1 Introduction
Reinforcement Learning (RL) has been applied in robotics for decades and has gained popularity due to the development in deep learning. In recent studies, it has been applied for learning 3D locomotion tasks (e.g. bipedal locomotion and quadrupedal locomotion
Schulman et al. [2015]) as well as robot arm manipulation tasks (e.g. stacking blocks Haarnoja et al. [2018]). Google DeepMind also showed the power of RL for learning to play Atari 2600 games Mnih et al. [2013] and Go games Silver et al. [2016]. In these applications, either the operation environment is simple, e.g. no interaction with other agents, or the action space is limited to discrete selection, e.g. left, right, forward, and backward.In regard to the driving tasks for autonomous vehicles, the situation is totally different because vehicles are required to operate smoothly and efficiently in a dynamic and complicated driving environment. Tough challenges frequently arise in driving domains. For example, the vehicle agent needs to coordinate with surrounding vehicles so as not to disturb the traffic flow significantly when it executes a maneuver, e.g. merging into a joint traffic flow. More importantly, the control action should be continuous to guarantee smooth travelling.
There have been some efforts in applying RL to autonomous driving Yu et al. [2016]Ngai and Yung [2007]Sallab et al. [2016]
, however, in some of the applications the state space or action space are arbitrarily discretized (e.g. vehicle acceleration is split into some fixed values), to fit into the RL algorithms (e.g. Qlearning) without considering the specific features of the driving problem. This simplified discretization results in the loss of a complete representation of the continuous space. Policy gradientbased methods are alternatives for continuous action problems, but they often complicates the training process by involving a policy network and sometimes suffer from vanishing or exploding gradient problems.
In this study, we resort to modelfree Qlearning and design the Qfunction in a Quadratic form based on the idea of Normalized Advantage Function Gu et al. [2016]. With this form, the optimal solution can be obtained in a closed form. Additionally, we incorporate the domain knowledge of the control mechanism in the design of an action network to help the agent with action exploration. We test the algorithm with two challenging driving cases, the lane change situation and ramp merge situation.
The reminder of the paper is organized as follows. Related work is described in Section 2. Methodology is given in Section 3, followed by application case in Section 4 and experiments in Section 5. Conclusions and discussions are given in the last section.
2 Related Work
In autonomous driving field, a vast majority of studies on operational level of driving are based on traditional methods. For example, in Ho et al. [2009]
, a virtual trajectory reference was created by a polynomial function for each moving vehicle, and a bicycle model was used to estimate vehicle positions based on the precalculated trajectories. In
Choi et al. [2015], a number of way points obtained from Differential Global Positioning System and Realtime Kinematic devices were used generate a path for the autonomous vehicle to follow. Such approaches can work well in predefined situations or within the model limits, however, they have limited performance in unforeseeable driving conditions.In recent years, we have seen a lot of applications of RL on the automated driving domain Yu et al. [2016] Ngai and Yung [2007]Wang and Chan [2017]Sallab et al. [2016]. For example, Yu et al. Yu et al. [2016] explored the application of Deep QLearning methods to the control of a simulated car in JavaScript Racer. They discretized the action space into nine actions and found that the vehicle agent can learn turning operations when there were no cars on the raceway but cannot perform well in obstacle avoidance. Ngai et al. Ngai and Yung [2007] put multiple goals in the RL framework (i.e. destination seeking and collision avoidance) to address the overtaking problem. They also converted continuous sensor values into discrete stateaction pairs. In all of these applications, the action space was treated as discrete and few interactions with the surrounding environment were considered. Wang et al. Wang and Chan [2017]
proposed a RL framework for learning onramp merge behavior, where a Long Short Term Memory (LSTM) was used to learn internal states and a Deep QNetwork was used for deciding the optimal control policy.
Sallab et al. Sallab et al. [2016] moved further to explore the impacts of a discrete and a continuous action space on the lane keeping case. They conducted experiments in a simulated environment, and concluded that the vehicle agent traveled more smoothly under continuous action design than discrete action design.
Qlearning is simple but effective, and is basically applicable to discrete action space. If a Qfunction approximator can be designed to encode continuous action values to corresponding Qvalues, it becomes an optimization problems in a continuous action space Wang et al. [2018]. And it also avoids involving a complicated policy network as in most policy gradient based methods Sutton et al. [2000]Degris et al. [2012]. Based on these thoughts, we design a quadratic Qnetwork, similar to the idea of Normalized Advantage Functions (NAF) Gu et al. [2016] in which the advantage term is parameterized as a quadratic function of nonlinear features of the state. We apply the method to the practical application case in autonomous driving, and combine it with domain knowledge of vehicle control mechanism to assist its action exploration.
3 Methodology
3.1 Quadratic Qnetwork
In our RL formulation, the state space and the action space are taken as continuous. The goal of the reinforcement learning is to find an optimal policy , so that the total return accumulated over the course of the driving task is maximized. For a given policy with parameters , a Qvalue function is used to estimate the total reward from taking action given state at time Sutton et al. [2000]. A value function is used to estimate the total reward from state . The advantage function is to calculate how much better is in state as
(1) 
In the case where the action space is discrete, we can obtain the optimal action with the greedy policy directly by iterating over the action space, as
(2) 
However, when the action space is continuous, it is not easily ready to apply the basic Qlearning formula to find the optimal action. According to the nature of the quadratic equation, if the Qfunction has a quadratic form, the optimal action can be obtained analytically and easily. With this idea, we design the Qfunction in a quadratic format as
(3) 
where
is a vector with the same dimension of the action,
is a negatively semidefinite matrix, and is considered as a the value function with a scalar value as output. With this special form of Qfunction, the optimal action can still be obtained in a greedy way as in equation (2) but given by .Figure 1 (left) depicts the architecture of the quadratic Qnetwork. and
are built separately with a single multilayer perceptron (MLP) with two hidden layers. In contrast,
consists of three MLPs that are combined in a special way. We call the action network and give its details in the next subsection. , , and are combined in the function which is the equation (3).Note that because any smooth Qfunction should be Taylorexpanded in this quadratic format near the greedy action, there is not much loss in generality with this assumption if we stay close to the greedy policy that is being updated in the Qlearning process.
3.2 Action network
From equation (3) we can observe that
plays a critical role in learning the optimal action. If it is purely designed as a neural network with thousands of neurons, it may suffer a hard time learning actions meaningful to a driving policy.
Based on this insight, we design the form of similar to a PID controller where some tuning parameters are replaced with neural networks. In other words, we do not manually tune the coefficients for the proportional, integral, and derivative terms but use neural networks to automatically find the appropriate values based on the defined reward function in RL. This way it makes the controller adaptable to different driving situations, and moreover the output action is based on a longterm goal of the task rather than an action just calculated for a target at current step as in a PID controller.
The right graph in Figure 1 shows the design of action network where three variables , and are designed with neural networks, and Equation (4) and (5) show how these variables are combined. To be specific, from equation (4), we obtain a temporary action based on PID properties, where
is the output from a neural network and interpreted as a transition time to mitigate errors between current and target states. The temporary value then goes through a hyperbolic tangent activation function in equation (
5), where another two parameters, and , are learned from neural networks. represents a tunable maximum acceleration and indicates a sensitivity factor enforced on the temporary control action. Figure 1 (right) depicts the architecture of .(4) 
(5) 
where is the state difference between a desired state and the current state. The desired state can be defined conveniently, for example, it can be the target lane ID for the lateral control or the preferred carfollowing distance for the longitudinal control. The state difference can include values such as relative distance , relative speed , and/or relative yaw angle .
3.3 Learning procedure
There are two iterative loops in learning the policy. One is a simulation loop where it provides the environment that the vehicle agent interacts with, and the other one is a update loop in which the neural network weights are updated.
In the simulation loop, we use to obtain the greedy action for a given state at step . The greedy action is then perturbed with a Gaussian noise to increase its exploration and executed in the simulation. After the execution, we get a new state as well as a reward from the environment, and store the transition tuple (, , ,) in a replay memory .
In the update loop, samples of tuples are drawn randomly from . To overcome the inherent instability issues in Qlearning, we use experience replay technique and a target Qnetwork as proposed in Adam et al. [2011]. Weights in Qnetwork () are updated by gradient descent at every time step, while weights in target Qnetwork () are periodically overwritten by . Algorithm 1 gives the learning process.
It is also worth mentioning that the overall training process includes two steps, pretraining step and training step. In pretraining, we only train the neural networks of and , and froze the parameters in . During training, we jointly update parameters in all neural network. This trick helps the agent learn faster.
3.4 Reward function
In our study, the immediate reward is designed as a linear combination of multiple feature functions with respect to driving safety, comfort and efficiency. Each stateaction pair is evaluated with a negative value, thus teaching the agent to avoid resulting in situations with large penalties.
To be specific, safety is evaluated by relative distances from vehicles that matter most to the ego vehicle. It includes relative distance to vehicles on the longitudinal direction and the distance to the centerline of the target lane on the lateral direction.
(6) 
where is the safety reward term, and are the feature functions which can be power functions based on how much we want to rate this feature, and are the weights, and are the relative distance on longitudinal and lateral directions, and is the number of adjacent vehicles.
Comfort is evaluated by the control variables, and (i.e. speed acceleration and yaw acceleration), and their derivatives, and .
(7) 
where is the comfort reward term, and are feature functions, and and are the weights.
Efficiency is evaluated by the maneuvering time, i.e. how long it takes to finish the task. For example, in a merging case, it is the time consumed from the initiation to the completion of the behavior. The efficiency at a single time step is calculated by the time step interval ().
(8) 
where is the efficiency reward term, is the feature function that can also be a power function of , and is the function weight.
Function weights are hyperparameters that are manually tuned through multiple training episodes. Their values and the expressions of the feature functions is given in the next section.
4 Applications on lateral and longitudinal control
We apply the proposed algorithm to two use cases, a lanechange situation and a rampmerge situation. In the lanechange scenario, the lateral control is learned while the longitudinal control is an adapted Intelligent Driver Model (IDM) Treiber et al. [2000]. In the rampmerge scenario, the longitudinal control is learned while the lateral control is to follow the centerline of the current lane. We defer the work of simultaneously learning control variables on the two directions to our future work.
There assumes to be a decisionmaking module and a gap selection module in the higher level that issue commands on when to make lane change or ramp merge. Our work focuses on learning the control variables under the received the commands.
4.1 Lateral control under lanechange case
The lane change behavior is affected by the ego vehicle’s kinematics (e.g. vehicle speed, position, yaw angle, yaw rate etc.) as well as the surrounding vehicles’ in the targeting gap. Road curvature also affects the success of a lane change, for example, a curved road segment introduces additional centrifugal force that should be considered in the lane change process. Therefore, we define the state space to include both vehicle dynamics and road curvature information.
As mentioned earlier, we resort to a welldeveloped carfollowing model, Intelligent Driver Model (IDM) Treiber et al. [2000], with some adaptation for the longitudinal control. IDM describes dynamics of the positions and velocities of single vehicles. Due to space limitation, we only briefly introduce the modified IDM that is adapted to alleviate overly conservative driving behaviors. The longitudinal acceleration is calculated by equation (9).
(9) 
where is the velocity difference between the ego vehicle and its preceding vehicle , is the desired velocity of the ego vehicle in free traffic, is the minimum spacing to the leader, is the current spacing, is the minimum headway to the leader, is the maximum acceleration, is the comfortable braking deceleration, and is the exponential parameter. In our test case, we set to , to , to , to , and to 4.
The action space for lateral control is treated as continuous to allow any reasonable real values being taken in the lane change process. Specifically, we define the lateral control action to be the yaw acceleration, with the consideration that a finite yaw acceleration ensures the smoothness in steering, where is the yaw angle.
The reward function, composed of the three parts of safety, comfort and efficiency, is given in Table 1. In the safety part, only reward from the lateral direction is considered in which is the lateral deviation from current position to the centerline of the target lane. Safety in the longitudinal direction is taken care of by the gap selection and IDM model. The comfort part is evaluated by the lateral action and the yaw rate . The efficiency is evaluated by timestep intervals.
4.2 Longitudinal control under rampmerge case
In the rampmerging case, when the gap selection module finds a proper gap on the merging lane, the vehicle agent will try to merge into it by adjusting its longitudinal acceleration while keeping itself in the middle of the lane. The state space in such a situation includes the speed, position, heading angle of the ego vehicle, its leading vehicle and vehicles from the target gap. The action space is the longitudinal acceleration with continuous values in a limited range of .
The reward function is given in Table 1. The safety term is decided by relative distances to the leading and lagging vehicles of the target gap on the merging lane. No lateral deviation is considered as discussed above. The longitudinal acceleration decides the comfort reward term. Efficiency is evaluated by timestep intervals, the same as in the lanechange case.
Functions  Lane change  Ramp merge  Weights  Lane change  Ramp merge 

None  None  0.01  
None  0.05  None  
0.5  0.5  
None  2.0  None  
0.05  0.05 
5 Simulation and results
5.1 Simulation environment
The lanechange behavior is simulated on a highway segment of 350m long and threelane wide on each direction. The rampmerge behavior is simulated in the highwayramp merging zone where the ramp is merged into the rightmost lane of the highway. An illustration of the simulation scene is show in Figure 2.
The simulated traffic is customized to generate diverse driving conditions. The initial speed, departure time interval, and speed limit of each individual vehicle is set to random values as long as they are within reasonable ranges, e.g. [30 km/h, 50 km/h], [5s, 10s], and [80 km/h, 120 km/h], respectively. In the simulation, vehicles can interact with each other. One example is that lagging vehicles in a lanechange case can yield or overtake the ego vehicle, creating diverse and realistic driving situations for training the RL agent.
The RL vehicle agent in the lanechange case is randomly generated in the middle lane thus that it can make either left or right lane change based on the command received after traveling for 150m. In the rampmerge case, the RL agent is generated on the ramp, about 150m away from the merging intersection. Vehicles on the highway travel on its current lane and have the IDM carfollowing behavior. Additionally, a small portion of aggressive driving behaviors are simulated by setting a relatively high acceleration range and small carfollowing distances, and conversely for defensive driving behaviors.
5.2 Training results
The hyperparameters for training in the two application cases are similar expect for the learning rate, 0.0005 for lane change and 0.001 for ramp merge, and training episodes, 6000 for lane change and 4000 for ramp merge. Other hyperparameters are set as follows: replay memory=2000, batch size=64, discount factor=0.95, targetQ weights update frequency=1000, optimizer=Adam. Twelve intermediate checkpoints are saved in each application case for testing the learned models.
Training loss and accumulated rewards are plotted in Figure 3 for both the lanechange case and rampmerge case.
From Figure 3, we can observe that in both cases the training loss curve shows an obvious convergence and that the total rewards also demonstrate a consistently increasing trend, which satisfactorily indicates that the RL vehicle agent has learned the lanechange behavior and rampmerge behavior.
Since each point in the total reward curves represents only one random driving case under that corresponding training step, it might not be enough to prove the learned driving behavior. Therefore, we conduct testing on the saved checkpoints to get an averaged driving performance. We run 100 episodes at each checkpoint in both the lanechange situation and rampmerge situation, and then average their total rewards. The results are plot in Figure 4.
The testing curves in Figure 4 show consistent upward trend as the total reward curves in Figure 3, indicating that the RL agent has indeed progressively learned the driving behavior of lane change and ramp merge, and can take responsible actions with respect to safety, comfort and efficiency as defined in the reward function.
We also plot some driving dynamics to further compare the driving performance at the initial stage and the final stage, i.e., at the early saved checkpoint and last saved checkpoint. The right graph in Figure 5 demonstrates the lane change trajectories (blue for left lane change and red for right lane change) at the initial stage (upper right) and the final stage (lower right), respectively. It shows clearly that the trajectories at the final stage, in comparison to initial stage, are quite smooth and stable. The left graph in Figure 5 shows the acceleration curves in the ramp merge case as we learn the longitudinal control in this situation. We can see that the acceleration in a merging case gets smoother (lower left) in the final stage than that in the initial stage (upper left).
6 Conclusion and discussion
In this work, we designed a Quadratic Qnetwork for handling continuous control problems in autonomous driving. With the quadratic format, the optimal action can be obtained easily and analytically. We also leverage domain knowledge of vehicle control mechanism for designing an action network, to provide the vehicle agent guidance in the action exploration.
The proposed method is applied to two challenging driving cases, the lanechange case and rampmerge case. Training results show convergence in both the training losses and total rewards, indicating that the RL vehicle agent has learned to drive with higher rewards as defined in the reward function. Testing results show consistent convergence trend as that in the training, proving that the agent has indeed learned the behavior of lane changing and ramp merging. Comparison of the driving trajectories (in lane change situation) and vehicle accelerations (in ramp merge situation) at respectively the initial stage and final stage also reveals that the agent can drive safely, smoothly and efficiently.
This study demonstrates the potentials of applying the quadratic Qlearning framework to continuous control problems in autonomous driving. Our further step is to learn the longitudinal and lateral controls simultaneously based on different designs of reward functions. Also, we will try other directions for learning the policy, such as methods based on adversarial learning. Generative Adversarial Imitation Learning
Ho and Ermon [2016] and Adversarial Inverse Reinforcement Learning Fu et al. [2017] show promising features in learning robotic control and it can recover both a policy and a reward function from demonstrations. Applying it to the dynamically changing environment in autonomous driving will be a challenging but interesting work.References
 [1] (2011) Experience replay for realtime reinforcement learning control. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42 (2), pp. 201–212. Cited by: §3.3.
 [2] (2015) Lane change and path planning of autonomous vehicles using gis. In 2015 12th International Conference on Ubiquitous Robots and Ambient Intelligence (URAI), pp. 163–166. Cited by: §2.
 [3] (2012) Modelfree reinforcement learning with continuous action in practice. In 2012 American Control Conference (ACC), pp. 2177–2182. Cited by: §2.
 [4] (2017) Learning robust rewards with adversarial inverse reinforcement learning. arXiv preprint arXiv:1710.11248. Cited by: §6.

[5]
(2016)
Continuous deep qlearning with modelbased acceleration.
In
International Conference on Machine Learning
, pp. 2829–2838. Cited by: §1, §2.  [6] (2018) Composable deep reinforcement learning for robotic manipulation. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 6244–6251. Cited by: §1.
 [7] (2016) Generative adversarial imitation learning. In Advances in neural information processing systems, pp. 4565–4573. Cited by: §6.
 [8] (2009) Lane change algorithm for autonomous vehicles via virtual curvature method. Journal of Advanced Transportation 43 (1), pp. 47–70. Cited by: §2.
 [9] (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §1.
 [10] (2007) Automated vehicle overtaking based on a multiplegoal reinforcement learning framework. In 2007 IEEE Intelligent Transportation Systems Conference, pp. 818–823. Cited by: §1, §2.
 [11] (2016) Endtoend deep reinforcement learning for lane keeping assist. arXiv preprint arXiv:1612.04340. Cited by: §1, §2, §2.
 [12] (2015) Highdimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438. Cited by: §1.
 [13] (2016) Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484. Cited by: §1.
 [14] (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §2, §3.1.
 [15] (2000) Congested traffic states in empirical observations and microscopic simulations. Physical review E 62 (2), pp. 1805. Cited by: §4.1, §4.
 [16] (2018) A reinforcement learning based approach for automated lane change maneuvers. In 2018 IEEE Intelligent Vehicles Symposium (IV), pp. 1379–1384. Cited by: §2.
 [17] (2017) Formulation of deep reinforcement learning architecture toward autonomous driving for onramp merge. In 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), pp. 1–6. Cited by: §2.
 [18] (2016) Deep reinforcement learning for simulated autonomous vehicle control. Course Project Reports: Winter, pp. 1–7. Cited by: §1, §2.