I Introduction
Reinforcement learning is a goaldirected learningbased method that can be used for control tasks [1]
. Reinforcement learning is formulated as a Markov Decision Process (MDP) wherein an agent takes an action based on the current environment state, and receives a reward as the environment moves to the next state due to the action taken. The goal of the reinforcement learning agent is to learn a stateaction mapping policy that maximizes the longterm cumulative reward. DRL utilizes deep (multilayer) neural nets to approximate the optimal stateaction policy through trial and error as the agent interacts with the environment during training
[2]. DRL has found recent breakthroughs as it surpassed humans in playing board games [3]. DRL is actively evolving and various algorithms have been developed which include Deep Q Networks [2], Double Deep Q Networks [4], Deep Deterministic Policy Gradient (DDPG) [5], Distributed Distributional Deterministic Policy Gradient [6], and Soft Actor Critic [7].Connected and automated vehicles have become increasingly popular in academia and industry since DARPA urban challenge as autonomous driving could potentially become a reality [8]. Fully autonomous driving is a challenging task since the transportation traffic can be dynamic, highspeed, and unpredictable. The Society of Automotive Engineers has defined multiple levels of automation as we progress from partial, such as Advanced Driver Assistance Systems (ADAS), to full automation. Current ADAS include Adaptive Cruise Control (carfollowing control), lanekeeping assistance, lanechange assistance, emergency braking assistance, and driver drowsiness detection[9]. Future highly automated vehicles shall be able to tackle more challenging traffic scenarios such as freeway onramp merging, intersection maneuver, and roundabout traversing.
Since DRL has been demonstrated to surpass humans in certain domains, it could potentially be suited to solve the challenging tasks in automated driving to achieve superhuman performance. Current literature has seen that DRL is used to tackle various traffic scenarios for automated driving. In [10], Deep Qlearning is used to guide an autonomous vehicle to merge to freeway from onramp. In [11, 12, 13, 14], Deep Q Networks and/or DDPG allow an autonomous vehicle to maneuver through a single intersection while avoiding collisions. DRL is also used to solve the multipleintersections management problem to maximize a global autonomous driving objective [15]. In [16], DRL is used to solve for the lane change maneuver. Other studies have also used DRL to train a single agent to handle a variety of driving tasks [17, 18].
However, all the abovementioned studies consider pointmass kinematic models of the vehicle, instead of vehicle dynamic models wherein acceleration delay and acceleration command dynamics are included. With acceleration delay, the reinforcement learning action such as the target acceleration is delayed in time; with acceleration command dynamics, the actual acceleration does not rise up to the target acceleration immediately [19]. We acknowledge that acceleration command dynamics is being considered in a couple of most recent works that use DRL for vehicle control. In [20], a longitudinal dynamic model is considered for predictive speed control using DDPG. In [21], a carfollowing controller is developed with acceleration command dynamics considered using DDPG by learning from naturalistic humandriving data. However, both studies did not investigate the impact of acceleration delay, which could degrade the control performance.
Regarding carfollowing control using DRL, there are other studies in the literature that have developed such controllers. In [22], a cooperative carfollowing controller is developed using policy gradient with a singlelayer neural net. In [23, 24], humanlike carfollowing controllers without considering vehicle dynamics are developed using deterministic policy gradient by learning from naturalistic humandriving data. To the best of our knowledge, there is currently no study that utilizes DRL to develop a carfollowing controller through selfplay in simulation (not learning from naturalistic data).
There are studies in the literature that investigate delayed control inputs in nondeep reinforcement learning. It is suggested that the delay can negatively influence control performance if it is not considered in the reinforcement learning controller development [25]. A few approaches have been proposed to cope with control delay for reinforcement learning. In [26]
, the environment state is augmented by adding the delayed control inputs, i.e., the actions in the delay interval which have not been executed, for developing a vehicle speed controller using reinforcement learning whose stateaction mapping policy is a decision tree instead of a neural net. In
[27], the authors proposed to learn the underlying dynamic system model so as to use the model to predict the future state after the delay for the purpose of determining the current control action. In [28], a memoryless method that exploits the delay length is proposed to directly learn the control action from the current environment state with the stateaction mapping policy being a tile coding function instead of a neural net. There is currently no study that researches how a deep neural net trained in a nocontroldelay environment responds to control delay. There is currently no work that develops a DRL controller with control delay considered.Our work here studies the importance of vehicle dynamics, which include both acceleration delay and acceleration command dynamics, in developing a DRL controller for automated vehicles through empirical study. We first investigate whether a DRL agent trained using vehicle kinematic models could be used for more realistic control with vehicle dynamics. We consider a particular carfollowing scenario wherein the preceding vehicle maintains a constant speed. As it shows that the DRL controller trained using a kinematic model causes significantly degraded performance when vehicle dynamics exists, we redesign the DRL controller by adding the delayed control inputs and the actual acceleration to the environment state [29, 26] to accommodate for vehicle dynamics.
Ii CarFollowing Problem Formulation
In this section, we derive the statespace equations of the carfollowing control system so as to (1) understand how it could fit into the reinforcement learning framework with stateaction mapping, and (2) use dynamic programming (DP) to compute the global optimal solutions for comparison with DRL solutions. DP is based on the statespace equations and checks all permissible state values to search for the global minimum cost for the control system [30].
We acknowledge that the relatively easy carfollowing control problem may preferably be solved using classical control method instead of DRL which is more capable to solve more challenging control tasks such as freeway onramp merging. We choose the carfollowing control problem here because it can be explicitly modeled to obtain the statespace equations with which we can use DP to solve for the guaranteed global optimal solutions for comparison purposes. The DP solutions are critical because they serve as benchmarks with which we can evaluate the DRL controllers trained with either the vehicle dynamic or kinematic model. The other autonomous driving control tasks such as freeway onramp merging may not be explicitly modeled since they involve highly complex multivehicle interactions.
We consider a simple carfollowing control problem wherein a following vehicle desires to maintain a constant distance headway between itself and its preceding vehicle , see Fig. 1. The gapkeeping error dynamic equations of the carfollowing control system can be derived as:
(1) 
where is the error between the actual intervehicle distance and the desired distance headway , is the vehicle body length of the preceding vehicle , and are the distances traveled by the preceding and the following vehicles, respectively, and are the velocities of the preceding and following vehicles, respectively, and are the actual accelerations of the preceding and the following vehicles, respectively. For the state space representation, we define . Then
(2) 
Assuming no vehicletovehicle communication, the preceding vehicle’s acceleration is unknown to the following vehicle. As the DRL algorithm used here is Deep Deterministic Policy Gradient which demands the system to be deterministic, we only consider preceding vehicle’s speed to be a constant with . In fact, without knowing the preceding vehicle’s acceleration, the system is not closed and the exact optimal solution could not be found. We found that even though the DRL neural nets are trained for this scenario in which the preceding vehicle has a constant speed, the trained neural nets could be applied to scenarios when the preceding vehicle accelerates or decelerates with acceptable gapkeeping errors. Since the purpose of this paper is to compare the use of dynamics versus kinematic models for vehicle control, we do not show such results here.
Now we consider using the vehicle kinematic and dynamic models for the control. For a pointmass kinematic model, the following vehicle’s control input is exactly the acceleration , i.e., . The vehicle integrates and doubleintegrates over the control input (acceleration) for velocity and position updates, respectively. Thus, the state space representation when using a pointmass kinematic model is
(3) 
For a vehicle dynamic model, we adopt a simplified firstorder system for the acceleration command dynamics from the current literature used for Toyota Prius and Volvo S60 [31, 32], which is shown in Laplace Domain as
(4) 
where is the Laplace Transform variable, and are the Laplace Transforms of and , respectively, is the time constant of the firstorder system, and is the acceleration time delay. In time domain, the firstorder system can be interpreted as
(5) 
where denotes that is delayed by in time. Introducing another state variable , the state space representation when using the dynamic model is
(6) 
The control goal is to minimize both the error and control effort, which is a common goal of classical control methods such as Linear Quadratic Regulator and Model Predictive Control. Here we define the absolutevalue cost for the carfollowing control system as
(7) 
where is the terminal time, and denote the absolute values of the error and control input, respectively, is the allowed maximum of , is the nominal maximum of , and and are coefficients that satisfy , , and . The and values can be adjusted so as to decide the weighting of minimizing the error over the control action in the combined cost. The is a nominal maximum because the gapkeeping error can be very large, especially during DRL training wherein the vehicle can have any acceleration behavior before it gets well trained, see the next section. We choose a sufficiently large to represent a maximum gapkeeping error of a general carfollowing transient state.
As both dynamic programming and reinforcement learning are based on discrete time, the above continuoustime equations are discretized using a forward Euler integrator. Note that the absolutevalue cost is different than the quadratic cost for LQR and MPC. This is because, for DRL, absolutevalue rewards lead to lower steadystate errors [33]. As we want to compare DRL solutions with DP ones, the DP cost function needs to be the same as the DRL’s.
Iii Deep Reinforcement Learning Algorithm
In this section, we introduce the reinforcement learning framework and the specific DRL algorithm, DDPG (Deep Deterministic Policy Gradient), that we use to solve the above carfollowing control problem.
Iiia Reinforcement Learning
As stated in [1], reinforcement learning is learning what to do, i.e., how to map states to actions, so as to maximize a numerical cumulative reward. The formulation of reinforcement learning is a Markov Decision Process. At each time step , , a reinforcement learning agent receives the environment state , and on that basis selects an action . As a consequence of the action, the agent receives a numerical reward and finds itself in a new state
. In reinforcement learning, there are probability distributions for transitioning from a state to an action and for the corresponding reward, which are not illustrated here. The goal in reinforcement learning is to learn an optimal stateaction mapping policy
that maximizes the expected cumulative discounted reward with denoting the expectation of the probabilities. The symbol denotes optimality. The Qvalue, i.e., the stateaction value, for time step is defined as the expected cumulative discounted reward calculated from time , i.e., . Reinforcement learning problem is solved using Bellman’s principle of optimality. That is, if the optimal stateaction value for the next time step is known , then the optimal stateaction value for the current time step can be solved by taking the action that maximizes .The reinforcement learning framework for the carfollowing control system is based on the statespace equations described in the previous section. The action of the reinforcement learning framework is the control input of the carfollowing control system for time . The reward function is the negative value of the discretized absolutevalue cost defined in Equation 7 of the previous section.
(8) 
With this expression, the reward value range is (
,0]. We clip the reward to be in the range [1,0] to avoid huge bumps in the gradient update of the policy and Qvalue neural networks of DDPG. The huge bumps in the gradient update lead to training instability
[34].We consider 4 cases of the reinforcement learning framework as this work compares using dynamic versus kinematic models for autonomous vehicle control. For case 1, a kinematic model is used. Based on Equation 3
, only the gapkeeping error and error rate are sufficient to solve for the dynamic system. So the environment state vector is
for time step .For case 2, only acceleration delay is considered with no acceleration command dynamics. We consider this intermediate case for comparison purposes as well. In fact, for hybrid electric vehicles such as Toyota Prius [31], the time constant in the acceleration dynamics equation is small s, which means that the vehicle responds to a desired acceleration very quickly. Also, for pure electric vehicles, the response is even faster. For such vehicles, the acceleration command dynamics results in little degradation in the DRL control performance, as we observed in our simulations. Therefore, case 2 may represent DRL control for hybrid and pure electric vehicles. For this case, we define the state vector as with being the largest integer such that with s being one time step value. This means that we feed into the DRL agent the past control inputs that haven’t been executed by the control system due to time delay. We expect the DRL agent to use these delayed control inputs to solve for the corresponding system responses that would happen in the future and predict the next optimal control input .
For case 3, only acceleration command dynamics is considered with no acceleration delay . For this case, the time constant is s, which applies to gasengine vehicles such as Volvo S60 [32]. We consider this intermediate case for comparison purposes. According to Equation 6, the state vector is which includes the error, error rate, and the actual acceleration of the following vehicle.
Discrete time step  0.1s 

Nominal max error  10m 
Max control input  2.6 m/s 
Acceleration delay  0.2s 
Acceleration command dynamics time constant  0.5s 
Preceding vehicle constant speed  30m/s 
Following vehicle initial speed  27.5m/s 
Initial gapkeeping error  2.5m 
For case 4, both acceleration command dynamics and delay are considered. For this case, the time constant is also s for gasengine vehicles. The state vector is . Table I shows the parameter values for the carfollowing control system.
IiiB Deep Deterministic Policy Gradient
The DRL algorithm that we use is DDPG, which is exactly the same as proposed in [5]. Here we provide a brief description of the DDPG algorithm and we encourage the readers to read the original paper. The DDPG algorithm utilizes two deep neural networks: actor and critic networks. The actor network is for the stateaction mapping policy where denotes the actor neural net weight parameters, and the critic network is for Qvalue function (cumulative discounted reward) where denotes the critic neural net weight parameters. DDPG concurrently learns the policy and Qvalue function. For learning the Qvalue (Qlearning), the Bellman’s principle of optimality is followed to minimize the rootmeansquared loss using gradient descent. For learning the policy, gradient ascent is performed with respect to only the policy parameters to maximize the Qvalue .
Target network update coefficient  0.001 
Reward discount factor  0.99 
Actor learning rate  0.0001 
Critic learning rate  0.001 
Experience replay memory size  500000 
Minibatch size  64 
Actor Gaussian noise mean  0 
Actor Gaussian noise standard deviation 
0.02 
Target networks are adopted to stabilize training [2]. We use Gaussian noise for action exploration [20]. Minibatch gradient descent is used [5]. Experience replay is used for stability concerns [2]
. Batch normalization is used to accelerate learning by reducing internal covariant shift
[35]. Please see Table II for the DDPG algorithm parameter values.Both the actor and critic networks are 2layer neural nets for all cases. For training with vehicle kinematics (case 1) and just acceleration command dynamics (case 3), the neural nets have 64 neurons for each layer and the training time is 1 million time steps. For training with control delays (cases 2 and 4), the neural nets have 128 neurons for each layer and the training time is 1.5 million time steps. For all cases, the training converges. Fig.
2 shows the undiscounted episode reward for case 1. The plots of the undiscounted episode rewards for all the other cases look similar to that for case 1, and are not shown here. We use the undiscounted episode reward since it allows us to track changes for the latter part of the carfollowing errors easily. Note that, with the discount factor, the last reward at 20 seconds (200 time steps) of one episode is discounted by .Iv Results
In this section, the DRL results for the above mentioned 4 cases are presented. We also present the DP results which are the global optimal solutions for all cases for comparison purposes. We first present DRL and DP results for the carfollowing control with a pointmass kinematic model and the results of applying this kinematicsmodeltrained DRL controller to carfollowing control with vehicle dynamics, which are shown in Fig. 3. We then present the results of our proposed solution to deal with acceleration delay and acceleration command dynamics by adding the delayed control inputs and the current actual acceleration to the environment state, which are shown in Fig. 4. Note that, the acceleration command dynamics for all related cases is for gasengine vehicles with time constant s.
In Fig. 3, when trained using a pointmass kinematic model, the DRL agent achieves a nearoptimal solution as compared with DP results, see the blue solid and black dashed lines. When this DRL controller is applied to carfollowing control with just acceleration delay, the carfollowing performance is degraded to a small extent. The gapkeeping error is able to return to nearzero in the steady state, see column (a) of Fig. 3. When this DRL controller is applied to carfollowing control with just acceleration command dynamics, the carfollowing performance is degraded to a bigger extent as compared to the delay case. The gapkeeping error returns to nearzero in the steady state in a longer time, see column (b) of Fig. 3. When this DRL controller is applied to carfollowing control with both acceleration delay and command dynamics, the performance is the worst. Both the transient and steadystate performances are significantly degraded. The steadystate error does not return to zero and forms a wavy oscillation pattern with the maximum being 0.73m and the minimum being 0.22m.
The columns (a), (b), and (c) in Fig. 4 show the results for the redesigned DRL controllers trained with acceleration delay (case 2), acceleration command dynamics (case 3), and both acceleration delay and command dynamics (case 4), respectively. For all these cases, the DRL controllers achieve nearoptimal solutions as compared to the DP ones. Note that the steadystate gapkeeping errors of the DP solutions in columns (b) and (c) are around 0.5 meters. This will be reduced if we use a smaller interval to create the evenly spaced samples of the states for DP, although it takes much longer time to run.
V Conclusion
By solving a particular carfollowing control problem using DRL (deep reinforcement learning), we show that a DRL controller trained with a pointmass kinematic model could not be generalized to solve more realistic control situations with both vehicle acceleration delay and command dynamics. We added the control inputs that are delayed and have not been executed, and the actual acceleration of the vehicle, to the reinforcement learning environment state for DRL controller development with vehicle dynamics. The training results show that this approach provides nearoptimal solutions for carfollowing control with vehicle dynamics.
When the reinforcement learning environment state is augmented with the delayed control inputs, the DRL agent is expected to utilize the delayed control inputs to predict the system behavior in the future and determine the next optimal control action. Our results show that the DRL agent is capable to do so after training, in a nearoptimal manner. However, because the environment state is augmented with more variables, the neural network size needs to be increased and more training time is needed, which is the disadvantage. As stated in the introduction, an alternative method is to learn the underlying dynamic system separately and use the learned system to predict the system behavior in the future after the delay time so as to determine the current control action [27]. However, this method may not be feasible for challenging autonomous driving control systems such as merging control because such systems are subject to many variations and disturbances due to multivehicle interactions. It may not be easy to develop or learn an accurate model for such systems.
Future work includes developing a more robust carfollowing DRL controller that can be trained with rich variations of the preceding vehicle’s speed. Another research direction is to develop DRL controllers with vehicle dynamics considered for more challenging autonomous driving scenarios such as freeway onramp merging.
Acknowledgment
The authors would like to thank Toyota, Ontario Centres of Excellence, and Natural Sciences and Engineering Research Council of Canada for the support of this work.
References
 [1] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.
 [2] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
 [3] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, p. 484, 2016.

[4]
H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with
double qlearning,” in
Thirtieth AAAI Conference on Artificial Intelligence
, 2016.  [5] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
 [6] G. BarthMaron, M. W. Hoffman, D. Budden, W. Dabney, D. Horgan, A. Muldal, N. Heess, and T. Lillicrap, “Distributed distributional deterministic policy gradients,” arXiv preprint arXiv:1804.08617, 2018.
 [7] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor,” arXiv preprint arXiv:1801.01290, 2018.
 [8] C. Urmson, J. Anhalt, D. Bagnell, C. Baker, R. Bittner, M. Clark, J. Dolan, D. Duggins, T. Galatali, C. Geyer, et al., “Autonomous driving in urban environments: Boss and the urban challenge,” Journal of Field Robotics, vol. 25, no. 8, pp. 425–466, 2008.
 [9] A. Eskandarian, Handbook of intelligent vehicles. Springer London, 2012.
 [10] P. Wang and C.Y. Chan, “Autonomous ramp merge maneuver based on reinforcement learning with continuous action space,” arXiv preprint arXiv:1803.09203, 2018.
 [11] D. Isele, R. Rahimi, A. Cosgun, K. Subramanian, and K. Fujimura, “Navigating occluded intersections with autonomous vehicles using deep reinforcement learning,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 2034–2039.
 [12] Z. Qiao, K. Muelling, J. M. Dolan, P. Palanisamy, and P. Mudalige, “Automatically generated curriculum based reinforcement learning for autonomous vehicles in urban environment,” in 2018 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2018, pp. 1233–1238.
 [13] Z. Qiao, K. Muelling, J. Dolan, P. Palanisamy, and P. Mudalige, “POMDP and hierarchical options MDP with continuous actions for autonomous driving at intersections,” in 2018 21st International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2018, pp. 2377–2382.
 [14] C. Li and K. Czarnecki, “Urban driving with multiobjective deep reinforcement learning,” arXiv preprint arXiv:1811.08586, 2018.
 [15] H. Mirzaei and T. Givargis, “Finegrained acceleration control for autonomous intersection management using deep reinforcement learning,” in 2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation. IEEE, 2017, pp. 1–8.
 [16] P. Wang, C.Y. Chan, and A. de La Fortelle, “A reinforcement learning based approach for automated lane change maneuvers,” in 2018 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2018, pp. 1379–1384.
 [17] P. Wolf, K. Kurzer, T. Wingert, F. Kuhnt, and J. M. Zollner, “Adaptive behavior generation for autonomous driving using deep reinforcement learning with compact semantic states,” in 2018 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2018, pp. 993–1000.
 [18] S. Aradi, T. Becsi, and P. Gaspar, “Policy gradient based reinforcement learning approach for autonomous highway driving,” in 2018 IEEE Conference on Control Technology and Applications (CCTA). IEEE, 2018, pp. 670–675.
 [19] R. N. Jazar, Vehicle dynamics: theory and application. Springer, 2017.
 [20] M. Bucchel and A. Knoll, “Deep reinforcement learning for predictive longitudinal control of automated vehicles,” in 2018 21st International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2018, pp. 2391–2397.
 [21] S. Wei, Y. Zou, T. Zhang, X. Zhang, and W. Wang, “Design and experimental validation of a cooperative adaptive cruise control system based on supervised reinforcement learning,” Applied Sciences, vol. 8, no. 7, p. 1014, 2018.
 [22] C. Desjardins and B. ChaibDraa, “Cooperative adaptive cruise control: A reinforcement learning approach,” IEEE Transactions on Intelligent Transportation Systems, vol. 12, no. 4, pp. 1248–1260, 2011.
 [23] D. Zhao, B. Wang, and D. Liu, “A supervised actor–critic approach for adaptive cruise control,” Soft Computing, vol. 17, no. 11, pp. 2089–2099, 2013.
 [24] M. Zhu, X. Wang, and Y. Wang, “Humanlike autonomous carfollowing model with deep reinforcement learning,” Transportation Research Part C: Emerging Technologies, vol. 97, pp. 348–368, 2018.
 [25] E. Schuitema, M. Wisse, T. Ramakers, and P. Jonker, “The design of leo: a 2d bipedal walking robot for online autonomous reinforcement learning,” in 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2010, pp. 3238–3243.
 [26] T. Hester and P. Stone, “Texplore: realtime sampleefficient reinforcement learning for robots,” Machine learning, vol. 90, no. 3, pp. 385–429, 2013.
 [27] T. J. Walsh, A. Nouri, L. Li, and M. L. Littman, “Learning and planning in environments with delayed feedback,” Autonomous Agents and MultiAgent Systems, vol. 18, no. 1, p. 83, 2009.
 [28] E. Schuitema, L. Buşoniu, R. Babuška, and P. Jonker, “Control delay in reinforcement learning for realtime dynamic systems: a memoryless approach,” in 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2010, pp. 3226–3231.
 [29] K. V. Katsikopoulos and S. E. Engelbrecht, “Markov decision processes with delays and asynchronous cost collection,” IEEE transactions on automatic control, vol. 48, no. 4, pp. 568–574, 2003.
 [30] D. S. Naidu, Optimal control systems. CRC press, 2002.
 [31] J. Ploeg, B. T. Scheepers, E. Van Nunen, N. Van de Wouw, and H. Nijmeijer, “Design and experimental evaluation of cooperative adaptive cruise control,” in 2011 14th International IEEE Conference on Intelligent Transportation Systems (ITSC). IEEE, 2011, pp. 260–265.
 [32] K. Lidstrom, K. Sjoberg, U. Holmberg, J. Andersson, F. Bergh, M. Bjade, and S. Mak, “A modular CACC system integration and design,” IEEE Transactions on Intelligent Transportation Systems, vol. 13, no. 3, pp. 1050–1061, 2012.
 [33] J.M. Engel and R. Babuška, “Online reinforcement learning for nonlinear motion control: Quadratic and nonquadratic reward functions,” IFAC Proceedings Volumes, vol. 47, no. 3, pp. 7043–7048, 2014.
 [34] H. P. van Hasselt, A. Guez, M. Hessel, V. Mnih, and D. Silver, “Learning values across many orders of magnitude,” in Advances in Neural Information Processing Systems, 2016, pp. 4287–4295.
 [35] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
Comments
There are no comments yet.