Reinforcement learning is a goal-directed learning-based method that can be used for control tasks 
. Reinforcement learning is formulated as a Markov Decision Process (MDP) wherein an agent takes an action based on the current environment state, and receives a reward as the environment moves to the next state due to the action taken. The goal of the reinforcement learning agent is to learn a state-action mapping policy that maximizes the long-term cumulative reward. DRL utilizes deep (multi-layer) neural nets to approximate the optimal state-action policy through trial and error as the agent interacts with the environment during training. DRL has found recent breakthroughs as it surpassed humans in playing board games . DRL is actively evolving and various algorithms have been developed which include Deep Q Networks , Double Deep Q Networks , Deep Deterministic Policy Gradient (DDPG) , Distributed Distributional Deterministic Policy Gradient , and Soft Actor Critic .
Connected and automated vehicles have become increasingly popular in academia and industry since DARPA urban challenge as autonomous driving could potentially become a reality . Fully autonomous driving is a challenging task since the transportation traffic can be dynamic, high-speed, and unpredictable. The Society of Automotive Engineers has defined multiple levels of automation as we progress from partial, such as Advanced Driver Assistance Systems (ADAS), to full automation. Current ADAS include Adaptive Cruise Control (car-following control), lane-keeping assistance, lane-change assistance, emergency braking assistance, and driver drowsiness detection. Future highly automated vehicles shall be able to tackle more challenging traffic scenarios such as freeway on-ramp merging, intersection maneuver, and roundabout traversing.
Since DRL has been demonstrated to surpass humans in certain domains, it could potentially be suited to solve the challenging tasks in automated driving to achieve superhuman performance. Current literature has seen that DRL is used to tackle various traffic scenarios for automated driving. In , Deep Q-learning is used to guide an autonomous vehicle to merge to freeway from on-ramp. In [11, 12, 13, 14], Deep Q Networks and/or DDPG allow an autonomous vehicle to maneuver through a single intersection while avoiding collisions. DRL is also used to solve the multiple-intersections management problem to maximize a global autonomous driving objective . In , DRL is used to solve for the lane change maneuver. Other studies have also used DRL to train a single agent to handle a variety of driving tasks [17, 18].
However, all the above-mentioned studies consider point-mass kinematic models of the vehicle, instead of vehicle dynamic models wherein acceleration delay and acceleration command dynamics are included. With acceleration delay, the reinforcement learning action such as the target acceleration is delayed in time; with acceleration command dynamics, the actual acceleration does not rise up to the target acceleration immediately . We acknowledge that acceleration command dynamics is being considered in a couple of most recent works that use DRL for vehicle control. In , a longitudinal dynamic model is considered for predictive speed control using DDPG. In , a car-following controller is developed with acceleration command dynamics considered using DDPG by learning from naturalistic human-driving data. However, both studies did not investigate the impact of acceleration delay, which could degrade the control performance.
Regarding car-following control using DRL, there are other studies in the literature that have developed such controllers. In , a cooperative car-following controller is developed using policy gradient with a single-layer neural net. In [23, 24], human-like car-following controllers without considering vehicle dynamics are developed using deterministic policy gradient by learning from naturalistic human-driving data. To the best of our knowledge, there is currently no study that utilizes DRL to develop a car-following controller through self-play in simulation (not learning from naturalistic data).
There are studies in the literature that investigate delayed control inputs in non-deep reinforcement learning. It is suggested that the delay can negatively influence control performance if it is not considered in the reinforcement learning controller development . A few approaches have been proposed to cope with control delay for reinforcement learning. In 
, the environment state is augmented by adding the delayed control inputs, i.e., the actions in the delay interval which have not been executed, for developing a vehicle speed controller using reinforcement learning whose state-action mapping policy is a decision tree instead of a neural net. In, the authors proposed to learn the underlying dynamic system model so as to use the model to predict the future state after the delay for the purpose of determining the current control action. In , a memoryless method that exploits the delay length is proposed to directly learn the control action from the current environment state with the state-action mapping policy being a tile coding function instead of a neural net. There is currently no study that researches how a deep neural net trained in a no-control-delay environment responds to control delay. There is currently no work that develops a DRL controller with control delay considered.
Our work here studies the importance of vehicle dynamics, which include both acceleration delay and acceleration command dynamics, in developing a DRL controller for automated vehicles through empirical study. We first investigate whether a DRL agent trained using vehicle kinematic models could be used for more realistic control with vehicle dynamics. We consider a particular car-following scenario wherein the preceding vehicle maintains a constant speed. As it shows that the DRL controller trained using a kinematic model causes significantly degraded performance when vehicle dynamics exists, we redesign the DRL controller by adding the delayed control inputs and the actual acceleration to the environment state [29, 26] to accommodate for vehicle dynamics.
Ii Car-Following Problem Formulation
In this section, we derive the state-space equations of the car-following control system so as to (1) understand how it could fit into the reinforcement learning framework with state-action mapping, and (2) use dynamic programming (DP) to compute the global optimal solutions for comparison with DRL solutions. DP is based on the state-space equations and checks all permissible state values to search for the global minimum cost for the control system .
We acknowledge that the relatively easy car-following control problem may preferably be solved using classical control method instead of DRL which is more capable to solve more challenging control tasks such as freeway on-ramp merging. We choose the car-following control problem here because it can be explicitly modeled to obtain the state-space equations with which we can use DP to solve for the guaranteed global optimal solutions for comparison purposes. The DP solutions are critical because they serve as benchmarks with which we can evaluate the DRL controllers trained with either the vehicle dynamic or kinematic model. The other autonomous driving control tasks such as freeway on-ramp merging may not be explicitly modeled since they involve highly complex multi-vehicle interactions.
We consider a simple car-following control problem wherein a following vehicle desires to maintain a constant distance headway between itself and its preceding vehicle , see Fig. 1. The gap-keeping error dynamic equations of the car-following control system can be derived as:
where is the error between the actual inter-vehicle distance and the desired distance headway , is the vehicle body length of the preceding vehicle , and are the distances traveled by the preceding and the following vehicles, respectively, and are the velocities of the preceding and following vehicles, respectively, and are the actual accelerations of the preceding and the following vehicles, respectively. For the state space representation, we define . Then
Assuming no vehicle-to-vehicle communication, the preceding vehicle’s acceleration is unknown to the following vehicle. As the DRL algorithm used here is Deep Deterministic Policy Gradient which demands the system to be deterministic, we only consider preceding vehicle’s speed to be a constant with . In fact, without knowing the preceding vehicle’s acceleration, the system is not closed and the exact optimal solution could not be found. We found that even though the DRL neural nets are trained for this scenario in which the preceding vehicle has a constant speed, the trained neural nets could be applied to scenarios when the preceding vehicle accelerates or decelerates with acceptable gap-keeping errors. Since the purpose of this paper is to compare the use of dynamics versus kinematic models for vehicle control, we do not show such results here.
Now we consider using the vehicle kinematic and dynamic models for the control. For a point-mass kinematic model, the following vehicle’s control input is exactly the acceleration , i.e., . The vehicle integrates and double-integrates over the control input (acceleration) for velocity and position updates, respectively. Thus, the state space representation when using a point-mass kinematic model is
For a vehicle dynamic model, we adopt a simplified first-order system for the acceleration command dynamics from the current literature used for Toyota Prius and Volvo S60 [31, 32], which is shown in Laplace Domain as
where is the Laplace Transform variable, and are the Laplace Transforms of and , respectively, is the time constant of the first-order system, and is the acceleration time delay. In time domain, the first-order system can be interpreted as
where denotes that is delayed by in time. Introducing another state variable , the state space representation when using the dynamic model is
The control goal is to minimize both the error and control effort, which is a common goal of classical control methods such as Linear Quadratic Regulator and Model Predictive Control. Here we define the absolute-value cost for the car-following control system as
where is the terminal time, and denote the absolute values of the error and control input, respectively, is the allowed maximum of , is the nominal maximum of , and and are coefficients that satisfy , , and . The and values can be adjusted so as to decide the weighting of minimizing the error over the control action in the combined cost. The is a nominal maximum because the gap-keeping error can be very large, especially during DRL training wherein the vehicle can have any acceleration behavior before it gets well trained, see the next section. We choose a sufficiently large to represent a maximum gap-keeping error of a general car-following transient state.
As both dynamic programming and reinforcement learning are based on discrete time, the above continuous-time equations are discretized using a forward Euler integrator. Note that the absolute-value cost is different than the quadratic cost for LQR and MPC. This is because, for DRL, absolute-value rewards lead to lower steady-state errors . As we want to compare DRL solutions with DP ones, the DP cost function needs to be the same as the DRL’s.
Iii Deep Reinforcement Learning Algorithm
In this section, we introduce the reinforcement learning framework and the specific DRL algorithm, DDPG (Deep Deterministic Policy Gradient), that we use to solve the above car-following control problem.
Iii-a Reinforcement Learning
As stated in , reinforcement learning is learning what to do, i.e., how to map states to actions, so as to maximize a numerical cumulative reward. The formulation of reinforcement learning is a Markov Decision Process. At each time step , , a reinforcement learning agent receives the environment state , and on that basis selects an action . As a consequence of the action, the agent receives a numerical reward and finds itself in a new state
. In reinforcement learning, there are probability distributions for transitioning from a state to an action and for the corresponding reward, which are not illustrated here. The goal in reinforcement learning is to learn an optimal state-action mapping policythat maximizes the expected cumulative discounted reward with denoting the expectation of the probabilities. The symbol denotes optimality. The Q-value, i.e., the state-action value, for time step is defined as the expected cumulative discounted reward calculated from time , i.e., . Reinforcement learning problem is solved using Bellman’s principle of optimality. That is, if the optimal state-action value for the next time step is known , then the optimal state-action value for the current time step can be solved by taking the action that maximizes .
The reinforcement learning framework for the car-following control system is based on the state-space equations described in the previous section. The action of the reinforcement learning framework is the control input of the car-following control system for time . The reward function is the negative value of the discretized absolute-value cost defined in Equation 7 of the previous section.
With this expression, the reward value range is (-
,0]. We clip the reward to be in the range [-1,0] to avoid huge bumps in the gradient update of the policy and Q-value neural networks of DDPG. The huge bumps in the gradient update lead to training instability.
We consider 4 cases of the reinforcement learning framework as this work compares using dynamic versus kinematic models for autonomous vehicle control. For case 1, a kinematic model is used. Based on Equation 3
, only the gap-keeping error and error rate are sufficient to solve for the dynamic system. So the environment state vector isfor time step .
For case 2, only acceleration delay is considered with no acceleration command dynamics. We consider this intermediate case for comparison purposes as well. In fact, for hybrid electric vehicles such as Toyota Prius , the time constant in the acceleration dynamics equation is small s, which means that the vehicle responds to a desired acceleration very quickly. Also, for pure electric vehicles, the response is even faster. For such vehicles, the acceleration command dynamics results in little degradation in the DRL control performance, as we observed in our simulations. Therefore, case 2 may represent DRL control for hybrid and pure electric vehicles. For this case, we define the state vector as with being the largest integer such that with s being one time step value. This means that we feed into the DRL agent the past control inputs that haven’t been executed by the control system due to time delay. We expect the DRL agent to use these delayed control inputs to solve for the corresponding system responses that would happen in the future and predict the next optimal control input .
For case 3, only acceleration command dynamics is considered with no acceleration delay . For this case, the time constant is s, which applies to gas-engine vehicles such as Volvo S60 . We consider this intermediate case for comparison purposes. According to Equation 6, the state vector is which includes the error, error rate, and the actual acceleration of the following vehicle.
|Discrete time step||0.1s|
|Nominal max error||10m|
|Max control input||2.6 m/s|
|Acceleration command dynamics time constant||0.5s|
|Preceding vehicle constant speed||30m/s|
|Following vehicle initial speed||27.5m/s|
|Initial gap-keeping error||2.5m|
For case 4, both acceleration command dynamics and delay are considered. For this case, the time constant is also s for gas-engine vehicles. The state vector is . Table I shows the parameter values for the car-following control system.
Iii-B Deep Deterministic Policy Gradient
The DRL algorithm that we use is DDPG, which is exactly the same as proposed in . Here we provide a brief description of the DDPG algorithm and we encourage the readers to read the original paper. The DDPG algorithm utilizes two deep neural networks: actor and critic networks. The actor network is for the state-action mapping policy where denotes the actor neural net weight parameters, and the critic network is for Q-value function (cumulative discounted reward) where denotes the critic neural net weight parameters. DDPG concurrently learns the policy and Q-value function. For learning the Q-value (Q-learning), the Bellman’s principle of optimality is followed to minimize the root-mean-squared loss using gradient descent. For learning the policy, gradient ascent is performed with respect to only the policy parameters to maximize the Q-value .
|Target network update coefficient||0.001|
|Reward discount factor||0.99|
|Actor learning rate||0.0001|
|Critic learning rate||0.001|
|Experience replay memory size||500000|
|Actor Gaussian noise mean||0|
Actor Gaussian noise standard deviation
. Batch normalization is used to accelerate learning by reducing internal covariant shift. Please see Table II for the DDPG algorithm parameter values.
Both the actor and critic networks are 2-layer neural nets for all cases. For training with vehicle kinematics (case 1) and just acceleration command dynamics (case 3), the neural nets have 64 neurons for each layer and the training time is 1 million time steps. For training with control delays (cases 2 and 4), the neural nets have 128 neurons for each layer and the training time is 1.5 million time steps. For all cases, the training converges. Fig.2 shows the undiscounted episode reward for case 1. The plots of the undiscounted episode rewards for all the other cases look similar to that for case 1, and are not shown here. We use the undiscounted episode reward since it allows us to track changes for the latter part of the car-following errors easily. Note that, with the discount factor, the last reward at 20 seconds (200 time steps) of one episode is discounted by .
In this section, the DRL results for the above mentioned 4 cases are presented. We also present the DP results which are the global optimal solutions for all cases for comparison purposes. We first present DRL and DP results for the car-following control with a point-mass kinematic model and the results of applying this kinematics-model-trained DRL controller to car-following control with vehicle dynamics, which are shown in Fig. 3. We then present the results of our proposed solution to deal with acceleration delay and acceleration command dynamics by adding the delayed control inputs and the current actual acceleration to the environment state, which are shown in Fig. 4. Note that, the acceleration command dynamics for all related cases is for gas-engine vehicles with time constant s.
In Fig. 3, when trained using a point-mass kinematic model, the DRL agent achieves a near-optimal solution as compared with DP results, see the blue solid and black dashed lines. When this DRL controller is applied to car-following control with just acceleration delay, the car-following performance is degraded to a small extent. The gap-keeping error is able to return to near-zero in the steady state, see column (a) of Fig. 3. When this DRL controller is applied to car-following control with just acceleration command dynamics, the car-following performance is degraded to a bigger extent as compared to the delay case. The gap-keeping error returns to near-zero in the steady state in a longer time, see column (b) of Fig. 3. When this DRL controller is applied to car-following control with both acceleration delay and command dynamics, the performance is the worst. Both the transient and steady-state performances are significantly degraded. The steady-state error does not return to zero and forms a wavy oscillation pattern with the maximum being 0.73m and the minimum being -0.22m.
The columns (a), (b), and (c) in Fig. 4 show the results for the redesigned DRL controllers trained with acceleration delay (case 2), acceleration command dynamics (case 3), and both acceleration delay and command dynamics (case 4), respectively. For all these cases, the DRL controllers achieve near-optimal solutions as compared to the DP ones. Note that the steady-state gap-keeping errors of the DP solutions in columns (b) and (c) are around 0.5 meters. This will be reduced if we use a smaller interval to create the evenly spaced samples of the states for DP, although it takes much longer time to run.
By solving a particular car-following control problem using DRL (deep reinforcement learning), we show that a DRL controller trained with a point-mass kinematic model could not be generalized to solve more realistic control situations with both vehicle acceleration delay and command dynamics. We added the control inputs that are delayed and have not been executed, and the actual acceleration of the vehicle, to the reinforcement learning environment state for DRL controller development with vehicle dynamics. The training results show that this approach provides near-optimal solutions for car-following control with vehicle dynamics.
When the reinforcement learning environment state is augmented with the delayed control inputs, the DRL agent is expected to utilize the delayed control inputs to predict the system behavior in the future and determine the next optimal control action. Our results show that the DRL agent is capable to do so after training, in a near-optimal manner. However, because the environment state is augmented with more variables, the neural network size needs to be increased and more training time is needed, which is the disadvantage. As stated in the introduction, an alternative method is to learn the underlying dynamic system separately and use the learned system to predict the system behavior in the future after the delay time so as to determine the current control action . However, this method may not be feasible for challenging autonomous driving control systems such as merging control because such systems are subject to many variations and disturbances due to multi-vehicle interactions. It may not be easy to develop or learn an accurate model for such systems.
Future work includes developing a more robust car-following DRL controller that can be trained with rich variations of the preceding vehicle’s speed. Another research direction is to develop DRL controllers with vehicle dynamics considered for more challenging autonomous driving scenarios such as freeway on-ramp merging.
The authors would like to thank Toyota, Ontario Centres of Excellence, and Natural Sciences and Engineering Research Council of Canada for the support of this work.
-  R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
-  D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, p. 484, 2016.
H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with
double q-learning,” in
Thirtieth AAAI Conference on Artificial Intelligence, 2016.
-  T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
-  G. Barth-Maron, M. W. Hoffman, D. Budden, W. Dabney, D. Horgan, A. Muldal, N. Heess, and T. Lillicrap, “Distributed distributional deterministic policy gradients,” arXiv preprint arXiv:1804.08617, 2018.
-  T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” arXiv preprint arXiv:1801.01290, 2018.
-  C. Urmson, J. Anhalt, D. Bagnell, C. Baker, R. Bittner, M. Clark, J. Dolan, D. Duggins, T. Galatali, C. Geyer, et al., “Autonomous driving in urban environments: Boss and the urban challenge,” Journal of Field Robotics, vol. 25, no. 8, pp. 425–466, 2008.
-  A. Eskandarian, Handbook of intelligent vehicles. Springer London, 2012.
-  P. Wang and C.-Y. Chan, “Autonomous ramp merge maneuver based on reinforcement learning with continuous action space,” arXiv preprint arXiv:1803.09203, 2018.
-  D. Isele, R. Rahimi, A. Cosgun, K. Subramanian, and K. Fujimura, “Navigating occluded intersections with autonomous vehicles using deep reinforcement learning,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 2034–2039.
-  Z. Qiao, K. Muelling, J. M. Dolan, P. Palanisamy, and P. Mudalige, “Automatically generated curriculum based reinforcement learning for autonomous vehicles in urban environment,” in 2018 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2018, pp. 1233–1238.
-  Z. Qiao, K. Muelling, J. Dolan, P. Palanisamy, and P. Mudalige, “POMDP and hierarchical options MDP with continuous actions for autonomous driving at intersections,” in 2018 21st International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2018, pp. 2377–2382.
-  C. Li and K. Czarnecki, “Urban driving with multi-objective deep reinforcement learning,” arXiv preprint arXiv:1811.08586, 2018.
-  H. Mirzaei and T. Givargis, “Fine-grained acceleration control for autonomous intersection management using deep reinforcement learning,” in 2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation. IEEE, 2017, pp. 1–8.
-  P. Wang, C.-Y. Chan, and A. de La Fortelle, “A reinforcement learning based approach for automated lane change maneuvers,” in 2018 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2018, pp. 1379–1384.
-  P. Wolf, K. Kurzer, T. Wingert, F. Kuhnt, and J. M. Zollner, “Adaptive behavior generation for autonomous driving using deep reinforcement learning with compact semantic states,” in 2018 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2018, pp. 993–1000.
-  S. Aradi, T. Becsi, and P. Gaspar, “Policy gradient based reinforcement learning approach for autonomous highway driving,” in 2018 IEEE Conference on Control Technology and Applications (CCTA). IEEE, 2018, pp. 670–675.
-  R. N. Jazar, Vehicle dynamics: theory and application. Springer, 2017.
-  M. Bucchel and A. Knoll, “Deep reinforcement learning for predictive longitudinal control of automated vehicles,” in 2018 21st International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2018, pp. 2391–2397.
-  S. Wei, Y. Zou, T. Zhang, X. Zhang, and W. Wang, “Design and experimental validation of a cooperative adaptive cruise control system based on supervised reinforcement learning,” Applied Sciences, vol. 8, no. 7, p. 1014, 2018.
-  C. Desjardins and B. Chaib-Draa, “Cooperative adaptive cruise control: A reinforcement learning approach,” IEEE Transactions on Intelligent Transportation Systems, vol. 12, no. 4, pp. 1248–1260, 2011.
-  D. Zhao, B. Wang, and D. Liu, “A supervised actor–critic approach for adaptive cruise control,” Soft Computing, vol. 17, no. 11, pp. 2089–2099, 2013.
-  M. Zhu, X. Wang, and Y. Wang, “Human-like autonomous car-following model with deep reinforcement learning,” Transportation Research Part C: Emerging Technologies, vol. 97, pp. 348–368, 2018.
-  E. Schuitema, M. Wisse, T. Ramakers, and P. Jonker, “The design of leo: a 2d bipedal walking robot for online autonomous reinforcement learning,” in 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2010, pp. 3238–3243.
-  T. Hester and P. Stone, “Texplore: real-time sample-efficient reinforcement learning for robots,” Machine learning, vol. 90, no. 3, pp. 385–429, 2013.
-  T. J. Walsh, A. Nouri, L. Li, and M. L. Littman, “Learning and planning in environments with delayed feedback,” Autonomous Agents and Multi-Agent Systems, vol. 18, no. 1, p. 83, 2009.
-  E. Schuitema, L. Buşoniu, R. Babuška, and P. Jonker, “Control delay in reinforcement learning for real-time dynamic systems: a memoryless approach,” in 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2010, pp. 3226–3231.
-  K. V. Katsikopoulos and S. E. Engelbrecht, “Markov decision processes with delays and asynchronous cost collection,” IEEE transactions on automatic control, vol. 48, no. 4, pp. 568–574, 2003.
-  D. S. Naidu, Optimal control systems. CRC press, 2002.
-  J. Ploeg, B. T. Scheepers, E. Van Nunen, N. Van de Wouw, and H. Nijmeijer, “Design and experimental evaluation of cooperative adaptive cruise control,” in 2011 14th International IEEE Conference on Intelligent Transportation Systems (ITSC). IEEE, 2011, pp. 260–265.
-  K. Lidstrom, K. Sjoberg, U. Holmberg, J. Andersson, F. Bergh, M. Bjade, and S. Mak, “A modular CACC system integration and design,” IEEE Transactions on Intelligent Transportation Systems, vol. 13, no. 3, pp. 1050–1061, 2012.
-  J.-M. Engel and R. Babuška, “On-line reinforcement learning for nonlinear motion control: Quadratic and non-quadratic reward functions,” IFAC Proceedings Volumes, vol. 47, no. 3, pp. 7043–7048, 2014.
-  H. P. van Hasselt, A. Guez, M. Hessel, V. Mnih, and D. Silver, “Learning values across many orders of magnitude,” in Advances in Neural Information Processing Systems, 2016, pp. 4287–4295.
-  S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.