I Introduction
Dynamic programming (DP) is a theoretical and effective tool in solving discretetime (DT) optimal learning problems with known dynamics [1]. The optimal value function for DT systems is based on solving the DT HamiltonJacobiBellman (HJB) equation, also known as the Bellman optimality equation, which develops backward in time [2]
. However, due to the curse of dimensionality, running DP directly to get the optimal solution of DT HJB is usually computationally untenable for complex nonlinear DT systems
[3]. The adaptive dynamic programming (ADP) algorithms were first proposed by Werbos as a way to overcome this difficulty by solving an approximate solution of DT HJB forward in time [4, 5]. ADP has several synonyms, including approximate DP[6], reinforcement learning (RL)[7], and neuroDP[8].ADP methods are usually implemented as an actorcritic architecture which involves a critic parameterized function for value function approximation and an actor parameterized function for policy approximation [9, 10, 11, 12, 13]. Recently, deep neural networks (NNs) have been widely used as approximators of both value function and policy due to its strong fitting ability, and achieve stateoftheart performance on many control tasks [14, 7]. Most ADP methods adopt iterative methods as primary tools to adapt both value and policy networks by iteratively solving the DT HJB equation[10]. Generalized policy iteration (GPI), which contains PI and value iteration as special cases, is an important iterative framework widely used in ADP[7]. There are two revolving iteration procedures for GPI framework: 1) policy evaluation, which makes the value function consistent with the current policy, and 2) policy improvement, which improves the policy to reduce the corresponding value function.
Over the last few decades, many ADP methods of finding optimal control solution for DT systems with known dynamics have emerged. Chen and Jagannathan (2008) proposed an ADP method to find nearly optimal control state feedback laws for affine nonlinear DT systems by iteratively solving the generalized HJB equation. The value function was approximated by a linear combination of artificially designed basis functions, while the policy was directly derived from the value function [15]. Both actor and critic NNs were utilized by AlTamimi et al. (2008) to develop a valueiterationbased algorithm for DT systems, and it was shown that the algorithm could converge to the optimal value function and policy as long as the control coefficient matrix was known [16]. Furthermore, Wei et al. (2016) established a new termination criteria to guarantee the effectiveness of the iterative policy NN for value iteration ADP algorithms [17]. Liu et al. (2015) proposed a GPI algorithm for DT nonlinear systems, and the admissibility property of NN policy could also be guaranteed as long as the initialized policy was admissible [18]. To relax the need of system dynamics knowledge, Dierks et al. (2009) introduced a model NN to learn the unknown system dynamics. Then ADP training was undertaken using only the learned NN model [19]. In addition, modelfree algorithms such as DQN, DDPG, A3C, PPO, have also been widely used to solve DT optimal learning problems [20, 21, 22, 23, 24].
It should be pointed out that most existing ADP techniques have a common shortcoming: they are not feasible for optimal learning problems with state constraints. This is because the gradient descent method is only suitable for solving unconstrained policy optimization problem. For practical applications, however, most controlled systems must be subject to some state restrictions. Taking vehicle control in the pathtracking task as an example, in addition to considering the tracking performance, certain state functions of the vehicle must be constrained to the stability zone to prevent vehicle instability problems [25]. Model predictive control (MPC) is a commonly used control method to solve control input online while satisfying a set of constraints[26]. However, compared with ADP, complex systems such as nonaffine nonlinear models are still a big challenge for MPC.
In this paper, a new ADP algorithm, called constrained deep ADP, is developed to solve optimal learning problems of DT general nonlinear systems with nonaffine constrained inputs. Both the actor and critic are approximated by deep NNs to build a map from the system state to action and value function respectively. No handcrafted basis function is needed in this case. The proposed algorithm considers the state constraints by transforming the policy improvement process to a constrained optimization problem. Meanwhile, a trust region constraint is added to allow large update step without violating the monotonic improvement condition. We first linearize this constrained optimization problem locally into quadratically constrained linear program problem, and then obtain the optimal update of policy NN parameters by solving its dual problem. We also propose a series of recovery rules to update the policy in case that the primal problem is infeasible. In addition, parallel learners are employed to explore different state spaces and then to stabilize and accelerate the learning speed.
The paper is organized as follows. In Section II, we provide the formulation of the DT optimal learning problem, followed by the general description of GPI algorithm. Section III presents the constrained ADP algorithm. In Section IV, we present a simulation example that show the generality and effectiveness of the CDADP algorithm for DT system. Section V concludes this paper.
Ii Methematical preliminaries
Iia DT HJB Equation
Consider the general timeinvariant dynamical system
(1) 
with state , control input , and . We assume that is Lipschitz continuous on a compact set that contains the origin, and that the system is stabilizable on , i.e. there exists a continuous policy , where , such that the system is asymptotically stable on . The system dynamics is assumed to be known, it can be nonlinear and nonaffine analytic functions, NNs, or even a Matlab/Simulink model (only if is known). Moreover, the system input can be either constrained or unconstrained. Given the policy , define its associated infinite horizon value function
(2) 
where is the utility function, and is the discount factor.
Then the optimal learning problem can now be formulated as finding a policy such that the value function Eq. (2) associated with systems Eq. (1) is minimized for all . The minimized value function defined by
(4) 
satisfies the DT HJB equation or Bellman optimality equation (BOE)
(5) 
Meanwhile, the optimal control can be derived as
(6) 
In order to find the optimal control solution for the problem, one only needs to solve Eq. (5) for the value function and then substitute the solution into Eq. (6) to obtain the optimal control. However, due to the nonlinear nature of DT HJB, finding its solution is generally difficult or impossible.
IiB Generalized Policy Iteration
The proposed algorithm for DT optimal learning problems used in this paper is motivated by GPI techniques [7]. GPI is an iteration method widely used in modelbased or modelfree ADP (or RL) problems to find the approximate solution of BOE. GPI usually employs actorcritic (AC) architecture to approximate both the policy and value function. In our work, both the value function and policy are approximated by deep NNs, called respectively the value network (or critic network) and the policy network (or actor network) , where and are network parameters. These two networks directly build a map from the raw system states to the approximated value function and control respectively; in this case, no handcrafted basis function is needed.
GPI involves two interacting processes: 1) policy evaluation, which drives the estimated value function towards the true value function for current policy based on Eq. (
3), and 2) policy improvement, which improves the policy with respect to current estimated value function based on Eq. (6).Defining the accumulated future cost of state as
, the policy evaluation process of DPI proceeds by iteratively minimizing the following loss function:
(7) 
where is usually called temporal difference (TD) error. Therefore the update gradient for the value network is:
(8) 
In the policy improvement step, the parameters of policy network are updated to minimize the objective function
(9) 
Denoting as , as , the update gradient for policy network is:
(10)  
where
Any offtheshelf NN optimization methods can be used to update these two NNs, including stochastic gradient descent (SGD), RMSProp, Adam
[27]. Taking the SGD method as an example, the updating rules of the value network and the policy network in the th iteration are:(11)  
where and denote the learning rate of value and policy network, respectively.
Iii Constrained GPI
Iiia Constrained policy improvment
One drawback of the policy update rule mentioned above is that it is not suitable for optimal learning problems with state constraints. For practical applications, however, most controlled systems must be subject to some state restrictions. Although the state constraints can be added to the objective function as penalty, it is often difficult to choose the appropriate hyperparameter values to balance the constraint requirements with the control objectives. Also, it is still impossible to ensure that the policy satisfies the constraints. Therefore, in this paper, the state constraints of future steps are introduced to transfer the policy improvement process into a constrained optimization problem. Assuming there are state constraints, the th state constraints can be formulated as:
(12) 
where is the th state constraint bounded above by boundary .
In addition, inspired by the work of trust region policy optimization (TRPO) [28], we also add a policy constraint to avoid excessive policy update, so as to take larger update steps in a robust way without violating the monotonic improvement guarantee. This is because that the monotonic improvement condition of can only be guaranteed when the policy changes are not very large because is an estimate. We define the following function
(13) 
to measure the difference between the new policy and the old policy. Then, the policy constraint is described as:
(14) 
where is the corresponding step size bound. The policy constraint is also called the trust region (TR) constraint.
Therefore, the policy improvement process can be formulated as the following constrained optimization problem:
(15)  
where is the state constraint number. A proof of convergence for constrained PI based on a tabular setting is presented in Appendix A. Of course, the convergence results can be extended to the case of GPI framework and deep NN models according to [7].
IiiB Approximate solution
For policies with highdimensional parameter spaces like NNs, directly solving problem (15) may be impractical because the computational cost and the nonlinear characteristics of NN. However, for small step sizes , the objective function of problem (15) and state functions in th iteration can be wellapproximated by linearizing around current policy using Taylor’s expansion theorem. Denoting , it follows that:
where
In addition, the and its gradient are both zero at , therefore the trust region constraint is wellapproximated by secondorder Taylor expansion:
where is the Hessian of . Since is always no less than , is always positive semidefinite and we will assume it to be positivedefinite in the following.
Denoting , , and . With and , the approximation to problem (15) is:
(16)  
Denoting the optimal solution of problem (16) as , the new updating rule for policy improvement process is
(17) 
Although (16) is a convex constrained optimization problem, directly solving it will take lots of computation time and resources because the number of variables is very large (usually over 10 thousand). Since is assumed to be positive definite, this problem can also be solved using dual method when feasible. Assume that problem (16) is feasible, then its Lagrange function can be expressed as
(18) 
Then the dual to (16) is:
(19) 
The gradient of w.r.t. parameters is:
(20) 
When , we have
(21) 
By taking Eq. (21) into (19), the dual to (16) can be expressed as:
(22) 
where , , . Let
(23) 
then we can rewrite Eq. (22) with
(24) 
Problem (24) is a boundconstrained convex optimization problem with only variables, which is also equal to the number of constraints in problem (15) but much smaller than its variable number. Therefore, compared with problem (15), the optimal solution of problem (24) can be solved more easily and efficiently by using offtheshelf algorithms such as LBFGSB, truncated Newton method [29, 30]. If , are the optimal solutions to the duality, can be updated as
(25) 
IiiC Feasibility
Due to the approximation errors, the optimal solution to the problem (16) of the th iteration may be a bad update and then a new policy that fails to satisfy state constraints may be produced. This may cause problem (16) of the th to be infeasible. In other words, the feasible region of (16) would be empty, i.e., , where , and . Hence, before solving the dual problem (24), we construct the following optimization problem to determine whether the feasible region is empty:
(26)  
Denoting the optimal solution to problem (26) as , then the minimum trust region boundary that makes problem (16) feasible is and it is clear that
(27) 
The value can efficiently obtained by solving the following dual problem:
(28) 
If is the optimal solution of problem (28), then .
It is known from Eq. (27) that the magnitude of the value directly affects the feasibility of the problem (16). Denoting the expected TR boundary as , if , we can directly solve problem (24) with . For the infeasible case, i.e., , a recovery method is needed to calculate a reasonable policy update. By introducing the recovery TR boundary which is slightly greater than , we propose two recovery rules according to the value of : 1) If , we solve problem (24) with for and ; 2) If , we recover the policy by adding the state constraints as a penalty to the original objective function:
(29)  
where is the hyperparameter that trades off the importance between the original objective function and the penalty term, is the weight of the th state constraint that is calculated by:
(30) 
where if , otherwise, which penalizes violations of the th state constraint. Defining , the dual to (29) can be expressed as:
(31) 
where . In this case, we can easily find the policy recovery rule:
(32) 
Inspired by the ideas used in multithreaded variants of deep RL, we used multiple parallel agents to explore different state spaces, thereby removing correlations in the training set and stabilizing the learnign process [22]. During learning, we apply value function and policy updates on the state set which contains current observation states of these parallel agents. All the state constraints of these parallel agents are stored in the constraints buffer. Due to the computational burden caused by estimating the matrices , and solving the problem (24), the speed of the strategy optimization process will decrease as the state constraints number increases. For each iteration, we only consider state constraints randomly selected from the constraints buffer. The diagram and pesudocode of CDADP are shown in Fig. 1 and Algorithm 1.
Iv Simulation
To evaluate the performance of our CDADP algorithm, we choose the vehicle lateral and longitudinal control in the pathtracking task as an example. It is a nonaffine nonlinear systme control problem with state constraints[31].
The expected trajectory is a circle with the radius , and the control objective is to maximize the vehicle speed, while maintaining small tracking error and ensuring that the vehicle state stays within the stability region. The system states and control inputs of this problem are listed in Table I, and the vehicle parameters are listed in Table II. Note that the system frequency used for simulation is different from the sampling frequency . The vehicle is controlled by a saturating actuator, where and . The vehicle dynamics are:
(33) 
where and are the lateral tire forces of the front and rear tires respectively [32]. The lateral tire forces are usually approximated according to the Fiala tire model:
where is the tire slip angle, is the tire load, is the lateral friction coefficient, and the subscript represents the front or rear tires. The slip angles can be calculated from the geometric relationship between the front/rear axle and the center of gravity (CG):
The notation represents the tire slip angle when the tire fullysliding behavior occurs, calculated as:
Assuming that the rolling resistance is negligible, the lateral friction coefficient of the front/rear wheel is:
where and are the longitudinal tire forces of the front and rear tires respectively, calculated as:
(34) 
The loads on the front and rear tires can be approximated by:
state  Lateral velocity  [m/s]  
Yaw rate at center of gravity (CG)  [rad/s]  
Longitudinal velocity  [m/s]  
Yaw angle between vehicle & trajectory  [rad]  
Distance between CG & trajectory  [m]  
input  Front wheel angle  [rad]  
Longitudinal acceleration  [m/] 
Front wheel cornering stiffness  88000 [N/rad]  
Rear wheel cornering stiffness  94000 [N/rad]  
Distance from CG to front axle  1.14 [m]  
Distance from CG to rear axle  1.40 [m]  
Mass  1500 [kg]  
Polar moment of inertia at CG 
2420 [kg]  
Tireroad friction coefficient  1.0  
Sampling frequency  40 [Hz]  
System frequency  200 [Hz] 
To ensure vehicle stability, the yaw rate at the CG, and the slip angles should be subject to the following constraints:
where .
The utility function is:
Therefore, the policy optimization problem of this example can be formulated as:
(35)  
where , and are the state constraint functions of state bounded above by , and respectively. It is clear that the form of this problem is the same as that of problem (15), which means that we can train the vehicle control policy using the proposed CDADP algorithm.
In this paper, the value function and policy are represented by 5layer fullyconnected NNs, which have the same architecture except for the output layers. For each network, the input layer is composed of the states, followed by 5 hidden layers using exponential linear units (ELUs) as activation functions with
units per layer. The output of value network is a linear unit, while the output layer of the policy network is set as a layer with two units, multiplied by the matrix to confront bounded controls. We use Adam method to update the value network with the learning rate of . Other hyperparameters of this problem are shown in Table III.agent number  256  
prediction step size  N  30 
state constraints number  M  10 
discounted factor  0.98  
TR boundary  
recovery TR boundary  
penalty factor  0.8 
We compare the CDADP algorithm with three other variants of the ADP algorithms, namely GPI, TRADP (i.e., GPI with trust region constraint), and penalty TRADP (PTRADP, i.e., update policy network by directly solving problem (29) with respectively). Note that TRADP can be considered as a special case of PTRADP, in which . Fig.2 shows the mean and confidence interval of policy performance for 20 different training runs. The policy performance for each iteration is measured by the undiscounted accumulated cost function of 400 steps (10s) during the simulation period staring from random initialized state. As shown in the figure, the convergence speed of CDADP is one of the fastest, while that of GPI is slowest. According to the study of Schulman et al. [28], the monotonic improvement condition of can only be guaranteed when the policy changes are not very large because is an estimate. Therefore, the policy learning rate of tradition GPI algorithm is usually very small ( in this paper), which leads to slow learning speed. On the other hand, the TR constraint allows the policy to take larger adaptive update steps in a robust way without violating the monotonic improvement guarantee. This is why the algorithms with the TR constraint learn faster and more stably than GPI. In addition, CDADP also considers state constraints, which can effectively reduce the state space, thereby further improving the learning speed.
In addition to the training speed, we also compare the algorithms from the constraint compliance and policy performance. Fig. 3 shows the boxplots of the relevant indicators for 20 different runs obtained during the simulation for each algorithm. In particular, Fig. (a)a shows the accumulated cost of 500 steps obtained through the learned policy, and Fig. (b)b, (c)c and (d)d show the differences between maximum values of , , and their bounds in the simulation process, respectively. As shown in these figures, in addition to the CDADP algorithm, the cumulative cost increases as the penalty factor increases, while the difference between the state function and its bound is reversed. And when , the performance of the strategy becomes very poor because the constraint term accounts for too much of the objective function in problem (29). Even so, for the PTRADP algorithm with , the situation in which the controlled system violates the constraint still exists. In fact, only CDADP meets the three state constraints in all runs. Although the loss function of CDADP is slightly higher than other algorithms except PTRADP with , the performance of these methods are obtained at the expense of violating constraints. Fig.4 shows the evolution of some vehicle parameters controlled by one of the trained CDADP policies. The learned policy can make the vehicle track the circle with higher speed and smaller tracking error without violating the constraints (, , ).
V Conclusion
This paper proposes a constrained deep adaptive dynamic programming (CDADP) algorithm to solve general nonlinear nonafine discretetime learning problems with known dynamics. Unlike previous ADP algorithms, it can deal with problems with state constraints. Both the policy and value function are approximated by deep neural networks (NNs), which directly map the system state to action and value function respectively without needing to use handcrafted basis function. We transform the policy improvement process into a constrained optimization problem to consider the state constraints. Meanwhile, a trust region constraint is added to allow large update step without violating the monotonic improvement condition. In order to solve this problem, we first linearize it locally into a quadratically constrained linear program problem, then determine its feasibility by calculating the minimum trust region boundary. For the feasible case, the optimal update of policy NN parameters is obtained by solving its dual problem. We also propose a series of recovery rules to update the policy in case the primal problem is infeasible. In addition, parallel learners are employed to explore different state spaces and then to stabilize and accelerate the learning speed. We apply our algorithm and five other baseline algorithms to the vehicle control problem in pathtracking task, which is a nonlinear nonaffine system optimal learning problem with multiple state constraints. The results show that CDADP has the fastest learning speed and is the only algorithm that can satisfy all the state constraints during simulation.
Appendix A Proofs of Constrained Policy Iteration
In a version of constrained PI for a tabular setting, we maintain two tables to estimate policy and the corresponding value function respectively. In the policy evaluation step of constrained PI, we want to compute the value according to Bellman equation (3). For a fixed policy , the corresponding value function can be computed iteratively, starting from any initialization function and repeatedly applying the Bellman update rule given by:
(36) 
where . The convergence of policy evaluation shown below has been proved and described in previous studies [7].
Lemma 1.
(Policy Evaluation). For a fixed policy , consider the Bellman backup rule in Eq. (36) and a mapping , then the sequence will converge to the value function as .
Proof. Define . According to Eq. (36), it follows that
Similarly, we have
Therefore, , which also means that satisfies the Bellman equation (3). So, the sequence converges to the value function as .
Define . In the constrained policy improvement step, we update the policy by solving the following constrained optimization problem:
(37)  
For this projection, we can show that the new policy has a lower value function than the old policy . The proof borrows heavily from policy improvement theorem of Qlearning and soft Qlearning [7, 33, 34, 35].
Definition 1.
Lemma 2.
Proof. Because and , problem (37) is feasible for . Then we can show that
where and for . So, decreases monotonically.
Theorem 1.
(Constrained Policy Iteration). Through repeated application of policy evaluation and constrained policy improvement, any policy will converge to the optimal policy , such that for all and .
Proof. Let be the policy at iteration . We can find follows from Lemma 1. By Lemma 2, the sequence is monotonically decreasing. Since is bounded below for all and (the utility function is also bounded), and will converge to some and . At convergence, it must follow that for all and . Using the same iterative argument as in Eq. (A) of Lemma 2, it is clear that for all and . Hence is optimal in .
Acknowledgment
We would like to acknowledge Yang Zheng, Yiwen Liao, Guofa Li and Ziyu Lin for their valuable suggestions.
References
 [1] D. P. Bertsekas, D. P. Bertsekas, D. P. Bertsekas, and D. P. Bertsekas, Dynamic programming and optimal control, vol. 1. Athena scientific Belmont, MA, 2005.
 [2] F. L. Lewis, D. Vrabie, and V. L. Syrmos, Optimal control. John Wiley & Sons, 2012.
 [3] F.Y. Wang, H. Zhang, and D. Liu, “Adaptive dynamic programming: An introduction,” IEEE Computational Intelligence Magazine, vol. 4, no. 2, 2009.
 [4] P. Werbos, “Beyond regression: New tools for prediction and analysis in the behavioral sciences,” Ph. D. dissertation, Harvard University, 1974.
 [5] P. Werbos, “Approximate dynamic programming for realtime control and neural modelling,” Handbook of Intelligent Control: Neural, Fuzzy and Adaptive Approaches, pp. 493–525, 1992.
 [6] W. B. Powell, Approximate Dynamic Programming: Solving the curses of dimensionality, vol. 703. John Wiley & Sons, 2007.
 [7] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.
 [8] D. P. Bertsekas and J. N. Tsitsiklis, “Neurodynamic programming: an overview,” in Proc, of the 34th IEEE Conf. on Decision and Control, vol. 1, pp. 560–564, IEEE Publ. Piscataway, NJ, 1995.
 [9] J. Duan, S. E. Li, Z. Liu, M. Bujarbaruah, and B. Cheng, “Generalized policy iteration for optimal control in continuous time,” arXiv preprint arXiv:1909.05402, 2019.
 [10] D. Liu, Q. Wei, D. Wang, X. Yang, and H. Li, Adaptive dynamic programming with applications in optimal control. Springer, 2017.
 [11] K. G. Vamvoudakis and F. L. Lewis, “Online actor–critic algorithm to solve the continuoustime infinite horizon optimal control problem,” Automatica, vol. 46, no. 5, pp. 878–888, 2010.
 [12] L. Dong, X. Zhong, C. Sun, and H. He, “Eventtriggered adaptive dynamic programming for continuoustime systems with control constraints,” IEEE Transactions on Neural Networks and Learning Systems, vol. 28, no. 8, pp. 1941–1952, 2016.
 [13] J. Li, H. Modares, T. Chai, F. L. Lewis, and L. Xie, “Offpolicy reinforcement learning for synchronization in multiagent graphical games,” IEEE transactions on neural networks and learning systems, vol. 28, no. 10, pp. 2434–2445, 2017.

[14]
Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”
Nature, vol. 521, no. 7553, p. 436, 2015.  [15] Z. Chen and S. Jagannathan, “Generalized hamilton–jacobi–bellman formulationbased neural network control of affine nonlinear discretetime systems,” IEEE Transactions on Neural Networks, vol. 19, no. 1, pp. 90–106, 2008.
 [16] A. AlTamimi, F. L. Lewis, and M. AbuKhalaf, “Discretetime nonlinear hjb solution using approximate dynamic programming: Convergence proof,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 38, no. 4, pp. 943–949, 2008.
 [17] Q. Wei, D. Liu, and H. Lin, “Value iteration adaptive dynamic programming for optimal control of discretetime nonlinear systems,” IEEE Transactions on cybernetics, vol. 46, no. 3, pp. 840–853, 2016.
 [18] D. Liu, Q. Wei, and P. Yan, “Generalized policy iteration adaptive dynamic programming for discretetime nonlinear systems,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 45, no. 12, pp. 1577–1591, 2015.
 [19] T. Dierks, B. T. Thumati, and S. Jagannathan, “Optimal control of unknown affine nonlinear discretetime systems using offlinetrained neural networks with proof of convergence,” Neural Networks, vol. 22, no. 56, pp. 851–860, 2009.
 [20] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
 [21] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.

[22]
V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver,
and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,”
in
Int. Conf. on Machine Learning
, pp. 1928–1937, 2016.  [23] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox
Comments
There are no comments yet.