Q-learning is one of the most popular reinforcement learning methods that seek efficient control policies without the knowledge of an explicit system model 
. The key idea of Q-learning is to combine dynamic programming and stochastic approximation in a way to estimate the optimal state-action value function, also called theQ-function
, by using trajectory samples. For discrete-time Markov decision processes, Q-learning has been extensively studied (see and the references therein), while the literature on continuous-time Q-learning is sparse. In discrete time, the Bellman equation for Q-functions can be defined by using dynamic programming in a straightforward manner. However, the corresponding Bellman equation for continuous-time Q-functions has not yet been fully characterized despite some prior attempts. A variant of Q-function is used in [3, 4], which has a different meaning from the Q-function in reinforcement learning. In other literature, a Q-function similar to that of reinforcement learning was introduced, but with function-valued control input  or heavily utilizing the linear-time-invariant (LTI) system structure . A similar model-free approach for the LTI system has been also studied [7, 8], although the Q-function is not specifically defined. A continuous-time Q-function was also considered to prove the convergence of stochastic approximation .
In this paper, we consider continuous-time deterministic optimal control problems with Lipschitz continuous controls. We show that the associated Q-function corresponds to the unique viscosity solution of a Hamilton–Jacobi–Bellman (HJB) equation in a particular form. In the viscosity solution framework, even when it is not differentiable, the Q-function can be used to verify the optimality of a given control and to design an optimal control strategy. Then, we use the proposed HJB equation to derive an integral equation that the optimal Q-function and optimal trajectory should satisfy. Based on this equation, we propose a Q-learning algorithm for continuous-time dynamical systems. For high-dimensional state and control spaces, we also propose a DQN-like algorithm by using deep neural networks (DNNs) as a function approximator. This opens a new avenue of research that connects viscosity solution theory for HJB equations and the reinforcement learning domain. The performance of the proposed Q-learning algorithm is tested through a set of numerical experiments with 1-, 10- and 20-dimensional systems.
2 Continuous-Time Q-Functions and HJB Equations
Consider a controlled dynamical system of the form
where is the system state and is the control input. Let be the set of admissible controls. The standard finite-horizon optimal control problem can be formulated as
with , where and are running and terminal cost functions of interest, respectively, and is a subset of . The Q-function of (2.2) is defined by
which represents the minimal cost incurred from time to when starting from with . In particular, when , the Q-function reduces to the standard optimal value function , defined by
Suppose that . Then, the Q-function (2.3) corresponds to for each , i.e.,
Fix . Let be an arbitrary positive constant. Then, there exists such that
where satisfies (2.1) with in the Carathéodory sense: . We now construct a new control as if ; if . Such a modification of controls at a single point does not affect the trajectory or the total cost. Therefore, we have
Since was arbitrary, we conclude that for any . ∎
Thus, if is chosen to be , the Q-function has no additional interesting property. Motivated by this observation, we restrict the control as a Lipschitz continuous function. Since any Lipschitz continuous function is differentiable almost everywhere, we define the set of admissible control as
Then, for any , there exists a unique measurable function with such that the following ODE holds a.e.: , . By using the dynamic programming principle, we can deduce that
Suppose for a moment that. Then, the Taylor expansion of in (2.5) yields
Letting tend to zero, we arrive at the following HJB equation for the Q-function:
Note that minimizes the Hamiltonian, and thus the HJB equation can be expressed as
In what follows, we uncover some mathematical properties of the HJB equation (2.6) and the Q-function.
2.1 Viscosity Solution: Existence and Uniqueness
In general, the Q-function is not a -function. As a weak solution of the HJB equation, we use the framework of viscosity solutions [11, 12]. We begin by defining the viscosity solution of (2.6) in the following standard manner [13, 14]:
A continuous function is a viscosity solution of (2.6) if
for all .
For any , if has a local maximum at a point , then
For any , if has a local minimum at a point , then
From now on, we assume the following regularity conditions on , and :111These assumptions can be relaxed by using a modulus associated with each function as in [13, Chapter III.1–3].
The functions , and are bounded, i.e., there exists a constant such that
The functions , and are Lipschitz continuous, i.e., there exists a constant such that
Then, the HJB equation (2.6) has a unique viscosity solution, which corresponds to the Q-function.
To accommodate the Lipschitz continuity constraint on controls, we consider the following augmented system:
where and are interpreted as a new state and input, respectively. Let be the augmented state, and
be the augmented vector field. Then, the Q-function can be expressed as
where . The HJB equation (2.6) can be rewritten as
with , where the Hamiltonian is defined by . By the assumptions (A1) and (A2), we have
These imply that the Hamiltonian satisfies the Lipschitz continuity conditions, and thus the standard proof for the existence and the uniqueness of viscosity solution can be directly used (e.g., [13, 14]). Furthermore, by [13, Proposition 3.5 in Ch. 3], the Q-function corresponds to the unique viscosity solution. The boundedness and the Lipschitz continuity of the Q-function follows from [13, Proposition 3.1 in Ch. 3]. ∎
2.2 Asymptotic Consistency
We now discuss the convergence of the Q-function to the optimal value function as the Lipschitz constant tends to . This convergence property demonstrates that the proposed HJB framework is asymptotically consistent with our observation in Proposition 1. We parametrize the Q-function by so that the scaling becomes similar to that of the other classical singular limit problem. More precisely, let
where Since for any , it is straightforward to observe that . We also notice that is a bounded function under Assumption (A1). Therefore, by the monotone convergence theorem, there exists a limit function such that
The limit function corresponds to the optimal value function (2.4) without the Lipschitz continuity constraint on controls.
For any , we have .
Since the argument in the proof of [13, Theorem 4.1 in Ch. 7] can be used, we omit the proof. ∎
2.3 Optimal Controls
To characterize a necessary and sufficient condition for optimality of a control , we consider the following function:
By (2.5), we deduce that the control is optimal if and only if is a constant function for each . On the other hand, the dynamic programming principle implies that the function is non-decreasing for any control . Thus, the control is optimal if and only if the function is non-increasing. If the Q-function is differentiable, this implies that
Since Q-function satisfies HJB equation (2.6), we have
Therefore, when is differentiable, is optimal if and only if with . To obtain the rigorous principle of optimality when is not differentiable, we use generalized derivatives of the Q-function. We define the lower and upper Dini derivative of Lipschitz continuous function at point with the direction by
Moreover, we define the sub- and super differentials of by
The following optimality theorem is a direct application of the classical results in [13, Theorem 3.39, Ch. 3].
The trajectory-control pair is optimal if and only if
Furthermore, suppose that and are continuously differentiable. Then, the pair is optimal if and only if
At a point where is differentiable, the sub- and superdifferentials of are identical to the classical derivative of . Thus, at such a point, we can construct the optimal control using . At a point where is not differentiable, one can choose any control given by (2.8).
Suppose that is an optimal control constructed using (2.8) with the initial condition . Then, we have
for any admissible control with .
3 Q-Learning Using the HJB Equation
In the infinite-horizon case, we consider the following discounted cost function (with ):
and the Q-function is defined by . Again, using dynamic programming principle, we can derive the following HJB equation:
As in Theorem 1, we can show that the Q-function is the unique viscosity solution of the HJB equation (3.1). A necessary and sufficient condition for optimality can be characterized in a way similar to Theorem 3.
We now discuss how the HJB equation (3.1) can be used to design a Q-learning algorithm for estimating using the (simulation) data of system trajectories. To provide the essential idea, we assume that is differentiable. We then have
Suppose now that an optimal control is employed, i.e., . Then, the time derivative of Q-function along the trajectory is further simplified as
By integrating (3.2) along the optimal trajectory-control pair, we obtain
However, since the optimal Q-function and optimal trajectory-control pair is unknown a priori, we iteratively update the Q-function and the control with by using simulation data. The iteration is based on the equation (3.3), which is the characterizing equation of the optimal Q-function. Specifically, for a given control , we obtain the system trajectory starting from randomly chosen for time interval with small and collect the (simulation) data , and . We then update a new estimate for Q-function by using (3.3) and simulation data as so that (3.3) holds asymptotically.
To handle high-dimensional state and control spaces, we propose a DQN-like algorithm by using DNNs as a function approximator. Let denote the approximate Q-function parameterized by . We update the network parameter by minimizing the mean squared error (MSE) loss between and the target . To enhance the stability of learning procedure, we use samples of for defining MSE loss and introduce the target network parameter in estimating target value as in DQN . The target network is slowly updated as a weighted sum of and itself as in . The algorithm minimizes the error between the left and right-hand sides of (3.3) for each iteration, making the asymptotically satisfies (3.3) as much as possible. The overall procedures are summarized in Algorithm 1. Note that this algorithm does not need the knowledge of an explicit system model as in discrete-time Q-learning or DQN.
4 Numerical Experiments
We consider the following linear system with an exponentially discounted quadratic cost
where and . We restrict the control as a Lipschitz continuous function with Lipschitz constant 1. The parameters for simulation are chosen as and . As the DNNs for approximating Q-functions, we use fully connected networks consisting of an input layer with
nodes, two hidden layers with 128 nodes and an output layer with a single node. For the two hidden layers, we use ReLU activation function. For training the networks, we use Adam optimizer with a learning rate.
4.1 One-dimensional problem
As a toy example for sanity check, we first consider a one dimensional model, where , and . In order to measure the performance of control, we fix the initial state and control and we integrate the running cost over . Figure 1 (a) shows the log of costs at each iteration of Algorithm 1. The solid line represents the learning curve averaged over five different trials and the shaded region represents the minimum and maximum of the cost over different trials. The running cost rapidly decreases as the parameter is learned. Figure 1 (b) shows the trajectory of , , and generated by using the learned Q-function after iteration. The optimal control for this one dimensional problem is to drive both and as soon as possible to 0. Starting from , this can be done by first driving the control to negative value so that moves towards the origin, and then reducing the absolute value of so that both and approaches 0 asymptotically. Such a behavior is observed in Figure 1(b). This confirms that the learned policy is near optimal.
4.2 10- and 20-dimensional problems
For 10- and 20-dimensional problems, we set the coefficient matrices and such that each element of is sampled from and then multiplied by 0.1, and each element of is sampled from and then multiplied by 5:
We integrate the running cost from randomly chosen initial position and control . The learning curve and the first component of under the learned control are shown in Figures 2 and 3. The results show that the cost decreases super-exponentially in both 10- and 20-dimensional problems. Note that the -axis of Figure 2 (a) and Figure 3 (a) are plotted with log-scale, and thus the decay of cost is rapid. Moreover, the learning processes are shown to be fairly stable. As shown in Figure 2 (b) and Figure 3 (b), the learned policy confines the state-control pair in a small neighborhood of the origin. This confirms that it presents the desired performance as this behavior is aligned with the original control objective.
We introduced a Q-function for continuous-time deterministic optimal control problems with Lipschitz continuous control. By using the dynamic programming principle, we derived the corresponding HJB equation and showed that its unique viscosity solution corresponds to the Q-function. An optimality condition was also characterized in terms of the Q-function without the knowledge of system models. Using the HJB equation and the optimality condition, we construct a Q-learning algorithm and its DQN-like approximate version. The simulation results show that the proposed Q-learning algorithm is fast and stable, and that the learned controller presents a good performance, even in the 20-dimensional case.
-  C. J. Watkins and P. Dayan, “Q-learning,” Machine Learning, vol. 8, no. 3-4, pp. 279–292, 1992.
-  D. P. Bertsekas, Reinforcement Learning and Optimal Control. Athena Scientific, 2019.
-  J. Y. Lee, J. B. Park, and Y. H. Choi, “Integral Q-learning and explorized policy iteration for adaptive optimal control of continuous-time linear systems,” Automatica, vol. 48, pp. 2850–2859, 2012.
-  P. Mehta and S. Meyn, “Q-learning and Pontryagin’s minimum principle,” in Proceedings of the 48th IEEE Conference on Decision and Control, 2009, pp. 3598–3605.
-  M. Palanisamy, H. Modares, F. L. Lewis, and M. Aurangzeb, “Continuous-time Q-learning for infinite-horizon discounted cost linear quadratic regulator problems,” IEEE Transactions on Cybernetics, vol. 45, pp. 165–176, 2015.
-  K. G. Vamvoudakis, “Q-learning for continuous-time linear systems: A model-free infinite horizon optimal control approach,” Systems & Control Letters, vol. 100, pp. 14–20, 2017.
-  Y. Jiang and Z.-P. Jiang, “Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics,” Automatica, vol. 48, no. 10, pp. 2699–2704, 2012.
-  D. Vrabie, O. Pastravanu, M. Abu-Khalaf, and F. L. Lewis, “Adaptive optimal control for continuous-time linear systems based on policy iteration,” Automatica, vol. 45, no. 2, pp. 477–484, 2009.
-  A. M. Devraj and S. Meyn, “Zap Q-learning,” in Advances in Neural Information Processing Systems, 2017, pp. 2235–2244.
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, pp. 529–533, 2015.
-  M. Crandall, L. C. Evans, and P.-L. Lions, “Some properties of viscosity solutions of Hamilton–Jacobi equations,” Transactions of the American Mathematical Society, vol. 282, pp. 487–502, 1984.
-  M. Crandall and P.-L. Lions, “Viscosity solutions of Hamilton–Jacobi equations,” Transactions of the American Mathematical Society, vol. 277, pp. 1–42, 1983.
-  M. Bardi and I. Capuzzo-Dolcetta, Optimal Control and Viscosity Solutions of Hamilton-Jacobi-Bellman Equations. Birkhäuser, 1997.
-  L. C. Evans, Partial Differential Equations. American Mathematical Society, 2010.
-  T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” in International Conferences on Learning Representation (ICLR), 2016.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conferences on Learning Representation (ICLR), 2015.