1 Introduction
Qlearning is one of the most popular reinforcement learning methods that seek efficient control policies without the knowledge of an explicit system model [1]
. The key idea of Qlearning is to combine dynamic programming and stochastic approximation in a way to estimate the optimal stateaction value function, also called the
Qfunction, by using trajectory samples. For discretetime Markov decision processes, Qlearning has been extensively studied (see
[2] and the references therein), while the literature on continuoustime Qlearning is sparse. In discrete time, the Bellman equation for Qfunctions can be defined by using dynamic programming in a straightforward manner. However, the corresponding Bellman equation for continuoustime Qfunctions has not yet been fully characterized despite some prior attempts. A variant of Qfunction is used in [3, 4], which has a different meaning from the Qfunction in reinforcement learning. In other literature, a Qfunction similar to that of reinforcement learning was introduced, but with functionvalued control input [5] or heavily utilizing the lineartimeinvariant (LTI) system structure [6]. A similar modelfree approach for the LTI system has been also studied [7, 8], although the Qfunction is not specifically defined. A continuoustime Qfunction was also considered to prove the convergence of stochastic approximation [9].In this paper, we consider continuoustime deterministic optimal control problems with Lipschitz continuous controls. We show that the associated Qfunction corresponds to the unique viscosity solution of a Hamilton–Jacobi–Bellman (HJB) equation in a particular form. In the viscosity solution framework, even when it is not differentiable, the Qfunction can be used to verify the optimality of a given control and to design an optimal control strategy. Then, we use the proposed HJB equation to derive an integral equation that the optimal Qfunction and optimal trajectory should satisfy. Based on this equation, we propose a Qlearning algorithm for continuoustime dynamical systems. For highdimensional state and control spaces, we also propose a DQNlike algorithm by using deep neural networks (DNNs) as a function approximator
[10]. This opens a new avenue of research that connects viscosity solution theory for HJB equations and the reinforcement learning domain. The performance of the proposed Qlearning algorithm is tested through a set of numerical experiments with 1, 10 and 20dimensional systems.2 ContinuousTime QFunctions and HJB Equations
Consider a controlled dynamical system of the form
(2.1) 
where is the system state and is the control input. Let be the set of admissible controls. The standard finitehorizon optimal control problem can be formulated as
(2.2) 
with , where and are running and terminal cost functions of interest, respectively, and is a subset of . The Qfunction of (2.2) is defined by
(2.3) 
which represents the minimal cost incurred from time to when starting from with . In particular, when , the Qfunction reduces to the standard optimal value function , defined by
(2.4) 
Proposition 1.
Suppose that . Then, the Qfunction (2.3) corresponds to for each , i.e.,
Proof.
Fix . Let be an arbitrary positive constant. Then, there exists such that
where satisfies (2.1) with in the Carathéodory sense: . We now construct a new control as if ; if . Such a modification of controls at a single point does not affect the trajectory or the total cost. Therefore, we have
Since was arbitrary, we conclude that for any . ∎
Thus, if is chosen to be , the Qfunction has no additional interesting property. Motivated by this observation, we restrict the control as a Lipschitz continuous function. Since any Lipschitz continuous function is differentiable almost everywhere, we define the set of admissible control as
Then, for any , there exists a unique measurable function with such that the following ODE holds a.e.: , . By using the dynamic programming principle, we can deduce that
(2.5) 
Suppose for a moment that
. Then, the Taylor expansion of in (2.5) yieldsLetting tend to zero, we arrive at the following HJB equation for the Qfunction:
Note that minimizes the Hamiltonian, and thus the HJB equation can be expressed as
(2.6) 
In what follows, we uncover some mathematical properties of the HJB equation (2.6) and the Qfunction.
2.1 Viscosity Solution: Existence and Uniqueness
In general, the Qfunction is not a function. As a weak solution of the HJB equation, we use the framework of viscosity solutions [11, 12]. We begin by defining the viscosity solution of (2.6) in the following standard manner [13, 14]:
Definition 1.
A continuous function is a viscosity solution of (2.6) if

for all .

For any , if has a local maximum at a point , then

For any , if has a local minimum at a point , then
From now on, we assume the following regularity conditions on , and :^{1}^{1}1These assumptions can be relaxed by using a modulus associated with each function as in [13, Chapter III.1–3].

The functions , and are bounded, i.e., there exists a constant such that

The functions , and are Lipschitz continuous, i.e., there exists a constant such that
Then, the HJB equation (2.6) has a unique viscosity solution, which corresponds to the Qfunction.
Theorem 1.
Proof.
To accommodate the Lipschitz continuity constraint on controls, we consider the following augmented system:
where and are interpreted as a new state and input, respectively. Let be the augmented state, and
be the augmented vector field. Then, the Qfunction can be expressed as
where . The HJB equation (2.6) can be rewritten as
with , where the Hamiltonian is defined by . By the assumptions (A1) and (A2), we have
These imply that the Hamiltonian satisfies the Lipschitz continuity conditions, and thus the standard proof for the existence and the uniqueness of viscosity solution can be directly used (e.g., [13, 14]). Furthermore, by [13, Proposition 3.5 in Ch. 3], the Qfunction corresponds to the unique viscosity solution. The boundedness and the Lipschitz continuity of the Qfunction follows from [13, Proposition 3.1 in Ch. 3]. ∎
2.2 Asymptotic Consistency
We now discuss the convergence of the Qfunction to the optimal value function as the Lipschitz constant tends to . This convergence property demonstrates that the proposed HJB framework is asymptotically consistent with our observation in Proposition 1. We parametrize the Qfunction by so that the scaling becomes similar to that of the other classical singular limit problem. More precisely, let
(2.7) 
where Since for any , it is straightforward to observe that . We also notice that is a bounded function under Assumption (A1). Therefore, by the monotone convergence theorem, there exists a limit function such that
The limit function corresponds to the optimal value function (2.4) without the Lipschitz continuity constraint on controls.
Theorem 2.
For any , we have .
Proof.
Since the argument in the proof of [13, Theorem 4.1 in Ch. 7] can be used, we omit the proof. ∎
2.3 Optimal Controls
To characterize a necessary and sufficient condition for optimality of a control , we consider the following function:
By (2.5), we deduce that the control is optimal if and only if is a constant function for each . On the other hand, the dynamic programming principle implies that the function is nondecreasing for any control . Thus, the control is optimal if and only if the function is nonincreasing. If the Qfunction is differentiable, this implies that
Since Qfunction satisfies HJB equation (2.6), we have
Therefore, when is differentiable, is optimal if and only if with . To obtain the rigorous principle of optimality when is not differentiable, we use generalized derivatives of the Qfunction. We define the lower and upper Dini derivative of Lipschitz continuous function at point with the direction by
Moreover, we define the sub and super differentials of by
Then, the definition of viscosity solution can be written in terms of sub and super differentials [13]: a uniformly continuous function is a viscosity solution to (2.6) if and only if
The following optimality theorem is a direct application of the classical results in [13, Theorem 3.39, Ch. 3].
Theorem 3.
The trajectorycontrol pair is optimal if and only if
Furthermore, suppose that and are continuously differentiable. Then, the pair is optimal if and only if
(2.8) 
Proof.
Remark 1.
At a point where is differentiable, the sub and superdifferentials of are identical to the classical derivative of . Thus, at such a point, we can construct the optimal control using . At a point where is not differentiable, one can choose any control given by (2.8).
Corollary 1.
Suppose that is an optimal control constructed using (2.8) with the initial condition . Then, we have
for any admissible control with .
3 QLearning Using the HJB Equation
In the infinitehorizon case, we consider the following discounted cost function (with ):
and the Qfunction is defined by . Again, using dynamic programming principle, we can derive the following HJB equation:
(3.1) 
As in Theorem 1, we can show that the Qfunction is the unique viscosity solution of the HJB equation (3.1). A necessary and sufficient condition for optimality can be characterized in a way similar to Theorem 3.
We now discuss how the HJB equation (3.1) can be used to design a Qlearning algorithm for estimating using the (simulation) data of system trajectories. To provide the essential idea, we assume that is differentiable. We then have
Suppose now that an optimal control is employed, i.e., . Then, the time derivative of Qfunction along the trajectory is further simplified as
(3.2) 
By integrating (3.2) along the optimal trajectorycontrol pair, we obtain
(3.3) 
However, since the optimal Qfunction and optimal trajectorycontrol pair is unknown a priori, we iteratively update the Qfunction and the control with by using simulation data. The iteration is based on the equation (3.3), which is the characterizing equation of the optimal Qfunction. Specifically, for a given control , we obtain the system trajectory starting from randomly chosen for time interval with small and collect the (simulation) data , and . We then update a new estimate for Qfunction by using (3.3) and simulation data as so that (3.3) holds asymptotically.
To handle highdimensional state and control spaces, we propose a DQNlike algorithm by using DNNs as a function approximator. Let denote the approximate Qfunction parameterized by . We update the network parameter by minimizing the mean squared error (MSE) loss between and the target . To enhance the stability of learning procedure, we use samples of for defining MSE loss and introduce the target network parameter in estimating target value as in DQN [10]. The target network is slowly updated as a weighted sum of and itself as in [15]. The algorithm minimizes the error between the left and righthand sides of (3.3) for each iteration, making the asymptotically satisfies (3.3) as much as possible. The overall procedures are summarized in Algorithm 1. Note that this algorithm does not need the knowledge of an explicit system model as in discretetime Qlearning or DQN.
4 Numerical Experiments
We consider the following linear system with an exponentially discounted quadratic cost
where and . We restrict the control as a Lipschitz continuous function with Lipschitz constant 1. The parameters for simulation are chosen as and . As the DNNs for approximating Qfunctions, we use fully connected networks consisting of an input layer with
nodes, two hidden layers with 128 nodes and an output layer with a single node. For the two hidden layers, we use ReLU activation function. For training the networks, we use Adam optimizer with a learning rate
[16].4.1 Onedimensional problem
As a toy example for sanity check, we first consider a one dimensional model, where , and . In order to measure the performance of control, we fix the initial state and control and we integrate the running cost over . Figure 1 (a) shows the log of costs at each iteration of Algorithm 1. The solid line represents the learning curve averaged over five different trials and the shaded region represents the minimum and maximum of the cost over different trials. The running cost rapidly decreases as the parameter is learned. Figure 1 (b) shows the trajectory of , , and generated by using the learned Qfunction after iteration. The optimal control for this one dimensional problem is to drive both and as soon as possible to 0. Starting from , this can be done by first driving the control to negative value so that moves towards the origin, and then reducing the absolute value of so that both and approaches 0 asymptotically. Such a behavior is observed in Figure 1(b). This confirms that the learned policy is near optimal.
4.2 10 and 20dimensional problems
For 10 and 20dimensional problems, we set the coefficient matrices and such that each element of is sampled from and then multiplied by 0.1, and each element of is sampled from and then multiplied by 5:
We integrate the running cost from randomly chosen initial position and control . The learning curve and the first component of under the learned control are shown in Figures 2 and 3. The results show that the cost decreases superexponentially in both 10 and 20dimensional problems. Note that the axis of Figure 2 (a) and Figure 3 (a) are plotted with logscale, and thus the decay of cost is rapid. Moreover, the learning processes are shown to be fairly stable. As shown in Figure 2 (b) and Figure 3 (b), the learned policy confines the statecontrol pair in a small neighborhood of the origin. This confirms that it presents the desired performance as this behavior is aligned with the original control objective.
5 Conclusion
We introduced a Qfunction for continuoustime deterministic optimal control problems with Lipschitz continuous control. By using the dynamic programming principle, we derived the corresponding HJB equation and showed that its unique viscosity solution corresponds to the Qfunction. An optimality condition was also characterized in terms of the Qfunction without the knowledge of system models. Using the HJB equation and the optimality condition, we construct a Qlearning algorithm and its DQNlike approximate version. The simulation results show that the proposed Qlearning algorithm is fast and stable, and that the learned controller presents a good performance, even in the 20dimensional case.
References
 [1] C. J. Watkins and P. Dayan, “Qlearning,” Machine Learning, vol. 8, no. 34, pp. 279–292, 1992.
 [2] D. P. Bertsekas, Reinforcement Learning and Optimal Control. Athena Scientific, 2019.
 [3] J. Y. Lee, J. B. Park, and Y. H. Choi, “Integral Qlearning and explorized policy iteration for adaptive optimal control of continuoustime linear systems,” Automatica, vol. 48, pp. 2850–2859, 2012.
 [4] P. Mehta and S. Meyn, “Qlearning and Pontryagin’s minimum principle,” in Proceedings of the 48th IEEE Conference on Decision and Control, 2009, pp. 3598–3605.
 [5] M. Palanisamy, H. Modares, F. L. Lewis, and M. Aurangzeb, “Continuoustime Qlearning for infinitehorizon discounted cost linear quadratic regulator problems,” IEEE Transactions on Cybernetics, vol. 45, pp. 165–176, 2015.
 [6] K. G. Vamvoudakis, “Qlearning for continuoustime linear systems: A modelfree infinite horizon optimal control approach,” Systems & Control Letters, vol. 100, pp. 14–20, 2017.
 [7] Y. Jiang and Z.P. Jiang, “Computational adaptive optimal control for continuoustime linear systems with completely unknown dynamics,” Automatica, vol. 48, no. 10, pp. 2699–2704, 2012.
 [8] D. Vrabie, O. Pastravanu, M. AbuKhalaf, and F. L. Lewis, “Adaptive optimal control for continuoustime linear systems based on policy iteration,” Automatica, vol. 45, no. 2, pp. 477–484, 2009.
 [9] A. M. Devraj and S. Meyn, “Zap Qlearning,” in Advances in Neural Information Processing Systems, 2017, pp. 2235–2244.
 [10] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, pp. 529–533, 2015.
 [11] M. Crandall, L. C. Evans, and P.L. Lions, “Some properties of viscosity solutions of Hamilton–Jacobi equations,” Transactions of the American Mathematical Society, vol. 282, pp. 487–502, 1984.
 [12] M. Crandall and P.L. Lions, “Viscosity solutions of Hamilton–Jacobi equations,” Transactions of the American Mathematical Society, vol. 277, pp. 1–42, 1983.
 [13] M. Bardi and I. CapuzzoDolcetta, Optimal Control and Viscosity Solutions of HamiltonJacobiBellman Equations. Birkhäuser, 1997.
 [14] L. C. Evans, Partial Differential Equations. American Mathematical Society, 2010.
 [15] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” in International Conferences on Learning Representation (ICLR), 2016.
 [16] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conferences on Learning Representation (ICLR), 2015.