# Hamilton-Jacobi-Bellman Equations for Q-Learning in Continuous Time

In this paper, we introduce Hamilton-Jacobi-Bellman (HJB) equations for Q-functions in continuous time optimal control problems with Lipschitz continuous controls. The standard Q-function used in reinforcement learning is shown to be the unique viscosity solution of the HJB equation. A necessary and sufficient condition for optimality is provided using the viscosity solution framework. By using the HJB equation, we develop a Q-learning method for continuous-time dynamical systems. A DQN-like algorithm is also proposed for high-dimensional state and control spaces. The performance of the proposed Q-learning algorithm is demonstrated using 1-, 10- and 20-dimensional dynamical systems.

There are no comments yet.

## Authors

• 2 publications
• 5 publications
• ### Hamilton-Jacobi Deep Q-Learning for Deterministic Continuous-Time Systems with Lipschitz Continuous Controls

In this paper, we propose Q-learning algorithms for continuous-time dete...
10/27/2020 ∙ by Jeongho Kim, et al. ∙ 0

• ### POMDPs in Continuous Time and Discrete Spaces

Many processes, such as discrete event systems in engineering or populat...
10/02/2020 ∙ by Bastian Alt, et al. ∙ 0

• ### Unsupervised Real-Time Control through Variational Empowerment

We introduce a methodology for efficiently computing a lower bound to em...
10/13/2017 ∙ by Maximilian Karl, et al. ∙ 0

• ### SACBP: Belief Space Planning for Continuous-Time Dynamical Systems via Stochastic Sequential Action Control

We propose a novel belief space planning technique for continuous dynami...
02/26/2020 ∙ by Haruki Nishimura, et al. ∙ 0

• ### Finite-Time Convergence of Continuous-Time Optimization Algorithms via Differential Inclusions

In this paper, we propose two discontinuous dynamical systems in continu...
12/18/2019 ∙ by Orlando Romero, et al. ∙ 0

• ### Continuous-time system identification with neural networks: model structures and fitting criteria

This paper presents tailor-made neural model structures and two custom f...
06/03/2020 ∙ by Marco Forgione, et al. ∙ 0

• ### Learning Neural Event Functions for Ordinary Differential Equations

The existing Neural ODE formulation relies on an explicit knowledge of t...
11/08/2020 ∙ by Ricky T. Q. Chen, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Q-learning is one of the most popular reinforcement learning methods that seek efficient control policies without the knowledge of an explicit system model [1]

. The key idea of Q-learning is to combine dynamic programming and stochastic approximation in a way to estimate the optimal state-action value function, also called the

Q-function

, by using trajectory samples. For discrete-time Markov decision processes, Q-learning has been extensively studied (see

[2] and the references therein), while the literature on continuous-time Q-learning is sparse. In discrete time, the Bellman equation for Q-functions can be defined by using dynamic programming in a straightforward manner. However, the corresponding Bellman equation for continuous-time Q-functions has not yet been fully characterized despite some prior attempts. A variant of Q-function is used in [3, 4], which has a different meaning from the Q-function in reinforcement learning. In other literature, a Q-function similar to that of reinforcement learning was introduced, but with function-valued control input [5] or heavily utilizing the linear-time-invariant (LTI) system structure [6]. A similar model-free approach for the LTI system has been also studied [7, 8], although the Q-function is not specifically defined. A continuous-time Q-function was also considered to prove the convergence of stochastic approximation [9].

In this paper, we consider continuous-time deterministic optimal control problems with Lipschitz continuous controls. We show that the associated Q-function corresponds to the unique viscosity solution of a Hamilton–Jacobi–Bellman (HJB) equation in a particular form. In the viscosity solution framework, even when it is not differentiable, the Q-function can be used to verify the optimality of a given control and to design an optimal control strategy. Then, we use the proposed HJB equation to derive an integral equation that the optimal Q-function and optimal trajectory should satisfy. Based on this equation, we propose a Q-learning algorithm for continuous-time dynamical systems. For high-dimensional state and control spaces, we also propose a DQN-like algorithm by using deep neural networks (DNNs) as a function approximator

[10]. This opens a new avenue of research that connects viscosity solution theory for HJB equations and the reinforcement learning domain. The performance of the proposed Q-learning algorithm is tested through a set of numerical experiments with 1-, 10- and 20-dimensional systems.

## 2 Continuous-Time Q-Functions and HJB Equations

Consider a controlled dynamical system of the form

 ˙x(t)=f(x(t),u(t)),t>0, (2.1)

where is the system state and is the control input. Let be the set of admissible controls. The standard finite-horizon optimal control problem can be formulated as

 infu∈U1Jx(u):=infu∈U1{∫T0r(x(t),u(t))dt+q(x(T))} (2.2)

with , where and are running and terminal cost functions of interest, respectively, and is a subset of . The Q-function of (2.2) is defined by

 Q(x,u,t):=infu∈U1{∫Ttr(x(s),u(s))ds+q(x(T)) ∣∣ x(t)=x,u(t)=u}, (2.3)

which represents the minimal cost incurred from time to when starting from with . In particular, when , the Q-function reduces to the standard optimal value function , defined by

 v(x,t):=infu∈U{∫Ttr(x(s),u(s))ds+q(x(T)) ∣∣ x(t)=x}. (2.4)
###### Proposition 1.

Suppose that . Then, the Q-function (2.3) corresponds to for each , i.e.,

 Q(x,u,t)=v(x,t)∀(x,u,t)∈Rn×Rm×[0,T].
###### Proof.

Fix . Let be an arbitrary positive constant. Then, there exists such that

 ∫Ttr(x(s),u(s))ds+q(x(T))

where satisfies (2.1) with in the Carathéodory sense: . We now construct a new control as if ; if . Such a modification of controls at a single point does not affect the trajectory or the total cost. Therefore, we have

 v(x,t)≤Q(x,u,t)≤∫Ttr(x(s),~u(s))ds+q(x(T))

Since was arbitrary, we conclude that for any . ∎

Thus, if is chosen to be , the Q-function has no additional interesting property. Motivated by this observation, we restrict the control as a Lipschitz continuous function. Since any Lipschitz continuous function is differentiable almost everywhere, we define the set of admissible control as

 U1:={u∈U∣∥˙u∥L∞≤M, a.e.},where M is a fixed constant.

Then, for any , there exists a unique measurable function with such that the following ODE holds a.e.: , . By using the dynamic programming principle, we can deduce that

 Q(x,u,t)=infu∈U1{∫t+htr(x(s),u(s))ds+Q(x(t+h),u(t+h),t) ∣∣ x(t)=x,u(t)=u}. (2.5)

Suppose for a moment that

. Then, the Taylor expansion of in (2.5) yields

 infu∈U1{1h∫t+htr(x(s),u(s))ds+∂tQ+∇xQ⋅f(x,u)+∇uQ⋅˙u(t)+O(h)}=0.

Letting tend to zero, we arrive at the following HJB equation for the Q-function:

 ∂tQ+∇xQ⋅f(x,u)+infa∈Rm,|a|≤M{∇uQ⋅a}+r(x,u)=0.

Note that minimizes the Hamiltonian, and thus the HJB equation can be expressed as

 ∂tQ+∇xQ⋅f(x,u)−M|∇uQ|+r(x,u)=0. (2.6)

In what follows, we uncover some mathematical properties of the HJB equation (2.6) and the Q-function.

### 2.1 Viscosity Solution: Existence and Uniqueness

In general, the Q-function is not a -function. As a weak solution of the HJB equation, we use the framework of viscosity solutions [11, 12]. We begin by defining the viscosity solution of (2.6) in the following standard manner [13, 14]:

###### Definition 1.

A continuous function is a viscosity solution of (2.6) if

1. for all .

2. For any , if has a local maximum at a point , then

 ∂tR(x0,u0,t0)+∇xR(x0,u0,t0)⋅f(x0,u0)−M|∇uR(x0,u0,t0)|+r(x0,u0)≥0.
3. For any , if has a local minimum at a point , then

 ∂tR(x0,u0,t0)+∇xR(x0,u0,t0)⋅f(x0,u0)−M|∇uR(x0,u0,t0)|+r(x0,u0)≤0.

From now on, we assume the following regularity conditions on , and :111These assumptions can be relaxed by using a modulus associated with each function as in [13, Chapter III.1–3].

• The functions , and are bounded, i.e., there exists a constant such that

 ∥f∥L∞+∥r∥L∞+∥q∥L∞
• The functions , and are Lipschitz continuous, i.e., there exists a constant such that

 ∥f∥Lip+∥r∥Lip+∥q∥Lip

Then, the HJB equation (2.6) has a unique viscosity solution, which corresponds to the Q-function.

###### Theorem 1.

The Q-function (2.3) is the unique viscosity solution of the HJB equation (2.6). Moreover, it is a bounded Lipschitz continuous function.

###### Proof.

To accommodate the Lipschitz continuity constraint on controls, we consider the following augmented system:

 ˙x(t)=f(x(t),u(t)),˙u(t)=a(t),|a(t)|≤M,

where and are interpreted as a new state and input, respectively. Let be the augmented state, and

be the augmented vector field. Then, the Q-function can be expressed as

 Q(z,t)=inf|a|≤M{∫Ttr(z(s))ds+~q(z(T)) ∣∣ z(t)=z},

where . The HJB equation (2.6) can be rewritten as

 ∂tQ+H(∇zQ,z)=0

with , where the Hamiltonian is defined by . By the assumptions (A1) and (A2), we have

 |H(p,z)−H(q,z)| ≤|p1−q1||f(z1,z2)|+M|p2−q2|≤(M+∥f∥L∞)|p−q|, |H(p,z)−H(p,y)| ≤|p1||f(z1,z2)−f(y1,y2)|+|r(z1,z2)−r(y1,y2)| ≤(|p|∥f∥Lip+∥r∥Lip)|z−y|.

These imply that the Hamiltonian satisfies the Lipschitz continuity conditions, and thus the standard proof for the existence and the uniqueness of viscosity solution can be directly used (e.g., [13, 14]). Furthermore, by [13, Proposition 3.5 in Ch. 3], the Q-function corresponds to the unique viscosity solution. The boundedness and the Lipschitz continuity of the Q-function follows from [13, Proposition 3.1 in Ch. 3]. ∎

### 2.2 Asymptotic Consistency

We now discuss the convergence of the Q-function to the optimal value function as the Lipschitz constant tends to . This convergence property demonstrates that the proposed HJB framework is asymptotically consistent with our observation in Proposition 1. We parametrize the Q-function by so that the scaling becomes similar to that of the other classical singular limit problem. More precisely, let

 Qε(x,u,t):=infu∈Uε1{∫Ttr(x(s),u(s))ds+q(x(T)) ∣∣ x(t)=x, u(t)=u}, (2.7)

where Since for any , it is straightforward to observe that . We also notice that is a bounded function under Assumption (A1). Therefore, by the monotone convergence theorem, there exists a limit function such that

 Q0(x,u,t)=limε→0Qε(x,u,t)∀(x,u,t)∈Rn×Rm×[0,T].

The limit function corresponds to the optimal value function (2.4) without the Lipschitz continuity constraint on controls.

###### Theorem 2.

For any , we have .

###### Proof.

Since the argument in the proof of [13, Theorem 4.1 in Ch. 7] can be used, we omit the proof. ∎

### 2.3 Optimal Controls

To characterize a necessary and sufficient condition for optimality of a control , we consider the following function:

 gu(s;x,u,t):=∫str(x(τ),u(τ))dτ+Q(x(s),u(s),s),x(t)=x,u(t)=u.

By (2.5), we deduce that the control is optimal if and only if is a constant function for each . On the other hand, the dynamic programming principle implies that the function is non-decreasing for any control . Thus, the control is optimal if and only if the function is non-increasing. If the Q-function is differentiable, this implies that

 ddsgu(s;x,u,t)=r(x(s),u(s))+∇xQ⋅f(x(s),u(s))+∇uQ⋅˙u+∂tQ≤0.

Since Q-function satisfies HJB equation (2.6), we have

 0=∂tQ+∇xQ⋅f(x(s),u(s))−M|∇uQ|+r(x(s),u(s))≤∂tQ+∇xQ⋅f(x(s),u(s))+∇uQ⋅˙u+r(x(s),u(s))≤0.

Therefore, when is differentiable, is optimal if and only if with . To obtain the rigorous principle of optimality when is not differentiable, we use generalized derivatives of the Q-function. We define the lower and upper Dini derivative of Lipschitz continuous function at point with the direction by

 ∂−Q(x,u,t;q):=liminfh→0Q((x,u,t)+hq)−Q(x,u,t)t,∂+Q(x,u,t;q):=limsuph→0Q((x,u,t)+hq)−Q(x,u,t)t.

Moreover, we define the sub- and super differentials of by

 D−Q(x,u,t):={p=(p0,p1,p2)∈Rn+m+1 | p⋅q≤∂−Q(x,u,t;q)}, D+Q(x,u,t):={p=(p0,p1,p2)∈Rn+m+1 | p⋅q≥∂+Q(x,u,t;q)}.

Then, the definition of viscosity solution can be written in terms of sub- and super differentials [13]: a uniformly continuous function is a viscosity solution to (2.6) if and only if

 {p2+p0⋅f(x,u)−M|p1|+r(x,u)≥0∀p∈D+Q(x,u,t),p2+p0⋅f(x,u)−M|p1|+r(x,u)≤0∀p∈D−Q(x,u,t).

The following optimality theorem is a direct application of the classical results in [13, Theorem 3.39, Ch. 3].

###### Theorem 3.

The trajectory-control pair is optimal if and only if

 ∂−Q(x∗(s),u∗(s),s;f(x∗(s),u∗(s)),˙u∗(s),1)=min|a|≤M∂−Q(x∗(s),u∗(s),s;f(x∗(s),u∗(s)),a,1),a.e. s≥t.

Furthermore, suppose that and are continuously differentiable. Then, the pair is optimal if and only if

 ˙u∗(s)=−Mp1|p1|∀p=(p0,p1,p2)∈D±Q(x∗(s),u∗(s),s),a.e. s≥t. (2.8)
###### Proof.

Note that the controlled system is equivalent to the extended dynamics defined in the proof of Theorem 1. Then, the assertions can be obtained by directly applying [13, Theorem 3.39 in Ch. 3] to this augmented system. ∎

###### Remark 1.

At a point where is differentiable, the sub- and superdifferentials of are identical to the classical derivative of . Thus, at such a point, we can construct the optimal control using . At a point where is not differentiable, one can choose any control given by (2.8).

###### Corollary 1.

Suppose that is an optimal control constructed using (2.8) with the initial condition . Then, we have

 gu(s;x,u,t)≥gu∗(s;x,u,t)

for any admissible control with .

## 3 Q-Learning Using the HJB Equation

In the infinite-horizon case, we consider the following discounted cost function (with ):

 Jx(u)=∫∞0e−γtr(x(t),u(t))dt,x(0)=x,

and the Q-function is defined by . Again, using dynamic programming principle, we can derive the following HJB equation:

 γQ(x,u)−r(x,u)−∇xQ⋅f(x,u)+M|∇uQ|=0. (3.1)

As in Theorem 1, we can show that the Q-function is the unique viscosity solution of the HJB equation (3.1). A necessary and sufficient condition for optimality can be characterized in a way similar to Theorem 3.

We now discuss how the HJB equation (3.1) can be used to design a Q-learning algorithm for estimating using the (simulation) data of system trajectories. To provide the essential idea, we assume that is differentiable. We then have

 ddtQ(x(t),u(t)) =∇xQ⋅f(x(t),u(t))+∇uQ⋅˙u(t) =γQ(x(t),u(t))−r(x(t),u(t))+M|∇uQ|+∇uQ⋅˙u(t).

Suppose now that an optimal control is employed, i.e., . Then, the time derivative of Q-function along the trajectory is further simplified as

 ddtQ(x(t),u(t))=γQ(x(t),u(t))−r(x(t),u(t)). (3.2)

By integrating (3.2) along the optimal trajectory-control pair, we obtain

 Q(x,u)=∫t0e−γtr(x(t),u(t))dt+e−γtQ(x(t),u(t)),∀t≥0. (3.3)

However, since the optimal Q-function and optimal trajectory-control pair is unknown a priori, we iteratively update the Q-function and the control with by using simulation data. The iteration is based on the equation (3.3), which is the characterizing equation of the optimal Q-function. Specifically, for a given control , we obtain the system trajectory starting from randomly chosen for time interval with small and collect the (simulation) data , and . We then update a new estimate for Q-function by using (3.3) and simulation data as so that (3.3) holds asymptotically.

To handle high-dimensional state and control spaces, we propose a DQN-like algorithm by using DNNs as a function approximator. Let denote the approximate Q-function parameterized by . We update the network parameter by minimizing the mean squared error (MSE) loss between and the target . To enhance the stability of learning procedure, we use samples of for defining MSE loss and introduce the target network parameter in estimating target value as in DQN [10]. The target network is slowly updated as a weighted sum of and itself as in [15]. The algorithm minimizes the error between the left and right-hand sides of (3.3) for each iteration, making the asymptotically satisfies (3.3) as much as possible. The overall procedures are summarized in Algorithm 1. Note that this algorithm does not need the knowledge of an explicit system model as in discrete-time Q-learning or DQN.

## 4 Numerical Experiments

We consider the following linear system with an exponentially discounted quadratic cost

 ˙x(t)=Ax(t)+Bu(t),Jx(u):=∫∞0e−γt(|x(t)|2+|u(t)|2)dt,x(0)=x,u(0)=u,

where and . We restrict the control as a Lipschitz continuous function with Lipschitz constant 1. The parameters for simulation are chosen as and . As the DNNs for approximating Q-functions, we use fully connected networks consisting of an input layer with

nodes, two hidden layers with 128 nodes and an output layer with a single node. For the two hidden layers, we use ReLU activation function. For training the networks, we use Adam optimizer with a learning rate

[16].

### 4.1 One-dimensional problem

As a toy example for sanity check, we first consider a one dimensional model, where , and . In order to measure the performance of control, we fix the initial state and control and we integrate the running cost over . Figure 1 (a) shows the log of costs at each iteration of Algorithm 1. The solid line represents the learning curve averaged over five different trials and the shaded region represents the minimum and maximum of the cost over different trials. The running cost rapidly decreases as the parameter is learned. Figure 1 (b) shows the trajectory of , , and generated by using the learned Q-function after iteration. The optimal control for this one dimensional problem is to drive both and as soon as possible to 0. Starting from , this can be done by first driving the control to negative value so that moves towards the origin, and then reducing the absolute value of so that both and approaches 0 asymptotically. Such a behavior is observed in Figure 1(b). This confirms that the learned policy is near optimal.

### 4.2 10- and 20-dimensional problems

For 10- and 20-dimensional problems, we set the coefficient matrices and such that each element of is sampled from and then multiplied by 0.1, and each element of is sampled from and then multiplied by 5:

 Aij=0.1Xij,Bij=5Yij,Xij,Yij∼U[0,1].

We integrate the running cost from randomly chosen initial position and control . The learning curve and the first component of under the learned control are shown in Figures 2 and 3. The results show that the cost decreases super-exponentially in both 10- and 20-dimensional problems. Note that the -axis of Figure 2 (a) and Figure 3 (a) are plotted with log-scale, and thus the decay of cost is rapid. Moreover, the learning processes are shown to be fairly stable. As shown in Figure 2 (b) and Figure 3 (b), the learned policy confines the state-control pair in a small neighborhood of the origin. This confirms that it presents the desired performance as this behavior is aligned with the original control objective.

## 5 Conclusion

We introduced a Q-function for continuous-time deterministic optimal control problems with Lipschitz continuous control. By using the dynamic programming principle, we derived the corresponding HJB equation and showed that its unique viscosity solution corresponds to the Q-function. An optimality condition was also characterized in terms of the Q-function without the knowledge of system models. Using the HJB equation and the optimality condition, we construct a Q-learning algorithm and its DQN-like approximate version. The simulation results show that the proposed Q-learning algorithm is fast and stable, and that the learned controller presents a good performance, even in the 20-dimensional case.

## References

• [1] C. J. Watkins and P. Dayan, “Q-learning,” Machine Learning, vol. 8, no. 3-4, pp. 279–292, 1992.
• [2] D. P. Bertsekas, Reinforcement Learning and Optimal Control.   Athena Scientific, 2019.
• [3] J. Y. Lee, J. B. Park, and Y. H. Choi, “Integral Q-learning and explorized policy iteration for adaptive optimal control of continuous-time linear systems,” Automatica, vol. 48, pp. 2850–2859, 2012.
• [4] P. Mehta and S. Meyn, “Q-learning and Pontryagin’s minimum principle,” in Proceedings of the 48th IEEE Conference on Decision and Control, 2009, pp. 3598–3605.
• [5] M. Palanisamy, H. Modares, F. L. Lewis, and M. Aurangzeb, “Continuous-time Q-learning for infinite-horizon discounted cost linear quadratic regulator problems,” IEEE Transactions on Cybernetics, vol. 45, pp. 165–176, 2015.
• [6] K. G. Vamvoudakis, “Q-learning for continuous-time linear systems: A model-free infinite horizon optimal control approach,” Systems & Control Letters, vol. 100, pp. 14–20, 2017.
• [7] Y. Jiang and Z.-P. Jiang, “Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics,” Automatica, vol. 48, no. 10, pp. 2699–2704, 2012.
• [8] D. Vrabie, O. Pastravanu, M. Abu-Khalaf, and F. L. Lewis, “Adaptive optimal control for continuous-time linear systems based on policy iteration,” Automatica, vol. 45, no. 2, pp. 477–484, 2009.
• [9] A. M. Devraj and S. Meyn, “Zap Q-learning,” in Advances in Neural Information Processing Systems, 2017, pp. 2235–2244.
• [10] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, pp. 529–533, 2015.
• [11] M. Crandall, L. C. Evans, and P.-L. Lions, “Some properties of viscosity solutions of Hamilton–Jacobi equations,” Transactions of the American Mathematical Society, vol. 282, pp. 487–502, 1984.
• [12] M. Crandall and P.-L. Lions, “Viscosity solutions of Hamilton–Jacobi equations,” Transactions of the American Mathematical Society, vol. 277, pp. 1–42, 1983.
• [13] M. Bardi and I. Capuzzo-Dolcetta, Optimal Control and Viscosity Solutions of Hamilton-Jacobi-Bellman Equations.   Birkhäuser, 1997.
• [14] L. C. Evans, Partial Differential Equations.   American Mathematical Society, 2010.
• [15] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” in International Conferences on Learning Representation (ICLR), 2016.
• [16] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conferences on Learning Representation (ICLR), 2015.