1 Introduction
Reinforcement learning (RL) reinforcement ; lewis2009reinforcement ; sugiyama2015statistical has been successful in a variety of applications such as AlphaGo and Atari games, particularly for discrete stochastic systems. Recently, application of RL to physical control tasks has also been gaining attention, because solving an optimal control problem (or the HamiltonJacobiBellmanIsaacs equation) optimalcontrol directly is computationally prohibitive for complex nonlinear system dynamics and/or cost functions.
In the physical world, states and actions are continuous, and many dynamical systems evolve in continuous time. OpenAI Gym openaigym and DeepMind Control Suite controlsuite offer several representative examples of such physical tasks. When handling continuoustime (CT) systems, CT formulations are methodologically desirable over the use of discretetime (DT) formulations with the small time intervals, since such discretization is susceptible to errors. In terms of computational complexities and the ease of analysis, CT formulations are also more advantageous over DT counterparts for controltheoretic analyses such as stability and forward invariance khalil1996noninear , which are useful for safetycritical applications. As we will show in this paper, our framework allows to constrain control inputs and/or states in a computationally efficient way.
One of the early examples of RL for CT systems baird1994reinforcement pointed out that Q learning is incabable of learning in continuous time and proposed advantage updating. Convergence proofs were given in munos1998reinforcement for systems described by stochastic differential equations (SDEs) oksendal2003stochastic using a gridbased discretization of states and time. Stochastic differential dynamic programming and RL have also been studied in, for example, theodorou2010stochastic ; pan2014probabilistic ; theodorou2010reinforcement . For continuous states and actions, function approximators are often employed instead of finely discretizing the state space to avoid the explosion of computational complexities. The work in doya2000reinforcement presented an application of CTRL by function approximators such as Gaussian networks with fixed number of basis functions. In vamvoudakis2010online , it was mentioned that any continuously differentiable value function (VF) can be approximated by increasing the number of independent basis functions to infinity in CT scenarios, and a CT policy iteration was proposed.
However, without resorting to the theory of reproducing kernels aronszajn50 , determining the number of basis functions and selecting the suitable basis function class cannot be performed systematically in general. Nonparametric learning is often desirable when no a priori knowledge about a suitable set of basis functions for learning is available. Kernelbased methods have many nonparametric learning algorithms, ranging from Gaussian processes (GPs) rasmussen2006gaussian to kernel adaptive filters (KFs) liu_book10 , which can provably deal with uncertainties and nonstationarity. While DT kernelbased RL was studied in ormoneit2002kernel ; xu2007kernel ; taylor2009kernelized ; sun2016online ; nishiyama2012hilbert ; grunewalder2012modelling ; ohnishi2018safety , for example, and the Gaussian process temporal difference (GPTD) algorithm was presented in engel2005reinforcement , no CT kernelbased RL has been proposed to our knowledge. Moreover, there is no unified framework in which existing kernel methods and their convergence/tracking analyses are straightforwardly applied to modelbased VF approximation.
In this paper, we present a novel theoretical framework of modelbased CTVF approximation in reproducing kernel Hilbert spaces (RKHSs) aronszajn50 for systems described by SDEs. The RKHS for learning is defined through onetoone correspondence to a userdefined RKHS in which the VF being obtained is lying. We then obtain the associated kernel to be used for learning. The resulting framework renders any kind of kernelbased methods applicable in modelbased CTVF approximation, including GPs rasmussen2006gaussian and KFs liu_book10
. In addition, we propose an efficient barriercertified policy update for CT systems, which implicitly enforces state constraints. Relations of our framework to the existing approaches for DT, DT stochastic (the Markov decision process (MDP)), CT, and CT stochastic systems are shown in Table
1. Our proposed framework covers modelbased VF approximation working in RKHSs, including those for CT and CT stochastic systems. We verify the validity of the framework on the classical Mountain Car problem and a simulated inverted pendulum.DT  DT stochastic (MDP)  CT  CT stochastic  

Non kernelbased  (e.g. baird1995residual )  (e.g. tsitsiklis1997analysis )  (e.g. doya2000reinforcement )  (e.g. munos1998reinforcement ) 
Kernelbased  (e.g. engel2005reinforcement )  (e.g. grunewalder2012modelling )  (This work)  (This work) 
2 Problem setting
Throughout, , , and are the sets of real numbers, nonnegative integers, and strictly positive integers, respectively. We suppose that the system dynamics described by the SDE oksendal2003stochastic ,
(1) 
is known or learned, where , , and are the state, control, and a Brownian motion of dimensions , , and , respectively, is the drift, and is the diffusion. A Brownian motion can be considered as a process noise, and is known to satisfy the Markov property oksendal2003stochastic . Given a policy , we define and , and make the following two assumptions.
Assumption 1.
For any Lipschitz continuous policy , both and are Lipschitz continuous, i.e., the stochastic process defined in (1) is an Itô diffusion (oksendal2003stochastic, , Definition 7.1.1), which has a pathwise unique solution for .
Assumption 2.
The set is compact with nonempty interior , and is invariant under the system (1) with any Lipschitz continuous policy , i.e.,
(2) 
where
denotes the probability that
lies in when starting from .Assumption 2 implies that a solution of the system (1) stays in with probability one. We refer the readers to khasminskii2011stochastic for stochastic stability and invariance for SDEs.
In this paper, we consider the immediate cost function^{1}^{1}1The cost function might be obtained by the negation of the reward function. , which is continuous and satisfies where is the expectation for all trajectories (time evolutions of ) starting from , and is the discount factor. Note this boundedness implies that or that there exists a zerocost state which is stochastically asymptotically stable khasminskii2011stochastic . Specifically, we consider the case where the immediate cost is not known a priori but is sequentially observed. Now, the VF associated with a policy is given by
(3) 
where .
The advantages of using CT formulations include a smooth control performance and an efficient policy update, and CT formulations require no elaborative partitioning of time doya2000reinforcement . In addition, our work shows that CT formulations make controltheoretic analyses easier and computationally more efficient and are more advantageous in terms of susceptibility to errors when the time interval is small. We mention that the CT formulation can still be considered in spite of the fact that the algorithm is implemented in discrete time.
With these problem settings in place, our goal is to estimate the CTVF in an RKHS and improve policies. However, since the output
is unobservable and the socalled doublesampling problem exists when approximating VFs (see e.g., sutton2009convergent ; konda2004convergence), kernelbased supervised learning and its analysis cannot be directly applied to VF approximation in general. Motivated by this fact, we propose a novel modelbased CTVF approximation framework which enables us to conduct kernelbased VF approximation as supervised learning.
3 Modelbased CTVF approximation in RKHSs
In this section, we briefly present an overview of our framework; We take the following steps:

Select an RKHS which is supposed to contain as one of its elements.

Construct another RKHS under onetoone correspondence to through a certain bijective linear operator to be defined later in the next section.

Estimate the immediate cost function in the RKHS by kernelbased supervised learning, and return its estimate .

An estimate of the VF is immediately obtained by .
An illustration of our framework is depicted in Figure 1.
Note we can avoid the doublesampling problem because the operator is deterministic even though the system dynamics is stochastic. Therefore, under this framework, modelbased CTVF approximation in RKHSs can be derived, and convergence/tracking analyses of kernelbased supervised learning can also be applied to VF approximation.
Policy update while restricting certain regions of the state space
As mentioned above, one of the advantages of a CT framework is its affinity for controltheoretic analyses such as stability and forward invariance, which are useful for safetycritical applications. For example, suppose that we need to restrict the region of exploration in the state space to some set , where is smooth. This is often required for safetycritical applications.
To this end, control inputs must be properly constrained so that the state trajectory remains inside the set . In the safe RL context, there exists an idea of considering a smaller space of allowable policies (see Safesurvey and references therein). To effectively constrain policies, we employ control barrier certificates (cf. xu2015robustness ; wieland2007constructive ; glotfelter2017nonsmooth ; wang2017safety ; ames2016control ; 2017dcbf ). Without explicitly calculating the state trajectory over a long time horizon, it is known that any Lipschitz continuous policy satisfying control barrier certificates renders the set forward invariant xu2015robustness , i.e., the state trajectory remains inside the set . In other words, we can implicitly enforce state constraints by satisfying barrier certificates when updating policies. Barriercertified policy update was first introduced in ohnishi2018safety for DT systems, but is computationally more efficient in our CT scenario. This concept is illustrated in Figure 2, where is the space of Lipschitz continuous policies , and is the space of barriercertified allowable policies.
A brief summary of the proposed modelbased CTVF approximation in RKHSs is given in Algorithm 1. In the next section, we present theoretical analyses of our framework.
4 Theoretical analyses
We presented the motivations and an overview of our framework in the previous section. In this section, we validate the proposed framework from theoretical viewpoints. Because the output of the VF is unobservable, we follow the strategy presented in the previous section. First, by properly identifying the RKHS which is supposed to contain the VF, we can implicitly restrict the class of the VF. If the VF is twice continuously differentiable^{2}^{2}2 See, for example, (fleming2006controlled, , Chapter IV),krylov2008controlled , for more detailed arguments about the conditions under which twice continuous differentiability is guaranteed. over , we obtain the following HamiltonJacobiBellmanIsaacs equation oksendal2003stochastic :
(4) 
where the infinitesimal generator is defined as
(5) 
Here, stands for the trace, and , where . By employing a suitable RKHS such as a Gaussian RKHS for , we can guarantee twice continuous differentiability of an estimated VF. Note that functions in a Gaussian RKHS are smooth minh2010some , and any continuous function on every compact subset of can be approximated with an arbitrary accuracy steinwart01 in a Gaussian RKHS.
Next, we need to construct another RKHS which contains the immediate cost function as one of its element. The relation between the VF and the immediate cost function is given by rewriting (4) as
(6) 
where is the identity operator. To define the operator over the whole , we use the following definition.
Definition 1 ((zhou2008derivative, , Definition 1)).
Let for . Define where . If is compact with nonempty interior , is the space of functions over such that is well defined and continuous over for each . Define to be the space of continuous functions over such that and that has a continuous extension to for each . If , define
Now, suppose that is an RKHS associated with the reproducing kernel . Then, we define the operator as
(7) 
where is the entry of . Note here that over . We emphasize here that the expected value and the immediate cost are related through the deterministic operator . The following main theorem states that is indeed an RKHS under Assumptions 1 and 2, and its corresponding reproducing kernel is obtained.
Theorem 1.
Under Assumptions 1 and 2, suppose that is an RKHS associated with the reproducing kernel .
Suppose also that (i) , or that (ii) is a Gaussian RKHS, and there exists a point which is stochastically asymptotically stable over , i.e., for any starting point .
Then, the following statements hold.
(a) The space is an isomorphic Hilbert space of equipped with the inner product defined by
(8) 
where the operator is defined in (7).
(b) The Hilbert space has the reproducing kernel given by
(9) 
where
(10) 
Proof.
See Appendices A and B in the supplementary document. ∎
Under Assumptions 1 and 2, Theorem 1 implies that the VF can be uniquely determined by the immediate cost function for a policy if the VF is in an RKHS of a particular class. In fact, the relation between the VF and the immediate cost function in (4) is based on the assumption that the VF is twice continuously differentiable over , and the verification theorem (cf. fleming2006controlled ) states that, when the immediate cost function and a twice continuously differentiable function satisfying the relation (4) are given under certain conditions, the twice continuously differentiable function is indeed the VF. In Theorem 1, on the other hand, we first restrict the class of the VF by identifying an RKHS , and then approximate the immediate cost function in the RKHS any element of which satisfies the relation (4). Because the immediate cost is observable, we can employ any kernelbased supervised learning to estimate the function in the RKHS , such as GPs and KFs, as elaborated later in Section 6.
In the RKHS , an estimate of at time instant is given by , where is the set of samples, and the reproducing kernel is defined in (9). An estimate of the VF at time instant is thus immediately obtained by , where is defined in (10).
Note, when the system dynamics is described by an ordinary differential equation (i.e.,
), the assumptions that is twice continuously differentiable and that are relaxed to that is continuously differentiable and that , respectively.As an illustrative example of Theorem 1, we show the case of the linearquadratic regulator (LQR) below.
Special case: linearquadratic regulator
Consider a linear feedback , i.e., , and a linear system , where and are matrices. In this case, we know that the value function becomes quadratic with respect to the state variable (cf. zhou1996robust ). Therefore, we employ an RKHS with a quadratic kernel for , i.e., . If we assume that the input space is so large that the set accommodates any real symmetric matrix, we obtain .
Moreover, it follows from the product rule of the directional derivative bonet1997nonlinear that , where . Note is symmetric, implying , and we obtain . Because is symmetric, it follows that . If is stable (Hurwitz), from Theorem 1, the onetoone correspondence between and thus implies that . Therefore, we can fully approximate the immediate cost function in if is quadratic with respect to the state variable.
Suppose that the immediate cost function is given by . Then, the estimated value function will be given by , where which is indeed the wellknown Lyapunov equation zhou1996robust . Unlike Gaussiankernel cases, we only require a finite number of parameters to fully approximate the immediate cost function, and hence is analytically obtainable.
Barriercertified policy updates under CT formulation
Next, we show that the CT formulation makes barriercertified policy updates computationally more efficient under certain conditions. Assume that the system dynamics is affine in the control, i.e., , and , where and , and that the immediate cost is given by , where , and is a positive definite matrix. Then, any Lipschitz continuous policy satisfying renders the set forward invariant xu2015robustness , i.e., the state trajectory remains inside the set , where is strictly increasing and . Taking this constraint into account, the barriercertified greedy policy update is given by
(11) 
which is, by virtue of the CT formulation, a quadratic programming (QP) problem at when defines affine constraints (see Appendix C in the supplementary document). The space of allowable policies is thus given by . When and the dynamics is learned by GPs as in pan2014probabilistic , the work in wang2017safe provides a barrier certificate for uncertain dynamics. Note, one can also employ a function approximator or add noises to the greedily updated policy to avoid unstable performance and promote exploration (see e.g., doya2000reinforcement ). To see if the updated policy remains in the space of Lipschitz continuous policies , i.e., , we present the following proposition.
Proposition 1.
Assume the conditions in Theorem 1. Assume also that defines affine constraints, and that , , , and the derivative of are Lipschitz continuous over . Then, the policy defined in (11) is Lipschitz continuous over if the width of a feasible set^{3}^{3}3Width indicates how much control margin is left for the strictest constraint, as defined in (morris2013sufficient, , Equation 21). is strictly larger than zero over .
Proof.
See Appendix D in the supplementary document. ∎
Note, if defines the bounds of each entry of control inputs, it defines affine constraints. Lastly, the width of a feasible set is strictly larger than zero if is sufficiently large and .
We will further clarify the relations of the proposed theoretical framework to existing works below.
5 Relations to existing works
First, our proposed framework takes advantage of the capability of learning complicated functions and nonparametric flexibility of RKHSs, and reproduces some of the existing modelbased DTVF approximation techniques (see Appendix E in the supplementary document). Note that some of the existing DTVF approximations in RKHSs, such as GPTD engel2005reinforcement , also work for modelfree cases (see ohnishi2018safety for modelfree adaptive DT actionvalue function approximation, for example). Second, since the RKHS for learning is explicitly defined in our framework, any kernelbased method and its convergence/tracking analyses are directly applicable. While, for example, the work in koppel2017parsimonious , which aims at attaining a sparse representation of the unknown function in an online fashion in RKHSs, was extended to the policy evaluation koppel2017policy by addressing the doublesampling problem, our framework does not suffer from the doublesampling problem, and hence any kernelbased online learning (e.g., koppel2017parsimonious ; yukawa_tsp12 ; tayu_tsp15 ) can be straightforwardly applied. Third, when the time interval is small, DT formulations become susceptible to errors, while CT formulations are immune to the choice of the time interval. Note, on the other hand, a larger time interval poorly represents the system dynamics evolving in continuous time. Lastly, barrier certificates are efficiently incorporated in our CT framework through QPs under certain conditions, and state constraints are implicitly taken into account. Stochastic optimal control such as the work in theodorou2010stochastic ; theodorou2010reinforcement requires sample trajectories over predefined finite time horizons and the value is computed along the trajectories while the VF is estimated in an RKHS even without having to follow the trajectory in our framework.
6 Applications and practical implementation
We apply the theory presented in the previous section to the Gaussian kernel case and introduce CTGP as an example, and present a practical implementation. Assume that is diagonal, for simplicity. The Gaussian kernel is given by , , . Given Gaussian kernel , the reproducing kernel defined in (9) is derived as , where is a realvalued function (see Appendix F in the supplementary document for the explicit form of ).
Ctgp
One of the celebrated properties of GPs is their Bayesian formulation, which enables us to deal with uncertainty through credible intervals. Suppose that the observation
at time instant contains some noise , i.e., Given data for some , we can employ GP regression to obtain the meanand the variance
of at a point as(12) 
where
is the identity matrix,
, and the entry of is . Then, by the existence of the inverse operator , the mean and the variance of at a point can be given by(13) 
where (see Appendix G in the supplementary document for more details).
7 Numerical Experiment
In this section, we first show the validity of the proposed CT framework and its advantage over DT counterparts when the time interval is small, and then compare CTGP and GPTD for RL on a simulated inverted pendulum. In both of the experiments, the coherencebased sparsification richard09 in the RKHS is employed to curb the growth of the dictionary size.
Policy evaluations: comparison of CT and DT approaches
We show that our CT approaches are advantageous over DT counterparts in terms of susceptibility to errors, by using MountainCarContinuousv0 in OpenAI Gym openaigym as the environment. The state is given by , where and are the position and the velocity of the car, and the dynamics is given by where . The position and the velocity are clipped to and , respectively, and the goal is to reach the position . In the simulation, the control cycle (i.e., the frequency of applying control inputs and observing the states and costs) is set to second. The observed immediate cost is given by for and for , where . Note the immediate cost for the DT cases is given by , where is the time interval. For policy evaluations, we use the policy obtained by RL based on the crossentropy methods^{4}^{4}4We used the code in https://github.com/udacity/deepreinforcementlearning/blob/master/crossentropy/CEM.ipynb
offered by Udacity. The code is based on PyTorch
paszke2017automatic ., and the four methods, CTGP, KFbased CTVF approximation (CTKF), GPTD, and KFbased DTVF approximation (DTKF), are used to learn value functions associated with the policy by executing the current policy for five episodes, each of which terminates whenever or . GPbased techniques are expected to handle the random component added to the immediate cost. The new policies are then obtained by the barriercertified policy updates under CT formulations, and these policies are evaluated for five times. Here, the barrier function is given by , which prevents the velocity from becoming lower than . Figure 3 compares the value functions^{5}^{5}5We used "jet colormaps" in Python Matplotlib for illustrating the value functions. learned by each method for the time intervals and . We observe that the DT approaches are sensitive to the choice of . Table 2 compares the cumulative costs averaged over five episodes for each method and for different time intervals and the numbers of times we observed the velocity being lower than when the barrier certificate is employed and unemployed. (Numbers associated with the algorithm names indicate the lengths of the time intervals.) Note that the cumulative costs are calculated by summing up the immediate costs multiplied by the duration of each control cycle, i.e., we discretized the immediate cost based on the control cycle. It is observed that the CT approach is immune to the choice of while the performance of the DT approach degrades when the time interval becomes small, and that the barriercertified policy updates work efficiently.CTKF  GPTD_1  DTKF_1  CTGP  GPTD_20  DTKF_20  

Cumulative cost  
With barrier  (times)  (times)  (times)  (times)  (times)  (times) 
Without barrier  (times)  (times)  (times)  (times)  (times)  (times) 
Reinforcement learning: inverted pendulum
We show the advantage of CTGP over GPTD when the time interval for the estimation is small. Let the state consists of the angle and the angular velocity of an inverted pendulum, and we consider the dynamics: where . The Brownian motion may come from outer disturbances and/or modeling error. In the simulation, the time interval is set to seconds, and the simulated dynamics evolves by , where . In this experiment, the task is to stabilize the inverted pendulum at . The observed immediate cost is given by , where . A trajectory associated with the current policy is generated to learn the VF. The trajectory terminates when and restarts from a random initial angle. After seconds, the policy is updated. To evaluate the current policy, average time over five episodes in which the pendulum stays up () when initialized as is used. Figure 4
compares this average time of CTGP and GPTD up to five updates with standard deviations until when stable policy improvement becomes difficult without some heuristic techniques such as adding noises to policies. Note that the same seed for the random number generator is used for the initializations of both of the two approaches. It is observed that GPTD fails to improve policies. The large credible interval of CTGP is due to the random initialization of the state.
8 Conclusion and future work
We presented a novel theoretical framework that renders the CTVF approximation problem simultaneously solvable in an RKHS by conducting kernelbased supervised learning for the immediate cost function in the properly defined RKHS. Our CT framework is compatible with rich theories of control, including control barrier certificates for safetycritical applications. The validity of the proposed framework and its advantage over DT counterparts when the time interval is small were verified experimentally on the classical Mountain Car problem and a simulated inverted pendulum.
There are several possible directions to explore as future works; First, we can employ the stateoftheart kernel methods within our theoretical framework or use other variants of RL, such as actorcritic methods, to improve practical performances. Second, we can consider uncertainties in value function approximation by virtue of the RKHSbased formulation, which might be used for safety verifications. Lastly, it is worth further explorations of advantages of CT formulations for physical tasks.
Acknowledgments
This work was partially conducted when M. Ohnishi was at the GRITS Lab, Georgia Institute of Technology; M. Ohnishi thanks the members of the GRITS Lab, including Dr. Li Wang, and Prof. Magnus Egerstedt for discussions regarding barrier functions. M. Yukawa was supported in part by KAKENHI 18H01446 and 15H02757, M. Johansson was supported in part by the Swedish Research Council and by the Knut and Allice Wallenberg Foundation, and M. Sugiyama was supported in part by KAKENHI 17H00757. Lastly, the authors thank all of the anonymous reviewers for their very insightful comments.
Appendix A Tools to prove Theorem 1
Some known properties of RKHSs and Dynkin’s formula which will be used to prove Theorem 1 are given below.
Lemma A.1 ((minh2010some, , Theorem 2)).
Let be any set with nonempty interior. Then, the RKHS associated with the Gaussian kernel for an arbitrary scale parameter does not contain any polynomial on , including the nonzero constant function.
Proposition A.1 ((zhou2008derivative, , Theorem 1)).
Let be the RKHS associated with a Mercer kernel , where is compact with nonempty interior. Then, , and
(A.1) 
Dynkin’s formula
Under Assumption 1, we obtain Dynkin’s formula (cf. (oksendal2003stochastic, , Theorem 7.3.3, Theorem 7.4.1)):
(A.2) 
for any and for any , i.e., and has compact support, where is defined in (5). Moreover, it holds (fleming2006controlled, , Chapter III.3), for , that
(A.3) 
When is the first exit time of a bounded set^{6}^{6}6The first exit time of a bounded set is given by starting from a point ., then the condition for is weakened into over the bounded set (see the remark of (oksendal2003stochastic, , Theorem 7.4.1)). See Figure B.1 for an intuition of the expectation taken for the trajectories of the state starting from .
Appendix B Proof of Theorem 1
Note first that the operator is well defined because exists for any from Proposition A.1. We show that is bijective linear, and then show that the reproducing kernel in is given by (9).
Proof of (a)
Because the operator is surjective by definition of , we show that is injective. The operator is linear because the operator is linear by (A.1) in Proposition A.1:
(B.1) 
Hence, it is sufficient to show that linearalgebra . Suppose that . It follows that , where is defined in (6). Under Assumptions 1 and 2 (i.e., compactness of ), Dynkin’s fomula (A.2) and (A.3) can be applied to over the bounded set because . (i) When the discount factor , we apply (A.3) to . Under Assumption 1, we can consider the time being taken to infinity. Because , we obtain
(B.2) 
Under Assumption 2 (i.e., compactness of and invariance of ), is bounded, from which it follows that over . (ii) When and is stochastically asymptotically stable over , we apply (A.2) to
Comments
There are no comments yet.