Continuous-time Value Function Approximation in Reproducing Kernel Hilbert Spaces

Motivated by the success of reinforcement learning (RL) for discrete-time tasks such as AlphaGo and Atari games, there has been a recent surge of interest in using RL for continuous-time control of physical systems (cf. many challenging tasks in OpenAI Gym and the DeepMind Control Suite). Since discretization of time is susceptible to error, it is methodologically more desirable to handle the system dynamics directly in continuous time. However, very few techniques exist for continuous-time RL and they lack flexibility in value function approximation. In this paper, we propose a novel framework for continuous-time value function approximation based on reproducing kernel Hilbert spaces. The resulting framework is so flexible that it can accommodate any kind of kernel-based approach, such as Gaussian processes and the adaptive projected subgradient method, and it allows us to handle uncertainties and nonstationarity without prior knowledge about the environment or what basis functions to employ. We demonstrate the validity of the presented framework through experiments.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

02/09/2021

Continuous-Time Model-Based Reinforcement Learning

Model-based reinforcement learning (MBRL) approaches rely on discrete-ti...
06/20/2021

Optimal Strategies for Decision Theoretic Online Learning

We extend the drifting games analysis to continuous time and show that t...
06/15/2020

The Reflectron: Exploiting geometry for learning generalized linear models

Generalized linear models (GLMs) extend linear regression by generating ...
08/15/2021

Policy Evaluation and Temporal-Difference Learning in Continuous Time and Space: A Martingale Approach

We propose a unified framework to study policy evaluation (PE) and the a...
04/15/2021

Predictor-Corrector(PC) Temporal Difference(TD) Learning (PCTD)

Using insight from numerical approximation of ODEs and the problem formu...
02/18/2018

Estimating scale-invariant future in continuous time

Natural learners must compute an estimate of future outcomes that follow...
08/23/2021

A generalized stacked reinforcement learning method for sampled systems

A common setting of reinforcement learning (RL) is a Markov decision pro...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning (RL) reinforcement ; lewis2009reinforcement ; sugiyama2015statistical has been successful in a variety of applications such as AlphaGo and Atari games, particularly for discrete stochastic systems. Recently, application of RL to physical control tasks has also been gaining attention, because solving an optimal control problem (or the Hamilton-Jacobi-Bellman-Isaacs equation) optimalcontrol directly is computationally prohibitive for complex nonlinear system dynamics and/or cost functions.

In the physical world, states and actions are continuous, and many dynamical systems evolve in continuous time. OpenAI Gym openaigym and DeepMind Control Suite controlsuite offer several representative examples of such physical tasks. When handling continuous-time (CT) systems, CT formulations are methodologically desirable over the use of discrete-time (DT) formulations with the small time intervals, since such discretization is susceptible to errors. In terms of computational complexities and the ease of analysis, CT formulations are also more advantageous over DT counterparts for control-theoretic analyses such as stability and forward invariance khalil1996noninear , which are useful for safety-critical applications. As we will show in this paper, our framework allows to constrain control inputs and/or states in a computationally efficient way.

One of the early examples of RL for CT systems baird1994reinforcement pointed out that Q learning is incabable of learning in continuous time and proposed advantage updating. Convergence proofs were given in munos1998reinforcement for systems described by stochastic differential equations (SDEs) oksendal2003stochastic using a grid-based discretization of states and time. Stochastic differential dynamic programming and RL have also been studied in, for example, theodorou2010stochastic ; pan2014probabilistic ; theodorou2010reinforcement . For continuous states and actions, function approximators are often employed instead of finely discretizing the state space to avoid the explosion of computational complexities. The work in doya2000reinforcement presented an application of CT-RL by function approximators such as Gaussian networks with fixed number of basis functions. In vamvoudakis2010online , it was mentioned that any continuously differentiable value function (VF) can be approximated by increasing the number of independent basis functions to infinity in CT scenarios, and a CT policy iteration was proposed.

However, without resorting to the theory of reproducing kernels aronszajn50 , determining the number of basis functions and selecting the suitable basis function class cannot be performed systematically in general. Nonparametric learning is often desirable when no a priori knowledge about a suitable set of basis functions for learning is available. Kernel-based methods have many non-parametric learning algorithms, ranging from Gaussian processes (GPs) rasmussen2006gaussian to kernel adaptive filters (KFs) liu_book10 , which can provably deal with uncertainties and nonstationarity. While DT kernel-based RL was studied in ormoneit2002kernel ; xu2007kernel ; taylor2009kernelized ; sun2016online ; nishiyama2012hilbert ; grunewalder2012modelling ; ohnishi2018safety , for example, and the Gaussian process temporal difference (GPTD) algorithm was presented in engel2005reinforcement , no CT kernel-based RL has been proposed to our knowledge. Moreover, there is no unified framework in which existing kernel methods and their convergence/tracking analyses are straightforwardly applied to model-based VF approximation.

In this paper, we present a novel theoretical framework of model-based CT-VF approximation in reproducing kernel Hilbert spaces (RKHSs) aronszajn50 for systems described by SDEs. The RKHS for learning is defined through one-to-one correspondence to a user-defined RKHS in which the VF being obtained is lying. We then obtain the associated kernel to be used for learning. The resulting framework renders any kind of kernel-based methods applicable in model-based CT-VF approximation, including GPs rasmussen2006gaussian and KFs liu_book10

. In addition, we propose an efficient barrier-certified policy update for CT systems, which implicitly enforces state constraints. Relations of our framework to the existing approaches for DT, DT stochastic (the Markov decision process (MDP)), CT, and CT stochastic systems are shown in Table

1. Our proposed framework covers model-based VF approximation working in RKHSs, including those for CT and CT stochastic systems. We verify the validity of the framework on the classical Mountain Car problem and a simulated inverted pendulum.

DT DT stochastic (MDP) CT CT stochastic
Non kernel-based (e.g. baird1995residual ) (e.g. tsitsiklis1997analysis ) (e.g. doya2000reinforcement ) (e.g. munos1998reinforcement )
Kernel-based (e.g. engel2005reinforcement ) (e.g. grunewalder2012modelling ) (This work) (This work)
Table 1: Relations to the existing approaches

2 Problem setting

Throughout, , , and are the sets of real numbers, nonnegative integers, and strictly positive integers, respectively. We suppose that the system dynamics described by the SDE oksendal2003stochastic ,

(1)

is known or learned, where , , and are the state, control, and a Brownian motion of dimensions , , and , respectively, is the drift, and is the diffusion. A Brownian motion can be considered as a process noise, and is known to satisfy the Markov property oksendal2003stochastic . Given a policy , we define and , and make the following two assumptions.

Assumption 1.

For any Lipschitz continuous policy , both and are Lipschitz continuous, i.e., the stochastic process defined in (1) is an Itô diffusion (oksendal2003stochastic, , Definition 7.1.1), which has a pathwise unique solution for .

Assumption 2.

The set is compact with nonempty interior , and is invariant under the system (1) with any Lipschitz continuous policy , i.e.,

(2)

where

denotes the probability that

lies in when starting from .

Assumption 2 implies that a solution of the system (1) stays in with probability one. We refer the readers to khasminskii2011stochastic for stochastic stability and invariance for SDEs.

In this paper, we consider the immediate cost function111The cost function might be obtained by the negation of the reward function. , which is continuous and satisfies where is the expectation for all trajectories (time evolutions of ) starting from , and is the discount factor. Note this boundedness implies that or that there exists a zero-cost state which is stochastically asymptotically stable khasminskii2011stochastic . Specifically, we consider the case where the immediate cost is not known a priori but is sequentially observed. Now, the VF associated with a policy is given by

(3)

where .

The advantages of using CT formulations include a smooth control performance and an efficient policy update, and CT formulations require no elaborative partitioning of time doya2000reinforcement . In addition, our work shows that CT formulations make control-theoretic analyses easier and computationally more efficient and are more advantageous in terms of susceptibility to errors when the time interval is small. We mention that the CT formulation can still be considered in spite of the fact that the algorithm is implemented in discrete time.

With these problem settings in place, our goal is to estimate the CT-VF in an RKHS and improve policies. However, since the output

is unobservable and the so-called double-sampling problem exists when approximating VFs (see e.g., sutton2009convergent ; konda2004convergence

), kernel-based supervised learning and its analysis cannot be directly applied to VF approximation in general. Motivated by this fact, we propose a novel model-based CT-VF approximation framework which enables us to conduct kernel-based VF approximation as supervised learning.

3 Model-based CT-VF approximation in RKHSs

In this section, we briefly present an overview of our framework; We take the following steps:

  1. Select an RKHS which is supposed to contain as one of its elements.

  2. Construct another RKHS under one-to-one correspondence to through a certain bijective linear operator to be defined later in the next section.

  3. Estimate the immediate cost function in the RKHS by kernel-based supervised learning, and return its estimate .

  4. An estimate of the VF is immediately obtained by .

An illustration of our framework is depicted in Figure 1.

Figure 1: An illustration of the main ideas of our proposed framework. Given a system dynamics and an RKHS for the VF , define under one-to-one correspondence to estimate an observable immediate cost function in , and obtain by bringing it back to .

Note we can avoid the double-sampling problem because the operator is deterministic even though the system dynamics is stochastic. Therefore, under this framework, model-based CT-VF approximation in RKHSs can be derived, and convergence/tracking analyses of kernel-based supervised learning can also be applied to VF approximation.

Policy update while restricting certain regions of the state space

  Estimate of the VF:
  for  do
     - Receive , , and
     - Update the estimate of by using some kernel-based method in e.g., Section 6
     - Update the policy with barrier certificates when is well estimated e.g., (11)
  end for
Algorithm 1 Model-based CT-VF Approximation in RKHSs with Barrier-Certified Policy Updates

As mentioned above, one of the advantages of a CT framework is its affinity for control-theoretic analyses such as stability and forward invariance, which are useful for safety-critical applications. For example, suppose that we need to restrict the region of exploration in the state space to some set , where is smooth. This is often required for safety-critical applications.

Figure 2: An illustration of barrier-certified policy updates. State constraints are implicitly enforced via barrier certificates.

To this end, control inputs must be properly constrained so that the state trajectory remains inside the set . In the safe RL context, there exists an idea of considering a smaller space of allowable policies (see Safesurvey and references therein). To effectively constrain policies, we employ control barrier certificates (cf. xu2015robustness ; wieland2007constructive ; glotfelter2017nonsmooth ; wang2017safety ; ames2016control ; 2017dcbf ). Without explicitly calculating the state trajectory over a long time horizon, it is known that any Lipschitz continuous policy satisfying control barrier certificates renders the set forward invariant xu2015robustness , i.e., the state trajectory remains inside the set . In other words, we can implicitly enforce state constraints by satisfying barrier certificates when updating policies. Barrier-certified policy update was first introduced in ohnishi2018safety for DT systems, but is computationally more efficient in our CT scenario. This concept is illustrated in Figure 2, where is the space of Lipschitz continuous policies , and is the space of barrier-certified allowable policies.

A brief summary of the proposed model-based CT-VF approximation in RKHSs is given in Algorithm 1. In the next section, we present theoretical analyses of our framework.

4 Theoretical analyses

We presented the motivations and an overview of our framework in the previous section. In this section, we validate the proposed framework from theoretical viewpoints. Because the output of the VF is unobservable, we follow the strategy presented in the previous section. First, by properly identifying the RKHS which is supposed to contain the VF, we can implicitly restrict the class of the VF. If the VF is twice continuously differentiable222 See, for example, (fleming2006controlled, , Chapter IV),krylov2008controlled , for more detailed arguments about the conditions under which twice continuous differentiability is guaranteed. over , we obtain the following Hamilton-Jacobi-Bellman-Isaacs equation oksendal2003stochastic :

(4)

where the infinitesimal generator is defined as

(5)

Here, stands for the trace, and , where . By employing a suitable RKHS such as a Gaussian RKHS for , we can guarantee twice continuous differentiability of an estimated VF. Note that functions in a Gaussian RKHS are smooth minh2010some , and any continuous function on every compact subset of can be approximated with an arbitrary accuracy steinwart01 in a Gaussian RKHS.

Next, we need to construct another RKHS which contains the immediate cost function as one of its element. The relation between the VF and the immediate cost function is given by rewriting (4) as

(6)

where is the identity operator. To define the operator over the whole , we use the following definition.

Definition 1 ((zhou2008derivative, , Definition 1)).

Let for . Define where . If is compact with nonempty interior , is the space of functions over such that is well defined and continuous over for each . Define to be the space of continuous functions over such that and that has a continuous extension to for each . If , define

Now, suppose that is an RKHS associated with the reproducing kernel . Then, we define the operator as

(7)

where is the entry of . Note here that over . We emphasize here that the expected value and the immediate cost are related through the deterministic operator . The following main theorem states that is indeed an RKHS under Assumptions 1 and 2, and its corresponding reproducing kernel is obtained.

Theorem 1.

Under Assumptions 1 and 2, suppose that is an RKHS associated with the reproducing kernel . Suppose also that (i) , or that (ii) is a Gaussian RKHS, and there exists a point which is stochastically asymptotically stable over , i.e., for any starting point . Then, the following statements hold.
(a) The space is an isomorphic Hilbert space of equipped with the inner product defined by

(8)

where the operator is defined in (7).
(b) The Hilbert space has the reproducing kernel given by

(9)

where

(10)
Proof.

See Appendices A and B in the supplementary document. ∎

Under Assumptions 1 and 2, Theorem 1 implies that the VF can be uniquely determined by the immediate cost function for a policy if the VF is in an RKHS of a particular class. In fact, the relation between the VF and the immediate cost function in (4) is based on the assumption that the VF is twice continuously differentiable over , and the verification theorem (cf. fleming2006controlled ) states that, when the immediate cost function and a twice continuously differentiable function satisfying the relation (4) are given under certain conditions, the twice continuously differentiable function is indeed the VF. In Theorem 1, on the other hand, we first restrict the class of the VF by identifying an RKHS , and then approximate the immediate cost function in the RKHS any element of which satisfies the relation (4). Because the immediate cost is observable, we can employ any kernel-based supervised learning to estimate the function in the RKHS , such as GPs and KFs, as elaborated later in Section 6.

In the RKHS , an estimate of at time instant is given by , where is the set of samples, and the reproducing kernel is defined in (9). An estimate of the VF at time instant is thus immediately obtained by , where is defined in (10).

Note, when the system dynamics is described by an ordinary differential equation (i.e.,

), the assumptions that is twice continuously differentiable and that are relaxed to that is continuously differentiable and that , respectively.

As an illustrative example of Theorem 1, we show the case of the linear-quadratic regulator (LQR) below.

Special case: linear-quadratic regulator

Consider a linear feedback , i.e., , and a linear system , where and are matrices. In this case, we know that the value function becomes quadratic with respect to the state variable (cf. zhou1996robust ). Therefore, we employ an RKHS with a quadratic kernel for , i.e., . If we assume that the input space is so large that the set accommodates any real symmetric matrix, we obtain .

Moreover, it follows from the product rule of the directional derivative bonet1997nonlinear that , where . Note is symmetric, implying , and we obtain . Because is symmetric, it follows that . If is stable (Hurwitz), from Theorem 1, the one-to-one correspondence between and thus implies that . Therefore, we can fully approximate the immediate cost function in if is quadratic with respect to the state variable.

Suppose that the immediate cost function is given by . Then, the estimated value function will be given by , where which is indeed the well-known Lyapunov equation zhou1996robust . Unlike Gaussian-kernel cases, we only require a finite number of parameters to fully approximate the immediate cost function, and hence is analytically obtainable.

Barrier-certified policy updates under CT formulation

Next, we show that the CT formulation makes barrier-certified policy updates computationally more efficient under certain conditions. Assume that the system dynamics is affine in the control, i.e., , and , where and , and that the immediate cost is given by , where , and is a positive definite matrix. Then, any Lipschitz continuous policy satisfying renders the set forward invariant xu2015robustness , i.e., the state trajectory remains inside the set , where is strictly increasing and . Taking this constraint into account, the barrier-certified greedy policy update is given by

(11)

which is, by virtue of the CT formulation, a quadratic programming (QP) problem at when defines affine constraints (see Appendix C in the supplementary document). The space of allowable policies is thus given by . When and the dynamics is learned by GPs as in pan2014probabilistic , the work in wang2017safe provides a barrier certificate for uncertain dynamics. Note, one can also employ a function approximator or add noises to the greedily updated policy to avoid unstable performance and promote exploration (see e.g., doya2000reinforcement ). To see if the updated policy remains in the space of Lipschitz continuous policies , i.e., , we present the following proposition.

Proposition 1.

Assume the conditions in Theorem 1. Assume also that defines affine constraints, and that , , , and the derivative of are Lipschitz continuous over . Then, the policy defined in (11) is Lipschitz continuous over if the width of a feasible set333Width indicates how much control margin is left for the strictest constraint, as defined in (morris2013sufficient, , Equation 21). is strictly larger than zero over .

Proof.

See Appendix D in the supplementary document. ∎

Note, if defines the bounds of each entry of control inputs, it defines affine constraints. Lastly, the width of a feasible set is strictly larger than zero if is sufficiently large and .

We will further clarify the relations of the proposed theoretical framework to existing works below.

5 Relations to existing works

First, our proposed framework takes advantage of the capability of learning complicated functions and nonparametric flexibility of RKHSs, and reproduces some of the existing model-based DT-VF approximation techniques (see Appendix E in the supplementary document). Note that some of the existing DT-VF approximations in RKHSs, such as GPTD engel2005reinforcement , also work for model-free cases (see ohnishi2018safety for model-free adaptive DT action-value function approximation, for example). Second, since the RKHS for learning is explicitly defined in our framework, any kernel-based method and its convergence/tracking analyses are directly applicable. While, for example, the work in koppel2017parsimonious , which aims at attaining a sparse representation of the unknown function in an online fashion in RKHSs, was extended to the policy evaluation koppel2017policy by addressing the double-sampling problem, our framework does not suffer from the double-sampling problem, and hence any kernel-based online learning (e.g., koppel2017parsimonious ; yukawa_tsp12 ; tayu_tsp15 ) can be straightforwardly applied. Third, when the time interval is small, DT formulations become susceptible to errors, while CT formulations are immune to the choice of the time interval. Note, on the other hand, a larger time interval poorly represents the system dynamics evolving in continuous time. Lastly, barrier certificates are efficiently incorporated in our CT framework through QPs under certain conditions, and state constraints are implicitly taken into account. Stochastic optimal control such as the work in theodorou2010stochastic ; theodorou2010reinforcement requires sample trajectories over predefined finite time horizons and the value is computed along the trajectories while the VF is estimated in an RKHS even without having to follow the trajectory in our framework.

6 Applications and practical implementation

We apply the theory presented in the previous section to the Gaussian kernel case and introduce CTGP as an example, and present a practical implementation. Assume that is diagonal, for simplicity. The Gaussian kernel is given by , , . Given Gaussian kernel , the reproducing kernel defined in (9) is derived as , where is a real-valued function (see Appendix F in the supplementary document for the explicit form of ).

Ctgp

One of the celebrated properties of GPs is their Bayesian formulation, which enables us to deal with uncertainty through credible intervals. Suppose that the observation

at time instant contains some noise , i.e., Given data for some , we can employ GP regression to obtain the mean

and the variance

of at a point as

(12)

where

is the identity matrix,

, and the entry of is . Then, by the existence of the inverse operator , the mean and the variance of at a point can be given by

(13)

where (see Appendix G in the supplementary document for more details).

7 Numerical Experiment

In this section, we first show the validity of the proposed CT framework and its advantage over DT counterparts when the time interval is small, and then compare CTGP and GPTD for RL on a simulated inverted pendulum. In both of the experiments, the coherence-based sparsification richard09 in the RKHS is employed to curb the growth of the dictionary size.

Policy evaluations: comparison of CT and DT approaches

We show that our CT approaches are advantageous over DT counterparts in terms of susceptibility to errors, by using MountainCarContinuous-v0 in OpenAI Gym openaigym as the environment. The state is given by , where and are the position and the velocity of the car, and the dynamics is given by where . The position and the velocity are clipped to and , respectively, and the goal is to reach the position . In the simulation, the control cycle (i.e., the frequency of applying control inputs and observing the states and costs) is set to second. The observed immediate cost is given by for and for , where . Note the immediate cost for the DT cases is given by , where is the time interval. For policy evaluations, we use the policy obtained by RL based on the cross-entropy methods444We used the code in https://github.com/udacity/deep-reinforcement-learning/blob/master/cross-entropy/CEM.ipynb

offered by Udacity. The code is based on PyTorch

paszke2017automatic ., and the four methods, CTGP, KF-based CT-VF approximation (CTKF), GPTD, and KF-based DT-VF approximation (DTKF), are used to learn value functions associated with the policy by executing the current policy for five episodes, each of which terminates whenever or . GP-based techniques are expected to handle the random component added to the immediate cost. The new policies are then obtained by the barrier-certified policy updates under CT formulations, and these policies are evaluated for five times. Here, the barrier function is given by , which prevents the velocity from becoming lower than . Figure 3 compares the value functions555We used "jet colormaps" in Python Matplotlib for illustrating the value functions. learned by each method for the time intervals and . We observe that the DT approaches are sensitive to the choice of . Table 2 compares the cumulative costs averaged over five episodes for each method and for different time intervals and the numbers of times we observed the velocity being lower than when the barrier certificate is employed and unemployed. (Numbers associated with the algorithm names indicate the lengths of the time intervals.) Note that the cumulative costs are calculated by summing up the immediate costs multiplied by the duration of each control cycle, i.e., we discretized the immediate cost based on the control cycle. It is observed that the CT approach is immune to the choice of while the performance of the DT approach degrades when the time interval becomes small, and that the barrier-certified policy updates work efficiently.

(a) GPTD for (b) GPTD for (c) CTGP (d) DTKF for (e) DTKF for (f) CTKF
Figure 3: Illustrations of the value functions obtained by CTGP, CTKF, GPTD, and DTKF for time intervals and . The policy is obtained by RL based on the cross-entropy method. CT approaches are not affected by the choice of .
CTKF GPTD_1 DTKF_1 CTGP GPTD_20 DTKF_20
Cumulative cost
With barrier (times) (times) (times) (times) (times) (times)
Without barrier (times) (times) (times) (times) (times) (times)
Table 2: Comparisons of the cumulative costs and numbers of times the observed velocities became lower than with and without barrier certificates

Reinforcement learning: inverted pendulum

Figure 4: Comparison of time up to which the pendulum stays up between CTGP and GPTD for the inverted pendulum ( std. deviation).

We show the advantage of CTGP over GPTD when the time interval for the estimation is small. Let the state consists of the angle and the angular velocity of an inverted pendulum, and we consider the dynamics: where . The Brownian motion may come from outer disturbances and/or modeling error. In the simulation, the time interval is set to seconds, and the simulated dynamics evolves by , where . In this experiment, the task is to stabilize the inverted pendulum at . The observed immediate cost is given by , where . A trajectory associated with the current policy is generated to learn the VF. The trajectory terminates when and restarts from a random initial angle. After seconds, the policy is updated. To evaluate the current policy, average time over five episodes in which the pendulum stays up () when initialized as is used. Figure 4

compares this average time of CTGP and GPTD up to five updates with standard deviations until when stable policy improvement becomes difficult without some heuristic techniques such as adding noises to policies. Note that the same seed for the random number generator is used for the initializations of both of the two approaches. It is observed that GPTD fails to improve policies. The large credible interval of CTGP is due to the random initialization of the state.

8 Conclusion and future work

We presented a novel theoretical framework that renders the CT-VF approximation problem simultaneously solvable in an RKHS by conducting kernel-based supervised learning for the immediate cost function in the properly defined RKHS. Our CT framework is compatible with rich theories of control, including control barrier certificates for safety-critical applications. The validity of the proposed framework and its advantage over DT counterparts when the time interval is small were verified experimentally on the classical Mountain Car problem and a simulated inverted pendulum.

There are several possible directions to explore as future works; First, we can employ the state-of-the-art kernel methods within our theoretical framework or use other variants of RL, such as actor-critic methods, to improve practical performances. Second, we can consider uncertainties in value function approximation by virtue of the RKHS-based formulation, which might be used for safety verifications. Lastly, it is worth further explorations of advantages of CT formulations for physical tasks.

Acknowledgments

This work was partially conducted when M. Ohnishi was at the GRITS Lab, Georgia Institute of Technology; M. Ohnishi thanks the members of the GRITS Lab, including Dr. Li Wang, and Prof. Magnus Egerstedt for discussions regarding barrier functions. M. Yukawa was supported in part by KAKENHI 18H01446 and 15H02757, M. Johansson was supported in part by the Swedish Research Council and by the Knut and Allice Wallenberg Foundation, and M. Sugiyama was supported in part by KAKENHI 17H00757. Lastly, the authors thank all of the anonymous reviewers for their very insightful comments.

Appendix A Tools to prove Theorem 1

Some known properties of RKHSs and Dynkin’s formula which will be used to prove Theorem 1 are given below.

Lemma A.1 ((minh2010some, , Theorem 2)).

Let be any set with nonempty interior. Then, the RKHS associated with the Gaussian kernel for an arbitrary scale parameter does not contain any polynomial on , including the nonzero constant function.

Proposition A.1 ((zhou2008derivative, , Theorem 1)).

Let be the RKHS associated with a Mercer kernel , where is compact with nonempty interior. Then, , and

(A.1)

Dynkin’s formula

Under Assumption 1, we obtain Dynkin’s formula (cf. (oksendal2003stochastic, , Theorem 7.3.3, Theorem 7.4.1)):

(A.2)

for any and for any , i.e., and has compact support, where is defined in (5). Moreover, it holds (fleming2006controlled, , Chapter III.3), for , that

(A.3)

When is the first exit time of a bounded set666The first exit time of a bounded set is given by starting from a point ., then the condition for is weakened into over the bounded set (see the remark of (oksendal2003stochastic, , Theorem 7.4.1)). See Figure B.1 for an intuition of the expectation taken for the trajectories of the state starting from .

Appendix B Proof of Theorem 1

Note first that the operator is well defined because exists for any from Proposition A.1. We show that is bijective linear, and then show that the reproducing kernel in is given by (9).

Proof of (a)

Because the operator is surjective by definition of , we show that is injective. The operator is linear because the operator is linear by (A.1) in Proposition A.1:

(B.1)

Hence, it is sufficient to show that linearalgebra . Suppose that . It follows that , where is defined in (6). Under Assumptions 1 and 2 (i.e., compactness of ), Dynkin’s fomula (A.2) and (A.3) can be applied to over the bounded set because . (i) When the discount factor , we apply (A.3) to . Under Assumption 1, we can consider the time being taken to infinity. Because , we obtain

(B.2)

Under Assumption 2 (i.e., compactness of and invariance of ), is bounded, from which it follows that over . (ii) When and is stochastically asymptotically stable over , we apply (A.2) to