## I Introduction

Markov Decision Problems offer a flexible framework to address sequential decision making tasks under uncertainty [2], and have gained broad interest in robotics [3], control [4], finance [5]

, and artificial intelligence

[6]. Despite this surge of interest, few works in reinforcement learning address the computational difficulties associated with continuous state and action spaces in a principled way that guarantees convergence. The goal of this work is to develop new reinforcement learning tools for continuous problems which are provably stable and whose complexity is at-worst moderate.

In the development of stochastic methods for reinforcement learning, one may attempt to estimate the transition density of the Markov Decision Process (MDP) (model-based

[7]), perform gradient descent on the value function with respect to the policy (direct policy search [8]), and pursue value function based (model-free [9, 10]) methods which exploit structural properties of the setting to derive fixed point problems called*Bellman equations*. We adopt the latter approach in this work [11], motivated by the fact that an action-value function tells us both how to find a policy and how to evaluate it in terms of the performance metric we have defined, and that a value function encapsulates structural properties of the relationship between states, actions, and rewards.

It is well-known that approaches featuring the “deadly triad” [6]

of function approximation, bootstrapping (e.g. temporal-difference learning), and off-policy training are in danger of divergence, and the most well-understood techniques for ensuring convergence in a stochastic gradient descent context are those based on Gradient Temporal Difference (GTD)

[12]. Though the final algorithm looks similar, our approach could be considered as an alternative formulation and analysis of the GTD family of algorithms centered on a flexible RKHS representation that lets us address problems with nonlinear, continuous state and action spaces in a natural way.To understand our proposed approach, consider the fixed point problem defined by Bellman’s optimality equation [13]. When the state and action spaces are finite and small enough that expectations are computable, fixed point iterations may be used. When this fails to hold, stochastic fixed point methods, namely, -learning [10], may be used, whose convergence may be addressed with asynchronous stochastic approximation theory [14, 15]. This approach is only valid when the action-value (or ) function may be represented as a matrix. However, when the state and action spaces are infinite, this is no longer true, and the -function instead belongs to a generic function space.

In particular, to solve the fixed point problem defined by Bellman’s optimality equation when spaces are continuous, one must surmount the fact that it is defined for infinitely many unknowns, one example of Bellman’s curse of dimensionality

[13]. Efforts to sidestep this issue assume that the -function admits a finite parameterization, such as a linear [16, 12] or nonlinear [17]basis expansion, is defined by a neural network

[18], or that it belongs to a reproducing kernel Hilbert Space (RKHS) [19, 20]. In this work, we adopt the later nonparametric approach, motivated by the fact that combining fixed point iterations with different parameterizations may cause divergence [21, 22], and in general the -function parameterization must be tied to the stochastic update to ensure the convergence of both the function sequence and its parameterization [23].Our main result is a memory-efficient, non-parametric, stochastic method that converges to a fixed point of the Bellman optimality operator almost surely when it belongs to a RKHS. We obtain this result by reformulating the Bellman optimality equation as a nested stochastic program (Section II), a topic investigated in operations research [24] and probability [25, 26]. These problems have been addressed in finite settings with stochastic *quasi-gradient* (SQG) methods [27] which use two time-scale stochastic approximation to mitigate the fact that the objective’s stochastic gradient not available due to its dependence on a *second expectation*, which is referred to as the double sampling problem in [12].

Here, we use a non-parametric generalization of SQG for -learning in infinite MDPs (Section III), motivated by its success for policy evaluation in finite [12, 17] and infinite MDPs [28]. However, a function in a RKHS has comparable complexity to the number of training samples processed, which is in general infinite, an issue is often ignored in kernel methods for Markov decision problems [29, 30, 31, 32]. We address this bottleneck (the curse of kernelization) by requiring memory efficiency in both the function sample path and in its limit through the use of sparse projections which are constructed greedily via matching pursuit [33, 34], akin to [35, 28]. Greedy compression is appropriate since (a) kernel matrices induced by arbitrary data streams will likely become ill-conditioned and hence violate assumptions required by convex methods [36], and (b) parsimony is more important than exact recovery as the SQG iterates are not the target signal but rather a stepping stone to Bellman fixed point. Rather than unsupervised forgetting [37], we tie the projection-induced error to guarantee stochastic descent [35], only keeping dictionary points needed for convergence.

As a result, we conduct functional SQG descent via sparse projections of the SQG. This maintains a moderate-complexity sample path exactly towards , which may be made arbitrarily close to a Bellman fixed point by decreasing the regularizer. In contrast to the convex structure in [28], the Bellman optimality equation induces a non-convex cost functional, which requires us to generalize the relationship between SQG for non-convex objectives and coupled supermartingales in [38] to RKHSs. In doing so, we establish that the sparse projected SQG sequence converges almost surely (Theorem 1) to the Bellman fixed point with decreasing learning rates (Section IV) and to a small Bellman error whose magnitude depends on the learning rates when learning rates are held constant (Theorem 2). Use of constant learning rates allows us to further guarantee that the memory of the learned function remains under control. Moreover, on Continuous Mountain Car [39] and the Inverted Pendulum [40], we observe that our learned action-value function attains a favorable trade-off between memory efficiency and Bellman error, which then yields a policy whose performance is competitive with the state of the art in terms of episode average reward accumulation.

## Ii Markov Decision Processes

We model an autonomous agent in a continuous space as a Markov Decision Process (MDP) with continuous states and actions . When in state and taking action , a random transition to state occurs according to the conditional probability density . After the agent transitions to a particular from , the MDP assigns an instantaneous reward , where the reward function is a map .

In Markov Decision problems, the goal is to find the action sequence so as to maximize the infinite horizon accumulation of rewards, i.e., the value function: . The action-value function is the conditional mean of the value function given the initial action :

(1) |

We define as the maximum of (1) with respect to the action sequence. The reason for defining action-value functions is that the optimal may be used to compute the optimal policy as

(2) |

where a policy is a map from states to actions: : . Thus, finding solves the MDP. Value-function based approaches to MDPs reformulate (2) by shifting the index of the summand in (1) by one, use the time invariance of the Markov transition kernel, and the homogeneity of the summand, to derive the Bellman optimality equation:

(3) |

where the expectation is taken with respect to the conditional distribution of the state given the state action pair . The right-hand side of Equation (3) defines the Bellman optimality operator : over , the space of bounded continuous action-value functions Q: :

(4) |

[4] [Proposition 5.2] establishes that the fixed point of (4) is the optimal action-value function . Thus, to solve the MDP, we seek to compute the fixed point of (4) for all .

Compositional Stochastic Optimization. The functional fixed point equation in (3) has to be simultaneously satisfied for all state action pairs . Alternatively, we can integrate (3) over an arbitrary distribution that is dense around any pair to write a nested stochastic optimization problem [38, 35, 28]. To do so, begin by defining the function

(5) |

and consider an arbitrary everywhere dense distribution over pairs to define the functional

(6) |

Comparing (5) with (3) permits concluding that is the unique function that makes for all . It then follows that is the only function that makes the functional in (6) take the value . Since this functional is also nonnegative, we can write the optimal function as

(7) |

Computation of the optimal policy is thus equivalent to solving the optimization problem in (7). This requires a difficult search over all bounded continuous functions . We reduce this difficulty through a hypothesis on the function class.

Reproducing Kernel Hilbert Spaces We propose restricting to be a Hilbert space equipped with a unique reproducing kernel, an inner product-like map such that

(8) |

In (8), property (i) is called the reproducing property. Replacing by in (8) (i) yields the expression , the origin of the term “reproducing kernel.” Moreover, property (8) (ii) states that functions admit a basis expansion in terms of kernel evaluations (9). Function spaces of this type are referred to as reproducing kernel Hilbert spaces (RKHSs).

We may apply the Representer Theorem to transform the functional problem into a parametric one [41, 42]. In the Reproducing Kernel Hilbert Space (RKHS), the optimal function takes the following form

(9) |

where is a sample of state-action pairs . is an expansion of kernel evaluations only at observed samples.

One complication of the restriction to the RKHS is that this setting requires the cost to be differentiable with Lipschitz gradients, but the definition of [cf. (6)] defined by Bellman’s equation (4) is non-differentiable due to the presence of the maximization over the function. This issue may be addressed by either operating with approximate smoothed gradients of a non-differentiable function [43] or by approximating the non-smooth cost by a smooth one. We adopt the latter approach by replacing the term in (6) by the softmax over continuous range , i.e.

(10) |

and define the -smoothed cost as the one where the softmax (10) in lieu of the hard maximum in (6). Subsequently, we restrict focus to smoothed cost .

In this work, we restrict the kernel used to be in the family of universal kernels, such as a Gaussian Gaussian Radial Basis Function(RBF) kernel with constant diagonal covariance

,(11) |

motivated by the fact that a continuous function over a compact set may be approximated uniformly by a function in a RKHS equipped with a universal kernel [44].

To apply the Representer Theorem, we require the cost to be coercive in [42], which may be satisfied through use of a Hilbert-norm regularizer, so we define the regularized cost functional and solve the regularized problem (7), i.e.

(12) |

Thus, finding a locally optimal action-value function in an MDP amounts to solving the RKHS-valued compositional stochastic program with a non-convex objective defined by the Bellman optimality equation (4). This action-value function can then be used to obtain the optimal policy (2). In the following section, we turn to iterative stochastic methods to solve (12). We point out that this is a step back from the original intent of solving (7) to then find optimal policies using (2). This is the case because the assumption we have made about being representable in the RKHS need not be true. More importantly, the functional is not convex in and there is no guarantee that a local minimum of will be the optimal policy . This is a significant difference relative to policy evaluation problems [28].

## Iii Stochastic Quasi-Gradient Method

To solve (12

), we propose applying a functional variant of stochastic quasi-gradient (SQG) descent to the loss function

[cf. (12)]. The reasoning for this approach rather than a stochastic gradient method is the nested expectations cause the functional stochastic gradient to be still dependent on a second expectation which is not computable, and SQG circumvents this issue. Then, we apply the Representer Theorem (9) (“kernel trick”) to obtain a parameterization of this optimization sequence, which has per-iteration complexity. We then mitigate this untenable complexity growth while preserving optimality using greedy compressive methods, inspired by [35, 28].To find a stationary point of (12) we use quasi-gradients of the functional relative to the function in an iterative process. To do so, introduce an iteration index and let be the estimate of the stationary point at iteration . Further consider a random state action pair independently and randomly chosen from the distribution . Action is executed from state resulting in the system moving to state . This outcome is recorded along with reward and the action that maximizes the action-value function when the system is in state , i.e.,

(13) |

The state (S) , action (A) , reward (R) , state (S) , action (A) are collectively referred to as the SARSA tuple at time .

Further consider the expressions for in (12) and in (6) and exchange order of the expectation and differentiation operators to write the gradient of as

(14) |

To compute the directional derivative in (14), we need to address differentiation of the softmax and its approximation properties with respect to the exact maximum, which is done in the following remark.

###### Remark 1

(Softmax Gradient Error) The functional derivative of (10) takes the form

(15) |

by applying Leibniz Rule, Chain Rule, and the reproducing property of the kernel. Moreover, a factor of

cancels. Observe that as , the softmax becomes closer to the exact (hard) maximum, and the integrals in (15) approach unit, and the only term that remains is . This term may be used in place of (15) to avoid computing the integral, and yields the functional gradient of the exact maximum instead of the softmax. Doing so, however, comes at the cost of computing of the maximizer of the function .Observe that to obtain samples of we require two different queries to a simulation oracle: one to approximate the inner expectation over the Markov transition dynamics defined by , and one for *each initial pair* which defines the outer expectation. This complication, called the “double sampling problem,” was first identified in [27, 45], has been ameliorated through use of two time-scale stochastic approximation, which may be viewed as a stochastic variant of quasi-gradient methods [38].

Following this line of reasoning, we build up the total expectation of one of the terms in (14) while doing stochastic descent with respect to the other. In principle, it is possible to build up the expectation of either term in (14), but the mean of the difference of kernel evaluations is of infinite complexity. On the other hand, the *temporal action difference*, defined as the difference between the action-value function evaluated at state-action pair and the action-value function evaluated at next state and the instantaneous maximizing action :

(16) |

is a *scalar*, and thus so is its total expected value. Therefore, for obvious complexity motivations, we build up the total expectation of (16). To do so, we propose recursively averaging realizations of (16) through the following auxiliary sequence , initialized as null :

(17) |

where is an independent realization of the random triple and is a learning rate.

To define the stochastic descent step, we replace the first term inside the outer expectation in (14) with its instantaneous approximation evaluated at a sample triple , which yields the stochastic quasi-gradient step:

(18) |

where the coefficient comes from the regularizer and is a positive scalar learning rate. Moreover, is the instantaneous -function maximizing action. Now, using similar logic to [37], we may extract a tractable parameterization of the infinite dimensional function sequence (18), exploiting properties of the RKHS (8).

Kernel Parametrization Suppose . Then the update in (18) at time , inductively making use of the Representer Theorem, implies the function is a kernel expansion of past state-action tuples

(19) |

The kernel expansion in (19), together with the functional update (18), yields the fact that functional SQG in amounts to updating the kernel dictionary

and coefficient vector

as(20) |

In (20), the coefficient vector and dictionary are defined as

(21) |

and in (19), we introduce the notation for even and for odd. Moreover, in (19), we make use of a concept called the empirical kernel map associated with dictionary , defined as

(22) |

Observe that (20) causes to have two more columns than its predecessor . We define the model order as the number of data points (columns) in the dictionary at time t, which for functional stochastic quasi-gradient descent is . Asymptotically, then, the complexity of storing is infinite, and even for moderately large training sets is untenable. Next, we address this intractable complexity blowup, inspired by [35, 28], using greedy compression methods [33].

Sparse Stochastic Subspace Projections Since the update step (18) has complexity due to the RKHS parametrization, it is impractical in settings with streaming data or arbitrarily large training sets. We address this issue by replacing the stochastic quasi-descent step (18) with an orthogonally projected variant, where the projection is onto a low-dimensional functional subspace of the RKHS

(23) |

where for some collection of sample instances . We define and as the resulting kernel matrix from this dictionary. We seek function parsimony by selecting dictionaries such that . Suppose that is parameterized by model points and weights . Then, we denote as the SQG step without projection. This may be represented by dictionary and weight vector [cf. (20)]:

(24) |

where in (24) is computed by (III) using obtained from (23):

(25) |

Observe that has columns which is the length of . We proceed to describe the construction of the subspaces onto which the SQG iterates are projected in (23). Specifically, we select the kernel dictionary via greedy compression. We form by selecting a subset of columns from that best approximates in terms of Hilbert norm error. To accomplish this, we use kernel orthogonal matching pursuit [35, 28] with error tolerance to find a compressed dictionary from , the one that adds the latest samples. For a fixed dictionary , the update for the kernel weights is a least-squares problem on the coefficient vector:

(26) |

We tune to ensure both stochastic descent and finite model order – see the next section.

We summarize the proposed method, KQ-Learning, in Algorithm 1, the execution of the stochastic projection of the functional SQG iterates onto subspaces . We begin with a null function , i.e., empty dictionary and coefficients (Step 1). At each step, given an i.i.d. sample and step-size , (Steps 2-5), we compute the unconstrained functional SQG iterate parametrized by and (Steps 6-7), which are fed into KOMP (Algorithm 2) [35] with budget , (Step 8). KOMP then returns a lower complexity estimate of that is away in .

## Iv Convergence Analysis

In this section, we shift focus to the task of establishing that the sequence of action-value function estimates generated by Algorithm 1 actually yield a locally optimal solution to the Bellman optimality equation, which, given intrinsic the non-convexity of the problem setting, is the best one may hope for in general through use of numerical stochastic optimization methods. Our analysis extends the ideas of coupled supermartingales in reproducing kernel Hilbert spaces [28], which have been used to establish convergent policy evaluation approaches in infinite MDPs (a convex problem), to non-convex settings, and further generalizes the non-convex vector-valued setting of [38].

Before proceeding with the details of the technical setting, we introduce a few definitions which simplify derivations greatly. In particular, for further reference, we use (13) to define , the instantaneous maximizer of the action-value function and defines the direction of the gradient. We also define the functional stochastic quasi-gradient of the regularized objective

(27) |

and its sparse-subspace projected variant as

(28) |

Note that the update may be rewritten as a stochastic projected quasi-gradient step rather than a stochastic quasi-gradient step followed by a set projection, i.e.,

(29) |

With these definitions, we may state our main assumptions required to establish convergence of Algorithm 1.

###### Assumption 1

The state space and action space are compact, and the reproducing kernel map may be bounded as

(30) |

Moreover, the subspaces are intersected with some finite Hilbert norm ball for each .

###### Assumption 2

The temporal action difference and auxiliary sequence

satisfy the zero-mean, finite conditional variance, and Lipschitz continuity conditions, respectively,

(31) |

where and are positive scalars, and is the expected value of the temporal action difference conditioned on the state and action .

###### Assumption 3

The functional gradient of the temporal action difference is an unbiased estimate for

and the difference of the reproducing kernels expression has finite conditional variance:(32) |

(33) |

Moreover, the projected stochastic quasi-gradient of the objective has finite second conditional moment as

(34) |

and the temporal action difference is Lipschitz continuous with respect to the action-value function Q. Moreover, for any two distinct and , we have

(35) |

with , distinct -functions; is a scalar.

Assumption 1 regarding the compactness of the state and action spaces of the MDP holds for most application settings and limits the radius of the set from which the MDP trajectory is sampled. The mean and variance properties of the temporal difference stated in Assumption 2 are necessary to bound the error in the descent direction associated with the stochastic sub-sampling and are required to establish convergence of stochastic methods. Assumption 3 is similar to Assumption 2, but instead of establishing bounds on the stochastic approximation error of the temporal difference, limits stochastic error variance in the RKHS. The term related to the maximum of the function in the temporal action difference is Lipschitz in the infinity norm since is automatically Lipschitz since it belongs to the RKHS. Thus, this term can be related to the Hilbert norm through a constant factor. Hence, (35) is only limits how non-smooth the reward function may be. These are natural extensions of the conditions needed for vector-valued stochastic compositional gradient methods.

Due to Assumption 1 and the use of set projections in (23), we have that is always bounded in Hilbert norm, i.e., there exists some such that

(36) |

With these technical conditions, we can derive a coupled stochastic descent-type relationship of Algorithm 1 and then apply the Coupled Supermartingale Theorem [46][Lemma 6] to establish convergence, which we state next.

###### Theorem 1

Consider the sequence and as stated in Algorithm 1. Assume the regularizer is positive , Assumptions 1-3 hold, and the step-size conditions hold, with a positive constant:

(37) |

Then converges to null with probability 1, and hence attains a stationary point of (12). In particular, the limit of achieves the regularized Bellman fixed point restricted to the RKHS.

See Appendix B.

Theorem 1 establishes that Algorithm 1 converges almost surely to a stationary solution of the problem (12) defined by the Bellman optimality equation in a continuous MDP. This is one of the first Lyapunov stability results for -learning in continuous state-action spaces with nonlinear function parameterizations, which are intrinsically necessary when the -function does not admit a lookup table (matrix) representation, and should form the foundation for value-function based reinforcement learning in continuous spaces. A key feature of this result is that the complexity of the function parameterization will not grow untenably large due to the use of our KOMP-based compression method which ties the sparsification bias to the algorithm step-size . In particular, by modifying the above exact convergence result for diminishing learning rates to one in which they are kept constant, we are able to keep constant compression budgets as well, and establish convergence to a neighborhood as well as the finiteness of the model order of , as we state next.

###### Theorem 2

See Appendix C.

The expression on the right-hand side of (38) is a complicated posynomial of and , but is positive provided , and for a fixed increases as increases. This means that more aggressive selections of , for a given , yield a larger limiting lower bound on the Bellman error. A simple example which satisfies the constant step-size conditions is for some small constant . This is consistent with the diminishing step-size conditions where means that must be smaller than which is in .

An additional salient feature of the parameter choice given in Theorem 2 is that [28][Corollary 1] applies, and thus we may conclude that the -function parameterization is at-worst finite during learning when used with constant step-sizes and compression budget. In subsequent sections, we investigate the empirical validity of the proposed approach on two autonomous control tasks: the Inverted Pendulum and Continuous Mountain Car, for which observe consistent convergence in practice. To do so, first some implementation details of Algorithm 1 must be addressed.

###### Remark 2

Observe that (39) bears a phantom resemblance to Watkins’ Q-Learning algorithm [10]; however, it is unclear how to extend [10] to continuous MDPs where function approximation is required. In practice, using (39) for all updates, we observe globally steady policy learning and convergence of Bellman error, suggesting a link between (39) and stochastic fixed point methods [14, 15]. This link is left to future investigation. For now, we simply note that stochastic fixed point iteration is fundamentally different than stochastic descent methods which rely on the construction of supermartingales, so results from the previous section do not apply to (39). Moreover, this update has also been referred to as a temporal difference “semi-gradient” update in Chapters 9-10 of [6].

## V Practical Considerations

The convergence guarantees for Algorithm 1 require sequentially observing state-action-next-state triples independently and identically distributed. Doing so, however, only yields convergence toward a stationary point, which may or may not be the optimal function. To improve the quality of stationary points to which we converge, it is vital to observe states that yield reward (an instantiation of the explore-exploit tradeoff). To do so, we adopt a standard practice in reinforcement learning which is to bias actions towards states that may accumulate more reward.

The method in which we propose to bias actions is by selecting them according to the current estimate of the optimal policy, i.e., the greedy policy. However, when doing so, the KQ-Learning updates (18) computed using greedy samples are composed of two points nearby in space. These points are then evaluated by kernels and given approximately equal in opposite weight. Thus, this update is immediately pruned away during the execution of KOMP [33, 47, 35]. In order to make use of the greedy samples and speed up convergence, we project the functional update onto just one kernel dictionary element, resulting in the update step:

(39) |

The resulting procedure is summarized as Algorithm 3. First, trajectory samples are obtained using a greedy policy. Then, the temporal-action difference is computed and averaged recursively. Finally, we update the Q function via (39) and compress it using Algorithm 2.

-Greedy Actions and Hybrid Update To address the explore-exploit trade-off, we use an -greedy policy [48]: with probability we select a random action, and select a greedy action with probability . We adopt this approach with decreasing linearly during training, meaning that as time passes more greedy actions are taken.

The algorithm when run with a -greedy policy is described as the *Hybrid* algorithm, which uses Algorithm 1 when exploratory actions are taken and Algorithm 3 for greedy actions. Practically, we find it useful to judiciously use training examples, which may be done with a data buffer. Thus, the hybrid algorithm is as follows: First, we accumulate trajectory samples in a buffer. Along with the sample, we store an indicator whether was an exploratory action or greedy with respect to . Then, samples are drawn at random from the buffer for training. We explore two different methods for obtaining samples from the buffer: uniformly at random, and prioritized sampling, which weighs each sample in the buffer by its observed Bellman error. For greedy actions, we use the update in (39), and for exploratory actions, we use the KQ-learning update from 1. Finally, we use KOMP to compress the representation of the function.

Maximizing the Function In order to implement Algorithm 3, we apply simulated annealing [49] to evaluate the instantaneous maximizing action . For a general reproducing kernel , maximizing over a weighted sum of kernels is a non-convex optimization problem, so we get stuck in undesirable stationary points [50]. To reduce the chance that this undesirable outcome transpires, we use simulated annealing. First, we sample actions uniformly at random from the action space. Next, we use gradient ascent to refine our estimate of the global maximum of for state . We use the Gaussian Radial Basis Function(RBF) kernel (11), so the function is differentiable with respect to an arbitrary action :

(40) |

and that gradient evaluations are cheap: typically their complexity scales with the model order of the function which is a kept under control using Algorithm 2.