# Differential Temporal Difference Learning

Value functions derived from Markov decision processes arise as a central component of algorithms as well as performance metrics in many statistics and engineering applications of machine learning techniques. Computation of the solution to the associated Bellman equations is challenging in most practical cases of interest. A popular class of approximation techniques, known as Temporal Difference (TD) learning algorithms, are an important sub-class of general reinforcement learning methods. The algorithms introduced in this paper are intended to resolve two well-known difficulties of TD-learning approaches: Their slow convergence due to very high variance, and the fact that, for the problem of computing the relative value function, consistent algorithms exist only in special cases. First we show that the gradients of these value functions admit a representation that lends itself to algorithm design. Based on this result, a new class of differential TD-learning algorithms is introduced. For Markovian models on Euclidean space with smooth dynamics, the algorithms are shown to be consistent under general conditions. Numerical results show dramatic variance reduction when compared to standard methods.

## Authors

• 8 publications
• 14 publications
• 5 publications
• ### Fixed-Horizon Temporal Difference Methods for Stable Reinforcement Learning

We explore fixed-horizon temporal difference (TD) methods, reinforcement...
09/09/2019 ∙ by Kristopher De Asis, et al. ∙ 0

• ### The Value Function Polytope in Reinforcement Learning

We establish geometric and topological properties of the space of value ...
01/31/2019 ∙ by Robert Dadashi, et al. ∙ 10

• ### Learning and Planning in Average-Reward Markov Decision Processes

We introduce improved learning and planning algorithms for average-rewar...
06/29/2020 ∙ by Yi Wan, et al. ∙ 8

• ### Zap Q-Learning With Nonlinear Function Approximation

The Zap stochastic approximation (SA) algorithm was introduced recently ...
10/11/2019 ∙ by Shuhang Chen, et al. ∙ 0

• ### Conditions on Features for Temporal Difference-Like Methods to Converge

The convergence of many reinforcement learning (RL) algorithms with line...
05/28/2019 ∙ by Marcus Hutter, et al. ∙ 4

• ### Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view

We investigate projection methods, for evaluating a linear approximation...
11/19/2010 ∙ by Bruno Scherrer, et al. ∙ 0

• ### Predictor-Corrector(PC) Temporal Difference(TD) Learning (PCTD)

Using insight from numerical approximation of ODEs and the problem formu...
04/15/2021 ∙ by Caleb Bowyer, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

A central task in the application of many machine learning methods and control techniques is the (exact or approximate) computation of value functions arising from Markov decision processes. The class of Temporal Difference (TD) learning algorithms considered in this work is an important sub-class of the general family of reinforcement learning methods. Our main contributions here are the introduction of a related family of TD-learning algorithms that enjoy much better convergence properties than existing methods, and the rigorous theoretical analysis of these algorithms.

The value functions considered in this work are based on a discrete-time Markov chain

taking values in , and on an associated cost function . Our central modelling assumption throughout is that evolves according to the nonlinear state space model,

 X(t+1)=a(X(t),N(t+1)),t≥0, (1)

where is an

-dimensional disturbance sequence of independent and identically distributed (i.i.d.) random variables, and

is continuous. Under these assumptions, is a continuous function of the initial condition ; this observation is our starting point for the construction of effective algorithms for value function approximation.

We begin with some familiar background.

### 1.1 Value functions

Given a discount factor , the discounted-cost value function, defined as,

 hβ(x):=∞∑t=0βtE[c(X(t))∣X(0)=x],x∈Rℓ, (2)

solves the Bellman equation: For each ,

 c(x)+βE[hβ(X(t+1))|X(t)=x]−hβ(x)=0. (3)

The average cost is defined as the ergodic limit,

 η=limn→∞1nn−1∑t=0E[c(X(t))∣X(0)=x], (4)

where the limit exists and is independent of under the conditions imposed below. The following relative value function is central to analysis of the average cost:

 (5)

Provided the sum (5) exists for each , the relative value function solves the Poisson equation:

 E[h(X(t+1))−h(X(t))∣X(t)=x]=−[c(x)−η]. (6)

These equations and their solutions are of interest in learning theory, control engineering, and many other fields, including:

Optimal control and Markov decision processes: Policy iteration and actor-critic algorithms are designed to approximate an optimal policy using two-step procedures: First, given a policy, the associated value function is computed (or approximated), and then the policy is updated based on this value function [5, 24]. These approaches can be used for both discounted- and average-cost optimal control problems.

Algorithm design for variance reduction:

Under general conditions, the asymptotic variance (i.e., the variance appearing in the central limit theorem for the ergodic averages in (

4)) is naturally expressed in terms of the relative value function [2, 30]. The method of control variates is intended to reduce the asymptotic variance of various Monte Carlo methods; a version of this technique involves the construction of an approximate solution to Poisson’s equation [18, 19, 27, 12, 10].

Nonlinear filtering:

A recent approach to approximate nonlinear filtering requires the solution to Poisson’s equation to obtain the innovation gain

[44, 28]. Approximations of the solution can lead to efficient implementations of this method [40, 32, 33].

### 1.2 TD-learning and value function approximation

In most cases of practical interest, closed-form expressions for the value functions and in (2) or (6) cannot be derived. One approach to obtaining approximations is the simulation-based algorithm known as Temporal Difference (TD) learning [37, 6].

In the case of the discounted-cost value function, the goal of TD-learning is to approximate  as a member of a parametrized family of functions . Throughout the paper we restrict attention to linear parametrizations of the form,

 hθβ=d∑j=1θjψj, (7)

where we write , , and we assume that the given collection of ‘basis’ functions is continuously differentiable.

In one variant of this technique (the LSTD(1) algorithm, described in Section 4

), the optimal parameter vector

is chosen as the solution to a minimum-norm problem,

 θ∗ =argminθ∥hθβ−hβ∥2π (8) :=argminθE[(hθβ(X)−hβ(X))2],

where the expectation is with respect to , and denotes the steady-state distribution of the Markov chain ; more details are provided in Sections 2.1 and 4.

Theory for TD-learning in the discounted-cost setting is largely complete, in the sense that criteria for convergence are well-understood, and the asymptotic variance of the algorithm is computable based on standard theory from stochastic approximation [7, 15, 16]. Theory and algorithms for the average-cost setting involving the relative value function is more fragmented. The optimal parameter in the analog of (8), with replaced by the relative value function , can be computed using TD-learning techniques only for Markovian models that regenerate, i.e., under the assumption that there exists a single state that is visited infinitely often [29, 20, 22].

Regeneration is not a restrictive assumption in many cases. However, the asymptotic variance of these algorithms grows with the variance of inter-regeneration times. The variance can be massive even in simple examples such as the M/M/1 queue; see the final chapter of [29]. High variance is also predominantly observed in the discounted-cost case when the discounting factor is close to ; see the relevant remarks in Section 1.4 below.

The differential

TD-learning algorithms developed in this paper are designed to resolve these issues. The main idea is to estimate the

gradient of the value function. Under the conditions imposed, the asymptotic variance of the resulting algorithms remains uniformly bounded over . And the same techniques can be applied to obtain finite-variance algorithms for approximating the relative value function for models without regeneration.

It is interesting to note that the needs of the analysis of the algorithms presented here have, in part, motivated the development of rich new convergence theory for general classes of discrete-time Markov processes [13]. Indeed, the results in Sections 2 and 3 of this paper draw heavily on the Lipschitz-norm convergence results established in [13].

### 1.3 Differential TD-learning

In the discounted-cost setting, suppose that the value function and all its potential approximations are continuously differentiable as functions of the state , i.e., , for each . In terms of the linear parametrization (7), we obtain approximations of the form:

 ∇hθβ=d∑j=1θj∇ψj. (9)

The differential LSTD-learning algorithm introduced in Section 3 is designed to compute the solution to the quadratic program,

 θ∗=argminθE[∥∇hθβ(X)−∇hβ(X)∥22],X∼π, (10)

where is the usual Euclidean norm. Once the optimal parameter vector has been obtained, approximating the value function requires the addition of a constant:

 hθ∗β=d∑j=1θ∗jψj+κ(θ∗). (11)

The mean-square optimal choice of is obtained on requiring,

 E[hθ∗β(X)−hβ(X)]=0,X∼π.

A similar program can be carried out for the relative value function , which, viewed as a solution to Poisson’s equation (6), is unique only up to an additive constant. Therefore, we can set in the average-cost setting.

### 1.4 Summary of contributions

The main contributions of this work are:

1. The introduction of the new differential Least Squares TD-learning (LSTD, or ‘grad-LSTD’) algorithm, which is applicable in both the discounted- and average-cost settings.

2. The development of appropriate conditions under which we can show that, for linear parametrizations, LSTD converges and solves the quadratic program (10).

3. The introduction of the family of LSTD()-learning algorithms. With , LSTD() has smaller asymptotic variance, and it is shown that LSTD() also solves the quadratic program (10).

1. The new algorithms are applicable for models that do not have regeneration, and their asymptotic variance is uniformly bounded over all , under general conditions.

Finally, a few more remarks about the error rates of these algorithms are in order. From the definition of the value function (2), it can be expected that at rate for “most” . This is why approximation methods in reinforcement learning typically take for granted that error will grow at this rate. Moreover, it is observed that variance in reinforcement learning can grow dramatically with the discount factor. In particular, it is shown in [15, 16] that variance in the standard Q-learning algorithm of Watkins is infinite when the discount factor satisfies .

The family of TD() algorithms was introduced in [37] to reduce the variance of earlier methods, but it brings its own potential challenges. Consider [41, Theorem 1], which compares the estimate obtained using TD(), with the -optimal approximation obtained using TD():

 ∥hθλβ−hβ∥π≤1−λβ1−β∥hθ∗β−hβ∥π. (12)

This bound suggests that the bias can grow as for fixed .

The difficulties are more acute when we come to the average-cost problem. Consider the minimum-norm problem (8) with the relative value function in place of :

 θ∗=argminθ∥hθ−h∥2π. (13)

Here, for the TD() algorithm with , Theorem 3 of [42] implies a bound in terms of the convergence rate for the Markov chain,

 ∥hθλβ−h∥π≤c(λ,ρ)∥hθ∗−h∥π, (14)

in which and as . However, there is no TD() algorithm to compute in the case of the relative value function , except in special cases; cf. [29, 20, 22].

Under the assumptions imposed in this paper, we show that the gradients of these value functions are well behaved: is a bounded collection of functions, and uniformly on compact sets. As a consequence, both the bias and variance of the new -LSTD() algorithms are bounded over all .

The remainder of the paper is organized as follows: Basic definitions and value function representations are presented in Section 2. The LSTD-learning algorithm is introduced in Section 3, and the LSTD() algorithms are introduced in Section 4. Results from numerical experiments are shown in Section 5, and conclusions are contained in Section 6.

## 2 Representations and Approximations

We begin with modelling assumptions on the Markov process , and representations for the value functions and their gradients.

### 2.1 Markovian model and value function gradients

The evolution equation (1) defines a Markov chain with transition semigroup , where is defined, for all times , any state , and every measurable , via,

 Pt(x,A):=Px{X(t)∈A}:=Pr{X(t)∈A|X(0)=x}.

For we write , so that:

 P(x,A)=Pr{a(x,N(1))∈A}.

The first set of assumptions ensures that the value functions and are well-defined. Fix a continuous function that serves as a weighting function. For any measurable function , the -norm is defined as follows:

 ∥f∥v:=supx|f(x)|v(x).

The space of all measurable functions for which is finite is denoted . Also, for any measurable function and measure , we write for the integral,

Assumption A1:

• The Markov chain is -uniformly ergodic

: It has a unique invariant probability measure

, and there is a continuous function and constants and , such that, for each function ,

 ∣∣E[f(X(t))∣X(0)=x]−π(f)∣∣≤b0ρt0∥f∥vv(x),x∈Rℓ,t≥0. (15)

It is well known that assumption A1 is equivalent to the existence of a Lyapunov function that satisfies the drift condition (V4) of [29]. The following consequences are immediate [30, 29]:

###### Proposition 2.1.

Under assumption A1, for any cost function such that , the limit in (4) exists with , and is independent of the initial state . The value functions and exist as expressed in (2) and (5), and they satisfy equations (3) and (6), respectively.

Moreover, there exists a constant such that the following bounds hold:

 |h(x)| ≤bcv(x) |hβ(x)| ≤bc(v(x)+(1−β)−1) |hβ(x)−hβ(y)| ≤bc(v(x)+v(y)),x,y∈Rℓ.

The following operator-theoretic notation will simplify exposition. For any measurable function , the new function is defined as the conditional expectation:

 Ptf(x)=Ex[f(X(t))]:=E[f(X(t))∣X(0)=x].

For any , the resolvent kernel is the “-transform” of the semigroup ,

 Rβ:=∞∑t=0βtPt. (16)

Under the assumptions of Prop. 2.1, the discounted-cost value function admits the representation,

 hβ=Rβc, (17)

and similarly, for the relative value function we have,

 h=R[c−η], (18)

where we write when [30, 25, 29].

The representations (17) and (18) are valuable in deriving the LSTD-learning algorithms [6, 29, 36]. Next, we will obtain analogous representations for the gradients:

 ∇hβ=∇[Rβc],∇h=∇[Rc].

### 2.2 Representation for the gradient of a value function

In this section we describe the construction of operators and , for which the following hold:

 ∇hβ=∇[Rβc]=Ωβ∇c,∇h=∇[Rc]=Ω∇c. (19)

A more detailed account is given in Section 3.2, and a complete exposition of the underlying theory together with the formal justification of the existence and the relevant properties of and can be found in [13].

For the sake of simplicity, here we restrict our discussion to and its gradient. But it is not hard to see that the construction below easily generalizes to ; again, see Section 3.2 and [13] for the relevant details.

We require the following further assumptions:

Assumption A2:

• The disturbance process is independent of .

• The function is continuously differentiable in its first variable, with,

 supx,n∥∇xa(x,n)∥<∞,

where is any matrix norm, and the matrix is defined as:

 [∇ax(x,n)]i,j:=∂∂xi(a(x,n))j,1≤i,j≤ℓ.

The first assumption, A2.1, is critical so that the initial state can be regarded as a variable, with being a continuous function of . This together with A2.2 allows us to define the sensitivity process , where, for each :

 Si,j(t):=∂Xi(t)∂Xj(0),1≤i,j≤ℓ. (20)

Then and from (1) the sensitivity process evolves according to the random linear system,

 S(t+1) =A(t+1)S(t),t≥0, (21)

where the matrix is defined as in assumption A2.2, by .

For any function , define the operator as:

 ∇Sf(X(t)):=S\tiny\it T(t)∇f(X(t)). (22)

It follows from the chain rule that this coincides with the gradient of

with respect to the initial condition:

 [∇Sf(X(t))]i=∂f(X(t))∂Xi(0),1≤i≤ℓ.

Equation (22) motivates the introduction of a semigroup of operators, whose domain includes functions of the form , with for each . For , is the identity operator, and for ,

 Qtg(x):=Ex[S\tiny\it T(t)g(X(t))]. (23)

Provided we can exchange the gradient and the expectation, we can write,

 ∂∂xiEx[f(X(t))]=Ex[[∇Sf(X(t))]i],1≤i≤ℓ,

and consequently, the following elegant formula is obtained:

 ∇Ptf(x)=Ex[∇Sf(X(t))]=Qt∇f(x),x∈Rℓ. (24)

Justification requires minimal assumptions on the function . The proof of Prop. 2.2 is based on Lemmas A.1 and A.2, given in the Appendix.

###### Proposition 2.2.

Suppose that Assumptions A1 and A2 hold, and that and both lie in . Then (24) holds, and is continuous as a function of .

###### Proof.

The proof uses Lemma A.2 in the Appendix, and is based on a slightly different truncation argument as in [13]. Let be a sequence of functions satisfying, for each :

• is a continuous approximation to the indicator function on the set,

 Rn={x∈Rℓ:|xi|≤n,  1≤i≤d},

in the sense that for all , on , and on .

• is continuous and uniformly bounded: .

On denoting , we have,

 ∇fn=\raisebox1.5pt\scalebox1.25$χ$\raisebox0.8pt\scalebox.6$χ$n∇f+f∇\raisebox1.5pt\scalebox1.25$χ$n,

which is bounded and continuous under the assumptions of the proposition. An application of the mean value theorem combined with dominated convergence allows us to exchange differentiation and expectation:

 ∂∂xiEx[fn(X(t))]=Ex[∂∂xifn(X(t))],1≤i≤ℓ.

This identity is equivalent to (24) for .

Under the assumptions of the proposition there is a constant such that for each . Applying the dominated convergence theorem once more gives,

 Qt∇f(x)=limn→∞Qt∇fn(x),x∈Rd.

The limit is continuous by Lemma A.2, and an application of Lemma 3.6 of [13] completes the proof.

Prop. 2.2  strongly suggests the representation in (19), with:

 Ωβ:=∞∑t=0βtQt. (25)

This is indeed justified (under additional assumptions) in [13, Theorem 2.4], and it forms the basis of the LSTD-learning algorithms developed in this paper.

Similarly, the representation with for the gradient of the relative value function is derived, under appropriate conditions, in [13, Theorem 2.3].

## 3 Differential LSTD-Learning

In this section we develop the new differential LSTD (or LSTD, or ‘grad-LSTD’) learning algorithms for approximating the value functions and , cf. (2) and (5), associated with a cost function and a Markov chain evolving according to the model (1), subject to assumptions A1 and A2. The algorithms are presented first, with supporting theory in Section 3.2. We concentrate mainly on the family of discounted-cost value functions , . The extension to the case of the relative value function is briefly discussed in Section 3.3.

### 3.1 Differential LSTD algorithms

We begin with a review of the standard Least Squares TD-learning (LSTD) algorithm, cf. [6, 29]. We assume that the following are given: A target number of iterations together with samples from the process , the discount factor , the functions , and a gain sequence . Throughout the paper the gain sequence is taken to be , .

Algorithm 1 is equivalent to the LSTD() algorithm of [9]; see Section 4 and [15, 16] for more details.

To simplify discussion we restrict to a stationary setting for the convergence results in this paper.

###### Proposition 3.1.

Suppose that assumption A1 holds, and that the functions and are in . Suppose moreover that the matrix is of full rank.

Then, there exists a version of the pair process that is stationary on the two-sided time axis, and for any initial choice of and positive definite, Algorithm 1 is consistent:

 θ∗=limt→∞M−1(t)b(t),

where is the least squares minimizer in (8).

###### Proof.

The existence of a stationary solution on the two-sided time interval follows directly from -uniform ergodicity, and we then define, for each ,

 φ(t)=∞∑i=0βiψ(X(t−i)).

The optimal parameter can be expressed in which

, so the result follows from the law of large numbers for this ergodic process.

In the construction of the LSTD algorithm, the optimization problem (8) is cast as a minimum-norm problem in the Hilbert space,

 Lπ2={measurable g:Rℓ→R : ∥g∥2π=⟨g,g⟩π<∞},

with inner-product, .

The LSTD algorithm presented next is based on a minimum-norm problem in a different Hilbert space. For functions , , for which each , , define the inner product,

 ⟨f,g⟩π,1=∫∇f(x)\tiny\it T∇g(x)π(dx),

with the associated norm . We let denote the set of functions with finite norm:

 (26)

Two functions are considered identical if . In particular, this is true if the difference is a constant independent of .

The ‘differential’ version of the least-squares problem in (8), given as the nonlinear program (10), can now be recast as,

 θ∗=argminθ∥hθβ−hβ∥π,1. (27)

The LSTD algorithm, defined by the following set of recursions, solves (27).

Given a target number of iterations together with samples from the process , the discount factor , the functions , and a gain sequence , we write for the matrix,

 [∇ψ(x)]i,j=∂∂xiψj(x),x∈Rℓ. (28)

After the estimate of the optimal choice of is obtained from Algorithm 2, the required estimate of is formed as,

 hθβ=θ\tiny\it Tψ+κ(θ),

where,

 κ(θ)=−π(hθβ)+η/(1−β), (29)

with as in (4), and with the two means and given by the results of the following recursive estimates:

 ¯¯¯hβ(t) = ¯¯¯hβ(t−1)+αt(hθ(t)β−¯¯¯hβ(t−1)), (30) η(t) = η(t−1)+αt(c(X(t))−η(t−1)). (31)

It is immediate that , a.s., as , by the law of large numbers for -uniformly ergodic Markov chains [30]. The convergence of to is established in the following section.

### 3.2 Derivation and analysis

In the notation of the previous section, and recalling the definition of in (28), we write:

 M = Eπ[(∇ψ(X))\tiny\it T∇ψ(X)], (32) b = Eπ[(∇ψ(X))\tiny\it T∇hβ(X)]. (33)

Prop. 3.2 follows immediately from these representations, and the definition of the norm .

###### Proposition 3.2.

The norm appearing in (27) is quadratic in :

 ∥hθβ−hβ∥2π,1=θ\tiny\it TMθ−2b\tiny\it Tθ+k, (34)

in which for each ,

 Mi,j=⟨ψi,ψj⟩π,1,bi=⟨ψi,hβ⟩π,1, (35)

and . Consequently, the optimizer (27) is any solution to:

 Mθ∗=b. (36)

As in the standard SLTD-learning algorithm, the representation for the vector in (33) involves the function , which is unknown. An alternative representation will be obtained, which is amenable to recursive approximation will form the basis of the LSTD algorithm.

The following assumption is used to justify this representation:

Assumption A3:

• For any functions satisfying and , the following holds for the stationary version of the chain :

 ∞∑t=0Eπ[∣∣∇f(X(t))\tiny\it T% S(t)∇g(X(0))∣∣]<∞. (37)
• The function is continuously differentiable, and , and for some and ,

 ∥Qt∇c(x)∥2≤b1ρt1v(x),x∈Rℓ,t≥0.

Theorem 2.1 of [13] establishes (2) under additional conditions on the model. The bound (37) is related to a negative Lyapunov exponent for the Markov chain [1].

Under A3 we can justify the representation for the gradient of the value functions:

###### Lemma 3.3.

Suppose that assumptions A1–A3 hold, and that . Then the two representations in (19) hold -a.s.:

 ∇hβ=Ωβ∇cand∇h=Ω∇c.
###### Proof.

Prop. 2.2 justifies the following calculation,

 ∇hβ,n(x) :=∇(n∑t=0βtPtc(x))=n∑t=0βtQt∇c(x),

and also implies that this gradient is continuous as a function of . Assumption A3.2 implies that the right-hand side converges to as . The function is continuous in , since the limit is uniform on compact subsets of (recall that is continuous). Lemma 3.6 of [13] then completes the proof.

A stationary realization of the algorithm is established next. Lemma 3.4 follows immediately from the assumptions: The non-recursive expression for in (38) is immediate from the recursions in Algorithm 2.

###### Lemma 3.4.

Suppose that assumptions A1–A3 hold, and that and are in . Then there is a version of the pair process that is stationary on the two-sided time line, and for each ,

 φ(t)=∞∑k=0βk[Θt−kS(k)]∇ψ(X(t−k)), (38)

where .

The remainder of this section consists of a proof of the following proposition.

###### Proposition 3.5.

Suppose that assumptions A1–A3 hold, and that , , and are in . Suppose moreover that the matrix in (32) is of full rank. Then, for the stationary process , the LSTD-learning algorithm is consistent: For any initial and positive definite,

 θ∗=limt→∞M−1(t)b(t).

Moreover, with probability one,

 η=limt→∞η(t),π(hθ∗β)=limt→∞¯¯¯hβ(t),

and hence .

We begin with a representation of :

###### Lemma 3.6.

Under the assumptions of Prop. 3.5,

 b\tiny\it T =∞∑t=0βtE[(S\tiny\it T(t)∇c(X(t)))\tiny\it T∇ψ(X(0))] (39) =E[(∇c(X(0)))\tiny\it Tφ(0)].
###### Proof.

The following shift-operator on sample space is defined for a stationary version of : For a random variable of the form

 Z=F(X(r),N(r),…,X(s),N(s)),with r≤s,

we denote, for any integer ,

 ΘkZ=F(X(r+k),N(r+k),…,X(s+k),N(s+k)).

Consequently, viewing as a function of as in the evolution equation (21), we have:

 ΘkS(t)=A(t+k)⋯A(2+k)A(1+k). (40)

The representation (19) for is valid under assumption A3, by Lemma 3.3. Using this and (21) gives the first representation in (39):

 b\tiny\it T =∫Ex[(Ωβ∇c(x))\tiny% \it T∇ψ(x)]π(dx) (41) =∞∑t=0βt∫Ex[(S\tiny\it T(t)∇c(x))\tiny\it T∇ψ(x)]π(dx) =∞∑t=0βtE[(S\tiny\it T(t)∇c(X(t)))\tiny\it T∇ψ(X(0))].

Stationarity implies that for any ,

 E[( S\tiny\it T(t)∇c(X(t)))% \tiny\it T∇ψ(X(0))] =E[([ΘkS\tiny\it T(t)]∇c(X(t+k)))\tiny\it T∇ψ(X(k))].

Setting , the first representation in (39) becomes:

 b\tiny\it T =∞∑t=0βtE[(∇c(X(0)))\tiny\it T(Θ−tS(t))∇ψ(X(−t))] =E[(∇c(X(0)))\tiny\it T(∞∑t=0βt(Θ−tS(t))∇ψ(X(−t)))],

where last equality is obtained under assumption A3 by applying Fubini’s theorem. This combined with (38) completes the proof.

##### Proof of Prop. 3.5

Lemma 3.6 combined with the stationarity assumption implies that,

 limT→∞1Tb(t) =limT→∞1TT∑t=1φ\tiny% \it T(t)∇c(X(t)) =E[φ\tiny\it T(0)∇c(X(0))]=b.

Similarly, for each we have,

 M(T)=M(0)+T∑t=1(∇ψ(X(t)))\tiny\it T∇ψ(X(t)),

and by the law of large numbers we once again obtain:

 limT→∞1TM(T)=M.

Combining these results establishes .

Convergence of in (31) is identical, and convergence of in (30) also follows from the law of large numbers since we have convergence of .

### 3.3 Extension to average cost

The LSTD recursion of Algorithm 2 is also consistent in the case , which corresponds to the relative value function in place of the discounted-cost value function . Although we do not repeat the details of the analysis here, we observe that nowhere in the proof of Prop. 3.5 do we use the assumption that . Indeed, it is not difficult to establish that, under the conditions of the proposition, the LSTD-learning algorithm is also convergent when , and that the limit solves the quadratic program:

 θ∗=argminθ∥hθ−h∥π,1.

## 4 Differential LSTD(λ)-Learning

In this section we introduce a Galerkin approach for the construction of the new differential LSTD (or LSTD(), or ‘grad’-LSTD()) algorithms. The relationship between TD-learning algorithms and the Galerkin relaxation has a long history; see [17, 21, 31] and [41], and also [45, 4, 39] for more recent discussions.

The algorithms developed here offer approximations for the value functions and associated with a cost function and a Markov chain , under the same conditions as in Section 3. Again, we concentrate on the discounted-cost value functions , . The extension to the relative value function is straightforward, following along the same lines as in Section 3.3, and thus omitted.

The starting point of the development of the Galerkin approach in this context is the Bellman equation (3). Since we want to approximate the gradient of the discounted-cost value function , it is natural to begin with the ‘differential’ version of (3), i.e., taking gradients,

 ∇c+βQ∇hβ−∇hβ=0, (42)

where we used the identity ‘’ from Prop. 2.2 . Equivalently, using the definitions of and in terms of the sensitivity process, this can be stated as the requirement that the expectations,

 Eπ[Z(t)(∇c(X(t))+βA\tiny\it T(t+1)∇hβ(X(t+1))−∇hβ(X(t)))],

are identically equal to zero, for a ‘large enough’ class of random matrices .

The Galerkin approach is simply a relaxation of this requirement: A specific -dimensional, stationary process