A central task in the application of many machine learning methods and control techniques is the (exact or approximate) computation of value functions arising from Markov decision processes. The class of Temporal Difference (TD) learning algorithms considered in this work is an important sub-class of the general family of reinforcement learning methods. Our main contributions here are the introduction of a related family of TD-learning algorithms that enjoy much better convergence properties than existing methods, and the rigorous theoretical analysis of these algorithms.
The value functions considered in this work are based on a discrete-time Markov chaintaking values in , and on an associated cost function . Our central modelling assumption throughout is that evolves according to the nonlinear state space model,
where is an
-dimensional disturbance sequence of independent and identically distributed (i.i.d.) random variables, andis continuous. Under these assumptions, is a continuous function of the initial condition ; this observation is our starting point for the construction of effective algorithms for value function approximation.
We begin with some familiar background.
1.1 Value functions
Given a discount factor , the discounted-cost value function, defined as,
solves the Bellman equation: For each ,
The average cost is defined as the ergodic limit,
where the limit exists and is independent of under the conditions imposed below. The following relative value function is central to analysis of the average cost:
Provided the sum (5) exists for each , the relative value function solves the Poisson equation:
These equations and their solutions are of interest in learning theory, control engineering, and many other fields, including:
Optimal control and Markov decision processes: Policy iteration and actor-critic algorithms are designed to approximate an optimal policy using two-step procedures: First, given a policy, the associated value function is computed (or approximated), and then the policy is updated based on this value function [5, 24]. These approaches can be used for both discounted- and average-cost optimal control problems.
Algorithm design for variance reduction:
Under general conditions, the asymptotic variance (i.e., the variance appearing in the central limit theorem for the ergodic averages in (4)) is naturally expressed in terms of the relative value function [2, 30]. The method of control variates is intended to reduce the asymptotic variance of various Monte Carlo methods; a version of this technique involves the construction of an approximate solution to Poisson’s equation [18, 19, 27, 12, 10].
1.2 TD-learning and value function approximation
In most cases of practical interest, closed-form expressions for the value functions and in (2) or (6) cannot be derived. One approach to obtaining approximations is the simulation-based algorithm known as Temporal Difference (TD) learning [37, 6].
In the case of the discounted-cost value function, the goal of TD-learning is to approximate as a member of a parametrized family of functions . Throughout the paper we restrict attention to linear parametrizations of the form,
where we write , , and we assume that the given collection of ‘basis’ functions is continuously differentiable.
In one variant of this technique (the LSTD(1) algorithm, described in Section 4
), the optimal parameter vectoris chosen as the solution to a minimum-norm problem,
Theory for TD-learning in the discounted-cost setting is largely complete, in the sense that criteria for convergence are well-understood, and the asymptotic variance of the algorithm is computable based on standard theory from stochastic approximation [7, 15, 16]. Theory and algorithms for the average-cost setting involving the relative value function is more fragmented. The optimal parameter in the analog of (8), with replaced by the relative value function , can be computed using TD-learning techniques only for Markovian models that regenerate, i.e., under the assumption that there exists a single state that is visited infinitely often [29, 20, 22].
Regeneration is not a restrictive assumption in many cases. However, the asymptotic variance of these algorithms grows with the variance of inter-regeneration times. The variance can be massive even in simple examples such as the M/M/1 queue; see the final chapter of . High variance is also predominantly observed in the discounted-cost case when the discounting factor is close to ; see the relevant remarks in Section 1.4 below.
TD-learning algorithms developed in this paper are designed to resolve these issues. The main idea is to estimate thegradient of the value function. Under the conditions imposed, the asymptotic variance of the resulting algorithms remains uniformly bounded over . And the same techniques can be applied to obtain finite-variance algorithms for approximating the relative value function for models without regeneration.
It is interesting to note that the needs of the analysis of the algorithms presented here have, in part, motivated the development of rich new convergence theory for general classes of discrete-time Markov processes . Indeed, the results in Sections 2 and 3 of this paper draw heavily on the Lipschitz-norm convergence results established in .
1.3 Differential TD-learning
In the discounted-cost setting, suppose that the value function and all its potential approximations are continuously differentiable as functions of the state , i.e., , for each . In terms of the linear parametrization (7), we obtain approximations of the form:
The differential LSTD-learning algorithm introduced in Section 3 is designed to compute the solution to the quadratic program,
where is the usual Euclidean norm. Once the optimal parameter vector has been obtained, approximating the value function requires the addition of a constant:
The mean-square optimal choice of is obtained on requiring,
A similar program can be carried out for the relative value function , which, viewed as a solution to Poisson’s equation (6), is unique only up to an additive constant. Therefore, we can set in the average-cost setting.
1.4 Summary of contributions
The main contributions of this work are:
The introduction of the new differential Least Squares TD-learning (LSTD, or ‘grad-LSTD’) algorithm, which is applicable in both the discounted- and average-cost settings.
The development of appropriate conditions under which we can show that, for linear parametrizations, LSTD converges and solves the quadratic program (10).
The introduction of the family of LSTD()-learning algorithms. With , LSTD() has smaller asymptotic variance, and it is shown that LSTD() also solves the quadratic program (10).
The new algorithms are applicable for models that do not have regeneration, and their asymptotic variance is uniformly bounded over all , under general conditions.
Finally, a few more remarks about the error rates of these algorithms are in order. From the definition of the value function (2), it can be expected that at rate for “most” . This is why approximation methods in reinforcement learning typically take for granted that error will grow at this rate. Moreover, it is observed that variance in reinforcement learning can grow dramatically with the discount factor. In particular, it is shown in [15, 16] that variance in the standard Q-learning algorithm of Watkins is infinite when the discount factor satisfies .
The family of TD() algorithms was introduced in  to reduce the variance of earlier methods, but it brings its own potential challenges. Consider [41, Theorem 1], which compares the estimate obtained using TD(), with the -optimal approximation obtained using TD():
This bound suggests that the bias can grow as for fixed .
The difficulties are more acute when we come to the average-cost problem. Consider the minimum-norm problem (8) with the relative value function in place of :
Here, for the TD() algorithm with , Theorem 3 of  implies a bound in terms of the convergence rate for the Markov chain,
Under the assumptions imposed in this paper, we show that the gradients of these value functions are well behaved: is a bounded collection of functions, and uniformly on compact sets. As a consequence, both the bias and variance of the new -LSTD() algorithms are bounded over all .
The remainder of the paper is organized as follows: Basic definitions and value function representations are presented in Section 2. The LSTD-learning algorithm is introduced in Section 3, and the LSTD() algorithms are introduced in Section 4. Results from numerical experiments are shown in Section 5, and conclusions are contained in Section 6.
2 Representations and Approximations
We begin with modelling assumptions on the Markov process , and representations for the value functions and their gradients.
2.1 Markovian model and value function gradients
The evolution equation (1) defines a Markov chain with transition semigroup , where is defined, for all times , any state , and every measurable , via,
For we write , so that:
The first set of assumptions ensures that the value functions and are well-defined. Fix a continuous function that serves as a weighting function. For any measurable function , the -norm is defined as follows:
The space of all measurable functions for which is finite is denoted . Also, for any measurable function and measure , we write for the integral,
The Markov chain is -uniformly ergodic
: It has a unique invariant probability measure, and there is a continuous function and constants and , such that, for each function ,
Under assumption A1, for any cost function such that , the limit in (4) exists with , and is independent of the initial state . The value functions and exist as expressed in (2) and (5), and they satisfy equations (3) and (6), respectively.
Moreover, there exists a constant such that the following bounds hold:
The following operator-theoretic notation will simplify exposition. For any measurable function , the new function is defined as the conditional expectation:
For any , the resolvent kernel is the “-transform” of the semigroup ,
Under the assumptions of Prop. 2.1, the discounted-cost value function admits the representation,
and similarly, for the relative value function we have,
2.2 Representation for the gradient of a value function
In this section we describe the construction of operators and , for which the following hold:
A more detailed account is given in Section 3.2, and a complete exposition of the underlying theory together with the formal justification of the existence and the relevant properties of and can be found in .
For the sake of simplicity, here we restrict our discussion to and its gradient. But it is not hard to see that the construction below easily generalizes to ; again, see Section 3.2 and  for the relevant details.
We require the following further assumptions:
The disturbance process is independent of .
The function is continuously differentiable in its first variable, with,
where is any matrix norm, and the matrix is defined as:
The first assumption, A2.1, is critical so that the initial state can be regarded as a variable, with being a continuous function of . This together with A2.2 allows us to define the sensitivity process , where, for each :
Then and from (1) the sensitivity process evolves according to the random linear system,
where the matrix is defined as in assumption A2.2, by .
For any function , define the operator as:
It follows from the chain rule that this coincides with the gradient ofwith respect to the initial condition:
Equation (22) motivates the introduction of a semigroup of operators, whose domain includes functions of the form , with for each . For , is the identity operator, and for ,
Provided we can exchange the gradient and the expectation, we can write,
and consequently, the following elegant formula is obtained:
Suppose that Assumptions A1 and A2 hold, and that and both lie in . Then (24) holds, and is continuous as a function of .
is a continuous approximation to the indicator function on the set,
in the sense that for all , on , and on .
is continuous and uniformly bounded: .
On denoting , we have,
which is bounded and continuous under the assumptions of the proposition. An application of the mean value theorem combined with dominated convergence allows us to exchange differentiation and expectation:
This identity is equivalent to (24) for .
This is indeed justified (under additional assumptions) in [13, Theorem 2.4], and it forms the basis of the LSTD-learning algorithms developed in this paper.
Similarly, the representation with for the gradient of the relative value function is derived, under appropriate conditions, in [13, Theorem 2.3].
3 Differential LSTD-Learning
In this section we develop the new differential LSTD (or LSTD, or ‘grad-LSTD’) learning algorithms for approximating the value functions and , cf. (2) and (5), associated with a cost function and a Markov chain evolving according to the model (1), subject to assumptions A1 and A2. The algorithms are presented first, with supporting theory in Section 3.2. We concentrate mainly on the family of discounted-cost value functions , . The extension to the case of the relative value function is briefly discussed in Section 3.3.
3.1 Differential LSTD algorithms
We begin with a review of the standard Least Squares TD-learning (LSTD) algorithm, cf. [6, 29]. We assume that the following are given: A target number of iterations together with samples from the process , the discount factor , the functions , and a gain sequence . Throughout the paper the gain sequence is taken to be , .
To simplify discussion we restrict to a stationary setting for the convergence results in this paper.
Suppose that assumption A1 holds, and that the functions and are in . Suppose moreover that the matrix is of full rank.
The existence of a stationary solution on the two-sided time interval follows directly from -uniform ergodicity, and we then define, for each ,
The optimal parameter can be expressed in which
, so the result follows from the law of large numbers for this ergodic process.
In the construction of the LSTD algorithm, the optimization problem (8) is cast as a minimum-norm problem in the Hilbert space,
with inner-product, .
The LSTD algorithm presented next is based on a minimum-norm problem in a different Hilbert space. For functions , , for which each , , define the inner product,
with the associated norm . We let denote the set of functions with finite norm:
Two functions are considered identical if . In particular, this is true if the difference is a constant independent of .
The LSTD algorithm, defined by the following set of recursions, solves (27).
Given a target number of iterations together with samples from the process , the discount factor , the functions , and a gain sequence , we write for the matrix,
After the estimate of the optimal choice of is obtained from Algorithm 2, the required estimate of is formed as,
with as in (4), and with the two means and given by the results of the following recursive estimates:
It is immediate that , a.s., as , by the law of large numbers for -uniformly ergodic Markov chains . The convergence of to is established in the following section.
3.2 Derivation and analysis
In the notation of the previous section, and recalling the definition of in (28), we write:
Prop. 3.2 follows immediately from these representations, and the definition of the norm .
As in the standard SLTD-learning algorithm, the representation for the vector in (33) involves the function , which is unknown. An alternative representation will be obtained, which is amenable to recursive approximation will form the basis of the LSTD algorithm.
The following assumption is used to justify this representation:
For any functions satisfying and , the following holds for the stationary version of the chain :
The function is continuously differentiable, and , and for some and ,
Under A3 we can justify the representation for the gradient of the value functions:
Suppose that assumptions A1–A3 hold, and that . Then the two representations in (19) hold -a.s.:
Prop. 2.2 justifies the following calculation,
and also implies that this gradient is continuous as a function of . Assumption A3.2 implies that the right-hand side converges to as . The function is continuous in , since the limit is uniform on compact subsets of (recall that is continuous). Lemma 3.6 of  then completes the proof.
A stationary realization of the algorithm is established next. Lemma 3.4 follows immediately from the assumptions: The non-recursive expression for in (38) is immediate from the recursions in Algorithm 2.
Suppose that assumptions A1–A3 hold, and that and are in . Then there is a version of the pair process that is stationary on the two-sided time line, and for each ,
The remainder of this section consists of a proof of the following proposition.
Suppose that assumptions A1–A3 hold, and that , , and are in . Suppose moreover that the matrix in (32) is of full rank. Then, for the stationary process , the LSTD-learning algorithm is consistent: For any initial and positive definite,
Moreover, with probability one,
and hence .
We begin with a representation of :
Under the assumptions of Prop. 3.5,
The following shift-operator on sample space is defined for a stationary version of : For a random variable of the form
we denote, for any integer ,
Consequently, viewing as a function of as in the evolution equation (21), we have:
Stationarity implies that for any ,
Setting , the first representation in (39) becomes:
where last equality is obtained under assumption A3 by applying Fubini’s theorem. This combined with (38) completes the proof.
Proof of Prop. 3.5
Lemma 3.6 combined with the stationarity assumption implies that,
Similarly, for each we have,
and by the law of large numbers we once again obtain:
Combining these results establishes .
3.3 Extension to average cost
The LSTD recursion of Algorithm 2 is also consistent in the case , which corresponds to the relative value function in place of the discounted-cost value function . Although we do not repeat the details of the analysis here, we observe that nowhere in the proof of Prop. 3.5 do we use the assumption that . Indeed, it is not difficult to establish that, under the conditions of the proposition, the LSTD-learning algorithm is also convergent when , and that the limit solves the quadratic program:
4 Differential LSTD()-Learning
In this section we introduce a Galerkin approach for the construction of the new differential LSTD (or LSTD(), or ‘grad’-LSTD()) algorithms. The relationship between TD-learning algorithms and the Galerkin relaxation has a long history; see [17, 21, 31] and , and also [45, 4, 39] for more recent discussions.
The algorithms developed here offer approximations for the value functions and associated with a cost function and a Markov chain , under the same conditions as in Section 3. Again, we concentrate on the discounted-cost value functions , . The extension to the relative value function is straightforward, following along the same lines as in Section 3.3, and thus omitted.
The starting point of the development of the Galerkin approach in this context is the Bellman equation (3). Since we want to approximate the gradient of the discounted-cost value function , it is natural to begin with the ‘differential’ version of (3), i.e., taking gradients,
where we used the identity ‘’ from Prop. 2.2 . Equivalently, using the definitions of and in terms of the sensitivity process, this can be stated as the requirement that the expectations,
are identically equal to zero, for a ‘large enough’ class of random matrices .
The Galerkin approach is simply a relaxation of this requirement: A specific -dimensional, stationary process