Finite Sample Analyses for TD(0) with Function Approximation

by   Gal Dalal, et al.

TD(0) is one of the most commonly used algorithms in reinforcement learning. Despite this, there is no existing finite sample analysis for TD(0) with function approximation, even for the linear case. Our work is the first to provide such results. Existing convergence rates for Temporal Difference (TD) methods apply only to somewhat modified versions, e.g., projected variants or ones where stepsizes depend on unknown problem parameters. Our analyses obviate these artificial alterations by exploiting strong properties of TD(0). We provide convergence rates both in expectation and with high-probability. The two are obtained via different approaches that use relatively unknown, recently developed stochastic approximation techniques.


page 1

page 2

page 3

page 4


Finite-sample Analysis of Greedy-GQ with Linear Function Approximation under Markovian Noise

Greedy-GQ is an off-policy two timescale algorithm for optimal control i...

The Fast Convergence of Incremental PCA

We consider a situation in which we see samples in R^d drawn i.i.d. from...

A Tale of Two-Timescale Reinforcement Learning with the Tightest Finite-Time Bound

Policy evaluation in reinforcement learning is often conducted using two...

Truncated Emphatic Temporal Difference Methods for Prediction and Control

Emphatic Temporal Difference (TD) methods are a class of off-policy Rein...

Finite Sample Analysis of Minimax Offline Reinforcement Learning: Completeness, Fast Rates and First-Order Efficiency

We offer a theoretical characterization of off-policy evaluation (OPE) i...

Efficient learning of smooth probability functions from Bernoulli tests with guarantees

We study the fundamental problem of learning an unknown, smooth probabil...

Target-Based Temporal Difference Learning

The use of target networks has been a popular and key component of recen...

1 Introduction

Temporal Difference (TD) algorithms lie at the core of Reinforcement Learning (RL), dominated by the celebrated TD(0) algorithm. The term has been coined in [Sutton and Barto1998]

, describing an iterative process of updating an estimate of a value function

with respect to a given policy based on temporally-successive samples. The classical version of the algorithm uses a tabular representation, i.e., entry-wise storage of the value estimate per each state . However, in many problems, the state-space

is too large for such a vanilla approach. The common practice to mitigate this caveat is to approximate the value function using some parameterized family. Often, linear regression is used, i.e.,

. This allows for an efficient implementation of TD(0) even on large state-spaces and has shown to perform well in a variety of problems [Tesauro1995, Powell2007]

. More recently, TD(0) has become prominent in many state-of-the-art RL solutions when combined with deep neural network architectures, as an integral part of fitted value iteration

[Mnih et al.2015, Silver et al.2016]. In this work we focus on the former case of linear Function Approximation (FA); nevertheless, we consider this work as a preliminary milestone in route to achieving theoretical guarantees for non-linear RL architectures.

Two types of convergence rate results exist in literature: in expectation and with high probability. We stress that no results of either type exist for the actual, commonly used, TD(0) algorithm with linear FA; our work is the first to provide such results. In fact, it is the first work to give a convergence rate for an unaltered online TD algorithm of any type. We emphasize that TD(0) with linear FA is formulated and used with non-problem-specific stepsizes. Also, it does not require a projection step to keep

in a ‘nice’ set. In contrast, the few recent works that managed to provide convergence rates for TD(0) analyzed only altered versions of them. These modifications include a projection step and eigenvalue-dependent stepsizes, or they apply only to the average of iterates; we expand on this in the coming section.

Existing Literature

The first TD(0) convergence result was obtained by [Tsitsiklis, Van Roy, and others1997] for both finite and infinite state-spaces. Following that, a key result by [Borkar and Meyn2000]

paved the path to a unified and convenient tool for convergence analyses of Stochastic Approximation (SA), and hence of TD algorithms. This tool is based on the Ordinary Differential Equation (ODE) method. Essentially, that work showed that under the right conditions, the SA trajectory follows the solution of a suitable ODE, often referred to as its

limiting ODE; thus, it eventually converges to the solution of the limiting ODE. Several usages of this tool in RL literature can be found in [Sutton, Maei, and Szepesvári2009, Sutton et al.2009, Sutton, Mahmood, and White2015].

As opposed to the case of asymptotic convergence analysis of TD algorithms, very little is known about their finite sample behavior. We now briefly discuss the few existing results on this topic. In [Borkar2008], a concentration bound is given for generic SA algorithms. Recent works [Kamal2010, Thoppe and Borkar2015] obtain better concentration bounds via tighter analyses. The results in these works are conditioned on the event that the th iterate lies in some a-priori chosen bounded region containing the desired equilibria; this, therefore, is the caveat in applying them to TD(0).

In [Korda and Prashanth2015], convergence rates for TD(0) with mixing-time consideration have been given. We note that even though doubts were recently raised regarding the correctness results there [Narayanan and Szepesvári2017], we shall treat them as correct for the sake of discussion. The results in [Korda and Prashanth2015] require the learning rate to be set based on prior knowledge about system dynamics, which, as argued in the paper, is problematic; alternatively, they apply to the average of iterates. Additionally, unlike in our work, a strong requirement for all high probability bounds is that the iterates need to lie in some a-priori chosen bounded set; this is ensured there via projections (personal communication). In similar spirit, results for TD(0) requiring prior knowledge about system parameters are also given in [Konda2002]. An additional work by [Liu et al.2015] considered the gradient TD algorithms GTD(0) and GTD2, which were first introduced in [Sutton et al.2009, Sutton, Maei, and Szepesvári2009]

. That work interpreted the algorithms as gradient methods to some saddle-point optimization problem. This enabled them to obtain convergence rates on altered versions of these algorithms using results from the convex optimization literature. Despite the alternate approach, in a similar fashion to the results above, a projection step that keeps the parameter vectors in a convex set is needed there.

Bounds similar in flavor to ours are also given in [Frikha and Menozzi2012, Fathi and Frikha2013]. However, they apply only to a class of SA methods satisfying strong assumptions, which do not hold for TD(0). In particular, neither the uniformly Lipschitz assumption nor its weakened version, the Lyapunov Stability-Domination criteria, hold for TD(0) when formulated in their iid noise setup.

Three additional works [Yu and Bertsekas2009, Lazaric, Ghavamzadeh, and Munos2010, Pan, White, and White2017] provide sample complexity bounds on the batch LSTD algorithms. However, in the context of finite sample analysis, these belong to a different class of algorithms. The case of online TD learning has proved to be more practical, at the expense of increased analysis difficulty compared to LSTD methods.

Our Contributions

Our work is the first to give bounds on the convergence rate of TD(0) in its original, unaltered form. In fact, it is the first to obtain convergence rate results for an unaltered online TD algorithm of any type. Indeed, as discussed earlier, existing convergence rates apply only to online TD algorithms with alterations such as projections and stepsizes dependent on unknown problem parameters; alternatively, they only apply to average of iterates.

The methodologies for obtaining the expectation and high probability bounds are quite different. The former has a short and elegant proof that follows via induction using a subtle trick from [Kamal2010]. This bound applies to a general family of stepsizes that is not restricted to square-summable sequences, as usually was required by most previous works. This result reveals an explicit interplay between the stepsizes and noise.

As for the key ingredients in proving our high-probability bound, we first show that the -th iterate at worst is only away from the solution . Based on that, we then utilize tailor-made stochastic approximation tools to show that after some additional steps all subsequent iterates are -close to the solution w.h.p. This novel analysis approach allows us to obviate the common alterations mentioned above. Our key insight regards the role of the driving matrix’s smallest eigenvalue . The convergence rate is dictated by it when it is below some threshold; for larger values, the rate is dictated by the noise.

We believe these two analysis approaches are not limited to TD(0) alone.

2 Problem Setup

We consider the problem of policy evaluation for a Markov Decision Process (MDP). A MDP is defined by the 5-tuple

[Sutton1988], where is the set of states, is the set of actions, is the transition kernel, is the reward function, and is the discount factor. In each time-step, the process is in some state , an action is taken, the system transitions to a next state according to the transition kernel , and an immediate reward is received according to . Let policy

be a stationary mapping from states to actions. Assuming the associated Markov chain is ergodic and uni-chain, let

be the induced stationary distribution. Moreover, let be the value function at state w.r.t. defined via the Bellman equation . In our policy evaluation setting, the goal is to estimate using linear regression, i.e., , where is a feature vector at state , and is a weight vector. For brevity, we omit the notation and denote by .

Let be iid samples of .111The iid assumption does not hold in practice; however, it is standard when dealing with convergence bounds in reinforcement learning [Liu et al.2015, Sutton, Maei, and Szepesvári2009, Sutton et al.2009]. It allows for sophisticated and well-developed techniques from SA theory, and it is not clear how it can be avoided. Indeed, the few papers that obviate this assumption assume other strong properties such as exponentially-fast mixing time [Korda and Prashanth2015, Tsitsiklis, Van Roy, and others1997]. In practice, drawing samples from the stationary distribution is often simulated by taking the last sample from a long trajectory, even though knowing when to stop the trajectory is again a hard theoretical problem. Additionally, most recent implementations of TD algorithms use long replay buffers that shuffle samples. This reduces the correlation between the samples, thereby making our assumption more realistic. Then the TD(0) algorithm has the update rule


where is the stepsize. For analysis, we can rewrite the above as


where and


with and It is known that is positive definite [Bertsekas2012] and that (2) converges to [Borkar2008]. Note that


3 Main Results

Our first main result is a bound on the expected decay rate of the TD(0) iterates. It requires the following assumption.

  1. [leftmargin=4ex+]

  2. For some ,

This assumption follows from (3) when, for example,

have uniformly bounded second moments. The latter is a common assumption in such results; e.g.,

[Sutton et al.2009, Sutton, Maei, and Szepesvári2009].

Recall that all eigenvalues of a symmetric matrix are real. For a symmetric matrix let and be its minimum and maximum eigenvalues, respectively.

Theorem 3.1 (Expected Decay Rate for TD(0)).

Fix and let Fix Then, under , for

where are some constants that depend on both and see (11) and (12) for the exact expressions.

Remark 3.2 (Stepsize tradeoff – I).

The exponentially decaying term in Theorem 3.1 corresponds to the convergence rate of the noiseless TD(0) algorithm, while the inverse polynomial term appears due to the martingale noise The inverse impact of on these two terms introduces the following tradeoff:

  1. For close to which corresponds to slowly decaying stepsizes, the first term converges faster. This stems from speeding up the underlying noiseless TD(0) process.

  2. For close to which corresponds to quickly decaying stepsizes, the second term converges faster. This is due to better mitigation of the martingale noise; recall that is scaled with

While this insight is folklore, a formal estimate of the tradeoff, to the best of our knowledge, has been obtained here for the first time.

Remark 3.3 (Stepsize tradeoff – II).

A practitioner might expect initially large stepsizes to speed up convergence. However, Theorem 3.1 shows that as becomes small, the convergence rate starts being dominated by the martingale difference noise; i.e., choosing a larger stepsize will help speed up convergence only up to some threshold.

Remark 3.4 (Non square-summable stepsizes).

In Theorem 3.1, unlike most works, need not be finite. Thus this result is applicable for a wider class of stepsizes; e.g., with In [Borkar2008], on which much of the existing RL literature is based on, the square summability assumption is due to the Gronwall inequality. In contrast, in our work, we use the Variation of Parameters Formula [Lakshmikantham and Deo1998] for comparing the SA trajectory to appropriate trajectories of the limiting ODE; it is a stronger tool than Gronwall inequality.

Our second main result is a high-probability bound for a specific stepsize. It requires the following assumption.

  1. [leftmargin=4ex+]

  2. All rewards and feature vectors are uniformly bounded, i.e., and

This assumption is well accepted in the literature [Liu et al.2015, Korda and Prashanth2015].

In the following results, the notation hides problem dependent constants and poly-logarithmic terms.

Theorem 3.5 (TD(0) Concentration Bound).

Let where is the -th eigenvalue of . Let . Then, under , for and there exists a function

such that

To enable direct comparison with previous works, one can obtain a following weaker implication of Theorem 3.5 by dropping quantifier inside the event. This translates to the following.

Theorem 3.6.

[TD(0) High-Probability Convergence Rate] Let and be as in Theorem 3.5. Fix Then, under , there exists some function such that for all


Fix some , and choose so that . Then, on one hand, due to Theorem 3.5 and, on the other hand, by the definition of . The claimed result follows. ∎

Remark 3.7 (Eigenvalue dependence).

Theorem 3.6 shows that the rate improves as increases from to however, beyond it remains fixed at As seen in the proof of Theorem 3.5, this is because the rate is dictated by noise when and by the limiting ODE when

Remark 3.8 (Comparison to [Korda and Prashanth2015]).

Recently, doubts were raised in [Narayanan and Szepesvári2017] regarding the correctness of the results in [Korda and Prashanth2015]. Nevertheless, given the current form of those results, the following discussion is in order.

The expectation bound in Theorem 1, [Korda and Prashanth2015] requires the TD(0) stepsize to satisfy for some function where is as above. Theorem 2 there obviates this, but it applies to the average of iterates. In contrast, our expectation bound does not need any scaling of the above kind and applies directly to the TD(0) iterates. Moreover, our result applies to a broader family of stepsizes; see Remark 3.4. Our expectation bound when compared to that of Theorem 2, [Korda and Prashanth2015] is of the same order (even though theirs is for the average of iterates). As for the high-probability concentration bounds in Theorems 1&2, [Korda and Prashanth2015], they require projecting the iterates to some bounded set (personal communication). In contrast, our result applies directly to the original TD(0) algorithm and we obviate all the above modifications.

4 Proof of Theorem 3.1

We begin with an outline of our proof for Theorem 3.1. Our first key step is to identify a “nice” Liapunov function . Then, we apply conditional expectation to eliminate the linear noise terms in the relation between and this subtle trick appeared in [Kamal2010]. Lastly, we use induction to obtain desired result.

Our first two results hold for stepsize sequences of generic form. All that we require for is to satisfy and

Notice that the matrices and are symmetric, where is the constant from . Further, as is positive definite, the above matrices are also positive definite. Hence their minimum and maximum eigenvalues are strictly positive. This is used in the proofs in this section.

Lemma 4.1.

For let where

Fix Let be so that Then for any such that




Using Weyl’s inequality, we have


Since we have

For using and hence we have the following weak bound:


On the other hand, for we have


To prove the desired result, we consider three cases: and For the last case, using (6) and (4), we have

as desired. Similarly, it can be shown that bound holds in other cases as well. The desired result thus follows. ∎

Using Lemma 4.1, we now prove a convergence rate in expectation for general stepsizes.

Theorem 4.2 (Technical Result: Expectation Bound).

Fix Then, under ,

where are constants as defined in Lemmas 4.1 and , respectively.


Let Using (2) and (4), we have


Taking conditional expectation and using we get

Therefore, using ,

where Since is a symmetric matrix, all its eigenvalues are real. With we have

Taking expectation on both sides and letting we have

Sequentially using the above inequality, we have

Using Lemma 4.1 and using the constant defined there, the desired result follows. ∎

The next result provides closed form estimates of the expectation bound given in Theorem 4.2 for the specific stepsize sequence with Notice this family of stepsizes is more general than other common choices in the literature as it is non-square summable for See Remark 3.4 for further details.

Theorem 4.3.

Fix and let Then, under ,

where with denoting a number larger than


Let for . Observe that

where the third relation follows by treating the sum as right Riemann sum, and the last inequality follows since Hence it follows that


We claim that for all


To establish this, we show that for any monotonically increases as is varied from to To prove the latter, it suffices to show that or equivalently for all But the latter is indeed true. Thus (10) holds. From (9) and (10), we then have

where the first relation holds as for any positive sequence with and the last relation follows as and Combining the above inequality with the relation from Theorem 4.2, we have


the desired result follows. ∎

To finalize the proof of Theorem 3.1 we employ Theorem 4.3 with the following constants.


where is the constant from ,

with and and

5 Proof of Theorem 3.5

In this section we prove Theorem 3.5. Throughout this section we assume . All proofs for intermediate lemmas are given in Appendix B.

Outline of Approach

Stepsize Discretization Error Martingale Noise Impact TD(0) Behavior
Large Large Large Possibly diverging
Moderate w.h.p. Stay in ball w.h.p.
Small w.h.p. Converging w.h.p.
Table 1: Chronological Summary of Analysis Outline

The limiting ODE for (2) is


Let denote the solution to the above ODE starting at at time When the starting point and time are unimportant, we will denote this solution by .

As the solutions of the ODE are continuous functions of time, we also define a linear interpolation

of Let For let and let


Our tool for comparing to is the Variation of Parameters (VoP) method [Lakshmikantham and Deo1998]. Initially, could stray away from when the stepsizes may not be small enough to tame the noise. However, we show that i.e., does not stray away from too fast. Later, we show that we can fix some so that first the TD(0) iterates for stay within an distance from Then, after for some additional time, when the stepsizes decay enough, the TD(0) iterates start behaving almost like a noiseless version. These three different behaviours are summarized in Table 1 and illustrated in Figure 1.

Figure 1: Visualization of the proof outline. The three balls (from large to small) are respectively the ball, ball, and ball, where is from Lemma 5.4. The blue curve is the initial, possibly diverging phase of . The green curve is when the stepsizes are moderate in size ( in the analysis). Similarly, the red curve is when the stepsizes are sufficiently small (). The dotted curves are the associated ODE trajectories .


We establish some preliminary results here that will be used throughout this section. Let and Using results from Chapter 6, [Hirsch, Smale, and Devaney2012], it follows that the solution of (13) satisfies the relation


As the matrix is positive definite, for



for all and

Let be as in Theorem 3.5. From Corollary 3.6, p71, [Teschl2012], so that


Separately, as


The following result is a consequence of that gives a bound directly on the martingale difference noise as a function of the iterates. We emphasize that this strong behavior of TD(0) is significant in our work. We also are not aware of other works that utilized it even though or equivalents are often assumed and accepted.

Lemma 5.1 (Martingale Noise Behavior).

For all


Remark 5.2.

The noise behavior usually used in the literature (e.g., [Sutton et al.2009, Sutton, Maei, and Szepesvári2009]) is the same as we assumed in for Theorem 3.1:

for some constant . However, here we assume the stronger , which, using a similar proof technique to that of Lemma 5.1, implies

for all

The remaining parts of the analysis rely on the comparison of the discrete TD(0) trajectory to the continuous solution of the limiting ODE. For this, we first switch from directly treating to treating their linear interpolation as defined in (14). The key idea then is to use the VoP method [Lakshmikantham and Deo1998] as in Lemma A.1, and express as a perturbation of due to two factors: the discretization error and the martingale difference noise. Our quantification of these two factors is as follows. For the interval let


Corollary 5.3 (Comparison of SA Trajectory and ODE Solution).

For every ,

We highlight that both the paths, and start at the same point at time Consequently, by bounding and we can estimate the distance of interest.

Part I – Initial Possible Divergence

In this section, we show that the TD(0) iterates lie in an -ball around We stress that this is one of the results that enable us to accomplish more than existing literature. Previously, the distance of the initial iterates from was bounded using various assumptions, often justified with an artif