Temporal Difference (TD) algorithms lie at the core of Reinforcement Learning (RL), dominated by the celebrated TD(0) algorithm. The term has been coined in [Sutton and Barto1998]
, describing an iterative process of updating an estimate of a value functionwith respect to a given policy based on temporally-successive samples. The classical version of the algorithm uses a tabular representation, i.e., entry-wise storage of the value estimate per each state . However, in many problems, the state-space
is too large for such a vanilla approach. The common practice to mitigate this caveat is to approximate the value function using some parameterized family. Often, linear regression is used, i.e.,. This allows for an efficient implementation of TD(0) even on large state-spaces and has shown to perform well in a variety of problems [Tesauro1995, Powell2007]
. More recently, TD(0) has become prominent in many state-of-the-art RL solutions when combined with deep neural network architectures, as an integral part of fitted value iteration[Mnih et al.2015, Silver et al.2016]. In this work we focus on the former case of linear Function Approximation (FA); nevertheless, we consider this work as a preliminary milestone in route to achieving theoretical guarantees for non-linear RL architectures.
Two types of convergence rate results exist in literature: in expectation and with high probability. We stress that no results of either type exist for the actual, commonly used, TD(0) algorithm with linear FA; our work is the first to provide such results. In fact, it is the first work to give a convergence rate for an unaltered online TD algorithm of any type. We emphasize that TD(0) with linear FA is formulated and used with non-problem-specific stepsizes. Also, it does not require a projection step to keep
in a ‘nice’ set. In contrast, the few recent works that managed to provide convergence rates for TD(0) analyzed only altered versions of them. These modifications include a projection step and eigenvalue-dependent stepsizes, or they apply only to the average of iterates; we expand on this in the coming section.
paved the path to a unified and convenient tool for convergence analyses of Stochastic Approximation (SA), and hence of TD algorithms. This tool is based on the Ordinary Differential Equation (ODE) method. Essentially, that work showed that under the right conditions, the SA trajectory follows the solution of a suitable ODE, often referred to as itslimiting ODE; thus, it eventually converges to the solution of the limiting ODE. Several usages of this tool in RL literature can be found in [Sutton, Maei, and Szepesvári2009, Sutton et al.2009, Sutton, Mahmood, and White2015].
As opposed to the case of asymptotic convergence analysis of TD algorithms, very little is known about their finite sample behavior. We now briefly discuss the few existing results on this topic. In [Borkar2008], a concentration bound is given for generic SA algorithms. Recent works [Kamal2010, Thoppe and Borkar2015] obtain better concentration bounds via tighter analyses. The results in these works are conditioned on the event that the th iterate lies in some a-priori chosen bounded region containing the desired equilibria; this, therefore, is the caveat in applying them to TD(0).
In [Korda and Prashanth2015], convergence rates for TD(0) with mixing-time consideration have been given. We note that even though doubts were recently raised regarding the correctness results there [Narayanan and Szepesvári2017], we shall treat them as correct for the sake of discussion. The results in [Korda and Prashanth2015] require the learning rate to be set based on prior knowledge about system dynamics, which, as argued in the paper, is problematic; alternatively, they apply to the average of iterates. Additionally, unlike in our work, a strong requirement for all high probability bounds is that the iterates need to lie in some a-priori chosen bounded set; this is ensured there via projections (personal communication). In similar spirit, results for TD(0) requiring prior knowledge about system parameters are also given in [Konda2002]. An additional work by [Liu et al.2015] considered the gradient TD algorithms GTD(0) and GTD2, which were first introduced in [Sutton et al.2009, Sutton, Maei, and Szepesvári2009]
. That work interpreted the algorithms as gradient methods to some saddle-point optimization problem. This enabled them to obtain convergence rates on altered versions of these algorithms using results from the convex optimization literature. Despite the alternate approach, in a similar fashion to the results above, a projection step that keeps the parameter vectors in a convex set is needed there.
Bounds similar in flavor to ours are also given in [Frikha and Menozzi2012, Fathi and Frikha2013]. However, they apply only to a class of SA methods satisfying strong assumptions, which do not hold for TD(0). In particular, neither the uniformly Lipschitz assumption nor its weakened version, the Lyapunov Stability-Domination criteria, hold for TD(0) when formulated in their iid noise setup.
Three additional works [Yu and Bertsekas2009, Lazaric, Ghavamzadeh, and Munos2010, Pan, White, and White2017] provide sample complexity bounds on the batch LSTD algorithms. However, in the context of finite sample analysis, these belong to a different class of algorithms. The case of online TD learning has proved to be more practical, at the expense of increased analysis difficulty compared to LSTD methods.
Our work is the first to give bounds on the convergence rate of TD(0) in its original, unaltered form. In fact, it is the first to obtain convergence rate results for an unaltered online TD algorithm of any type. Indeed, as discussed earlier, existing convergence rates apply only to online TD algorithms with alterations such as projections and stepsizes dependent on unknown problem parameters; alternatively, they only apply to average of iterates.
The methodologies for obtaining the expectation and high probability bounds are quite different. The former has a short and elegant proof that follows via induction using a subtle trick from [Kamal2010]. This bound applies to a general family of stepsizes that is not restricted to square-summable sequences, as usually was required by most previous works. This result reveals an explicit interplay between the stepsizes and noise.
As for the key ingredients in proving our high-probability bound, we first show that the -th iterate at worst is only away from the solution . Based on that, we then utilize tailor-made stochastic approximation tools to show that after some additional steps all subsequent iterates are -close to the solution w.h.p. This novel analysis approach allows us to obviate the common alterations mentioned above. Our key insight regards the role of the driving matrix’s smallest eigenvalue . The convergence rate is dictated by it when it is below some threshold; for larger values, the rate is dictated by the noise.
We believe these two analysis approaches are not limited to TD(0) alone.
2 Problem Setup
We consider the problem of policy evaluation for a Markov Decision Process (MDP). A MDP is defined by the 5-tuple[Sutton1988], where is the set of states, is the set of actions, is the transition kernel, is the reward function, and is the discount factor. In each time-step, the process is in some state , an action is taken, the system transitions to a next state according to the transition kernel , and an immediate reward is received according to . Let policy
be a stationary mapping from states to actions. Assuming the associated Markov chain is ergodic and uni-chain, letbe the induced stationary distribution. Moreover, let be the value function at state w.r.t. defined via the Bellman equation . In our policy evaluation setting, the goal is to estimate using linear regression, i.e., , where is a feature vector at state , and is a weight vector. For brevity, we omit the notation and denote by .
Let be iid samples of .111The iid assumption does not hold in practice; however, it is standard when dealing with convergence bounds in reinforcement learning [Liu et al.2015, Sutton, Maei, and Szepesvári2009, Sutton et al.2009]. It allows for sophisticated and well-developed techniques from SA theory, and it is not clear how it can be avoided. Indeed, the few papers that obviate this assumption assume other strong properties such as exponentially-fast mixing time [Korda and Prashanth2015, Tsitsiklis, Van Roy, and others1997]. In practice, drawing samples from the stationary distribution is often simulated by taking the last sample from a long trajectory, even though knowing when to stop the trajectory is again a hard theoretical problem. Additionally, most recent implementations of TD algorithms use long replay buffers that shuffle samples. This reduces the correlation between the samples, thereby making our assumption more realistic. Then the TD(0) algorithm has the update rule
where is the stepsize. For analysis, we can rewrite the above as
3 Main Results
Our first main result is a bound on the expected decay rate of the TD(0) iterates. It requires the following assumption.
For some ,
This assumption follows from (3) when, for example,
have uniformly bounded second moments. The latter is a common assumption in such results; e.g.,[Sutton et al.2009, Sutton, Maei, and Szepesvári2009].
Recall that all eigenvalues of a symmetric matrix are real. For a symmetric matrix let and be its minimum and maximum eigenvalues, respectively.
Theorem 3.1 (Expected Decay Rate for TD(0)).
Remark 3.2 (Stepsize tradeoff – I).
The exponentially decaying term in Theorem 3.1 corresponds to the convergence rate of the noiseless TD(0) algorithm, while the inverse polynomial term appears due to the martingale noise The inverse impact of on these two terms introduces the following tradeoff:
For close to which corresponds to slowly decaying stepsizes, the first term converges faster. This stems from speeding up the underlying noiseless TD(0) process.
For close to which corresponds to quickly decaying stepsizes, the second term converges faster. This is due to better mitigation of the martingale noise; recall that is scaled with
While this insight is folklore, a formal estimate of the tradeoff, to the best of our knowledge, has been obtained here for the first time.
Remark 3.3 (Stepsize tradeoff – II).
A practitioner might expect initially large stepsizes to speed up convergence. However, Theorem 3.1 shows that as becomes small, the convergence rate starts being dominated by the martingale difference noise; i.e., choosing a larger stepsize will help speed up convergence only up to some threshold.
Remark 3.4 (Non square-summable stepsizes).
In Theorem 3.1, unlike most works, need not be finite. Thus this result is applicable for a wider class of stepsizes; e.g., with In [Borkar2008], on which much of the existing RL literature is based on, the square summability assumption is due to the Gronwall inequality. In contrast, in our work, we use the Variation of Parameters Formula [Lakshmikantham and Deo1998] for comparing the SA trajectory to appropriate trajectories of the limiting ODE; it is a stronger tool than Gronwall inequality.
Our second main result is a high-probability bound for a specific stepsize. It requires the following assumption.
All rewards and feature vectors are uniformly bounded, i.e., and
In the following results, the notation hides problem dependent constants and poly-logarithmic terms.
Theorem 3.5 (TD(0) Concentration Bound).
To enable direct comparison with previous works, one can obtain a following weaker implication of Theorem 3.5 by dropping quantifier inside the event. This translates to the following.
[TD(0) High-Probability Convergence Rate] Let and be as in Theorem 3.5. Fix Then, under , there exists some function such that for all
Fix some , and choose so that . Then, on one hand, due to Theorem 3.5 and, on the other hand, by the definition of . The claimed result follows. ∎
Remark 3.7 (Eigenvalue dependence).
Remark 3.8 (Comparison to [Korda and Prashanth2015]).
Recently, doubts were raised in [Narayanan and Szepesvári2017] regarding the correctness of the results in [Korda and Prashanth2015]. Nevertheless, given the current form of those results, the following discussion is in order.
The expectation bound in Theorem 1, [Korda and Prashanth2015] requires the TD(0) stepsize to satisfy for some function where is as above. Theorem 2 there obviates this, but it applies to the average of iterates. In contrast, our expectation bound does not need any scaling of the above kind and applies directly to the TD(0) iterates. Moreover, our result applies to a broader family of stepsizes; see Remark 3.4. Our expectation bound when compared to that of Theorem 2, [Korda and Prashanth2015] is of the same order (even though theirs is for the average of iterates). As for the high-probability concentration bounds in Theorems 1&2, [Korda and Prashanth2015], they require projecting the iterates to some bounded set (personal communication). In contrast, our result applies directly to the original TD(0) algorithm and we obviate all the above modifications.
4 Proof of Theorem 3.1
We begin with an outline of our proof for Theorem 3.1. Our first key step is to identify a “nice” Liapunov function . Then, we apply conditional expectation to eliminate the linear noise terms in the relation between and this subtle trick appeared in [Kamal2010]. Lastly, we use induction to obtain desired result.
Our first two results hold for stepsize sequences of generic form. All that we require for is to satisfy and
Notice that the matrices and are symmetric, where is the constant from . Further, as is positive definite, the above matrices are also positive definite. Hence their minimum and maximum eigenvalues are strictly positive. This is used in the proofs in this section.
For let where
Fix Let be so that Then for any such that
Using Weyl’s inequality, we have
Since we have
For using and hence we have the following weak bound:
On the other hand, for we have
as desired. Similarly, it can be shown that bound holds in other cases as well. The desired result thus follows. ∎
Using Lemma 4.1, we now prove a convergence rate in expectation for general stepsizes.
Theorem 4.2 (Technical Result: Expectation Bound).
where are constants as defined in Lemmas 4.1 and , respectively.
Taking conditional expectation and using we get
where Since is a symmetric matrix, all its eigenvalues are real. With we have
Taking expectation on both sides and letting we have
Sequentially using the above inequality, we have
Using Lemma 4.1 and using the constant defined there, the desired result follows. ∎
The next result provides closed form estimates of the expectation bound given in Theorem 4.2 for the specific stepsize sequence with Notice this family of stepsizes is more general than other common choices in the literature as it is non-square summable for See Remark 3.4 for further details.
Let for . Observe that
where the third relation follows by treating the sum as right Riemann sum, and the last inequality follows since Hence it follows that
We claim that for all
To establish this, we show that for any monotonically increases as is varied from to To prove the latter, it suffices to show that or equivalently for all But the latter is indeed true. Thus (10) holds. From (9) and (10), we then have
where the first relation holds as for any positive sequence with and the last relation follows as and Combining the above inequality with the relation from Theorem 4.2, we have
the desired result follows. ∎
5 Proof of Theorem 3.5
Outline of Approach
|Stepsize||Discretization Error||Martingale Noise Impact||TD(0) Behavior|
|Moderate||w.h.p.||Stay in ball w.h.p.|
The limiting ODE for (2) is
Let denote the solution to the above ODE starting at at time When the starting point and time are unimportant, we will denote this solution by .
As the solutions of the ODE are continuous functions of time, we also define a linear interpolationof Let For let and let
Our tool for comparing to is the Variation of Parameters (VoP) method [Lakshmikantham and Deo1998]. Initially, could stray away from when the stepsizes may not be small enough to tame the noise. However, we show that i.e., does not stray away from too fast. Later, we show that we can fix some so that first the TD(0) iterates for stay within an distance from Then, after for some additional time, when the stepsizes decay enough, the TD(0) iterates start behaving almost like a noiseless version. These three different behaviours are summarized in Table 1 and illustrated in Figure 1.
We establish some preliminary results here that will be used throughout this section. Let and Using results from Chapter 6, [Hirsch, Smale, and Devaney2012], it follows that the solution of (13) satisfies the relation
As the matrix is positive definite, for
for all and
The following result is a consequence of that gives a bound directly on the martingale difference noise as a function of the iterates. We emphasize that this strong behavior of TD(0) is significant in our work. We also are not aware of other works that utilized it even though or equivalents are often assumed and accepted.
Lemma 5.1 (Martingale Noise Behavior).
The remaining parts of the analysis rely on the comparison of the discrete TD(0) trajectory to the continuous solution of the limiting ODE. For this, we first switch from directly treating to treating their linear interpolation as defined in (14). The key idea then is to use the VoP method [Lakshmikantham and Deo1998] as in Lemma A.1, and express as a perturbation of due to two factors: the discretization error and the martingale difference noise. Our quantification of these two factors is as follows. For the interval let
Corollary 5.3 (Comparison of SA Trajectory and ODE Solution).
For every ,
We highlight that both the paths, and start at the same point at time Consequently, by bounding and