# Adaptive Temporal Difference Learning with Linear Function Approximation

This paper revisits the celebrated temporal difference (TD) learning algorithm for the policy evaluation in reinforcement learning. Typically, the performance of the plain-vanilla TD algorithm is sensitive to the choice of stepsizes. Oftentimes, TD suffers from slow convergence. Motivated by the tight connection between the TD learning algorithm and the stochastic gradient methods, we develop the first adaptive variant of the TD learning algorithm with linear function approximation that we term AdaTD. In contrast to the original TD, AdaTD is robust or less sensitive to the choice of stepsizes. Analytically, we establish that to reach an ϵ accuracy, the number of iterations needed is Õ(ϵ^2ln^41/ϵ/ln^41/ρ), where ρ represents the speed of the underlying Markov chain converges to the stationary distribution. This implies that the iteration complexity of AdaTD is no worse than that of TD in the worst case. Going beyond TD, we further develop an adaptive variant of TD(λ), which is referred to as AdaTD(λ). We evaluate the empirical performance of AdaTD and AdaTD(λ) on several standard reinforcement learning tasks in OpenAI Gym on both linear and nonlinear function approximation, which demonstrate the effectiveness of our new approaches over existing ones.

## Authors

• 32 publications
• 8 publications
• 33 publications
• 42 publications
• ### On TD(0) with function approximation: Concentration bounds and a centered variant with exponential convergence

We provide non-asymptotic bounds for the well-known temporal difference ...
11/12/2014 ∙ by Nathaniel Korda, et al. ∙ 0

• ### Non-asymptotic Convergence of Adam-type Reinforcement Learning Algorithms under Markovian Sampling

Despite the wide applications of Adam in reinforcement learning (RL), th...
02/15/2020 ∙ by Huaqing Xiong, et al. ∙ 22

• ### A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation

Temporal difference learning (TD) is a simple iterative algorithm used t...
06/06/2018 ∙ by Jalaj Bhandari, et al. ∙ 0

• ### Finite-Time Analysis of Q-Learning with Linear Function Approximation

In this paper, we consider the model-free reinforcement learning problem...
05/27/2019 ∙ by Zaiwei Chen, et al. ∙ 0

• ### Investigating practical linear temporal difference learning

Off-policy reinforcement learning has many applications including: learn...
02/28/2016 ∙ by Adam White, et al. ∙ 0

• ### True Online Temporal-Difference Learning

The temporal-difference methods TD(λ) and Sarsa(λ) form a core part of m...
12/13/2015 ∙ by Harm van Seijen, et al. ∙ 0

• ### Evolving Stochastic Learning Algorithm Based on Tsallis Entropic Index

In this paper, inspired from our previous algorithm, which was based on ...
12/09/2005 ∙ by Aristoklis D. Anastasiadis, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Reinforcement learning (RL) involves a sequential decision-making procedure, where an agent takes (possibly randomized) actions in a stochastic environment over a sequence of time steps, and aims to maximize the long-term cumulative rewards received from the interacting environment [29]

. Owning to its generality, RL has been widely studied in many areas, such as control theory, game theory, operations research, multi-agent systems

[36, 17].

Temporal Difference (TD) learning is one of the most commonly used algorithms for policy evaluation in RL [28]

. TD learning provides an iterative procedure to estimate the value function with respect to a given policy based on samples from a Markov chain. The classical TD algorithm adopts a tabular representation for the value function, which stores value estimates on a per state basis. In large-scale settings, the tabular-based TD learning algorithm can become intractable due to the increase number of states, and thus function approximation techniques are often combined with TD for better scalability and efficiency

[1, 32].

The idea of TD learning with function approximation is essentially to parameterize the value function with a linear or nonlinear combination of fixed basis functions induced by the states that are termed feature vectors, and estimates the combination parameters in the same spirit of the tabular TD learning. Similar to all other parametric stochastic optimization algorithms, however, the performance of the TD learning algorithm with function approximation is very sensitive to the choice of stepsizes. Oftentimes, it suffers from slow convergence

[11]. Ad-hoc adaptive modification of TD with function approximation has been often used empirically, but their convergence behavior and the rate of convergence have not been fully understood. In this context, a natural question to consider is

Can we develop a provably convergent adaptive algorithm to accelerate the plain-vanilla TD algorithm?

This paper presents an affirmative answer to this question. The key difficulty here is that the update used in the original TD does not follow the (stochastic) gradient direction of any objective function in an optimization problem, which prevents the use of the popular gradient-based optimization machinery. And the Markovian sampling protocol naturally involved in the TD update further complicates the analysis of adaptive and accelerated optimization algorithms.

### 1.1 Related works

We briefly review related works in the areas of TD learning and adaptive stochastic gradient methods.

Temporal difference learning. The great empirical success of TD [28] motivated active studies on the theoretical foundation of TD. The first convergence analysis of TD is given by [14], which established the convergence by leveraging stochastic approximation techniques. In [32], the characterization of limit points inTD has been studied, which also gives new intuition about the dynamics of TD learning. The ODE-based method (e.g., [3]) has greatly inspired the subsequent development of research on asymptotic convergence of TD. Early convergence results of TD learning were mostly asymptotic, e.g., [30]

, because the TD update does not follow the (stochastic) gradient direction of any objective function. Non-asymptotic analysis for the gradient TD — a variant of the original TD has been first studied in

[20]. The finite-time analysis of TD with i.i.d observation has been studied in [6]. The Markov sampling convergence analysis is presented in [2]. In a concurrent line of research, TD has been considered in the view of stochastic linear system, whose improved results are given by [16]. The finite time analysis for stochastic linear system under the Markov sampling is established by [27, 13]. The finite time analysis of multi-agent TD is proved by [8]. However, all the aforementioned work leverages the original TD update. Adaptive and accelerated variants of TD have been studied in [7, 12], but only asymptotic analysis is provided in [7], and the learning rate selection is used in [12].

### 1.2 Our contributions

Complementary to existing theoretical RL efforts, we the first adaptive variant of the TD learning algorithm with linear function approximation that has the finite-time convergence guarantees. For completeness of our analytical results, we investigate both the original TD algorithm as well as the the TD() algorithm. In a nutshell, our contributions are summarized as follows.

c1) We develop the adaptive variants of the TD and TD() algorithms with linear function approximation. The new AdaTD and AdaTD() are (almost) as simple as the original TD and TD() algorithms.

c2) We establish the finite-time convergence guarantees of AdaTD and AdaTD(), and they are not worse than those of the original TD and TD() algorithms in the worst case.

c3) We test our AdaTD and AdaTD() on several standard RL benchmarks, and they have favorable empirical results relative to TD, TD() and the existing alternative.

## 2 Preliminaries

This section introduces the notation, basic concepts and properties for RL and TD.

Notation: The coordinate of a vector is denoted by and is transpose of . We use

to denote the expectation with respect to the underlying probability space, and

for norm. Given a positive constant and , denotes the projection of to the ball . For a matrix , denotes the projection to space .

### 2.1 Markov Decision Process

Consider a Markov decision process (MDP) described as a tuple

), where denotes the state space, denotes the action space, represents the transition matrix, is the reward function, and is the discount factor. In this case, let denote the transition probability from state to state . The corresponding transition reward is . We consider the finite-state case, i.e., consists of elements, and a deterministic or stochastic policy or that specifies an action or a probability density of all actions given the current state .

We use the following two assumptions on the stationary distribution and transition reward.

###### Assumption 1.

The rewards are uniformly bounded, that is, with .

Assumption 1 can be replaced with non-uniform boundedness. The uniform boundedness is assumed for simplicity.

###### Assumption 2.

For any two states , it holds that

 π(s′)=limt→∞P(st=s′|s0=s)>0,   ∀s′∈S.

There exist constants and such that

 sups∈S|P(st∣s0=s)−π(s)|≤¯κρt.

Assumption 2 is standard under the Markovian Property. For irreducible and aperiodic Markov chains, Assumption 2 can always holds [18]. In fact, the constant actually represents the speed of the Markov chain accessing to the stationary distribution . When the states are finite, the Markovian transition kernel is then a matrix , and

is identical to the second largest eigenvalue of

. An important notion in Markov chain is the mixing time, which measures the time a Markov chain needs for its current state distribution roughly matches the stationary one . Given an , the mixing time . With Assumption 2, we can see . That means if is small, the mixing time is small.

### 2.2 TD versus stochastic approximation

This paper is concerned about evaluating the quality of a given policy . And we consider the on policy setting, where both the target policy and the behavior policy are . For a given policy , since the actions or the distribution of actions will be uniquely determined, we thus eliminate the dependence on the action in the rest of paper. We denote the expected reward at a given state by . The value function associated with a policy is the expected cumulative discounted reward from a given state , that is

 Vμ(s)=E[∞∑t=0γtR(st)∣∣s0=s] (2.1)

where the expectation is taken over the trajectory of states generated under and . The restriction on discount can guarantee the the boundedness of . The Markovian property of MDP yields the well-known Bellman equation

 TμVμ=Vμ (2.2)

where the operator on a value function can be presented as

 (TμV)(s)=R(s)+γ∑s′∈SP(s′|s)V(s′),s∈S. (2.3)

Solving the (linear) Bellman equation allows us to find the value function induced by a given policy . However, in practice, is usually very large and it is hard to solve Bellman equation directly. And an alternative method is to leverage the linear [29]

or non-linear approximations (e.g., kernels and neural networks

[23]). We focus on the linear case here, that is

 Vμ(s)≈Vθ(s):=ϕ(s)⊤θ (2.4)

where is the feature vector for state , and is a parameter vector. To reduce difficulty caused by the dimension, is set smaller than . It is worth mentioning that can be unequal to . With the linear approximator, the vector becomes , where the feature matrix is defined as

 Φ:=⎡⎢ ⎢⎣ϕ1(s1)ϕk(s1)ϕd(s1)⋮⋮⋮ϕ1(sS)ϕk(sS)ϕd(sS)⎤⎥ ⎥⎦∈RS×d (2.5)

with being the th entry of .

###### Assumption 3.

For any state , we assume the feature vector is uniformly bounded such that , and the feature matrix is full column-rank.

It is not hard to guarantee Assumption 3 since the feature map is chosen by the users and . With Assumptions 2 and 3, we can see that the matrix is positive define, and we denote its minimal eigenvalue as follows

 ω:=λmin(Φ⊤Diag(π)Φ)>0. (2.6)

With the linear approximation of value function, the task then is tantamount to finding that obeys the Bellman equation given by

 Φθ=TμΦθ. (2.7)

However, that satisfies (2.7) may not exist if . Instead, there always exists a unique solution for the projected Bellman equation [32], given by

 Φθ∗=PΦTμΦθ∗ (2.8)

where is the projection to the space of .

With denoting a trajectory from the Markov chain, the traditional TD with linear function approximation performs SGD-like update as (with the stepsize )

 θk+1=θk+η¯¯¯g(θk;sk,sk+1) (2.9)

where the stochastic gradient is defined as

 ¯¯¯g(θ;sk,sk+1):= r(sk+1,sk)ϕ(sk)+γϕ(sk+1)⊤θϕ(sk)−ϕ(sk)⊤θϕ(sk). (2.10)

The rationale behind TD update is that the update direction is a good one since it is asymptotically close to the direction whose limit point is . This is akin to the celebrated stochastic approximation [26]. Specifically, it has been established that [32]

 limk→∞E¯¯¯g(θ;sk,sk+1)=g(θ) (2.11)

where is defined as

 g(θ):=Φ⊤Diag(π)(TμΦθ−Φθ). (2.12)

We term as the limiting update direction, which ensures that . Note that while

is an unbiased estimate under the stationary

, it is not for a finite due to the Markovian property of .

Nevertheless, an important property of the limiting update direction is that: for any , it holds that

 ⟨θ∗−θ,g(θ)⟩≥(1−γ)ω∥θ∗−θ∥2. (2.13)

Two important observations follow from (2.13) readily. One is that only one satisfies . If there exists another such that , we have , which again means . Second is that . This also explains why addition instead of subtraction appears in the TD update (2.9).

To ensure the boundedness of and simplify the convergence analysis, a projected version of TD is usually used (see e.g., [2]), that is

 θk+1=PR(θk+η¯¯¯g(θk;sk,sk+1)). (2.14)

If , the projected TD does not exclude all the limit points of the vanilla one [2].

## 3 Adaptive Temporal Difference Learning

In this section, we formally develop AdaTD, provide its intuitions, and then present the main results.

### 3.1 Algorithm development

We briefly review the schemes of the adaptive stochastic gradient descent to minimize , where

is a random variable from an unknown distribution

. In addition to adjusting the parameter using stochastic gradient, the adaptive stochastic gradient descent method incorporates two important variables: momentum variables and weights . The update of the momentum follows that

 mk+1=βmk+(1−β)∇F(θk;ξk) (3.1)

where is a parameter, is the current iterate and is the current sample. And the update of includes the recursive summation of gradient norm square [10], the geometric summation of gradient norm square [31], and the square maximum [25]. We consider the the recursive summation of gradient norm square in this paper, and leave the general schemes to future work.

As presented in last section, is a stochastic estimate of . Based on this observation, we consider the adaptive scheme for TD. Replacing the with for their similarity, we propose the following scheme

 (3.2)

where . In second step, we use rather than , which coincides with the vanilla TD and projected TD. The positive parameter is used for the numerical stability. The projection used in the scheme can directly yields several bounds for the variables even with randomness. Like TD, AdaTD does not use the gradient for any objective function in an optimization problem.

For that, the main results depend on constants related to the bounds. We present them in Lemma 1.

###### Lemma 1.

For , the following bounds hold

 ∥θk−θ∗∥≤2R,   ∥gk∥≤G,   ∥mk∥≤G (3.3)

where the constant is defined as .

Lemma 1 follows readily, but the bounds presented in Lemma 1 are critical for the subsequent analysis.

### 3.2 Finite time analysis

The convergence of AdaTD is different from both the TD and adaptive stochastic gradient descent. Even under the i.i.d sampling, fails to be the true gradient of any function, let alone the samplings are drawn from Markov chain trajectory. The first task is to bound the difference between and . However, it seems that we directly present the bound due to that is related to which may miss visiting several states, i.e., choosing some states with probability 0. Hence, and are uniformly bounded by constants that cannot be controlled in final convergence result. To solve this technical issue, we consider and . This because although is biased, is sufficiently closed to when is large. Using such a technique, we can prove the following lemma.

###### Lemma 2.

Assume is generated by AdaTD. Given an integer , we have

 ∥∥E[¯¯¯g(θk−T;sk,sk+1)]−g(θk−T)∥∥≤κρT (3.4)

where the constant is defined as .

Sketch of the proofs: Now, we present the sketch of the proofs for the main result. Because AdaTD does not have any objective function to optimize, we consider sequence . With direct calculations, we have

 ∥θ∗−θk+1∥2≤∥θ∗−θk∥2+2η⟨mk,θk−θ∗⟩/(vk+δ)12+η2∥mk∥2/(vk+δ).

The main proof consists of three steps:

s1) In the first one, we bound . Due to that is a composition of , we expand by and use a provable result on the nonnegative sequence. (Lemmas 5 and 7 in the Appendix)

s2) In the third step, we consider how to bound with (Lemma 8 in the Appendix).

s3) In the second step, we exploit the relation between and . (Lemma 9 in the Appendix)

With these steps, we then can bound . Once with (2.13), we can derive the main convergence result.

###### Theorem 1.

Suppose is generated by AdaTD with

 R≥2B√ω(1−γ)32 (3.5)

under the Markovian observation. Given the integer and , for , we have

 minT≤k≤KE∥θ∗−θk∥2≤C1ln(δ+KG2δ)+C2√K+C3ρT√K

where , and , and are positive constants that are independent of the number of iterations . The expressions of , and are given in the supplementary materials.

With Theorem 1, to achieve -accuracy for , we need the following to hold

 C3ρT√K≤ϵ2,   C1ln(δ+KG2δ)+C2√K≤ϵ2. (3.6)

With the setting , it follows . Thus we can see Recalling (3.6), we obtain that to achieve a solution whose square distance to is , the number of iterations needed is

 ~O⎛⎝ln41ϵϵ2ln41ρ⎞⎠. (3.7)

When is not very closed to , the term keeps at a relatively small level. Recall that the most recent convergence result of TD given in [2] is , the speed of AdaTD is quite closed to TD.

We do not present a faster speed for technical reasons. In fact, such a phenomenon also exists for the adaptive stochastic gradient descent. Specifically, although numerical results demonstrate the advantage of adaptive methods, the worst-case convergence rate of adaptive methods is still similar to that of the stochastic gradient descent method.

## 4 Extension to Adaptive TD(λ)

This section contains the adaptive TD() algorithm and its finite-time convergence analysis.

### 4.1 Algorithm development

We first review the scheme of TD() [29, 32]. If solves the Bellman equation (2.2), it also solves

 T(m)μVμ=Vμ,   m∈Z+ (4.1)

where denoting the th power of . In this case, we can represent also as

 Vμ(s)=E[m∑t=0γtR(st)+γm+1Vμ(sm+1)∣∣s0=s].

Given and , , the -averaged Bellman operator is given by

 (TλμV)(s)=(1−λ)∞∑m=0λm×E[m∑t=0γtR(st)+γm+1V(sm+1)∣∣s0=s]. (4.2)

Comparing (2.3) and (4.2), it is clear that . Thus, the vanilla TD is also referred as TD(0).

Although it is known that TD() can generate a sequence almost surely convergent to the solution of , the finite time analysis is developed by [2] recently. We denote

 gλ(θ)=Φ⊤Diag(π)(TλμΦθ−Φθ). (4.3)

In TD(), an alternative sampling operator is

 ^g(θ;sk,sk+1,zk):=r(sk+1,sk)zk+γϕ(sk+1)⊤θzk−ϕ(sk)⊤θzk∈Rd (4.4)

where is recursively updated by

 zk=(γλ)zk−1+ϕ(sk). (4.5)

Similar to TD, it has been established in [32] and [33] that

 limk→+∞E^g(θ;sk,sk+1,zk)=gλ(θ). (4.6)

Like the limiting update direction in TD, a critical property of the update direction in TD() is given by

 ⟨θ∗−θ,gλ(θ)⟩≥(1−α)ω∥θ−θ∗∥2 (4.7)

where for any . By denoting , where is the stationary sequence. Then, it also holds

 E^g(θ;^sk,^sk+1,z∞)=gλ(θ). (4.8)

We are now ready to present AdaTD(). As summarized in Algorithm 2, AdaTD() is similar to AdaTD except for a different sampling protocol. In each iteration, we sample with . Assumption 1 indicates that . Hence, for any , we can uniformly bound

 ∥gk∥≤G1−γλ,   ∥mk∥≤G1−γλ,   ∥zk∥≤11−γλ.

### 4.2 Finite time analysis

The analysis of TD() is more complicated than the that for TD due to the existence of . To this end, we need to bound the sequence in the next lemma.

###### Lemma 3.

Assume is generated by AdaTD(). It then holds that

 ∥Ezk−Ez∞∥≤kζk1−γλ   and   ∥z∞∥≤11−γλ. (4.9)

With Lemma 3, similar to Lemma 2, we consider the “delayed” expectation. For a fixed , we consider the following error decomposition

 E^g(θk−T;sk,sk+1,zk)=E^g(θk−T;sk,sk+1,z∞) +E^g(θk−T;sk,sk+1,zk)−E^g(θk−T;sk,sk+1,z∞).

Therefore, our problem becomes bounding the difference between and , where the proof is similar to that of Lemma 2. We can also establish the following result.

###### Lemma 4.

Assume is generated by AdaTD(). Given integer , we then get

 ∥∥E(^g(θk−T;sk,sk+1,zk) −gλ(θk−T)∥∥≤κ1−γλρT+^G1−γλζk

where .

If , it then holds .111For convenience, we follow the convention ., and thus Lemma 4 degrades to Lemma 2. It is also easy to see . But we do not want to use to replace in Lemma 4 due to the bound will not diminish when . Direct computing gives

 ∞∑k=1ζk≤max{γλ,ρ}(1−max{γλ,ρ})2. (4.10)

###### Theorem 2.

Suppose is generated by AdaTD() with

 R≥2B√ω(1−γ)√1−γ(1−λ)1−γλ (4.11)

under the Markovian observation. Given any integer and , , for , we have

 minT≤k≤KE∥θ†−θk∥2≤Cλ1ln(δ+KG2δ)+Cλ2√K+Cλ3ρT√K

where , , and are positive constants that are independent of the number of iterations . The expressions of , and are given in the supplementary materials.

When , Theorem 2 reduces to Theorem 1. Recall the fact (4.10), . Hence, to achieve accuracy for , the number of iterations needs to be , the same as AdaTD.

## 5 Numerical simulations

To validate the theoretical analysis and show the effectiveness of our algorithms, we tested AdaTD and AdaTD() on several commonly used RL tasks. As briefly highlighted below, the first three tasks are from the popular OpenAI Gym [4], and the fourth task is a single-agent version of the Navigation task in [21].

Mountain Car: The goal is to control a car at the valley of a mountain to reach the top. The car needs to drive back and forth to build up momentum, and gets larger accumulated reward if it uses fewer steps to achieve the goal.
Acrobot: An agent can actuate the joint between two links. The goal is to swing the links to a certain height. The agent will get negative reward before it reaches the goal.
CartPole: A pendulum is attached to a cart with an un-actuated joint. The agent can apply a bi-directional force to the cart to keep the pendulum upright while avoiding the cart out of boundary. The agent gets a +1 reward every step when pendulum does not fall and the cart stays in boundary.
Navigation: The goal is for an agent to reach a landmark based on its observation. The agent’s action space is a discrete set {stay, left, right, up, down}. The agent is rewarded greater if the distance between it and its landmark is shorter.

We compared our algorithms with other policy evaluation methods using the runtime mean squared Bellman error (RMSBE). In each test, policy is same for all the algorithms when value parameter is updating separately. In the first two tasks, the value function is approximated using linear functions. In the last two tasks, the value function is parameterized by a neural network. In the linear tasks, for different values of , we compared AdaTD() algorithm, the plain-vanilla TD() and ALRR algorithm in [12]. For consistency, we changed the update step in the original ALRR algorithm to single time scale TD() update. In the non-linear tasks, we extended our algorithm to non-linear cases and compared it with TD(). Since ALRR was not designed for the neural network-parameterized cases, we did not include it in the non-linear TD tests.

In Figure 1, for different , we compared our AdaTD() method with TD() and ALRR method in the Mountain Car task. The common parameters for all three algorithms are set as max episode = , batch size = . For AdaTD(), we use , and the initial step size . For ALRR method, we use , and . For TD(), the step size is set as . In this test, the performance of all three methods is close, while AdaTD() still has a small advantage over other two when is small.

In Figure 2, for different , we compared our AdaTD() with TD() and ALRR in the Acrobot task. In this test, max episode = , batch size = . For AdaTD(), we used ( and ) or ( and ), and initial step size . The initial step size is relatively large for AdaTD() which will cause the gradient update to explode, but AdaTD() is able to quickly adapt the large initial step size and guarantee afterwards convergence. For ALRR method, , and . For TD(), we used the constant step size (, , ) or (). Note there is a major fluctuation in average loss around episode . TD() has constant step size, and thus it is more vulnerable to the fluctuation than AdaTD(). As a result, our algorithm demonstrates better overall convergence speed over TD(). In addition, our method also has better stability over two other methods in this test, which is indicated by the small shaded areas.

In Figure 3, we compared AdaTD() with TD(

) in the CartPole task. The value function is parameterized by a network with two hidden layers each with 128 neurons. We used ReLU as activation functions for hidden layers. In this test, max episode =

, batch size = . For AdaTD(), we used , and initial step size . For TD(), the step size is . To achieve stable convergence, the step size of TD() cannot be large. Thus, it is outperformed by AdaTD() in terms of convergence speed. In fact, when is large, a small step size of still cannot guarantee the stability of TD(). It can be observed in Figure 3 that when gets larger, i.e. the gradient is larger, original TD() becomes less stable. In comparison, AdaTD() has exhibited robustness to the choice of and large initial step size in this test.

In Figure 4, we compared AdaTD() with TD() in the Navigation task. The value function is parameterized by a neural network with two hidden layers each with 64 neurons. The activation function for hidden layers is ReLU. In this test, max episode = , batch size = . For AdaTD(), we used , and initial step size . For TD(), the step size is . It can be observed from Figure 4 that AdaTD() strongly outperforms TD() in terms of stability and convergence speed. Also, it is worth noticing that when is large, original TD() again has stability issues when converging.

## 6 Conclusions

We focused on developing an improved variant of the celebrated temporal difference (TD) learning algorithm in this paper. Motivated by the tight link between TD and stochastic gradient-based methods, we developed adaptive TD and TD() algorithms. The finite-time convergence analysis of the novel adaptive TD and TD() algorithms has been established under the Markovian observation model. While the theoretical (worst-case) convergence rates of Adaptive TD and TD() are similar to those of the original TD and TD(), extensive tests on several standard benchmark tasks demonstrate the effectiveness of our new approaches.

## Appendix A Technical Lemmas

###### Lemma 5 ([5, 19]).

For , we have

 T∑t=1atδ+∑ti=1ai≤ln(δ+T¯a)−lnδ.
###### Lemma 6 ([2]).

For any , it follows

 ⟨θ∗−θ,g(θ)⟩≥(1−γ)ω∥θ∗−θ∥2. (A.1)

In the proofs, we use three shorthand notation for simplifications. Those three notation are all related to the iteration . Assume , , are all generated by AdaTD. We denote

 ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩Ξk:=E∥mk∥2/(vk+δ),Υk:=E(⟨θk−θ∗,mk⟩/(vk+δ)12),Rk:=8T(1−β)δ1/2(1−γ)ω∑1d=TΞk−d+ηβΞk+2RG(1−β)dδ⋅∥gk∥2vk+δ+2R(1−β)κρT√δ. (A.2)

The left technical lemmas all describe the notation given above.

###### Lemma 7.

With defined in (A.2), we have

 K∑k=1Ξk≤1+β1−βK−1∑j=1E∥gj|2/(vj+δ).

Further, with the boundedness of , we then get

 K∑k=1Ξk≤1+β1−βln(δ+(K−1)G2δ).
###### Lemma 8.

Given , we have

 E⟨θk−θ∗,gk⟩/(vk−1+δ)12 ≤12E⟨θk−θ∗,g(θk)⟩/(vk−1+δ)12+2RκρT√δ+8Tδ1/2(1−γ)ω1∑h=TE∥mk−h∥2/(vk−h+δ)12. (A.3)
###### Lemma 9.

With and defined in (A.2), the following result holds for AdaTD

 Υk+(1−β)2E⟨θ∗−θk,g(θk)⟩/(vk−1+δ)12≤βΥk−1+Rk. (A.4)

On the other hand, we can bound as

 |Υk|≤η2RGδ12.

## Appendix B Proofs of Technical Lemmas

### b.1 Proof of Lemma 2

Given an integer , with Assumptions 1 and 2, we have

 E(¯¯¯g(θk−T;sk,sk+1)∣σk−T) =∑s,s′<