1 Introduction
Reinforcement learning (RL) involves a sequential decisionmaking procedure, where an agent takes (possibly randomized) actions in a stochastic environment over a sequence of time steps, and aims to maximize the longterm cumulative rewards received from the interacting environment [29]
. Owning to its generality, RL has been widely studied in many areas, such as control theory, game theory, operations research, multiagent systems
[36, 17].Temporal Difference (TD) learning is one of the most commonly used algorithms for policy evaluation in RL [28]
. TD learning provides an iterative procedure to estimate the value function with respect to a given policy based on samples from a Markov chain. The classical TD algorithm adopts a tabular representation for the value function, which stores value estimates on a per state basis. In largescale settings, the tabularbased TD learning algorithm can become intractable due to the increase number of states, and thus function approximation techniques are often combined with TD for better scalability and efficiency
[1, 32].The idea of TD learning with function approximation is essentially to parameterize the value function with a linear or nonlinear combination of fixed basis functions induced by the states that are termed feature vectors, and estimates the combination parameters in the same spirit of the tabular TD learning. Similar to all other parametric stochastic optimization algorithms, however, the performance of the TD learning algorithm with function approximation is very sensitive to the choice of stepsizes. Oftentimes, it suffers from slow convergence
[11]. Adhoc adaptive modification of TD with function approximation has been often used empirically, but their convergence behavior and the rate of convergence have not been fully understood. In this context, a natural question to consider isCan we develop a provably convergent adaptive algorithm to accelerate the plainvanilla TD algorithm?
This paper presents an affirmative answer to this question. The key difficulty here is that the update used in the original TD does not follow the (stochastic) gradient direction of any objective function in an optimization problem, which prevents the use of the popular gradientbased optimization machinery. And the Markovian sampling protocol naturally involved in the TD update further complicates the analysis of adaptive and accelerated optimization algorithms.
1.1 Related works
We briefly review related works in the areas of TD learning and adaptive stochastic gradient methods.
Temporal difference learning. The great empirical success of TD [28] motivated active studies on the theoretical foundation of TD. The first convergence analysis of TD is given by [14], which established the convergence by leveraging stochastic approximation techniques. In [32], the characterization of limit points inTD has been studied, which also gives new intuition about the dynamics of TD learning. The ODEbased method (e.g., [3]) has greatly inspired the subsequent development of research on asymptotic convergence of TD. Early convergence results of TD learning were mostly asymptotic, e.g., [30]
, because the TD update does not follow the (stochastic) gradient direction of any objective function. Nonasymptotic analysis for the gradient TD — a variant of the original TD has been first studied in
[20]. The finitetime analysis of TD with i.i.d observation has been studied in [6]. The Markov sampling convergence analysis is presented in [2]. In a concurrent line of research, TD has been considered in the view of stochastic linear system, whose improved results are given by [16]. The finite time analysis for stochastic linear system under the Markov sampling is established by [27, 13]. The finite time analysis of multiagent TD is proved by [8]. However, all the aforementioned work leverages the original TD update. Adaptive and accelerated variants of TD have been studied in [7, 12], but only asymptotic analysis is provided in [7], and the learning rate selection is used in [12].Adaptive stochastic gradient descent.
In machine learning areas different but related to RL, adaptive stochastic gradient descent methods have been actively studied. The first adaptive gradient (AdaGrad) is proposed by
[10, 22], and the algorithm demonstrated impressive numerical results when the gradients are sparse. While the original AdaGrad has performance guarantee only in the convex case, the nonconvex AdaGrad has been studied by [19]. Besides the convex results, sharp analysis for convex AdaGrad has been also investigated in [34]. Variants of AdaGrad have been developed in [31, 35], which use alternative updating scheme (the exponential moving average schemes) rather than the average of the square about past iterates. The momentum technique applied to the adaptive stochastic algorithms gives birth to Adam and Nadam [15, 9]. However, in [25], the authors demonstrate that Adam may diverge under certain circumstances, and provide a new convergent Adam algorithm called AMSGrad. Another fixable method given by [37] is the use of decreasing factors for moving average of the square of the past iterates. In [5], a convergence framework for generic Adamtype algorithms has been proposed, which contains various adaptive methods.1.2 Our contributions
Complementary to existing theoretical RL efforts, we the first adaptive variant of the TD learning algorithm with linear function approximation that has the finitetime convergence guarantees. For completeness of our analytical results, we investigate both the original TD algorithm as well as the the TD() algorithm. In a nutshell, our contributions are summarized as follows.
c1) We develop the adaptive variants of the TD and TD() algorithms with linear function approximation. The new AdaTD and AdaTD() are (almost) as simple as the original TD and TD() algorithms.
c2) We establish the finitetime convergence guarantees of AdaTD and AdaTD(), and they are not worse than those of the original TD and TD() algorithms in the worst case.
c3) We test our AdaTD and AdaTD() on several standard RL benchmarks, and they have favorable empirical results relative to TD, TD() and the existing alternative.
2 Preliminaries
This section introduces the notation, basic concepts and properties for RL and TD.
Notation: The coordinate of a vector is denoted by and is transpose of . We use
to denote the expectation with respect to the underlying probability space, and
for norm. Given a positive constant and , denotes the projection of to the ball . For a matrix , denotes the projection to space .2.1 Markov Decision Process
Consider a Markov decision process (MDP) described as a tuple
), where denotes the state space, denotes the action space, represents the transition matrix, is the reward function, and is the discount factor. In this case, let denote the transition probability from state to state . The corresponding transition reward is . We consider the finitestate case, i.e., consists of elements, and a deterministic or stochastic policy or that specifies an action or a probability density of all actions given the current state .We use the following two assumptions on the stationary distribution and transition reward.
Assumption 1.
The rewards are uniformly bounded, that is, with .
Assumption 1 can be replaced with nonuniform boundedness. The uniform boundedness is assumed for simplicity.
Assumption 2.
For any two states , it holds that
There exist constants and such that
Assumption 2 is standard under the Markovian Property. For irreducible and aperiodic Markov chains, Assumption 2 can always holds [18]. In fact, the constant actually represents the speed of the Markov chain accessing to the stationary distribution . When the states are finite, the Markovian transition kernel is then a matrix , and
is identical to the second largest eigenvalue of
. An important notion in Markov chain is the mixing time, which measures the time a Markov chain needs for its current state distribution roughly matches the stationary one . Given an , the mixing time . With Assumption 2, we can see . That means if is small, the mixing time is small.2.2 TD versus stochastic approximation
This paper is concerned about evaluating the quality of a given policy . And we consider the on policy setting, where both the target policy and the behavior policy are . For a given policy , since the actions or the distribution of actions will be uniquely determined, we thus eliminate the dependence on the action in the rest of paper. We denote the expected reward at a given state by . The value function associated with a policy is the expected cumulative discounted reward from a given state , that is
(2.1) 
where the expectation is taken over the trajectory of states generated under and . The restriction on discount can guarantee the the boundedness of . The Markovian property of MDP yields the wellknown Bellman equation
(2.2) 
where the operator on a value function can be presented as
(2.3) 
Solving the (linear) Bellman equation allows us to find the value function induced by a given policy . However, in practice, is usually very large and it is hard to solve Bellman equation directly. And an alternative method is to leverage the linear [29]
or nonlinear approximations (e.g., kernels and neural networks
[23]). We focus on the linear case here, that is(2.4) 
where is the feature vector for state , and is a parameter vector. To reduce difficulty caused by the dimension, is set smaller than . It is worth mentioning that can be unequal to . With the linear approximator, the vector becomes , where the feature matrix is defined as
(2.5) 
with being the th entry of .
Assumption 3.
For any state , we assume the feature vector is uniformly bounded such that , and the feature matrix is full columnrank.
It is not hard to guarantee Assumption 3 since the feature map is chosen by the users and . With Assumptions 2 and 3, we can see that the matrix is positive define, and we denote its minimal eigenvalue as follows
(2.6) 
With the linear approximation of value function, the task then is tantamount to finding that obeys the Bellman equation given by
(2.7) 
However, that satisfies (2.7) may not exist if . Instead, there always exists a unique solution for the projected Bellman equation [32], given by
(2.8) 
where is the projection to the space of .
With denoting a trajectory from the Markov chain, the traditional TD with linear function approximation performs SGDlike update as (with the stepsize )
(2.9) 
where the stochastic gradient is defined as
(2.10) 
The rationale behind TD update is that the update direction is a good one since it is asymptotically close to the direction whose limit point is . This is akin to the celebrated stochastic approximation [26]. Specifically, it has been established that [32]
(2.11) 
where is defined as
(2.12) 
We term as the limiting update direction, which ensures that . Note that while
is an unbiased estimate under the stationary
, it is not for a finite due to the Markovian property of .Nevertheless, an important property of the limiting update direction is that: for any , it holds that
(2.13) 
Two important observations follow from (2.13) readily. One is that only one satisfies . If there exists another such that , we have , which again means . Second is that . This also explains why addition instead of subtraction appears in the TD update (2.9).
3 Adaptive Temporal Difference Learning
In this section, we formally develop AdaTD, provide its intuitions, and then present the main results.
3.1 Algorithm development
We briefly review the schemes of the adaptive stochastic gradient descent to minimize , where
is a random variable from an unknown distribution
. In addition to adjusting the parameter using stochastic gradient, the adaptive stochastic gradient descent method incorporates two important variables: momentum variables and weights . The update of the momentum follows that(3.1) 
where is a parameter, is the current iterate and is the current sample. And the update of includes the recursive summation of gradient norm square [10], the geometric summation of gradient norm square [31], and the square maximum [25]. We consider the the recursive summation of gradient norm square in this paper, and leave the general schemes to future work.
As presented in last section, is a stochastic estimate of . Based on this observation, we consider the adaptive scheme for TD. Replacing the with for their similarity, we propose the following scheme
(3.2) 
where . In second step, we use rather than , which coincides with the vanilla TD and projected TD. The positive parameter is used for the numerical stability. The projection used in the scheme can directly yields several bounds for the variables even with randomness. Like TD, AdaTD does not use the gradient for any objective function in an optimization problem.
For that, the main results depend on constants related to the bounds. We present them in Lemma 1.
Lemma 1.
For , the following bounds hold
(3.3) 
where the constant is defined as .
3.2 Finite time analysis
The convergence of AdaTD is different from both the TD and adaptive stochastic gradient descent. Even under the i.i.d sampling, fails to be the true gradient of any function, let alone the samplings are drawn from Markov chain trajectory. The first task is to bound the difference between and . However, it seems that we directly present the bound due to that is related to which may miss visiting several states, i.e., choosing some states with probability 0. Hence, and are uniformly bounded by constants that cannot be controlled in final convergence result. To solve this technical issue, we consider and . This because although is biased, is sufficiently closed to when is large. Using such a technique, we can prove the following lemma.
Lemma 2.
Assume is generated by AdaTD. Given an integer , we have
(3.4) 
where the constant is defined as .
Sketch of the proofs: Now, we present the sketch of the proofs for the main result. Because AdaTD does not have any objective function to optimize, we consider sequence . With direct calculations, we have
The main proof consists of three steps:
s1) In the first one, we bound . Due to that is a composition of , we expand by and use a provable result on the nonnegative sequence. (Lemmas 5 and 7 in the Appendix)
s2) In the third step, we consider how to bound with (Lemma 8 in the Appendix).
s3) In the second step, we exploit the relation between and . (Lemma 9 in the Appendix)
With these steps, we then can bound . Once with (2.13), we can derive the main convergence result.
Theorem 1.
Suppose is generated by AdaTD with
(3.5) 
under the Markovian observation. Given the integer and , for , we have
where , and , and are positive constants that are independent of the number of iterations . The expressions of , and are given in the supplementary materials.
With Theorem 1, to achieve accuracy for , we need the following to hold
(3.6) 
With the setting , it follows . Thus we can see Recalling (3.6), we obtain that to achieve a solution whose square distance to is , the number of iterations needed is
(3.7) 
When is not very closed to , the term keeps at a relatively small level. Recall that the most recent convergence result of TD given in [2] is , the speed of AdaTD is quite closed to TD.
We do not present a faster speed for technical reasons. In fact, such a phenomenon also exists for the adaptive stochastic gradient descent. Specifically, although numerical results demonstrate the advantage of adaptive methods, the worstcase convergence rate of adaptive methods is still similar to that of the stochastic gradient descent method.
4 Extension to Adaptive TD()
This section contains the adaptive TD() algorithm and its finitetime convergence analysis.
4.1 Algorithm development
We first review the scheme of TD() [29, 32]. If solves the Bellman equation (2.2), it also solves
(4.1) 
where denoting the th power of . In this case, we can represent also as
Given and , , the averaged Bellman operator is given by
(4.2) 
Comparing (2.3) and (4.2), it is clear that . Thus, the vanilla TD is also referred as TD(0).
Although it is known that TD() can generate a sequence almost surely convergent to the solution of , the finite time analysis is developed by [2] recently. We denote
(4.3) 
In TD(), an alternative sampling operator is
(4.4) 
where is recursively updated by
(4.5) 
4.2 Finite time analysis
The analysis of TD() is more complicated than the that for TD due to the existence of . To this end, we need to bound the sequence in the next lemma.
Lemma 3.
Assume is generated by AdaTD(). It then holds that
(4.9) 
With Lemma 3, similar to Lemma 2, we consider the “delayed” expectation. For a fixed , we consider the following error decomposition
Therefore, our problem becomes bounding the difference between and , where the proof is similar to that of Lemma 2. We can also establish the following result.
Lemma 4.
Assume is generated by AdaTD(). Given integer , we then get
where .
If , it then holds .^{1}^{1}1For convenience, we follow the convention ., and thus Lemma 4 degrades to Lemma 2. It is also easy to see . But we do not want to use to replace in Lemma 4 due to the bound will not diminish when . Direct computing gives
(4.10) 
Now we are ready to present the convergence of AdaTD().
Theorem 2.
Suppose is generated by AdaTD() with
(4.11) 
under the Markovian observation. Given any integer and , , for , we have
where , , and are positive constants that are independent of the number of iterations . The expressions of , and are given in the supplementary materials.
When , Theorem 2 reduces to Theorem 1. Recall the fact (4.10), . Hence, to achieve accuracy for , the number of iterations needs to be , the same as AdaTD.
5 Numerical simulations
To validate the theoretical analysis and show the effectiveness of our algorithms, we tested AdaTD and AdaTD() on several commonly used RL tasks. As briefly highlighted below, the first three tasks are from the popular OpenAI Gym [4], and the fourth task is a singleagent version of the Navigation task in [21].
Mountain Car: The goal is to control a car at the valley of a mountain to reach the top. The car needs to drive back and forth to build up momentum, and gets larger accumulated reward if it uses fewer steps to achieve the goal.
Acrobot: An agent can actuate the joint between two links. The goal is to swing the links to a certain height. The agent will get negative reward before it reaches the goal.
CartPole: A pendulum is attached to a cart with an unactuated joint. The agent can apply a bidirectional force to the cart to keep the pendulum upright while avoiding the cart out of boundary. The agent gets a +1 reward every step when pendulum does not fall and the cart stays in boundary.
Navigation: The goal is for an agent to reach a landmark based on its observation. The agent’s action space is a discrete set {stay, left, right, up, down}. The agent is rewarded greater if the distance between it and its landmark is shorter.
We compared our algorithms with other policy evaluation methods using the runtime mean squared Bellman error (RMSBE). In each test, policy is same for all the algorithms when value parameter is updating separately. In the first two tasks, the value function is approximated using linear functions. In the last two tasks, the value function is parameterized by a neural network. In the linear tasks, for different values of , we compared AdaTD() algorithm, the plainvanilla TD() and ALRR algorithm in [12]. For consistency, we changed the update step in the original ALRR algorithm to single time scale TD() update. In the nonlinear tasks, we extended our algorithm to nonlinear cases and compared it with TD(). Since ALRR was not designed for the neural networkparameterized cases, we did not include it in the nonlinear TD tests.
In Figure 1, for different , we compared our AdaTD() method with TD() and ALRR method in the Mountain Car task. The common parameters for all three algorithms are set as max episode = , batch size = . For AdaTD(), we use , and the initial step size . For ALRR method, we use , and . For TD(), the step size is set as . In this test, the performance of all three methods is close, while AdaTD() still has a small advantage over other two when is small.
In Figure 2, for different , we compared our AdaTD() with TD() and ALRR in the Acrobot task. In this test, max episode = , batch size = . For AdaTD(), we used ( and ) or ( and ), and initial step size . The initial step size is relatively large for AdaTD() which will cause the gradient update to explode, but AdaTD() is able to quickly adapt the large initial step size and guarantee afterwards convergence. For ALRR method, , and . For TD(), we used the constant step size (, , ) or (). Note there is a major fluctuation in average loss around episode . TD() has constant step size, and thus it is more vulnerable to the fluctuation than AdaTD(). As a result, our algorithm demonstrates better overall convergence speed over TD(). In addition, our method also has better stability over two other methods in this test, which is indicated by the small shaded areas.
In Figure 3, we compared AdaTD() with TD(
) in the CartPole task. The value function is parameterized by a network with two hidden layers each with 128 neurons. We used ReLU as activation functions for hidden layers. In this test, max episode =
, batch size = . For AdaTD(), we used , and initial step size . For TD(), the step size is . To achieve stable convergence, the step size of TD() cannot be large. Thus, it is outperformed by AdaTD() in terms of convergence speed. In fact, when is large, a small step size of still cannot guarantee the stability of TD(). It can be observed in Figure 3 that when gets larger, i.e. the gradient is larger, original TD() becomes less stable. In comparison, AdaTD() has exhibited robustness to the choice of and large initial step size in this test.In Figure 4, we compared AdaTD() with TD() in the Navigation task. The value function is parameterized by a neural network with two hidden layers each with 64 neurons. The activation function for hidden layers is ReLU. In this test, max episode = , batch size = . For AdaTD(), we used , and initial step size . For TD(), the step size is . It can be observed from Figure 4 that AdaTD() strongly outperforms TD() in terms of stability and convergence speed. Also, it is worth noticing that when is large, original TD() again has stability issues when converging.
6 Conclusions
We focused on developing an improved variant of the celebrated temporal difference (TD) learning algorithm in this paper. Motivated by the tight link between TD and stochastic gradientbased methods, we developed adaptive TD and TD() algorithms. The finitetime convergence analysis of the novel adaptive TD and TD() algorithms has been established under the Markovian observation model. While the theoretical (worstcase) convergence rates of Adaptive TD and TD() are similar to those of the original TD and TD(), extensive tests on several standard benchmark tasks demonstrate the effectiveness of our new approaches.
References
 [1] Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In Machine Learning, pages 30–37. 1995.
 [2] Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite time analysis of temporal difference learning with linear function approximation. arXiv preprint arXiv:1806.02450, 2018.
 [3] Vivek S Borkar and Sean P Meyn. The ode method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization, 38(2):447–469, 2000.
 [4] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. arXiv:1606.01540, 2016.
 [5] Xiangyi Chen, Sijia Liu, Ruoyu Sun, and Mingyi Hong. On the convergence of a class of adamtype algorithms for nonconvex optimization. ICLR, 2018.

[6]
Gal Dalal, Balázs Szörényi, Gugan Thoppe, and Shie Mannor.
Finite sample analyses for td(0) with function approximation.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, 2018.  [7] Adithya Devraj and Sean Meyn. Zap Qlearning. In Advances in Neural Information Processing Systems, pages 2235–2244, Long Beach, CA, Dec. 2017.
 [8] Thinh T Doan, Siva Theja Maguluri, and Justin Romberg. Convergence rates of distributed td (0) with linear function approximation for multiagent reinforcement learning. arXiv preprint arXiv:1902.07393, 2019.

[9]
Timothy Dozat.
Incorporating nesterov momentum into adam.
2016.  [10] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
 [11] Eyal EvenDar and Yishay Mansour. Learning rates for qlearning. Journal of machine learning Research, 5(Dec.):1–25, 2003.
 [12] Lei Ying Harsh Gupta, R. Srikant. Finitetime performance bounds and adaptive learning rate selection for two timescale reinforcement learning. In Advances in Neural Information Processing Systems, pages 4706–4715, Vancouver, Canada, November 2019.
 [13] Bin Hu and Usman Syed. Characterizing the exact behaviors of temporal difference learning algorithms using markov jump linear system theory. In Advances in Neural Information Processing Systems, pages 8477–8488, Vancouver, Canada, Dec. 2019.
 [14] Tommi Jaakkola, Michael I Jordan, and Satinder P Singh. Convergence of stochastic iterative dynamic programming algorithms. In Advances in neural information processing systems, pages 703–710, 1994.
 [15] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2014.
 [16] Chandrashekar Lakshminarayanan and Csaba Szepesvari. Linear stochastic approximation: How far does constant stepsize and iterate averaging go? In International Conference on Artificial Intelligence and Statistics, pages 1347–1355, 2018.
 [17] Donghwan Lee, Niao He, Parameswaran Kamalaruban, and Volkan Cevher. Optimization for reinforcement learning: From single agent to cooperative agents. arXiv preprint arXiv:1912.00498, 2019.
 [18] David A Levin and Yuval Peres. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.
 [19] Xiaoyu Li and Francesco Orabona. On the convergence of stochastic gradient descent with adaptive stepsizes. AISTATS, 2018.
 [20] Bo Liu, Ji Liu, Mohammad Ghavamzadeh, Sridhar Mahadevan, and Marek Petrik. Finitesample analysis of proximal gradient td algorithms. In Proc. Conf. Uncertainty in Artificial Intelligence, pages 504–513, Amsterdam, Netherlands, 2015.
 [21] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multiagent actorcritic for mixed cooperativecompetitive environments. Long beach, CA, December 2017.
 [22] H Brendan McMahan and Matthew Streeter. Adaptive bound optimization for online convex optimization. arXiv preprint arXiv:1002.4908, 2010.
 [23] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 [24] Rufus Oldenburger et al. Infinite powers of matrices and characteristic roots. Duke Mathematical Journal, 6(2):357–361, 1940.
 [25] Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. ICLR, 2019.
 [26] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
 [27] R Srikant and Lei Ying. Finitetime error bounds for linear stochastic approximation and TD learning. COLT, 2019.
 [28] Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.
 [29] Richard S Sutton, Andrew G Barto, et al. Introduction to reinforcement learning, volume 2. MIT press Cambridge, 1998.
 [30] Richard S Sutton, Hamid R Maei, and Csaba Szepesvári. A convergent temporaldifference algorithm for offpolicy learning with linear function approximation. In Advances in Neural Information Processing Systems, pages 1609–1616, Vancouver, Canada, Dec. 2009.
 [31] Tijmen Tieleman and Geoffrey Hinton. Rmsprop: Neural networks for machine learning. University of Toronto, Technical Report, 2012.
 [32] JN Tsitsiklis and B Van Roy. An analysis of temporaldifference learning with function approximation. IEEE Transactions on Automatic Control, 1997.
 [33] Benjamin Van Roy. Learning and value function approximation in complex decision processes. PhD thesis, Massachusetts Institute of Technology, 1998.
 [34] Rachel Ward, Xiaoxia Wu, and Leon Bottou. Adagrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization. arXiv preprint arXiv:1806.01811, 2018.
 [35] Matthew D Zeiler. ADADELTA: An adaptive learning rate method. arXiv preprint:1212.5701, December 2012.
 [36] Kaiqing Zhang, Zhuoran Yang, and Tamer Başar. Multiagent reinforcement learning: A selective overview of theories and algorithms. arXiv preprint arXiv:1911.10635, 2019.

[37]
Fangyu Zou, Li Shen, Zequn Jie, Weizhong Zhang, and Wei Liu.
A sufficient condition for convergences of adam and rmsprop.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 11127–11135, 2019.
Appendix A Technical Lemmas
Lemma 6 ([2]).
For any , it follows
(A.1) 
In the proofs, we use three shorthand notation for simplifications. Those three notation are all related to the iteration . Assume , , are all generated by AdaTD. We denote
(A.2) 
The left technical lemmas all describe the notation given above.
Lemma 7.
Lemma 8.
Given , we have
(A.3) 
Lemma 9.
With and defined in (A.2), the following result holds for AdaTD
(A.4) 
On the other hand, we can bound as
Comments
There are no comments yet.