1 Introduction
The theory of Qlearning with function approximation has not caught up with the famous success stories in applications. Counter examples appeared in the 1990’s, shortly following the seminal work of Watkins and Dayan, establishing consistency of the Qlearning algorithm in the tabular setting [50]. These examples demonstrate failure of the natural generalization of Watkins’ algorithm, even in very simple settings such as linear
function approximation in a simple finite stateaction Markov decision process (MDP)
[3, 28]. Even when convergence holds, it is found in practice that convergence of Qlearning can be extremely slow.This paper focuses on algorithm design to ensure stability of the algorithm, consistency, and techniques to obtain at least qualitative insight on the rate of convergence. The framework for algorithm design is the theory of stochastic approximation (SA), and the ODE approximation that is central to that theory. There is a long history of application of SA tools in the analysis of reinforcement learning (RL) algorithms [44, 46, 23, 7, 28].
An explanation for the slow convergence of Watkins’ Qlearning is given in [13, 14]. The starting point is the recognition that this RL algorithm can be represented as a dimensional SA recursion
(1) 
in which
is a Markov chain on a finite state space
, is a nonnegative gain sequence, and . In the tabular Qlearning algorithm of Watkins, the dimension is equal to the number of stateaction pairs. The definition of for this case is given in Section 2. We assume throughout that this Markov chain has a unique invariant probability mass function (pmf).
It is known that the evolution of (1) can be approximated by the solution to the ODE: in which
denotes the SA vector field
, where the expectation is in steadystate. This is the essence of the SA algorithms that have been refined over nearly 70 years since the original work of Robbins and Monro [35]. See [6] for an accessible treatment. If the ODE is stable in a suitably strong sense; in particular, if solutions converge to some limit from each initial condition, then the same is true for (1): with probability one.SA theory also provides tools to understand the rate of convergence. This is based on an approximation of (1) via the linear recursion,
(2) 
where is called the linearization matrix, and . This is intended to approximate the error dynamics: ; see the standard textbooks [25, 6, 4].
The sequence is zero mean for the stationary version of the Markov chain
. Its asymptotic covariance (appearing in the Central Limit Theorem) is denoted
(3) 
where the expectation is in steady state. For a fixed but arbitrary initial condition for , denote . We obtain the following remarkable conclusions; The proof is in Appendix A.
Proposition 1.1.
Suppose that the matrix is Hurwitz:
for every eigenvalue
of , and for . Then, for the linear recursion (2),
If for every eigenvalue of , then rate of convergence is :
where and is the solution to the Lyapunov equation
(4) 
Suppose there is an eigenvalue satisfying , let
denote a corresponding left eigenvector, and suppose that
. Then, with ,
The slow convergence for Watkins’ algorithm can be explained by the fact that many eigenvalues of may be close to zero, and is always an eigenvalue in the case of tabular Qlearning [13, 14], so that the convergence rate can be as slow as . It is shown in Section 2.3 that the situation can be far worse for the GQlearning algorithm of [28]: When implemented using a tabular basis, it is shown that the linearization matrix will have an eigenvalue that is atleast , implying a convergence rate as slow as .
What is remarkable is that to know if the convergence rate is , it is sufficient to analyze only the deterministic ODE: Provided is Hurwitz, the convergence rate is guaranteed by using a modified gain , with chosen so that the matrix is Hurwitz. We can obtain much more reliable algorithms by turning to matrix gain algorithms.
The main contributions of the present paper are summarized as follows:

A significant generalization of the Zap SA algorithm of [13, 14, 12] is proposed:
Zap SA Algorithm: Initialize , , , small; Update for :(5a) (5b) with , defined in (9). The algorithm is designed so that it approximates the ODE:
(6) It is shown in Prop. 2.1 that (6) is stable and consistent under mild assumptions. In particular, if is a coercive function on , then it serves as a Lyapunov function for (6).

This new class of SA algorithms are used to propose a new class of ZapRL algorithms. Specifically, we generalize the Zap Qlearning of [14] to a nonlinear function approximation setting. Stability and convergence of this algorithm are proved under mild conditions.

We analyze the slow convergence of GQlearning of [28] and use motivation from ZapSA techniques to propose a new class of Zap GQlearning algorithms, which is stable even for nonlinear function approximation.
Literature review
The NewtonRaphson flow, introduced for deterministic control applications in [38, 48], is the special case of (6) obtained with . In this special case, this ODE was studied in the context of tabular Qlearning in [13, 14]. Stability and convergence of the ODE were established using ideas similar to the proof of Prop. 2.1.
The covariance approximation in Prop. 1.1 is the basis of algorithms designed to optimize the asymptotic covariance . Matrix gain algorithms to optimize the covariance were proposed in [25, 36, 24], and alternative approaches based on two timescale SA in [37, 32, 33, 24].
There are many versions of Prop. 1.1 in the literature, such as [25, 24], with most results couched in terms of the Central Limit Theorem rather than finite bounds (an exception is [16]). A complete proof of the proposition for general state space Markov chains is contained in the supplementary material. Analogous results for the nonlinear recursion (1) are obtained through linearization (subject to additional conditions on ; e.g. [16]).
Significant progress has been obtained very recently on finite bounds for SA and RL. Finite error bounds are obtained in [39] for the linear recursion (2) with fixed stepsize , and with also a function of a geometrically ergodic Markov chain. In [8, 11] the authors obtain concentration bounds for two timescale SA algorithms, under a martingale difference sequence noise assumption.
However, Qlearning with function approximation has remained a challenge for several years, with counterexamples dating back to the famous paper of [3] (also see [45, 40, 18]).
A significant part of literature on RL with function approximation deals with this issue by formulating an optimization problem, with the objective being mean square projected Bellman error [41, 28, 10]
. Classical first order method such as stochastic gradient descent can not be directly applied to solve this problem due to the double sampling issue
[3, 10]. Most recent works that aim to optimize this objective take an alternative, primaldual approach to solve this problem [29, 27, 47].The GQlearning algorithm of [28]
is of particular interest in this work. It can be interpreted as a matrixgain algorithm in which the gain is chosen for an entirely different purpose: to ensure stability of Qlearning in a linear function approximation setting, and to ensure that the estimates converge to the minimum of a projected Bellman error loss function
^{1}^{1}1The explicit matrix gain representation is hidden in the algorithm because the recursions tend to estimate matrixvector products, rather than the matrices itself.. The algorithm is discussed in detail in Section 2.3.2 Zap QLearning with Nonlinear Function Approximation
2.1 Guidelines for Algorithm Design
Consider the dimensional SA recursion (1) with matrix gain:
(7) 
The Markov chain is assumed to be irreducible, so there is a unique invariant pmf denoted , which is used to define the SA vector field , . The goal of SA is to find a vector satisfying .
As a part of algorithm design, the matrix sequence is chosen so that it approximates the ODE, for a function .
Based on SA theory surveyed in the introduction, we arrive at two guidelines for algorithm design:
G1. The solutions to the ODE converge to the desired limit
G2. The matrix is Hurwitz, with .
The Zap SA algorithm introduced in this paper is designed to achieve these two goals, and in addition achieve .
It is assumed that is in its first variable. Fix (assumed small), and for denote
(8) 
A two timescale algorithm are used in the definition of the Zap SA algorithm (5). The stepsize sequences and are assumed to satisfy standard requirements for twotimescale SA algorithms [6]: as . For concreteness we fix throughout:
(9) 
The approximations and will hold for large under general conditions – this is the basis of two timescale SA theory [6], and commonly applied in RL analysis [7, 24, 11, 21].
The proof of the following proposition is contained in Appendix B.
Proposition 2.1.
Consider the following conditions for the function :

is globally Lipschitz continuous and continuously differential in its first variable. Hence is a bounded matrixvalued function.

is coercive. That is, is compact for each .

The function has a unique zero , and for . Moreover, the matrix is nonsingular.
The following hold for solutions to the ODE (6) under increasingly stronger assumptions:

If (a) holds then for each , and each initial condition
(10) 
If in addition (b) holds, then the solutions to the ODE are bounded, and
(11) 
If (a)–(c) hold, then (6) is globally asymptotically stable.
Implications of Prop. 2.1 to the Zap SA Algorithm: We must first understand what is meant by the term “ODE approximation” (6). A precise definition can be found in [6], but we recall the basic ideas here. A change of timescale is required: denote for . A continuoustime process is defined via for
, and by piecewise linear interpolation to obtain a continuous function on
. A variation on the law of large numbers for Markov chains is then used to obtain the approximation, for any
,where the error term satisfies for any ,
This is by definition the ODE approximation, and is the basis of convergence theory for SA [25, 6, 4].
Based on the ODE approximation we anticipate that we can obtain the following conclusions under Assumptions (a)–(c), perhaps under slightly stronger assumptions on the function . The following results are presented as conjectures, listed in order in increasing level of difficulty for the proofs.
I. If , then the Zap SA algorithm (5) is consistent, a.s..
This almost follows from [6, Theorem 6.2] (the martingale noise assumption is imposed in [6] only for convenience – much of the work in SA on single timescales allows for Markovian noise, such as [1, 4] or the more recent [15, 21, 34]).
II. The sequence is bounded. That is is finite, and this result only requires Assumption (a) of the proposition.
The Lyapunov function used in the proof of stability of the ODE satisfies the conditions of [1, Theorem 2.3] or [15, Theorem 2.1]. These results establish stability for single timescale SA algorithms. We believe they can be generalized to the two timescale setting of this section [26].
III. The covariance is almost optimal: Let . Its covariance satisfies
with defined in (3). The optimal “asymptotic covariance” is found in all of the aforementioned papers [25, 24, 13, 14, 12]; in particular, [25, Ch. 10, eq. 2.7(a)].
The proof of the covariance approximation requires a Taylor series approximation of the error dynamics, as discussed in the introduction. The approximation of is immediate by optimality of . The bound can be refined through a second Taylor series expansion:
(12) 
where . The proof is contained in Appendix E.
The final two statements are truly conjectures — they require substantial additional effort, and stronger assumptions:
IV. Extension to parameterdependent Markovian noise. Rather than a timehomogeneous Markov chain, the transition matrix for at time depends on the parameter . There has been significant recent work on stochastic approximation with state dependent noise that can be applied [15, 20, 34]. The challenge is to construct algorithms to estimate .
V. Finite time bounds. This is the topic of the very recent work [11, 8, 39, 43], which may provide tools to obtain bounds for Zap stochastic approximation.
In the following subsections we introduce new Qlearning algorithms motivated by this theory, and show how Prop. 2.1 can be extended to these algorithms.
2.2 Zap QLearning
We restrict to a discounted reward optimal control problem, with finite state space , finite action space , reward function , and discount factor . Extensions to other criteria, such as average cost or weighted shortest path are obtained by substituting the corresponding formulation of the Bellman error.
The joint stateaction process is adapted to a filtration , so that is intended to model the information available to the controller at time . The Qfunction is defined as the maximum over all possible input sequences of the total discounted reward: For each and ,
(13) 
Let denote the state transition matrix when action is taken. It is known that the Qfunction is the unique solution to the Bellman equation [5]:
(14) 
where for any function .
For any such function there is a corresponding stationary policy (the greedy policy induced by ). To avoid ambiguities when the maximizer is not unique, we enumerate all stationary policies as , and specify
(15) 
The fixed point equation (14) is the basis for Watkins’ Qlearning algorithm and its extensions [49, 2, 14]. In general, the goal of Qlearning algorithms are to best approximate the solution to (14).
Most of these algorithms are based on a Galerkin relaxation [42, 13, 51]. Consider a (possibly nonlinear) parameterized family of approximators , wherein for each . The Galerkin relaxation is then obtained by specifying a dimensional stochastic process that is adapted to , and setting the goal: Find such that
(16) 
where the expectation is with respect to the steady state distribution of the Markov chain.
The root finding problem (16) is an ideal candidate for stochastic approximation. The matrix gain algorithm (7) is obtained on specifying , and
(17) 
It is assumed that , , for some function .
At points of differentiability, the derivative of has a simple form:
(18) 
where denotes the greedy policy induced by (defined in (15), with replaced by ). The definition of is extended to all of through eq. (18), in which is uniquely determined using (15). Under this notation, can be interpreted as a weak derivative of [9].
The Zap SA algorithm for Qlearning is exactly as described in (5) with defined in (17), and defined to be the term inside the expectation (18):
(19) 
These recursions are collected together in Algorithm 1. Observe that it is assumed that is defined using a randomized stationary policy. This requires that In future work we will consider parameter dependent policies such as greedy. It is assumed that the joint process is an irreducible Markov chain, with unique invariant pmf denoted
If the parameterization is linear, we have: where each , is a basis function. In tabular Qlearning [49], the basis functions are indicator functions: , , where and , and . The parameterization makes large scale MDP problems tractable and also invites use of prior knowledge of the structure of the value function. But stability is not guaranteed when is nonlinear in , or even in a linear setting with a general set of basis functions [3, 18].
Assumption Q1: is continuously differentiable, and Lipschitz continuous with respect to ; defined in (16) satisfies the coercivity property: is compact for each .
The following result extends Prop. 2.1 to Zap Qlearning. The extension is nontrivial because the function for Qlearning (defined in (16)) is only piecewise smooth. The proof is contained in Appendix C.
Theorem 2.2.
Consider the functions and defined in (16, 18). Suppose Assumption Q1 holds. Then, the differential inclusion (6) admits at least one solution from each initial condition, and for any solution
(20) 
If in addition has a unique zero at , is nonsingular, and for , then the ODE (10) is globally asymptotically stable.
The main step in the proof of the theorem is to establish the ODE (10), and this rests on convexity in of the “inverse reward function” defined for by
Implications of Thm. 2.2 to Algorithm 1: Nonsmoothness of presents a tougher challenge to establish the ODE approximation of compared to the arguments made following Prop. 2.1;
Since is discontinuous, and the setting is Markovian, standard tools to analyze the SA recursion for in Algorithm 1 cannot be applied. The authors in [14] make the technical assumption to deal with this issue. A discontinuous vector field is also encountered in the GQ algorithm [28]. The authors obtain the ODE approximation only under the assumption that is a martingale difference. Unfortunately, this assumption typically fails in function approximation settings.
It is believed that the techniques of [15] can be extended to establish the ODE approximation of for the two timescale Zap Qlearning algorithm. We leave this to future work.
2.3 GQlearning and Zap GQlearning
We now take a close look at the GQlearning algorithm of [28]. The algorithm is based on a linear function approximation setting, but here we consider a generalized version of the algorithm to fit a nonlinear function approximation setting.
GQlearning can be interpreted as a stochastic approximation algorithm that is designed to solve a particular optimization problem. With defined in (16), and for a given function , and , the objective in [28] is the following ^{2}^{2}2In [28], , the basis functions for the linearly parameterized Qfunction.:
(21) 
where the expectation is in steady state. Using (18), we have: , and under the assumption made in [28] that is nonsingular for any , the two time scale SA algorithm GQlearning aims to approximate the solution to the following ODE:
(22) 
The eigenvalue test G2 fails when in one special case:
Proposition 2.3.
Prop. 2.3 combined with Prop. 1.1 implies that the convergence rate of GQlearning algorithm can be as slow as . The tabular case is of course uninteresting from the point of view of the motivation of this paper or [28], but the proposition serves as a warning that the eigenvalue test may fail in GQ learning without care in choosing the basis function.
Following the steps in Qlearning we obtain
Zap GQlearning: Initialize , , , using (9), small, positive definite; Update for :
(23a)  
(23b)  
(23c) 
The matrix in (23) can either be as defined in (21), in which case the expectation is approximated using MonteCarlo, or it could be any other positive definite matrix. It is interesting to note that if , the recursion (23) is the same as Zap Qlearning algorithm in Alg. 1.
References
 [1] C. Andrieu, E. Moulines, and P. Priouret. Stability of stochastic approximation under verifiable conditions. SIAM Journal on Control and Optimization, 44(1):283–312, 2005.
 [2] M. G. Azar, R. Munos, M. Ghavamzadeh, and H. Kappen. Speedy Qlearning. In Advances in Neural Information Processing Systems, 2011.
 [3] L. Baird. Residual algorithms: Reinforcement learning with function approximation. In A. Prieditis and S. Russell, editors, Machine Learning Proceedings 1995, pages 30 – 37. Morgan Kaufmann, San Francisco (CA), 1995.
 [4] A. Benveniste, M. Métivier, and P. Priouret. Adaptive algorithms and stochastic approximations. Springer, 2012.
 [5] D. Bertsekas and J. N. Tsitsiklis. NeuroDynamic Programming. Atena Scientific, Cambridge, Mass, 1996.
 [6] V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan Book Agency and Cambridge University Press (jointly), Delhi, India and Cambridge, UK, 2008.
 [7] V. S. Borkar and S. P. Meyn. The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim., 38(2):447–469, 2000. (also presented at the IEEE CDC, December, 1998).
 [8] V. S. Borkar and S. Pattathil. Concentration bounds for two time scale stochastic approximation. In Allerton Conference on Communication, Control, and Computing, pages 504–511, Oct 2018.

[9]
A. Bressan.
Lecture Notes on Functional Analysis: With Applications to Linear Partial Differential Equations
, volume 143. American Mathematical Soc., 2013.  [10] B. Dai, A. Shaw, L. Li, L. Xiao, N. He, Z. Liu, J. Chen, and L. Song. Sbeed: Convergent reinforcement learning with nonlinear function approximation. arXiv preprint arXiv:1712.10285, 2017.

[11]
G. Dalal, B. Szörényi, G. Thoppe, and S. Mannor.
Concentration bounds for two timescale stochastic approximation with
applications to reinforcement learning.
Proceedings of the Conference on Computational Learning Theory, and ArXiv eprints
, pages 1–35, 2017.  [12] A. M. Devraj, A. Bušić, and S. Meyn. Zap QLearning – a user’s guide. In Proc. of the Fifth Indian Control Conference, January 911 2019.
 [13] A. M. Devraj and S. P. Meyn. Fastest convergence for Qlearning. ArXiv eprints, July 2017.
 [14] A. M. Devraj and S. P. Meyn. Zap Qlearning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017.
 [15] G. Fort, E. Moulines, A. Schreck, and M. Vihola. Convergence of Markovian stochastic approximation with discontinuous dynamics. SIAM Journal on Control and Optimization, 54(2):866–893, 2016.

[16]
L. Gerencser.
Convergence rate of moments in stochastic approximation with simultaneous perturbation gradient approximation and resetting.
IEEE Transactions on Automatic Control, 44(5):894–905, May 1999.  [17] P. W. Glynn and S. P. Meyn. A Liapounov bound for solutions of the Poisson equation. Ann. Probab., 24(2):916–931, 1996.
 [18] G. J. Gordon. Reinforcement learning with function approximation converges to a region. In Proc. of the 13th International Conference on Neural Information Processing Systems, pages 996–1002, Cambridge, MA, USA, 2000. MIT Press.
 [19] T. Kailath. Linear systems, volume 156. PrenticeHall Englewood Cliffs, NJ, 1980.
 [20] P. Karmakar and S. Bhatnagar. Dynamics of stochastic approximation with iteratedependent Markov noise under verifiable conditions in compact state space with the stability of iterates not ensured. arXiv eprints, page arXiv:1601.02217, Jan 2016.
 [21] P. Karmakar and S. Bhatnagar. Two timescale stochastic approximation with controlled Markov noise and offpolicy temporaldifference learning. Math. Oper. Res., 43(1):130–151, 2018.
 [22] H. K. Khalil. Nonlinear systems. PrenticeHall, Upper Saddle River, NJ, 3rd edition, 2002.
 [23] V. R. Konda and V. S. Borkar. Actorcritictype learning algorithms for Markov decision processes. SIAM J. Control Optim., 38(1):94–123 (electronic), 1999.
 [24] V. R. Konda and J. N. Tsitsiklis. Convergence rate of linear twotimescale stochastic approximation. Ann. Appl. Probab., 14(2):796–819, 2004.
 [25] H. J. Kushner and G. G. Yin. Stochastic approximation algorithms and applications, volume 35 of Applications of Mathematics (New York). SpringerVerlag, New York, 1997.
 [26] C. Lakshminarayanan and S. Bhatnagar. A stability criterion for two timescale stochastic approximation schemes. Automatica, 79:108–114, 2017.

[27]
B. Liu, J. Liu, M. Ghavamzadeh, S. Mahadevan, and M. Petrik.
Finitesample analysis of proximal gradient td algorithms.
In
UAI’15 Proceedings of the ThirtyFirst Conference on Uncertainty in Artificial Intelligence
. Citeseer, 2015.  [28] H. R. Maei, C. Szepesvári, S. Bhatnagar, and R. S. Sutton. Toward offpolicy learning control with function approximation. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, pages 719–726, USA, 2010. Omnipress.
 [29] S. Mahadevan, B. Liu, P. Thomas, W. Dabney, S. Giguere, N. Jacek, I. Gemp, and J. Liu. Proximal reinforcement learning: A new theory of sequential decision making in primaldual spaces. arXiv preprint arXiv:1405.6757, 2014.
 [30] M. Metivier and P. Priouret. Applications of a Kushner and Clark lemma to general classes of stochastic algorithms. IEEE Transactions on Information Theory, 30(2):140–151, March 1984.
 [31] S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Cambridge University Press, Cambridge, second edition, 2009. Published in the Cambridge Mathematical Library. 1993 edition online.
 [32] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
 [33] B. T. Polyak. Introduction to Optimization. Optimization Software Inc, New York, 1987.
 [34] A. Ramaswamy and S. Bhatnagar. Stability of stochastic approximations with ‘controlled Markov’ noise and temporal difference learning. IEEE Transactions on Automatic Control, pages 1–1, 2018.
 [35] H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400–407, 1951.
 [36] D. Ruppert. A NewtonRaphson version of the multivariate RobbinsMonro procedure. The Annals of Statistics, 13(1):236–245, 1985.
 [37] D. Ruppert. Efficient estimators from a slowly convergent RobbinsMonro processes. Technical Report Tech. Rept. No. 781, Cornell University, School of Operations Research and Industrial Engineering, Ithaca, NY, 1988.
 [38] S. Shivam, I. Buckley, Y. Wardi, C. Seatzu, and M. Egerstedt. Tracking control by the newtonraphson flow: Applications to autonomous vehicles. CoRR, abs/1811.08033, 2018.
 [39] R. Srikant and L. Ying. Finitetime error bounds for linear stochastic approximation and TD learning. CoRR, abs/1902.00923, 2019.
 [40] R. S. Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Proceedings of the 8th International Conference on Neural Information Processing Systems, NIPS’95, pages 1038–1044, Cambridge, MA, USA, 1995. MIT Press.
 [41] R. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesvári, and E. Wiewiora. Fast gradientdescent methods for temporaldifference learning with linear function approximation. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 993–1000. ACM, 2009.
 [42] C. Szepesvári. Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2010.
 [43] G. Thoppe and V. Borkar. A concentration bound for stochastic approximation via Alekseev’s formula. Stochastic Systems, 9(1):1–26, 2019.
 [44] J. Tsitsiklis. Asynchronous stochastic approximation and learning. Machine Learning, 16:185–202, 1994.
 [45] J. N. Tsitsiklis and B. Van Roy. Featurebased methods for large scale dynamic programming. Machine Learning, 22(13):59–94, 1996.
 [46] J. N. Tsitsiklis and B. Van Roy. An analysis of temporaldifference learning with function approximation. IEEE Trans. Automat. Control, 42(5):674–690, 1997.
 [47] S. Valcarcel Macua, P. Belanovic, and S. Zazo. Diffusion gradient temporal difference for cooperative reinforcement learning with linear function approximation. In Proc. International Workshop on Cognitive Information Processing, 2012.
 [48] Y. Wardi, C. Seatzu, M. Egerstedt, and I. Buckley. Performance regulation and tracking via lookahead simulation: Preliminary results and validation. In 56th IEEE Conference on Decision and Control, pages 6462–6468, 2017.
 [49] C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, Cambridge, UK, 1989.
 [50] C. J. C. H. Watkins and P. Dayan. learning. Machine Learning, 8(34):279–292, 1992.
 [51] H. Yu and D. P. Bertsekas. Error bounds for approximations from projected linear equations. Mathematics of Operations Research, 35(2):306–329, 2010.
Appendix A Proof for Prop. 1.1
We prove the proposition for a general state space Markov chain rather than finite state space. Recall that we are considering the linear recursion (2), presented again here for convenience:
(24) 
where and . The matrix is Hurwitz, and the following are assumed throughout:
Assumptions:
is uniformly ergodic on a locally compact and metrizable state space (the conditions of [31]), with unique invariant measure denoted , and .
The reader is referred to [31] for definitions, except for a few clarifications and consequences: to say that means that is measurable, and that the norm is finite:
It is assumed throughout [31] that . The uniform ergodic theorem (Theorem 16.0.1 of [31]) gives the following conclusions. Part (iii) is a simple consequence of Jensen’s inequality and the drift criterion that characterizes uniform ergodicity in [31, Theorem 16.0.1].
Theorem A.1.
The following hold for a uniformly ergodic Markov chain

There is and such that for each , and each and ,
(25) where .

Consider the function defined by
(26) This solves Poisson’s equation:
(27) 
The Markov chain is also uniformly ergodic for any . In particular, if , then .
The proof of Prop. 1.1 is composed of the following steps: The sequence can be expressed as the sum of three terms:
each of which is a linear SA recursion (described in (31)) differentiated by initial condition and “noise” input: the first has martingale difference input, the second zero input (driven only by the initial condition, and the input for the third is a telescoping sequence based on a solution to Poisson’s equation. Lemma A.2 shows how the telescoping input is converted into a zeromean input input of the form , with a solution to Poisson’s equation.
a.1 Noise statistics and Poisson’s equation
Under the assumptions of this section, the sequence appearing in (24) is zero mean for the stationary version of the Markov chain . This is because . Its asymptotic covariance (appearing in the Central Limit Theorem) is denoted
(28) 
where the expectations are in steady state.
A more useful representation of is obtained through a decomposition of the noise sequence based on Poisson’s equation. This now standard technique was introduced in the SA literature in the 1980s [30]. Two Poisson equation solutions are used in the analysis that follows:
(29) 
It is assumed for convenience that the solutions are normalized so and have zero steadystate mean. The existence of zeromean solutions follows from (26), and the fact that also solves (27) for . Bounds on solutions can be obtained under slightly weaker assumptions: see the main result of [17], and also [31, Theorem 17.4.2]. The bounds follows from Thm. A.1 (iii).
We then have this representation, for ,
where and is a martingale difference sequence. Each of the sequences is bounded in , and the asymptotic covariance is expressed
(30) 
where the expectation is taken in steadystate. The equivalence of (30) and (28) appears in [31, Theorem 17.5.3] for the case in which is scalar valued; the generalization to vector valued processes involves only notational changes.
a.2 Decomposition of the parameter sequence
The solution of the linear recursion (24) can be decomposed into three terms
each evolving as stochastic approximation sequence with different noise and initial conditions:
(31a)  
(31b)  
(31c) 
The second recursion admits a more tractable realization through a change of variables, , .
Lemma A.2.
The sequence evolves as the SA recursion
(32) 
a.3 Scaled parameter sequence.
For any consider the scaled error sequence . To obtain a recursion for this sequence, consider the Taylor series expansion:
where the second equation uses
Comments
There are no comments yet.