# Zap Q-Learning With Nonlinear Function Approximation

The Zap stochastic approximation (SA) algorithm was introduced recently as a means to accelerate convergence in reinforcement learning algorithms. While numerical results were impressive, stability (in the sense of boundedness of parameter estimates) was established in only a few special cases. This class of algorithms is generalized in this paper, and stability is established under very general conditions. This general result can be applied to a wide range of algorithms found in reinforcement learning. Two classes are considered in this paper: (i)The natural generalization of Watkins' algorithm is not always stable in function approximation settings. Parameter estimates may diverge to infinity even in the linear function approximation setting with a simple finite state-action MDP. Under mild conditions, the Zap SA algorithm provides a stable algorithm, even in the case of nonlinear function approximation. (ii) The GQ algorithm of Maei et. al. 2010 is designed to address the stability challenge. Analysis is provided to explain why the algorithm may be very slow to converge in practice. The new Zap GQ algorithm is stable even for nonlinear function approximation.

## Authors

• 4 publications
• 8 publications
• 9 publications
• 5 publications
• ### Convex Q-Learning, Part 1: Deterministic Optimal Control

It is well known that the extension of Watkins' algorithm to general fun...
08/08/2020 ∙ by Prashant G. Mehta, et al. ∙ 0

• ### Optimal Stable Nonlinear Approximation

While it is well known that nonlinear methods of approximation can often...
09/21/2020 ∙ by Albert Cohen, et al. ∙ 0

• ### Zap Meets Momentum: Stochastic Approximation Algorithms with Optimal Convergence Rate

There are two well known Stochastic Approximation techniques that are kn...
09/17/2018 ∙ by Adithya M. Devraj, et al. ∙ 0

• ### Specialized Interior Point Algorithm for Stable Nonlinear System Identification

Estimation of nonlinear dynamic models from data poses many challenges, ...
03/02/2018 ∙ by Jack Umenberger, et al. ∙ 0

• ### Differential Temporal Difference Learning

Value functions derived from Markov decision processes arise as a centra...
12/28/2018 ∙ by Adithya M. Devraj, et al. ∙ 0

• ### On the Stability of Random Matrix Product with Markovian Noise: Application to Linear Stochastic Approximation and TD Learning

This paper studies the exponential stability of random matrix products d...
01/30/2021 ∙ by Alain Durmus, et al. ∙ 0

• ### A Unified Switching System Perspective and O.D.E. Analysis of Q-Learning Algorithms

In this paper, we introduce a unified framework for analyzing a large fa...
12/04/2019 ∙ by Donghwan Lee, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The theory of Q-learning with function approximation has not caught up with the famous success stories in applications. Counter examples appeared in the 1990’s, shortly following the seminal work of Watkins and Dayan, establishing consistency of the Q-learning algorithm in the tabular setting [50]. These examples demonstrate failure of the natural generalization of Watkins’ algorithm, even in very simple settings such as linear

function approximation in a simple finite state-action Markov decision process (MDP)

[3, 28]. Even when convergence holds, it is found in practice that convergence of Q-learning can be extremely slow.

This paper focuses on algorithm design to ensure stability of the algorithm, consistency, and techniques to obtain at least qualitative insight on the rate of convergence. The framework for algorithm design is the theory of stochastic approximation (SA), and the ODE approximation that is central to that theory. There is a long history of application of SA tools in the analysis of reinforcement learning (RL) algorithms [44, 46, 23, 7, 28].

An explanation for the slow convergence of Watkins’ Q-learning is given in [13, 14]. The starting point is the recognition that this RL algorithm can be represented as a -dimensional SA recursion

 θn+1=θn+αn+1f(θn,Φn+1) (1)

in which

is a Markov chain on a finite state space

, is a non-negative gain sequence, and . In the tabular Q-learning algorithm of Watkins, the dimension is equal to the number of state-action pairs. The definition of for this case is given in Section 2

. We assume throughout that this Markov chain has a unique invariant probability mass function (pmf).

It is known that the evolution of (1) can be approximated by the solution to the ODE: in which

denotes the SA vector field

, where the expectation is in steady-state. This is the essence of the SA algorithms that have been refined over nearly 70 years since the original work of Robbins and Monro [35]. See [6] for an accessible treatment. If the ODE is stable in a suitably strong sense; in particular, if solutions converge to some limit from each initial condition, then the same is true for (1): with probability one.

SA theory also provides tools to understand the rate of convergence. This is based on an approximation of (1) via the linear recursion,

 En+1=En+αn+1[A∗En+Δn+1],E0=θ0−θ∗ (2)

where is called the linearization matrix, and . This is intended to approximate the error dynamics: ; see the standard textbooks [25, 6, 4].

The sequence is zero mean for the stationary version of the Markov chain

. Its asymptotic covariance (appearing in the Central Limit Theorem) is denoted

 ΣΔ=∑∞k=−∞E[ΔkΔ⊺0] (3)

where the expectation is in steady state. For a fixed but arbitrary initial condition for , denote . We obtain the following remarkable conclusions; The proof is in Appendix A.

###### Proposition 1.1.

Suppose that the matrix is Hurwitz:

for every eigenvalue

of , and for . Then, for the linear recursion (2),

• If for every eigenvalue of , then rate of convergence is :

 Σn=n−1Σ∞+O(n−1−δ)

where and is the solution to the Lyapunov equation

 (4)
• Suppose there is an eigenvalue satisfying , let

denote a corresponding left eigenvector, and suppose that

. Then, with ,

 limn→∞nϱE[(v⊺En)2]=0,ϱ<ϱ0limn→∞nϱE[(v⊺En)2]=∞,ϱ>ϱ0{\nobreak% \hfil\penalty 50   \hbox{}\nobreak\hfil\hbox{\hbox to 0.0pt{⊓}⊔}% \par}\vskip 6.0pt plus 2.0pt minus 2.0pt

The slow convergence for Watkins’ algorithm can be explained by the fact that many eigenvalues of may be close to zero, and is always an eigenvalue in the case of tabular Q-learning [13, 14], so that the convergence rate can be as slow as . It is shown in Section 2.3 that the situation can be far worse for the GQ-learning algorithm of [28]: When implemented using a tabular basis, it is shown that the linearization matrix will have an eigenvalue that is atleast , implying a convergence rate as slow as .

What is remarkable is that to know if the convergence rate is , it is sufficient to analyze only the deterministic ODE: Provided is Hurwitz, the convergence rate is guaranteed by using a modified gain , with chosen so that the matrix is Hurwitz. We can obtain much more reliable algorithms by turning to matrix gain algorithms.

The main contributions of the present paper are summarized as follows:

• A significant generalization of the Zap SA algorithm of [13, 14, 12] is proposed:
Zap SA Algorithm: Initialize , , , small; Update for :

 ˆAn+1 =ˆAn+γn+1[An+1(θn)−ˆAn],An+1(θ):=∂θf(θ,Φn+1) (5a) θn+1 =θn+αn+1Gnf(θn,Φn+1),Gn:=−[εI+ˆA⊺n+1ˆAn+1]−1ˆA⊺n+1 (5b)

with , defined in (9). The algorithm is designed so that it approximates the ODE:

 ddtϑ(t)=−[εI+A(ϑ(t))⊺A(ϑ(t))]−1A(θ)⊺¯¯¯f(ϑ(t)),A(θ)=∂¯¯¯f(θ) (6)

It is shown in Prop. 2.1 that (6) is stable and consistent under mild assumptions. In particular, if is a coercive function on , then it serves as a Lyapunov function for (6).

• This new class of SA algorithms are used to propose a new class of Zap-RL algorithms. Specifically, we generalize the Zap Q-learning of [14] to a nonlinear function approximation setting. Stability and convergence of this algorithm are proved under mild conditions.

• We analyze the slow convergence of GQ-learning of [28] and use motivation from Zap-SA techniques to propose a new class of Zap GQ-learning algorithms, which is stable even for nonlinear function approximation.

#### Literature review

The Newton-Raphson flow, introduced for deterministic control applications in [38, 48], is the special case of (6) obtained with . In this special case, this ODE was studied in the context of tabular Q-learning in [13, 14]. Stability and convergence of the ODE were established using ideas similar to the proof of Prop. 2.1.

The covariance approximation in Prop. 1.1 is the basis of algorithms designed to optimize the asymptotic covariance . Matrix gain algorithms to optimize the covariance were proposed in [25, 36, 24], and alternative approaches based on two time-scale SA in [37, 32, 33, 24].

There are many versions of Prop. 1.1 in the literature, such as [25, 24], with most results couched in terms of the Central Limit Theorem rather than finite bounds (an exception is [16]). A complete proof of the proposition for general state space Markov chains is contained in the supplementary material. Analogous results for the nonlinear recursion (1) are obtained through linearization (subject to additional conditions on ; e.g. [16]).

Significant progress has been obtained very recently on finite- bounds for SA and RL. Finite- error bounds are obtained in [39] for the linear recursion (2) with fixed step-size , and with also a function of a geometrically ergodic Markov chain. In [8, 11] the authors obtain concentration bounds for two time-scale SA algorithms, under a martingale difference sequence noise assumption.

However, Q-learning with function approximation has remained a challenge for several years, with counterexamples dating back to the famous paper of [3] (also see [45, 40, 18]).

A significant part of literature on RL with function approximation deals with this issue by formulating an optimization problem, with the objective being mean square projected Bellman error [41, 28, 10]

. Classical first order method such as stochastic gradient descent can not be directly applied to solve this problem due to the double sampling issue

[3, 10]. Most recent works that aim to optimize this objective take an alternative, primal-dual approach to solve this problem [29, 27, 47].

The GQ-learning algorithm of [28]

is of particular interest in this work. It can be interpreted as a matrix-gain algorithm in which the gain is chosen for an entirely different purpose: to ensure stability of Q-learning in a linear function approximation setting, and to ensure that the estimates converge to the minimum of a projected Bellman error loss function

111The explicit matrix gain representation is hidden in the algorithm because the recursions tend to estimate matrix-vector products, rather than the matrices itself.. The algorithm is discussed in detail in Section 2.3.

## 2 Zap Q-Learning with Nonlinear Function Approximation

### 2.1 Guidelines for Algorithm Design

Consider the -dimensional SA recursion (1) with matrix gain:

 θn+1=θn+αn+1Gnf(θn,Φn+1) (7)

The Markov chain is assumed to be irreducible, so there is a unique invariant pmf denoted , which is used to define the SA vector field , . The goal of SA is to find a vector satisfying .

As a part of algorithm design, the matrix sequence is chosen so that it approximates the ODE, for a function .

Based on SA theory surveyed in the introduction, we arrive at two guidelines for algorithm design:
G1. The solutions to the ODE converge to the desired limit
G2. The matrix is Hurwitz, with .

The Zap SA algorithm introduced in this paper is designed to achieve these two goals, and in addition achieve .

It is assumed that is in its first variable. Fix (assumed small), and for denote

 A(θ)=∑ϖ(z)∂θf(θ,z)G(θ)=−[εI+A(θ)⊺A(θ)]−1A(θ)⊺ (8)

A two time-scale algorithm are used in the definition of the Zap SA algorithm (5). The step-size sequences and are assumed to satisfy standard requirements for two-time-scale SA algorithms [6]: as . For concreteness we fix throughout:

 αn=1/n,γn=1/nρ,n≥1,\it with% ρ∈(0.5,1) (9)

The approximations and will hold for large under general conditions – this is the basis of two time-scale SA theory [6], and commonly applied in RL analysis [7, 24, 11, 21].

The proof of the following proposition is contained in Appendix B.

###### Proposition 2.1.

Consider the following conditions for the function :

• is globally Lipschitz continuous and continuously differential in its first variable. Hence is a bounded matrix-valued function.

• is coercive. That is, is compact for each .

• The function has a unique zero , and for . Moreover, the matrix is non-singular.

The following hold for solutions to the ODE (6) under increasingly stronger assumptions:

• If (a) holds then for each , and each initial condition

 ddt¯¯¯f(ϑ(t))=−[εI+A(ϑ(t))⊺A(ϑ(t))]−1A(ϑ(t))⊺¯¯¯f(ϑ(t)) (10)
• If in addition (b) holds, then the solutions to the ODE are bounded, and

 limt→∞A(ϑ(t))⊺¯¯¯f(ϑ(t))=0 (11)
• If (a)–(c) hold, then (6) is globally asymptotically stable.

Implications of Prop. 2.1 to the Zap SA Algorithm: We must first understand what is meant by the term “ODE approximation” (6). A precise definition can be found in [6], but we recall the basic ideas here. A change of time-scale is required: denote for . A continuous-time process is defined via for

, and by piecewise linear interpolation to obtain a continuous function on

. A variation on the law of large numbers for Markov chains is then used to obtain the approximation, for any

,

 ϑ∙(T0+t)=ϑ∙(T0)+∫T0+tT0G(ϑ∙(r))¯¯¯f(ϑ∙(r))dr+E(T0,t)

where the error term satisfies for any ,

 sup0≤t≤T∥E(T0,t)∥=o(∥ϑ∙(T0)∥)

This is by definition the ODE approximation, and is the basis of convergence theory for SA [25, 6, 4].

Based on the ODE approximation we anticipate that we can obtain the following conclusions under Assumptions (a)–(c), perhaps under slightly stronger assumptions on the function . The following results are presented as conjectures, listed in order in increasing level of difficulty for the proofs.

I. If , then the Zap SA algorithm (5) is consistent, a.s..

This almost follows from [6, Theorem 6.2] (the martingale noise assumption is imposed in [6] only for convenience – much of the work in SA on single time-scales allows for Markovian noise, such as [1, 4] or the more recent [15, 21, 34]).

II. The sequence is bounded. That is is finite, and this result only requires Assumption (a) of the proposition.

The Lyapunov function used in the proof of stability of the ODE satisfies the conditions of [1, Theorem 2.3] or [15, Theorem 2.1]. These results establish stability for single time-scale SA algorithms. We believe they can be generalized to the two time-scale setting of this section [26].

III. The covariance is almost optimal: Let . Its covariance satisfies

 Cov(~θn)=n−1Σε+o(n−1),\it withΣε=Σ∗+O(ε2),Σ∗=A−1∗ΣΔ(A⊺∗)−1

with defined in (3). The optimal “asymptotic covariance” is found in all of the aforementioned papers [25, 24, 13, 14, 12]; in particular, [25, Ch. 10, eq. 2.7(a)].

The proof of the covariance approximation requires a Taylor series approximation of the error dynamics, as discussed in the introduction. The approximation of is immediate by optimality of . The bound can be refined through a second Taylor series expansion:

 Σε=Σ∗+ε2Σ(2)+o(ε2) (12)

where . The proof is contained in Appendix E.

The final two statements are truly conjectures — they require substantial additional effort, and stronger assumptions:

IV. Extension to parameter-dependent Markovian noise. Rather than a time-homogeneous Markov chain, the transition matrix for at time depends on the parameter . There has been significant recent work on stochastic approximation with state dependent noise that can be applied [15, 20, 34]. The challenge is to construct algorithms to estimate .

V. Finite time bounds. This is the topic of the very recent work [11, 8, 39, 43], which may provide tools to obtain bounds for Zap stochastic approximation.

In the following subsections we introduce new Q-learning algorithms motivated by this theory, and show how Prop. 2.1 can be extended to these algorithms.

### 2.2 Zap Q-Learning

We restrict to a discounted reward optimal control problem, with finite state space , finite action space , reward function , and discount factor . Extensions to other criteria, such as average cost or weighted shortest path are obtained by substituting the corresponding formulation of the Bellman error.

The joint state-action process is adapted to a filtration , so that is intended to model the information available to the controller at time . The Q-function is defined as the maximum over all possible input sequences of the total discounted reward: For each and ,

 Q∗(x,u):=max∞∑k=0βkE[r(Xk,Uk)∣X0=x,U0=u] (13)

Let denote the state transition matrix when action is taken. It is known that the Q-function is the unique solution to the Bellman equation [5]:

 Q∗(x,u)=r(x,u)+β∑x′∈XPu(x,x′)Q––∗(x′),x∈X,u∈U, (14)

where for any function .

For any such function there is a corresponding stationary policy (the greedy policy induced by ). To avoid ambiguities when the maximizer is not unique, we enumerate all stationary policies as , and specify

 ϕ:=ϕ(κ), where κ:=min{i:ϕ(i)(x)∈argmaxuQ(x,u), for all x∈X} (15)

The fixed point equation (14) is the basis for Watkins’ Q-learning algorithm and its extensions [49, 2, 14]. In general, the goal of Q-learning algorithms are to best approximate the solution to (14).

Most of these algorithms are based on a Galerkin relaxation [42, 13, 51]. Consider a (possibly nonlinear) parameterized family of approximators , wherein for each . The Galerkin relaxation is then obtained by specifying a -dimensional stochastic process that is adapted to , and setting the goal: Find such that

 ¯¯¯f(θ∗)=0,with ¯¯¯f(θ):=E[(r(Xn,Un)+βQ––θ(Xn+1)−Qθ(Xn,Un))ζn] (16)

where the expectation is with respect to the steady state distribution of the Markov chain.

The root finding problem (16) is an ideal candidate for stochastic approximation. The matrix gain algorithm (7) is obtained on specifying , and

 f(θn,Φn+1):=(r(Xn,Un)+βQ––θn(Xn+1)−Qθn(Xn,Un))ζn (17)

It is assumed that , , for some function .

At points of differentiability, the derivative of has a simple form:

 A(θ):=∂θ¯¯¯f(θ)=E[ζn(β∂θQθ(Xn+1,ϕθ(Xn+1))−∂θQθ(Xn,Un))] (18)

where denotes the greedy policy induced by (defined in (15), with replaced by ). The definition of is extended to all of through eq. (18), in which is uniquely determined using (15). Under this notation, can be interpreted as a weak derivative of  [9].

The Zap SA algorithm for Q-learning is exactly as described in (5) with defined in (17), and defined to be the term inside the expectation (18):

 An+1(θ) =∂θf(θ,Φn+1)=ζn[β∂θQθ(Xn+1,ϕθn(Xn+1))−∂θQθ(Xn,Un)] (19)

These recursions are collected together in Algorithm 1. Observe that it is assumed that is defined using a randomized stationary policy. This requires that In future work we will consider parameter dependent policies such as -greedy. It is assumed that the joint process is an irreducible Markov chain, with unique invariant pmf denoted

If the parameterization is linear, we have: where each , is a basis function. In tabular Q-learning [49], the basis functions are indicator functions: , , where and , and . The parameterization makes large scale MDP problems tractable and also invites use of prior knowledge of the structure of the value function. But stability is not guaranteed when is nonlinear in , or even in a linear setting with a general set of basis functions [3, 18].

Assumption Q1: is continuously differentiable, and Lipschitz continuous with respect to ; defined in (16) satisfies the coercivity property: is compact for each .

The following result extends Prop. 2.1 to Zap Q-learning. The extension is non-trivial because the function for Q-learning (defined in (16)) is only piece-wise smooth. The proof is contained in Appendix C.

###### Theorem 2.2.

Consider the functions and defined in (16, 18). Suppose Assumption Q1 holds. Then, the differential inclusion (6) admits at least one solution from each initial condition, and for any solution

 limt→∞A(ϑ(t))⊺¯¯¯f(ϑ(t))=0 (20)

If in addition has a unique zero at , is non-singular, and for , then the ODE (10) is globally asymptotically stable.

The main step in the proof of the theorem is to establish the ODE (10), and this rests on convexity in of the “inverse reward function” defined for by

 rθ(x,u):=β∑x′∈XPu(x,x′)Q––θ(x′)−Qθ(x,u)x∈X,u∈U.

Implications of Thm. 2.2 to Algorithm 1: Non-smoothness of presents a tougher challenge to establish the ODE approximation of compared to the arguments made following Prop. 2.1;

Since is discontinuous, and the setting is Markovian, standard tools to analyze the SA recursion for in Algorithm 1 cannot be applied. The authors in [14] make the technical assumption to deal with this issue. A discontinuous vector field is also encountered in the GQ algorithm [28]. The authors obtain the ODE approximation only under the assumption that is a martingale difference. Unfortunately, this assumption typically fails in function approximation settings.

It is believed that the techniques of [15] can be extended to establish the ODE approximation of for the two time-scale Zap Q-learning algorithm. We leave this to future work.

### 2.3 GQ-learning and Zap GQ-learning

We now take a close look at the GQ-learning algorithm of [28]. The algorithm is based on a linear function approximation setting, but here we consider a generalized version of the algorithm to fit a non-linear function approximation setting.

GQ-learning can be interpreted as a stochastic approximation algorithm that is designed to solve a particular optimization problem. With defined in (16), and for a given function , and , the objective in [28] is the following 222In [28], , the basis functions for the linearly parameterized Q-function.:

 minθJ(θ)=12¯¯¯f(θ)⊺M¯¯¯f(θ),withM=E[ζnζ⊺n]−1 (21)

where the expectation is in steady state. Using (18), we have: , and under the assumption made in [28] that is nonsingular for any , the two time scale SA algorithm GQ-learning aims to approximate the solution to the following ODE:

 ddtϑ(t)=¯¯¯fGQ(ϑ(t))¯¯¯fGQ(θ):=−A(θ)⊺M¯¯¯f(θ) (22)

The eigenvalue test G2 fails when in one special case:

###### Proposition 2.3.

The linearization matrix for GQ-learning is given by , with and defined in (18, 21). In the special case of a linear function approximation, and with a tabular basis, there is an eigenvalue of satisfying

 λGQ≥−(1−β)2maxx,uϖ(x,u)

Prop. 2.3 combined with Prop. 1.1 implies that the convergence rate of GQ-learning algorithm can be as slow as . The tabular case is of course uninteresting from the point of view of the motivation of this paper or [28], but the proposition serves as a warning that the eigenvalue test may fail in GQ learning without care in choosing the basis function.

Following the steps in Q-learning we obtain

Zap GQ-learning: Initialize , , , using (9), small, positive definite; Update for :

 ˆAn+1 =ˆAn+γn+1[An+1(θn)−ˆAn],with An+1(θ) % defined in (???) (23a) Gn :=−[εI+ˆA⊺n+1MˆAn+1]−1ˆA⊺n+1M (23b) θn+1 =θn+αn+1Gnf(θn,Φn+1),with f(θn,Φn+1) defined in % (???) (23c)

The matrix in (23) can either be as defined in (21), in which case the expectation is approximated using Monte-Carlo, or it could be any other positive definite matrix. It is interesting to note that if , the recursion (23) is the same as Zap Q-learning algorithm in Alg. 1.

The GQ ODE (22) has a discontinuous right hand side, which has prevented an extension of Prop. 2.1 to this case. In preliminary experiments it is observed in numerical results that the Zap GQ algorithm with defined in (21) has similar performance to Zap-Q learning.

## References

• [1] C. Andrieu, E. Moulines, and P. Priouret. Stability of stochastic approximation under verifiable conditions. SIAM Journal on Control and Optimization, 44(1):283–312, 2005.
• [2] M. G. Azar, R. Munos, M. Ghavamzadeh, and H. Kappen. Speedy Q-learning. In Advances in Neural Information Processing Systems, 2011.
• [3] L. Baird. Residual algorithms: Reinforcement learning with function approximation. In A. Prieditis and S. Russell, editors, Machine Learning Proceedings 1995, pages 30 – 37. Morgan Kaufmann, San Francisco (CA), 1995.
• [4] A. Benveniste, M. Métivier, and P. Priouret. Adaptive algorithms and stochastic approximations. Springer, 2012.
• [5] D. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Atena Scientific, Cambridge, Mass, 1996.
• [6] V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan Book Agency and Cambridge University Press (jointly), Delhi, India and Cambridge, UK, 2008.
• [7] V. S. Borkar and S. P. Meyn. The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim., 38(2):447–469, 2000. (also presented at the IEEE CDC, December, 1998).
• [8] V. S. Borkar and S. Pattathil. Concentration bounds for two time scale stochastic approximation. In Allerton Conference on Communication, Control, and Computing, pages 504–511, Oct 2018.
• [9] A. Bressan.

Lecture Notes on Functional Analysis: With Applications to Linear Partial Differential Equations

, volume 143.
American Mathematical Soc., 2013.
• [10] B. Dai, A. Shaw, L. Li, L. Xiao, N. He, Z. Liu, J. Chen, and L. Song. Sbeed: Convergent reinforcement learning with nonlinear function approximation. arXiv preprint arXiv:1712.10285, 2017.
• [11] G. Dalal, B. Szörényi, G. Thoppe, and S. Mannor. Concentration bounds for two timescale stochastic approximation with applications to reinforcement learning.

Proceedings of the Conference on Computational Learning Theory, and ArXiv e-prints

, pages 1–35, 2017.
• [12] A. M. Devraj, A. Bušić, and S. Meyn. Zap Q-Learning – a user’s guide. In Proc. of the Fifth Indian Control Conference, January 9-11 2019.
• [13] A. M. Devraj and S. P. Meyn. Fastest convergence for Q-learning. ArXiv e-prints, July 2017.
• [14] A. M. Devraj and S. P. Meyn. Zap Q-learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017.
• [15] G. Fort, E. Moulines, A. Schreck, and M. Vihola. Convergence of Markovian stochastic approximation with discontinuous dynamics. SIAM Journal on Control and Optimization, 54(2):866–893, 2016.
• [16] L. Gerencser.

Convergence rate of moments in stochastic approximation with simultaneous perturbation gradient approximation and resetting.

IEEE Transactions on Automatic Control, 44(5):894–905, May 1999.
• [17] P. W. Glynn and S. P. Meyn. A Liapounov bound for solutions of the Poisson equation. Ann. Probab., 24(2):916–931, 1996.
• [18] G. J. Gordon. Reinforcement learning with function approximation converges to a region. In Proc. of the 13th International Conference on Neural Information Processing Systems, pages 996–1002, Cambridge, MA, USA, 2000. MIT Press.
• [19] T. Kailath. Linear systems, volume 156. Prentice-Hall Englewood Cliffs, NJ, 1980.
• [20] P. Karmakar and S. Bhatnagar. Dynamics of stochastic approximation with iterate-dependent Markov noise under verifiable conditions in compact state space with the stability of iterates not ensured. arXiv e-prints, page arXiv:1601.02217, Jan 2016.
• [21] P. Karmakar and S. Bhatnagar. Two time-scale stochastic approximation with controlled Markov noise and off-policy temporal-difference learning. Math. Oper. Res., 43(1):130–151, 2018.
• [22] H. K. Khalil. Nonlinear systems. Prentice-Hall, Upper Saddle River, NJ, 3rd edition, 2002.
• [23] V. R. Konda and V. S. Borkar. Actor-critic-type learning algorithms for Markov decision processes. SIAM J. Control Optim., 38(1):94–123 (electronic), 1999.
• [24] V. R. Konda and J. N. Tsitsiklis. Convergence rate of linear two-time-scale stochastic approximation. Ann. Appl. Probab., 14(2):796–819, 2004.
• [25] H. J. Kushner and G. G. Yin. Stochastic approximation algorithms and applications, volume 35 of Applications of Mathematics (New York). Springer-Verlag, New York, 1997.
• [26] C. Lakshminarayanan and S. Bhatnagar. A stability criterion for two timescale stochastic approximation schemes. Automatica, 79:108–114, 2017.
• [27] B. Liu, J. Liu, M. Ghavamzadeh, S. Mahadevan, and M. Petrik. Finite-sample analysis of proximal gradient td algorithms. In

UAI’15 Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence

. Citeseer, 2015.
• [28] H. R. Maei, C. Szepesvári, S. Bhatnagar, and R. S. Sutton. Toward off-policy learning control with function approximation. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, pages 719–726, USA, 2010. Omnipress.
• [29] S. Mahadevan, B. Liu, P. Thomas, W. Dabney, S. Giguere, N. Jacek, I. Gemp, and J. Liu. Proximal reinforcement learning: A new theory of sequential decision making in primal-dual spaces. arXiv preprint arXiv:1405.6757, 2014.
• [30] M. Metivier and P. Priouret. Applications of a Kushner and Clark lemma to general classes of stochastic algorithms. IEEE Transactions on Information Theory, 30(2):140–151, March 1984.
• [31] S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Cambridge University Press, Cambridge, second edition, 2009. Published in the Cambridge Mathematical Library. 1993 edition online.
• [32] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
• [33] B. T. Polyak. Introduction to Optimization. Optimization Software Inc, New York, 1987.
• [34] A. Ramaswamy and S. Bhatnagar. Stability of stochastic approximations with ‘controlled Markov’ noise and temporal difference learning. IEEE Transactions on Automatic Control, pages 1–1, 2018.
• [35] H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400–407, 1951.
• [36] D. Ruppert. A Newton-Raphson version of the multivariate Robbins-Monro procedure. The Annals of Statistics, 13(1):236–245, 1985.
• [37] D. Ruppert. Efficient estimators from a slowly convergent Robbins-Monro processes. Technical Report Tech. Rept. No. 781, Cornell University, School of Operations Research and Industrial Engineering, Ithaca, NY, 1988.
• [38] S. Shivam, I. Buckley, Y. Wardi, C. Seatzu, and M. Egerstedt. Tracking control by the newton-raphson flow: Applications to autonomous vehicles. CoRR, abs/1811.08033, 2018.
• [39] R. Srikant and L. Ying. Finite-time error bounds for linear stochastic approximation and TD learning. CoRR, abs/1902.00923, 2019.
• [40] R. S. Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Proceedings of the 8th International Conference on Neural Information Processing Systems, NIPS’95, pages 1038–1044, Cambridge, MA, USA, 1995. MIT Press.
• [41] R. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesvári, and E. Wiewiora. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 993–1000. ACM, 2009.
• [42] C. Szepesvári. Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2010.
• [43] G. Thoppe and V. Borkar. A concentration bound for stochastic approximation via Alekseev’s formula. Stochastic Systems, 9(1):1–26, 2019.
• [44] J. Tsitsiklis. Asynchronous stochastic approximation and -learning. Machine Learning, 16:185–202, 1994.
• [45] J. N. Tsitsiklis and B. Van Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22(1-3):59–94, 1996.
• [46] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approximation. IEEE Trans. Automat. Control, 42(5):674–690, 1997.
• [47] S. Valcarcel Macua, P. Belanovic, and S. Zazo. Diffusion gradient temporal difference for cooperative reinforcement learning with linear function approximation. In Proc. International Workshop on Cognitive Information Processing, 2012.
• [48] Y. Wardi, C. Seatzu, M. Egerstedt, and I. Buckley. Performance regulation and tracking via lookahead simulation: Preliminary results and validation. In 56th IEEE Conference on Decision and Control, pages 6462–6468, 2017.
• [49] C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, Cambridge, UK, 1989.
• [50] C. J. C. H. Watkins and P. Dayan. -learning. Machine Learning, 8(3-4):279–292, 1992.
• [51] H. Yu and D. P. Bertsekas. Error bounds for approximations from projected linear equations. Mathematics of Operations Research, 35(2):306–329, 2010.

## Appendix A Proof for Prop. 1.1

We prove the proposition for a general state space Markov chain rather than finite state space. Recall that we are considering the linear recursion (2), presented again here for convenience:

 En+1=En+αn+1[A∗En+Δn+1],E0=θ0−θ∗ (24)

where and . The matrix is Hurwitz, and the following are assumed throughout:

#### Assumptions:

is -uniformly ergodic on a locally compact and metrizable state space (the conditions of [31]), with unique invariant measure denoted , and .

The reader is referred to [31] for definitions, except for a few clarifications and consequences: to say that means that is measurable, and that the norm is finite:

 ∥g∥V=supz∈Z|g(z)|V(z)

It is assumed throughout [31] that . The -uniform ergodic theorem (Theorem 16.0.1 of [31]) gives the following conclusions. Part (iii) is a simple consequence of Jensen’s inequality and the drift criterion that characterizes -uniform ergodicity in [31, Theorem 16.0.1].

###### Theorem A.1.

The following hold for a -uniformly ergodic Markov chain

• There is and such that for each , and each and ,

 ∣∣E[g(Φn)∣Φ0=z]−ϖ(g)∣∣≤BV∥g∥VρnV(z) (25)

where .

• Consider the function defined by

 ^g(z)=∞∑n=0[E[g(Φn)∣Φ0=z]−ϖ(g)],z∈Z (26)

This solves Poisson’s equation:

 E[^g(Φk+1)∣Φk=z]=^g(z)−g(z)+ϖ(g) (27)
• The Markov chain is also -uniformly ergodic for any . In particular, if , then .

The proof of Prop. 1.1 is composed of the following steps: The sequence can be expressed as the sum of three terms:

 En=E(1)n+E(2)n+E(3)n

each of which is a linear SA recursion (described in (31)) differentiated by initial condition and “noise” input: the first has martingale difference input, the second zero input (driven only by the initial condition, and the input for the third is a telescoping sequence based on a solution to Poisson’s equation. Lemma A.2 shows how the telescoping input is converted into a zero-mean input input of the form , with a solution to Poisson’s equation.

Lemmas A.3 and A.4 imply the conclusions of Prop. 1.1 for ; Lemma A.5 implies the desired conclusions for ; and Lemma A.6 implies the desired conclusions for .

### a.1 Noise statistics and Poisson’s equation

Under the assumptions of this section, the sequence appearing in (24) is zero mean for the stationary version of the Markov chain . This is because . Its asymptotic covariance (appearing in the Central Limit Theorem) is denoted

 ΣΔ=∞∑k=−∞Eϖ[ΔkΔ⊺0] (28)

where the expectations are in steady state.

A more useful representation of is obtained through a decomposition of the noise sequence based on Poisson’s equation. This now standard technique was introduced in the SA literature in the 1980s [30]. Two Poisson equation solutions are used in the analysis that follows:

 P^f(z)=^f(z)−f(θ∗,z),P^^f(z)=^^f(z)−^f(z), (29)

It is assumed for convenience that the solutions are normalized so and have zero steady-state mean. The existence of zero-mean solutions follows from (26), and the fact that also solves (27) for . Bounds on solutions can be obtained under slightly weaker assumptions: see the main result of [17], and also [31, Theorem 17.4.2]. The bounds follows from Thm. A.1 (iii).

We then have this representation, for ,

 Δn=Δmn+1+Zn−Zn+1

where and is a martingale difference sequence. Each of the sequences is bounded in , and the asymptotic covariance is expressed

 ΣΔ=Eϖ[ΔmnΔmn⊺] (30)

where the expectation is taken in steady-state. The equivalence of (30) and (28) appears in [31, Theorem 17.5.3] for the case in which is scalar valued; the generalization to vector valued processes involves only notational changes.

### a.2 Decomposition of the parameter sequence

The solution of the linear recursion (24) can be decomposed into three terms

 En=E(1)n+E(2)n+E(3)n

each evolving as stochastic approximation sequence with different noise and initial conditions:

 E(1)n+1 =E(1)n+αn+1[AE(1)n+Δmn+2], E(1)0=0 (31a) E(2)n+1 =E(2)n+αn+1AE(2)n, E(2)0=E0 (31b) E(3)n+1 =E(3)n+αn+1[AE(3)n+Zn+1−Zn+2], E(3)0=0 (31c)

The second recursion admits a more tractable realization through a change of variables, , .

###### Lemma A.2.

The sequence evolves as the SA recursion

 Ξn+1=Ξn+αn+1[AΞn−αn[I+A]Zn+1],Ξ1=E(3)1−Z2 (32)
###### Proof.

Recall the summation by parts formula: for scalar sequences ,

 N∑k=0ak+1[bk+1−bk]=ak+1bk+1−a1b0−N∑k=1[ak+1−ak]bk (33)

This is applied to (31c), beginning with

 E(3)N+1=N∑n=0αn+1AE(3)n+N∑n=0αn+1[Zn+1−Zn+2]

Hence with and , the identity (33) implies

 N∑n=0αn+1[Zn+1−Zn+2] =Z1−αN+1ZN+2+N∑n=1[αn+1−αn]Zn+1 =Z1−αN+1ZN+2−N∑n=1αn+1αnZn+1

By substitution, and using ,

With for we finally obtain for ,

 ΞN+1=Z1+N∑n=1αn+1[AΞn−αn[I+A]Zn+1]

which is equivalent to (32).

### a.3 Scaled parameter sequence.

For any consider the scaled error sequence . To obtain a recursion for this sequence, consider the Taylor series expansion:

 (n+1)ϱnϱ=(1+n−1)ϱ =1+ϱ(n+1)−1+ϱn−1(n+1)−1−12ϱ(1−ϱ)n−2+O(n−3)

where the second equation uses