# Zap Meets Momentum: Stochastic Approximation Algorithms with Optimal Convergence Rate

There are two well known Stochastic Approximation techniques that are known to have optimal rate of convergence (measured in terms of asymptotic variance): the Ruppert-Polyak averaging technique, and stochastic Newton-Raphson (SNR) (a matrix gain algorithm that resembles the deterministic Newton-Raphson method). The Zap algorithms introduced by the authors are a version of SNR designed to behave more closely like their deterministic cousin. It is found that estimates from the Zap Q-learning algorithm converge remarkably quickly, but the per-iteration complexity can be high. This paper introduces an entirely new class of stochastic approximation algorithms based on matrix momentum. For a special choice of the matrix momentum and gain sequences, it is found in simulations that the parameter estimates obtained from the algorithm couple with those obtained from the more complex stochastic Newton-Raphson algorithm. Conditions under which coupling is guaranteed are established for a class of linear recursions. Optimal finite-n error bounds are also obtained. The main objective of this work is to create more efficient algorithms for applications to reinforcement learning. Numerical results illustrate the value of these techniques in this setting.

There are no comments yet.

## Authors

• 8 publications
• 9 publications
• 6 publications
• ### Accelerating Optimization and Reinforcement Learning with Quasi-Stochastic Approximation

The ODE method has been a workhorse for algorithm design and analysis si...
09/30/2020 ∙ by Shuhang Chen, et al. ∙ 0

• ### Does Momentum Help? A Sample Complexity Analysis

Momentum methods are popularly used in accelerating stochastic iterative...
10/29/2021 ∙ by Gugan Thoppe, et al. ∙ 0

• ### Zap Q-Learning With Nonlinear Function Approximation

The Zap stochastic approximation (SA) algorithm was introduced recently ...
10/11/2019 ∙ by Shuhang Chen, et al. ∙ 0

• ### Stochastic approximation algorithms for superquantiles estimation

This paper is devoted to two different two-time-scale stochastic approxi...
07/29/2020 ∙ by Bernard Bercu, et al. ∙ 0

• ### Convergence Rate Improvement of Richardson and Newton-Schulz Iterations

Fast convergent, accurate, computationally efficient, parallelizable, an...
08/26/2020 ∙ by Alexander Stotsky, et al. ∙ 0

• ### Strong overall error analysis for the training of artificial neural networks via random initializations

Although deep learning based approximation algorithms have been applied ...
12/15/2020 ∙ by Arnulf Jentzen, et al. ∙ 0

• ### Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Streaming Data

Motivated by the high-frequency data streams continuously generated, rea...
09/15/2021 ∙ by Antoine Godichon-Baggioni, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The general goal of this paper is the efficient computation of the root of a vector valued function: obtain the solution

to the -dimensional equation . It is assumed that the function is expressed as an expectation: , where and is an

-valued random variable with distribution denoted

. The stochastic approximation (SA) literature contains a large collection of tools to construct algorithms and obtain bounds on their convergence rate. In this paper we show how algorithms with optimal rate of convergence can be constructed based on a synthesis of techniques from classical SA theory combined with variants of momentum algorithms pioneered by Polyak pol64 ; pol87 .

The algorithms and analysis in this paper admit application to both optimization and reinforcement learning. In such applications, it is commonplace to assume that there is an aperiodic and positive Harris recurrent Markov chain

whose steady-state distribution is . Let for and . Under suitable bounds on the function , the following limits hold:

 ¯f(θ)=limn→∞1nn−1∑k=0fk(θ)=limn→∞E[fn(θ)],θ∈Rd,(first limit in the a.s.\ sense \@@cite[cite]{\@@bibref{Authors Phrase1Y% earPhrase2}{MT}{\@@citephrase{(}}{\@@citephrase{)}}}). (1)

Three general classes of algorithms are investigated. Each is defined with respect to a non-negative scalar gain sequence , and two include matrix sequences . For each algorithm, the difference sequence is denoted , , with given initial condition .

#### 1. Stochastic approximation with matrix gain

 Δθn+1=αn+1Gn+1fn+1(θn). (2)

#### 2. Matrix Heavy-Ball Stochastic approximation (PolSA)

 Δθn+1=Mn+1Δθn+αn+1Gn+1fn+1(θn) (3)

#### 3. Nesterov Stochastic approximation (NeSA)

For a fixed scalar ,

 Δθn+1=Δθn+ζ[fn+1(θn)−fn+1(θn−1)]+ζαn+1fn+1(θn) (4)

If , then (2) is the classical algorithm of Robbins and Monro bor08a . In Stochastic Newton Raphson (SNR) and the more recent Zap SNR rup85 ; devmey17a ; devmey17b , the matrix sequence is chosen to be an approximation of . Stability of the algorithm has been demonstrated in application to Q-learning devmey17a ; devmey17b ; a non-trivial result, given that Q-learning is cast as root finding and not a minimization problem. The PolSA algorithm coincides with the heavy-ball method when is a sequence of scalars pol64 ; pol87 ; loiric17 . Justification for the special form (4) in NeSA is provided in the next section.

As in many previous papers in the context of high-dimensional optimization loiric17 and SA kontsi04 ; kusyin97 ; bor08a , parameter error analysis is restricted to a linear setting:

 fn(θ)=Anθ−bn=A(θ−θ∗)+Δn (5)

in which is a stochastic process with common mean , and

 Δn:=~An(θ−θ∗)+Δ∗nwithΔ∗n:=fn(θ∗),n≥1, (6)

where the tilde always denotes deviation: .

#### Goals

The main goal is to design algorithms with 1. fast convergence to zero of the error sequence , and 2. low computational complexity.

Rates of convergence are well understood for the SA recursion. It is known that the Central Limit Theorem and Law of the Iterated Logarithm hold under general conditions, and the

asymptotic covariance appearing in these results can be expressed as the limit

 Σθ=limn→∞Σθn:=limn→∞nE[~θn~θ\tiny\it Tn]. (7)

The LIL is most interesting in terms of bounds:

 |~θn(i)|≤(σ(i)+o(1))√2loglog(n)n (8)

where as , and .

The a.s. bound (8) may not be as satisfying as a Hoeffding or PAC-style finite- bound, but presently there are no such bounds for Markovian models with useful constants (see e.g. glyorm02 ). Applications to reinforcement learning are typically cast in a Markov setting.

A necessary condition for quick convergence is that the LIL holds with small . Again, for the SA recursion, optimization of this parameter is well-understood. Denote by the asymptotic covariance for (2) with . When this is finite, it admits a representation in terms of the asymptotic covariance of the noise:

 ΣΔ=limn→∞1nE[(n∑k=1Δ∗k)(n∑k=1Δ∗k)% \tiny\it T] (9)

In particular, the choice is a special case of SNR, for which asymptotic covariance admits the explicit form

 Σ∗:=A−1ΣΔA−1\tiny\it T. (10)

This is optimal: the difference is positive semi-definite for any benmetpri90 ; kusyin97 ; bor08a .

What about computational complexity? In realistic applications of SNR the gain sequence will be of the form , where are approximations of . In a nonlinear model, is an approximation of , obtained using the two time-scale algorithm of devmey17a ; devmey17b . The resulting complexity is a barrier to application in high dimension. Steps towards resolving this obstacle are presented in this paper:

• The parameters in the PolSA algorithm can be designed so that the error sequence enjoys all the attractive properties of SNR, but without the need for any matrix inversion.

• NeSA is often simpler than PolSA in applications to RL. A formula for the asymptotic covariance of a variant of NeSA is obtained in this paper. While not equal to , the reduced complexity makes it a valuable option.

These conclusions are established in Propositions 2.2, 3.1 and 3.2 for linear recursions, and illustrated in numerical examples for new Q-learning algorithms introduced in Section 4.

The techniques introduced in this paper are not directly applicable to batch gradient descent – an explanation is given in Section 2.2.

Portions of this article are taken from the recent survey devbusmey19 .

#### Literature survey

The reader is referred to devmey17a ; devmey17b for a survey on SNR and the more recent Zap SNR algorithms.

The present paper is built on a vast literature on optimization nes83 ; pol64 ; pol87 ; nes12 and stochastic approximation kontsi04 ; kusyin97 ; bor08a ; rup85 ; pol90 ; poljud92 . The work of Polyak is central to both thrusts: the introduction of momentum, and techniques to minimize variance in SA algorithms.

Many papers with similar titles focus on high dimensional optimization in which randomness is introduced to ease computational burden; an example is batch gradient descent — see wilrecjor16 for an extensive survey. Again, as explained in Section 2.2, we do not know if the concepts introduced in this paper can be applied to batch gradient descent.

Most closely related to the present work is the literature on ERM (empirical risk minimization) in which the sample path limit in (1) is replaced by a finite average, Katyusha16 ; SAGA14 ; jaikakkidnetsid17 . Under general conditions it can be shown that the sequence of ERM optimizers is convergent, and has optimal asymptotic covariance (a survey and further discussion is presented in jaikakkidnetsid17 ). The papers moubac11 ; bacmou13 ; gadpansaa18 ; duc16 ; jaikakkidnetsid17 establish the optimal convergence rate of for various algorithms.

The recent paper jaikakkidnetsid17 is most closely related to the present work, considering the shared goal of optimization of the asymptotic covariance, along with rapidly vanishing transients through algorithm design. The paper restricts to nonlinear optimization rather than the root finding problems considered here, which rules out application to many reinforcement learning algorithms. The metric for performance is slightly different, focusing on the rate of convergence of the expected loss, for which they obtain bounds for each iteration of the algorithm. Optimal asymptotic covariance is not established, but rather they obtain tight bounds on the regret.

The algorithms presented here do achieve the optimal asymptotic covariance, are not restricted to ERM, and we believe that in many applications they will be simpler to implement. This is especially true for the NeSA algorithm applied to Q-learning.

## 2 Momentum methods and applications

Consider first the deterministic root-finding problem. This will bring insight into the relationship between the three algorithms (24) discussed in the introduction. Since the Markovian disturbance is absent, the notation is used in place of . The goal remains the same: find the vector such that .

Deterministic variants of (24) commonly considered in the literature are, respectively,

 Successive approximation: Δθn+1=αf(θn) (11) Polyak’s heavy ball: Δθn+1=μΔθn+αf(θn) (12) Nesterov’s acceleration: Δθn+1=μΔθn+ζ[f(θn)−f(θn−1)]+αf(θn) (13)

where are positive constants. Nesterov’s algorithm was designed for extremal seeking, which is the special case for a real-valued function . The recursion (13) is the natural extension to the root-finding problem considered here.

The questions asked in this paper are posed in a stochastic setting, but analogous questions are:

• Why restrict to a scalar momentum term ? Can a “momentum matrix” be designed to improve performance?

• Can online algorithms be designed to approximate the optimal momentum matrix? If so, we require tools to investigate the performance of a matrix sequence :

 Δθn+1=Mn+1Δθn+αf(θn) (14)

Potential answers are obtained by establishing relationships between these deterministic recursions.

Consider the successive approximation algorithm (11) under the assumption of global convergence: as . Assume moreover that . We obtain by the mean-value theorem,

 Δθn+1−Δθn=α∂f(¯θn)Δθn=α2∂f(¯θn)f(θn−1)

where lies on the line connecting the vectors and . It follows that . This suggests a heuristic: swap and in a given convergent algorithm to obtain a new algorithm that is simpler, but with desirable properties. This is similar in spirit to the coupling concept introduced in allore14

. Applying this heuristic to (

14) results in

 Δθn+1=Mn+1Δθn+αf(θn)≈Mn+1Δθn+1+αf(θn)

Assuming that an inverse exists, this becomes

 Δθn+1≈α[I−Mn+1]−1f(θn)

We thus arrive at a possible answer to the question of optimal momentum: For the matrix sequence , the algorithm (14) can be expressed

 Δθn+1=[I+α∂f(θn)]Δθn+αf(θn) (15)

The foregoing approximations suggest that this is an approximation of Newton-Raphson:

 Δθn+1≈−[∂f(θn)]−1f(θn)

A Taylor series argument shows that the recursion (15) is approximated by

 Δθn+1=Δθn+α[f(θn)−f(θn−1)]+αf(θn) (16)

This is the special case of Nesterov’s algorithm (13) with and .

In this paper, the focus is root-finding in the stochastic setting described in the introduction. Strong justification for the stochastic analog of (15) is provided through a coupling bound between the respective algorithms: see Prop. 2.2. It is found that similar transformations lead to new algorithms for reinforcement learning and other applications.

### 2.1 Optimal matrix momentum for PolSA

Returning to the stochastic setting, the following special case of PolSA is the analog of (15) considered in this paper

 Δθn+1=[I+ζˆAn+1]Δθn+αn+1ζfn+1(θn) (17)

where is a scalar constant, and are estimates of . The SNR algorithm is (2) in which (the Moore–Penrose pseudo inverse when necessary).

Certain simplifying assumptions are imposed to ease analysis. The linear model (5

) is adopted, which is subject to the following stability assumptions: For any eigenvalue

of ,

 Real(λ)<0and|1+ζλ|<1 (18)

It is assumed without loss of generality that . The algorithms are also simplified by replacing the estimate with its true value . The simplified SNR and PolSA algorithms are defined as follows:

 Δθ∗n+1 = −αn+1A−1fn+1(θ∗n) (19) Δθn+1 = [I+A]Δθn+αn+1fn+1(θn) (20)

Theorem 2.1 shows that the SNR algorithm is in some sense optimal under general conditions. Its proof is contained in Section A, and is based on MT ; che00 ; kovsch03 .

###### Theorem 2.1.

Suppose that is a bounded sequence, and each entry is a function of a -uniformly ergodic Markov chain MT . Then,

• The asymptotic covariance of SNR is given by (10), where is the asymptotic covariance of the zero-mean sequence .

• The Central Limit Theorem and Law of the Iterated logarithm hold for with respect to this covariance sequence. In particular, (8) holds with .

• Consider any other algorithm of the form (2) for which the sequences are convergent, and the limit (7) holds. Then the matrix inequality holds: .

A drawback with SNR is complexity introduced by the matrix inverse. The algorithm (2.1) is simpler and enjoys the same attractive properties. This is established through coupling under slightly stronger assumptions:

###### Proposition 2.2.

Suppose that the assumptions of Theorem 2.1 hold, and that are bounded martingale difference sequences. Let denote the iterates using SNR (19) and the iterates obtained using (20), with identical initial conditions.

Then, there is a square-integrable random variable such that

 ∥θn−θ∗n∥≤˜Bn−1,n≥1. (21)

Consequently, the conclusions of Theorem 2.1 (i) and (ii) also hold for the PolSA algorithm.

Other than the two techniques introduced by Polyak, SNR and the Polyak-Ruppert averaging technique, PolSA is the only other SA algorithm that achieves optimal asymptotic variance.

The proof of Prop. 2.2, contained in Section B, is based on a justification of the heuristic used to construct the deterministic recursion (15).

An illustration is provided in Fig. 1 for the linear model in which is symmetric and positive definite, with , and is i.i.d. and Gaussian. Shown are the trajectories of (a short run for this example, since is over one million).

### 2.2 Applications

#### Reinforcement learning

Section 4 describes application to Q-learning, and includes numerical examples. Section D contains a full account of TD-learning. In particular, the LSTD algorithm can be regarded as an instance of SNR: (2) with an estimate of .

#### Optimization

A common application of stochastic approximation is convex optimization. In this setting, for a sequence of smooth functions , and then . The theory developed in this paper is applicable to this general class of optimization problems, except in degenerate cases. For comparison, consider the quadratic optimization problem in which , with . The stability condition (18) holds provided : a condition familiar in the convex optimization literature.

#### Batch gradient descent

Consider the following special case in which a deterministic function is given. The goal is again to compute a stationary point . A sequence of sparse matrices is constructed, and then for each

 fn:=−In∇J

It is assumed that , so that are unbiased samples of the gradient; see loiric17 for a recent survey. Randomness vanishes as approaches :

 ∥fn(θ∗)∥2=(∇J(θ∗))\tiny\it TI2n∇J(θ∗)=0

The covariance matrix is zero, and hence also the asymptotic covariance (10). There is no known CLT or LIL in this case, and hence the results of this paper do not apply to this special setting.

## 3 Variance analysis of the NeSA algorithm

The NeSA algorithm has a finite asymptotic covariance that can be expressed as the solution to a Lyapunov equation. We again restrict to the linear model, so that the recursion (4) (with ) becomes

 Δθn+1 =[I+An+1]Δθn+αn+1[An+1θn−bn+1] (22)

Stability of the recursion requires a strengthening of (18). Define the linear operator as follows: For any matrix ,

 L(Q):=E[(I+An)Q(I+An)\tiny\it T] (23)

The following assumptions are imposed throughout:

• are bounded martingale difference sequences. Moreover, for any matrix ,

 E[(I+An)Q(I+An)\tiny\it T∣Fn−1]=L(Q)

where is the natural filtration: .

• The bounds in (18) hold, and the linear operator has spectral radius bounded by unity.

Define the -dimensional vector processes

 Φn:=(√n~θnnΔθn)\it andΔΦn:=(√nΔnnΔn) (24)

A covariance matrix sequence is the focus of this section:

 (25)
###### Proposition 3.1.

Suppose that (N1) and (N2) hold. Then, the covariance sequence is convergent:

 limn→∞Σn=[Σ11∞00Σ22∞] (26)

in which the second limit is the solution to the Lyapunov equation

 Σ22∞=L(Σ22∞)+ΣΔ (27)

(an explicit solution is given in eqn. (53)), and

 Σ11∞=−Σ22∞−A−1Σ22∞−Σ22∞A−1 (28)

The following result is a corollary to Prop. 3.1, with an independent proof provided in Section C.1.

###### Proposition 3.2.

Under (N1)–(N3) the conclusions of Prop. 3.1 hold for the PolSA recursion (20). In this case the solution to the Lyapunov equation is the optimal covariance:

 Σ11∞=Σ∗:=A−1ΣΔA−1\tiny\it T (29)

and is the unique solution to the Lyapunov equation

 Σ22∞=(I+A)Σ22∞(I+A)\tiny\it T+ΣΔ (30)

The main step in the proof of Prop. 3.1 involves a finer look at the off-diagonal blocks of the covariance matrix. The proofs of the following Lemmas are postponed to Section C.

###### Lemma 3.3.

The following approximations hold for and the scaled covariance :

 Σ22n+1 =L(Σ22n)+ΣΔ+o(1),ψn=−Σ11n−A−1Σ22∞+o(1),n≥1. (31)

The second iteration is used together with the following result to obtain (28).

###### Lemma 3.4.

The following approximation holds:

 Σ11n+αn+1(Σ11n+AΣ11n +Σ11nA\tiny\it T+ψ\tiny\it Tn(I+A)\tiny\it T+(I+A)ψn+L(Σ22∞)+ΣΔ+o(1)) (32)
###### Proof of Prop. 3.1.

The first approximation in (31) combined with Assumption (N2) implies that the sequence is convergent, and the limit is the solution to (27) (details are provided in Section C.2).

Substituting the approximation (31) for into (32) gives

 Σ11n+1=Σ11n+αn+1(−Σ11n−Σ22∞−A−1Σ22∞−Σ22∞A−1+o(1))

This can be regarded as a Euler approximation to the ODE:

 ddtxt=−xt−Σ22∞−A−1Σ22∞−Σ22∞A−1

The convergence techniques of stochastic approximation theory can be applied to establish that the limits of and coincide with the stationary point, which is (28).

## 4 Application to Q-learning

Consider the discounted-cost control problem with state-action process denoted evolving on the finite set , cost function , and discount factor . The Q-function solves the Bellman equation: . For a -dimensional basis of functions and a given parameter , the corresponding estimate is defined by

 Qθ(x,u)=∑kθkψk(x,u).

Watkins’ Q-learning algorithm is designed to compute the exact Q-function that solves the corresponding Bellman errors wat89 ; watday92a . In this setting the basis is taken to be the set of indicator variables: , , with . The basic algorithm of Watkins can be written

 Δθn+1=αn+1ˆDn+1[An+1θn−bn+1] (33)

in which the matrix gain is diagonal:

 ˆDn(i,i)=[1nn−1∑k=0I{(Xk,Uk)=(xi,ui)}]−1, (34)

and , where (see csa10 for more details).

Among the other algorithms compared are

 SNR: Δθn+1=−αn+1ˆA−1n+1[An+1θn−bn+1] PolSA: Δθn+1=[I+ˆAn+1]Δθn+αn+1[An+1θn−bn+1] PolSA-D: NeSA:

The SNR algorithm considered coincides with the Zap Q-learning algorithm of devmey17a ; devmey17b . Histograms for a particular 6-state MDP model with were considered in this prior work. Fig. 2 shows that the histograms for using PolSA-D and SNR nearly coincide with (performance for PolSA is similar). The histogram for NeSA shows a much higher variance, but this algorithm requires by far the least computation per iteration.

Fig. 3 shows the maximal Bellman error as a function of iterations for several algorithms.

Experiments were performed for larger examples. Results from two such experiments are shown in Fig. 4. The MDP model is again a stochastic shortest path problem. The model construction was based on the creation of a graph with

nodes, in which the probability of an edge between a pair of nodes is i.i.d. with probability

. Additional edges are added, for each , to ensure the resulting graph is strongly connected.

The controls are as in the 6-state example: with probability the particle moves in the desired direction, and with remaining probability any of the neighboring nodes is chosen uniformly. Two exploration rules were considered: the “online” version described above (asynchronous version), and the offline “clock sampling” approach in which state-action pairs are chosen sequentially (synchronous version). At stage , if is the current pair, a random variable is chosen according to the distribution , and the entry of the Q-function is updated according to the particular algorithm using the triple . A significant change to Watkins’ iteration (33) in the synchronous setting is that is replaced by . This combined with deterministic sampling results in significant reduction in variance.

The synchronous speedy Q-learning recursion of azamunghakap11 appears similar to the NeSA algorithm with clock sampling, following the heuristic swapping of arguments used to motivate the algorithms introduced in this paper. We do not know if the swapping is justified for their algorithm.

Two graphs were used in the survey experiments, one resulting in an MDP with state-action pairs and another resulting in a larger MDP with state-action pairs. The plots in Fig. 4 show Bellman error as a function of iteration for the two cases. Comparison of the performance of algorithms in a deterministic exploration setting versus the online setting is also shown.

We have tested examples with over variables and obtain similar performance for the clock-sampling version of the algorithm. Performance for the PolSA-D algorithm degrades due to estimation error for in (34). Projection may be needed to improve reliability in high dimensions.

## 5 Conclusions

It is exciting to see how the intuitive transformation from SNR to PolSA and NeSA can be justified theoretically and in simulations. While the covariance of NeSA is not optimal, it is the simplest of the three algorithms and does well in the experiments we have conducted.

An important next step is to create adaptive techniques to ensure fast coupling or other ways to ensure fast forgetting of the initial condition. It is possible that techniques in jaikakkidnetsid17 may be adapted. We do not have a natural notion of expected loss in the general root finding problem, but it will be of interest to pursue analysis in the special case of nonlinear optimization.

## References

• [1] Z. Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. ArXiv e-prints, Mar. 2016.
• [2] Z. Allen-Zhu and L. Orecchia. Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent. ArXiv e-prints, July 2014.
• [3] M. G. Azar, R. Munos, M. Ghavamzadeh, and H. Kappen. Speedy Q-learning. In Advances in Neural Information Processing Systems, 2011.
• [4] F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate . In Advances in Neural Information Processing Systems 26, pages 773–781. Curran Associates, Inc., 2013.
• [5] A. Benveniste, M. Métivier, and P. Priouret. Adaptive algorithms and stochastic approximations, volume 22 of Applications of Mathematics (New York). Springer-Verlag, Berlin, 1990. Translated from the French by Stephen S. Wilson.
• [6] V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan Book Agency and Cambridge University Press (jointly), Delhi, India and Cambridge, UK, 2008.
• [7] J. A. Boyan. Technical update: Least-squares temporal difference learning. Mach. Learn., 49(2-3):233–246, 2002.
• [8] X. Chen. On the limit laws of the second order for additive functionals of Harris recurrent Markov chains. Probability Theory and Related Fields, 116(1):89–123, Jan 2000.
• [9] A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in neural information processing systems, pages 1646–1654, 2014.
• [10] A. M. Devraj, A. Bušić, and S. Meyn. Zap Q Learning – a user’s guide. In Proceedings of the fifth Indian Control Conference, January 2019.
• [11] A. M. Devraj and S. Meyn. Zap Q-Learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 2235–2244. Curran Associates, Inc., 2017.
• [12] A. M. Devraj and S. P. Meyn. Fastest convergence for Q-learning. ArXiv e-prints, July 2017.
• [13] J. Duchi. Introductory lectures on stochastic optimization. Stanford Lecture Series, 2016.
• [14] S. Gadat, F. Panloup, and S. Saadane. Stochastic heavy ball. Electron. J. Statist., 12(1):461–529, 2018.
• [15] P. W. Glynn and D. Ormoneit. Hoeffding’s inequality for uniformly ergodic Markov chains. Statistics and Probability Letters, 56:143–146, 2002.
• [16] P. Jain, S. M. Kakade, R. Kidambi, P. Netrapalli, and A. Sidford.

Accelerating Stochastic Gradient Descent.

ArXiv e-prints (and to appear, COLT 2018), Apr. 2017.
• [17] V. R. Konda and J. N. Tsitsiklis. Convergence rate of linear two-time-scale stochastic approximation. Ann. Appl. Probab., 14(2):796–819, 2004.
• [18] V. Koval and R. Schwabe. A law of the iterated logarithm for stochastic approximation procedures in d-dimensional euclidean space. Stochastic Processes and their Applications, 105(2):299 – 313, 2003.
• [19] H. J. Kushner and G. G. Yin. Stochastic approximation algorithms and applications, volume 35 of Applications of Mathematics (New York). Springer-Verlag, New York, 1997.
• [20] N. Loizou and P. Richtárik. Momentum and Stochastic Momentum for Stochastic Gradient, Newton, Proximal Point and Subspace Descent Methods. ArXiv e-prints, Dec. 2017.
• [21] S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Cambridge University Press, Cambridge, second edition, 2009. Published in the Cambridge Mathematical Library. 1993 edition online.
• [22] E. Moulines and F. R. Bach.

Non-asymptotic analysis of stochastic approximation algorithms for machine learning.

In Advances in Neural Information Processing Systems 24, pages 451–459. Curran Associates, Inc., 2011.
• [23] Y. Nesterov. A method of solving a convex programming problem with convergence rate . In Soviet Mathematics Doklady, 1983.
• [24] Y. Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.
• [25] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
• [26] B. T. Polyak. Introduction to Optimization. Optimization Software Inc, New York, 1987.
• [27] B. T. Polyak. A new method of stochastic approximation type. Avtomatika i telemekhanika (in Russian). translated in Automat. Remote Control, 51 (1991), pages 98–107, 1990.
• [28] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim., 30(4):838–855, 1992.
• [29] D. Ruppert. A Newton-Raphson version of the multivariate Robbins-Monro procedure. The Annals of Statistics, 13(1):236–245, 1985.
• [30] R. S. Sutton. Learning to predict by the methods of temporal differences. Mach. Learn., 3(1):9–44, 1988.
• [31] C. Szepesvári. Algorithms for reinforcement learning.

Synthesis lectures on artificial intelligence and machine learning

, 4(1):1–103, 2010.
• [32] C. Szepesvári. Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2010.
• [33] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approximation. IEEE Trans. Automat. Control, 42(5):674–690, 1997.
• [34] C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, Cambridge, UK, 1989.
• [35] C. J. C. H. Watkins and P. Dayan. -learning. Machine Learning, 8(3-4):279–292, 1992.
• [36] A. C. Wilson, B. Recht, and M. I. Jordan. A Lyapunov Analysis of Momentum Methods in Optimization. ArXiv e-prints, 2016.

## Appendix A Limit theory for Markov chains and SA: Proof of Theorem 2.1

The error recursion for the linear SNR algorithm (19) can be expressed

 ~θ∗n+1=~θ∗n+αn+1[−~θ∗n+A−1~An~θ∗n+A−1Δ∗n]

for which . This linear recursion is convergent under the assumptions of the theorem.

The sequence satisfies the CLT with covariance [21], and also the LIL [8]. Each of these results is established by first constructing a martingale:

 Mn=Hn−H0+n∑k=1Δ∗k

where are functions of the underling Markov chain , where solves a certain Poisson’s equation. We have under the assumption that is a bounded function of and the Markov chain is -uniformly ergodic (Foster’s criterion holds using the function ). Since has bounded mean, the assumptions of [18] hold to establish the LIL for the SA recursion, and the CLT can be found in [19] (see Theorem 2.1 of Chapter 10 and the following discussion in Section 10.2.2).

## Appendix B Coupling

The proof of Prop. 2.2 is based on a transformation of SNR so that it resembles PolSA with a vanishing disturbance sequence. This is essentially a reversal of the manipulations applied to derive (13) from an approximation of (14) at the start of Section 2.

For simplicity we take (this is without loss of generality by re-defining the matrix ).

It is simplest to first prove the result when is deterministic: .

### b.1 Deterministic matrix sequence

###### Lemma B.1.

The SNR and PolSA recursions with deterministic can be expressed, respectively

 Δθ∗n+1 = [I+A]Δθ∗n+αn+1[A~θ∗n+Δn+1]+En+1 (35) Δθn+1 = [I+A]Δθn+αn+1[A~θn+Δn+1], (36)

where and

 En+1=[I+A]{Δθ∗n+1−Δθ∗n},n≥0. (37)
###### Proof.

The recursion (36) is the definition of PolSA in this special case.

The recursion (35) is obtained by expressing SNR as follows:

 0 =AΔθ∗n+1+αn+1[A~θ∗n+Δn+1] =AΔθ∗n+αn+1[A~θ∗n+Δn+1]−{Δθ∗n+1−Δθ∗n}+En+1

Moving to the left-hand side completes the proof.

Denote and . The proof of Prop. 2.2 requires that we establish uniform bounds on these sequences.

###### Lemma B.2.

The error sequence evolves according to the recursion

 Δ¯θn+1=[I+A]Δ¯θn+αn+1A¯θn−En+1 (38)

in which the sequence (37) satisfies the following:

• is a bounded sequence

• The partial sums are also bounded:

 SEn:=n∑k=1kEk,n≥1.
###### Proof.

Recall that . Consequently,

 (n+1){Δθ∗n+1−Δθ∗n} =(n+1)Δθ∗n+1−nΔθ∗n}−Δθ∗n =−(θ∗n+A−1Δn+1)+(θ∗n−1+A−1Δn)−Δθ∗n =−2Δθ∗n−A−1(Δn+1−Δn)

The right hand side is a bounded and telescoping sequence; the bounds on easily follow.

The proof of the following lemma is routine.

###### Lemma B.3.

The normalized error sequence evolves according to the recursion

 Δξn+1 =1n(n+1)[I+A]ξn−1 +[I+A](Δξn+2[¯θn−¯θn−1])−(n+1)En+1

###### Proof of Prop. 2.2 – deterministic case.

On summing each side of the identity in Lemma B.3 we obtain, for any ,

 ξn+1−ξm =[I+A]n∑k=m1k(k+1)ξk−1 +[I+A](ξn−ξm−1+2[¯θn−¯θm−1])−n∑k=m(k+1)En+1

or

 ξn+1=[I+A]((1+2n−1)ξn+n∑k=m1k(k+1)ξk−1)+bξn+1

where the final term is bounded in for any fixed :

 bξn+1=ξm−(1−2(m−1)−1)[I+A]ξm−1−n∑k=m(k+1)En+1

To prove that the sequence is bounded, let and satisfy the matrix inequality

 [I+A]M[I+A]\tiny\it T≤δM

Let denote the Banach space of sequences with finite norm

 ∥\boldmathx∥∞:=supn√x\tiny% \it TnMxn.

Let denote the linear operator on such sequences defined by for , and for

 H(\boldmathx)n+1=[I+A]((1+2n−1)ξn+n∑k=m1k(k+1)ξk)+bξn+1,

For sufficiently large , this is a contraction in : for some and any pair of bounded sequences,

 ∥H(\boldmathx)−H(\boldmathx′)∥∞:=supn|H(\boldmathx)n−H(\boldmathx′)n|≤ϱ∥\boldmathx−% \boldmathx′∥∞.

It follows from the Banach fixed point theorem that the is a bounded sequence.

### b.2 Proof of Prop. 2.2 for the general linear algorithm

The major difference in the case of random is the identity (38) holds with a modified error sequence:

 Δ¯θn+1=[I+A]Δ¯θn+αn+1A¯θn−En+1+αn+1~An+1¯θn

Lemma B.3 must be modified accordingly, and from this we obtain

 ξn+1=[I+A]((1+2n−1)ξn+n∑k=m1k(k+1)ξk−1)+bξn+1+Mm,n

where for fixed , the sequence is a martingale:

 n∑k=m1k(k+1)~Ak+1ξk

The proof then proceeds as in the previous deterministic setting.

## Appendix C Variance analysis of the NeSA algorithm

Throughout this section it is assumed that the assumptions of Section 3.1 hold. We will initially impose an additional assumption:

(N3)  The covariance sequence defined in (25) is bounded.

This simplifies the proof of convergence. Following this proof, we explain how the same arguments leading to convergence can be elaborated to establish boundedness.

Even with under this assumption, the convergence proof appears complex. We first provide a proof for the simpler PolSA algorithm (20).

### c.1 Variance analysis of PolSA

The recursion (20) is expressed in state space form as follows:

 [~θn+1Δθn+1]=[I(I+A)0(I+A)][~θnΔθn]+αn+1[[A0A0][~θnΔθn]+[Δn+1Δn+1]] (39)

Using the Taylor series approximation:

 √n+1=√n+√n2n+O(1n3/2), (40)

it follows that the process (24) evolves as

 Φn+1=MΦn+αn+1(BΦn+ΔΦn+1+εΦn), (41)

involving the matrices and , and the column vector :

 (42)

Once again, the main step in the proof of Prop. 3.2 to obtain sharp results for the off-diagonal blocks of the covariance matrix. The proofs of the following lemmas are contained in Section C.2.

###### Lemma C.1.

Under the conditions of Prop. 3.2, the following approximations hold for and the scaled covariance :

 Σ22n+1 =(I+A)Σ22n(I+A)\tiny\it T+ΣΔ+O(αn+1) (43) ψn =−Σ11n−A−1Σ22∞+o(1),n≥1. Σ11n+1 =Σ11n+αn+1(Σ11n+AΣ11n+Σ11nA\tiny\it T+ψ\tiny\it Tn(I+A)\tiny\it T+(I+A)ψn +(I+A)Σ22∞(I+A)\tiny\it T+ΣΔ+o(1))
###### Proof of Prop. 3.2.

From Assumption A.1, which implies that the eigenvalues of matrix lie within the open unit disc, it follows that the first recursion in (43) of Lemma C.1 can be approximated by a geometrically stable discrete-time Lyapunov recursion with a time-invariant, bounded input . The limit (30) directly follows.

Substituting the approximation for in (43) into the right hand side of the recursion in (43) gives

 Σ11n+1=Σ11n+α