The general goal of this paper is the efficient computation of the root of a vector valued function: obtain the solutionto the -dimensional equation . It is assumed that the function is expressed as an expectation: , where and is an
-valued random variable with distribution denoted. The stochastic approximation (SA) literature contains a large collection of tools to construct algorithms and obtain bounds on their convergence rate. In this paper we show how algorithms with optimal rate of convergence can be constructed based on a synthesis of techniques from classical SA theory combined with variants of momentum algorithms pioneered by Polyak pol64 ; pol87 .
The algorithms and analysis in this paper admit application to both optimization and reinforcement learning. In such applications, it is commonplace to assume that there is an aperiodic and positive Harris recurrent Markov chainwhose steady-state distribution is . Let for and . Under suitable bounds on the function , the following limits hold:
Three general classes of algorithms are investigated. Each is defined with respect to a non-negative scalar gain sequence , and two include matrix sequences . For each algorithm, the difference sequence is denoted , , with given initial condition .
1. Stochastic approximation with matrix gain
2. Matrix Heavy-Ball Stochastic approximation (PolSA)
3. Nesterov Stochastic approximation (NeSA)
For a fixed scalar ,
If , then (2) is the classical algorithm of Robbins and Monro bor08a . In Stochastic Newton Raphson (SNR) and the more recent Zap SNR rup85 ; devmey17a ; devmey17b , the matrix sequence is chosen to be an approximation of . Stability of the algorithm has been demonstrated in application to Q-learning devmey17a ; devmey17b ; a non-trivial result, given that Q-learning is cast as root finding and not a minimization problem. The PolSA algorithm coincides with the heavy-ball method when is a sequence of scalars pol64 ; pol87 ; loiric17 . Justification for the special form (4) in NeSA is provided in the next section.
The main goal is to design algorithms with 1. fast convergence to zero of the error sequence , and 2. low computational complexity.
Rates of convergence are well understood for the SA recursion. It is known that the Central Limit Theorem and Law of the Iterated Logarithm hold under general conditions, and theasymptotic covariance appearing in these results can be expressed as the limit
The LIL is most interesting in terms of bounds:
where as , and .
The a.s. bound (8) may not be as satisfying as a Hoeffding or PAC-style finite- bound, but presently there are no such bounds for Markovian models with useful constants (see e.g. glyorm02 ). Applications to reinforcement learning are typically cast in a Markov setting.
A necessary condition for quick convergence is that the LIL holds with small . Again, for the SA recursion, optimization of this parameter is well-understood. Denote by the asymptotic covariance for (2) with . When this is finite, it admits a representation in terms of the asymptotic covariance of the noise:
In particular, the choice is a special case of SNR, for which asymptotic covariance admits the explicit form
What about computational complexity? In realistic applications of SNR the gain sequence will be of the form , where are approximations of . In a nonlinear model, is an approximation of , obtained using the two time-scale algorithm of devmey17a ; devmey17b . The resulting complexity is a barrier to application in high dimension. Steps towards resolving this obstacle are presented in this paper:
The parameters in the PolSA algorithm can be designed so that the error sequence enjoys all the attractive properties of SNR, but without the need for any matrix inversion.
NeSA is often simpler than PolSA in applications to RL. A formula for the asymptotic covariance of a variant of NeSA is obtained in this paper. While not equal to , the reduced complexity makes it a valuable option.
The techniques introduced in this paper are not directly applicable to batch gradient descent – an explanation is given in Section 2.2.
Portions of this article are taken from the recent survey devbusmey19 .
The present paper is built on a vast literature on optimization nes83 ; pol64 ; pol87 ; nes12 and stochastic approximation kontsi04 ; kusyin97 ; bor08a ; rup85 ; pol90 ; poljud92 . The work of Polyak is central to both thrusts: the introduction of momentum, and techniques to minimize variance in SA algorithms.
Many papers with similar titles focus on high dimensional optimization in which randomness is introduced to ease computational burden; an example is batch gradient descent — see wilrecjor16 for an extensive survey. Again, as explained in Section 2.2, we do not know if the concepts introduced in this paper can be applied to batch gradient descent.
Most closely related to the present work is the literature on ERM (empirical risk minimization) in which the sample path limit in (1) is replaced by a finite average, Katyusha16 ; SAGA14 ; jaikakkidnetsid17 . Under general conditions it can be shown that the sequence of ERM optimizers is convergent, and has optimal asymptotic covariance (a survey and further discussion is presented in jaikakkidnetsid17 ). The papers moubac11 ; bacmou13 ; gadpansaa18 ; duc16 ; jaikakkidnetsid17 establish the optimal convergence rate of for various algorithms.
The recent paper jaikakkidnetsid17 is most closely related to the present work, considering the shared goal of optimization of the asymptotic covariance, along with rapidly vanishing transients through algorithm design. The paper restricts to nonlinear optimization rather than the root finding problems considered here, which rules out application to many reinforcement learning algorithms. The metric for performance is slightly different, focusing on the rate of convergence of the expected loss, for which they obtain bounds for each iteration of the algorithm. Optimal asymptotic covariance is not established, but rather they obtain tight bounds on the regret.
The algorithms presented here do achieve the optimal asymptotic covariance, are not restricted to ERM, and we believe that in many applications they will be simpler to implement. This is especially true for the NeSA algorithm applied to Q-learning.
2 Momentum methods and applications
Consider first the deterministic root-finding problem. This will bring insight into the relationship between the three algorithms (2–4) discussed in the introduction. Since the Markovian disturbance is absent, the notation is used in place of . The goal remains the same: find the vector such that .
|Polyak’s heavy ball:||(12)|
where are positive constants. Nesterov’s algorithm was designed for extremal seeking, which is the special case for a real-valued function . The recursion (13) is the natural extension to the root-finding problem considered here.
The questions asked in this paper are posed in a stochastic setting, but analogous questions are:
Why restrict to a scalar momentum term ? Can a “momentum matrix” be designed to improve performance?
Can online algorithms be designed to approximate the optimal momentum matrix? If so, we require tools to investigate the performance of a matrix sequence :
Potential answers are obtained by establishing relationships between these deterministic recursions.
Consider the successive approximation algorithm (11) under the assumption of global convergence: as . Assume moreover that . We obtain by the mean-value theorem,
where lies on the line connecting the vectors and . It follows that . This suggests a heuristic: swap and in a given convergent algorithm to obtain a new algorithm that is simpler, but with desirable properties. This is similar in spirit to the coupling concept introduced in allore14
. Applying this heuristic to (14) results in
Assuming that an inverse exists, this becomes
We thus arrive at a possible answer to the question of optimal momentum: For the matrix sequence , the algorithm (14) can be expressed
The foregoing approximations suggest that this is an approximation of Newton-Raphson:
A Taylor series argument shows that the recursion (15) is approximated by
This is the special case of Nesterov’s algorithm (13) with and .
In this paper, the focus is root-finding in the stochastic setting described in the introduction. Strong justification for the stochastic analog of (15) is provided through a coupling bound between the respective algorithms: see Prop. 2.2. It is found that similar transformations lead to new algorithms for reinforcement learning and other applications.
2.1 Optimal matrix momentum for PolSA
Returning to the stochastic setting, the following special case of PolSA is the analog of (15) considered in this paper
where is a scalar constant, and are estimates of . The SNR algorithm is (2) in which (the Moore–Penrose pseudo inverse when necessary).
Certain simplifying assumptions are imposed to ease analysis. The linear model (5
) is adopted, which is subject to the following stability assumptions: For any eigenvalueof ,
It is assumed without loss of generality that . The algorithms are also simplified by replacing the estimate with its true value . The simplified SNR and PolSA algorithms are defined as follows:
Suppose that is a bounded sequence, and each entry is a function of a -uniformly ergodic Markov chain MT . Then,
A drawback with SNR is complexity introduced by the matrix inverse. The algorithm (2.1) is simpler and enjoys the same attractive properties. This is established through coupling under slightly stronger assumptions:
Suppose that the assumptions of Theorem 2.1 hold, and that are bounded martingale difference sequences. Let denote the iterates using SNR (19) and the iterates obtained using (20), with identical initial conditions.
Then, there is a square-integrable random variable such that
Consequently, the conclusions of Theorem 2.1 (i) and (ii) also hold for the PolSA algorithm.
Other than the two techniques introduced by Polyak, SNR and the Polyak-Ruppert averaging technique, PolSA is the only other SA algorithm that achieves optimal asymptotic variance.
An illustration is provided in Fig. 1 for the linear model in which is symmetric and positive definite, with , and is i.i.d. and Gaussian. Shown are the trajectories of (a short run for this example, since is over one million).
A common application of stochastic approximation is convex optimization. In this setting, for a sequence of smooth functions , and then . The theory developed in this paper is applicable to this general class of optimization problems, except in degenerate cases. For comparison, consider the quadratic optimization problem in which , with . The stability condition (18) holds provided : a condition familiar in the convex optimization literature.
Batch gradient descent
Consider the following special case in which a deterministic function is given. The goal is again to compute a stationary point . A sequence of sparse matrices is constructed, and then for each
It is assumed that , so that are unbiased samples of the gradient; see loiric17 for a recent survey. Randomness vanishes as approaches :
The covariance matrix is zero, and hence also the asymptotic covariance (10). There is no known CLT or LIL in this case, and hence the results of this paper do not apply to this special setting.
3 Variance analysis of the NeSA algorithm
The NeSA algorithm has a finite asymptotic covariance that can be expressed as the solution to a Lyapunov equation. We again restrict to the linear model, so that the recursion (4) (with ) becomes
Stability of the recursion requires a strengthening of (18). Define the linear operator as follows: For any matrix ,
The following assumptions are imposed throughout:
are bounded martingale difference sequences. Moreover, for any matrix ,
where is the natural filtration: .
The bounds in (18) hold, and the linear operator has spectral radius bounded by unity.
Define the -dimensional vector processes
A covariance matrix sequence is the focus of this section:
Suppose that (N1) and (N2) hold. Then, the covariance sequence is convergent:
in which the second limit is the solution to the Lyapunov equation
(an explicit solution is given in eqn. (53)), and
The following approximations hold for and the scaled covariance :
The second iteration is used together with the following result to obtain (28).
The following approximation holds:
Proof of Prop. 3.1.
4 Application to Q-learning
Consider the discounted-cost control problem with state-action process denoted evolving on the finite set , cost function , and discount factor . The Q-function solves the Bellman equation: . For a -dimensional basis of functions and a given parameter , the corresponding estimate is defined by
Watkins’ Q-learning algorithm is designed to compute the exact Q-function that solves the corresponding Bellman errors wat89 ; watday92a . In this setting the basis is taken to be the set of indicator variables: , , with . The basic algorithm of Watkins can be written
in which the matrix gain is diagonal:
and , where (see csa10 for more details).
Among the other algorithms compared are
The SNR algorithm considered coincides with the Zap Q-learning algorithm of devmey17a ; devmey17b . Histograms for a particular 6-state MDP model with were considered in this prior work. Fig. 2 shows that the histograms for using PolSA-D and SNR nearly coincide with (performance for PolSA is similar). The histogram for NeSA shows a much higher variance, but this algorithm requires by far the least computation per iteration.
Fig. 3 shows the maximal Bellman error as a function of iterations for several algorithms.
Experiments were performed for larger examples. Results from two such experiments are shown in Fig. 4. The MDP model is again a stochastic shortest path problem. The model construction was based on the creation of a graph with
nodes, in which the probability of an edge between a pair of nodes is i.i.d. with probability. Additional edges are added, for each , to ensure the resulting graph is strongly connected.
The controls are as in the 6-state example: with probability the particle moves in the desired direction, and with remaining probability any of the neighboring nodes is chosen uniformly. Two exploration rules were considered: the “online” version described above (asynchronous version), and the offline “clock sampling” approach in which state-action pairs are chosen sequentially (synchronous version). At stage , if is the current pair, a random variable is chosen according to the distribution , and the entry of the Q-function is updated according to the particular algorithm using the triple . A significant change to Watkins’ iteration (33) in the synchronous setting is that is replaced by . This combined with deterministic sampling results in significant reduction in variance.
The synchronous speedy Q-learning recursion of azamunghakap11 appears similar to the NeSA algorithm with clock sampling, following the heuristic swapping of arguments used to motivate the algorithms introduced in this paper. We do not know if the swapping is justified for their algorithm.
Two graphs were used in the survey experiments, one resulting in an MDP with state-action pairs and another resulting in a larger MDP with state-action pairs. The plots in Fig. 4 show Bellman error as a function of iteration for the two cases. Comparison of the performance of algorithms in a deterministic exploration setting versus the online setting is also shown.
We have tested examples with over variables and obtain similar performance for the clock-sampling version of the algorithm. Performance for the PolSA-D algorithm degrades due to estimation error for in (34). Projection may be needed to improve reliability in high dimensions.
It is exciting to see how the intuitive transformation from SNR to PolSA and NeSA can be justified theoretically and in simulations. While the covariance of NeSA is not optimal, it is the simplest of the three algorithms and does well in the experiments we have conducted.
An important next step is to create adaptive techniques to ensure fast coupling or other ways to ensure fast forgetting of the initial condition. It is possible that techniques in jaikakkidnetsid17 may be adapted. We do not have a natural notion of expected loss in the general root finding problem, but it will be of interest to pursue analysis in the special case of nonlinear optimization.
-  Z. Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. ArXiv e-prints, Mar. 2016.
-  Z. Allen-Zhu and L. Orecchia. Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent. ArXiv e-prints, July 2014.
-  M. G. Azar, R. Munos, M. Ghavamzadeh, and H. Kappen. Speedy Q-learning. In Advances in Neural Information Processing Systems, 2011.
-  F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate . In Advances in Neural Information Processing Systems 26, pages 773–781. Curran Associates, Inc., 2013.
-  A. Benveniste, M. Métivier, and P. Priouret. Adaptive algorithms and stochastic approximations, volume 22 of Applications of Mathematics (New York). Springer-Verlag, Berlin, 1990. Translated from the French by Stephen S. Wilson.
-  V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan Book Agency and Cambridge University Press (jointly), Delhi, India and Cambridge, UK, 2008.
-  J. A. Boyan. Technical update: Least-squares temporal difference learning. Mach. Learn., 49(2-3):233–246, 2002.
-  X. Chen. On the limit laws of the second order for additive functionals of Harris recurrent Markov chains. Probability Theory and Related Fields, 116(1):89–123, Jan 2000.
-  A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in neural information processing systems, pages 1646–1654, 2014.
-  A. M. Devraj, A. Bušić, and S. Meyn. Zap Q Learning – a user’s guide. In Proceedings of the fifth Indian Control Conference, January 2019.
-  A. M. Devraj and S. Meyn. Zap Q-Learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 2235–2244. Curran Associates, Inc., 2017.
-  A. M. Devraj and S. P. Meyn. Fastest convergence for Q-learning. ArXiv e-prints, July 2017.
-  J. Duchi. Introductory lectures on stochastic optimization. Stanford Lecture Series, 2016.
-  S. Gadat, F. Panloup, and S. Saadane. Stochastic heavy ball. Electron. J. Statist., 12(1):461–529, 2018.
-  P. W. Glynn and D. Ormoneit. Hoeffding’s inequality for uniformly ergodic Markov chains. Statistics and Probability Letters, 56:143–146, 2002.
P. Jain, S. M. Kakade, R. Kidambi, P. Netrapalli, and A. Sidford.
Accelerating Stochastic Gradient Descent.ArXiv e-prints (and to appear, COLT 2018), Apr. 2017.
-  V. R. Konda and J. N. Tsitsiklis. Convergence rate of linear two-time-scale stochastic approximation. Ann. Appl. Probab., 14(2):796–819, 2004.
-  V. Koval and R. Schwabe. A law of the iterated logarithm for stochastic approximation procedures in d-dimensional euclidean space. Stochastic Processes and their Applications, 105(2):299 – 313, 2003.
-  H. J. Kushner and G. G. Yin. Stochastic approximation algorithms and applications, volume 35 of Applications of Mathematics (New York). Springer-Verlag, New York, 1997.
-  N. Loizou and P. Richtárik. Momentum and Stochastic Momentum for Stochastic Gradient, Newton, Proximal Point and Subspace Descent Methods. ArXiv e-prints, Dec. 2017.
-  S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Cambridge University Press, Cambridge, second edition, 2009. Published in the Cambridge Mathematical Library. 1993 edition online.
-  E. Moulines and F. R. Bach. In Advances in Neural Information Processing Systems 24, pages 451–459. Curran Associates, Inc., 2011.
-  Y. Nesterov. A method of solving a convex programming problem with convergence rate . In Soviet Mathematics Doklady, 1983.
-  Y. Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.
-  B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
-  B. T. Polyak. Introduction to Optimization. Optimization Software Inc, New York, 1987.
-  B. T. Polyak. A new method of stochastic approximation type. Avtomatika i telemekhanika (in Russian). translated in Automat. Remote Control, 51 (1991), pages 98–107, 1990.
-  B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim., 30(4):838–855, 1992.
-  D. Ruppert. A Newton-Raphson version of the multivariate Robbins-Monro procedure. The Annals of Statistics, 13(1):236–245, 1985.
-  R. S. Sutton. Learning to predict by the methods of temporal differences. Mach. Learn., 3(1):9–44, 1988.
Algorithms for reinforcement learning.
Synthesis lectures on artificial intelligence and machine learning, 4(1):1–103, 2010.
-  C. Szepesvári. Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2010.
-  J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approximation. IEEE Trans. Automat. Control, 42(5):674–690, 1997.
-  C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, Cambridge, UK, 1989.
-  C. J. C. H. Watkins and P. Dayan. -learning. Machine Learning, 8(3-4):279–292, 1992.
-  A. C. Wilson, B. Recht, and M. I. Jordan. A Lyapunov Analysis of Momentum Methods in Optimization. ArXiv e-prints, 2016.
Appendix A Limit theory for Markov chains and SA: Proof of Theorem 2.1
The error recursion for the linear SNR algorithm (19) can be expressed
for which . This linear recursion is convergent under the assumptions of the theorem.
where are functions of the underling Markov chain , where solves a certain Poisson’s equation. We have under the assumption that is a bounded function of and the Markov chain is -uniformly ergodic (Foster’s criterion holds using the function ). Since has bounded mean, the assumptions of  hold to establish the LIL for the SA recursion, and the CLT can be found in  (see Theorem 2.1 of Chapter 10 and the following discussion in Section 10.2.2).
Appendix B Coupling
The proof of Prop. 2.2 is based on a transformation of SNR so that it resembles PolSA with a vanishing disturbance sequence. This is essentially a reversal of the manipulations applied to derive (13) from an approximation of (14) at the start of Section 2.
For simplicity we take (this is without loss of generality by re-defining the matrix ).
It is simplest to first prove the result when is deterministic: .
b.1 Deterministic matrix sequence
The SNR and PolSA recursions with deterministic can be expressed, respectively
The recursion (36) is the definition of PolSA in this special case.
The recursion (35) is obtained by expressing SNR as follows:
Moving to the left-hand side completes the proof.
Denote and . The proof of Prop. 2.2 requires that we establish uniform bounds on these sequences.
The error sequence evolves according to the recursion
in which the sequence (37) satisfies the following:
is a bounded sequence
The partial sums are also bounded:
Recall that . Consequently,
The right hand side is a bounded and telescoping sequence; the bounds on easily follow.
The proof of the following lemma is routine.
The normalized error sequence evolves according to the recursion
Proof of Prop. 2.2 – deterministic case.
On summing each side of the identity in Lemma B.3 we obtain, for any ,
where the final term is bounded in for any fixed :
To prove that the sequence is bounded, let and satisfy the matrix inequality
Let denote the Banach space of sequences with finite norm
Let denote the linear operator on such sequences defined by for , and for
For sufficiently large , this is a contraction in : for some and any pair of bounded sequences,
It follows from the Banach fixed point theorem that the is a bounded sequence.
b.2 Proof of Prop. 2.2 for the general linear algorithm
Appendix C Variance analysis of the NeSA algorithm
Throughout this section it is assumed that the assumptions of Section 3.1 hold. We will initially impose an additional assumption:
(N3) The covariance sequence defined in (25) is bounded.
This simplifies the proof of convergence. Following this proof, we explain how the same arguments leading to convergence can be elaborated to establish boundedness.
Even with under this assumption, the convergence proof appears complex. We first provide a proof for the simpler PolSA algorithm (20).
c.1 Variance analysis of PolSA
The recursion (20) is expressed in state space form as follows:
Using the Taylor series approximation:
it follows that the process (24) evolves as
involving the matrices and , and the column vector :
Under the conditions of Prop. 3.2, the following approximations hold for and the scaled covariance :
Proof of Prop. 3.2.
From Assumption A.1, which implies that the eigenvalues of matrix lie within the open unit disc, it follows that the first recursion in (43) of Lemma C.1 can be approximated by a geometrically stable discrete-time Lyapunov recursion with a time-invariant, bounded input . The limit (30) directly follows.