Stochastic approximation (SA) algorithms are widely used in many areas, including stochastic control, communications, machine learning, statistical signal processing and reinforcement learning, among others. There is now a very rich literature on SA algorithms, their applications and the associated theory (e.g., see the books[5, 14, 8] and references therein). One set of fundamental questions concerns the convergence of SA algorithms; there are various general techniques for establishing convergence, including the ODE method, dynamical system, and Lyapunov-based methods, among others. Much of the classical theory in stochastic approximation is asymptotic in nature, whereas in more recent work, particularly in the special case of stochastic optimization, attention has been shifted to non-asymptotic results [15, 4].
The goal of this paper is to develop some non-asymptotic bounds for a certain class of stochastic approximation procedures. The motivating impetus for this work was to gain a deeper insight into the classical -learning algorithm  from Markov decision processes and reinforcement learning [18, 20, 6, 7, 22]. It is a stochastic approximation algorithm for solving a fixed point equation involving the Bellman operator. In the discounted setting, this operator is contractive with respect to a sup norm, and also monotonic in the elementwise ordering. We show that these conditions can be viewed as special cases of a more general structure on the operators used in stochastic approximation for solving fixed point equations. In particular, we introduce monotonicity and quasi-contractivity conditions that are defined with respect to the partial order and gauge norms induced by an underlying cone. In the case of sup norm contractions, this underlying cone is the orthant cone, but other cones also arise naturally in applications. For instance, for SA procedures that operate in the space of symmetric matrices, the cone of positive semidefinite matrices induces the spectral order, as well as various forms of spectral norms. For a sequence of operators satisfying these cone monotonicity and quasi-contractivity conditions, we prove a general result (Theorem 1) that sandwiches the error at each iteration in terms of the partial order induced by the cone. By considering concrete choices of stepsize—such as linearly or polynomial decaying ones—we derive corollaries that yield non-asymptotic bounds on the error.
We specialize this general theory to the synchronous form of -learning in discounted Markov decision processes, and use it to derive non-asymptotic bounds on the -error of -learning, for both polynomial stepsizes and a linearly rescaled stepsize. Notably, these results are the sharpest known to date, and are instance-specific, depending on the particular structure of the optimal -function underlying the problem. Our bounds—when considered in a uniform (or worst-case) sense over the class of -discounted MDPs—establish that the number of iterations required by synchronous -learning to reach a solution that is -accurate in -norm scales as . We show via a careful simulation study that this guarantee is unimprovable. It improves upon the best bounds on synchronous -learning from previous work , which exhibit a scaling; see Section 3.3 for an in-depth discussion of relevant past results on -learning. For context, we note that the speedy-Q-learning method, an extension of ordinary -learning, is known to have iteration complexity scaling as , the same as our guarantee for ordinary -learning. Moreover, Azar et al.  show that model-based -iteration exhibits a scaling, and moreover that this is the best possible for any method in a minimax sense. Consequently, a corollary of our results is to reveal a gap between the performance of standard synchronous -learning and an optimal (model-based) procedure.
The remainder of this paper is organized as follows. In Section 2, we introduce the class of stochastic approximation algorithms analyzed in this paper, including some required background on cones and induced gauge norms. We then state our main result (Theorem 1), as well some of its corollaries for particular stepsize choices (Corollaries 1 and 2). In Section 3, we turn to the analysis of -learning. After introducing the necessary background in Section 3.1, we then devote Section 3.2 to statement of our two main results on -learning, namely -norm bounds for a linear rescaled stepsize (Corollary 3) and for polynomially decaying stepsizes (Corollary 4). In Section 3.3, we discuss past work on -learning and compare our guarantees to the best previously known non-asymptotic results. In Section 3.4, we describe and report the results of a simulation study that provides empirical evidence for the sharpness of our bounds in a worst-case sense. We conclude with a discussion in Section 4, with more technical aspects of our proofs deferred to the appendices.
2 A general convergence result
In this section, we set up the stochastic approximation algorithms of interest. Doing so requires some background on cones, monotonic operators on cones, and gauge norms induced by order intervals, which we provide in Section 2.1. In Section 2.2, we state a general result (Theorem 1) that sandwiches the iterate error using the partial order induced by the cone. This result holds for arbitrary stepsizes in the interval ; we follow up by using this general result to derive specific bounds that apply to stepsize choices commonly used in practice (cf. Corollaries 1 and 2).
2.1 Background and problem set-up
Consider a topological vector space, and an operator that maps to itself. Our goal is to compute a fixed point of —that is, an element such that —assuming that such an element exists and is unique. In various applications, we are not able to evaluate exactly, but instead are given access to a sequence of auxiliary operators , and permitted to compute the quantity for any . Here denotes an error term, allowed to be arbitrary in the analysis of this section. In the simplest case, we have for all , but the additional generality afforded by the setup here turns out to be useful.
Given an observation model of this type, we consider algorithms that generate a sequence according to the recursion
The stepsize parameters are assumed to belong to the interval , and should be understood as design parameters. Our primary goals are to specify conditions on the auxiliary operators , noise sequence , and stepsize sequence under which the sequence converges to . Moreover, we seek to develop tools for proving non-asymptotic bounds on the error—i.e., guarantees that hold for finite iterations, as opposed to in the limit as increases to infinity.
Of course, convergence guarantees are not possible without imposing assumptions on the auxiliary operators. In this paper, motivated by the analysis of -learning and related algorithms in reinforcement learning, we assume that they satisfy certain properties that depend on a cone contained in . Let us first introduce some relevant background on cones, order intervals and induced gauge norms. Any cone induces a partial order on via the relation
Cones that have non-empty interiors and are topologically normal [1, 13] can also be used to induce a certain class of gauge norms as follows. For a given element , the associated order interval is the set
|and it defines the Minkowski (gauge) norm given by|
Let us consider some concrete examples to illustrate.
Example 1 (Orthant cone and -norms).
Suppose that is the usual Euclidean space , and consider the orthant cone , where . It induces the usual elementwise ordering—viz. if and only if for all . Setting to be the all-ones vector, we find that
Thus, this choice of induces the usual -norm on vectors. Setting to some other vector contained in the interior of the orthant cone yields a weighted -norm.
Example 2 (Symmetric matrices and spectral norm).
Now suppose that is the space of -dimensional symmetric matrices . Letting
denote the eigenvalues of a matrix, consider the cone of positive semidefinite matrices
This cone induces the spectral ordering if and only if all the eigenvalues of are non-negative. Setting
to be the identity matrix, we have
so that the induced gauge norm is the spectral norm on symmetric matrices.
In this paper, we assume that the operators in the recursion (1) satisfy two properties: cone-monotonicity and cone-quasi-contractivity. More precisely, we assume that for each , the operator is monotonic with respect to the cone, meaning that
|Moreover, we assume that for some element , it is cone-quasi-contractive meaning that there is some and some such that|
Here the terminology “quasi” denotes the fact that the relation (4b) only need hold for a single , as opposed to in a uniform sense. Note that it is not necessary that be a fixed point of each .
2.2 A sandwich result and its corollaries
With this set-up, we now turn to the analysis of the sequence generated by a recursion of the form (1). We first state a general “sandwich” result, which provides both lower and upper bounds on the error in terms of the partial order induced by the cone. This result holds for any sequence of stepsizes contained in the interval . By specializing this general theorem to particular stepsize choices that are common in stochastic approximation, we obtain non-asymptotic upper bounds on the error, as measured in the cone-induced norm .
Our results depend on a form of effective noise, defined as follows
Note that the effective noise at iteration is the sum of the “defect” in the operator —meaning its failure to preserve the target as a fixed point—and the original error term introduced in our set-up.
Our bounds involve the sequence of elements in defined via the recursion
|where denotes the zero element in . It also involves the sequences of non-negative scalars|
Consider a sequence of operators that are monotonic (4a) and -quasi-contractive (4b) with respect to a cone with gauge norm . Then for any sequence of stepsizes in the interval , the iterates generated by the recursion (1) satisfy the sandwich relation
where denotes the partial ordering induced by the cone.
See Appendix A for the proof.
Theorem 1 is a general result that applies to any choice of stepsizes that belong to the unit interval . By specializing the stepsize choice, we can use the sandwich relation (7) to obtain concrete bounds on the error . In doing so, we specialize to the case , so that all the operators share the same quasi-contractivity coefficient .
We begin by considering a sequence of stepsizes in the interval that satisfy the bound
Note that the usual linear stepsize does not satisfy this bound for . Examples of stepsizes that do satisfy this bound are the rescaled linear stepsize , valid once , as well as the shifted version of rescaled linear stepsize , valid for all iterations .
Corollary 1 (Bounds for linear stepsizes).
By applying the stepsize bound (8) repeatedly to the recursion (10a), we find that . Applying this same identity to the recursion (10b) yields the bound . Combining these two inequalities, along with the additional term from the bound (7) in Theorem 1, yields the claim (9). ∎
It is worth pointing out why the linear stepsize is excluded from our theory. If we adopt this stepsize choice and substitute into the recursion (10a), then we find that
This behavior makes clear that an unrescaled linear stepsize will lead to bounds with exponential dependence on . It should be noted that this kind of sensitivity to the choices of constants is well-documented when using linear stepsizes for stochastic optimization; e.g., see Section 2.1 of Nemirovski et al. 
for some examples showing slow rates when the strong convexity constant is mis-estimated. As we discuss at more length in Section3.3, this type of exponential scaling has also been documented in past work on -learning [21, 10].
Corollary 2 (Bounds for polynomial stepsizes).
Under the assumptions of Theorem 1, consider the sequence of stepsizes for some . Then for all iterations , we have
Observe that both of the recursions (10a) and (10b) hold for general stepsizes in the interval . In order to simplify these expressions, we need to bound the products of various stepsizes. We claim that for any positive integers , we have
The proof is straightforward. From the inequality , valid for , we find that . Now since the function is decreasing on the positive real line, we have
Combining the pieces yields the claimed bound. ∎
3 Applications to -learning
We now turn to the consequences of our general results for the problem of -learning in the tabular setting.
3.1 Background and set-up
Here we provide only a very brief introduction to Markov decision processes and the -learning algorithm; the reader can consult various standard sources (e.g., [18, 20, 6, 7, 22]) for more background. We consider a Markov decision process (MDP) with a finite set of possible states , and a finite set of possible actions . The dynamics are probabilistic in nature and influenced by the actions: performing action while in state
causes a transition to a new state, randomly chosen according to a probability distribution denoted. Thus, underlying the MDP is a family of probability transition functions . The reward function maps state-action pairs to real numbers, so that is the reward received upon executing action while in state . A deterministic policy is a mapping from the state space to the action space, so that action is taken when in state .
For a given policy , the -function or state-action function measures the expected discounted reward obtained by starting in a given state-action pair, and then following the policy in all subsequent iterations. More precisely, for a given discount factor , we define
Naturally, we would like to choose the policy so as to optimize the values of the -function. From the classical theory of finite Markov decision processes [18, 20, 7], this task is equivalent to computing the unique fixed point of the Bellman operator. The Bellman operator is a mapping from to itself, whose -entry is given by
It is well-known that is a -contraction with respect to the -norm, meaning that
|where the or sup norm is defined in the usual way—viz.|
It is this contractivity that guarantees the existence and uniqueness of the fixed point of the Bellman operator (i.e., for which ).
In the context of reinforcement learning, the transition dynamics are unknown, so that it is not possible to exactly evaluate the Bellman operator. Instead, given some form of random access to these transition dynamics, our goal is to compute an approximation to the optimal -function on the basis of observed state-action pairs. Watkins and Dayan  introduced the idea of -learning, a form of stochastic approximation designed to compute the optimal -function. One can distinguish between the synchronous and asynchronous forms of -learning; we focus on the former here.111Given bounds on the behavior of synchronous -learning, it is possible to transform them into guarantees for the asynchronous model via notions such as the cover time of the underlying Markov process; we refer the reader to the papers [10, 2] for instances of such conversions. In the synchronous setting of -learning, we make observations of the following type. At each time and for each state-action pair , we observe a sample drawn according to the transition function
. Equivalently stated, we observe a random matrixwith independent entries, in which the entry indexed by is distributed according to .
Based on these observations, the synchronous form of -learning algorithm generates a sequence of iterates according to the recursion
Here is a mapping from to itself, and is known as the empirical Bellman operator: its -entry is given by
By construction, for any fixed , we have , so that the empirical Bellman operator (17
) is an unbiased estimate of the population Bellman operator (14).
There are different ways in which we can express the -learning recursion (16) in a form suitable for the application of Theorem 1 and its corollaries. One very natural approach, as followed in some past work on the problem (e.g., [23, 11, 7, 10]), is to rewrite the -learning update (16) as an application of the population Bellman operator with noise. In particular, we can write
where the noise matrix is zero-mean, conditioned on . Theorem 1 and its corollaries can then be applied with the orthant cone and the norm, along with the operators and quasi-contraction coefficients for all iterations .
For our purposes, it turns out to be more convenient to apply our general theory with a different and time-varying choice—namely, with for each . This choice satisfies the required assumptions, since it can be verified that each one of the random operators is monotonic with respect to the orthant ordering, and moreover
Setting leads to effective noise variables (as defined in equation (5)) of the form
These effective noise variables are especially easy to control. In particular, note that is an i.i.d. sequence of random matrices with zero mean, where entry
Here the expectations and are both computed over .
3.2 Non-asymptotic guarantees for -learning
With this set-up, we are now equipped to state some non-asymptotic guarantees for -learning. These bounds involve the quantity , corresponding to the total number of state-action pairs, as well as the span seminorm of given by
Note that this is a seminorm (as opposed to a norm), since we have whenever is constant for all state-action pairs. See §6.6.1 of Puterman 
for further background on the span seminorm and its properties. Finally, we also define the maximal standard deviation
where the variance was previously defined in equation (20).
With these definitions in place, we are now ready to state bounds on the expected -norm error for -learning with rescaled linear stepsizes:
Corollary 3 (-learning with rescaled linear stepsize).
Consider the step size choice . Then there is a universal constant such that for all iterations , we have
A few remarks about the bound (23) are in order. Naturally, the first term (involving ) measures how quickly the error due to an initialization decays. The rate for this term is , which is to be expected with a linearly decaying step size. The second term in curly braces arises from the fluctuations of the noise in -learning, in particular via a Bernstein bound (see Lemma 3). The term with corresponds to the standard deviation of the effective noise terms (19) whereas the term with arises from the boundedness of the noises. Finally, while we have stated a bound on the expected error, it is also possible to derive a high probability bound: in particular, if we replace the terms with for a universal constant , then the bounds hold with probability at least . (See Lemma 2 in Appendix B.1.2
for a bound on the moment generating function of the relevant noise terms.)
Next we analyze the case of -learning with a polynomial-decaying stepsize.
Corollary 4 (-learning with polynomial stepsize).
Consider the step size choice for some . Then there is a constant , universal apart from dependence on , such that for all iterations , we have
At a high level, the interpretation of this bound is similar to that of the bound in Corollary 3: the first term corresponds to the initialization error, whereas the second term corresponds to the fluctuations induced by the stochasticity of the update. When taking the much larger polynomial stepsizes—in contrast to the linear stepsize case—the initialization error vanishes much more quickly, in particular as an exponential function of . On the other hand, the noise terms exhibit larger fluctuations—with the two terms in the Bernstein bound scaling as and .
3.3 Comparison to past work and worst-case guarantees
There is a very large body of work on -learning in different settings, and studying its behavior under various criteria. Here we focus only on the subset of work that has given bounds on the -error for discounted problems, which is most relevant for direct comparison to our results. The -learning algorithm was initially introduced and studied by Watkins and Dayan . General asymptotic results on the convergence of -learning were given by Tsitsiklis  and Jaakkola et al. , who made explicit connection to stochastic approximation. Szepesvári 
gave an asymptotic analysis showing (among other results) that the convergence rate of-learning with linear stepsizes can be exponentially slow as a function of . Bertsekas and Tsitsiklis  provided a general framework for the analysis of stochastic approximation of the -learning type, and used it to provide asymptotic convergence guarantees for a broad range of stepsizes. Using this same framework, Even-Dar and Mansour 
performed an epoch-based analysis that led to non-asymptotic bounds on the behavior of-learning, both for the non-rescaled linear stepsize , and the polynomial stepsizes for . It is these non-asymptotic results that are most directly comparable to our Corollary 4.
In order to make some precise comparisons, consider the class of MDPs in which the reward function is uniformly as bounded
Bounds from past work  are given in terms of the iteration complexity of the algorithms, meaning the minimum number of iterations required to drive the expected -error222In fact, they stated their results as high-probability bounds but up to some additional logarithmic factors, these are the same as the bounds on expected error. below .
3.3.1 Linear stepsizes
For the unrescaled linear stepsizes , Even-Dar and Mansour  proved a pessimistic result: namely, that -learning with this step size has an iteration complexity that grows exponentially in the quantity . As noted previously, earlier work by Szepesvári  had given an asymptotic analogue of this poor behavior of -learning with this linear stepsize. Moreover, as we discussed following Corollary 1, this type of exponential scaling will also arise if our general machinery is applied with the ordinary linear stepsize.
Let us now turn to the bounds given by Corollary 3 using the rescaled linear stepsize . Translating these bounds into iteration complexity, we find that taking
iterations is sufficient to guarantee -accuracy in expected norm. Here the notation denotes an inequality that holds with constants and log factors dropped, so as to simplify comparison of results.
As will be established momentarily, all of the -dependent quantities in the complexity estimate (26) scale at most as polynomial functions of . In fact, our theory establishes that the iteration complexity of -learning with rescaled linear stepsize—in the worst case—scales as . See Table 1 for a summary.
3.3.2 Polynomial stepsizes
We now turn to the polynomial step sizes for , and compare our results to past work. For these polynomial step sizes, Even-Dar and Mansour  (in Theorem 2 of their paper) proved that for any MDP with -bounded rewards and discount factor , it suffices to take at most
in order to drive the error below . Here as before, our notation indicates that we are dropping constants and other logarithmic factors (including those involving ).
On the other hand, Corollary 4 in this paper guarantees that for a -discounted MDP with optimal -function , initializing at and taking
steps is sufficient to achieve an -accurate estimate. As shown in the next section, choosing optimizes the trade-off between the two terms in the worst case, and leads to an overall scaling, as with -learning with rescaled linear stepsizes. Again, see Table 1 for a summary. Moreover, we clarify in the next section the relation between the bound (28) and the earlier result (27).
3.3.3 Worst-case guarantees
In our bounds for -learning, the -specific difficulty enters via the span seminorm and the maximal standard deviation . In this section, we bound these quantities in a worst-case sense. Doing so allows us to give uniform guarantees version of our earlier bound for -learning with rescaled linear stepsize, and to make a more precise comparison between the bounds (27) and (28). In particular, let denote the set of all optimal -functions that can be obtained from a -discounted MDP with an -uniformly bounded reward function (as in equation (25)).
Over the class , we have the uniform bounds
See Appendix B.1.1 for the proof of this lemma.
Lemma 1 allows us to derive uniform versions of our previous iteration complexity bounds. For simplicity, we assume initialization at , so that . For the rescaled linear stepsize, for all , we have
On the other hand, for the polynomial step size, we have
Note that this bound shows a trade-off between the two terms as a function of . Setting optimizes the trade-off in terms of , and as shown in Table 1 yields the scaling , where we disregard logarithmic terms. Note that this has the same scaling in as the linear rescaled bound (30), but inferior behavior in .
|Stepsize choice||Previous work ||This paper|
|Linear rescaled||No results given|
3.4 Simulation study of -dependence
It is natural to wonder whether or not the bounds given in Corollaries 3 and 4 give sharp scalings for the dependence of -learning on the discount factor . In this section, we provide empirical evidence for the sharpness of our bounds.
3.4.1 A class of “hard” problems
In order to do so, we consider a class of MDPs introduced in past work by Azar et al. , and used to prove minimax lower bounds. For our purposes—namely, exploring sharpness with the discount —it suffices to consider an especially simple instance of these “hard” problems. As illustrated in Figure 1(a), this MDP consists of a five element space , and a two element action space , shorthand for “left” and “right” respectively. When in state , taking action yields a deterministic transition (i.e., with probability ) to state whereas taking action leads to a deterministic transition to state . When in state , taking either action leads to a transition to state with probability , and remaining in state with probability . (The same assertion applies to the behavior in state , with state replaced by state .) Finally, both states an are absorbing states. The reward function in zero in every state except for states and , for which we have
A straightforward computation yields that the optimal -function has the form
For any , it is valid to set .
If we run -learning either with a rescaled linear stepsize (as in Corollary 3) or a polynomial stepsize (as in Corollary 4), then as shown in Figure 1(b), we see convergence at the rate for the linear stepsize, and for the polynomial stepsize. This is consistent with the theory, and standard for stochastic approximation. Of most interest to us is the behavior of the curves as the discount factor is changed; as seen in Figure 1(b), the curves shift upwards, reflecting the fact that problems with larger value of are harder. We would like to understand these shifts in a quantitative manner.
3.4.2 Behavior of and
For this particular class of problems, let us compute the quantities and that play a key role in our bounds. First, observe that with our choice of from above, we have , which implies that
|Recalling that in our construction, observe that (up to a constant factor) this -function saturates the worst-case upper bound on from Lemma 1. As for the maximal standard deviation term, as shown in Appendix C, as long as , it is lower bounded as|
By comparing with Lemma 1, we see that the maximal standard deviation saturates the worst-case upper bound, again up to a constant factor. Thus, the constructed class of problems is “hard”, at least from the point of view of maximizing the -dependent terms in our upper bounds.
Consequently, from our earlier calculations in Table 1, we expect that for any fixed , the iteration complexity of -learning as a function of should be upper bounded as , and moreover, this bound should hold for either the rescaled linear stepsize, or the polynomial stepsize with . If our bounds are sharp—at least in the worst-case sense—then we expect to see that this predicted bound is met with equality in simulation. Accordingly, our numerical simulations were addressed to testing the correctness of this prediction.
Figure 2 illustrates the results of our simulations. Panel (a) illustrates how the iteration complexity was estimated in simulation. For a given algorithm and setting of , we ran the algorithm for steps, thereby obtaining a path of -norm errors at each iteration . We averaged these paths over a total of independent trials. Given these Monte Carlo estimates of the average -error, for a given , we compute by finding the smallest iteration at which the estimated -error falls below . Panel (a) illustrates two instances of this calculation, for the settings and respectively.
For the fixed tolerance , we repeated this Monte Carlo estimation procedure in order to estimate the quantity for each