Stochastic approximation with cone-contractive operators: Sharp ℓ_∞-bounds for Q-learning

Motivated by the study of Q-learning algorithms in reinforcement learning, we study a class of stochastic approximation procedures based on operators that satisfy monotonicity and quasi-contractivity conditions with respect to an underlying cone. We prove a general sandwich relation on the iterate error at each time, and use it to derive non-asymptotic bounds on the error in terms of a cone-induced gauge norm. These results are derived within a deterministic framework, requiring no assumptions on the noise. We illustrate these general bounds in application to synchronous Q-learning for discounted Markov decision processes with discrete state-action spaces, in particular by deriving non-asymptotic bounds on the ℓ_∞-norm for a range of stepsizes. These results are the sharpest known to date, and we show via simulation that the dependence of our bounds cannot be improved in a worst-case sense. These results show that relative to a model-based Q-iteration, the ℓ_∞-based sample complexity of Q-learning is suboptimal in terms of the discount factor γ.

Authors

• 86 publications
• PAC Bounds for Imitation and Model-based Batch Learning of Contextual Markov Decision Processes

We consider the problem of batch multi-task reinforcement learning with ...
06/11/2020 ∙ by Yash Nair, et al. ∙ 0

• Dynamic Policy Programming

In this paper, we propose a novel policy iteration method, called dynami...
04/12/2010 ∙ by Mohammad Gheshlaghi Azar, et al. ∙ 0

• A Multistep Lyapunov Approach for Finite-Time Analysis of Biased Stochastic Approximation

Motivated by the widespread use of temporal-difference (TD-) and Q-learn...
09/10/2019 ∙ by Gang Wang, et al. ∙ 13

• Policy Error Bounds for Model-Based Reinforcement Learning with Factored Linear Models

In this paper we study a model-based approach to calculating approximate...
02/19/2016 ∙ by Bernardo Ávila Pires, et al. ∙ 0

• Non-Stationary Markov Decision Processes a Worst-Case Approach using Model-Based Reinforcement Learning

This work tackles the problem of robust zero-shot planning in non-statio...
04/22/2019 ∙ by Erwan Lecarpentier, et al. ∙ 0

• Non-asymptotic error bounds for scaled underdamped Langevin MCMC

Recent works have derived non-asymptotic upper bounds for convergence of...
12/06/2019 ∙ by Tim Zajic, et al. ∙ 0

• A General Family of Robust Stochastic Operators for Reinforcement Learning

We consider a new family of operators for reinforcement learning with th...
05/21/2018 ∙ by Yingdong Lu, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Stochastic approximation (SA) algorithms are widely used in many areas, including stochastic control, communications, machine learning, statistical signal processing and reinforcement learning, among others. There is now a very rich literature on SA algorithms, their applications and the associated theory (e.g., see the books

[5, 14, 8] and references therein). One set of fundamental questions concerns the convergence of SA algorithms; there are various general techniques for establishing convergence, including the ODE method, dynamical system, and Lyapunov-based methods, among others. Much of the classical theory in stochastic approximation is asymptotic in nature, whereas in more recent work, particularly in the special case of stochastic optimization, attention has been shifted to non-asymptotic results [15, 4].

The goal of this paper is to develop some non-asymptotic bounds for a certain class of stochastic approximation procedures. The motivating impetus for this work was to gain a deeper insight into the classical -learning algorithm [26] from Markov decision processes and reinforcement learning [18, 20, 6, 7, 22]. It is a stochastic approximation algorithm for solving a fixed point equation involving the Bellman operator. In the discounted setting, this operator is contractive with respect to a sup norm, and also monotonic in the elementwise ordering. We show that these conditions can be viewed as special cases of a more general structure on the operators used in stochastic approximation for solving fixed point equations. In particular, we introduce monotonicity and quasi-contractivity conditions that are defined with respect to the partial order and gauge norms induced by an underlying cone. In the case of sup norm contractions, this underlying cone is the orthant cone, but other cones also arise naturally in applications. For instance, for SA procedures that operate in the space of symmetric matrices, the cone of positive semidefinite matrices induces the spectral order, as well as various forms of spectral norms. For a sequence of operators satisfying these cone monotonicity and quasi-contractivity conditions, we prove a general result (Theorem 1) that sandwiches the error at each iteration in terms of the partial order induced by the cone. By considering concrete choices of stepsize—such as linearly or polynomial decaying ones—we derive corollaries that yield non-asymptotic bounds on the error.

We specialize this general theory to the synchronous form of -learning in discounted Markov decision processes, and use it to derive non-asymptotic bounds on the -error of -learning, for both polynomial stepsizes and a linearly rescaled stepsize. Notably, these results are the sharpest known to date, and are instance-specific, depending on the particular structure of the optimal -function underlying the problem. Our bounds—when considered in a uniform (or worst-case) sense over the class of -discounted MDPs—establish that the number of iterations required by synchronous -learning to reach a solution that is -accurate in -norm scales as . We show via a careful simulation study that this guarantee is unimprovable. It improves upon the best bounds on synchronous -learning from previous work [10], which exhibit a scaling; see Section 3.3 for an in-depth discussion of relevant past results on -learning. For context, we note that the speedy-Q-learning method, an extension of ordinary -learning, is known to have iteration complexity scaling as , the same as our guarantee for ordinary -learning. Moreover, Azar et al. [3] show that model-based -iteration exhibits a scaling, and moreover that this is the best possible for any method in a minimax sense. Consequently, a corollary of our results is to reveal a gap between the performance of standard synchronous -learning and an optimal (model-based) procedure.

The remainder of this paper is organized as follows. In Section 2, we introduce the class of stochastic approximation algorithms analyzed in this paper, including some required background on cones and induced gauge norms. We then state our main result (Theorem 1), as well some of its corollaries for particular stepsize choices (Corollaries 1 and 2). In Section 3, we turn to the analysis of -learning. After introducing the necessary background in Section 3.1, we then devote Section 3.2 to statement of our two main results on -learning, namely -norm bounds for a linear rescaled stepsize (Corollary 3) and for polynomially decaying stepsizes (Corollary 4). In Section 3.3, we discuss past work on -learning and compare our guarantees to the best previously known non-asymptotic results. In Section 3.4, we describe and report the results of a simulation study that provides empirical evidence for the sharpness of our bounds in a worst-case sense. We conclude with a discussion in Section 4, with more technical aspects of our proofs deferred to the appendices.

2 A general convergence result

In this section, we set up the stochastic approximation algorithms of interest. Doing so requires some background on cones, monotonic operators on cones, and gauge norms induced by order intervals, which we provide in Section 2.1. In Section 2.2, we state a general result (Theorem 1) that sandwiches the iterate error using the partial order induced by the cone. This result holds for arbitrary stepsizes in the interval ; we follow up by using this general result to derive specific bounds that apply to stepsize choices commonly used in practice (cf. Corollaries 1 and 2).

2.1 Background and problem set-up

Consider a topological vector space

, and an operator that maps to itself. Our goal is to compute a fixed point of —that is, an element such that —assuming that such an element exists and is unique. In various applications, we are not able to evaluate exactly, but instead are given access to a sequence of auxiliary operators , and permitted to compute the quantity for any . Here denotes an error term, allowed to be arbitrary in the analysis of this section. In the simplest case, we have for all , but the additional generality afforded by the setup here turns out to be useful.

Given an observation model of this type, we consider algorithms that generate a sequence according to the recursion

 θk+1 =(1−λk)θk+λk{Hk(θk)+Ek}. (1)

The stepsize parameters are assumed to belong to the interval , and should be understood as design parameters. Our primary goals are to specify conditions on the auxiliary operators , noise sequence , and stepsize sequence under which the sequence converges to . Moreover, we seek to develop tools for proving non-asymptotic bounds on the error—i.e., guarantees that hold for finite iterations, as opposed to in the limit as increases to infinity.

Of course, convergence guarantees are not possible without imposing assumptions on the auxiliary operators. In this paper, motivated by the analysis of -learning and related algorithms in reinforcement learning, we assume that they satisfy certain properties that depend on a cone contained in . Let us first introduce some relevant background on cones, order intervals and induced gauge norms. Any cone induces a partial order on via the relation

 θ⪯θ′⟺(θ′−θ)∈K. (2)

Cones that have non-empty interiors and are topologically normal [1, 13] can also be used to induce a certain class of gauge norms as follows. For a given element , the associated order interval is the set

 [−e,e]\vcentcolon={θ∈V∣−e⪯θ⪯e} (3a) and it defines the Minkowski (gauge) norm given by ∥θ∥e =inf{s>0∣θ/s∈[−e,e]}. (3b)

Let us consider some concrete examples to illustrate.

Example 1 (Orthant cone and ℓ∞-norms).

Suppose that is the usual Euclidean space , and consider the orthant cone , where . It induces the usual elementwise ordering—viz. if and only if for all . Setting to be the all-ones vector, we find that

 ∥θ∥e =inf{s>0∣−1≤θj/s≤1for all j∈[d]}=maxj∈[d]|θj|∥θ∥∞.

Thus, this choice of induces the usual -norm on vectors. Setting to some other vector contained in the interior of the orthant cone yields a weighted -norm.

Example 2 (Symmetric matrices and spectral norm).

Now suppose that is the space of -dimensional symmetric matrices . Letting

denote the eigenvalues of a matrix

, consider the cone of positive semidefinite matrices

 K\tiny{PSD}={M∈Θ∣γj(M)≥0for all j∈[d]}.

This cone induces the spectral ordering if and only if all the eigenvalues of are non-negative. Setting

to be the identity matrix

, we have

 ∥M∥e =inf{s>0∣−1≤γj(M)/s≤1for all % j∈[d]}=maxj∈[d]|γj(M)||||M|||2,

so that the induced gauge norm is the spectral norm on symmetric matrices.

In this paper, we assume that the operators in the recursion (1) satisfy two properties: cone-monotonicity and cone-quasi-contractivity. More precisely, we assume that for each , the operator is monotonic with respect to the cone, meaning that

 Hk(θ)⪯Hk(θ′)whenever θ⪯θ′. (4a) Moreover, we assume that for some element e∈int(K), it is cone-quasi-contractive meaning that there is some νk∈(0,1) and some θ∗∈V such that ∥Hk(θ)−Hk(θ∗)∥e ≤νk∥θ−θ∗∥efor all θ∈V. (4b)

Here the terminology “quasi” denotes the fact that the relation (4b) only need hold for a single , as opposed to in a uniform sense. Note that it is not necessary that be a fixed point of each .

2.2 A sandwich result and its corollaries

With this set-up, we now turn to the analysis of the sequence generated by a recursion of the form (1). We first state a general “sandwich” result, which provides both lower and upper bounds on the error in terms of the partial order induced by the cone. This result holds for any sequence of stepsizes contained in the interval . By specializing this general theorem to particular stepsize choices that are common in stochastic approximation, we obtain non-asymptotic upper bounds on the error, as measured in the cone-induced norm .

Our results depend on a form of effective noise, defined as follows

 Wk \vcentcolon=Hk(θ∗)−θ∗+Ek. (5)

Note that the effective noise at iteration is the sum of the “defect” in the operator —meaning its failure to preserve the target as a fixed point—and the original error term introduced in our set-up.

Our bounds involve the sequence of elements in defined via the recursion

 Pk \vcentcolon=(1−λk−1)Pk−1+λk−1Wk−1,with initialization P1=0, (6a) where 0 denotes the zero element in V. It also involves the sequences of non-negative scalars bk \vcentcolon=(1−(1−νk−1)λk−1)bk−1with initialization b1=∥θ1−θ∗∥e, and (6b) ak \vcentcolon=(1−(1−νk−1)λk−1)ak−1+γλk∥Pk−1∥e,with initialization a1=0. (6c)
Theorem 1.

Consider a sequence of operators that are monotonic (4a) and -quasi-contractive (4b) with respect to a cone with gauge norm . Then for any sequence of stepsizes in the interval , the iterates generated by the recursion (1) satisfy the sandwich relation

 −bke−ake+Pk⪯θk−θ∗⪯bke+ake+Pk, (7)

where denotes the partial ordering induced by the cone.

See Appendix A for the proof.

Theorem 1 is a general result that applies to any choice of stepsizes that belong to the unit interval . By specializing the stepsize choice, we can use the sandwich relation (7) to obtain concrete bounds on the error . In doing so, we specialize to the case , so that all the operators share the same quasi-contractivity coefficient .

We begin by considering a sequence of stepsizes in the interval that satisfy the bound

 (1−(1−ν)λk)≤λkλk−1. (8)

Note that the usual linear stepsize does not satisfy this bound for . Examples of stepsizes that do satisfy this bound are the rescaled linear stepsize , valid once , as well as the shifted version of rescaled linear stepsize , valid for all iterations .

Corollary 1 (Bounds for linear stepsizes).

Under the assumptions of Theorem 1, for any sequence of stepsizes in the interval satisfying the bound (8), we have

 ∥θk+1−θ∗∥e ≤λk{∥θ1−θ∗∥eλ1+νk∑ℓ=1∥Pℓ∥e}+∥Pk+1∥e, (9)

for all iterations .

Proof.

Define the error at iteration . Using the definitions (6b) and (6c) of and respectively, an inductive argument yields

 bk+1 =k∏ℓ=1(1−(1−ν)λℓ)∥Δ1∥e, (10a) ak+1 (10b)

By applying the stepsize bound (8) repeatedly to the recursion (10a), we find that . Applying this same identity to the recursion (10b) yields the bound . Combining these two inequalities, along with the additional term from the bound (7) in Theorem 1, yields the claim (9). ∎

It is worth pointing out why the linear stepsize is excluded from our theory. If we adopt this stepsize choice and substitute into the recursion (10a), then we find that

 bk+1∥θ1−θ∗∥e =k∏ℓ=1(1−(1−γ)ℓ)≈exp(−(1−γ)k∑ℓ=1ℓ−1)≈(1k)1−γ.

This behavior makes clear that an unrescaled linear stepsize will lead to bounds with exponential dependence on . It should be noted that this kind of sensitivity to the choices of constants is well-documented when using linear stepsizes for stochastic optimization; e.g., see Section 2.1 of Nemirovski et al. [15]

for some examples showing slow rates when the strong convexity constant is mis-estimated. As we discuss at more length in Section

3.3, this type of exponential scaling has also been documented in past work on -learning [21, 10].

Corollary 2 (Bounds for polynomial stepsizes).

Under the assumptions of Theorem 1, consider the sequence of stepsizes for some . Then for all iterations , we have

 ∥θk+1−θ∗∥e ≤e−1−ν1−ω(k1−ω−1)∥θ1−θ∗∥e+e−1−ν1−ωk1−ωk∑ℓ=1e1−ν1−ωℓ1−ωℓω∥Pℓ∥e+∥Pk+1∥e. (11)
Proof.

Observe that both of the recursions (10a) and (10b) hold for general stepsizes in the interval . In order to simplify these expressions, we need to bound the products of various stepsizes. We claim that for any positive integers , we have

 T1∏ℓ=T0(1−1−νℓω) ≤exp(−1−ν1−ω(T1−ω1−T1−ω0)). (12)

The proof is straightforward. From the inequality , valid for , we find that . Now since the function is decreasing on the positive real line, we have

 T1∑ℓ=T0ℓ−ω

Combining the pieces yields the claimed bound. ∎

3 Applications to Q-learning

We now turn to the consequences of our general results for the problem of -learning in the tabular setting.

3.1 Background and set-up

Here we provide only a very brief introduction to Markov decision processes and the -learning algorithm; the reader can consult various standard sources (e.g., [18, 20, 6, 7, 22]) for more background. We consider a Markov decision process (MDP) with a finite set of possible states , and a finite set of possible actions . The dynamics are probabilistic in nature and influenced by the actions: performing action while in state

causes a transition to a new state, randomly chosen according to a probability distribution denoted

. Thus, underlying the MDP is a family of probability transition functions . The reward function maps state-action pairs to real numbers, so that is the reward received upon executing action while in state . A deterministic policy is a mapping from the state space to the action space, so that action is taken when in state .

For a given policy , the -function or state-action function measures the expected discounted reward obtained by starting in a given state-action pair, and then following the policy in all subsequent iterations. More precisely, for a given discount factor , we define

 θπ(x,u) =E[∞∑k=0γkr(xk,uk)∣x0=x,u0=u]where uk=π(xk) for all k≥1. (13)

Naturally, we would like to choose the policy so as to optimize the values of the -function. From the classical theory of finite Markov decision processes [18, 20, 7], this task is equivalent to computing the unique fixed point of the Bellman operator. The Bellman operator is a mapping from to itself, whose -entry is given by

 T(θ)(x,u) \vcentcolon=r(x,u)+γEx′maxu′∈Uθ(x′,u′)where x′∼Pu(⋅∣x). (14)

It is well-known that is a -contraction with respect to the -norm, meaning that

 ∥T(θ)−T(θ′)∥∞ ≤γ∥θ−θ′∥∞for all (x,u)∈(X,U), (15a) where the ℓ∞ or sup norm is defined in the usual way—viz. ∥θ∥∞\vcentcolon=max(x,u)∈X×U|θ(x,u)|. (15b)

It is this contractivity that guarantees the existence and uniqueness of the fixed point of the Bellman operator (i.e., for which ).

In the context of reinforcement learning, the transition dynamics are unknown, so that it is not possible to exactly evaluate the Bellman operator. Instead, given some form of random access to these transition dynamics, our goal is to compute an approximation to the optimal -function on the basis of observed state-action pairs. Watkins and Dayan [26] introduced the idea of -learning, a form of stochastic approximation designed to compute the optimal -function. One can distinguish between the synchronous and asynchronous forms of -learning; we focus on the former here.111Given bounds on the behavior of synchronous -learning, it is possible to transform them into guarantees for the asynchronous model via notions such as the cover time of the underlying Markov process; we refer the reader to the papers [10, 2] for instances of such conversions. In the synchronous setting of -learning, we make observations of the following type. At each time and for each state-action pair , we observe a sample drawn according to the transition function

. Equivalently stated, we observe a random matrix

with independent entries, in which the entry indexed by is distributed according to .

Based on these observations, the synchronous form of -learning algorithm generates a sequence of iterates according to the recursion

 θk+1 =(1−λk)θk+λkˆTk(θk). (16)

Here is a mapping from to itself, and is known as the empirical Bellman operator: its -entry is given by

 ˆTk(θ)(x,u)=r(x,u)+γmaxu′∈Uθ(xk,u′)where xk≡xk(x,u)∼Pu(⋅∣x). (17)

By construction, for any fixed , we have , so that the empirical Bellman operator (17

) is an unbiased estimate of the population Bellman operator (

14).

There are different ways in which we can express the -learning recursion (16) in a form suitable for the application of Theorem 1 and its corollaries. One very natural approach, as followed in some past work on the problem (e.g., [23, 11, 7, 10]), is to rewrite the -learning update (16) as an application of the population Bellman operator with noise. In particular, we can write

 θk+1 =(1−λk)θk+λk{T(θk)+Ek}, (18)

where the noise matrix is zero-mean, conditioned on . Theorem 1 and its corollaries can then be applied with the orthant cone and the norm, along with the operators and quasi-contraction coefficients for all iterations .

For our purposes, it turns out to be more convenient to apply our general theory with a different and time-varying choice—namely, with for each . This choice satisfies the required assumptions, since it can be verified that each one of the random operators is monotonic with respect to the orthant ordering, and moreover

 ∥ˆTk(θ)−ˆTk(θ∗)∥∞ ≤γ∥θ−θ∗∥∞for all θ.

Setting leads to effective noise variables (as defined in equation (5)) of the form

 Wk\vcentcolon=ˆTk(θ∗)−T(θ∗). (19)

These effective noise variables are especially easy to control. In particular, note that is an i.i.d. sequence of random matrices with zero mean, where entry

has variance

 σ2(θ∗)(x,u) \vcentcolon=γ2E˜x[(max˜u∈Uθ∗(˜x,˜u)−Ex′maxu′∈Uθ∗(x′,u′))2]. (20)

Here the expectations and are both computed over .

3.2 Non-asymptotic guarantees for Q-learning

With this set-up, we are now equipped to state some non-asymptotic guarantees for -learning. These bounds involve the quantity , corresponding to the total number of state-action pairs, as well as the span seminorm of given by

 ∥θ∗∥\tiny{span}=max(x,u)∈X×Uθ∗(x,u)−min(x,u)∈X×Uθ∗(x,u). (21)

Note that this is a seminorm (as opposed to a norm), since we have whenever is constant for all state-action pairs. See §6.6.1 of Puterman [18]

for further background on the span seminorm and its properties. Finally, we also define the maximal standard deviation

 ∥σ(θ∗)∥∞=√max(x,u)∈X×Uσ2(θ∗)(x,u), (22)

where the variance was previously defined in equation (20).

With these definitions in place, we are now ready to state bounds on the expected -norm error for -learning with rescaled linear stepsizes:

Corollary 3 (Q-learning with rescaled linear stepsize).

Consider the step size choice . Then there is a universal constant such that for all iterations , we have

 E[∥θk+1−θ∗∥∞] ≤∥θ1−θ∗∥∞1+(1−γ)k+c1−γ⎧⎨⎩∥σ(θ∗)∥∞√log(2D)√1+(1−γ)k+∥θ∗∥\tiny{span}log(2eD(1+(1−γ)k))1+(1−γ)k⎫⎬⎭. (23)

A few remarks about the bound (23) are in order. Naturally, the first term (involving ) measures how quickly the error due to an initialization decays. The rate for this term is , which is to be expected with a linearly decaying step size. The second term in curly braces arises from the fluctuations of the noise in -learning, in particular via a Bernstein bound (see Lemma 3). The term with corresponds to the standard deviation of the effective noise terms (19) whereas the term with arises from the boundedness of the noises. Finally, while we have stated a bound on the expected error, it is also possible to derive a high probability bound: in particular, if we replace the terms with for a universal constant , then the bounds hold with probability at least . (See Lemma 2 in Appendix B.1.2

for a bound on the moment generating function of the relevant noise terms.)

Next we analyze the case of -learning with a polynomial-decaying stepsize.

Corollary 4 (Q-learning with polynomial stepsize).

Consider the step size choice for some . Then there is a constant , universal apart from dependence on , such that for all iterations , we have

 E∥θk+1−θ∗∥∞≤e−1−γ1−ω(k1−ω−1){∥θ1−θ∗∥∞+cω(1−γ)−11−ω}+cω1−γ⎧⎨⎩∥σ(θ∗)∥∞√log(2D)kω/2+∥θ∗∥\tiny{% span}log(2D)kω⎫⎬⎭. (24)

At a high level, the interpretation of this bound is similar to that of the bound in Corollary 3: the first term corresponds to the initialization error, whereas the second term corresponds to the fluctuations induced by the stochasticity of the update. When taking the much larger polynomial stepsizes—in contrast to the linear stepsize case—the initialization error vanishes much more quickly, in particular as an exponential function of . On the other hand, the noise terms exhibit larger fluctuations—with the two terms in the Bernstein bound scaling as and .

3.3 Comparison to past work and worst-case guarantees

There is a very large body of work on -learning in different settings, and studying its behavior under various criteria. Here we focus only on the subset of work that has given bounds on the -error for discounted problems, which is most relevant for direct comparison to our results. The -learning algorithm was initially introduced and studied by Watkins and Dayan [26]. General asymptotic results on the convergence of -learning were given by Tsitsiklis [23] and Jaakkola et al. [11], who made explicit connection to stochastic approximation. Szepesvári [21]

gave an asymptotic analysis showing (among other results) that the convergence rate of

-learning with linear stepsizes can be exponentially slow as a function of . Bertsekas and Tsitsiklis [7] provided a general framework for the analysis of stochastic approximation of the -learning type, and used it to provide asymptotic convergence guarantees for a broad range of stepsizes. Using this same framework, Even-Dar and Mansour [10]

performed an epoch-based analysis that led to non-asymptotic bounds on the behavior of

-learning, both for the non-rescaled linear stepsize , and the polynomial stepsizes for . It is these non-asymptotic results that are most directly comparable to our Corollary 4.

In order to make some precise comparisons, consider the class of MDPs in which the reward function is uniformly as bounded

 max(x,u)∈X×U|r(x,u)|≤r% \tiny{max}. (25)

Bounds from past work [10] are given in terms of the iteration complexity of the algorithms, meaning the minimum number of iterations required to drive the expected -error222In fact, they stated their results as high-probability bounds but up to some additional logarithmic factors, these are the same as the bounds on expected error. below .

3.3.1 Linear stepsizes

For the unrescaled linear stepsizes , Even-Dar and Mansour [10] proved a pessimistic result: namely, that -learning with this step size has an iteration complexity that grows exponentially in the quantity . As noted previously, earlier work by Szepesvári [21] had given an asymptotic analogue of this poor behavior of -learning with this linear stepsize. Moreover, as we discussed following Corollary 1, this type of exponential scaling will also arise if our general machinery is applied with the ordinary linear stepsize.

Let us now turn to the bounds given by Corollary 3 using the rescaled linear stepsize . Translating these bounds into iteration complexity, we find that taking

 T\tiny{LinRes}(ϵ,γ,θ∗) ≾⎛⎝∥θ1−θ∗∥∞1−γ+∥θ∗∥\tiny{span}(1−γ)2⎞⎠(1ϵ)+(∥σ(θ∗)∥2∞(1−γ)3ϵ2) (26)

iterations is sufficient to guarantee -accuracy in expected norm. Here the notation denotes an inequality that holds with constants and log factors dropped, so as to simplify comparison of results.

As will be established momentarily, all of the -dependent quantities in the complexity estimate (26) scale at most as polynomial functions of . In fact, our theory establishes that the iteration complexity of -learning with rescaled linear stepsize—in the worst case—scales as . See Table 1 for a summary.

3.3.2 Polynomial stepsizes

We now turn to the polynomial step sizes for , and compare our results to past work. For these polynomial step sizes, Even-Dar and Mansour [10] (in Theorem 2 of their paper) proved that for any MDP with -bounded rewards and discount factor , it suffices to take at most

 Tω(ϵ,γ,r\tiny{max}) ≾⎛⎜⎝r2\tiny{max}(1−γ)4ϵ2⎞⎟⎠1ω+{11−γlog(r\tiny{max}(1−γ)ϵ)}11−ω (27)

in order to drive the error below . Here as before, our notation indicates that we are dropping constants and other logarithmic factors (including those involving ).

On the other hand, Corollary 4 in this paper guarantees that for a -discounted MDP with optimal -function , initializing at and taking

 Tω(ϵ,γ,θ∗) ≾(∥σ(θ∗)∥2∞(1−γ)2ϵ2)1ω+⎛⎜⎝∥θ∗∥2\tiny{span}(1−γ)2ϵ2⎞⎟⎠12ω+{11−γlog(r\tiny{max}(1−γ)ϵ)}11−ω (28)

steps is sufficient to achieve an -accurate estimate. As shown in the next section, choosing optimizes the trade-off between the two terms in the worst case, and leads to an overall scaling, as with -learning with rescaled linear stepsizes. Again, see Table 1 for a summary. Moreover, we clarify in the next section the relation between the bound (28) and the earlier result (27).

3.3.3 Worst-case guarantees

In our bounds for -learning, the -specific difficulty enters via the span seminorm and the maximal standard deviation . In this section, we bound these quantities in a worst-case sense. Doing so allows us to give uniform guarantees version of our earlier bound for -learning with rescaled linear stepsize, and to make a more precise comparison between the bounds (27) and (28). In particular, let denote the set of all optimal -functions that can be obtained from a -discounted MDP with an -uniformly bounded reward function (as in equation (25)).

Lemma 1.

Over the class , we have the uniform bounds

 supθ∗∈M(γ,r\tiny{max})∥θ∗∥\tiny{span}≤2supθ∗∈M(γ,r\tiny{max})∥θ∗∥∞ ≤2r\tiny{max}1−γand (29a) supθ∗∈M(γ,r\tiny{max})∥σ(θ∗)∥∞ ≤√8r\tiny{max}√1−γ. (29b)

See Appendix B.1.1 for the proof of this lemma.

Lemma 1 allows us to derive uniform versions of our previous iteration complexity bounds. For simplicity, we assume initialization at , so that . For the rescaled linear stepsize, for all , we have

 supθ∗∈M(γ,r\tiny{max})T\tiny{LinRes}(ϵ,γ,θ∗) ≾⎛⎜⎝r2\tiny{max}(1−γ)4ϵ2⎞⎟⎠. (30)

On the other hand, for the polynomial step size, we have

 supθ∗∈M(γ,r\tiny{max})Tω(ϵ,γ,θ∗) ≾⎛⎜⎝r2\tiny{max}(1−γ)3ϵ2⎞⎟⎠1ω+{11−γlog(r\tiny{max}(1−γ)ϵ)}11−ω. (31)

Note that this bound shows a trade-off between the two terms as a function of . Setting optimizes the trade-off in terms of , and as shown in Table 1 yields the scaling , where we disregard logarithmic terms. Note that this has the same scaling in as the linear rescaled bound (30), but inferior behavior in .

On the other hand, the bound (27) from past work is optimized by setting , and yields the scaling . See the first row of Table 1 for this guarantee, and a comparison to the result obtained with in the bound (31).

3.4 Simulation study of γ-dependence

It is natural to wonder whether or not the bounds given in Corollaries 3 and 4 give sharp scalings for the dependence of -learning on the discount factor . In this section, we provide empirical evidence for the sharpness of our bounds.

3.4.1 A class of “hard” problems

In order to do so, we consider a class of MDPs introduced in past work by Azar et al. [3], and used to prove minimax lower bounds. For our purposes—namely, exploring sharpness with the discount —it suffices to consider an especially simple instance of these “hard” problems. As illustrated in Figure 1(a), this MDP consists of a five element space , and a two element action space , shorthand for “left” and “right” respectively. When in state , taking action yields a deterministic transition (i.e., with probability ) to state whereas taking action leads to a deterministic transition to state . When in state , taking either action leads to a transition to state with probability , and remaining in state with probability . (The same assertion applies to the behavior in state , with state replaced by state .) Finally, both states an are absorbing states. The reward function in zero in every state except for states and , for which we have

 r(2,L)=r(2,R)=r(3,L)=r(3,R)=1.

A straightforward computation yields that the optimal -function has the form

 θ∗(x,u) =⎧⎪ ⎪⎨⎪ ⎪⎩γ1−pγfor x=111−pγfor x∈{2,3}0for x∈{4,5}

For any , it is valid to set .

If we run -learning either with a rescaled linear stepsize (as in Corollary 3) or a polynomial stepsize (as in Corollary 4), then as shown in Figure 1(b), we see convergence at the rate for the linear stepsize, and for the polynomial stepsize. This is consistent with the theory, and standard for stochastic approximation. Of most interest to us is the behavior of the curves as the discount factor is changed; as seen in Figure 1(b), the curves shift upwards, reflecting the fact that problems with larger value of are harder. We would like to understand these shifts in a quantitative manner.

3.4.2 Behavior of ∥θ∗∥\tiny{span} and ∥σ(θ∗)∥∞

For this particular class of problems, let us compute the quantities and that play a key role in our bounds. First, observe that with our choice of from above, we have , which implies that

 ∥θ∗∥\tiny{span}=3411−γ−0=3411−γ. (32a) Recalling that r\tiny{max}=1 in our construction, observe that (up to a constant factor) this Q-function saturates the worst-case upper bound on ∥θ∗∥\tiny{span} from Lemma 1. As for the maximal standard deviation term, as shown in Appendix C, as long as γ≥1/2, it is lower bounded as ∥σ(θ∗)∥∞≥14√31√1−γ. (32b)

By comparing with Lemma 1, we see that the maximal standard deviation saturates the worst-case upper bound, again up to a constant factor. Thus, the constructed class of problems is “hard”, at least from the point of view of maximizing the -dependent terms in our upper bounds.

Consequently, from our earlier calculations in Table 1, we expect that for any fixed , the iteration complexity of -learning as a function of should be upper bounded as , and moreover, this bound should hold for either the rescaled linear stepsize, or the polynomial stepsize with . If our bounds are sharp—at least in the worst-case sense—then we expect to see that this predicted bound is met with equality in simulation. Accordingly, our numerical simulations were addressed to testing the correctness of this prediction.

Figure 2 illustrates the results of our simulations. Panel (a) illustrates how the iteration complexity was estimated in simulation. For a given algorithm and setting of , we ran the algorithm for steps, thereby obtaining a path of -norm errors at each iteration . We averaged these paths over a total of independent trials. Given these Monte Carlo estimates of the average -error, for a given , we compute by finding the smallest iteration at which the estimated -error falls below . Panel (a) illustrates two instances of this calculation, for the settings and respectively.

For the fixed tolerance , we repeated this Monte Carlo estimation procedure in order to estimate the quantity for each