Reinforcement learning (RL) is a powerful paradigm for modeling a learning agent’s interactions with an unknown environment, in an attempt to accumulate as much reward as possible. Because of its flexibility, RL can encode such a vast array of different problem settings - many of which are entirely intractable. Therefore, it is crucial to understand what conditions make it possible for an RL agent to effectively learn about its environment.
In this paper, we consider tabular Markov decision processes (MDPs), a canonical RL setting where the agent seeks to learn apolicy mapping discrete states to one of finitely many actions , in attempt to maximize cumulative reward over an episode horizon . We shall study the regret setting, where the learner plays a policy for a sequence of episodes , and suffers a regret proportional to the average sub-optimality of the policies .
In recent years, the vast majority of literature has focused on obtaining minimax regret bounds that match the worst-case dependence on the number states , actions , and horizon length ; namely, a cumulative regret of , where denotes the total number of rounds of the game (Azar et al., 2017).
While these bounds are succinct and easy to interpret, they do not elucidate the favorable structural properties of which a learning agent can hope to take advantage. The earlier literature, on the other hand, does offer precise, instance-dependent complexities, given in terms of the sub-optimality gaps associated with each action at a given state, defined as
where and denote the value and functions for an optimal policy , and the subscript- denotes these bounds hold for a non-episodic, infinite horizon setting. Unfortunately, these analyses are asymptotic in nature, and only take effect after a large number of rounds, exponential in instance-dependent parameters. Further still, the number of rounds needed for the bounds to take hold, and often times the bounds themselves, depend on worst-case conditions such as uniform ergodicity which may be overly pessimistic or intractable to verify.
Recently, Zanette and Brunskill (2019) introduced, a novel algorithm called which made a first step towards attaining instance-dependent, non-asymptotic guarantees for tabular MDPs. They show that enjoys reduced dependence on the episode horizon for favorable instances while maintaining the same worse-cast dependence for other parameters in their analysis as in Azar et al. (2017).
In this paper, we take the next step by demonstrating that a common class of algorithms for solving MDPs, based on the optimism principle, attains gap-dependent, problem-specific bounds similar to those previously found only in the asymptotic regime. For concreteness, we consider a minor modification of the algorithm we call . We show that
For any episodic MDP ,
enjoys a high probability regret bound offor all rounds , where the constant depends on the sub-optimality gaps between actions at different states, as well as the horizon length, and contains an additive almost-gap-independent term that scales as .
Unlike previous gap-dependent regret bounds,
The constant does not suffer worst-case dependencies on other problem dependent quantities such mixing times, hitting times or measures of ergodicity. However, the constant does take advantage of benign problem instances (Definition 2.2).
The regret bound of is valid for any total number of rounds . Selecting , this implies a non-asymptotic expected regret bound of 111By this, we mean that for any fixed , one can attain regret. Extending the bound to anytime regret is left to future work.
The regret of interpolates between instance-dependent regret and minimax regret , the latter of which may be sharper for smaller . Following Zanette and Brunskill (2019), this dependence on may also be refined for benign instances.
Lastly, while the algorithm affords sharper regret bounds than past algorithms, our analysis techniques extend more generally to other optimism based algorithms:
Following our analysis of , the clipped regret decomposition can establish analogous gap-dependent -regret bounds for any of the algorithms mentioned above.
What is ? In many settings, we show that is dominated by an analogue to the sum over the reciprocals of the gaps defined in (1). This is known to be optimal for non-dynamic MDP settings like contextual bandits, and we prove a lower bound (Proposition 2.3) which shows that this is unimprovable for general MDPs as well. Furthermore, building on Zanette and Brunskill (2019), we show this adapts to problems with additional structure, yielding, for example, a horizon -free bound for contextual bandits.
However, our gap-dependent bound also suffers from a certain dependence on the smallest nonzero gap (see Definition 2.1), which may dominate in some settings. We prove a lower bound (Theorem 2.2) which shows that optimistic algorithms in the recent literature - including - necessarily suffer a similar term in their regret. We believe this insight will motivate new algorithms for which this dependence can be removed, leading to new design principles and actionable insights for practitioners. Finally, our regret bound incurs an (almost) gap-independent burn-in term, which is standard for optimistic algorithms, and which we believe is an exciting direction of research to remove.
Altogether, we believe that the results in our paper serve as a preliminary but significant step to attaining sharp, instance-dependent, and non-asymptotic bounds for tabular MDPs, and hope that our analysis will guide the design of future algorithms that attain these bounds.
1.2 Related Work
Like the multi-armed bandit setting, regret bounds for MDP algorithms have been characterized both in gap-independent forms that solely on , and in gap-dependent forms which take into account the gaps (1), as well as other instance-specific properties of the rewards and transition probabilities.
Finite Sample Bounds, Gap-Independent Bounds: A number of notable recent works give undiscounted regret bounds for finite-horizon, tabular MDPs, nearly all of them relying on the principle of optimism which we describe in Section 3 (Dann and Brunskill, 2015; Azar et al., 2017; Dann et al., 2017; Jin et al., 2018; Zanette and Brunskill, 2019). Many of the more recent works Azar et al. (2017); Zanette and Brunskill (2019); Dann et al. (2018) attain a regret of , matching the known lower bound of established in Osband and Van Roy (2016); Jaksch et al. (2010); Dann and Brunskill (2015). As mentioned above, the algorithm of Zanette and Brunskill (2019) attains the minimax rates and simultaneously enjoys a reduced dependence on in benign problem instances, such as the contextual bandits setting where the transition probabilities do not depend on the current state or learners actions, or when the total cumulative rewards over any roll-out are bounded by in magnitude.
Diameter Dependent Bounds: In the setting of infinite horizon MDPs with discounted regret, many previous works have established logarithmic regret bounds of the form , where is a constant depending on the underlying MDP. Notably, Jaksch et al. (2010) give an algorithm which attains a gap-independent regret, and an gap-dependent regret bound, where is the difference between the average reward of and the next-best stationary policy, and where denotes the maximum expected traversal time between any two states , under the policy which attains the minimal traversal time between those two states. We note that if denotes the sub-optimality of any action at stage as in (1), then . The bounds in this work, on the other hand, depend on an average over inverse gaps, rather than a worst case. Moreover, the diameter can be quite large when there exist difficult-to-access states.
Asymptotic Bounds: Prior to Jaksch et al. (2010), and building on the bounds of Burnetas and Katehakis (1997), Tewari and Bartlett (2008) presented bounds in terms of a diameter-related quantity , which captures the minimal hitting time between states when restricted to optimal policies. Tewari and Bartlett (2008) prove that their algorithm enjoys a regret222Tewari and Bartlett (2008) actually presents a bound of the form but it is straightforward to extract the claimed form from the proof. of asymptotically in where CRIT contains those sub-optimal state-action pairs that can be made optimal by replacing
with some other vector on the-simplex. Recently, Ok et al. (2018) gives per-instance lower bounds for both structured and unstructured MDPs, which apply to any algorithm which enjoys sub-linear regret on any problem instance, and present an algorithm which matches these upper bounds asymptotically. This bound replaces with , where denotes the range of the bias functions, an analogue of for the non-episodic setting Bartlett and Tewari (2009). We further stress that whereas the logarithmic regret bounds of Jaksch et al. (2010) hold for finite time with polynomial dependence on the problem parameters, the number of episodes needed for the bounds of Burnetas and Katehakis (1997); Tewari and Bartlett (2008); Ok et al. (2018) to hold may be exponentially large, and depend on additional, pessimistic problem-dependent quantities (e.g. a uniform hitting time in Tewari (2007, Proposition 29)).
Novelty of this work: The major contribution of our work is showing problem-dependent regret bounds which i) attain a refined dependence on the gaps, as in Tewari and Bartlett (2008), ii) apply in finite time after a burn-in time only polynomial in , , and the gaps, iii) depends only on and not on the diameter (and thus, are not adversely affected by difficult to access states), and iv) smoothly interpolates between regret and the minimax rate attained by Azar et al. (2017) et al.
1.3 Problem Setting and Notation
Episodic MDP: A stationary, episodic MDP is a tuple , where for each we have that is a random reward with expectation , denotes transition probabilities, is an initial distribution over states, and is the horizon, or length of the episode. A policy is a sequence of mappings . For our given MDP , we let and denote the expectation and probability operator with respect to the law of sequence , where , , . We define the value of as
and for and ,
which we identify with a vector in . We define the associated Q-function ,
so that . We denote the set of optimal policies
and let denote the set of optimal actions. Lastly, given any optimal , we introduce the shorthand and , where we note that even when is not unique, and do not depend on the choice of optimal policy.
Episodic Regret: We consider a game that proceeds in rounds , where at each state an algorithm selects a policy , and observes a roll out . The goal is to minimize the cumulative simple regret, defined as
Notation: For any integer we define . For two expressions that are functions of any problem-dependent variables of , we say (, respectively) if there exists a universal constant independent of such that (, respectively). We say if .
2 Main Results
We now state regret bounds that describe the performance of , an instance of the optimistic algorithm class defined in Definition 3.1; as remarked below, our techniques extend to this broader class as well. We defer a precise description of to Algorithm 1 in Appendix B.
The key quantities at play are the suboptimality-gaps between the Q-functions:
Definition 2.1 (Suboptimality Gaps).
We define the stage-dependent suboptimality gap
as well as the minimal stage dependent gap , and the minimal gap
Above, we recall that any optimal satisfies the Bellman equation , and thus if and only if . Next, following Zanette and Brunskill (2019), we consider two illustrative benign problem settings under which we obtain an improved dependence on the horizon :
Definition 2.2 (Benign Settings).
We say that an MDP is a contextual bandit instance if does not depend on or . An MDP has -bounded rewards if, for any policy , holds with probability 1 over trajectories .
Lastly, we define as the set of pairs for which is optimal at for some stage , and its complement :
Note that typically or even (see Remark A.1). We now state our first result, which gives a gap-dependent regret bound that scales as with probability at least . The result is a consequence of a more general result stated as Theorem 2.4, itself a simplified version of our most granular bound stated in the Appendix A.
Fix , and let , , . Then with probability at least , run with confidence parameter enjoys the following regret bound for all :
Moreover, if is either a contextual bandits instance, or has -bounded rewards for , then the factors of on the first line can be sharped to . In addition, if is a contextual bandits instance, the factor of in the first term (summing over ) can be sharped to .
Setting and noting that with probability , we see that the expected regret can be bounded by replacing with in right hand side of the inequality (2); this yields an expected regret that scales as .
Three regret terms: The first term in Corollary 2.1 reflects the sum over sub-optimal state-action pairs, which a lower bound (Proposition 2.3) shows is unimprovable in general. In the infinite horizon, Ok et al. (2018) give an algorithm whose regret is asymptotically bounded by an analogue of this term. The third term characterizes the burn-in time suffered by nearly all finite-time analyses and is the number of rounds necessary before standard concentration of measure arguments kick in. The second term is less familiar and is addressed in the following subsection.
dependence: Comparing to known results from the infinite-horizon setting, one expects the optimal dependence of the first term on the horizon to be . However, we cannot rule out that the optimal dependence is for the following three reasons: (i) the infinite-horizon analogues (Section 1.2) are not directly comparable to the horizon ; (ii) in the episodic setting, we have a potentially different value function for each , whereas the value functions of the infinite horizon setting are constant across time; (iii) the may be unavoidable for non-asymptotic (in ) bounds, even if is the optimal asymptotic dependence after sufficient burn-in (possibly depending on diameter-like quantities). Resolving the optimal dependence is left as future work.
We also note that in the contextual bandits setting, we incur no dependence on the first term; and thus the first term coincides with the known asymptotically optimal, instance-specific regret (Kaufmann et al., 2016).
Guarantees for other optimistic algorithms: To make the exposition concrete, we only provide regret bounds for th algorithm. However, the “gap-clipping” trick and subsequent analysis template described in Section 3.3 can be applied to obtain similar bounds for other recent optimistic algorithms, as in (Azar et al., 2017; Dann et al., 2017; Jin et al., 2018; Zanette and Brunskill, 2019; Dann et al., 2018). 333 To achieve logarithmic regret, some of these algorithms require a minor modification to their confidence intervals; otherwise, the gap-dependent regret scales as
To achieve logarithmic regret, some of these algorithms require a minor modification to their confidence intervals; otherwise, the gap-dependent regret scales as. See Appendix B for details.
2.1 Why the dependence on ?
Without the second term, Corollary 2.1 would only suffer one factor of due to the sum over state-actions pairs (when the minimum is achieved by a single pair). However, as remarked above, typically scales like and therefore the second term scales like , with a dependence on that is at least a factor of more than we would expect. Here, we claim that is unavoidable for the sorts of optimistic algorithms that we typically see in the literature; a rigorous proof is deferred to Appendix F.
Theorem 2.2 (Informal Lower Bound).
Fix . For universal constants , if , and satisfies , there exists an MDP with , and horizon , such that exactly one state has a sub-optimality gap of and all other states have a minimum sub-optimality gap of at least . For this MDP, but all existing optimistic algorithms for finite-horizon MDPs which are -correct suffer a regret of at least with probability at least .
The particular instance described in Appendix F that witnesses this lower bound is instructive because it demonstrates a case where optimism results in over-exploration.
2.2 Sub-optimality Gap Lower Bound
Next, we shows that when the total number of rounds is large, the first term of Corollary 2.1 is unavoidable in terms of regret. Specifically, for every possible choice of gaps, there exists an instance whose regret scales on the order of the first term of in (2).
Following standard convention in the literature, the lower bound is stated for algorithms which have sublinear worst case regret. Namely, we say than an algorithm is -uniformly good if, for any MDP instance , there exists a constant such that for all .444We may assume as well that is allowed to take the number of episodes as a parameter.
Proposition 2.3 (Regret Lower Bound).
Let , and , and let denote a set of gaps. Then, for any , there exists an MDP with states , actions , and stages, such that,
and any -uniformly good algorithm satisfies
The above proposition is proven in Appendix G, using a construction based on Dann and Brunskill (2015). For simplicity, we stated an asymptotic lower bound. We remark that if the constant is , then one can show that the above asymptotic bound holds as soon as , where . More refined non-asymptotic regret bounds can be obtained by following Garivier et al. (2018).
2.3 Interpolating with Minimax Regret for Small
We remark that while the logarithmic regret in Corollary 2.1 is non-asymptotic, the expression can be loose for a number of rounds that is small relative to the sum of the inverse gaps. Our more general result interpolates between the gap-dependent and
gap-independent regret regimes. To state this more general bound, we introduce the following variance terms:
Definition 2.3 (Variance Terms).
We define the variance of a triple as
and the maximal variance as .
While for general MDPs (see e.g. (Azar et al., 2017; Zanette and Brunskill, 2019)), we have for the benign instances in Definition 2.2. Building on Zanette, we can define an associated “effective horizon”, which replaces with a possibly smaller problem dependent quantity:
Definition 2.4 (Effective Horitzon).
We define the effective horizon as
which satisfies for any horizon- MDP.
Note that for contextual bandits implies , where as for -bounded rewards with , . Our main theorem is as follows:
Theorem 2.4 (Main Regret Bound for ).
Fix , and let , , . Futher, define . Then with probability at least , run with confidence parameter enjoys the following regret bound for all :
Moreover, if is an instance of contextual bandits, all the terms and , as well as the factor of in the first line, can be replaced by . If instead has bounded rewards, then can be replaced by , and by .
By the same argument as above, Theorem 2.4 with implies an expected regret of . In the regret bound of Theorem 2.4 as well as Corollary 2.1, one notes the for each state-action pair, the maximum variance over is used, while the minimum over is used. Theorem 2.4 is proven in Section 4. Using a more careful analysis, we can refine our bound to use just the max over of the ratio of variance to gap, which can be substantially smaller; see Theorem A.1 and Appendix A for details.
3 Gap-Dependent bounds via ‘clipping’
In this section, we (i) introduce the key properties of optimistic algorithms, (ii) explain existing approaches to the analysis of such algorithms, and (iii) introduce the “clipping trick”, and sketch how this technique yields gap-dependent, non-asymptotic bounds.
We begin with a definition of optimistic algorithms, which have been the dominant approach for learning finite-horizon MDPs (Dann et al., 2017, 2018; Azar et al., 2017; Zanette and Brunskill, 2019; Jin et al., 2018)
. The central idea is to select policies which are optimal for an over-estimate estimate of the Q-function, known as anoptimistic function:
Definition 3.1 (Optimistic Algorithm).
We say that an algorithm is optimistic if, for each round and all stages , it constructs a -function and policy satisfying
We define the associated value function , and the associated surplus
We say that is strongly optimistic if, in addition, for all , and .
We reiterate that is a particular instantiation of Definition 3.1, whose optimistic Q-function is described by Algorithm 1 in Appendix B. The notion of strong optimism is novel to this work, and will allow us to further sharpen the -dependence in the benign contextual bandit setting of Definition 2.2.
3.1 The Regret Decomposition For Optimistic Algorithms
Under optimism alone, we can see that for any and any ,
and therefore, we can bound the sub-optimality of as .
We can decompose the regret further by introducing the following notation: we let denote the probability of visiting and playing at time in episode , and let denote the total expected number of times is visited/played in episode . We note that since is a deterministic function, (but not necessarily ) is supported on only one action for each state and stage . A standard regret decomposition (see e.g. Dann et al. (2017, Lemma E.15)) then shows that for a trajectory ,
yielding a regret bound of
3.2 A Tale of Two Analyses
To understand how the analysis in this work differs from existing works, we shall give crude sketches of two styles of analysis: first, the -minimax analyses for MDPs, and second, an attempt at regret of multi-arm bandits literature, and why transferring that analysis to MDPs is challenging.
3.2.1 Sketch of Minimax Analysis for MDPs:
We begin by sketching the flavor of minimax analyses. Introducing the notation
existing analyses carefully manipulate the surpluses to show that
where . Finally, they replace with an “idealized analogue”, . Letting denote the filtration capturing all events up to the end episode , we see that , and thus by standard concentration arguments (see Lemma 4.3, or Dann et al. (2018, Lemma 6)), and are within a constant factor of each other for all such that is sufficiently large. Hence, by replacing with , we have (up to lower order terms)
Here (i) follows by viewing the sum as an integral (see Lemma 4.6), and (ii) from an application of Cauchy Schwartz, or the pigeon-hole principle with . We emphasize that the above analysis constitutes a crude summary of how minimax bounds arise; attaining correct dependence on relevant problem parameters requires far greater care.
3.2.2 Attempt at a Gap-Dependent Analyses:
A first attempt for a gap-dependent analysis might rely on an alternative regret decomposition. With a standard computation (again, see Dann et al. (2017, Lemma E.15)), we can bound the regret in terms of the gaps
where simply uses the definition of , and uses Holder’s inequality, and the definitions and . Now suppose that we could somehow argue that the algorithm could rule out suboptimal actions once a pair had been visited sufficiently many times; say that . Then,
Contextual Bandits: For standard optimistic algorithms designed specifically for the contextual bandit setting, one can easily establish that , yielding the familiar -regret bound. The bound of relies on the fact that the algorithm’s choice of action at depends only on the estimates of the rewards , but not on rewards at other states . We remark that for MDPs with bounded diameter, one can can use the sufficient visitation of states to make similar, albeit more involved arguments.
MDPs without bounded diameter: For general MDPs, to determine if an action is optimal at state and stage , one requires precise knowledge about the value function at other states in the MDP at future times. Without accurate value function estimates, it is possible to visit a state an arbitrary number of times and still play sub-optimally at that state. Without bounded diameter, one cannot expect these value functions to be estimated uniformly. Hence, this coupling of information between different state-action pairs makes it challenging to import known proof strategies from the bandits literature, motivating our novel proof technique introduced in the next section.
3.3 The Clipping Trick
We now introduce the “clipping trick”, a technique which merges both the minimax analysis in terms of the surpluses , and the gap-dependent strategy, which attempts to control how many times a given suboptimal action is selected. Core to our analysis, define the clipping operator
for all . We can now state our first main technical result, which states that the sub-optimality can be controlled by a sum over surpluses which have been clipped to zero whenever they are sufficiently small.
Let . Then, if is induced by an optimistic algorithm with surpluses ,
If the algorithm is strongly optimistic, and is a contextual bandits instance, we can replace with .
The above corollary is a consequence of a more general bound, Theorem 4.1, given in Section 4. In particular, if , and for the benign contextual bandits setting, we can clip at a factor of larger. Unlike the sketch of a naive gap-dependent bound, we do not reason about when a suboptimal action will cease to be taken. In particular, unlike in the contextual bandits setting, we cannot certify that a suboptimal action will cease to be taken once the surplus is small. Nevertheless, we can reason that the cumulative sub-optimality is bounded almost as if our algorithm ceases to take suboptimal actions once the associated is small. We can exploit the above theorem to refine Equation (4) as follows (neglecting -factors and making numerous simplifications):
where uses an upper bound on the surplus (see Proposition 4.2), and neglects lower order terms. We remark that we may apply the gap-clipping Theorem 5.3 to any standard optimistic algorithm, and this high-level analysis sketch can be formalized to yield gap-dependent regret bounds. For concreteness and sharpness, we shall analyze an algorithm , formally described in Appendix B.
4 Proof of Theorem 2.4
We now give a rigorous proof of Theorem 2.4, the regret bound for . This proof will also provide the scaffolding for the more granular regret bounds described in Appendix A. First, we give a more general statement of Corollary 3.1 which uses “transition-sub-optimality,” a notion of distributional closeness that enables the improved clipping and sharper regret bounds for the special case of contextual bandits (Definition 2.2):
Definition 4.1 (Transition Sub-optimality).
Given , we say that a tuple is -transition suboptimal if there exists an such that