Regret Bounds for Discounted MDPs

Recently, it has been shown that carefully designed reinforcement learning (RL) algorithms can achieve near-optimal regret in the episodic or the average-reward setting. However, in practice, RL algorithms are applied mostly to the infinite-horizon discounted-reward setting, so it is natural to ask what the lowest regret an algorithm can achieve is in this case, and how close to the optimal the regrets of existing RL algorithms are. In this paper, we prove a regret lower bound of Ω(√(SAT)/1 - γ - 1/(1 - γ)^2) when T≥ SA on any learning algorithm for infinite-horizon discounted Markov decision processes (MDP), where S and A are the numbers of states and actions, T is the number of actions taken, and γ is the discounting factor. We also show that a modified version of the double Q-learning algorithm gives a regret upper bound of Õ(√(SAT)/(1 - γ)^2.5) when T≥ SA. Compared to our bounds, previous best lower and upper bounds both have worse dependencies on T and γ, while our dependencies on S, A, T are optimal. The proof of our upper bound is inspired by recent advances in the analysis of Q-learning in the episodic setting, but the cyclic nature of infinite-horizon MDPs poses many new challenges.

Authors

• 13 publications
• 56 publications
• Near-optimal Reinforcement Learning in Factored MDPs

Any reinforcement learning algorithm that applies to all Markov decision...
03/15/2014 ∙ by Ian Osband, et al. ∙ 0

• Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds

Strong worst-case performance bounds for episodic reinforcement learning...
01/01/2019 ∙ by Andrea Zanette, et al. ∙ 0

• Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies

State-of-the-art efficient model-based Reinforcement Learning (RL) algor...
05/27/2019 ∙ by Yonathan Efroni, et al. ∙ 0

• Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning

Recently, there has been significant progress in understanding reinforce...
10/29/2015 ∙ by Christoph Dann, et al. ∙ 0

• Near-optimal Reinforcement Learning in Factored MDPs: Oracle-Efficient Algorithms for the Non-episodic Setting

We study reinforcement learning in factored Markov decision processes (F...
02/06/2020 ∙ by Ziping Xu, et al. ∙ 9

• Efficient Reinforcement Learning via Initial Pure Exploration

In several realistic situations, an interactive learning agent can pract...
06/07/2017 ∙ by Sudeep Raja Putta, et al. ∙ 0

• Regret Bounds for Reinforcement Learning via Markov Chain Concentration

We give a simple optimistic algorithm for which it is easy to derive reg...
08/06/2018 ∙ by Ronald Ortner, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning (RL) is concerned with how an algorithm should interact with an unknown environment in order to maximize some notion of cumulative reward. The interaction with the environment is often formulated as a Markov decision process (MDP), which is defined by a state space of size , an action space of size , a transition function , and a reward function . The learning algorithm moves between states and receives rewards along the way according to , , and the action taken by the algorithm at each state. A policy is a rule that decides which action should be taken at any given state.

In this paper, we will measure the optimality of RL algorithms through a notion called regret, which has attracted much attention in the RL community in the recent years (Jaksch et al., 2010, Osband et al., 2013; 2016, Osband and Van Roy, 2017, Azar et al., 2017, Agrawal and Jia, 2017, Kakade et al., 2018, Jin et al., 2018). Regret, roughly speaking, measures the overall suboptimality of a learning algorithm compared to the optimal policy up until any given time. Regret is the de facto measurement of optimality for online learning algorithms (Shalev-Shwartz and others, 2012); given that many RL algorithms are deployed in an online fashion, it is not surprising that the notion of regret is favored by many in the RL community.

Recent studies have shown that carefully designed RL algorithms can achieve near-optimal regret in the episodic setting (Azar et al., 2017, Kakade et al., 2018, Jin et al., 2018), where the learner is reset to a certain state every steps, and the total return is the sum of the rewards received, and also in the average-reward setting (Jaksch et al., 2010, Agrawal and Jia, 2017), where the learner is reset only once at the very beginning, and the total return is the average of the rewards received, assuming the process runs forever.

However, in practice, RL algorithms are applied mostly to the discounted setting, which is the same as the average-reward setting except that the total return is the -discounted sum of the rewards received. In fact, almost all the deep reinforcement learning algorithms are designed with the discounted setting as the default setting in mind and implemented only for the discounted setting (Lillicrap et al., 2015, Schulman et al., 2015, Mnih et al., 2016, Schulman et al., 2017, Haarnoja et al., 2018). What makes the discounted setting most attractive is that it does not have the restriction of constant resetting like in the episodic setting, and unlike in the average-reward setting, the discounting factor automatically encourages prioritizing on more immediate rewards.

1.1 Regert for Discounted MDPs

Following Lattimore and Hutter (2012), we will refer to RL in the discounted setting simply as discounted MDPs from now on. To the best of our knowledge, not much progress has been made in understanding the regret for discounted MDPs. In fact, as far as we are aware, the concept of regret has not even been formally defined for discounted MDPs. Instead, many of the previous work on discounted MDPs have focused on a quantity called the sample complexity of exploration (Kakade and others, 2003, Szita and Szepesvári, 2010, Lattimore and Hutter, 2012, Dong et al., 2019). Informally, this complexity is defined through the suboptimality of the learning algorithm at each time step . The sample complexity of exploration

is defined to be the smallest number such that with probability at least

, the number of steps such that is at most , where is the expectation conditioned on all the history up until the state at step . The best lower and upper bounds on that we are aware of are  (Lattimore and Hutter, 2012) and  (Szita and Szepesvári, 2010) respectively.

In this paper, we propose to define the regret for discounted MDPs through a very natural adaptation of . Specifically, we define the regret of a learning algorithm on discounted MDPs to be the cumulative suboptimality incurred by the algorithm, i.e.,

 Regret(T)=T∑h=0Δh,

where the ’s are exactly the suboptimalities used in definining the sample complexity of exploration . It is easy to see that

 E[Regret(∞)] =Ω(supϵN(ϵ,0.5)⋅ϵ), E[Regret(T)] =O(infϵN(ϵ,1T)⋅11−γ+ϵT).

Plugging in existing best bounds for we arrive at an (expected) regret lower bound of 111This means that as . and an (expected) regret upper bound of .

1.2 Contributions

The primary focus of the current paper is to improve these existing bounds derived from : we will prove a regret lower bound of when for any (possibly randomized) learning algorithm; we will also introduce a modified version of the double Q-learning algorithm (Hasselt, 2010) for which a regret upper bound of is proved when .

It is easy to see that the dependencies on in our lower and upper bounds are optimal. Compared to the existing bounds derived from , our lower bound clearly has better dependencies on all parameters; on the other hand, it may seem that the existing upper bound has better dependencies on , but this is not the case. In fact, we have the following inequalities for the existing upper bound:

 (1)

Note that in the above inequalities is a trivial upper bound222This will become clear once the regret is formaly defined in Section 2. for any , while has a worse dependency on than our upper bound. Therefore, we can conclude that our upper bound has better dependencies on and than the existing bound does, and has optimal dependencies on .

1.3 Regret Analysis in other MDP settings

At this point, a curious reader may wonder whether existing analysis of regret for the episodic setting or the average-reward setting can be easily adapted to the discounted setting — we argue that this is not the case:

• The definition of regret in the discounted setting is significantly different from the one used for the average-reward setting, which, roughly speaking, is an infinitely-averaged version of without discounting. Furthermore, analyzing such “” type regret necessarily requires additional regularity assumptions, such as the mixing rate  (Even-Dar et al., 2005), or the diameter  (Jaksch et al., 2010, Agrawal and Jia, 2017). We refer the readers to Section 5.1 for a more detailed discussion.

• Our derivation of the regret upper bound is inspired by Jin et al. (2018) (and in turn Azar et al. (2017), which inspired Jin et al. (2018)), which showed that a modified version of Q-Learning (Watkins and Dayan, 1992) has regret when in the episodic setting. Their analysis, however, is not directly applicable to the infinite-horizon setting. The difference in the definition of regret is of course one of the reasons (see Section 5.1), but more importantly, in the episodic setting, the optimal policy to compare with is a sequence of time-dependent policies, so there is no cyclic dependency in the value-function. Furthermore, the rollout is terminated every steps, so controlling the blow-up of the regret is easier. We refer the readers to Section 5.2 for a more detailed discussion.

2 Preliminaries

We begin by introducing some notations which will be used throughout the paper. For a set , denote by

the set of probability distributions over

. We use to denote the sequence .

We are interested in the learning of MDPs. An MDP is defined by a finite state space , a finite action space , a transition function , and a reward function . Unlike in the planning setting, where are all known to the learner in advance, we consider the learning setting where only and are known beforehand.

An MDP can be interacted with either by calling reset, which returns an initial state and sets as the current state; or by calling if a current state is available, in which case a new state and a reward are returned and is set as the current state.

We consider the infinite-horizon setting, where the learner calls reset once at the very beginning, and repeatedly calls next afterwards. We denote the sequence of states, actions, and rewards generated in such a way by , , and respectively, where is the current state after calling next times, is the action taken at state , and is the reward received after taking action .

The performance of the learner is measured by a function called regret. Formally, for any discounting factor , state , and policy , denote by the expected -discounted total rewards generated by starting from a state and following policy to choose the next action repeatedly, i.e.

 Vπ,γs=Es′0=sa′h∼π(s′h)r′h∼R(s′h,a′h)s′h+1∼M(s′h,a′h)[∞∑h=0γhr′h].

We can define the maximum -discounted total rewards from state by

 V∗,γs=supπVπ,γs,

and define the suboptimality at step by

 Δh=V∗,γsh−∞∑t=0γtrh+t.

Note that

is a non-negative random variable where

 Eh[⋅]=E[⋅|s0:h,a0:h−1,r0:h−1]. (2)

Let be the number of actions the learner has taken, we can define the regret as

 Regret(T)=T−1∑h=0Δh. (3)

3 Regret Lower Bounds

In this section, we present a regret lower bound of when and a lower bound of when .

Our construction of the hard MDP to prove the lower bound is surprisingly simple, as shown in Figure 1. In particular, the transition function is deterministic, and the learning algorithm simply cycles through all the states repeatedly no matter what action it takes; what distinguishes between different actions is the reward – at any state, there is one “good action” which gives a slightly higher reward than the remaining actions.

The overall construction is in contrast to the construction for the average-reward setting (Jaksch et al., 2010), where many two-state components with stochastic transitions are assembled together so that the MDP has a diameter of (see Osband et al. (2016) for a similar proof), and is also different from the construction for proving the lower bound on the sample complexity of exploration  (Lattimore and Hutter, 2012), where Mannor and Tsitsiklis (2004) (instead of Auer et al. (1995) in our case) is used for basic reduction and additional fine-tuning is done to get a -factor improvement.

Theorem 1.

For any , , , and any (possibly randomized) learning algorithm, there exists a choice of , transition function, and reward function such that

Proof.

Fix , , , and a learning algorithm. We will construct an MDP as in Figure 1 where the transition function is deterministic, and the reward function is specified by

where and will be specified later. It is easy to see that

 E[Regret(T)]=T⋅12+ϵ1−γ−E[T−1∑h=0∞∑t=0γtrh+t].

Now note that

 T−1∑h=0∞∑t=0γtrh+t =T−1∑u=0ruu∑v=0γv+∞∑u=Truu∑v=u−T+1γv ≤11−γT−1∑u=0ru+∞∑u=Tγu−T+11−γ ≤11−γT−1∑u=0ru+1(1−γ)2

Therefore we have

 E[Regret(T)] ≥11−γ(∗)(T⋅(12+ϵ)−E[T−1∑u=0ru])−1(1−γ)2. (4)

Let be the number of actions the learner takes at state in the first steps, we have that

 Ts≥⌊TS⌋≥T2S,  if T≥S. (5)

Also denote the reward received when taking action at state , then we have that

 (∗) =S∑s=1(Ts⋅(12+ϵ)−E[Ts∑v=1rs,v]). (6)

According to the lower bounds for multi-armed bandits, e.g. Theorem 7.1 and its construction in (Auer et al., 1995), there exists and (which is defined from) such that for any

 Ts⋅1+ϵ2−E[Ti∑v=1rs,v]≥120min(√ATs,Ts).

Finally, note that if , then for any and consequently ; on the other hand, if , then for any and consequently for any , , where the last inequality follow from (5). Going back to (6) we have

Substitute this back to (4) gives us the lower bound. ∎

4 Regret Upper Bounds

In this section, we will introduce a modified version of the double Q-learning algorithm (Hasselt, 2010), which is used to prove our regret upper bound.

We will first consider a simpler setting in Section 4.1, where both the transition function and the reward functions are deterministic, i.e., they are in fact functions and . We call such type of MDPs deterministic MDPs. Compared to the more general stochastic setting, which we study in Section 4.2, the proof for deterministic MDPs is greatly simplified due to the absence of stochasticity, yet the analysis already highlights some of the main techniques involved when dealing with the more general setting.

4.1 Warmup: Double Q-Learning for Discounted and Deterministic MDPs

In this section, we are going to show that a modified version of the double Q-learning algorithm (Algorithm 1) has regret at most anytime for discounted and deterministic MDPs. There are some major differences between our modified double Q-learning and the original version proposed in Hasselt (2010):

First, the Q-value functions and have to be initialized to , or for that matter, the maximal possible discounted cumulative return if it is known. This initialization is necessary, and will turn out to be sufficient, for exploration in deterministic MDPs, as far as the -value functions are concerned. In fact, we will always take action greedily based on the Q-value functions, so initializing -value functions to maximum automatically encourages exploration.

Secondly, the behavioral policy (i.e., the strategy used to choose which action to take) cannot be arbitrary as in the original version. The action has to be taken greedily based on the Q-value function that immediately gets updated

. This ensures that the estimation of the

-value functions and the interaction with the environment guide each other.

Lastly, in the original version, in each round, one of the two candidate -value functions gets chosen, for example, randomly, and gets updated from the other one. In our version, the updates have to happen in a strictly alternating fashion — the -value function gets updated this round is used for updating the other -value function next round.

We are now ready to state our upper bound and prove it.

Theorem 2.

If and are both deterministic, then Algorithm 1 has for any .

Proof.

Denote by the function at the beginning of iteration if is even, or the function at the beginning of iteration if

is odd. Define

 Vh(s) =maxaQh(s,a) ϕh =Vh(sh)−V∗,γsh, δh =ϕh+Δh Q∗,γs,a =R(s,a)+V∗,γM(s,a).

It can be easily shown by induction on that for any ,

 0 ≤V∗,γs≤Vh(s)≤11−γ, (7) 0 ≤Q∗,γs,a≤Qh(s,a)≤11−γ. (8)

Note that

 δh≥ϕh≥0 (9)

because of (7) and the deterministicity of the MDP. Also define to be the largest such that and have the same parity, , and ; define to be if there is no such .

We have that

 δh =Vh(sh)−∞∑t=0γtrh+t =Qh(sh,ah)−∞∑t=0γtrh+t =1prevh=∅⋅(11−γ−Q∗,γsh,ah)+1prevh≠∅⋅γ(Vprevh+1(sprevh+1)−V∗,γsprevh+1) +(γV∗,γsh+1−γ∞∑t=0γtrh+1+t) (a)≤1prev% h=∅⋅11−γ+1prevh≠∅⋅γϕprevh+1+γ(δh+1−ϕh+1)

where (a) is due to (8) and the definition of and ; summing up we get

 ∞∑h=0δh ≤γ∞∑h=0δh+1+∞∑h=01prevh=∅⋅11−γ+γ(∞∑h=01prevh≠∅⋅ϕprevh+1−∞∑h=0ϕh+1) (a)≤γ∞∑h=0δh+1+2SA1−γ,

where (a) is because and

 ∞∑h=01prevh≠∅⋅ϕprevh+1≤∞∑h=0ϕh+1.

Rearranging the terms we get

 (1−γ)∞∑h=0δh ≤2SA1−γ−γδ0 (a)≤2SA1−γ,

where in (a) we used the fact that . Therefore,

 Regret(T)=T−1∑h=0Δh(a)≤∞∑h=0δh≤2SA(1−γ)2,

where (a) is due to (9). ∎

4.2 Double Q-Learning with Upper Confidence Bound for Discounted MDPs

In this section, we present a further modified version of the double Q-learning algorithm tailored to stochastic and discounted MDPs (Algorithm 2) and show that it has regret when . Compared to Algorithm 1, there are some additional elements at play here:

First, there is a counter for each of the two -value functions , indicating how many times has been updated at each state-action pair. The counters are used both for adjusting the learning rate and for modifying the rewards.

Second, instead of updating each -value function entirely from the other one, we now use a learning rate , where is the value of the counter for the -value function that gets updated in the current round, to slowly incorporate more recent information.

Finally, the reward in Algorithm 1 is now replaced by an upper confidence bound where is the value of the counter of the -value function that gets updated in the current round. This technique itself is rather standard and goes back to as early as (Lai and Robbins, 1985) in the bandit literature; the MDP setting does add a bit more complication, but the challenges are mostly technical.

We are now ready to state the upper bound. The proof is delayed to Appendix A due to space limitations.

Theorem 3.

For any , with probability at least , Algorithm 2 has

 Regret(T)≤14√SATln(π2SAT2p)(1−γ)2.5+2SA+3(1−γ)2

for any ; consequently, Algorithm 2 has

 E[Regret(T)]≤14√SATln(π2SAT3)(1−γ)2.5+2SA+4(1−γ)2

for any .

It has been shown that replacing Hoeffding-type upper confidence bound, such as the one we use in Algorithm 2, with Bernstein-type upper confidence bound, can give tighter regret upper bounds in the episodic setting (Azar et al., 2017, Jin et al., 2018). However, we argue that this technique, while being highly technically involved, is unlikely to help directly in the infinite-horizon setting.

In fact, the crucial observation this technique relies on is that the total variance of the value function in each episode is

, which is only valid when the learning policy within each episode is independent across time steps — note that in the episodic setting there are policies for each episode, one for each time step, and they do not affect each other within an episode.

However, in the infinite-horizon setting, if we use Q-learning-style updates, the learning policy necessarily evolves from step to step as more and more information being collected; in order to utilize Bernstein-type inequalities, algorithmic changes have to be made, for example by only periodically updating the learning policy. We refer the readers to Section 6 of Lattimore and Hutter (2012) for a related discussion on (the difficulties of) applying Bernstein inequalities to improve the upper bound on .

5 Related Work

In this section, we will perform a detailed discussion on the related work. In Section 5.1, we give a brief overview of the definition of regret and existing regret bounds for MDPs under different settings and how our regret is related to the others. In Section 5.2, we dive into more algorithmic and technical details and discuss how our algorithm design and analysis are different from previous ones.

5.1 Regret Bounds for MDPs

The learning of MDPs is typically studied in three different settings: the episodic setting, where the learner is reset to a certain state every steps and the total return is the sum of the rewards received; the average-reward setting, where the learner is reset only once at the very beginning and the total return is the average of the rewards received, assuming the process runs forever; and finally, the discounted setting, where the learner is reset only once at the very beginning and the total return is the -discounted sum of the rewards received, assuming the process runs forever.

The regret in the episodic setting is arguably the easiest to analyze among the three. More specifically, let , where is the number of episodes, is defined to be where is the suboptimality at episode , defined by

 Δk=supπ0:H−1Es′0=skHa′h∼πh(s′h)r′h∼R(s′h,a′h)s′h+1∼M(s′h,a′h)[H−1∑h=0r′h]−H−1∑h=0rkH+h.

Due to its acyclic nature (note that is calculated w.r.t. a sequence of policies ), it should not be surprising that this is the only setting for which the gap between the regret upper bound and the regret lower bound has been closed, at , when and  (Azar et al., 2017)333We note here that we could not find published formal proof of the lower bound of claimed in Azar et al. (2017) and Kakade et al. (2018). The resetting effect that happens every is time-dependent but not state-dependent, so a simple reduction to the average-reward setting (where a lower bound of is known) is not possible.. Further improvement of the upper bound on lower order terms can be found in Kakade et al. (2018), and an upper bound of when for model-free algorithms can be found in Jin et al. (2018). On the other hand, the drawback of being restricted to the episodic setting is that many real-world reinforcement learning scenarios are inherently non-episodic.

The definition of regret in the average-reward setting allows a little more leeway than in the episodic setting. Two definitions that appeared in the literature are

 Regret′(T) =supπEπ,s0[T−1∑h=0r′h]−T−1∑h=0rh, Regret′′(T) =TsupπlimH→∞1HEπ,s0[H−1∑h=0r′h]−T−1∑h=0rh.

where

 Eπ,s[⋅]=Es′0=sa′h∼π(s′h)r′h∼R(s′h,a′h)s′h+1∼M(s′h,a′h)[⋅].

However, non-vacuous bounds for this type of regret require additional assumptions on how fast the MDP communicates: for example, mixing rate for bounding  (Even-Dar et al., 2005)444We note here that Even-Dar et al. (2005) considers the adversarial-reward setting, although the formulation of the regret is the same as we have stated. or diameter for bounding  (Jaksch et al., 2010, Agrawal and Jia, 2017). In fact, in the absence of additional assumptions, there exists MDPs for which both and are at least . A formal statement can be found in Section B of the appendix, along with the proof. This issue can probably be alleviated by using a different definition of the regret: one possible alternative is where

 Δh=supπlimT→∞1TEπ,sh[T−1∑t=0r′t]−limT→∞1TT−1∑t=0rt. (10)

The advantage of this definition is that the the definition itself prevents the learner from being penalized when acting optimally in a suboptimal region of the state space, while on the other hand, definitions such as and rely on fast communication of the MDP to ensure that the learner can always escape from an suboptimal region. However, we suspect that (10) is significantly harder to analyze and existing proofs for both the upper and the lower bound on as in Jaksch et al. (2010) and Agrawal and Jia (2017) may be unapplicable. Fortunately, we will see shortly that the analogy to (10) in the discounted setting is actually more tractable.

In the discounted setting, we can define analogous to (10) as

 Δh=supπEπ,sh[∞∑t=0γtr′t]−∞∑t=0γtrh+t. (11)

The study of this quantity goes back to as early as in Kakade and others (2003), where the notion of sample complexity of exploration was introduced. More specifically, the sample complexity of exploration is defined to be the smallest number such that with probability at least , the number of times , is at most , where is defined in (2). The best upper bound on to date is  (Szita and Szepesvári, 2010), while the best lower bound is  (Lattimore and Hutter, 2012). We refer the readers to Strehl and Littman (2008) for additional discussions regarding . On the other hand, to the best of our knowledge, the concept of regret has not been formally studied in the discounted setting. The most relevent definition we could find is the adjusted average loss introduced in Definition 3 of Strehl and Littman (2005), which is very artificial and much more complicated than the definition we used in this paper. The definition we have chosen to use is a very natural adaptation of : let be as defined in (11), recall that with probability at least ,

 N(ϵ,δ)≥∞∑h=01Eh[Δh]>ϵ

and note that with our definition of regret as in (3),

 E[Regret(T)]=E[T−1∑h=0Eh[Δh]].

As we have discussed in Section 1, if we translate the upper and lower bounds for to those of the regret by optimizing over , we would get an (expected) regret upper bound of and an (expected) regret lower bound of . The lower bound is clearly worse than our lower bound in terms of all parameters; on the other hand, from (1) we can see that the translated upper bound has worse dependencies on and , while our dependencies on are optimal, in the light of the lower bound we proved.

As is pointed out in Theorem 3 of Dann et al. (2017), if the upper bounds on were to hold uniformly over all possible , then we could translate the (uniform) upper bound on into better regret upper bounds. In fact, if the best existing upper bound on , , was to hold uniformly over all possible , then by optimizing

 ∫11−γϵ0SAϵ2(1−γ)6+Tϵ0

over , we could get a regret upper bound as good as . We can see that even in this imagined ideal scenario the translated upper bound still has a worse dependency on than ours.

We note here that all of the regrets as mentioned earlier can also be considered in the Bayesian setting, where instead of considering the worst-case MDP, the MDP is drawn from some (known) prior distribution, and the regret is calculated with respect to the prior distribution over the possible MDPs. This definition of regret is particularly useful when analyzing Thompson-sampling-type algorithms

(Osband et al., 2013; 2016, Osband and Van Roy, 2017).

5.2 Algorithmics and Technicalities

Our derivation of the regret upper bound is inspired by Jin et al. (2018) (and in turn Azar et al. (2017), which inspired Jin et al. (2018)), which showed that a modified version of Q-Learning (Watkins and Dayan, 1992) has near-optimal regret in the episodic setting. Their analysis, however, is not directly applicable to the infinite-horizon setting, for the following reasons.

First of all, in the episodic setting, there are value functions to be learned, each depends only on — there is no cyclic dependency; on the other hand, in the infinite-horizon setting, there is only one single value function, so a hierarchical induction in the analysis is not possible. To battle with the self-dependency, we find it very useful to replace the regular Q-learning with double Q-learning (Hasselt, 2010), which has been widely used in deep reinforcement learning since it was introduced (Hasselt et al., 2016, Hessel et al., 2018). It is worth noting that double Q-learning was originally proposed to reduce over-estimation, but in our case, it is primarily used for handling self-loops.

Secondly, a key ingredient of the proof of Jin et al. (2018) is the choice of learning rate — a nice consequence of this choice is that the total per-episode-step regret blow-up is ; since there are at most steps in each episode, the total blow-up is , which is upper bounded by a constant no matter how large is. The same quantity also appeared in Azar et al. (2017) for the same reason. However, in the infinite-horizon setting, the blow-up could become arbitrarily large because the learner is not reset every steps; therefore, different techniques are required to control the blow-up of the regret.

Thirdly, note that our definition of regret is in some sense stronger than the regret used in the episodic setting – we are trying to upper bound , while in the episodic setting is upper bounded — note the additional conditional expectation. The additional stochasticity may seem to be harmless, but note that is not a martingale difference sequence, so we cannot use Azuma-Hoeffding to control the deviation. Therefore, we have to work directly on instead of .

6 Future Work

We believe that both our lower and upper bounds can be improved. On the lower bound side, the additional term of is rather unfortunate and might be an artifact resulted from our proof technique; the dependency on can potentially be improved by introducing more stochasticity into the transition function. On the upper bound side, there is a general belief that model-free algorithms, such as -learning and double -learning, is inherently less sample efficient; this was to some extent demonstrated in Jin et al. (2018), where the optimal dependency on the episode length was not achieved even with similar proof techniques used in Azar et al. (2017), where the optimal dependency was shown for a model-based algorithm.

References

• S. Agrawal and R. Jia (2017) Optimistic posterior sampling for reinforcement learning: worst-case regret bounds. In Advances in Neural Information Processing Systems, pp. 1184–1194. Cited by: item (1)., §1, §1, §5.1.
• P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire (1995) Gambling in a rigged casino: the adversarial multi-armed bandit problem. In Proceedings of IEEE 36th Annual Foundations of Computer Science, pp. 322–331. Cited by: §3, §3.
• M. G. Azar, I. Osband, and R. Munos (2017) Minimax regret bounds for reinforcement learning. In

Proceedings of the 34th International Conference on Machine Learning-Volume 70

,
pp. 263–272. Cited by: item (2)., §1, §1, §4.2, §5.1, §5.2, §5.2, §6, footnote 3.
• C. Dann, T. Lattimore, and E. Brunskill (2017) Unifying pac and regret: uniform pac bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pp. 5713–5723. Cited by: §5.1.
• K. Dong, Y. Wang, X. Chen, and L. Wang (2019) Q-learning with ucb exploration is sample efficient for infinite-horizon mdp. arXiv preprint arXiv:1901.09311. Cited by: §1.1.
• E. Even-Dar, S. M. Kakade, and Y. Mansour (2005) Experts in a markov decision process. In Advances in neural information processing systems, pp. 401–408. Cited by: item (1)., §5.1, footnote 4.
• T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §1.
• H. V. Hasselt, A. Guez, and D. Silver (2016) Deep reinforcement learning with double q-learning. In

Thirtieth AAAI conference on artificial intelligence

,
Cited by: §5.2.
• H. V. Hasselt (2010) Double q-learning. In Advances in neural information processing systems, pp. 2613–2621. Cited by: §1.2, §4.1, §4, §5.2.
• M. Hessel, J. Modayil, H. V. Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver (2018) Rainbow: combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §5.2.
• T. Jaksch, R. Ortner, and P. Auer (2010) Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research 11 (Apr), pp. 1563–1600. Cited by: item (1)., §1, §1, §3, §5.1.
• C. Jin, Z. Allen-Zhu, S. Bubeck, and M. I. Jordan (2018) Is q-learning provably efficient?. In Advances in Neural Information Processing Systems, pp. 4863–4873. Cited by: Appendix A, item (2)., §1, §1, §4.2, §5.1, §5.2, §5.2, §6.
• S. M. Kakade et al. (2003) On the sample complexity of reinforcement learning. Ph.D. Thesis. Cited by: §1.1, §5.1.
• S. Kakade, M. Wang, and L. F. Yang (2018) Variance reduction methods for sublinear reinforcement learning. arXiv preprint arXiv:1802.09184. Cited by: §1, §1, §5.1, footnote 3.
• T. L. Lai and H. Robbins (1985) Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6 (1), pp. 4–22. Cited by: §4.2.
• T. Lattimore and M. Hutter (2012) PAC bounds for discounted mdps. In International Conference on Algorithmic Learning Theory, pp. 320–334. Cited by: §1.1, §3, §4.2, §5.1.
• T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1.
• S. Mannor and J. N. Tsitsiklis (2004) The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research 5 (Jun), pp. 623–648. Cited by: §3.
• V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §1.
• I. Osband, D. Russo, and B. Van Roy (2013) (More) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pp. 3003–3011. Cited by: §1, §5.1.
• I. Osband, B. Van Roy, and Z. Wen (2016) Generalization and exploration via randomized value functions. In International Conference on Machine Learning, pp. 2377–2386. Cited by: §1, §3, §5.1.
• I. Osband and B. Van Roy (2017) Why is posterior sampling better than optimism for reinforcement learning?. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2701–2710. Cited by: §1, §5.1.
• J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz (2015) Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. Cited by: §1.
• J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §1.
• S. Shalev-Shwartz et al. (2012) Online learning and online convex optimization. Foundations and Trends® in Machine Learning 4 (2), pp. 107–194. Cited by: §1.
• A. L. Strehl and M. L. Littman (2005) A theoretical analysis of model-based interval estimation. In Proceedings of the 22nd international conference on Machine learning, pp. 856–863. Cited by: §5.1.
• A. L. Strehl and M. L. Littman (2008) An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences 74 (8), pp. 1309–1331. Cited by: §5.1.
• I. Szita and C. Szepesvári (2010) Model-based reinforcement learning with nearly tight exploration complexity bounds. In Proceedings of the Twenty-seventh International Conference on Machine Learning, pp. 1031–1038. Cited by: §1.1, §5.1.
• C. J. Watkins and P. Dayan (1992) Q-learning. Machine learning 8 (3-4), pp. 279–292. Cited by: item (2)., §5.2.

Appendix A Proof of Theorem 3

Recall that for any , . We furthermore define . Let ; it is easy to verify that . Define by and the and function at the beginning of iteration if is even, or the and the function at the beginning of iteration if is odd. Let . For , let be the smallest such that and have the same parity, , and .

Define

 Vh(s) =maxaQh(s,a) ¯R(s,a) =Er∼R(s,a)[r], ¯rh =¯R(sh,ah) V∗,γM(s,a) =Es′∼M(s,a)[V∗,γs′] ϕh =Vh(sh)−V∗,γsh, δh =ϕh+Δh, Q∗,γs,a =¯R(s,a)+V∗,γM(s,a).

The following lemmas will be useful.

Lemma 4.

The following statements are true:

•    for any and .

•    for any .

•    for any .

Proof.

For (ii) and (iii), the same proof as in Jin et al. (2018), Lemma 4.1.(b)-(c) can be applied, with replaced by , and note that in proving (iii) the requirement for and to be positive integers in their proof can be relaxed to and being real numbers that are at least . We will prove (i) by induction on . The base case holds because . Assuming the statement is true for , then on one hand,

 t+1∑i=1αit+1√ln(C⋅i)i (b)≥αt+1√ln(C⋅(t+1))t+1+(1−αt+1)√ln(C⋅t)t (c)≥√ln(C⋅(t+1))t+1,

where in (a) we used the definition of , in (b) we used the induction assumption, and (c) is because is a non-increasing function when and . On the other hand, we have

 t+1∑i=1αit+1√ln(C⋅i)i (b)≤αt+1√ln(C⋅(t+1))t+1+2(1−αt+1)√ln(C⋅t)t (c)=2−γ2+t−(t+1)γ√ln(C⋅(t+1))t+1+2t(1−γ)2+t−(t+1)γ√ln(C⋅t)t ≤2−γ2+t−(t+1)γ√ln(C⋅(t+1))t+1+2√t(1−γ)√t+12+t−(t+1)