Near-optimal Optimistic Reinforcement Learning using Empirical Bernstein Inequalities

05/27/2019 ∙ by Aristide Tossou, et al. ∙ 0

We study model-based reinforcement learning in an unknown finite communicating Markov decision process. We propose a simple algorithm that leverages a variance based confidence interval. We show that the proposed algorithm, UCRL-V, achieves the optimal regret Õ(√(DSAT)) up to logarithmic factors, and so our work closes a gap with the lower bound without additional assumptions on the MDP. We perform experiments in a variety of environments that validates the theoretical bounds as well as prove UCRL-V to be better than the state-of-the-art algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement Learning.

In reinforcement learning (Sutton and Barto, 1998), a learner interacts with an environment over a given time horizon . At each time , the learner observes the current state of the environment and needs to select an action . This leads the learner to obtain a reward and to transit to a new state . In the Markov decision process (MDP) formulation of reinforcement learning, the reward and next state are generated based on the environment, the current state and current action but are independent of all previous states and actions. The learner does not know the true reward and transition distributions and needs to learn them while interacting with the environment. There are two variations of MDP problems: discounted and undiscounted MDP. In the discounted MDP setting, the future rewards are discounted with a factor  (Brafman and Tennenholtz, 2002; Poupart et al., 2006). The cumulative reward is computed as the discounted sum of such rewards over an infinite horizon. In the undiscounted MDP setting, the future rewards are not discounted and the time horizon is finite. In this paper, we focus on undiscounted MDPs.

Finite communicating MDP.

An undiscounted finite MDP consists of a finite state space , a finite action space , a reward distribution on bounded rewards for all state-action pair , and a transition kernel such that

dictates the probability of transiting to state

from state by taking an action . In an MDP, at state in round , a learner chooses an action according to a policy . This grants the learner a reward and transits to a state according to the transition kernel . The diameter of an MDP is the expected number of rounds it takes to reach any state from any other state using an appropriate policy for any pair of states . More precisely,

Definition 1 (Diameter of an MDP).

The diameter of an MDP is defined as the minimum number of rounds needed to go from one state and reach any other state while acting using some deterministic policy. Formally,

where is the expected number of rounds it takes to reach state from using policy .

An MDP is communicating if it has a finite diameter .

Given that the rewards are undiscounted, a good measure of performance is the gain, i.e. the infinite horizon average rewards. The gain of a policy starting from state s is defined by:

Puterman (2014) shows that there is a policy whose gain, is greater than that of any other policy. In addition, this gain is the same for all states in a communicating MDP. We can then characterize the performance of the agent by its regret defined as:

Regret provides a performance metric to quantify the loss in gain because of the MDP being unknown to the learner. Thus, the learner has to explore the suboptimal state-actions to learn more about the MDP while also maximising the gain as much as possible. In the literature, this is called the exploration–exploitation dilemma. Our goal in this paper, is to design reinforcement learning algorithm that minimises the regret without a prior knowledge of the original MDP i.e. are unknown. Thus, our algorithm needs to deal with the exploration–exploitation dilemma.

Optimistic Reinforcement Learning.

We adopt the optimistic reinforcement learning technique for algorithm design. Optimism in the face of uncertainty (OFU) is a well-studied algorithm design technique for resolving the exploration–exploitation dilemma in multi-armed bandits (Audibert et al., 2007). Optimism provides scope for researchers to adopt and extend the well-developed tools for multi-armed bandits to MDPs. For discounted MDPs and Bayesian MDPs, optimism-based techniques allow researchers to develop state-of-the-art algorithms (Kocsis and Szepesvári, 2006; Silver et al., 2016).

Jaksch et al. proposed an algorithm, UCRL2, for finite communicating MDPs that uses the optimism in the face of uncertainty framework and achieves 111In this paper, we will use notation to hide extra factors. regret. The design technique of UCRL2 can be deconstructed as follows:

  1. Construct a set of statistically plausible MDPs around the estimated mean rewards and transitions that contains the true MDP with high probability.

  2. Compute a policy (called optimistic) whose gain is the maximum among all MDPs in the plausible set. They developed an extended value iteration algorithm for this task.

  3. Play the computed optimistic policy for an artificial episode that lasts until the number of visits to any state-action pair is doubled. This is known as the doubling trick.

Follow-up algorithms further developed from this optimism perspective, such as KL-UCRL (Filippi et al., 2010), REGAL.C (Bartlett and Tewari, 2009), UCBVI (Azar et al., 2017), SCAL (Fruit et al., 2018). These proposed algorithms and proof techniques improve the regret bound of optimistic reinforcement learning up to , however with additional assumptions on the MDP. The best known lower bound on the regret for a unknown finite communicating MDP is , as proven by Jaksch et al. (2010). This leaves a gap in the literature. In this paper, we design an algorithm and proof technique that fills up this gap by exploiting variance based confidence bounds and achieves a regret upper bound of with no additional assumptions on the communicating MDP.

Our Contributions.

Hereby, we summarise the contributions of this paper that we elaborate in the upcoming sections.

  • We propose an algorithm, UCRL-V, using the optimistic reinforcement learning framework. It uses empirical Bernstein bounds and a new pointwise constraint on the transition kernel to construct a crisper set of plausible MDPs than the existing algorithms. (Section 2)

  • We propose a modified extended value iteration algorithm that converges under the new constraints while retaining the same complexity of the extended value iteration algorithm in Jaksch et al. (2010). (Section 2)

  • We prove that UCRL-V achieves regret without imposing any additional constraint on the communicating MDP. Thus, filling a gap in the literature (Theorem 1).

  • We prove a correlation between the number of visits of a policy in an MDP with the values, probabilities and diameter (Proposition 3). This result, along with the algorithm design techniques causes the improved bound.

  • We perform experiments in a variety of environments that validates the theoretical bounds as well as proves UCRL-V to be better than the state-of-the-art algorithms. (Section 4)

We conclude by summarising the techniques involved in this paper and discussing the possible future works they can lead to (Section 5).

2 Methodology

In this section, we describe the algorithm design methodologies used in UCRL-V. We categorise them in following three sections.

Initialization: Set and observe initial state
Set to zero for all and .
for episodes  do
     
     
     Compute optimistic policy :
     /*Update the bounds on statistically plausible MDPs*/
     
     
     For any use:
     
     
     /*Find with value -close to the optimal*/
      (Algorithm 2)
     Execute Policy :
     while   do
         Play action and observe
         Increase and , by 1.
         
     end while
end for
Algorithm 1 UCRL-V

2.1 Constructing the Set of Statistically Plausible MDPs.

We construct the set of statistically plausible MDPs using variance modulated confidence bounds based on empirical Bernstein inequalities (Maurer and Pontil, 2009). In particular, an MDP is plausible if its expected rewards and transitions satisfy the following inequalities for all state-action pair and all subset of state space .

(1)
(2)

where represents respectively the empirical rewards and transitions. , , is the desired confidence level of the set of plausible MDPs. represents the sequence of observed rewards. represents the sequence of observed transitions with if , otherwise.

is a state-action dependent confidence function on the form:

(3)

where is the number of times is played up to round ; is the number of rounds at the start of episode and is the sample variance:

with the observed sample at time , the sample mean at the start of episode : In this paper, we will use to mean and similarly for .

Unlike Equation 2 in (Jaksch et al., 2010)

, our transitions vectors for a given state-action pair satisfy separate bounds for any possible subset of next states. This provides a crisper set of plausible MDPs. For example, if the empirical transition of a state is

, our bounds lead to an error of at most whereas UCRL2 could add up to . We illustrate this concept in Figure 2 (Appendix).

2.2 Modified Extended Value Iteration.

Our goal is to find an optimistic policy, whose average value is the maximum among all plausible MDPs. To that end, and similarly to (Jaksch et al., 2010) we consider an extended MDP with state space and continuous action space where for each action , each admissible transition according to (2), each admissible rewards according to (1), there is an action in with transition and mean rewards .

We can then use extended value iteration (Jaksch et al., 2010) to solve this problem, defined as follows:

(4)

where denotes the state value at the -th iteration and is the set of all possible transitions in the set of plausible MDPs.

The maximum is attained by setting to . For the inner maximum, although the set of all possible transition functions is an infinite space, computing the maximum over it is a linear optimization problem over the convex polytope which can be solved efficiently (Jaksch et al., 2010; Strehl and Littman, 2008). The idea is to put as much transition probability as possible to the states with maximal value at the expense of transition probabilities to states with small value. This idea is formally established for Algorithm 2 in Corollary 1 (Appendix) which shows that the value returned by Algorithm 2 is greater than the one obtained by any other .

However, there are still up to constraints on each transition, and it will remain computationally expensive to check each one of them. Our analysis shows that we can satisfy all constraints by just considering at most constraints under some natural consistency and feasibility assumptions described in Assumption 1. Given any two subsets of states both containing a state , one can construct the implied bound on the transition to , Assumption 1 requires that this implied constraint is tighter than the individual constraint on . More generally, Assumption 1 requires that the implied constraint using a subset of states is tighter than the implied constraint using another subset of states contained in . Lemma 7 (Appendix) shows that when using Algorithm 2 with any constraints on the transitions that satisfy Assumption 1, then all constraints are satisfied. This observation together with Lemma 6 (Appendix) showing that the contraints considered by UCRL-V satisfy Assumption 1, means that Algorithm 2 is indeed computing correctly the inner maximum.

Assumption 1.

For all state-action pairs consider the following constraints on transition probabilities

where and are any function returning real numbers. Let and any two subsets of states such that . For any state such that , assume the following:

(5)
(6)
Sort the states in descending order such that
Let and
for  to  do
(7)
end for
return
Algorithm 2 Modified Extended Value Iteration for Solving Equation 4

2.3 Scheduling the Adaptive Episodes.

In our analysis, we found that the standard doubling trick used to start a new episode can cause the length of an episode to be too large. More specifically, we found that the average number of states that are doubled during an episode should be a small constant independent of . However, we also need to make sure that the total number of episodes is small. As a result, we start a new episode as soon as where is the number of times is played at episode . We called this new condition extended doubling trick and it forms a crucial part into removing an additional factor compared to UCRL2. A more specific description is available in Algorithm 1 and Proposition 1 shows that the total number of episodes is bounded by .

3 Theoretical Analysis

Our proposed algorithm UCRL-V formally described in Algorithm  1 combines the three techniques described in Section 2 to achieve the near-optimal regret prove in Theorem 1.

Theorem 1 (Upper Bound on the Regret of UCRL-V).

With probability at least for any , any , the regret of UCRL-V is bounded by:

for .

Sketch.

We start by decomposing the regret similarly to Jaksch et al. (2010) as shown in Lemma 1. To bound the terms, we use Proposition 3 to remove a factor of compared to UCRL2. Proposition 3 used an expected number of visits in the optimistic MDP whereas the terms depends on , the number of visits in the true MDP. In our analysis, we relate to which was made possible since does not represent the infinite horizon number of visits. Instead it applies to episodes of steps from any starting state.

The above remark would lead to an additional factor. To avoid it, we assign the states based on the values into an infinite set of bins and constructed in a way that the ratio between upper and lower endpoint is 2. And this leads to a geometric sum instead of a sum over states.

We are also able to remove an additional factors compared to UCRL2 and the main factor that made this possible is our new extended doubling trick. The full proof is available in Appendix B.2

The following theorem provides the basis for our proof and shows that extended value iteration (Algorithm 2) converges to the optimal policy based on the point-wise constraints on the transitions. Also the number of episodes incurred by our extended doubling trick is upper bounded as shown in Proposition 1.

Theorem 2 (Convergence of Extended Value Iteration).

Let be the set of all MDPs with state space , action space , transitions probabilities and mean rewards that satisfy (1) and (2

) for given probabilities distribution

, in . If contains at least one communicating MDP, extended value iteration in Algorithm 2 converges. Further, stopping extended value iteration when:

the greedy policy with respect to is -optimal meaning and .

Proposition 1 (Bounding the number of episodes).

The number of episodes is upper bounded by

Sketch.

The proof relies on the key observation that after episodes the number of times some states have been doubled is . ∎

The following proposition 3 links the number of visits with the values and probabilities. It applies to any MDP and a policy whose values difference are bounded by . To understand its importance, observe that a trivial bound for the left side would be . It plays a crucial part in our analysis.

Proposition 2.

Let and any two non empty subset of states. Let . We have:

where represents the total expected number of time the optimistic policy visits the states when starting from state and playing for steps in the optimistic MDP .

Lemma 1 is the starting point of our proof and decomposes the regret into three parts. The first part is easily bounded similarly as in Jaksch et al. (2010). The bound for the second part (Lemma 2, Appendix) is the main contribution of this paper. The dependency of our confidence interval on the variance and the extended doubling trick played a crucial role. The bound for the last part (Lemma 3, Appendix) relies on using Bernstein based martingales concentration inequalities.

Lemma 1 (Regret decomposition).

With probability ,

where and , and are respectively the transition kernels of the optimistic and true (but unknown) MDP . is the number of times the optimistic policy visits state at episode .

These lemmas together provide us the desired bound on regret in Theorem 1.

4 Experimental Analysis

We empirically evaluate the performance of UCRL-V in comparison with that of KL-UCRL (Filippi et al., 2010) and UCRL2 (Jaksch et al., 2010). We also compared against TSDE (Ouyang et al., 2017) which is a variant of posterior sampling for reinforcement learning suited for infinite horizon problems. Section 4.1 describes the environments tested. Figure 1

illustrates the evolution of the average regret along with confidence region (standard deviation). Figure 

1 is a log-log plot where the ticks represent the actual values.

Experimental Setup.

The confidence hyper-parameter of UCRL-V, KL-UCRL, and UCRL2 is set to . TSDE is initialized with independent priors for each reward and a Dirichlet prior with parameters for the transition functions , where . We plot the average regret of each algorithm over rounds computed using independent trials.

Experimental Protocol.

We take two measures to eliminate unintentional bias and variance introduced by experimental setup while comparing different algorithms. Firstly, the true ID of each state and action is masked by randomly shuffling the sequence of states and actions. This is done independently for each trial so as to make sure that no algorithm can coincidentally benefit from the numbering of states and actions. Secondly, similarly to other authors (c.f.(McGovern and Sutton, 1998)), we eliminate unintentional variance in our results by using the same pseudo-random seeds when generating transitions and rewards for each trial. Specifically, for each trial, every state-action pair’s pseudo-random number generator is initialised with the same initial seed. This setup ensures that if two algorithms take the same actions in the same trial, they will generate the same transitions, thus reducing variance.

Implementation Notes on UCRL-V

We maintained the empirical means and variance of the rewards and transitions efficiently using Welford (1962) online algorithm. Also, the empirical mean transition to any subset of next state is the addition of its constituent and the corresponding variance is . As a result, keeping values is enough.

4.1 Description of Experimental Environments

RiverSwim RiverSwim consists of six states arranged in a chain as shown by Figure 1 in Osband et al. (2013). The agent begins at the far left state and at every round, has the choice to swim left or right. Swimming left (with the current) is always successful, but swimming right (against the current) often fails. The agent receives a small reward for reaching the leftmost state, but the optimal policy is to attempt to swim right and receive a much larger reward. The transitions are the same as in (Osband et al., 2013). To make the problem a little tougher, we increased the rewards of the leftmost state to and the reward of the rightmost state is set at . This decreases the difference in the value of the optimal and sub-optimal policies so as to make it harder for an agent to distinguish between them. Figure 0(a) shows the results.

GameOfSkill-v1

This environment is inspired by real-world scenarios in which a) one needs to take a succession of decisions before receiving any explicit feedback b) taking a wrong decision can undo part of the right decisions taken so far.

In a more abstract way, this environment consists of 20 states in a chain with two actions available at each state (left and right). Taking the left action always transits to the correct state. However, when going to the right from a state it only succeeds with probability and with probability , one stays in . The rewards at the leftmost state for the action left is whereas the reward at the rightmost state for the action right is . All other rewards are .

GameOfSkill-v2

This is essentially the same as GameOfSkill-v1 with the difference that going left now send you back to the leftmost state and not just the previous state.

Bandits

This is a standard bandit problem with two arms. One arm draws rewards from a Beta distribution

while the other always gives . Figure 0(c) show the results in this environment.

(a) RiverSwim
(b) GameOfSkill-v1
(c) Bandits
(d) GameOfSkill-v2
Figure 1: Time evolution of average regret for UCRL-V, TSDE, KL-UCRL, and UCRL2.

Results and Discussion.

Figure 0(c) shows an important result since to solve a larger MDP one faces at least bandits problems. Figure 0(c) illustrates the main reason why UCRL-V enjoys a better regret. It is able to efficiently exploit the non-hardness of the bandit problem tested. In contrast, UCRL2 does not exploit the structure of the problem at hand and instead obtain a problem independent performance. Both KL-UCRL and TSDE are also able to exploit the problem structure but are out-beaten by UCRL-V.

The results on the 6-states RiverSwim MDP in Figure 0(a) illustrates the same story as in the bandit problem for UCRL-V compared to UCRL2 and KL-UCRL. However, TSDE outperforms UCRL-V and much of gain comes from the first rounds. It seems that TSDE quickly moves to the seemingly good region of the state space without properly checking the apparent bad region. This can lead to catastrophic behavior as illustrated by the results on the more challenging GameOfSkill environments.

In both GameOfSkill environments (Figure 0(d) and  0(b)), UCRL-V significantly outperforms all other algorithms. Indeed, UCRL-V spends the first few rounds trying to learn the games and is able to do so in a reasonable time. Comparatively, TSDE never tries to learn the game. Instead, TSDE quickly decides to play the region of the state space that is apparently the best. However, this region turns out to be the worst region and TSDE never recovers. Both KL-UCRL and UCRL2 attempts at learning the game. UCRL2 didn’t complete its learning before the end of the game. While KL-UCRL takes a much longer time to learn.

5 Conclusion

Leveraging the empirical variance of rewards and transition functions to compute the upper confidence bound provides more control over the optimism used in UCRL-V algorithm. This trick provides us a narrower set of statistically plausible set of MDPs. Along with the modified extended value iteration and an extended doubling trick using the idea of average number of states doubled, provides UCRL-V a near-optimal regret guarantee based on the empirical Bernstein inequalities (Maurer and Pontil, 2009). As UCRL-V achieves the bound on worst case regret, it closes a gap in the literature following the lower bound proof of (Jaksch et al., 2010). Experimental analysis over four different environments illustrates that UCRL-V is strictly better than the state-of-the-art algorithms.

In light of the results of KL-UCRL and the of relation between KL-divergence and variance, we would like to explore if a variant of KL-UCRL can guarantees a near-optimal regret. Also, it will be interesting to explore the possibility of guaranteeing a near-optimal regret bound for posterior sampling. Finally, it would be interesting to explore how one can re-use the idea of UCRL-V for non-tabular settings such as with linear function approximation or deep learning.

References

  • Audibert et al. (2007) Audibert, J., Munos, R., and Szepesvári, C. (2007). Tuning bandit algorithms in stochastic environments. In ALT, volume 4754 of Lecture Notes in Computer Science, pages 150–165. Springer.
  • Azar et al. (2017) Azar, M. G., Osband, I., and Munos, R. (2017). Minimax regret bounds for reinforcement learning. arXiv preprint arXiv:1703.05449.
  • Bartlett and Tewari (2009) Bartlett, P. L. and Tewari, A. (2009). Regal: A regularization based algorithm for reinforcement learning in weakly communicating mdps. In

    Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence

    , UAI ’09, pages 35–42. AUAI Press.
  • Brafman and Tennenholtz (2002) Brafman, R. I. and Tennenholtz, M. (2002). R-max-a general polynomial time algorithm for near-optimal reinforcement learning.

    Journal of Machine Learning Research

    , 3(Oct):213–231.
  • Cesa-Bianchi and Gentile (2008) Cesa-Bianchi, N. and Gentile, C. (2008). Improved risk tail bounds for on-line algorithms. IEEE Transactions on Information Theory, 54(1):386–390.
  • Filippi et al. (2010) Filippi, S., Cappé, O., and Garivier, A. (2010).

    Optimism in reinforcement learning and kullback-leibler divergence.

    In Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on, pages 115–122. IEEE.
  • Fruit et al. (2018) Fruit, R., Pirotta, M., Lazaric, A., and Ortner, R. (2018). Efficient bias-span-constrained exploration-exploitation in reinforcement learning. arXiv preprint arXiv:1802.04020.
  • Inc. (2019) Inc., W. R. (2019). Mathematica, Version 12.0. Champaign, IL, 2019.
  • Jaksch et al. (2010) Jaksch, T., Ortner, R., and Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600.
  • Kocsis and Szepesvári (2006) Kocsis, L. and Szepesvári, C. (2006). Bandit based monte-carlo planning. In European conference on machine learning, pages 282–293. Springer.
  • Maurer and Pontil (2009) Maurer, A. and Pontil, M. (2009). Empirical bernstein bounds and sample-variance penalization. In COLT.
  • McGovern and Sutton (1998) McGovern, A. and Sutton, R. S. (1998). Macro-actions in reinforcement learning: An empirical analysis. Computer Science Department Faculty Publication Series, page 15.
  • Osband et al. (2013) Osband, I., Russo, D., and Van Roy, B. (2013). (more) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pages 3003–3011.
  • Ouyang et al. (2017) Ouyang, Y., Gagrani, M., Nayyar, A., and Jain, R. (2017).

    Learning unknown markov decision processes: A thompson sampling approach.

    In Advances in Neural Information Processing Systems, pages 1333–1342.
  • Peel et al. (2010) Peel, T., Anthoine, S., and Ralaivola, L. (2010). Empirical bernstein inequalities for u-statistics. In Advances in Neural Information Processing Systems, pages 1903–1911.
  • Poupart et al. (2006) Poupart, P., Vlassis, N., Hoey, J., and Regan, K. (2006). An analytic solution to discrete bayesian reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pages 697–704. ACM.
  • Puterman (2014) Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
  • Silver et al. (2016) Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. (2016).

    Mastering the game of go with deep neural networks and tree search.

    nature, 529(7587):484.
  • Strehl and Littman (2008) Strehl, A. L. and Littman, M. L. (2008). An analysis of model-based interval estimation for markov decision processes. J. Comput. Syst. Sci., 74(8):1309–1331.
  • Sutton and Barto (1998) Sutton, R. S. and Barto, A. G. (1998). Reinforcement learning: An introduction, volume 1. MIT press Cambridge.
  • Welford (1962) Welford, B. (1962). Note on a method for calculating corrected sums of squares and products. Technometrics, 4(3):419–420.

Appendix A Notations

Length of time horizon
Regret for a given horizon
Original MDP
Optimistic MDP
Diameter of original MDP
Total number of episodes
Length of episode
Transition kernel of original MDP given state and action
Optimistic transition kernel given state and action
Empirical average of transition kernels given state , action
Reward in original MDP given state and action
Optimistic reward given state and action
Empirical average of rewards given state , action
Number of times a state is visited in episode
Final value function obtained after iterations of Algorithm 2 in episode for the set of all plausible MDPs.
Defined by Equation (50)
Table 1: Table of Notations

Appendix B Proofs of Section 3 (Theoretical Analysis)

All the proof sketches assume bounded rewards .

b.1 Proof of UCRL-V

The proof of UCRL-V relies on a generic proof provided in Section B.2 for any algorithm that uses the same structure as UCRL-V with a plausible set containing the true model with high probability, satisfifying Assumption 1 and whose error is bounded in specific a form.

As a result, in this section we simply show that UCRL-V satisfies the assumption in the generic proof of Section B.2. For that we simply have to show that our plausible set contains the true rewards and transition for each with high probability and then express the maximum errors in a specific form. We start with the rewards then move on to the transitions.

For the rewards, using Theorem 3 and replacing the sample variance by , we have for with probability at least :

(8)

We obtain the last inequality since and for . Similarly, using Theorem 3 for the transitions of each state-action and replacing the sample variance by the true variance using Theorem 4 and the union bound in Fact 10, we have with probability at least (individually for each and subset of next states ):

(9)
(10)
(11)
(12)
(13)

Furthermore let’s observe that the bound in (8) and (13) works for since the second term is greater than when . This means the bound works for any .

As a result, the proof in Section B.2 applies where

(14)
(15)
(16)
(17)

b.2 Generic Proof For Regret Bound

In this section, we prove in a generic way, the regret for Algorithm 1. In particular, the following proof relates to any method that uses Algorithm 1 and uses a set of plausible models specified by with the following properties:

  1. satisfy Assumption 1.

where () means with probability at least (), and are respectively the rewards and probabilities of the true model, and and are the empirical mean observation of and respectively.

such that

(18)
(19)

Proof Overview. We start similarly as in Jaksch et al. (2010) by decomposing the regret into two main parts and as shown in Lemma 1.

In Lemma 2, we show how to bound the part . One of the main idea in the proof is to assign the states based on the values into an infinite set of bins , constructed in a way that the ratio between upper and lower endpoint is 2. This construction together with Proposition 3 that links the transitions, values and expected number of visits in episodes of rounds allows us to remove a factor of compared to UCRL2. The results in this Lemma 2 is based on a relation between , the number of visits in the true but unknown MDP to the expected number of visits in episodes of in the optimistic MDP (Proposition 5). However, we do not directly use the help of the expected number of visits to remove the . This is because, it can be too pessimistic and would not allow us to improve on the dependency on . Indeed, a careful analysis of Proposition 3 shows that it can be loose when the transitions probabilities are already too small. Instead, we define a new variable (50) which can be seen as the exact quantity needed to remove the based on the transitions probabilities. In Proposition 4, we relate this new quantity to the expected number of visits and show that by carefully grouping the states together, the ratio between and the expected number of visits is rather than the naive . This is how we remove an additional factor compared to UCRL2.

In Lemma 3, we show how to bound the part. The key idea is to use Bernstein based martingales concentrations inequalities instead of standard martingales. However, the adaptation was not trivial since we had to carefully introduce instead of the inside . The key step is to avoid relating those two quantities through concentration inequalities. Instead we used established proposition related to the convergence of extended value iteration (Section E).

We first provide the proof assuming that the true model is inside our plausible set for each episode . Then later on, we conclude by showing that this assumption holds with high probability.

Lemma 1 (Regret decomposition).

With probability at least ,