In this paper, we consider Reinforcement Learning (RL) in an unknown and discrete Markov Decision Process (MDP) under the average-reward criterion, when the learner interacts with the system in a single stream of observations, starting from an initial state without any reset. More formally, let denote an MDP where is a finite set of states and is a finite set of actions available at any state, with respective cardinalities and . The reward function and the transition kernel is respectively denoted by and . The game goes as follows: the learner starts in some state at time . At each time step , the learner chooses one action in her current state based on her past decisions and observations. When executing action in state , the learner receives a random reward drawn independently from distribution with support and mean . The state then transits to a next state sampled with probability , and a new decision step begins. As the transition probabilities and reward functions are unknown, the learner has to learn them by trying different actions and recording the realized rewards and state transitions. We refer to standard textbooks (Sutton and Barto, 1998; Puterman, 2014) for background material on RL and MDPs.
The performance of the learner can be quantified through the notion of regret, which compares the reward collected by the learner (or the algorithm) to that obtained by an oracle always following an optimal policy, where a policy is a mapping from states to actions. More formally, let denote a possibly stochastic policy. We further introduce the notation , and to denote the function . Likewise, let denote the mean reward after choosing action in step .
[Expected cumulative reward] The expected cumulative reward of policy when run for steps from initial state is defined as
where , , and finally .
[Average gain and bias] Let us introduce the average transition operator . The average gain and bias function are defined by
The previous definition requires some mild assumption on the MDP for the limits to makes sense. It is shown (see, e.g., (Puterman, 2014)) that the average gain achieved by executing a stationary policy in a communicating MDP is well-defined and further does not depend on the initial state, i.e., . For this reason, we restrict our attention to such MDPs in the rest of this paper. Furthermore, let denote an optimal policy, that is111The maximum is reached since there are only finitely many deterministic policies.
[Regret] We define the regret of any learning algorithm after steps as
and with is a sequence generated by the optimal strategy.
By an application of Azuma-Hoeffding’s inequality for bounded random martingales, it is immediate to show that with probability higher than ,
Thus, following (Jaksch et al., 2010), it makes sense to focus on the control of the middle term in brackets only. This leads us to consider the following notion of regret, which we may call effective regret:
To date, several algorithms have been proposed in order to minimize the regret based on the optimism in the face of uncertainty principle, coming from the literature on stochastic multi-armed bandits (see (Robbins, 1952)). Algorithms designed based on this principle typically maintain confidence bounds on the unknown reward and transition distributions, and choose an optimistic model that leads to the highest average long-term reward. One of the first algorithms based on this principle for MDPs is due to (Burnetas and Katehakis, 1997), which is shown to be asymptotically optimal. Their proposed algorithm uses the Kullback-Leibler (KL) divergence to define confidence bounds for transition probabilities. Subsequent studies by (Tewari and Bartlett, 2008), (Auer and Ortner, 2007), (Jaksch et al., 2010), and (Bartlett and Tewari, 2009) propose algorithms that maintain confidence bounds on transition kernel defined by or total variation norm. The use of norm, instead of KL-divergence, allows one to describe the uncertainty of the transition kernel by a polytope, which in turn brings computational advantages and ease in the regret analysis. On the other hand, such polytopic models are typically known to provide poor representations of underlying uncertainties; we refer to the literature on the robust control of MDPs with uncertain transition kernels, e.g., (Nilim and El Ghaoui, 2005), and more appropriately to (Filippi et al., 2010). Indeed, as argued in (Filippi et al., 2010), optimistic models designed by norm suffer from two shortcomings: (i) the optimistic model could lead to inconsistent models by assigning a zero mass to an already observed element, and (ii) due to polytopic shape of -induced confidence bounds, the maximizer of a linear optimization over ball could significantly vary for a small change in the value function, thus resulting in sub-optimal exploration (we refer to the discussion and illustrations on pages 120–121 in (Filippi et al., 2010)).
Both of these shortcomings are avoided by making use of the KL-divergence and properties of the corresponding KL-ball. In (Filippi et al., 2010), the authors introduce the KL-Ucrl algorithm that modifies the Ucrl2 algorithm of (Jaksch et al., 2010) by replacing norms with KL divergences in order to define the confidence bound on transition probabilities. Further, they provide an efficient way to carry out linear optimization over the KL-ball, which is necessary in each iteration of the Extended Value Iteration. Despite these favorable properties and the strictly superior performance in numerical experiments (even for very short time horizons), the best known regret bound for KL-Ucrl matches that of Ucrl2. Hence, from a theoretical perspective, the potential gain of use of KL-divergence to define confidence bounds for transition function has remained largely unexplored. The goal of this paper is to investigate this gap.
In this paper we provide a new regret bound for KL-Ucrl scaling as for ergodic MDPs with states, actions, and diameter . Here, denotes the variance of the optimal bias function of the true (unknown) MDP with respect to next state distribution under state-action . This bound improves over the best previous bound of for KL-Ucrl as . Interestingly, in several examples and actually is comparable to . Our numerical experiments on typical MDPs further confirm that could be orders of magnitude smaller than . To prove this result, we provide novel transportation concentration inequalities inspired by the transportation method that relate the so-called transportation cost under two discrete probability measures to the KL-divergence between the two measures and the associated variances. To the best of our knowledge, these inequalities are new and of independent interest. To complete our result, we provide a new minimax regret lower bound of order , where . In view of the new minimax lower bound, the reported regret bound for KL-Ucrl can be improved by only a factor .
RL in unknown MDPs under average-reward criterion dates back to the seminal papers by (Graves and Lai, 1997), and (Burnetas and Katehakis, 1997), followed by (Tewari and Bartlett, 2008). Among these studies, for the case of ergodic MDPs, (Burnetas and Katehakis, 1997) derive an asymptotic MDP-dependent lower bound on the regret and provide an asymptotically optimal algorithm. Algorithms with finite-time regret guarantees and for wider class of MDPs are presented by (Auer and Ortner, 2007), (Jaksch et al., 2010; Auer et al., 2009), (Bartlett and Tewari, 2009), (Filippi et al., 2010), and (Maillard et al., 2014).
Ucrl2 and KL-Ucrl achieve a regret bound with high probability in communicating MDPs, for any unknown time horizon. Regal obtains a regret with high probability in the larger class of weakly communicating MDPs, provided that we know an upper bound on the span of the bias function. It is however still an open problem to incorporate this knowledge into an implementable algorithm. The TSDE algorithm by Ouyang et al. (Ouyang et al., 2017) achieves a regret growing as for the class of weakly communicating MDPs, where is a given bound on the span of the bias function. In a recent study, (Agrawal and Jia, 2017) propose an algorithm based on posterior sampling for the class of communicating MDPs. Under the assumption of known reward function and known time horizon, their algorithm enjoys a regret bound scaling as , which constitutes the best known regret upper bound for learning in communicating MDPs and has a tight dependencies on and .
We finally mention that some studies consider regret minimization in MDPs in the episodic setting, where the length of each episode is fixed and known; see, e.g., (Osband et al., 2013), (Gheshlaghi Azar et al., 2017), and (Dann et al., 2017). Although RL in the episodic setting bears some similarities to the average-reward setting, the techniques developed in these paper strongly rely on the fixed length of the episode, which is assumed to be small, and do not directly carry over to the case of undiscounted RL considered here.
2 Background Material and The KL-Ucrl Algorithm
In this section, we recall some basic material on undiscounted MDPs and then detail the KL-Ucrl algorithm.
[Bias and Gain] The gain and bias function satisfy the following relations
According to the standard terminology, we say a policy is -improving if it satisfies . Applying the theory of MDPs (see, e.g., (Puterman, 2014)), it can be shown that any -improving policy is optimal and thus that we can choose to satisfy222The solution to this fixed-point equation is defined only up to an additive constant. Some people tend to use this equation in order to define and , but this is a bad habit that we avoid here. the following fundamental identity333Throughout this paper, we may use (resp. ) and (resp. ) interchangeably.
We now recall the definition of diameter and mixing time as we consider only MDPs with finite diameter or mixing time. [Diameter (Jaksch et al., 2010)] Let denote the first hitting time of state when following stationary policy from initial state . The diameter of an MDP is defined as
[Mixing time (Auer and Ortner, 2007)] Let
denote the Markov chain induced by the policyin an ergodic MDP and let represent the hitting time of . The mixing time of is defined as
For convenience, we also introduce, for any function defined on , its span defined by . It actually acts as a semi-norm (see (Puterman, 2014)).
We finally introduce the following quantity that appears in the known problem-dependent lower-bounds on the regret, and plays the analogue of the mean gap in the bandit literature.
[Sub-optimality gap] The sub-optimality of action at state is
Note importantly that is defined in terms of the bias of the optimal policy . Indeed, it can be shown that minimizing the effective regret (in expectation) is essentially equivalent to minimizing the quantity , where is the total number of steps when action has been played in state . More precisely, it is not difficult to show (see Appendix E for completeness) that for any stationary policy and all ,
The KL-Ucrl algorithm.
The KL-Ucrl algorithm (Filippi et al., 2010; Filippi, 2010) is a model-based algorithm inspired by Ucrl2 (Jaksch et al., 2010). To present the algorithm, we first describe how it defines, at each given time , the set of plausible MDPs based on the observation available at that time. To this end, we introduce the following notations. Under a given algorithm and for a state-action pair , let denote the number of visits, up to time , to : . Then, let . Similarly, denotes the number of visits to , up to time , followed by a visit to state :
. We introduce the empirical estimates of transition probabilities and rewards:
KL-Ucrl, as an optimistic model-based approach, considers the set as a collection of all MDPs , whose transition kernels and reward functions satisfy:
where denotes the mean of , and where , with and , and . Importantly, as proven in (Filippi et al., 2010, Proposition 1), with probability at least , the true MDP belongs to the set uniformly over all time steps .
Similarly to Ucrl2, KL-Ucrl proceeds in episodes of varying lengths; see Algorithm 1. We index an episode by . The starting time of the -th episode is denoted , and by a slight abuse of notation, let , , , and . At , the algorithm forms the set of plausible MDPs based on the observations gathered so far. It then defines an extended MDP , where for an extended action , it defines and . Then, a -optimal extended policy is computed in the form , in the sense that it satisfies
where denotes the gain of policy in MDP . and are respectively called the optimistic MDP and the optimistic policy. Finally, an episode stops at the first step when the number of local counts exceeds for some . We denote with some abuse .
The value is a parameter of Extended Value Iteration and is only here for computational reasons: with sufficient computational power, it could be replaced with .
3 Regret Lower Bound
In order to motivate the dependence of the regret on the local variance, we first provide the following minimax lower bound that makes appear this scaling. There exists an MDP with states and actions with , such that the expected regret under any algorithm after steps for any initial state satisfies
Let us recall that (Jaksch et al., 2010) present a minimax lower bound on the regret scaling as . Their lower bound follows by considering a family of hard-to-learn MDPs. To prove the above theorem, we also consider the same MDP instances as in (Jaksch et al., 2010) and leverage their techniques. We however show that choosing a slightly different choice of transition probabilities for the problem instance leads to a lower bound scaling as , which does not depend on the diameter (the details are provided in the appendix).
We also remark that for the considered problem instance, easy calculations show that for any state-action pair , the variance of bias function satisfies for some constants and . Hence, the lower bound in Theorem 3 can serve as an alternative minimax lower bound without any dependence on the diameter.
4 Concentration Inequalities and The Kullback-Leibler Divergence
Before providing the novel regret bound for the KL-Ucrl algorithm, let us discuss some important tools that we use for the regret analysis. We believe that these results, which could also be of independent interest beyond RL, shed light on some of the challenges of the regret analysis.
Let us first recall a powerful result from mathematical statistics (we provide the proof in Appendix B for completeness) known as the transportation lemma; see, e.g., (Boucheron et al., 2013, Lemma 4.18):
[Transportation lemma] For any function , let us introduce . Whenever is defined on some possibly unbounded interval containing , define its dual . Then it holds
This result is especially interesting when is the empirical version of built from i.i.d. observations, since in that case it enables to decouple the concentration properties of the distribution from the specific structure of the considered function. Further, it shows that controlling the KL divergence between and induces a concentration result valid for all (nice enough) functions , which is especially useful when we do not know in advance the function we want to handle (such as bias function ).
The quantities , may look complicated. When (where ) is Gaussian, they coincide with . Controlling them in general is challenging. However for bounded functions, a Bernstein-type relaxation can be derived that uses the variance and the span :
[Bernstein transportation] For any function such that and are finite,
We also provide below another variation of this result that is especially useful when the bounds of Corollary 4 cannot be handled, and that seems to be new (up to our knowledge):
[Transportation method II] Let
be a probability distribution on a finite alphabet. Then, for any real-valued function defined on , it holds that
When is the transition law under a state-action pair and is its empirical estimates up to time , i.e. and , the first assertion in Corollary 4 can be used to decouple from specific structure of . In particular, if is some bias function, then has a bounded span , and since , the first order terms makes appear the variance of . This would result in a term scaling as in our regret bound, where hides poly-logarithmic terms.
Now, for the case when and is the optimistic transition law at time , the second inequality in Corollary 4 allows us to bound by the variance of under law , which itself is controlled by the variance of under the true law . Using such an approach would lead to a term scaling as . We can remove the term scaling as in our regret analysis by resorting to Lemma 4 instead, in combination with the following property of the operator : Consider two distributions with . Then, for any real-valued function defined on , it holds that
5 Variance-Aware Regret Bound for KL-Ucrl
In this section, we present a regret upper bound for KL-Ucrl that leverages the results presented in the previous section. Let denote the span of the bias function, and for any define as the variance of the bias function under law .
Let denote the optimal policy in the extended MDP , whose gain satisfies . We consider a variant of KL-Ucrl, which computes, in every episode , a policy satisfying: , and .444We study such a variant to facilitate the analysis and presentation of the proof. This variant of KL-Ucrl may be computationally less efficient than Algorithm 1. We stress however that, in view of the number of episodes (growing as ) as well as Remark 2, with sufficient computational power such an algorithm could be practical.
In the following theorem, we provide a refined regret bound for KL-Ucrl:
[Variance-aware regret bound for KL-Ucrl] With probability at least , the regret under KL-Ucrl for any ergodic MDP and for any initial state satisfies
where hides the terms scaling as . Hence, with probability at least ,
If the cardinality of the set for state-action is known, then one can use the following improved confidence bound for the pair (instead of (3)):
where (see, e.g., (Filippi, 2010, Proposition 4.1) for the corresponding concentration result). As a result, if for all is known, it is then straightforward to show that the corresponding variant of KL-Ucrl, which relies on (5), achives a regret growing as
The regret bound provided in the aforementioned remark is of particular importance in the case of sparse MDPs, where most states transit to only a few next-states under various actions. We would like to stress that to get an improvement of a similar flavour for Ucrl2, to the best of our knowledge, one has to know the sets for all rather than their cardinalities.
Sketch of proof of Theorem 5.
The detailed proof of this result is provided in Appendix C. In order to better understand it, we now provide a high level sketch of proof explaining the main steps of the analysis.
First note that by an application of Azuma-Hoeffding inequality, the effective regret is upper bounded by with probability at least . We proceed by decomposing the term on the episodes , where is the total number of episodes after steps. Introducing as the number of visits to during episode for any and , with probability at least we have
We focus on episodes such that
, corresponding to valid confidence intervals, up to losing a probability only. In order to control , we use the decomposition
We refrain from using the fact that and instead use it as a slack later in the proof. We then introduce the bias function from the identity , and thus get
Term (a). The first term is controlled thanks to our variance-aware concentration inequalities:
The first inequality is obtained by Corollary 4 while the second one by a combination of Lemma 4 together with Lemma 4. We then relate to thanks to: For any episode such that , it holds that for any pair ,
It is then not difficult to show that this first term, when summed over all episodes, contributes to the regret as , where the terms comes from the use of time-uniform concentration inequalities.
Term (b). We then turn to Term (b) and observe that it makes appear a martingale difference structure. Following the same reasoning as in (Jaksch et al., 2010) or (Filippi et al., 2010), the right way to control it is however to sum this contribution over all episodes and make appear a martingale difference sequence of deterministic terms, bounded by the deterministic quantity , since . This comes at the price of losing a constant error per episode. Now, since it can be shown that as for Ucrl2, we deduce that with probability higher than ,
Term (c). It thus remains to handle Term (c). To this end, we first partition the states into and its complementary set , and get
We thus need to control the difference of bias from above and from below. To that end, we note that by property of the bias function, it holds that