Real-world Reinforcement Learning (RL) problems often concern dynamical systems with large state and action spaces, which make the design of efficient algorithms extremely challenging. This difficulty is well illustrated by the known regret fundamental limits. The regret compares the accumulated reward of an optimal policy (aware of the system dynamics and reward function) to that of the algorithm considered, and it quantifies the loss incurred by the need of exploring sub-optimal (state, action) pairs to learn the system dynamics and rewards. In online RL problems with undiscounted reward, regret lower bounds typically scale as or 111The first lower bound is asymptotic in and problem-specific, the second is minimax. We ignore here for simplicity the dependence of these bounds in the diameter, bias span, and action sub-optimality gap., where , , and denote the sizes of the state and action spaces and the time horizon, respectively. Hence, with large state and action spaces, it is essential to identify and exploit any possible structure existing in the system dynamics and reward function so as to minimize exploration phases and in turn reduce regret to reasonable values. Modern RL algorithms actually implicitly impose some structural properties either in the model parameters (transition probabilities and reward function, see e.g. (Ortner and Ryabko, 2012)) or directly in the -function (for discounted RL problems, see e.g. (Mnih et al., 2015). Despite the successes of these recent algorithms, our understanding of structured RL problems remains limited.
In this paper, we explore structured RL problems with finite state and action spaces. We first derive problem-specific regret lower bounds satisfied by any algorithm for RL problems with any arbitrary structure. These lower bounds are instrumental to devise algorithms optimally balancing exploration and exploitation, i.e., achieving the regret fundamental limits. A similar approach has been recently applied with success to stochastic bandit problems, where the average reward of arms exhibits structural properties, e.g. unimodality (Combes and Proutiere, 2014), Lipschitz continuity (Magureanu et al., 2014), or more general properties (Combes et al., 2017). Extending these results to RL problems is highly non trivial, and to our knowledge, this paper is the first to provide problem-specific regret lower bounds for structured RL problems. Although the results presented here concern ergodic RL problems with undiscounted reward, they could be easily generalized to discounted problems (under an appropriate definition of regret).
Our contributions are as follows:
1. For ergodic structured RL problems, we derive problem-specific regret lower bounds. The latter are valid for any structure (but are structure-specific), and for unknown system dynamics and reward function.
2. We analyze the lower bounds for unstructured MDPs, and show that they scale at most as , where and represent the span of the bias function and the minimal state-action sub-optimality gap, respectively. These results extend previously known regret lower bounds derived in the seminal paper (Burnetas and Katehakis, 1997) to the case where the reward function is unknown.
3. We further study the regret lower bounds in the case of Lipschitz MDPs. Interestingly, these bounds are shown to scale at most as where and only depend on the Lipschitz properties of the transition probabilities and reward function. This indicates that when and do not scale with the sizes of the state and action spaces, we can hope for a regret growing logarithmically with the time horizon, and independent of and .
4. We propose DEL, an algorithm that achieves our regret fundamental limits for any structured MDP. DEL is rather complex to implement since it requires in each round to solve an optimization problem similar to that providing the regret lower bounds. Fortunately, we were able to devise simplified versions of DEL, with regret scaling at most as and for unstructured and Lipschitz MDPs, respectively. In absence of structure, DEL, in its simplified version, does not require to compute action indexes as done in OLP (Tewari and Bartlett, 2008)
, and yet achieves similar regret guarantees without the knowledge of the reward function. DEL, simplified for Lipschitz MDPs, only needs, in each step, to compute the optimal policy of the estimated MDP, as well as to solve a simple linear program.
5. Preliminary numerical experiments (presented in the appendix) illustrate our theoretical findings. In particular, we provide examples of Lipschitz MDPs, for which the regret under DEL does not seem to scale with and , and significantly outperforms algorithms that do not exploit the structure.
2 Related Work
Regret lower bounds have been extensively investigated for unstructured ergodic RL problems. (Burnetas and Katehakis, 1997) provided a problem-specific lower bound similar to ours, but only valid when the reward function is known. Minimax regret lower bounds have been studied e.g. in (Auer et al., 2009) and (Bartlett and Tewari, 2009): in the worst case, the regret has to scale as where is the diameter of the MDP. In spite of these results, regret lower bounds for unstructured RL problems are still attracting some attention, see e.g. (Osband and Van Roy, 2016) for insightful discussions. To our knowledge, this paper constitutes the first attempt to derive regret lower bounds in the case of structured RL problems. Our bounds are asymptotic in the time horizon , but we hope to extend them to finite time horizons using similar techniques as those recently used to provide such bounds for bandit problems (Garivier et al., Jun. 2018). These techniques address problem-specific and minimax lower bounds in a unified manner, and can be leveraged to derive minimax lower bounds for structured RL problems. However we do not expect minimax lower bounds to be very informative about the regret gains that one may achieve by exploiting a structure (indeed, the MDPs leading to worst-case regret in unstructured RL comply to many structures).
There have been a plethora of algorithms developed for ergodic unstructured RL problems. We may classify these algorithms depending on their regret guarantees, either scaling asor . In absence of structure, (Burnetas and Katehakis, 1997) developed an asymptotically optimal, but involved, algorithm. This algorithm has been simplified in (Tewari and Bartlett, 2008), but remains more complex than our proposed algorithm. Some algorithms have finite-time regret guarantees scaling as (Auer and Ortner, 2007), (Auer et al., 2009), (Filippi et al., 2010). For example, the authors of (Filippi et al., 2010) propose KL-UCRL an extension of UCRL (Auer and Ortner, 2007) with regret bounded by . Having finite-time regret guarantees is arguably desirable, but so far this comes at the expense of a much larger constant in front of . Algorithms with regret scaling as include UCRL2 (Auer et al., 2009), KL-UCRL with regret guarantees , REGAL.C (Bartlett and Tewari, 2009) with guarantees . Recently, the authors of (Agrawal and Jia, 2017) managed to achieve a regret guarantee of , but only valid when .
Algorithms devised to exploit some known structure are most often applicable to RL problems with continuous state or action spaces. Typically, the transition probabilities and reward function are assumed to be smooth in the state and action, typically Lipschitz continuous (Ortner and Ryabko, 2012), (Lakshmanan et al., 2015). The regret then needs to scale as a power of , e.g. in (Lakshmanan et al., 2015) for 1-dimensional state spaces. An original approach to RL problems for which the transition probabilities belong to some known class of functions was proposed in (Osband and Van Roy, 2014). The regret upper bounds derived there depend on the so-called Kolmogorov and eluder dimensions, which in turn depend on the chosen class of functions. Our approach to design learning algorithms exploiting the structure is different from all aforementioned methods, as we aim at matching the problem-specific minimal exploration rates of sub-optimal (state, action) pairs.
3 Models and Objectives
We consider an MDP with finite state and action spaces and of respective cardinalities and . and are the transition and reward kernels of . Specifically, when in state , taking action , the system moves to state with probability , and a reward drawn from distribution of average is collected. The rewards are bounded, w.l.o.g., in . We assume that for any , is absolutely continuous w.r.t. some measure on 222 can be the Lebesgue measure; alternatively, if rewards take values in , can be the sum of Dirac measures at 0 and 1..
The random vectorrepresents the state, the action, and the collected reward at step . A policy selects an action, denoted by , in step when the system is in state based on the history captured through , the -algebra generated by observed under : is -measurable. We denote by the set of all such policies.
Structured MDPs. The MDP is initially unknown. However we assume that
belongs to some well specified set which may encode a known structure of the MDP.
The knowledge of can be exploited to devise (more) efficient policies. The results derived in this paper are valid under any structure, but we give a particular attention to the cases of
(i) Unstructured MDPs: if for all , and 333 is the set of distributions on and is the set of distributions on , absolutely continuous w.r.t. .;
(ii) Lipschitz MDPs: if and are Lipschitz-continuous w.r.t. and in some metric space (we provide a precise definition in the next section).
The learning problem. The expected cumulative reward up to step of a policy when the system starts in state is where denotes the expectation under policy given that . Now assume that the system starts in state and evolves according to the initially unknown MDP for given structure , the objective is to devise a policy maximizing or equivalently, minimizing the regret up to step defined as the difference between the cumulative reward of an optimal policy and that obtained under :
Preliminaries and notations. Let be the set of stationary (deterministic) policies, i.e. when in state , selects an action independent of . is communicating if each pair of states are connected by some policy. Further,
is ergodic if under any stationary policy, the resulting Markov chainis irreducible. For any communicating and any policy , we denote by the gain of (or long-term average reward) started from initial state : . We denote by the set of stationary policies with maximal gain: , where . If is communicating, the maximal gain is constant and denoted by . The bias function of is defined by , and quantifies the advantage of starting in state . We denote by and , respectively, the Bellman operator under action and the optimal Bellman operator under . They are defined by: for any and ,
Then for any , and satisfy the evaluation equation: for all state , . Furthermore, if and only if and verify the optimality equation:
We denote by the bias function of an optimal stationary policy444In case of is not unique, we arbitrarily select an optimal stationary policy and define ., and by its span . For , , and , let . For ergodic , is unique up to an additive constant. Hence, for ergodic , the set of optimal actions in state under is , and . Finally, we define for any state and action ,
This can be interpreted as the long-term regret obtained by initially selecting action in state (and then applying an optimal stationary policy) rather than following an optimal policy. The minimum gap is defined as .
We denote by . The set of MDPs is equipped with the following -norm: where .
The proofs of all results are presented in the appendix.
4 Regret Lower Bounds
In this section, we present an (asymptotic) regret lower bound satisfied by any uniformly good learning algorithm. An algorithm is uniformly good if for all ergodic , any initial state and any constant , the regret of satisfies .
To state our lower bound, we introduce the following notations. For and , we denote if the kernel of is absolutely continuous w.r.t. that of , i.e., , if . For and such that and , we define the KL-divergence between and in state-action pair as the KL-divergence between the distributions of the next state and collected reward if the state is and is selected under these two MDPs:
We further define the set of confusing MDPs as:
This set consists of MDP ’s that coincide with for state-action pairs where the actions are optimal (the kernels of and cannot be statistically distinguished under an optimal policy); and such that the optimal policies under are not optimal under .
Let be ergodic. For any uniformly good algorithm and for any ,
where is the value of the following optimization problem:
The above theorem can be interpreted as follows. When selecting a sub-optimal action in state , one has to pay a regret of . Then the minimal number of times any sub-optimal action in state has to be explored scales as where solves the optimization problem (2). It is worth mentioning that our lower bound is tight, as we present in Section 5 an algorithm achieving this fundamental limit of regret.
The regret lower bound stated in Theorem 1 extends the problem-specific regret lower bound derived in (Burnetas and Katehakis, 1997) for unstructured ergodic MDPs with known reward function. Our lower bound is valid for unknown reward function, but also applies to any structure . Note however that at this point, it is only implicitly defined through the solution of (2), which seems difficult to solve. The optimization problem can actually be simplified, as shown later in this section, by providing useful structural properties of the feasibility set depending on the structure considered. The simplification will be instrumental to quantify the gain that can be achieved when optimally exploiting the structure, as well as to design efficient algorithms.
In the following, the optimization problem: is referred to as ; so that corresponds to (2).
The proof of Theorem 1 combines a characterization of the regret as a function of the number of times up to step (state, action) pair is visited, and of the ’s, and change-of-measure arguments as those recently used to prove in a very direct manner regret lower bounds in bandit optimization problems (Kaufmann et al., 2016). More precisely, for any uniformly good algorithm , and for any confusing MDP , we show that the exploration rates required to statistically distinguish from satisfy where the expectation is taken w.r.t. given any initial state . The theorem is then obtained by considering (hence optimizing the lower bound) all possible confusing MDPs.
4.1 Decoupled exploration in unstructured MDPs
In the absence of structure, , and we have:
Consider the unstructured model , and let be ergodic. We have:
The theorem states that in the constraints of the optimization problem (2), we can restrict our attention to confusing MDPs that are different than the original MDP only for a single state-action pair . Further note that the condition is equivalent to saying that action becomes optimal in state under (see Lemma 1(i) in (Burnetas and Katehakis, 1997)). Hence to obtain the lower bound in unstructured MDPs, we may just consider confusing MDPs which make an initially sub-optimal action in state optimal by locally changing the kernels and rewards of at only. Importantly, this observation implies that an optimal algorithm must satisfy . In other words, the required level of exploration of the various sub-optimal state-action pairs are decoupled, which significantly simplifies the design of optimal algorithms.
To get an idea on how the regret lower bound scales as the sizes of both state and action spaces, we can further provide an upper bound of the regret lower bound. One may easily observe that where
From this result, an upper bound of the regret lower bound is , and we can devise algorithms achieving this regret scaling (see Section 5).
Theorem 2 relies on the following decoupling lemma, actually valid under any structure .
Let be two non-overlapping subsets of the (state, action) pairs such that for all , . Define the following three MDPs in obtained starting from and changing the kernels for (state, action) pairs in . Specifically, let be some transition and reward kernels. For all , define , as
Then, if , then or .
4.2 Lipschitz structure
Lipschitz structures have been widely studied in the bandit and reinforcement learning literature. We find it convenient to use the following structure, although one could imagine other variants in more general metric spaces. We assume that the state (resp. action) space can be embedded in the (resp. ) dimensional Euclidian space: and . We consider MDPs whose transition kernels and average rewards are Lipschitz w.r.t. the states and actions. Specifically, let , , and
Here is the Euclidean distance, and for two distributions and on we denote by .
For the model with Lipschitz structure (L1)-(L2), we have where is the set of satisfying for all such that ,
where we use the notation for . Furthermore, the optimal values and of and are upper bounded by where
The above theorem has important consequences. First, it states that exploiting the Lipschitz structure optimally, one may achieve a regret at most scaling as . This scaling is independent of the sizes of the state and action spaces provided that the minimal gap is fixed, and provided that the span does not scale with . The latter condition typically holds for fast mixing models or for MDPs with diameter not scaling with (refer to (Bartlett and Tewari, 2009) for a precise connection between and the diameter). Hence, exploiting the structure can really yield significant regret improvements. As shown in the next section, leveraging the simplified structure in , we may devise a simple algorithm achieving these improvements, i.e., having a regret scaling at most as .
In this section, we present DEL (Directed Exploration Learning), an algorithm that achieves the regret limits identified in the previous section. Asymptotically optimal algorithms for generic controlled Markov chains have already been proposed in (Graves and Lai, 1997), and could be adapted to our setting. By presenting DEL, we aim at providing simplified, yet optimal algorithms. Moreover, DEL can be adapted so that the exploration rates of sub-optimal actions are directed towards the solution of an optimization problem provided that (it suffices to use instead of in DEL). For example, in the case of Lipschitz structure , running DEL on yields a regret scaling at most as .
The pseudo-code of DEL with input parameter is given in Algorithm 2. There, for notational convenience, we abuse the notations and redefine as , and let . refers to the estimated MDP at time (using empirical transition rates and rewards). For any non-empty correspondence (i.e., for any , is a non-empty subset of ), let denote the restricted MDP where the set of actions available at state is . Then, and are the (optimal) gain and bias functions corresponding to the restricted MDP . Given a restriction defined by , for each , let and . For , let if , and let otherwise. For , we further define the set of confusing MDPs , and the set of feasible solutions as:
Similar sets and can be defined for the cases of unstructured and Lipschitz MDPs (refer to the appendix), and DEL can be simplified in these cases by replacing by or in the pseudo-code. Finally, refers to the optimization problem .
DEL combines the ideas behind OSSB (Combes et al., 2017), an asymptotically optimal algorithm for structured bandits, and the asymptotically optimal algorithm presented in (Burnetas and Katehakis, 1997) for RL problems without structure. DEL design aims at exploring sub-optimal actions no more than what the regret lower bound prescribes. To this aim, it essentially solves in each iteration an optimization problem close to where is an estimate of the true MDP . Depending on the solution and the number of times apparently sub-optimal actions have been played, DEL decides to explore or exploit. The estimation phase ensures that certainty equivalence holds. The "monotonization" phase together with the restriction to relatively well selected actions were already proposed in (Burnetas and Katehakis, 1997) to make sure that accurately estimated actions only are selected in the exploitation phase. The various details and complications introduced in DEL ensure that its regret analysis can be conducted. In practice (see the appendix), our initial experiments suggest that many details can be removed without large regret penalties.
For a structure with Bernoulli rewards and for any ergodic MDP , assume that: () is in the interior of (i.e., there exists a constant such that for any , if and ), () the solution is uniquely defined for each such that , () continuous at (i.e., for any given , there exists such that for all , if , where is solution of , and that of ). Then, for with any , we have:
For Lipschitz with (L1)-(L2) (resp. unstructured ), if uses in each step , (resp. ) instead of , its regret is asymptotically smaller than (resp. ).
In the above theorem, the assumptions about the uniqueness and continuity of the solution could be verified for particular structures. In particular, we believe that they generally hold in the case of unstructured and Lipschitz MDPs. Also note that similar assumptions have been made in (Graves and Lai, 1997).
6 Extensions and Future Work
It is worth extending the approach developed in this paper to the case of structured discounted RL problems (although for such problems, there is no ideal way of defining the regret of an algorithm). There are other extensions worth investigating. For example, since our framework allows any kind of structure, we may specify our regret lower bounds for structures stronger than that corresponding to Lipschitz continuity, e.g., the reward may exhibit some kind of unimodality or convexity. Under such structures, the regret improvements might become even more significant. Another interesting direction consists in generalizing the results to the case of communicating MDPs. This would allow us for example to consider deterministic system dynamics and unknown probabilistic rewards.
This work was partially supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation. Jungseul Ok is now with UIUC in Prof. Sewoong Oh’s group. He would like to thank UIUC for financially supporting his participation to NIPS 2018 conference.
- Agrawal and Jia  Shipra Agrawal and Randy Jia. Posterior sampling for reinforcement learning: worst-case regret bounds. In Advances in Neural Information Processing Systems 31, 2017.
- Auer and Ortner  Peter Auer and Ronald Ortner. Logarithmic online regret bounds for undiscounted reinforcement learning. In Advances in Neural Information Processing Systems 19, 2007.
- Auer et al.  Peter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal regret bounds for reinforcement learning. In Advances in Neural Information Processing Systems 22, 2009.
Bartlett and Tewari 
Peter L. Bartlett and Ambuj Tewari.
REGAL: A regularization based algorithm for reinforcement learning
in weakly communicating MDPs.
Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence, 2009.
Burnetas and Katehakis 
Apostolos N. Burnetas and Michael N. Katehakis.
Optimal adaptive policies for Markov decision processes.Mathematics of Operations Research, 22(1):222–255, 1997.
Combes and Proutiere 
Richard Combes and Alexandre Proutiere.
Unimodal bandits: Regret lower bounds and optimal algorithms.
Proceedings of the 31st International Conference on Machine Learning, 2014.
- Combes et al.  Richard Combes, Stefan Magureanu, and Alexandre Proutiere. Minimal exploration in structured stochastic bandits. In Advances in Neural Information Processing Systems 30, 2017.
Filippi et al. 
Sarah Filippi, Olivier Cappé, and Aurélien Garivier.
Optimism in reinforcement learning and Kullback-Leibler divergence.In 48th Annual Allerton Conference on Communication, Control, and Computing, 2010.
- Garivier et al. [Jun. 2018] Aurélien Garivier, Pierre Ménard, and Gilles Stoltz. Explore first, exploit next: The true shape of regret in bandit problems. Mathematics of Operations Research, Jun. 2018.
- Graves and Lai  Todd L. Graves and Tze Leung Lai. Asymptotically efficient adaptive choice of control laws in controlled Markov chains. SIAM J. Control and Optimization, 35(3):715–743, 1997.
- Kaufmann et al.  Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the complexity of best-arm identification in multi-armed bandit models. The Journal of Machine Learning Research, 17(1):1–42, 2016.
- Lakshmanan et al.  Kailasam Lakshmanan, Ronald Ortner, and Daniil Ryabko. Improved regret bounds for undiscounted continuous reinforcement learning. In 32nd International Conference on Machine Learning, 2015.
- Magureanu et al.  Stefan Magureanu, Richard Combes, and Alexandre Proutiere. Lipschitz bandits: Regret lower bounds and optimal algorithms. In Conference on Learning Theory, 2014.
- Mnih et al.  Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518:529–533, 2015.
- Ortner and Ryabko  Ronald Ortner and Daniil Ryabko. Online regret bounds for undiscounted continuous reinforcement learning. In Advances in Neural Information Processing Systems 25, 2012.
- Osband and Van Roy  Ian Osband and Benjamin Van Roy. Model-based reinforcement learning and the Eluder dimension. In Advances in Neural Information Processing Systems 27, 2014.
- Osband and Van Roy  Ian Osband and Benjamin Van Roy. On lower bounds for regret in reinforcement learning. arXiv preprint arXiv:1608.02732, 2016.
- Puterman  M. L. Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 1994.
- Tewari and Bartlett  Ambuj Tewari and Peter L. Bartlett. Optimistic linear programming gives logarithmic regret for irreducible MDPs. In Advances in Neural Information Processing Systems 20, 2008.
Appendix A The DEL Algorithm
In this section, we present DEL, and state its asymptotic performance guarantees. DEL pseudo-code is given in Algorithm 2. There, for notational convenience, we abuse the notations and redefine as . refers to the estimated MDP at time (e.g. using empirical transition rates). For non-empty correspondence (i.e., for any , is a non-empty subset of ), let denote the restricted MDP where the set of actions available at state is limited to . Then, and are the (optimal) gain and bias functions corresponding to the restricted MDP , respectively. Given a restriction defined by , for each , let and . For , let if , and let otherwise. For , we further define the set of confusing MDPs , and the set of feasible solutions :