Near-optimal Reinforcement Learning in Factored MDPs

by   Ian Osband, et al.
Stanford University

Any reinforcement learning algorithm that applies to all Markov decision processes (MDPs) will suffer Ω(√(SAT)) regret on some MDP, where T is the elapsed time and S and A are the cardinalities of the state and action spaces. This implies T = Ω(SA) time to guarantee a near-optimal policy. In many settings of practical interest, due to the curse of dimensionality, S and A can be so enormous that this learning time is unacceptable. We establish that, if the system is known to be a factored MDP, it is possible to achieve regret that scales polynomially in the number of parameters encoding the factored MDP, which may be exponentially smaller than S or A. We provide two algorithms that satisfy near-optimal regret bounds in this context: posterior sampling reinforcement learning (PSRL) and an upper confidence bound algorithm (UCRL-Factored).



There are no comments yet.


page 1

page 2

page 3

page 4


Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping

Modern tasks in reinforcement learning are always with large state and a...

Decentralized Cooperative Reinforcement Learning with Hierarchical Information Structure

Multi-agent reinforcement learning (MARL) problems are challenging due t...

RL for Latent MDPs: Regret Guarantees and a Lower Bound

In this work, we consider the regret minimization problem for reinforcem...

Posterior Sampling for Large Scale Reinforcement Learning

Posterior sampling for reinforcement learning (PSRL) is a popular algori...

Reducing Planning Complexity of General Reinforcement Learning with Non-Markovian Abstractions

The field of General Reinforcement Learning (GRL) formulates the problem...

Learning is planning: near Bayes-optimal reinforcement learning via Monte-Carlo tree search

Bayes-optimal behavior, while well-defined, is often difficult to achiev...

Efficient Reinforcement Learning via Initial Pure Exploration

In several realistic situations, an interactive learning agent can pract...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider a reinforcement learning agent that takes sequential actions within an uncertain environment with an aim to maximize cumulative reward [1]. We model the environment as a Markov decision process (MDP) whose dynamics are not fully known to the agent. The agent can learn to improve future performance by exploring poorly-understood states and actions, but might improve its short-term rewards through a policy which exploits its existing knowledge. Efficient reinforcement learning balances exploration with exploitation to earn high cumulative reward.

The vast majority of efficient reinforcement learning has focused upon the tabula rasa setting, where little prior knowledge is available about the environment beyond its state and action spaces. In this setting several algorithms have been designed to attain sample complexity polynomial in the number of states and actions [2, 3]. Stronger bounds on regret, the difference between an agent’s cumulative reward and that of the optimal controller, have also been developed. The strongest results of this kind establish regret for particular algorithms [4, 5, 6] which is close to the lower bound [4]. However, in many setting of interest, due to the curse of dimensionality, and can be so enormous that even this level of regret is unacceptable.

In many practical problems the agent will have some prior understanding of the environment beyond tabula rasa. For example, in a large production line with machines in sequence each with possible states, we may know that over a single time-step each machine can only be influenced by its direct neighbors. Such simple observations can reduce the dimensionality of the learning problem exponentially, but cannot easily be exploited by a tabula rasa algorithm. Factored MDPs (FMDPs) [7]

, whose transitions can be represented by a dynamic Bayesian network (DBN)

[8], are one effective way to represent these structured MDPs compactly.

Several algorithms have been developed that exploit the known DBN structure to achieve sample complexity polynomial in the parameters of the FMDP, which may be exponentially smaller than or [9, 10, 11]. However, these polynomial bounds include several high order terms. We present two algorithms, UCRL-Factored and PSRL, with the first near-optimal regret bounds for factored MDPs. UCRL-Factored is an optimistic algorithm that modifies the confidence sets of UCRL2 [4]

to take advantage of the network structure. PSRL is motivated by the old heuristic of Thompson sampling

[12] and has been previously shown to be efficient in non-factored MDPs [13, 6]. These algorithms are descibed fully in Section 6.

Both algorithms make use of approximate FMDP planner in internal steps. However, even where an FMDP can be represented concisely, solving for the optimal policy may take exponentially long in the most general case [14]. Our focus in this paper is upon the statistical aspect of the learning problem and like earlier discussions we do not specify which computational methods are used [10]. Our results serve as a reduction of the reinforcement learning problem to finding an approximate solution for a given FMDP. In many cases of interest, effective approximate planning methods for FMDPs do exist. Investigating and extending these methods are an ongoing subject of research [15, 16, 17, 18].

2 Problem formulation

We consider the problem of learning to optimize a random finite horizon MDP in repeated finite episodes of interaction. is the state space, is the action space, is the reward distibution over in state with action ,

is the transition probability over

from state with action , is the time horizon, and

the initial state distribution. We define the MDP and all other random variables we will consider with respect to a probability space


A deterministic policy is a function mapping each state and to an action . For each MDP and policy , we define a value function

where denotes the expected reward realized when action is selected while in state , and the subscripts of the expectation operator indicate that , and for . A policy is optimal for the MDP if for all and . We will associate with each MDP a policy that is optimal for .

The reinforcement learning agent interacts with the MDP over episodes that begin at times , . At each time , the agent selects an action , observes a scalar reward , and then transitions to . Let denote the history of observations made prior to time . A reinforcement learning algorithm is a deterministic sequence of functions, each mapping

to a probability distribution

over policies which the agent will employ during the th episode. We define the regret incurred by a reinforcement learning algorithm up to time to be:

where denotes regret over the th episode, defined with respect to the MDP by

with and . Note that regret is not deterministic since it can depend on the random MDP , the algorithm’s internal random sampling and, through the history , on previous random transitions and random rewards. We will assess and compare algorithm performance in terms of regret and its expectation.

3 Factored MDPs

Intuitively a factored MDP is an MDP whose rewards and transitions exhibit some conditional independence structure. To formalize this definition we must introduce some more notation common to the literature [11].

Definition 1 (Scope operation for factored sets ).

For any subset of indices let us define the scope set . Further, for any define the scope variable to be the value of the variables with indices . For singleton sets we will write for in the natural way.

Let be the set of functions mapping elements of a finite set to probability mass functions over a finite set . will denote the set of functions mapping elements of a finite set to -sub-Gaussian probability measures over with mean bounded in . For reinforcement learning we will write for and consider factored reward and factored transition functions which are drawn from within these families.

Definition 2 ( Factored reward functions ).

The reward function class is factored over with scopes if and only if, for all there exist functions such that,

for is equal to with each and individually observed.

Definition 3 ( Factored transition functions ).

The transition function class is factored over and with scopes if and only if, for all there exist some such that,

A factored MDP (FMDP) is then defined to be an MDP with both factored rewards and factored transitions. Writing a FMDP is fully characterized by the tuple

where and are the scopes for the reward and transition functions respectively in for . We assume that the size of all scopes and factors so that the domains of and are of size at most .

4 Results

Our first result shows that we can bound the expected regret of PSRL.

Theorem 1 (Expected regret for PSRL in factored MDPs).

Let be factored with graph structure . If is the distribution of and is the span of the optimal value function then we can bound the regret of PSRL:


We have a similar result for UCRL-Factored that holds with high probability.

Theorem 2 (High probability regret for UCRL-Factored in factored MDPs).

Let be factored with graph structure . If is the diameter of , then for any can bound the regret of UCRL-Factored:


with probability at least

Both algorithms give bounds where is a measure of MDP connectedness: expected span for PSRL and scaled diameter for UCRL-Factored. The span of an MDP is the maximum difference in value of any two states under the optimal policy . The diameter of an MDP is the maximum number of expected timesteps to get between any two states . PSRL’s bounds are tighter since and may be exponentially smaller.

However, UCRL-Factored has stronger probabilistic guarantees than PSRL since its bounds hold with high probability for any MDP not just in expectation. There is an optimistic algorithm REGAL [5] which formally replaces the UCRL2 with and retains the high probability guarantees. An analogous extension to REGAL-Factored is possible, however, no practical implementation of that algorithm exists even with an FMDP planner.

The algebra in Theorems 1 and 2 can be overwhelming. For clarity, we present a symmetric problem instance for which we can produce a cleaner single-term upper bound. Let be shorthand for the simple graph structure with , , and for and , we will write .

Corollary 1 (Clean bounds for PSRL in a symmetric problem).

If is the distribution of with structure then we can bound the regret of PSRL:

Corollary 2 (Clean bounds for UCRL-Factored in a symmetric problem).

For any MDP with structure we can bound the regret of UCRL-Factored:


with probability at least .

Both algorithms satisfy bounds of which is exponentially tighter than can be obtained by any -naive algorithm. For a factored MDP with independent components with states and actions the bound is close to the lower bound and so the bound is near optimal. The corollaries follow directly from Theorems 1 and 2 as shown in Appendix B.

5 Confidence sets

Our analysis will rely upon the construction of confidence sets based around the empirical estimates for the underlying reward and transition functions. The confidence sets are constructed to contain the true MDP with high probability. This technique is common to the literature, but we will exploit the additional graph structure

to sharpen the bounds.

Consider a family of functions which takes to a probability distribution over . We will write unless we wish to stress a particular -algebra.

Definition 4 (Set widths).

Let be a finite set, and let be a measurable space. The width of a set at with respect to a norm is

Our confidence set sequence is initialized with a set . We adapt our confidence set to the observations which are drawn from the true function at measurement points so that . Each confidence set is then centered around an empirical estimate at time , defined by

where is the number of time appears in and is the probability mass function over that assigns all probability to the outcome .

Our sequence of confidence sets depends on our choice of norm and a non-decreasing sequence . For each , the confidence set is defined by:

Where is shorthand for and we interpret as a null constraint. The following result shows that we can bound the sum of confidence widths through time.

Theorem 3 (Bounding the sum of widths).

For all finite sets , measurable spaces , function classes with uniformly bounded widths and non-decreasing sequences :


The proof follows from elementary counting arguments on and the pigeonhole principle. A full derivation is given in Appendix A. ∎

6 Algorithms

With our notation established, we are now able to introduce our algorithms for efficient learning in Factored MDPs. PSRL and UCRL-Factored proceed in episodes of fixed policies. At the start of the th episode they produce a candidate MDP and then proceed with the policy which is optimal for . In PSRL, is generated by a sample from the posterior for , whereas UCRL-Factored chooses optimistically from the confidence set .

Both algorithms require prior knowledge of the graphical structure and an approximate planner for FMDPs. We will write for a planner which returns -optimal policy for . We will write for a planner which returns an -optimal policy for most optimistic realization from a family of MDPs . Given it is possible to obtain through extended value iteration, although this might become computationally intractable [4].

PSRL remains identical to earlier treatment [13, 6] provided is encoded in the prior . UCRL-Factored is a modification to UCRL2 that can exploit the graph and episodic structure of . We write and as shorthand for these confidence sets and generated from initial sets and .

We should note that UCRL2 was designed to obtain regret bounds even in MDPs without episodic reset. This is accomplished by imposing artificial episodes which end whenever the number of visits to a state-action pair is doubled [4]. It is simple to extend UCRL-Factored’s guarantees to this setting using this same strategy. This will not work for PSRL since our current analysis requires that the episode length is independent of the sampled MDP. Nevertheless, there has been good empirical performance using this method for MDPs without episodic reset in simulation [6].

1:  Input: Prior encoding ,
2:  for episodes  do
3:     sample
4:     compute
5:     for timesteps  do
6:         sample and apply
7:         observe and
9:     end for
10:  end for
Algorithm 1 PSRL (Posterior Sampling)
1:  Input: Graph structure , confidence ,
2:  for episodes  do
3:      for
4:      for
6:     compute
7:     for timesteps  do
8:         sample and apply
9:         observe and
11:     end for
12:  end for
Algorithm 2 UCRL-Factored (Optimism)

7 Analysis

For our common analysis of PSRL and UCRL-Factored we will let refer generally to either the sampled MDP used in PSRL or the optimistic MDP chosen from with associated policy ). We introduce the Bellman operator , which for any MDP , stationary policy and value function , is defined by

This returns the expected value of state where we follow the policy under the laws of , for one time step. We will streamline our discussion of and by simply writing in place of or and in place of or where appropriate; for example . We will also write .

We now break down the regret by adding and subtracting the imagined near optimal reward of policy , which is known to the agent. For clarity of analysis we consider only the case of but this changes nothing for our consideration of finite .


relates the optimal rewards of the MDP to those near optimal for . We can bound this difference by the planning accuracy for PSRL in expectation, since and are equal in law, and for UCRL-Factored in high probability by optimism.

We decompose the first term through repeated application of dynamic programming:


Where is a martingale difference bounded by , the span of . For UCRL-Factored we can use optimism to say that [4] and apply the Azuma-Hoeffding inequality to say that:


The remaining term is the one step Bellman error of the imagined MDP . Crucially this term only depends on states and actions which are actually observed. We can now use the Hölder inequality to bound


7.1 Factorization decomposition

We aim to exploit the graphical structure to create more efficient confidence sets . It is clear from (9) that we may upper bound the deviations of factor-by-factor using the triangle inequality. Our next result, Lemma 1, shows we can also do this for the transition functions and . This is the key result that allows us to build confidence sets around each factor rather than as a whole.

Lemma 1 (Bounding factored deviations).

Let the transition function class be factored over and with scopes . Then, for any we may bound their L1 distance by the sum of the differences of their factorizations:


We begin with the simple claim that for any :

This result also holds for any , where can be verified case by case.

We now consider the probability distributions over and over . We let be the joint probability distribution over . Using the claim above we bound the L1 deviation by the deviations of their factors:

We conclude the proof by applying this times to the factored transitions and . ∎

7.2 Concentration guarantees for

We now want to show that the true MDP lies within with high probability. Note that posterior sampling will also allow us to then say that the sampled is within with high probability too. In order to show this, we first present a concentration result for the L1 deviation of empirical probabilities.

Lemma 2 (L1 bounds for the empirical transition function).

For all finite sets , finite sets , function classes then for any , the deviation the true distribution to the empirical estimate after samples is bounded:


This is a relaxation of the result proved by Weissman [19]. ∎

Lemma 2 ensures that for any . We then define with . Now using a union bound we conclude .

Lemma 3 (Tail bounds for sub -gaussian random variables).

If are all independent and sub -gaussian then :

A similar argument now ensures that , and so


7.3 Regret bounds

We now have all the necessary intermediate results to complete our proof. We begin with the analysis of PSRL. Using equation (10) and the fact that are equal in law by posterior sampling, we can say that . The contributions from regret in planning function are bounded by . From here we take equation (9), Lemma 1 and Theorem 3 to say that for any :

Let , since and by posterior sampling for all :

Plugging in and and setting completes the proof of Theorem 1. The analysis of UCRL-Factored and Theorem 2 follows similarly from (8) and (10). Corollaries 1 and 2 follow from substituting the structure and upper bounding the constant and logarithmic terms. This is presented in detail in Appendix B.

8 Conclusion

We present the first algorithms with near-optimal regret bounds in factored MDPs. Many practical problems for reinforcement learning will have extremely large state and action spaces, this allows us to obtain meaningful performance guarantees even in previously intractably large systems. However, our analysis leaves several important questions unaddressed. First, we assume access to an approximate FMDP planner that may be computationally prohibitive in practice. Second, we assume that the graph structure is known a priori but there are other algorithms that seek to learn this from experience [20, 21]. Finally, we might consider dimensionality reduction in large MDPs more generally, where either the rewards, transitions or optimal value function are known to belong in some function class to obtain bounds that depend on the dimensionality of .


Osband is supported by Stanford Graduate Fellowships courtesy of PACCAR inc. This work was supported in part by Award CMMI-0968707 from the National Science Foundation.


  • [1] Apostolos Burnetas and Michael Katehakis. Optimal adaptive policies for Markov decision processes. Mathematics of Operations Research, 22(1):222–255, 1997.
  • [2] Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine Learning, 49(2-3):209–232, 2002.
  • [3] Ronen Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for near-optimal reinforcement learning. The Journal of Machine Learning Research, 3:213–231, 2003.
  • [4] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. The Journal of Machine Learning Research, 99:1563–1600, 2010.
  • [5] Peter Bartlett and Ambuj Tewari. Regal: A regularization based algorithm for reinforcement learning in weakly communicating MDPs. In

    Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence

    , pages 35–42. AUAI Press, 2009.
  • [6] Ian Osband, Daniel Russo, and Benjamin Van Roy. (More) Efficient Reinforcement Learning via Posterior Sampling. Advances in Neural Information Processing Systems, 2013.
  • [7] Craig Boutilier, Richard Dearden, and Moisés Goldszmidt. Stochastic dynamic programming with factored representations. Artificial Intelligence, 121(1):49–107, 2000.
  • [8] Zoubin Ghahramani. Learning dynamic bayesian networks. In Adaptive processing of sequences and data structures, pages 168–197. Springer, 1998.
  • [9] Alexander Strehl. Model-based reinforcement learning in factored-state MDPs. In Approximate Dynamic Programming and Reinforcement Learning, 2007. ADPRL 2007. IEEE International Symposium on, pages 103–110. IEEE, 2007.
  • [10] Michael Kearns and Daphne Koller. Efficient reinforcement learning in factored MDPs. In IJCAI, volume 16, pages 740–747, 1999.
  • [11] István Szita and András Lőrincz. Optimistic initialization and greediness lead to polynomial time learning in factored MDPs. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1001–1008. ACM, 2009.
  • [12] William Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
  • [13] Malcom Strens. A Bayesian framework for reinforcement learning. In Proceedings of the 17th International Conference on Machine Learning, pages 943–950, 2000.
  • [14] Carlos Guestrin, Daphne Koller, Ronald Parr, and Shobha Venkataraman. Efficient solution algorithms for factored MDPs. J. Artif. Intell. Res.(JAIR), 19:399–468, 2003.
  • [15] Daphne Koller and Ronald Parr. Policy iteration for factored MDPs. In Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence, pages 326–334. Morgan Kaufmann Publishers Inc., 2000.
  • [16] Carlos Guestrin, Daphne Koller, and Ronald Parr. Max-norm projections for factored MDPs. In IJCAI, volume 1, pages 673–682, 2001.
  • [17] Karina Valdivia Delgado, Scott Sanner, and Leliane Nunes De Barros. Efficient solutions to factored MDPs with imprecise transition probabilities. Artificial Intelligence, 175(9):1498–1527, 2011.
  • [18] Scott Sanner and Craig Boutilier. Approximate linear programming for first-order MDPs. arXiv preprint arXiv:1207.1415, 2012.
  • [19] Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Weinberger. Inequalities for the L1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep, 2003.
  • [20] Alexander Strehl, Carlos Diuk, and Michael Littman. Efficient structure learning in factored-state MDPs. In AAAI, volume 7, pages 645–650, 2007.
  • [21] Carlos Diuk, Lihong Li, and Bethany R Leffler.

    The adaptive k-meteorologists problem and its application to structure learning and feature selection in reinforcement learning.

    In Proceedings of the 26th Annual International Conference on Machine Learning, pages 249–256. ACM, 2009.

Appendix A Bounding the widths of confidence sets

We present elementary arguments which culminate in a proof of Theorem 3.

Lemma 4 (Concentration results for ).

For all finite sets and any :

Where .


Let be the largest subsequence of such that . Now for any , let . Suppose there exist two distinct elements with so that . We note that for any so that:

This contradicts our assumption and so we must conclude that for all . This means that forms a subsequence of unique elements in , the total length of which must be bounded by . ∎

We now provide a corollary of this result which allows for episodic delays in updating visit counts . We imagine that we will only update our counts every steps.

Corollary 3 (Concentration results for in the episodic setting).

Let us associate times within episodes of length , for