Designing exploration strategies that are both computationally and statistically efficient is an important problem in Reinforcement Learning (RL). A common performance measure to theoretically evaluate exploration strategies is regret, the difference between the highest achievable cumulative reward and an agent’s actual rewards. There is a rich line of work that focuses on regret analysis in tabular MDPs where state and action space are finite and small (Jaksch et al., 2010; Osband et al., 2013; Dann and Brunskill, 2015; Kearns and Singh, 2002). In tabular setting, the lower bound on regret is , with as the diameter, state size, action size and total steps, respectively. This lower bound has been achieved by Tossou et al. (2019) in tabular setting.
A current challenge in RL is dealing with large state and action spaces where even polynomial dependence of regret on state and action space size is unacceptable. One idea to meet this challenge is to consider MDP with structure that allows us to represent them with a smaller number of parameters. One such example is a factored MDP (FMDP) (Boutilier et al., 2000)
whose transitions can be represented by a compact Dynamic Bayesian network (DBN)(Ghahramani, 1997).
There is no FMDP planner that is both computationally efficient and accurate. Guestrin et al. (2003)
proposed approximate algorithms with pre-specified basis functions. They use approximate linear programming and approximate policy iteration with max-norm projections, where the max-norm errors of the approximated value functions can be upper bounded by the projection error. For the even harder online learning setting, we study oracle-efficient algorithms, which can learn an unknown FMDP efficiently by assuming an efficient planning oracle. In this, we follow prior work in online learning where the goal is to design efficient online algorithms that only make a polynomial number of calls to an oracle that solves the associated offline problem. For example, oracle-based efficient algorithms have been proposed for the contextual bandit problem(Syrgkanis et al., 2016; Luo et al., 2017).
The only available near-optimal regret analysis of FMDPs is by Osband and Van Roy (2014). They consider the episodic setting and part of our motivation is to extend their work to the more challenging non-episodic setting. They proposed two algorithms, PSRL (Posterior Sampling RL) and UCRL-factored. The two algorithms enjoy near-optimal Bayesian and frequentist regret bounds, respectively. However, their UCRL-factored algorithm relies on solving a Bounded FMDP (Givan et al., 2000) with no computationally efficient solution for the factored case yet. Other computationally efficient algorithms either have some high order terms in their analysis (Strehl, 2007) or depend on some strong connectivity assumptions, e.g. mixing time (Kearns and Koller, 1999).
This paper makes three main contributions which are summarised below:
We provide the first oracle-efficient algorithm, called DORL (Discrete Optimism Reinforcement Learning), with a near-optimal frequentist regret bound for non-episodic FMDPs. The algorithm calls the FMDP planner only polynomial number of times. The upper bound, when specialized to the standard non-factored MDP setting, matches that of UCRL2 (Jaksch et al., 2010).
We provide a posterior sampling algorithm (PSRL) with a near-optimal Bayesian regret bound for non-episodic FMDPs. The Bayesian bound matches that of PSRL by Ouyang et al. (2017) when specialized to standard setting. While Bayesian regret bound is a weaker guarantee, PSRL has better empirical performance than that of DORL and the previous sample-efficient algorithms. The algorithm is able to find the true optimal policies despite our use of an approximate planner in our simulations.
We prove a lower bound for the regret in non-episodic FMDPs that depends on the span of the bias vector of the unknown FMDP rather than its diameter.
To illustrate the usefulness of span as a difficulty parameter for FMDPs, we show that the Cartesian product of independent and identical MDPs always has a bounded span but can have an arbitrarily large diameter. This result shows the limitation of regret bounds that depend on the diameter. To match the lower bound, we show that simply modifying the REGAL.C algorithm (Bartlett and Tewari, 2009) gives us the regret bound that only depends on the span. However, this requires access to a stronger computational oracle that maximizes optimal gain with bounded span within a confidence set of MDPs.
We solve non-episodic MDP by splitting the steps into epochs and change policies at the start of each epoch. Previous algorithms use a doubling trick that stops an epoch whenever the number of visits of some state-action pair is doubled. The doubling trick leads to a random splitting and introduces large terms in factored MDP setting. We instead show that simple fixed epochs can also have a near-optimal regret bound. To be consistent with previous literature, we use the word ”episode” to also denote the epochs of the algorithm when the policy stays constant. However, note that our setting is non-episodic throughout the paper.
2.1 Non-episodic MDP
We consider the non-episodic and undiscounted Markov decision process (MDP), represented by , with the finite state space , the finite action space
, the transition probabilityand reward distribution . Here denotes a distribution over the space and is the class of all the mappings from space to . Let and .
An MDP and an algorithm operating on with an initial state constitute a stochastic process described by the states visited at time step , the actions chosen by at step , the rewards and the next state obtained for . Let be the trajectory up to time .
To learn an non-episodic and undiscounted MDP with sublinear regret, we need some connectivity constraint. There are several subclasses of MDPs corresponding to different types of connectivity constraints (e.g., see the discussion in Bartlett and Tewari (2009)). This paper focuses on the class of communicating MDPs, i.e., the diameter of the MDP, which is defined below, is upper bounded by some .
Definition 2.1 (Diameter).
Consider the stochastic process defined by a stationary policy operating on an MDP with initial state . Let
be the random variable for the first time step in which stateis reached in this process. Then the diameter of is defined as
A stationary policy on an MDP is a mapping . An average reward (also called gain) of a policy on with an initial distribution is defined as
where the expectation is over trajectories and the limitation may be a random value. We restrict the choice of policies within the set of all policies that give fixed average rewards, . It can be shown that for a communicating MDP the optimal policies with the highest average reward are in the set and neither of optimal policy and optimal reward depends on the initial state. Let denote the optimal policy for MDP starting from and denote the optimal average reward or optimal gain of the optimal policy.
We define the regret of a reinforcement learning algorithm operating on MDP up to time as
Optimal equation for undiscounted MDP.
We let denote the -dimensional vector with each element representing and denote the matrix with each row as . For any communicating MDP using the optimal policy , there exists a vector , such that the optimal gain satisfies the following equation (Puterman, 2014):
We let be the vector satisfying the equation. Let .
2.2 Factored MDP
Factored MDP is modeled with a DBN (Dynamic Bayesian Network) (Dean and Kanazawa, 1989), where transition dynamic and rewards are factored and each factor only depends on a finite scope of state and action space. We use the definition in Osband and Van Roy (2014). We call factored set if it can be factored by .
Definition 2.2 (Scope operation for factored sets).
For any subset of indices , let us define the scope set . Further, for any define the scope variable to be the value of the variables with indices . For singleton sets , we write for in the natural way.
Definition 2.3 (Factored reward distribution).
A reward distribution is factored over with scopes if and only if, for all , there exists distributions such that any can be decomposed as , with each individually observable. Throughout the paper, we also let denote reward function of the distribution , which is the expectation .
Definition 2.4 (Factored transition probability).
A transition function is factored over and with scopes if and only if, for all there exists some such that,
For simplicity, let also denote the vector for the probability of each next state from current pair . We define in the same way.
Assumptions on FMDP.
To ensure a finite number of parameters, we assume that for , for and for all for some finite and . Furthermore, we assume that is in with probability 1.
We first define number of visits for each factored set. Let be the number of visits to until , be the number of visits to until and be the number of visits to until
. The empirical estimate foris for . Estimate for transition probability is for . We let and be and with be the first step of episode .
We use PSRL (Posterior Sampling RL) and a modified version of UCRL-factored, called DORL (Discrete Optimism RL). The term “discrete” here means taking possible transition probabilities from a discrete set (instead of a continuous one used by UCRL2). The main difference between UCRL and DORL version is that DORL only relies on a planner for FMDP, while UCRL needs to solve a bounded-parameter FMDP (Givan et al., 2000). Both PSRL and DORL use a fixed policy within an episode. For PSRL (Algorithm 2), we apply optimal policy for an MDP sampled from posterior distribution of the true MDP. For DORL (Algorithm 1), instead of optimizing over a bounded MDP, we construct a new extended MDP, which is also factored with the number of parameters polynomial in that of the true MDP. Then we find the optimal policy for the new factored MDP and map it to the policy space of the true MDP. Instead of using dynamic episodes, we show that a simple fixed episode scheme can also give us near-optimal regret bounds.
3.1 Extended FMDP
Previous near-optimal algorithms on regular MDP depend on constructing an extended MDP with high probability of being optimism, i.e., the optimal gain of the extended MDP is higher than that of the true MDP. There are two constructions for non-factored MDPs. Jaksch et al. (2010) constructs the extended MDP with a continuous action space to allow choosing any transition probability in a confidence set, whose width decrease with an order , where is the number of visits. This construction generates a bounded-parameter MDP. Agrawal and Jia (2017) instead samples transition probability only from the extreme points of the confidence set and combined them by adding extra discrete actions.
Solving the bounded-parameter MDP by the first construction, which requires storing and ordering the -dimensional bias vector, is not feasible for FMDPs. There is no direct adaptation that mitigates this computation issue. We show that the second construction, by removing the sampling part, can be solved with a much lower complexity in FMDP setting.
We formally describe the construction. For simplicity, we ignore the notations for in this session. First define the error bounds as an input. For every , , we have an error bound for transition probability . For every , we have an error bound for . At the start of episode the construction takes the inputs of and the error bounds, and outputs the extended MDP .
Extreme transition dynamic.
We first define the extreme transition probability mentioned ahead in factored setting. Let be the transition probability that encourages visiting , be
where is the vector with all zeros except for an one on the -th element. By this definition, is a new transition probability that puts all the uncertainty onto the direction . Our construction assigns an action for each of the extreme transition dynamic.
Construction of extended FMDP.
Our new factored MDP , where and the new scopes and are the same as those for the original MDP.
Let . The new transition probability is factored over and with the factored transition probability to be
The new reward function is factored over , with reward functions to be
for any .
The factored set of the extended MDP satisfies each for any and each for any .
By Claim 3.1, any planner that efficiently solves the original MDP, can also solve the extended MDP. We find the best policy for using the planner. To run a policy on original action space, we choose such that for every , where maps any new state-action pair to the pair it is extended from, i.e. for any .
We achieve the near-optimal Bayesian regret bound by PSRL and frequentist regret bound by DORL, respectively.
Theorem 4.1 (Regret of PSRL).
Let be the factored MDP with graph structure , all and , and diameter upper bounded by . Then if is the true prior distribution over the set of MDPs with diameter D, then we bound Bayesian regret of PSRL:
Theorem 4.2 (Regret of DORL).
Let be the factored MDP with graph structure , all and , and diameter upper bounded by . Then, with high probability, regret of DORL is upper bounded by:
The two bounds match the frequentist regret bound in Jaksch et al. (2010) and Bayesian regret bound in (Ouyang et al., 2017) for non-factored communicating MDP. We also give a condition of designing the speed of changing policies.
Our lower bound restricts the scope of transition probability, i.e. the scope contains itself, which we believe is a natural assumption.
For any algorithm, any graph structure satisfying with , , and for , there exists an FMDP with the span of bias vector , such that for any initial state , the expected regret of the algorithm after step is .
For a tighter regret bound depending on span, we use a factored REGAL.C, which replaces all the diameter with an upper bound on span. The discussion of factored REGAL.C is in Appendix F.
A standard regret analysis consists of proving the optimism, bounding the deviations and bounding the probability of failing the confidence set. Our analysis follows the standard procedure while adapting them to a FMDP setting. The novelty is on the proof of the general episode-assigning criterion and the lower bound.
For simplicity, we let denote the optimal policy of the true MDP, . Let be the starting time of episode and be the total number of episodes. Since for any does not depends on , we also let denote for any . Let and denote the optimal average reward for and .
Let be the confidence set of FMDPs at the start of episode with the same factorization, such that for and each ,
where as defined in (3);
and for each
where is defined in (2). It can be shown that
In the following analysis, we all assume that true MDP for both PSRL and DORL are in and by PSRL are in for all . In the end, we will bound the regret caused by the failure of confidence set.
5.1 Regret caused by difference in optimal gain
Lemma 5.1 (Lemma 1 in Osband et al. (2013)).
If is the distribution of , then, for any function ,
We let . As is a function. Since , are fixed value for each , we have .
For DORL, we need to prove optimism, i.e, with high probability. Given , we show that there exists a policy for with an average reward .
For any policy for and any vector , let be the policy for satisfying , where . Then, given , .
Let be the policy that satisfies , where . Then for any starting state .
5.2 Regret caused by deviation
We further bound regret caused by (5), which can be decomposed into the deviation between our brief and the true MDP. We first show that the diameter of can be upper bounded by .
We need diameter of extended MDP to be upper bounded to give a sublinear regret. For PSRL, since prior distribution has no mass on MDP with diameter greater than , the diameter of MDP from posterior is upper bounded by almost surely. For DORL, we have the following lemma, the proof of which is given in Appendix B.
When is in the confidence set , the diameter of the extended MDP .
Let be the number of visits on in episode and be the row vector of . Let . Using optimal equation,
where , and .
Using Azuma-Hoeffding inequality and the same analysis in (Jaksch et al., 2010), we bound with probability at least ,
To bound and , we analysis the deviation in transition and reward function between and . For DORL, the deviation in transition probability is upper bounded by
The deviation in reward function .
For PSRL, since , and .
Decomposing the bound for each scope provided by and for PSRL , it holds for both PSRL and DORL that:
where with some abuse of notations, define for . The second inequality is from the fact that (Osband and Van Roy, 2014).
5.3 Balance episode length and episode number
Lemma 5.5 implies that bounding the deviation regret is to balance total number of episodes and the length of the longest episode. The proof, as shown in Appendix C, relies on defining the last episode , such that .
Instead of using the doubling trick that was used in (Jaksch et al., 2010). We use an arithmetic progression: for . As in our algorithm, , we have and for all . Thus, by Lemma 5.5, putting (6), (7), (9), (8) together, the total regret for is upper bounded by
with a probability at least .
For the failure of confidence set, we prove the following Lemma in Appendix D.
For all , with probability greater than , holds.
For PSRL, and has the same posterior distribution. The expectation of the regret caused by and are the same. Choosing sufficiently small , Theorem 4.1 follows.
6 Lower Bound
The usefulness of the span of the optimal bias vector as a difficulty parameter for standard MDP has been discussed in the literature (Bartlett and Tewari, 2009). Here we further argue that the span is a better notion of difficulty for FMDPs since it scales better when we generate rather simple FMDPs that decompose into independently evolving MDPs. For such FMDPs, span grows in a controlled way whereas the diameter can blow up. We also provide a proof sketch for the lower bound of stated earlier in Theorem 4.3. Since our upper bounds are stated in terms of the diameter, we also observe that the REGAL.C algorithm of Bartlett and Tewari (2009) can be extended to FMDPs and shown to enjoy a regret bound that depends on the span, not the diameter. However, doing this does not lead to an oracle efficient solution which is the main focus of this paper.
Large diameter case.
We consider a simple FMDP with infinite diameter but still solvable. Let , . State space and action space are factored with indices . The transition probability is factored over and with scope for . The reward function is factored over with scope for . By this definition, the FMDP can be viewed as two independent MDP, and that are set to be communicating with bounded diameter. Each factored transition probability is chosen such that from any state and action pair, the next state will either move forward or move backward with probability one (state 0 is connected with state 3 as a circle).
This FMDP can be easily solved by solving independently. However, since always keeps the same parity, cannot be transmitted to . Thus, the FMDP has an infinite diameter. The span of optimal policy, on the other hand, is upper bounded by , which is tight in this case. To ensure an communicating MDP, we can simply add an extra action for each MDP with small probability to stay unchanged for each factored state. In this way, the diameter can be arbitrarily large.
Lower bound with only dependency on span.
Let formally state the lower bound. Our lower bound casts some restrictions on the scope of transition probability, i.e. the scope contains itself, which we believe is a natural assumption. We provide a proof sketch for Theorem 4.3 here.
Proof sketch. Let . As , a special case is the FMDP with graph structure , which can be decomposed into independent MDPs as in the previous example. Among the MDPs, the last MDPs are trivial. By simply setting the rest MDPs to be the construction used by (Jaksch et al., 2010), which we refer to as ”JAO MDP”, the regret for each MDP with the span , is for . The total regret is .
Let be the Cartesian product of independent MDPs , each with a span of bias vector . The optimal policy for has a span .