# Thompson Sampling for Learning Parameterized Markov Decision Processes

We consider reinforcement learning in parameterized Markov Decision Processes (MDPs), where the parameterization may induce correlation across transition probabilities or rewards. Consequently, observing a particular state transition might yield useful information about other, unobserved, parts of the MDP. We present a version of Thompson sampling for parameterized reinforcement learning problems, and derive a frequentist regret bound for priors over general parameter spaces. The result shows that the number of instants where suboptimal actions are chosen scales logarithmically with time, with high probability. It holds for prior distributions that put significant probability near the true model, without any additional, specific closed-form structure such as conjugate or product-form priors. The constant factor in the logarithmic scaling encodes the information complexity of learning the MDP in terms of the Kullback-Leibler geometry of the parameter space.

## Authors

• 26 publications
• 120 publications
• ### Variational Regret Bounds for Reinforcement Learning

We consider undiscounted reinforcement learning in Markov decision proce...
05/14/2019 ∙ by Pratik Gajane, et al. ∙ 0

• ### Causal Markov Decision Processes: Learning Good Interventions Efficiently

We introduce causal Markov Decision Processes (C-MDPs), a new formalism ...
02/15/2021 ∙ by Yangyi Lu, et al. ∙ 0

• ### Large Scale Markov Decision Processes with Changing Rewards

We consider Markov Decision Processes (MDPs) where the rewards are unkno...
05/25/2019 ∙ by Adrian Rivera Cardoso, et al. ∙ 0

• ### Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping

Modern tasks in reinforcement learning are always with large state and a...
06/23/2020 ∙ by Dongruo Zhou, et al. ∙ 16

• ### Online Learning in Kernelized Markov Decision Processes

We consider online learning for minimizing regret in unknown, episodic M...
05/21/2018 ∙ by Sayak Ray Chowdhury, et al. ∙ 0

• ### Scenario-Based Verification of Uncertain MDPs

We consider Markov decision processes (MDPs) in which the transition pro...
12/24/2019 ∙ by Murat Cubuktepe, et al. ∙ 0

• ### Intrinsically Motivated Multimodal Structure Learning

We present a long-term intrinsically motivated structure learning method...
07/15/2016 ∙ by Jay Ming Wong, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Reinforcement Learning (RL) is concerned with studying how an agent learns by repeated interaction with its environment. The goal of the agent is to act optimally to maximize some notion of performance, typically its net reward, in an environment modeled by a Markov Decision Process (MDP) comprising states, actions and state transition probabilities.

The difficulty of reinforcement learning stems primarily from the learner’s uncertainty in knowing the environment. When the environment is perfectly known, finding optimal behavior essentially becomes a dynamic programming or planning task. Without this knowledge, the learner faces a conflict between the need to explore the environment to discover its structure (e.g., reward/state transition behavior), and the need to exploit accumulated information. The trade-off is compounded by the fact that the agent’s current action influences future information. Thus, one has to strike the right balance between exploration and exploitation in order to learn efficiently.

Several modern reinforcement learning algorithms, such as UCRL2 (Jaksch et al., 2010), REGAL (Bartlett and Tewari, 2009) and R-max (Brafman and Tennenholtz, 2003)

, learn MDPs using the well-known “optimism under uncertainty” principle. The underlying strategy is to maintain high-probability confidence intervals for each state-action transition probability distribution and reward, shrinking the confidence interval corresponding to the current state transition/reward at each instant. Thus, observing a particular state transition/reward is assumed to provide information for

only that state and action.

However, one often encounters learning problems in complex environments, often with some form of lower-dimensional structure. Parameterized

MDPs, in which the entire structure of the MDP is determined by a parameter with only a few degrees of freedom, are a typical example. With such MDPs, observing a state transition at an instant can be informative about other, unobserved transitions. As a motivating example, consider the problem of learning to control a queue, where the state represents the occupancy of the queue at each instant (#packets), and the action is either FAST or SLOW denoting the (known) rate of service that can be provided. The state transitions are governed by (a) the type of service (FAST/SLOW) chosen by the agent, together with (b) the arrival rate of packets to the queue, and the cost at each step is a sum of a (known) cost for the type of service and a holding cost per queued packet. Suppose that packets arrive to the system with a fixed,

unknown rate that alone parameterizes the underlying MDP. Then, every state transition is informative about , and only a few transitions are necessary to pinpoint accurately and learn the MDP fully. A more general example is a system with several queues having potentially state-dependent arrival rates of a parametric form, e.g., for .

A conceptually simple approach to learn MDPs with complex, parametric structure is posterior or Thompson sampling (Thompson, 1933)

, in which the learner starts by imposing a fictitious “prior” probability distribution over the uncertain parameters (thus, over all possible MDPs). A parameter is then sampled from this prior, the optimal behavior for that particular parameter is computed and the action prescribed by the behavior for the current state is taken. After the resulting reward/state transition is observed, the prior is updated using Bayes’ rule, and the process repeats.

### 1.1 Contributions

The main contribution of this work is to present and analyze Thompson Sampling for MDPs (TSMDP) – an algorithm for undiscounted, online, non-episodic reinforcement learning in general, parameterized MDPs. The algorithm operates in cycles demarcated by visits to a reference state, samples from the posterior once every cycle and applies the optimal policy for the sample throughout the cycle. Our primary result is a structural, problem-dependent regret111more precisely, pseudo-regret (Audibert and Bubeck, 2010) bound for TSMDP that holds for sufficiently general parameter spaces and initial priors. The result shows that for priors that put sufficiently large probability mass in neighborhoods of the underlying parameter, with high probability the TSMDP algorithm follows the optimal policy for all but a logarithmic (in the time horizon) number of time instants. To our knowledge, these are the first logarithmic gap-dependent bounds for Thompson sampling in the MDP setting, without using any specific/closed form prior structure. Furthermore, using a novel sample-path based concentration analysis, we provide an explicit bound for the constant factor in this logarithmic scaling which admits interpretation as a measure of the “information complexity” of the RL problem. The constant factor arises as the solution to an optimization problem involving the Kullback-Leibler geometry of the parameter space222more precisely, involving marginal KL divergences – weighted KL-divergences that measure disparity between the true underlying MDP and other candidate MDPs. We discuss this in detail in Sections 5, 3., and encodes in a natural fashion the interdependencies among elements of the MDP induced by the parametric structure333In fact, the constant factor is similar in spirit to the notion of eluder dimension coined by Russo and Van Roy (Russo and Van Roy, 2013) in their fully Bayesian analysis of Thompson sampling for the bandit setting.. This results in significantly improved regret scaling in settings when the state/policy space is potentially large but where the space of uncertain parameters is relatively much smaller (Section 4.3), and represents an advantage over decoupled algorithms like UCRL2 which ignore the possibility of generalization across states, and explore each state transition in isolation.

We also implement and evaluate the numerical performance of the TSMDP algorithm for a queue MDP with unknown, state-dependent, parameterized arrival rates, which appears to be significantly better than the generic UCRL2 strategy.

The analysis of a distribution-based algorithm like Thompson sampling poses difficulties of a flavor unlike than those encountered in the analysis of algorithms using point estimates and confidence regions

(Jaksch et al., 2010; Bartlett and Tewari, 2009). In the latter class of algorithms, the focus is on (a) theoretically constructing tight confidence sets within which the algorithm uses the most optimistic parameter, and (b) tracking how the size of these confidence sets diminishes with time. In contrast, Thompson sampling, by design, is completely divorced from analytically tailored confidence intervals or point estimates. Understanding its performance is often complicated by the exercise of tracking the (posterior) distribution, driven by heterogeneous and history-dependent observations, concentrates with time.

The problem of quantifying how the prior in Thompson sampling evolves in a general parameter space, with potentially complex structure or coupling between elements, where the posterior may not even be expressible in a convenient closed-form manner, poses unique challenges that we address here. Almost all existing analyses of Thompson sampling for the multi-armed bandit (a degenerate special case of MDPs), rely heavily on specific properties of the problem, especially independence across actions’ rewards, and/or specific structure of the prior such as belonging to a closed-form conjugate prior family

(Agrawal and Goyal, 2012; Kaufmann et al., 2012; Korda et al., 2013; Agrawal and Goyal, 2013), or finitely supported priors (Gopalan et al., 2014).

Additional technical complications arise when generalizing from the bandit case – where the environment is stateless and IID444Independent and Identically Distributed – to state-based reinforcement learning in MDPs, in which state evolution is coupled across time and evolves as a function of decisions made. This makes tracking the evolution of the posterior and the algorithm’s decisions especially challenging.

There is relatively little work on the rigorous performance analysis of Thompson sampling schemes for reinforcement learning. To the best of our knowledge, the only known regret analyses of Thompson sampling for reinforcement learning are those of Osband et al. (2013) and Osband and Roy (2014) which study the (purely) Bayesian setting, in which nature draws the true MDP episodically from a prior which is also completely known to the algorithm. The former work establishes Bayesian regret bounds for Thompson sampling in the canonical parameterization setup (i.e., each state-action pair having independent transition/reward parameters) whereas the latter considers the same for parameterized MDPs as we do here. Our interest, however, is in the continuous (non-episodic) learning setting, and more importantly in the frequentist of regret performance, where the “prior” plays the role of merely a parameter used by the algorithm operating in an unknown, fixed environment. We are also interested in problem (or “gap”) dependent regret bounds depending on the explicit structure of the MDP parameterization.

In this work, we overcome these hurdles to derive the first regret-type bounds for TSMDP at the level of a general parameter space and prior. First, we directly consider the posterior density in its general form of a normalized, exponentiated, empirical Kullback-Leibler divergence. This is reminiscent of approaches towards posterior consistency in the statistics literature

(Shen and Wasserman, 2001; Ghosal et al., 2000), but we go beyond it in the sense of accounting for partial information from adaptively gathered samples. We then develop self-normalized, maximal concentration inequalities (de la Peña et al., 2007)

for sums of sub-exponential random variables to Markov chain cycles, which may be of independent interest in the analysis of MDP-based algorithms. These permit us to show sample-path based bounds on the concentration of the posterior distribution, and help bound the number of cycles in which suboptimal policies are played – a measure of regret.

## 2 Preliminaries

Let be a space of parameters, where each parameterizes an MDP . Here, and represent finite state and action spaces, is the reward function and is the probability transition kernel of the MDP (i.e., is the probability of the next state being when the current state is and action is played). We assume that the learner is presented with an MDP where is initially unknown. In the canonical parameterization, the parameter factors into separate components for each state and action (Dearden et al., 1999).

We restrict ourselves to the case where the reward function is completely known, with the only uncertainty being in the transition kernel of the unknown MDP. The extension to problems with unknown rewards is well-known from here (Bartlett and Tewari, 2009; Tewari and Bartlett, 2008).

A (stationary) policy or control is a prescription to (deterministically) play an action at every state of the MDP, i.e., . Let denote the set of all stationary policies555Note that is finite since are finite. In general, can be a subset of the set of all stationary policies, containing optimal policies for every . This serves to model policies with specific kinds of structure, e.g., threshold rules. over , which are the “reference policies” to compete with. Each policy , together with an MDP , induces the discrete-time stochastic process , with , and denoting the state, action taken and reward obtained respectively at time . In particular, the sequence of visited states becomes a discrete time Markov chain.

algocf[htbp]

For each policy , MDP and time horizon , we define the -step value function over initial states to be , with the subscripts666We will often drop subscripts when convenient for the sake of clarity in notation. indicating the stochasticity induced by in the MDP . Denote by the policy with the best long-term average reward777We assume that the limiting average reward is well-defined. If not, one can restrict to the limit inferior. in (ties are assumed to be broken in a fixed fashion). Correspondingly, let be the best attainable long-term average reward for . We will overload notation and use and .

In general, denotes the

th coordinate of the vector

, and is taken to mean the standard inner product of vectors and . Here, denotes the standard Kullback-Leibler divergence between probability distributions and on a common finite alphabet . The notation

is employed to denote the indicator random variable corresponding to event

.

The TSMDP Algorithm. TSMDP (Algorithm LABEL:alg:tsmdp) operates in contiguous intervals of time called epochs, induced in turn by an increasing sequence of stopping times We will analyze the version that uses the return times to the start state as epoch markers, i.e., , . The algorithm maintains a “prior” probability distribution (denoted by at time ) over the parameter space , from which it samples888If the prior is analytically tractable, accurate sampling may be feasible. If not, a variety of schemes for sampling approximately from a posterior distribution, e.g., Gibbs/Metropolis-Hastings samplers, can be used.

a parameterized MDP at the beginning of each epoch. It then uses an average-reward optimal policy w.r.t.

for the sampled MDP throughout the epoch , and updates the prior to a “posterior” distribution via Bayes’ rule (LABEL:eqn:bayesrule), effectively at the end of each epoch.

## 3 Assumptions Required for the Main Result

We describe in this section our main result for the TSMDP algorithm (Algorithm LABEL:alg:tsmdp), driven by the intuition presented in Section 5. We begin by stating and explaining the assumptions needed for our results to hold.

###### Assumption 1 (Recurrence).

The start state is recurrent999Recall that a state is said to be recurrent in a discrete time Markov chain if (Levin et al., 2006). for the true MDP under each policy for in the support of .

Assumption 1 is satisfied, for instance, if is an ergodic101010A Markov chain is ergodic if it is irreducible, i.e., it is possible to go from every state to every state (not necessarily in one move) Markov chain under every stationary policy – a condition commonly used in prior work on MDP learning (Tewari and Bartlett, 2008; Burnetas and Katehakis, 1997). Define to be the expected recurrence time to state , starting from , when policy is used in the true MDP .

###### Assumption 2 (Bounded Log-likelihood ratios).

Log-likelihood ratios are upper-bounded by a constant : .

Assumption 2 is primarily technical, and helps control the convergence of sample KL divergences in to (expected) true KL divergences, and is commonly employed in the statistics literature, e.g., (Shen and Wasserman, 2001).

###### Assumption 3 (Unique average-reward-optimal policy).

For the true MDP , is the unique average-reward optimal policy: .

The uniqueness assumption is made merely for ease of exposition; our results continue to hold with suitable redefinition otherwise.

The remaining assumptions (4 and 5) concern the behavior of the prior and the posterior distribution under “near-ideal” trajectories of the MDP. In order to introduce them, we will need to make a few definitions. Let (resp. ) be the stationary probability of state (resp. joint probability of immediately followed by ) when the policy is applied to the true MDP ; correspondingly, let be the expected first return time to state .We denote by the important marginal Kullback-Leibler divergence111111The marginal KL divergence appears as a fundamental quantity in the lower bound for regret in parameterized MDPs established by (Agrawal et al., 1989). for under :

 Dc(θ⋆||θ) :=∑s1∈Sπ(c)s1∑s2∈Spθ⋆(s1,c(s1),s2)logpθ⋆(s1,c(s1),s2)pθ(s1,c(s1),s2) =∑s1∈Sπ(c)s1KL(pθ⋆(s1,c(s1),⋅)||pθ(s1,c(s1),⋅)).

The marginal KL divergence is a convex combination of the KL divergences between the transition probability kernels of and , with the weights of the convex combination being the appropriate invariant probabilities induced by policy under . If is positive, then the MDPs and can be “resolved apart” using samples from the policy . Denote , i.e., the vector of values across all policies, with the convention that the final coordinate is associated with the optimal policy .

For each policy , define to be the decision region corresponding to , i.e., the set of parameters/MDPs for which the average-reward optimal policy is . Fixing , let . In other words, comprises all the parameters (resp. MDPs) with average reward-optimal policy that “appear similar” to (resp. ) under the true optimal policy . Correspondingly, put as the remaining set of parameters (resp. MDPs) in the decision region that are separated by at least w.r.t. .

Let us use to denote the epoch to which time instant belongs, i.e., if . Let be the number of epochs, up to and including epoch , in which the policy applied by the algorithm was . Let denote the total number of time instants that the state transition occurred in the first epochs when policy was used, i.e., .

The next assumption controls the posterior probability of playing the true optimal policy

during any epoch, preventing it from falling arbitrarily close to . Note that at the beginning of epoch (time instant ), the posterior measure of any legal subset can be expressed solely as a function of the sample state pair counts as

 πtk(M) =∫MWtk(θ)π(dθ)∫ΘWt(θ)π(dθ),Wtk(θ):=exp∑c,s1,s2J(s1,s2)(Nc(k),c)logpθ(s1,c(s1),s2)pθ⋆(s1,c(s1),s2),

where represents the posterior density or weight at time . The assumption requires that the posterior probability of the decision region of is uniformly bounded away from whenever the empirical state pair frequencies are “near” their corresponding expected121212Expectation w.r.t. the state transitions of values .

###### Assumption 4 (Posterior probability of the optimal policy under “near-ideal” trajectories).

For any scalars , there exists such that

 πtk(Sc⋆)≥p⋆whenever near-ideal" state pair frequencies have been observed:
 ∣∣∣J(s1,s2)(kc,c)kc−¯τcπ(c)(s1,s2)∣∣∣≤√e1log(e2logkc)kc∀s1,s2∈S,kc≥1,c∈C,k=∑c∈Ckc.

The final assumption we make is a “grain of truth” condition on the prior, requiring it to put sufficient probability on/around the true parameter . Specifically, we require that prior probability mass in weighted marginal KL-neighborhoods of to not decay too fast as a function of the total weighting. This form of local prior property is analogous to the Kullback-Leibler condition (Barron, 1998; Choi and Ramamoorthi, 2008; Ghosal et al., 1999) used to establish consistency of Bayesian procedures, and in fact can be thought of as an extension of the standard condition to the partial observations setting of this paper.

###### Assumption 5 (Prior mass on KL-neighborhoods of θ⋆).

(A) There exist such that , for all choices of nonnegative integers , and .

(B) There exist such that , for all choices of nonnegative integers , , that satisfy .

The key factor that will be shown to influence the regret scaling with time is the quantity above, which bounds the (polynomial) decay rate of the prior mass around essentially the marginal KL neighborhood of corresponding to always playing the policy .

We show later how these assumptions are satisfied in finite parameter spaces (Section 4.1) , and in continuous parameter spaces (Section 4.2). In particular, in finite parameter spaces, the assumptions can be shown to be satisfied with while for smooth (continuous) priors, the typical square-root rate of per independent parameter dimension holds, i.e., holds.

## 4 Main Result

We are now in a position to state131313Due to space constraints, the proofs of all results are deferred to the appendix. the main, top-level result of this paper.

###### Theorem 1 (Regret-type bound for TSMDP).

Suppose Assumptions 1 through 5 hold. Let , and let be the unique optimal stationary policy for the true MDP . For the TSMDP algorithm, there exists such that with probability at least , it holds for all that

 T∑t=11{At≠c⋆(St)}≤B+ClogT, (2)

where is a problem- and prior-dependent quantity independent of , and is the value of the optimization problem141414Note that in (15) is the constant from Assumption 5(B).

 max ∣∣∣∣x|C|−1∣∣∣∣1 (3) s.t. xl∈R|C|+,∀l=1,2,…,|C|−1, xl(|C|)=0,∀l=1,2,…,|C|−1, xi≥xj,∀1≤j≤i≤|C|−1, xi(l)=xl(l),∀i≥l,l=1,2,…,|C|−1, σ:{1,2,…,|C|−1}→C∖{c⋆}\emph{injective}, minθ∈S′σ(l)xl⋅D(θ⋆||θ)=(1+a4)(1+ϵ1−ϵ),∀1≤l≤|C|−1.

Discussion. Theorem 1 gives a high-probability, logarithmic-in- bound on the quantity
, the number of time instants in when a suboptimal choice of action (w.r.t. ) is made. This can be interpreted as a natural regret-minimization property of the algorithm161616In the case of a stochastic multi-armed bandit ( and IID across time) with rewards bounded in , for instance, this quantity serves as an upper bound to the standard pseudo regret151515A bound on a suitably defined version of pseudo regret - see e.g., Jaksch et al. (2010) - can easily be obtained from our main result (Theorem 1) by appropriate weighting; we leave the details to the reader. (Audibert and Bubeck, 2010), defined as , with . The optimization problem (3) and the bound (2) can be interpreted as a multi-dimensional “game” in the space of (epoch) play counts of policies , with the following “rules”: (1) Start growing the non-negative -dimensional vector of epoch play counts of all policies, with initial value (the -th coordinate of represents the number of plays of the optimal policy , which is irrelevant as far as regret is concerned, and is thus pegged to throughout), (2) Wait until the first time that some suboptimal policy is “eliminated”, in the sense , (3) Record , , (4) Impose the constraint that no further growth is allowed to occur in along dimension in the future, and (5) Repeat growing the play count vector until the time all suboptimal policies are eliminated, and aim to maximize the final when this occurs. An overview of how this optimization naturally arises as a regret bound for Thompson sampling is provided in Section 5.

We also have the following square-root scaling for the usual notion of regret for MDPs (Jaksch et al., 2010):

###### Theorem 2 (Regret bound for TSMDP).

Under the hypotheses of Theorem 1, with , for the TSMDP algorithm, there exists such that with probability at least , for all , .

This can be compared with the probability-at-least regret bound of for UCRL2 (Jaksch et al., 2010, Theorem 4), with being the diameter171717The diameter D is the time it takes to move from any state to any other state , using an appropriate policy for each pair of states . of the true MDP.

The following sections show how the conclusions of Theorem 1 are applicable to various MDPs and illustrate the behavior of the scaling constant , showing that significant gains are obtained in the presence of correlated parameters.

### 4.1 Application: Discrete Parameter Spaces

We show here how the conclusion of Theorem 1 holds in a setting where there the true MDP is known to be one among finitely many candidate models (MDPs).

###### Assumption 6 (Finitely many parameters, “Grain of truth” prior).

The prior probability distribution is supported on finitely many parameters: . Moreover, .

###### Theorem 3 (Regret-type bound for TSMDP, Finite parameter setting).

Suppose Assumptions 1, 2, 3 and 6 hold. Then, with , (a) Assumption 4 holds, and (b) Assumption 5 holds with and . Consequently, the conclusion of Theorem 1 holds, namely: Let , and let be the unique optimal stationary policy for the true MDP . For the TSMDP algorithm, there exists such that with probability at least , it holds for all that , where is a problem- and prior-dependent quantity independent of , and is the value of the optimization problem (3) with .

### 4.2 Application: Continuous Parameter Spaces

To illustrate the generality of our result, we apply our main result (Theorem 1) to obtain a regret bound for Thompson Sampling with a continuous prior, i.e., , and a probability density181818By a probability density on , we mean a probability measure absolutely continuous w.r.t. Lebesgue measure on . on . For ease of exposition, let us consider a -state, -action MDP: , (the theory can be applied in general to finite-state, finite-action MDPs). The (known) reward in state is , , irrespective of the action played, i.e., , , with . All the uncertainty is in the transition kernel of the MDP, parameterized by the canonical parameters . Hence, we take the parameter space to be , with the identification191919Note that we retain only independent parameters of the MDP model. and . It follows that the optimal policy for a parameter is one that maximizes the probability of staying at state :

 cOPT(θ)≡(c(1),c(2))=(j1,j2),j1=argmaxiθ(i)12,j2=argminiθ(i)21.

Imagine that the TSMDP algorithm is run with initial/recurrence state and prior as the uniform density on the sub-cube , on the MDP , . Also, without loss of generality, let , implying that , i.e., the optimal policy is to always play action . It can be checked that under this setup, Assumptions 1, 2 and 3 hold. The following result establishes the validity of Assumptions 4 and 5 in this continuous prior setting.

###### Theorem 4 (Regret-type bound for TSMDP, Continuous parameter/prior setting).

In the above MDP, with small enough, (a) Assumption 4 holds, and (b) Assumption 5 holds with and . Consequently, the conclusion of Theorem 1 holds.

### 4.3 Dependence of the Regret Scaling on MDP and Parameter Structure

We derive the following consequence of Theorem 1, useful in its own right, that explicitly guarantees an improvement in regret directly based on the Kullback-Leibler resolvability of parameters in the parameter space – a measure of the coupling across policies in the MDP.

###### Theorem 5 (Explicit Regret Improvement due to shared Marginal KL-Divergences).

Suppose that and the integer are such that

 ∀c≠c⋆,θ∈S′c|{^c∈C:^c≠c⋆,D^c(θ⋆||θ)≥Δ}|≥L,

i.e., at least coordinates202020Note that the coordinate corresponding to the optimal policy is excluded from the condition. of are at least . Then, the multiplicative scaling factor in (2) satisfies ,where .

The result assures a non-trivial additive reduction of from the naive decoupled regret, whenever any suboptimal model in can be resolved apart from by at least actions in the sense of marginal KL-divergences of their observations.

Although the net number of decision vectors in (3) is nearly , the scale of can be significantly less than the number of policies owing to the fact that the posterior probability of several parameters is driven down simultaneously via the marginal K-L divergence terms . Put differently, using a standard bandit algorithm (e.g., UCB) naively with each arm being a stationary policy will perform much worse with a scaling like . We show (Appendix E) an example of an MDP in which the number of states can be arbitrarily large but which has only one uncertain scalar parameter, for which Thompson sampling achieves a much better regret scaling than its frequentist counterparts like UCRL2 (Jaksch et al., 2010) which are forced to explore all possible state transitions in isolation.

## 5 Sketch of Proof and Techniques used to show Theorem 1

At the outset, TSMDP is a randomized algorithm, whose decision is based on a random sample from the parameter space . The essence of Thompson sampling performance lies in understanding how the posterior distribution evolves as time progresses.

Let us assume, for ease of exposition, that we have finitely many parameters, . Writing out the expression for the posterior density at time using Bayes’ rule, we have, ,

 πt+1(dθ)∝pθ(St,At+1,St+1)πt(dθ)=exp(−t−1∑i=0logpθ⋆(St,At+1,St+1)pθ(St,At+1,St+1))π0(dθ).

The sum in the exponent above can be rearranged into

 ∑c∈CVc(t)∑s1∈SVs1,c(t)Vc(t)∑s2∈S1Vs1,c(t)t−1∑i=01{(Si+1,Si)=(s2,s1),Ce(i)=c}logpθ⋆(s1,c(s1),s2)pθ(s1,c(s1),s2),

in which, , and .The above sum is an empirical quantity depending on the (random) sample path To gain a clear understanding of the posterior evolution, let us replace the empirical terms in the above sum by their “ergodic averages” (i.e., expected value under the respective invariant distribution) under the respective policies. In other words, for each and , let us approximate , the stationary probability of state when the policy is applied to the true MDP . In the same way, we approximate .With these “typical” estimates, our approximation to the posterior density simply becomes

Expression (5) is the result of effectively eliminating one of the two sources of randomness in the dynamics of the TSMDP algorithm – the variability of the environment, i.e., state transitions. The other source of randomness arises due to the algorithm’s sampling behavior from the posterior distribution. We use approximation (5) to extract two basic insights that determine the posterior shrinkage and regret performance of TSMDP even for general parameter spaces: For a total time horizon of steps, we claim Property 1. The true model always has “high” posterior mass. Assuming (the discrete “grain of truth” property), observe that (5) implies at all times . Thus, roughly, the true parameter is sampled by TSMDP with a frequency at least during the entire horizon, i.e., .

We also have Property 2. Suboptimal models are sampled only as long as their posterior probability is above . The total number of times a parameter with posterior mass less than can be picked in Thompson sampling is at most , which is irrelevant as far as the scaling of the regret with is concerned.

With these two insights, we can now estimate the net number of times bad parameters may be chosen. To this end, partition the parameter space into the optimal decision regions , setting and . Now, for each and , is positive; thus, since is finite, such that uniformly across all such . But this in turn implies, using Property 1 and (5), that the posterior probability of decays exponentially with time : . Hence, such parameters , are sampled at most a constant number of times in any time horizon with high probability and do not contribute to the overall regret scaling.

The interesting and non-trivial contribution to the regret comes from the amount that parameters from , are sampled. To see this, let us follow the vector of play counts of policies, i.e., as it starts growing from the all-zeros vector at , increasing by in some coordinate at each time step . By Property 2 above, once is reached, sampling from effectively ceases. Thus, considering the “worst-case” path that can follow to delay this condition for the longest time across all , we arrive (approximately) at the optimization problem (3) stated in Theorem 1.

Though the argument above was based on rather coarse approximations to empirical, path-based quantities, the underlying intuition holds true and is made rigorous (Appendix A) to show that this is indeed the right scaling of the regret. This involves several technical tools tailored for the analysis of Thompson sampling in MDPs, including (a) the development of self-normalized concentration inequalities for sub-exponential IID random variables (epoch-related quantities), and (b) control of the posterior probability using properties of the prior in Kullback-Leibler neighborhoods of the true parameter, using techniques analogous to those used to establish frequentist consistency of Bayesian procedures (Ghosal et al., 2000; Choi and Ramamoorthi, 2008).

## 6 Numerical Evaluation

MDP and Parameter Structure: Along the lines of the motivating example in the Introduction, we model a single-buffer, discrete time queueing system with a maximum occupancy of packets/customers. The state of the MDP is simply the number of packets in the queue at any given time, i.e., . At any given time, one of actions – Action (SLOW service) and Action (FAST service) may be chosen, i.e., . Applying SLOW (resp. FAST) service results in serving one packet from the queue with probability (resp. ) if it is not empty, i.e., the service model is Bernoulli() where is the packet processing probability under service type . Actions and incur a per-instant cost of and units respectively. In addition to this cost, there is a holding cost of per packet in the queue at all times. The system gains a reward of units whenever a packet is served from the queue212121A candidate physical interpretation of such a queueing system is in the form of a restaurant with tables, with the possibility to add more “chefs” or staff into service when desired (service rate control). However, adding staff costs the restaurant, as does customers waiting long until their orders materialize (holding cost)..

The arrival rate to the queueing system – the probability with which a new packet enters the buffer – is modeled as being state-dependent. Most importantly, the function

mapping a state to its corresponding packet arrival rate is parameterized using a standard Normal distribution ( make this clearer, avoid confusion with Normal probability distn.) as follows:

. Here, and represent the

-dimensional (mean,standard deviation) parameter for the arrival rate curve, and

is chosen to be a constant that makes (to ensure valid Bernoulli packet arrival distributions). For the true, unknown MDP, we set ( clarify Cartesian prod). Figure 1 depicts (a) the optimal policy over , (b) the stationary distribution under the optimal policy and (c) the (parameterized) mean arrival rate curve over .

Simulation Results: We simulate both TSMDP and the UCRL2 algorithm (Jaksch et al (Jaksch et al., 2010)) for the parameterized queueing MDP above. For UCRL2, we run the algorithm both with (a) fixed confidence intervals and (b) (horizon-dependent confidence intervals222222This choice of is used by Jaksch et al to show a logarithmic expected regret bound for UCRL2 (Jaksch et al., 2010).). We initialize TSMDP with a uniform prior for the normalized parameter on the discretized space .

Figure 2 shows the results of running the TSMDP and UCRL2 algorithms for various time horizons up to time steps, and across sample runs. We report both the average regret (w.r.t. a best per-step average reward of ) and the percentile of the regret across the runs. Thompson sampling is seen to significantly outperform UCRL2 as the horizon length increases. This advantage is presumably due to the fact that TSMDP is capable of exploiting the parameterized structure of better than UCRL2, which updates each confidence interval only when the associated state is visited.

## 7 Related Work

A line of recent work (Agrawal and Goyal, 2012; Kaufmann et al., 2012; Korda et al., 2013; Agrawal and Goyal, 2013; Gopalan et al., 2014) has demonstrated that the Thompson sampling enjoys near-optimal regret guarantees for multi-armed bandits – a widely studied subclass of reinforcement learning problems.

The work of Osband et al (Osband et al., 2013), perhaps the most relevant to us, studies the Bayesian regret of Thompson sampling for MDPs. In this setting, the true MDP is assumed to have been drawn from the same prior used by the algorithm; consequently, the Bayesian regret becomes the standard frequentist regret averaged across the entire parameter space w.r.t. the prior. While this is useful, it is arguably weaker than the standard frequentist notion of regret in that it is an averaged notion of standard regret (w.r.t. the specific prior), and moreover is not indicative of how the structure of the MDP exactly influences regret performance. Moreover, the learning model considered in their work is episodic with fixed-length episodes and resets, as opposed to the non-episodic learning setting treated in this work, where we are able to show the first known structural (“gap-dependent”) regret bounds for Thompson sampling in fixed but unknown parameterized MDPs.

Prior to this, Ortega and Braun (2010) investigate the consistency performance of posterior-sampling based control rules, again in the fully Bayesian setting where nature’s prior is known.

Several deterministic algorithms relying on the “optimism under uncertainty” philosophy have been proposed for RL in the frequentist setup considered here (Brafman and Tennenholtz, 2003; Jaksch et al., 2010; Bartlett and Tewari, 2009). These algorithms work by maintaining confidence intervals for each transition probability and reward, computing the most optimistic MDP satisfying all confidence intervals and adaptively shrinking the confidence intervals each time the relevant state transition occurs. This strategy is potentially inefficient in parameterized MDPs where, potentially, observing a particular state transition can give information about other parts of the MDP as well.

The parameterized MDP setting we consider in this work has been previously studied by other authors. Dyagilev et al (Dyagilev et al., 2008) investigate learning parameterized MDPs for finite parameter spaces in the discounted setting (we consider the average-reward setting), and demonstrate sample-complexity results under the Probably-Approximately-Correct (PAC) learning model, which is different from the notion of regret.

The certainty equivalence approach to learning MDPs (Kumar and Varaiya, 1986) – building the most plausible model given available data and using the optimal policy for it – is perhaps natural, but it suffers from a serious lack of adequate exploration necessary to achieve low regret (Kaelbling et al., 1996).

A noteworthy related work is the seminal paper of Agrawal et al (Agrawal et al., 1989) that gives fundamental lower bounds on the asymptotic regret scaling for general, parameterized reinforcement learning problems. The bound is also tight, in the sense that for finite parameter spaces, the authors show a learning algorithm that achieves the bound. Even though our analytical results also hold for the setting of a finite parameter space, the strategy in (Agrawal et al., 1989) relies crucially on the finiteness assumption. This is in sharp contrast to Thompson sampling which can be defined for any kind of parameter space. In fact, Thompson sampling has previously been shown to enjoy favorable regret guarantees with continuous priors in linear bandit problems (Agrawal and Goyal, 2013).

## 8 Conclusion and Future Work

We have proposed the TSMDP algorithm in this paper for solving parameterized RL problems, and have derived regret-style bounds for the algorithm under significantly general initial priors. This supports the increasing evidence for the success of Thompson sampling and pseudo-Bayesian methods for reinforcement learning/bandit problems.

Moving forward, it would be useful to extend the performance results for Thompson sampling to continuous parameter spaces, as well as understand what happens when feedback can be delayed. Specific applications to reinforcement learning problems with additional structure would also prove insightful. In particular, studying the regret of Thompson Sampling for MDPs with linear function approximation (Melo et al., 2008) would be of interest – in this setting, the parameterization of the MDP is in terms of linear weights corresponding to a known basis of state-action value functions, and one could develop a variant of Thompson sampling which uses information from sample paths to update its posterior over the space of weights.

Supplementary Material (Appendices and References)

Thompson Sampling for Learning Parameterized Markov Decision Processes

## Appendix A Proof of Theorem 1

### a.1 Expressing the “posterior” distribution

At time , the “posterior distribution” that TSMDP uses can be expressed by iterating Bayes’ rule (LABEL:eqn:bayesrule):

 ∀M⊆Θπt(M) =Wt(M)Wt(Θ)=∫MWt(θ)π(dθ)∫ΘWt(θ)π(dθ),

with the posterior density or weight simply being the likelihood ratio of the entire observed history up to under the MDPs and , i.e.,

 Wt(θ):=t−1∏i=0pθ(Si,Ai+1,Si+1)pθ⋆(Si,Ai+1,Si+1) =exp(∑c∈Ct−1∑i=01{Ce(i)=c}logpθ(Si,Ai+1,Si+1)pθ⋆(Si,Ai+1,Si+1)) =exp⎛⎜⎝∑c∈C∑(s1,s2)∈S2t−1∑i=01{Ce(i)=c,(Si,Si+1)=(s1,s2)}logpθ(s1,c(s1),s2)pθ⋆(s1,c(s1),s2)⎞⎟⎠ =exp⎛⎜⎝−∑c∈CVc(t)∑(s1,s2)∈S2t−1∑i=01{Ce(i)=c,(Si,Si+1)=(s1,s2)}Vc(t)logpθ⋆(s1,c(s1),s2)pθ(s1,c(s1),s2)⎞⎟⎠, (4)

where is the total number of time instants up to for which the epoch policy was used.

We will find it convenient in the sequel to introduce the following decomposition of the number of epochs up to epoch for which was chosen to be the epoch policy:

 Nc(k):=k∑l=11{θl∈Sc}=N′c(k)+N′′c(k), (5)
 N′c(k):=k∑l=11{θl∈S′c},N′′c(k):=k∑l=11{θl∈S′′c}.

### a.2 An alternative probability space

In order to analyze the dynamics of the TSMDP algorithm, it is useful to work in an equivalent probability space defined as follows. Define a random matrix with elements in . The rows of are indexed by sampling indices , and the columns by policies in . For each , independently generate the -th column of by applying the stationary policy