## 1 Introduction

Reinforcement Learning (RL) is concerned with studying how an agent learns by repeated interaction with its environment. The goal of the agent is to act optimally to maximize some notion of performance, typically its net reward, in an environment modeled by a Markov Decision Process (MDP) comprising states, actions and state transition probabilities.

The difficulty of reinforcement learning stems primarily from the learner’s uncertainty in knowing the environment. When the environment is perfectly known, finding optimal behavior essentially becomes a dynamic programming or planning task. Without this knowledge, the learner faces a conflict between the need to explore the environment to discover its structure (e.g., reward/state transition behavior), and the need to exploit accumulated information. The trade-off is compounded by the fact that the agent’s current action influences future information. Thus, one has to strike the right balance between exploration and exploitation in order to learn efficiently.

Several modern reinforcement learning algorithms, such as UCRL2 (Jaksch et al., 2010), REGAL (Bartlett and Tewari, 2009) and R-max (Brafman and Tennenholtz, 2003)

, learn MDPs using the well-known “optimism under uncertainty” principle. The underlying strategy is to maintain high-probability confidence intervals for each state-action transition probability distribution and reward, shrinking the confidence interval corresponding to the current state transition/reward at each instant. Thus, observing a particular state transition/reward is assumed to provide information for

only that state and action.However, one often encounters learning problems in complex environments, often with some form of lower-dimensional structure. Parameterized

MDPs, in which the entire structure of the MDP is determined by a parameter with only a few degrees of freedom, are a typical example. With such MDPs, observing a state transition at an instant can be informative about other, unobserved transitions. As a motivating example, consider the problem of learning to control a queue, where the state represents the occupancy of the queue at each instant (#packets), and the action is either FAST or SLOW denoting the (known) rate of service that can be provided. The state transitions are governed by (a) the type of service (FAST/SLOW) chosen by the agent, together with (b) the arrival rate of packets to the queue, and the cost at each step is a sum of a (known) cost for the type of service and a holding cost per queued packet. Suppose that packets arrive to the system with a fixed,

unknown rate that alone parameterizes the underlying MDP. Then, every state transition is informative about , and only a few transitions are necessary to pinpoint accurately and learn the MDP fully. A more general example is a system with several queues having potentially state-dependent arrival rates of a parametric form, e.g., for .A conceptually simple approach to learn MDPs with complex, parametric structure is posterior or Thompson sampling (Thompson, 1933)

, in which the learner starts by imposing a fictitious “prior” probability distribution over the uncertain parameters (thus, over all possible MDPs). A parameter is then sampled from this prior, the optimal behavior for that particular parameter is computed and the action prescribed by the behavior for the current state is taken. After the resulting reward/state transition is observed, the prior is updated using Bayes’ rule, and the process repeats.

### 1.1 Contributions

The main contribution of this work is to present and analyze Thompson Sampling for MDPs (TSMDP) – an algorithm for
undiscounted, online, non-episodic reinforcement learning in general,
parameterized MDPs. The algorithm operates in cycles demarcated
by visits to a reference state, samples from the posterior once every
cycle and applies the optimal policy for the sample throughout the
cycle. Our primary result is a structural, problem-dependent
regret^{1}^{1}1more precisely, pseudo-regret
(Audibert and Bubeck, 2010) bound for TSMDP that holds for
sufficiently general parameter spaces and initial priors. The result
shows that for priors that put sufficiently large probability mass in
neighborhoods of the underlying parameter, with high probability the
TSMDP algorithm follows the optimal policy for all but a logarithmic
(in the time horizon) number of time instants. To our knowledge, these
are the first logarithmic gap-dependent bounds for Thompson sampling
in the MDP setting, without using any specific/closed form prior
structure. Furthermore, using a novel sample-path based concentration
analysis, we provide an explicit bound for the constant factor in this
logarithmic scaling which admits interpretation as a measure of the
“information complexity” of the RL problem. The constant factor
arises as the solution to an optimization problem involving the
Kullback-Leibler geometry of the parameter space^{2}^{2}2more
precisely, involving marginal KL divergences – weighted
KL-divergences that measure disparity between the true underlying
MDP and other candidate MDPs. We discuss this in detail in Sections
5, 3., and encodes in a natural
fashion the interdependencies among elements of the MDP induced by the
parametric structure^{3}^{3}3In fact, the constant factor is similar
in spirit to the notion of eluder dimension coined by Russo
and Van Roy (Russo and Van Roy, 2013) in their fully Bayesian analysis
of Thompson sampling for the bandit setting.. This results in
significantly improved regret scaling in settings when the
state/policy space is potentially large but where the space of
uncertain parameters is relatively much smaller (Section
4.3), and represents an advantage over decoupled
algorithms like UCRL2 which ignore the possibility of generalization
across states, and explore each state transition in isolation.

We also implement and evaluate the numerical performance of the TSMDP algorithm for a queue MDP with unknown, state-dependent, parameterized arrival rates, which appears to be significantly better than the generic UCRL2 strategy.

The analysis of a distribution-based algorithm like Thompson sampling poses difficulties of a flavor unlike than those encountered in the analysis of algorithms using point estimates and confidence regions

(Jaksch et al., 2010; Bartlett and Tewari, 2009). In the latter class of algorithms, the focus is on (a) theoretically constructing tight confidence sets within which the algorithm uses the most optimistic parameter, and (b) tracking how the size of these confidence sets diminishes with time. In contrast, Thompson sampling, by design, is completely divorced from analytically tailored confidence intervals or point estimates. Understanding its performance is often complicated by the exercise of tracking the (posterior) distribution, driven by heterogeneous and history-dependent observations, concentrates with time.The problem of quantifying how the prior in Thompson sampling evolves in a general parameter space, with potentially complex structure or coupling between elements, where the posterior may not even be expressible in a convenient closed-form manner, poses unique challenges that we address here. Almost all existing analyses of Thompson sampling for the multi-armed bandit (a degenerate special case of MDPs), rely heavily on specific properties of the problem, especially independence across actions’ rewards, and/or specific structure of the prior such as belonging to a closed-form conjugate prior family

(Agrawal and Goyal, 2012; Kaufmann et al., 2012; Korda et al., 2013; Agrawal and Goyal, 2013), or finitely supported priors (Gopalan et al., 2014).Additional technical complications arise when generalizing from the
bandit case – where the environment is stateless and
IID^{4}^{4}4Independent and Identically Distributed – to
state-based reinforcement learning in MDPs, in which state evolution
is coupled across time and evolves as a function of decisions
made. This makes tracking the evolution of the posterior and the
algorithm’s decisions especially challenging.

There is relatively little work on the rigorous performance analysis of Thompson sampling schemes for reinforcement learning. To the best of our knowledge, the only known regret analyses of Thompson sampling for reinforcement learning are those of Osband et al. (2013) and Osband and Roy (2014) which study the (purely) Bayesian setting, in which nature draws the true MDP episodically from a prior which is also completely known to the algorithm. The former work establishes Bayesian regret bounds for Thompson sampling in the canonical parameterization setup (i.e., each state-action pair having independent transition/reward parameters) whereas the latter considers the same for parameterized MDPs as we do here. Our interest, however, is in the continuous (non-episodic) learning setting, and more importantly in the frequentist of regret performance, where the “prior” plays the role of merely a parameter used by the algorithm operating in an unknown, fixed environment. We are also interested in problem (or “gap”) dependent regret bounds depending on the explicit structure of the MDP parameterization.

In this work, we overcome these hurdles to derive the first regret-type bounds for TSMDP at the level of a general parameter space and prior. First, we directly consider the posterior density in its general form of a normalized, exponentiated, empirical Kullback-Leibler divergence. This is reminiscent of approaches towards posterior consistency in the statistics literature

(Shen and Wasserman, 2001; Ghosal et al., 2000), but we go beyond it in the sense of accounting for partial information from adaptively gathered samples. We then develop self-normalized, maximal concentration inequalities (de la Peña et al., 2007)for sums of sub-exponential random variables to Markov chain cycles, which may be of independent interest in the analysis of MDP-based algorithms. These permit us to show sample-path based bounds on the concentration of the posterior distribution, and help bound the number of cycles in which suboptimal policies are played – a measure of regret.

## 2 Preliminaries

Let be a space of parameters, where each parameterizes an MDP . Here, and represent finite state and action spaces, is the reward function and is the probability transition kernel of the MDP (i.e., is the probability of the next state being when the current state is and action is played). We assume that the learner is presented with an MDP where is initially unknown. In the canonical parameterization, the parameter factors into separate components for each state and action (Dearden et al., 1999).

We restrict ourselves to the case where the reward function is completely known, with the only uncertainty being in the transition kernel of the unknown MDP. The extension to problems with unknown rewards is well-known from here (Bartlett and Tewari, 2009; Tewari and Bartlett, 2008).

A (stationary) policy or control is a prescription to
(deterministically) play an action at every state of the MDP, i.e.,
. Let denote the set of
all stationary policies^{5}^{5}5Note that is finite
since are finite. In general,
can be a subset of the set of all stationary
policies, containing optimal policies for every .
This serves to model policies with specific kinds of structure,
e.g., threshold rules. over , which are
the “reference policies” to compete with. Each policy
, together with an MDP , induces the
discrete-time stochastic process
,
with , and denoting
the state, action taken and reward obtained respectively at time
. In particular, the sequence of visited states
becomes a discrete time
Markov chain.

algocf[htbp]

For each policy , MDP and time horizon
, we define the -step value function
over initial states to
be
,
with the subscripts^{6}^{6}6We will often drop subscripts when
convenient for the sake of clarity in notation.
indicating the stochasticity induced by in the MDP
. Denote by
the policy with the best long-term average reward^{7}^{7}7We assume
that the limiting average reward is well-defined. If not, one can
restrict to the limit inferior. in (ties are assumed
to be broken in a fixed fashion). Correspondingly, let
be the best attainable long-term average reward for . We will
overload notation and use
and
.

In general, denotes the

th coordinate of the vector

, and is taken to mean the standard inner product of vectors and . Here, denotes the standard Kullback-Leibler divergence between probability distributions and on a common finite alphabet . The notationis employed to denote the indicator random variable corresponding to event

.The TSMDP Algorithm. TSMDP (Algorithm LABEL:alg:tsmdp) operates
in contiguous intervals of time called epochs, induced in turn
by an increasing sequence of stopping times We will
analyze the version that uses the return times to the start state
as epoch markers, i.e.,
, . The
algorithm maintains a “prior” probability distribution (denoted by
at time ) over the parameter space , from which it
samples^{8}^{8}8If the prior is analytically tractable, accurate
sampling may be feasible. If not, a variety of schemes for sampling
approximately from a posterior distribution, e.g.,
Gibbs/Metropolis-Hastings samplers, can be used.

a parameterized MDP at the beginning of each epoch. It then uses an average-reward optimal policy w.r.t.

for the sampled MDP throughout the epoch , and updates the prior to a “posterior” distribution via Bayes’ rule (LABEL:eqn:bayesrule), effectively at the end of each epoch.## 3 Assumptions Required for the Main Result

We describe in this section our main result for the TSMDP algorithm (Algorithm LABEL:alg:tsmdp), driven by the intuition presented in Section 5. We begin by stating and explaining the assumptions needed for our results to hold.

###### Assumption 1 (Recurrence).

The start state is recurrent^{9}^{9}9Recall that a
state is said to be recurrent in a discrete time Markov chain
if
(Levin et al., 2006). for the true MDP
under each policy
for in the
support of .

Assumption 1 is satisfied, for instance, if
is an ergodic^{10}^{10}10A Markov chain is ergodic if
it is irreducible, i.e., it is possible to go from every state to
every state (not necessarily in one move) Markov chain under every
stationary policy – a condition commonly used in prior work on MDP
learning (Tewari and Bartlett, 2008; Burnetas and Katehakis, 1997). Define
to be the expected recurrence time to state ,
starting from , when policy is used in the true MDP
.

###### Assumption 2 (Bounded Log-likelihood ratios).

Log-likelihood ratios are upper-bounded by a constant : .

Assumption 2 is primarily technical, and helps control the convergence of sample KL divergences in to (expected) true KL divergences, and is commonly employed in the statistics literature, e.g., (Shen and Wasserman, 2001).

###### Assumption 3 (Unique average-reward-optimal policy).

For the true MDP , is the unique average-reward optimal policy: .

The uniqueness assumption is made merely for ease of exposition; our results continue to hold with suitable redefinition otherwise.

The remaining assumptions (4 and
5) concern the behavior of the prior and the
posterior distribution under “near-ideal” trajectories of the
MDP. In order to introduce them, we will need to make a few
definitions. Let (resp. ) be
the stationary probability of state (resp. joint probability of
immediately followed by ) when the policy is applied to
the true MDP ; correspondingly, let
be the expected first return
time to state .We denote by the important marginal
Kullback-Leibler divergence^{11}^{11}11The marginal KL divergence
appears as a fundamental quantity in the lower bound for
regret in parameterized MDPs established by
(Agrawal et al., 1989). for under :

The marginal KL divergence is a convex combination of the KL divergences between the transition probability kernels of and , with the weights of the convex combination being the appropriate invariant probabilities induced by policy under . If is positive, then the MDPs and can be “resolved apart” using samples from the policy . Denote , i.e., the vector of values across all policies, with the convention that the final coordinate is associated with the optimal policy .

For each policy , define to be the decision region corresponding to , i.e., the set of parameters/MDPs for which the average-reward optimal policy is . Fixing , let . In other words, comprises all the parameters (resp. MDPs) with average reward-optimal policy that “appear similar” to (resp. ) under the true optimal policy . Correspondingly, put as the remaining set of parameters (resp. MDPs) in the decision region that are separated by at least w.r.t. .

Let us use to denote the epoch to which time instant belongs, i.e., if . Let be the number of epochs, up to and including epoch , in which the policy applied by the algorithm was . Let denote the total number of time instants that the state transition occurred in the first epochs when policy was used, i.e., .

The next assumption controls the posterior probability of playing the true optimal policy

during any epoch, preventing it from falling arbitrarily close to . Note that at the beginning of epoch (time instant ), the posterior measure of any legal subset can be expressed solely as a function of the sample state pair counts aswhere represents the posterior density or weight at time . The assumption requires that the posterior
probability of the decision region of is uniformly bounded
away from whenever the empirical state pair frequencies
are “near”
their corresponding expected^{12}^{12}12Expectation w.r.t. the state
transitions of values
.

###### Assumption 4 (Posterior probability of the optimal policy under “near-ideal” trajectories).

For any scalars , there exists such that

The final assumption we make is a “grain of truth” condition on the prior, requiring it to put sufficient probability on/around the true parameter . Specifically, we require that prior probability mass in weighted marginal KL-neighborhoods of to not decay too fast as a function of the total weighting. This form of local prior property is analogous to the Kullback-Leibler condition (Barron, 1998; Choi and Ramamoorthi, 2008; Ghosal et al., 1999) used to establish consistency of Bayesian procedures, and in fact can be thought of as an extension of the standard condition to the partial observations setting of this paper.

###### Assumption 5 (Prior mass on KL-neighborhoods of ).

(A) There exist such that
, for
all choices of nonnegative integers , and .

(B) There exist such that , for all choices of nonnegative integers , , that satisfy .

The key factor that will be shown to influence the regret scaling with time is the quantity above, which bounds the (polynomial) decay rate of the prior mass around essentially the marginal KL neighborhood of corresponding to always playing the policy .

We show later how these assumptions are satisfied in finite parameter spaces (Section 4.1) , and in continuous parameter spaces (Section 4.2). In particular, in finite parameter spaces, the assumptions can be shown to be satisfied with while for smooth (continuous) priors, the typical square-root rate of per independent parameter dimension holds, i.e., holds.

## 4 Main Result

We are now in a position to state^{13}^{13}13Due to space constraints,
the proofs of all results are deferred to the appendix. the main,
top-level result of this paper.

###### Theorem 1 (Regret-type bound for TSMDP).

Discussion. Theorem 1 gives a high-probability,
logarithmic-in- bound on the quantity

, the number of time
instants in when a suboptimal choice of action
(w.r.t. ) is made.
This can be interpreted as a natural regret-minimization property of
the algorithm^{16}^{16}16In the case of a stochastic multi-armed bandit
( and IID across
time) with rewards bounded in , for instance, this quantity
serves as an upper bound to the standard pseudo
regret^{15}^{15}15A bound on a suitably defined version of pseudo
regret - see e.g., Jaksch et al. (2010) - can easily be obtained from
our main result (Theorem 1) by appropriate weighting;
we leave the details to the reader. (Audibert and Bubeck, 2010),
defined as
,
with . The optimization problem (3) and the bound
(2) can be interpreted as a multi-dimensional “game”
in the space of (epoch) play counts of policies ,
with the following “rules”: (1) Start growing the non-negative -dimensional
vector of epoch play counts of all policies, with initial value
(the -th coordinate of
represents the number of plays of the optimal policy , which
is irrelevant as far as regret is concerned, and is thus pegged to
throughout), (2) Wait until the first time that some suboptimal
policy is “eliminated”, in the sense
, (3) Record ,
, (4) Impose the constraint that no further growth is
allowed to occur in along dimension in the future, and (5) Repeat growing the play count vector until the time all
suboptimal policies are eliminated, and aim to
maximize the final when this occurs. An overview of how this
optimization naturally arises as a regret bound for Thompson sampling
is provided in Section 5.

We also have the following square-root scaling for the usual notion of regret for MDPs (Jaksch et al., 2010):

###### Theorem 2 (Regret bound for TSMDP).

Under the hypotheses of Theorem 1, with , for the TSMDP algorithm, there exists such that with probability at least , for all , .

This can be compared with the probability-at-least
regret bound of
for UCRL2 (Jaksch et al., 2010, Theorem 4), with being the
diameter^{17}^{17}17The diameter D is the time it takes to move
from any state to any other state , using an appropriate
policy for each pair of states . of the true MDP.

The following sections show how the conclusions of Theorem 1 are applicable to various MDPs and illustrate the behavior of the scaling constant , showing that significant gains are obtained in the presence of correlated parameters.

### 4.1 Application: Discrete Parameter Spaces

We show here how the conclusion of Theorem 1 holds in a setting where there the true MDP is known to be one among finitely many candidate models (MDPs).

###### Assumption 6 (Finitely many parameters, “Grain of truth” prior).

The prior probability distribution is supported on finitely many parameters: . Moreover, .

###### Theorem 3 (Regret-type bound for TSMDP, Finite parameter setting).

Suppose Assumptions 1, 2, 3 and 6 hold. Then, with , (a) Assumption 4 holds, and (b) Assumption 5 holds with and . Consequently, the conclusion of Theorem 1 holds, namely: Let , and let be the unique optimal stationary policy for the true MDP . For the TSMDP algorithm, there exists such that with probability at least , it holds for all that , where is a problem- and prior-dependent quantity independent of , and is the value of the optimization problem (3) with .

### 4.2 Application: Continuous Parameter Spaces

To illustrate the generality of our result, we apply our main result
(Theorem 1) to obtain a regret bound for Thompson
Sampling with a continuous prior, i.e.,
, and a probability density^{18}^{18}18By
a probability density on , we mean a probability
measure absolutely continuous w.r.t. Lebesgue measure on
. on . For ease of exposition, let us
consider a -state, -action MDP: ,
(the theory can be applied in general to
finite-state, finite-action MDPs). The (known) reward in state
is , , irrespective of the action played, i.e.,
, , with
. All the uncertainty is in the transition kernel of the
MDP, parameterized by the canonical parameters
. Hence, we take the
parameter space to be , with the
identification^{19}^{19}19Note that we retain only independent
parameters of the MDP model.
and . It
follows that the optimal policy for a parameter is one that
maximizes the probability of staying at state :

Imagine that the TSMDP algorithm is run with initial/recurrence state and prior as the uniform density on the sub-cube , on the MDP , . Also, without loss of generality, let , implying that , i.e., the optimal policy is to always play action . It can be checked that under this setup, Assumptions 1, 2 and 3 hold. The following result establishes the validity of Assumptions 4 and 5 in this continuous prior setting.

### 4.3 Dependence of the Regret Scaling on MDP and Parameter Structure

We derive the following consequence of Theorem 1, useful in its own right, that explicitly guarantees an improvement in regret directly based on the Kullback-Leibler resolvability of parameters in the parameter space – a measure of the coupling across policies in the MDP.

###### Theorem 5 (Explicit Regret Improvement due to shared Marginal KL-Divergences).

Suppose that and the integer are such that

i.e., at least coordinates^{20}^{20}20Note that the coordinate
corresponding to the optimal policy is excluded from the
condition. of are at least
. Then, the multiplicative scaling factor in
(2) satisfies ,where
.

The result assures a non-trivial additive reduction of from the naive decoupled regret, whenever any suboptimal model in can be resolved apart from by at least actions in the sense of marginal KL-divergences of their observations.

Although the net number of decision vectors in (3) is nearly , the scale of can be significantly less than the number of policies owing to the fact that the posterior probability of several parameters is driven down simultaneously via the marginal K-L divergence terms . Put differently, using a standard bandit algorithm (e.g., UCB) naively with each arm being a stationary policy will perform much worse with a scaling like . We show (Appendix E) an example of an MDP in which the number of states can be arbitrarily large but which has only one uncertain scalar parameter, for which Thompson sampling achieves a much better regret scaling than its frequentist counterparts like UCRL2 (Jaksch et al., 2010) which are forced to explore all possible state transitions in isolation.

## 5 Sketch of Proof and Techniques used to show Theorem 1

At the outset, TSMDP is a randomized algorithm, whose decision is based on a random sample from the parameter space . The essence of Thompson sampling performance lies in understanding how the posterior distribution evolves as time progresses.

Let us assume, for ease of exposition, that we have finitely many parameters, . Writing out the expression for the posterior density at time using Bayes’ rule, we have, ,

The sum in the exponent above can be rearranged into

in which, , and .The above sum is an empirical quantity depending on the (random) sample path To gain a clear understanding of the posterior evolution, let us replace the empirical terms in the above sum by their “ergodic averages” (i.e., expected value under the respective invariant distribution) under the respective policies. In other words, for each and , let us approximate , the stationary probability of state when the policy is applied to the true MDP . In the same way, we approximate .With these “typical” estimates, our approximation to the posterior density simply becomes

Expression (5) is the result of effectively eliminating one of the two sources of randomness in the dynamics of the TSMDP algorithm – the variability of the environment, i.e., state transitions. The other source of randomness arises due to the algorithm’s sampling behavior from the posterior distribution. We use approximation (5) to extract two basic insights that determine the posterior shrinkage and regret performance of TSMDP even for general parameter spaces: For a total time horizon of steps, we claim Property 1. The true model always has “high” posterior mass. Assuming (the discrete “grain of truth” property), observe that (5) implies at all times . Thus, roughly, the true parameter is sampled by TSMDP with a frequency at least during the entire horizon, i.e., .

We also have Property 2. Suboptimal models are sampled only as long as their posterior probability is above . The total number of times a parameter with posterior mass less than can be picked in Thompson sampling is at most , which is irrelevant as far as the scaling of the regret with is concerned.

With these two insights, we can now estimate the net number of times bad parameters may be chosen. To this end, partition the parameter space into the optimal decision regions , setting and . Now, for each and , is positive; thus, since is finite, such that uniformly across all such . But this in turn implies, using Property 1 and (5), that the posterior probability of decays exponentially with time : . Hence, such parameters , are sampled at most a constant number of times in any time horizon with high probability and do not contribute to the overall regret scaling.

The interesting and non-trivial contribution to the regret comes from the amount that parameters from , are sampled. To see this, let us follow the vector of play counts of policies, i.e., as it starts growing from the all-zeros vector at , increasing by in some coordinate at each time step . By Property 2 above, once is reached, sampling from effectively ceases. Thus, considering the “worst-case” path that can follow to delay this condition for the longest time across all , we arrive (approximately) at the optimization problem (3) stated in Theorem 1.

Though the argument above was based on rather coarse approximations to empirical, path-based quantities, the underlying intuition holds true and is made rigorous (Appendix A) to show that this is indeed the right scaling of the regret. This involves several technical tools tailored for the analysis of Thompson sampling in MDPs, including (a) the development of self-normalized concentration inequalities for sub-exponential IID random variables (epoch-related quantities), and (b) control of the posterior probability using properties of the prior in Kullback-Leibler neighborhoods of the true parameter, using techniques analogous to those used to establish frequentist consistency of Bayesian procedures (Ghosal et al., 2000; Choi and Ramamoorthi, 2008).

## 6 Numerical Evaluation

MDP and Parameter Structure: Along the lines of the motivating
example in the Introduction, we model a single-buffer, discrete time
queueing system with a maximum occupancy of
packets/customers. The state of the MDP is simply the number of
packets in the queue at any given time, i.e., . At any given time, one of actions – Action
(SLOW service) and Action (FAST service) may be chosen, i.e.,
. Applying SLOW (resp. FAST) service results in
serving one packet from the queue with probability (resp. )
if it is not empty, i.e., the service model is Bernoulli()
where is the packet processing probability under service type
. Actions and incur a per-instant cost of and
units respectively. In addition to this cost, there is a holding
cost of per packet in the queue at all times. The system gains a
reward of units whenever a packet is served from the
queue^{21}^{21}21A candidate physical interpretation of such a queueing
system is in the form of a restaurant with tables,
with the possibility to add more “chefs” or staff into service
when desired (service rate control). However, adding staff costs the
restaurant, as does customers waiting long until their orders
materialize (holding cost)..

The arrival rate to the queueing system – the probability with which a new packet enters the buffer – is modeled as being state-dependent. Most importantly, the function

mapping a state to its corresponding packet arrival rate is parameterized using a standard Normal distribution ( make this clearer, avoid confusion with Normal probability distn.) as follows:

. Here, and represent the-dimensional (mean,standard deviation) parameter for the arrival rate curve, and

is chosen to be a constant that makes (to ensure valid Bernoulli packet arrival distributions). For the true, unknown MDP, we set ( clarify Cartesian prod). Figure 1 depicts (a) the optimal policy over , (b) the stationary distribution under the optimal policy and (c) the (parameterized) mean arrival rate curve over .Simulation Results: We simulate both TSMDP and the UCRL2
algorithm (Jaksch et al (Jaksch et al., 2010)) for the parameterized
queueing MDP above. For UCRL2, we run the algorithm both with (a)
fixed confidence intervals and (b)
(horizon-dependent confidence intervals^{22}^{22}22This
choice of is used by Jaksch et al to show a logarithmic
expected regret bound for UCRL2 (Jaksch et al., 2010).). We initialize
TSMDP with a uniform prior for the normalized parameter
on the discretized
space .

Figure 2 shows the results of running the TSMDP and UCRL2 algorithms for various time horizons up to time steps, and across sample runs. We report both the average regret (w.r.t. a best per-step average reward of ) and the percentile of the regret across the runs. Thompson sampling is seen to significantly outperform UCRL2 as the horizon length increases. This advantage is presumably due to the fact that TSMDP is capable of exploiting the parameterized structure of better than UCRL2, which updates each confidence interval only when the associated state is visited.

## 7 Related Work

A line of recent work (Agrawal and Goyal, 2012; Kaufmann et al., 2012; Korda et al., 2013; Agrawal and Goyal, 2013; Gopalan et al., 2014) has demonstrated that the Thompson sampling enjoys near-optimal regret guarantees for multi-armed bandits – a widely studied subclass of reinforcement learning problems.

The work of Osband et al (Osband et al., 2013), perhaps the most relevant to us, studies the Bayesian regret of Thompson sampling for MDPs. In this setting, the true MDP is assumed to have been drawn from the same prior used by the algorithm; consequently, the Bayesian regret becomes the standard frequentist regret averaged across the entire parameter space w.r.t. the prior. While this is useful, it is arguably weaker than the standard frequentist notion of regret in that it is an averaged notion of standard regret (w.r.t. the specific prior), and moreover is not indicative of how the structure of the MDP exactly influences regret performance. Moreover, the learning model considered in their work is episodic with fixed-length episodes and resets, as opposed to the non-episodic learning setting treated in this work, where we are able to show the first known structural (“gap-dependent”) regret bounds for Thompson sampling in fixed but unknown parameterized MDPs.

Prior to this, Ortega and Braun (2010) investigate the consistency performance of posterior-sampling based control rules, again in the fully Bayesian setting where nature’s prior is known.

Several deterministic algorithms relying on the “optimism under uncertainty” philosophy have been proposed for RL in the frequentist setup considered here (Brafman and Tennenholtz, 2003; Jaksch et al., 2010; Bartlett and Tewari, 2009). These algorithms work by maintaining confidence intervals for each transition probability and reward, computing the most optimistic MDP satisfying all confidence intervals and adaptively shrinking the confidence intervals each time the relevant state transition occurs. This strategy is potentially inefficient in parameterized MDPs where, potentially, observing a particular state transition can give information about other parts of the MDP as well.

The parameterized MDP setting we consider in this work has been previously studied by other authors. Dyagilev et al (Dyagilev et al., 2008) investigate learning parameterized MDPs for finite parameter spaces in the discounted setting (we consider the average-reward setting), and demonstrate sample-complexity results under the Probably-Approximately-Correct (PAC) learning model, which is different from the notion of regret.

The certainty equivalence approach to learning MDPs (Kumar and Varaiya, 1986) – building the most plausible model given available data and using the optimal policy for it – is perhaps natural, but it suffers from a serious lack of adequate exploration necessary to achieve low regret (Kaelbling et al., 1996).

A noteworthy related work is the seminal paper of Agrawal et al (Agrawal et al., 1989) that gives fundamental lower bounds on the asymptotic regret scaling for general, parameterized reinforcement learning problems. The bound is also tight, in the sense that for finite parameter spaces, the authors show a learning algorithm that achieves the bound. Even though our analytical results also hold for the setting of a finite parameter space, the strategy in (Agrawal et al., 1989) relies crucially on the finiteness assumption. This is in sharp contrast to Thompson sampling which can be defined for any kind of parameter space. In fact, Thompson sampling has previously been shown to enjoy favorable regret guarantees with continuous priors in linear bandit problems (Agrawal and Goyal, 2013).

## 8 Conclusion and Future Work

We have proposed the TSMDP algorithm in this paper for solving parameterized RL problems, and have derived regret-style bounds for the algorithm under significantly general initial priors. This supports the increasing evidence for the success of Thompson sampling and pseudo-Bayesian methods for reinforcement learning/bandit problems.

Moving forward, it would be useful to extend the performance results for Thompson sampling to continuous parameter spaces, as well as understand what happens when feedback can be delayed. Specific applications to reinforcement learning problems with additional structure would also prove insightful. In particular, studying the regret of Thompson Sampling for MDPs with linear function approximation (Melo et al., 2008) would be of interest – in this setting, the parameterization of the MDP is in terms of linear weights corresponding to a known basis of state-action value functions, and one could develop a variant of Thompson sampling which uses information from sample paths to update its posterior over the space of weights.

Supplementary Material (Appendices and
References)

Thompson Sampling for Learning Parameterized
Markov Decision Processes

## Appendix A Proof of Theorem 1

### a.1 Expressing the “posterior” distribution

At time , the “posterior distribution” that TSMDP uses can be expressed by iterating Bayes’ rule (LABEL:eqn:bayesrule):

with the posterior density or weight simply being the likelihood ratio of the entire observed history up to under the MDPs and , i.e.,

(4) |

where is the total number of time instants up to for which the epoch policy was used.

We will find it convenient in the sequel to introduce the following decomposition of the number of epochs up to epoch for which was chosen to be the epoch policy:

(5) |

### a.2 An alternative probability space

In order to analyze the dynamics of the TSMDP algorithm, it is useful to work in an equivalent probability space defined as follows. Define a random matrix with elements in . The rows of are indexed by sampling indices , and the columns by policies in . For each , independently generate the -th column of by applying the stationary policy to the MDP ,