Thompson sampling (Thompson, 1933), or posterior sampling for reinforcement learning (PSRL), is a conceptually simple approach to deal with unknown MDPs (Strens, 2000; Osband et al., 2013). PSRL begins with a prior distribution over the MDP model parameters (transitions and/or rewards) and typically works in episodes. At the start of each episode, an MDP model is sampled from the posterior belief and the agent follows the policy that is optimal for that sampled MDP until the end of the episode. The posterior is updated at the end of every episode based on the observed actions, states, and rewards. A special case of MDP under which PSRL has been recently extensively studied is MDP with state resetting, either explicitly or implicitly. Specifically, in (Osband et al., 2013; Osband and Van Roy, 2014) the considered MDPs are assumed to have fixed-length episodes, and at the end of each episode the MDP’s state is reset according to a fixed state distribution. In (Gopalan and Mannor, 2015), there is an assumption that the environment is ergodic and that there exists a recurrent state under any policy. Both approaches have developed variants of PSRL algorithms under their respective assumptions, as well as state-of-the-art regret bounds, Bayesian in (Osband et al., 2013; Osband and Van Roy, 2014) and Frequentist in (Gopalan and Mannor, 2015).
However, many real-world problems are of a continuing and non-resetting nature. These include sequential recommendations and other common examples found in controlled mechanical systems (e.g., control of manufacturing robots), and process optimization (e.g., controlling a queuing system), where ‘resets’ are rare or unnatural. Many of these real world examples could easily be parametrized with a scalar parameter, where each value of the parameter could specify a complete model. These type of domains do not have the luxury of state resetting, and the agent needs to learn to act, without necessarily revisiting states. Extensions of the PSRL algorithms to general MDPs without state resetting has so far produced non-practical algorithms and in some cases buggy theoretical analysis. This is due to the difficulty of analyzing regret under policy switching schedules that depend on various dynamic statistics produced by the true underlying model (e.g., doubling the visitations of state and action pairs and uncertainty reduction of the parameters). Next we summarize the literature for this general case PSRL.
The earliest such general case was analyzed as Bayes regret in a ‘lazy’ PSRL algorithm (Abbasi-Yadkori and Szepesvári, 2015). In this approach a new model is sampled, and a new policy is computed from it, every time the uncertainty over the underlying model is sufficiently reduced; however, the corresponding analysis was shown to contain a gap (Osband and Van Roy, 2016).
A recent general case PSRL algorithm with Bayes regret analysis was proposed in (Ouyang et al., 2017b). At the beginning of each episode, the algorithm generates a sample from the posterior distribution over the unknown model parameters. It then follows the optimal stationary policy for the sampled model for the rest of the episode. The duration of each episode is dynamically determined by two stopping criteria. A new episode starts either when the length of the current episode exceeds the previous length by one, or when the number of visits to any state-action pair is doubled. They establish bounds on expected regret under a Bayesian setting, where and are the sizes of the state and action spaces, is time, and is the bound of the span, and notation hides logarithmic factors. However, despite the state-of-the-art regret analysis, the algorithm is not well suited for large and continuous state and action spaces due to the requirement to count state and action visitations for all state-action pairs.
In another recent work (Agrawal and Jia, 2017)
, the authors present a general case PSRL algorithm that achieves near-optimal worst-case regret bounds when the underlying Markov decision process is communicating with a finite, though unknown, diameter. Their main result is a high probability regret upper bound offor any communicating MDP with states, actions and diameter , when . Despite the nice form of the regret bound, this algorithm suffers from similar practicality issues as the algorithm in (Ouyang et al., 2017b)
. The epochs are computed based on doubling the visitations of state and action pairs, which implies tabular representations. In addition it employs a stricter assumption than previous work of a fully communicating MDP with some unknown diameter. Finally, in order for the bound to be true, which would be impractical for large scale problems.
Both of the above two recent state-of-the-art algorithms (Ouyang et al., 2017b; Agrawal and Jia, 2017), do not use generalization, in that they learn separate parameters for each state-action pair. In such non-parametrized case, there are several other modern reinforcement learning algorithms, such as UCRL2 (Jaksch et al., 2010), REGAL (Bartlett and Tewari, 2009), and R-max (Brafman and Tennenholtz, 2002)
, which learn MDPs using the well-known ‘optimism under uncertainty’ principle. In these approaches a confidence interval is maintained for each state-action pair, and observing a particular state transition and reward provides information for only that state and action. Such approaches are inefficient in cases where the whole structure of the MDP can be determined with a scalar parameter.
, both of them focus on tabular reinforcement learning and hence are sample inefficient for many practical problems with exponentially large or even continuous state/action spaces. On the other hand, in many practical RL problems, the MDPs are parametrized in the sense that system dynamics and reward/loss functions are assumed to lie in a known parametrized low-dimensional manifold(Gopalan and Mannor, 2015). Such model parametrization (i.e. model generalization) allows researchers to develop sample efficient algorithms for large-scale RL problems. Our paper belongs to this line of research. Specifically, we propose a novel general case PSRL algorithm, referred to as DS-PSRL, that exploits model parametrization (generalization). We prove an Bayes regret bound for DS-PSRL, assuming we can model every MDP with a single smooth parameter.
DS-PSRL also has lower computational and space complexities than algorithms proposed in (Ouyang et al., 2017b; Agrawal and Jia, 2017). In the case of (Ouyang et al., 2017b) the number of policy switches in the first steps is ; on the other hand, DS-PSRL adopts a deterministic schedule and its number of policy switches is . Since the major computational burden of PSRL algorithms is to solve a sampled MDP at each policy switch, DS-PSRL is computationally more efficient than the algorithm proposed in (Ouyang et al., 2017b). As to the space complexity, both algorithms proposed in (Ouyang et al., 2017b; Agrawal and Jia, 2017) need to store counts of state and action visitations. In contrast, DS-PSRL uses a model independent schedule and as a result does not need to store such statistics.
In the rest of the paper we will describe the DS-PSRL algorithm, and derive a state-of-the-art Bayes regret analysis. We will demonstrate and compare our algorithm with state-of-the-art on standard problems from the literature. Finally, we will show how the assumptions of our algorithm satisfy a sensible parametrization for a large class of problems in sequential recommendations.
2 Problem Formulation
We consider the reinforcement learning problem in a parametrized Markov decision process (MDP) where is the state space, is the action space, is the instantaneous loss function, and is an MDP transition model parametrized by . We assume that the learner knows , , , and the mapping from the parameter to the transition model , but does not know . Instead, the learner has a prior belief on at time , before it starts to interact with the MDP. We also use to denote the support of the prior belief . Note that in this paper, we do not assume or to be finite; they can be infinite or even continuous. For any time , let be the state at time and be the action at time . Our goal is to develop an algorithm (controller) that adaptively selects an action at every time step based on prior information and past observations to minimize the long-run Bayes average loss
where is the average loss of running the optimal policy under the true model . Note that under the mild ‘weakly communicating’ assumption, is independent of the initial state.
The Bayes regret analysis of PSRL relies on the key observation that at each stopping time the true MDP model and the sampled model are identically distributed (Ouyang et al., 2017b). This fact allows to relate quantities that depend on the true, but unknown, MDP , to those of the sampled MDP that is fully observed by the agent. This is formalized by the following Lemma 1.
(Posterior Sampling (Ouyang et al., 2017b)). Let be a filtration ( can be thought of as the historic information until current time ) and let be an almost surely finite -stopping time. Then, for any measurable function ,
Additionally, the above implies that through the tower property.
3 The Proposed Algorithm: Deterministic Schedule PSRL
In this section, we propose a PSRL algorithm with a deterministic policy update schedule, shown in Figure 1. The algorithm changes the policy in an exponentially rare fashion; if the length of the current episode is , the next episode would be . This switching policy ensures that the total number of switches is . We also note that, when sampling a new parameter , the algorithm finds the optimal policy assuming that the sampled parameter is the true parameter of the system. Any planning algorithm can be used to compute this optimal policy (Sutton and Barto, 1998). In our analysis, we assume that we have access to the exact optimal policy, although it can be shown that this computation need not be exact and a near optimal policy suffices (see (Abbasi-Yadkori and Szepesvári, 2015)).
To measure the performance of our algorithm we use Bayes regret defined in Equation 1. The slower the regret grows, the closer is the performance of the learner to that of an optimal policy. If the growth rate of is sublinear (, the average loss per time step will converge to the optimal average loss as gets large, and in this sense we can say that the algorithm is asymptotically optimal. Our main result shows that, under certain conditions, the construction of such asymptotically optimal policies can be reduced to efficiently sampling from the posterior of and solving classical (non-Bayesian) optimal control problems.
First we state our assumptions. We assume that MDP is weakly communicating. This is a standard assumption and under this assumption, the optimal average loss satisfies the Bellman equation. Further, we assume that the dynamics are parametrized by a scalar parameter and satisfy a smoothness condition.
Assumption A1 (Lipschitz Dynamics) There exist a constant such that for any state and action and parameters ,
We also make a concentrating posterior assumption, which states that the variance of the difference between the true parameter and the sampled parameter gets smaller as more samples are gathered.
Assumption A2 (Concentrating Posterior) Let be one plus the number of steps in the first episodes. Let be sampled from the posterior at the current episode . Then there exists a constant such that
The 3 assumption simply says the variance of posterior decreases given more data. In other words, we assume that the problem is learnable and not a degenerate case. 3 was actually shown to hold for two general categories of problems, finite MDPs and linearly parametrized problems with Gaussian noise Abbasi-Yadkori and Szepesvári (2015). In addition, in this paper we prove how this assumption is satisfied for a large class of practical problems, such as smoothly parametrized sequential recommendation systems in Section 6.
Now we are ready to state the main theorem. We show a sketch of the analysis in the next section. More details are in the appendix.
Notice that the regret bound in Theorem 1 does not directly depend on or . Moreover, notice that the regret bound is smaller if the Lipschitz constant is smaller or the posterior concentrates faster (i.e. is smaller).
4 Sketch of Analysis
To analyze the algorithm shown in Figure 1, first we decompose the regret into a number of terms, which are then bounded one by one. Let , i.e. an imaginary next state sample assuming we take action in state when parameter is . Also let and . By the average cost Bellman optimality equation (Bertsekas, 1995), for a system parametrized by , we can write
Here is the differential value function for a system with parameter . We assume there exists such that for any . Because the algorithm takes the optimal action with respect to parameter and is the action at time , the right-hand side of the above equation is minimized and thus
The regret decomposes into two terms as shown in Lemma 2.
We can decompose the regret as follows:
where denotes the event that the algorithm has changed its policy at time t.
The first term is related to the sequential changes in the differential value functions, . We control this term by keeping the number of switches small; as long as the same parameter is used. Notice that under DS-PSRL, always holds. Thus, the first term can be bounded by .
The second term
is related to how fast the posterior concentrates around the true parameter vector. To simplify the exposition, we define
Recall that while , thus, from the tower rule, we have
Under Assumption 3, let be the number of schedules up to time , we can show:
where is the number of steps in the th episode.
Given Assumption 3 we can show:
Combining the above results, we have
This concludes the proof.
In this section we compare through simulations the performance of DS-PSRL algorithm with the latest PSRL algorithm called Thompson Sampling with dynamic episodes (TSDE) Ouyang et al. (2017b). We experimented with the RiverSwim environment Strehl and Littman (2008), which was the domain used to show how TSDE outperforms all known existing algorithms in Ouyang et al. (2017b). The RiverSwim example models an agent swimming in a river who can choose to swim either left or right. The MDP consists of states arranged in a chain with the agent starting in the leftmost state (). If the agent decides to move left i.e with the river current then he is always successful but if he decides to move right he might ‘fail’ with some probability. The reward function is given by: if , ; if , ; and otherwise.
5.1 Scalar Parametrization
For scalar parametrization a scalar value defines the transition dynamics of the whole MDP. We did two types of experiments, In the first experiment the transition dynamics (or fail probability) were the same for all states for a given scalar value. In the second experiment we allowed for a single scalar value to define different fail probabilities for different states. We assumed two probabilities of failure, a high probability and a low probability . We assumed we have two scalar values . We compared with an algorithm that switches every time-step, which we otherwise call t-mod-1, with TSDE and DS-PSRL algorithms. We assumed the true model of the world was and that the agent starts in the left-most state.
In the first experiment, sets to be the fail probability for all states and sets to be the fail probability for all states. For the optimal policy was to go left for the states closer to left and right for the states closer to right. For the optimal policy was to always go right. The results are shown in Figure 2, where all schedules are quickly learning to optimize the reward.
In the second experiment, sets to be the fail probability for all states. And sets for the first few states on the left-end, and for the remaining. The optimal policies were similar to the first experiment. However the transition dynamics are the same for states closer to the left-end, while the polices are contradicting. For the optimal policy is to go left and for the optimal policy is to go right for states closer to the left-end. This leads to oscillating behavior when uncertainty about the true is high and policy switching is done frequently. The results are shown in Figure 2 where t-mod-1 and TSDE underperform significantly. Nonetheless, when the policy is switched after multiple interactions, the agent is likely to end up in parts of the space where it becomes easy to identify the true model of the world. The second experiment is an example where multi-step exploration is necessary.
5.2 Multiple Parameters
Even though our theoretical analysis does not account for the case with multiple parameters, we tested empirically our algorithm with multiple parameters. We assumed a Dirichlet prior for every state and action pair. The initial parameters of the priors were set to one (uniform) for the non-zero transition probabilities of the RiverSwim problem and zero otherwise. Updating the posterior in this case is equivalent to updating the parameters after every transition. We did not compare with the t-mod-1 schedule, due to the computational cost of sampling and solving an MDP every time-step. Unlike the scalar case we cannot define a small finite number of values, for which we can pre-compute the MDP policies. The ground truth model used was from the second scalar experiment. Our results are shown in Figures 3 and 3. DS-PSRL performs better than TSDE as we increase the number of parameters.
5.3 Continuous Domains
In a final experiment we tested the ability of DS-PSRL algorithm in continuous state and action domains. Specifically, we implemented the discrete infinite horizon linear quadratic (LQ) problem in Abbasi-Yadkori and Szepesvári (2015, 2011):
where is the control at time , is the state at time , is the cost at time , is the ‘noise’, and are unknown matrices while and
are known (positive definite) matrices. The problem is to design a controller based on past observations to minimize the average expected cost. Uncertainty is modeled as a multivariate normal distribution. In our experiment we setand .
We compared DS-PSRL with t-mod-1 and a recent TSDE algorithm for learning-based control of unknown linear systems with Thompson Sampling Ouyang et al. (2017a). This version of TSDE uses two dynamic conditions. The first condition is the same as in the discrete case, which activates when episodes increase by one from the previous episode. The second condition activates when the determinant of the sample covariance matrix is less than half of the previous value. All algorithms learn quickly the optimal and as shown in Figure 3(a). The fact that switching every time-step works well indicates that this problem does not require multi-step exploration.
6 Application to Sequential Recommendations
With ‘sequential recommendations’ we refer to the problem where a system recommends various ‘items’ to a person over time to achieve a long-term objective. One example is a recommendation system at a website that recommends various offers. Another example is a tutorial recommendation system, where the sequence of tutorials is important in advancing the user from novice to expert over time. Finally, consider a points of interest recommendation (POI) system, where the system recommends various locations for a person to visit in a city, or attractions in a theme park. Personalized sequential recommendations are not sufficiently discussed in the literature and are practically non-existent in the industry. This is due to the increased difficulty in accurately modeling long-term user behaviors and non-myopic decision making. Part of the difficulty arises from the fact that there may not be a previous sequential recommendation system deployed for data collection, otherwise known as the cold start problem.
Fortunately, there is an abundance of sequential data in the real world. These data is usually ‘passive’ in that they do not include past recommendations. A practical approach that learns from passive data was proposed in Theocharous et al. (2017). The idea is to first learn a model from passive data that predicts the next activity given the history of activities. This can be thought of as the ‘no-recommendation’ or passive model. To create actions for recommending the various activities, the authors perturb the passive model. Each perturbed model increases the probability of following the recommendations, by a different amount. This leads to a set of models, each one with a different ‘propensity to listen’. In effect, they used the single ‘propensity to listen’ parameter to turn a passive model into a set of active models. When there are multiple model one can use online algorithms, such as posterior sampling for Reinforcement learning (PSRL) to identify the best model for a new user (Strens, 2000; Osband et al., 2013). In fact, the algorithm used in Theocharous et al. (2017) was a deterministic schedule PSRL algorithm. However, there was no theoretical analysis. The perturbation function used was the following:
where is a POI, a history of POIs, and is a normalizing factor. Here we show how this model satisfies both assumptions of our regret analysis.
We first prove that the dynamics are Lipschitz continuous:
(Lipschitz Continuity) Assume the dynamics are given by Equation 5. Then for all and all and , we have
Please refer to Appendix D for the proof of this lemma.
7 Summary and Conclusions
We proposed a practical general case PSRL algorithm, called DS-PSRL with provable guarantees. The algorithm has similar regret to state-of-the-art. However, our result is more generally applicable to continuous state-action problems; when dynamics of the system is parametrized by a scalar, our regret is independent of the number of states. In addition, our algorithm is practical. The algorithm provides for generalization, and uses a deterministic policy switching schedule of logarithmic order, which is independent from the true model of the world. This leads to efficiency in sample, space and time complexities. We demonstrated empirically how the algorithm outperforms state-of-the-art PSRL algorithms. Finally, we showed how the assumptions satisfy a sensible parametrization for a large class of problems in sequential recommendations.
- Abbasi-Yadkori and Szepesvári  Yasin Abbasi-Yadkori and Csaba Szepesvári. Regret bounds for the adaptive control of linear quadratic systems. In COLT, 2011.
- Abbasi-Yadkori and Szepesvári  Yasin Abbasi-Yadkori and Csaba Szepesvári. Bayesian optimal control of smoothly parameterized systems. In UAI, pages 1–11, 2015.
- Agrawal and Jia  Shipra Agrawal and Randy Jia. Posterior sampling for reinforcement learning: worst-case regret bounds. In NIPS, 2017.
- Bartlett and Tewari  Peter L Bartlett and Ambuj Tewari. Regal: A regularization based algorithm for reinforcement learning in weakly communicating MDPs. In UAI, pages 35–42, 2009.
- Bertsekas  Dimitri P Bertsekas. Dynamic programming and optimal control, volume 2. Athena Scientific Belmont, MA, 1995.
Brafman and Tennenholtz 
Ronen I Brafman and Moshe Tennenholtz.
R-max-a general polynomial time algorithm for near-optimal
Journal of Machine Learning Research, 3(Oct):213–231, 2002.
- Gopalan and Mannor  Aditya Gopalan and Shie Mannor. Thompson sampling for learning parameterized Markov decision processes. In COLT, pages 861–898, 2015.
- Jaksch et al.  Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
- Osband and Van Roy  Ian Osband and Benjamin Van Roy. Model-based reinforcement learning and the eluder dimension. In NIPS, pages 1466–1474, 2014.
- Osband and Van Roy  Ian Osband and Benjamin Van Roy. Posterior sampling for reinforcement learning without episodes. arXiv preprint arXiv:1608.02731, 2016.
- Osband et al.  Ian Osband, Dan Russo, and Benjamin Van Roy. (More) efficient reinforcement learning via posterior sampling. In NIPS, pages 3003–3011, 2013.
- Ouyang et al. [2017a] Yi Ouyang, Mukul Gagrani, and Rahul Jain. Learning-based control of unknown linear systems with thompson sampling. arXiv preprint arXiv:1709.04047, 2017.
- Ouyang et al. [2017b] Yi Ouyang, Mukul Gagrani, Ashutosh Nayyar, and Rahul Jain. Learning unknown Markov decision processes: A thompson sampling approach. In NIPS, 2017.
Strehl and Littman 
Alexander L. Strehl and Michael L. Littman.
An analysis of model-based interval estimation for markov decision processes.Journal of Computer and System Sciences, 74(8):1309 – 1331, 2008. Learning Theory 2005.
- Strens  Malcolm Strens. A Bayesian framework for reinforcement learning. In ICML, pages 943–950, 2000.
- Sutton and Barto  Richard S Sutton and Andrew G Barto. Introduction to reinforcement learning, volume 135. MIT Press Cambridge, 1998.
- Theocharous et al.  Georgios Theocharous, Nikos Vlassis, and Zheng Wen. An interactive points of interest guidance system. In IUI Companion, pages 49–52. ACM, 2017.
- Thompson  William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25:285–294, 1933.
Appendix A Proof of lemma 2
Proof. For deterministic schedule,
Thus we can write
Thus, we can bound the regret using
where the second inequality follows because and . Let denote the event that the algorithm has changed its policy at time t. We can write
Appendix B Proof of lemma 3
Proof. By Cauchy-Schwarz inequality and Lipschitz dynamics assumption,
Recall that . Let be the length of episode . Because we have episodes, we can write
where is the number of steps in the th episode. Thus
Appendix C Proof of lemma 4
Proof. Let . Let be one plus the number of steps in the first episodes. So and . We write
where (a) follows from the fact that for all , (b) follows from
and , and (c) follows from Assumption 3.
Appendix D Proof of lemma 5
Proof. To simplify the expositions, we use to denote in this proof. Notice that . Based on the definition of , we have
We also define . Based on calculus, we have
The first equation implies that is strictly increasing in , and the second equation implies that for all , is maximized by setting . This implies that for all , we have
Hence, for all , we have . Consequently, as a function of is globally -Lipschitz continuous for . So we have
Appendix E Posterior Concentration for POI Recommendation
Recall that the parameter space is a finite set, and is the true parameter. Notice that if is close to or , then the DS-PSRL will not learn much about at time , since in such cases ’s are roughly the same for all . Hence, to derive the concentration result, we make the following simplifying assumption:
for some . Moreover, we assume that all the elements in are distinct, and define
as the minimum gap between and another . To simplify the exposition, we also define
Then we have the following lemma about the concentrating posterior of this problem:
(Concentration) Assume that is sampled from at time step , then under the above assumptions, for any , we have
where , , and are constants defined above. Note that they only depend on and
Notice that Lemma 6 implies that
for any . This directly implies that . Q.E.D.
e.1 Proof of lemma 6
Proof. We use to denote the prior over , and use to denote the posterior distribution over at the end of time . Note that by Bayes rule, we have
We also define the posterior log-likelihood of at time as
for all and all . Notice that always holds, and by definition. We also define to simplify the exposition. Note that by definition, we have
Define the indicator , then we have
Since is -adaptive, we have
where the last inequality follows from Pinsker’s inequality. Notice that function is a strictly convex function of , and , we have
Similarly, we have . Consequently, we have
where the last inequality follows from the fact . Since function is concave on and , we have . Define
then we have . Hence we have
Furthermore, we define