1 Introduction
Thompson sampling (Thompson, 1933), or posterior sampling for reinforcement learning (PSRL), is a conceptually simple approach to deal with unknown MDPs (Strens, 2000; Osband et al., 2013). PSRL begins with a prior distribution over the MDP model parameters (transitions and/or rewards) and typically works in episodes. At the start of each episode, an MDP model is sampled from the posterior belief and the agent follows the policy that is optimal for that sampled MDP until the end of the episode. The posterior is updated at the end of every episode based on the observed actions, states, and rewards. A special case of MDP under which PSRL has been recently extensively studied is MDP with state resetting, either explicitly or implicitly. Specifically, in (Osband et al., 2013; Osband and Van Roy, 2014) the considered MDPs are assumed to have fixedlength episodes, and at the end of each episode the MDP’s state is reset according to a fixed state distribution. In (Gopalan and Mannor, 2015), there is an assumption that the environment is ergodic and that there exists a recurrent state under any policy. Both approaches have developed variants of PSRL algorithms under their respective assumptions, as well as stateoftheart regret bounds, Bayesian in (Osband et al., 2013; Osband and Van Roy, 2014) and Frequentist in (Gopalan and Mannor, 2015).
However, many realworld problems are of a continuing and nonresetting nature. These include sequential recommendations and other common examples found in controlled mechanical systems (e.g., control of manufacturing robots), and process optimization (e.g., controlling a queuing system), where ‘resets’ are rare or unnatural. Many of these real world examples could easily be parametrized with a scalar parameter, where each value of the parameter could specify a complete model. These type of domains do not have the luxury of state resetting, and the agent needs to learn to act, without necessarily revisiting states. Extensions of the PSRL algorithms to general MDPs without state resetting has so far produced nonpractical algorithms and in some cases buggy theoretical analysis. This is due to the difficulty of analyzing regret under policy switching schedules that depend on various dynamic statistics produced by the true underlying model (e.g., doubling the visitations of state and action pairs and uncertainty reduction of the parameters). Next we summarize the literature for this general case PSRL.
The earliest such general case was analyzed as Bayes regret in a ‘lazy’ PSRL algorithm (AbbasiYadkori and Szepesvári, 2015). In this approach a new model is sampled, and a new policy is computed from it, every time the uncertainty over the underlying model is sufficiently reduced; however, the corresponding analysis was shown to contain a gap (Osband and Van Roy, 2016).
A recent general case PSRL algorithm with Bayes regret analysis was proposed in (Ouyang et al., 2017b). At the beginning of each episode, the algorithm generates a sample from the posterior distribution over the unknown model parameters. It then follows the optimal stationary policy for the sampled model for the rest of the episode. The duration of each episode is dynamically determined by two stopping criteria. A new episode starts either when the length of the current episode exceeds the previous length by one, or when the number of visits to any stateaction pair is doubled. They establish bounds on expected regret under a Bayesian setting, where and are the sizes of the state and action spaces, is time, and is the bound of the span, and notation hides logarithmic factors. However, despite the stateoftheart regret analysis, the algorithm is not well suited for large and continuous state and action spaces due to the requirement to count state and action visitations for all stateaction pairs.
In another recent work (Agrawal and Jia, 2017)
, the authors present a general case PSRL algorithm that achieves nearoptimal worstcase regret bounds when the underlying Markov decision process is communicating with a finite, though unknown, diameter. Their main result is a high probability regret upper bound of
for any communicating MDP with states, actions and diameter , when . Despite the nice form of the regret bound, this algorithm suffers from similar practicality issues as the algorithm in (Ouyang et al., 2017b). The epochs are computed based on doubling the visitations of state and action pairs, which implies tabular representations. In addition it employs a stricter assumption than previous work of a fully communicating MDP with some unknown diameter. Finally, in order for the bound to be true
, which would be impractical for large scale problems.Both of the above two recent stateoftheart algorithms (Ouyang et al., 2017b; Agrawal and Jia, 2017), do not use generalization, in that they learn separate parameters for each stateaction pair. In such nonparametrized case, there are several other modern reinforcement learning algorithms, such as UCRL2 (Jaksch et al., 2010), REGAL (Bartlett and Tewari, 2009), and Rmax (Brafman and Tennenholtz, 2002)
, which learn MDPs using the wellknown ‘optimism under uncertainty’ principle. In these approaches a confidence interval is maintained for each stateaction pair, and observing a particular state transition and reward provides information for only that state and action. Such approaches are inefficient in cases where the whole structure of the MDP can be determined with a scalar parameter.
Despite the elegant regret bounds for the general case PSRL algorithms developed in (Ouyang et al., 2017b; Agrawal and Jia, 2017)
, both of them focus on tabular reinforcement learning and hence are sample inefficient for many practical problems with exponentially large or even continuous state/action spaces. On the other hand, in many practical RL problems, the MDPs are parametrized in the sense that system dynamics and reward/loss functions are assumed to lie in a known parametrized lowdimensional manifold
(Gopalan and Mannor, 2015). Such model parametrization (i.e. model generalization) allows researchers to develop sample efficient algorithms for largescale RL problems. Our paper belongs to this line of research. Specifically, we propose a novel general case PSRL algorithm, referred to as DSPSRL, that exploits model parametrization (generalization). We prove an Bayes regret bound for DSPSRL, assuming we can model every MDP with a single smooth parameter.DSPSRL also has lower computational and space complexities than algorithms proposed in (Ouyang et al., 2017b; Agrawal and Jia, 2017). In the case of (Ouyang et al., 2017b) the number of policy switches in the first steps is ; on the other hand, DSPSRL adopts a deterministic schedule and its number of policy switches is . Since the major computational burden of PSRL algorithms is to solve a sampled MDP at each policy switch, DSPSRL is computationally more efficient than the algorithm proposed in (Ouyang et al., 2017b). As to the space complexity, both algorithms proposed in (Ouyang et al., 2017b; Agrawal and Jia, 2017) need to store counts of state and action visitations. In contrast, DSPSRL uses a model independent schedule and as a result does not need to store such statistics.
In the rest of the paper we will describe the DSPSRL algorithm, and derive a stateoftheart Bayes regret analysis. We will demonstrate and compare our algorithm with stateoftheart on standard problems from the literature. Finally, we will show how the assumptions of our algorithm satisfy a sensible parametrization for a large class of problems in sequential recommendations.
2 Problem Formulation
We consider the reinforcement learning problem in a parametrized Markov decision process (MDP) where is the state space, is the action space, is the instantaneous loss function, and is an MDP transition model parametrized by . We assume that the learner knows , , , and the mapping from the parameter to the transition model , but does not know . Instead, the learner has a prior belief on at time , before it starts to interact with the MDP. We also use to denote the support of the prior belief . Note that in this paper, we do not assume or to be finite; they can be infinite or even continuous. For any time , let be the state at time and be the action at time . Our goal is to develop an algorithm (controller) that adaptively selects an action at every time step based on prior information and past observations to minimize the longrun Bayes average loss
Similarly as existing literature (Osband et al., 2013; Ouyang et al., 2017b), we measure the performance of such an algorithm using Bayes regret:
(1) 
where is the average loss of running the optimal policy under the true model . Note that under the mild ‘weakly communicating’ assumption, is independent of the initial state.
The Bayes regret analysis of PSRL relies on the key observation that at each stopping time the true MDP model and the sampled model are identically distributed (Ouyang et al., 2017b). This fact allows to relate quantities that depend on the true, but unknown, MDP , to those of the sampled MDP that is fully observed by the agent. This is formalized by the following Lemma 1.
Lemma 1
(Posterior Sampling (Ouyang et al., 2017b)). Let be a filtration ( can be thought of as the historic information until current time ) and let be an almost surely finite stopping time. Then, for any measurable function ,
(2) 
Additionally, the above implies that through the tower property.
3 The Proposed Algorithm: Deterministic Schedule PSRL
In this section, we propose a PSRL algorithm with a deterministic policy update schedule, shown in Figure 1. The algorithm changes the policy in an exponentially rare fashion; if the length of the current episode is , the next episode would be . This switching policy ensures that the total number of switches is . We also note that, when sampling a new parameter , the algorithm finds the optimal policy assuming that the sampled parameter is the true parameter of the system. Any planning algorithm can be used to compute this optimal policy (Sutton and Barto, 1998). In our analysis, we assume that we have access to the exact optimal policy, although it can be shown that this computation need not be exact and a near optimal policy suffices (see (AbbasiYadkori and Szepesvári, 2015)).
To measure the performance of our algorithm we use Bayes regret defined in Equation 1. The slower the regret grows, the closer is the performance of the learner to that of an optimal policy. If the growth rate of is sublinear (, the average loss per time step will converge to the optimal average loss as gets large, and in this sense we can say that the algorithm is asymptotically optimal. Our main result shows that, under certain conditions, the construction of such asymptotically optimal policies can be reduced to efficiently sampling from the posterior of and solving classical (nonBayesian) optimal control problems.
First we state our assumptions. We assume that MDP is weakly communicating. This is a standard assumption and under this assumption, the optimal average loss satisfies the Bellman equation. Further, we assume that the dynamics are parametrized by a scalar parameter and satisfy a smoothness condition.

Assumption A1 (Lipschitz Dynamics) There exist a constant such that for any state and action and parameters ,
We also make a concentrating posterior assumption, which states that the variance of the difference between the true parameter and the sampled parameter gets smaller as more samples are gathered.

Assumption A2 (Concentrating Posterior) Let be one plus the number of steps in the first episodes. Let be sampled from the posterior at the current episode . Then there exists a constant such that
The 3 assumption simply says the variance of posterior decreases given more data. In other words, we assume that the problem is learnable and not a degenerate case. 3 was actually shown to hold for two general categories of problems, finite MDPs and linearly parametrized problems with Gaussian noise AbbasiYadkori and Szepesvári (2015). In addition, in this paper we prove how this assumption is satisfied for a large class of practical problems, such as smoothly parametrized sequential recommendation systems in Section 6.
Now we are ready to state the main theorem. We show a sketch of the analysis in the next section. More details are in the appendix.
Theorem 1
Notice that the regret bound in Theorem 1 does not directly depend on or . Moreover, notice that the regret bound is smaller if the Lipschitz constant is smaller or the posterior concentrates faster (i.e. is smaller).
4 Sketch of Analysis
To analyze the algorithm shown in Figure 1, first we decompose the regret into a number of terms, which are then bounded one by one. Let , i.e. an imaginary next state sample assuming we take action in state when parameter is . Also let and . By the average cost Bellman optimality equation (Bertsekas, 1995), for a system parametrized by , we can write
(3) 
Here is the differential value function for a system with parameter . We assume there exists such that for any . Because the algorithm takes the optimal action with respect to parameter and is the action at time , the righthand side of the above equation is minimized and thus
(4) 
The regret decomposes into two terms as shown in Lemma 2.
Lemma 2
We can decompose the regret as follows:
where denotes the event that the algorithm has changed its policy at time t.
The first term is related to the sequential changes in the differential value functions, . We control this term by keeping the number of switches small; as long as the same parameter is used. Notice that under DSPSRL, always holds. Thus, the first term can be bounded by .
The second term
is related to how fast the posterior concentrates around the true parameter vector. To simplify the exposition, we define
Recall that while , thus, from the tower rule, we have
Lemma 3
Under Assumption 3, let be the number of schedules up to time , we can show:
where is the number of steps in the th episode.
Lemma 4
Given Assumption 3 we can show:
Thus,
Combining the above results, we have
This concludes the proof.
5 Experiments
In this section we compare through simulations the performance of DSPSRL algorithm with the latest PSRL algorithm called Thompson Sampling with dynamic episodes (TSDE) Ouyang et al. (2017b). We experimented with the RiverSwim environment Strehl and Littman (2008), which was the domain used to show how TSDE outperforms all known existing algorithms in Ouyang et al. (2017b). The RiverSwim example models an agent swimming in a river who can choose to swim either left or right. The MDP consists of states arranged in a chain with the agent starting in the leftmost state (). If the agent decides to move left i.e with the river current then he is always successful but if he decides to move right he might ‘fail’ with some probability. The reward function is given by: if , ; if , ; and otherwise.
5.1 Scalar Parametrization
For scalar parametrization a scalar value defines the transition dynamics of the whole MDP. We did two types of experiments, In the first experiment the transition dynamics (or fail probability) were the same for all states for a given scalar value. In the second experiment we allowed for a single scalar value to define different fail probabilities for different states. We assumed two probabilities of failure, a high probability and a low probability . We assumed we have two scalar values . We compared with an algorithm that switches every timestep, which we otherwise call tmod1, with TSDE and DSPSRL algorithms. We assumed the true model of the world was and that the agent starts in the leftmost state.
In the first experiment, sets to be the fail probability for all states and sets to be the fail probability for all states. For the optimal policy was to go left for the states closer to left and right for the states closer to right. For the optimal policy was to always go right. The results are shown in Figure 2, where all schedules are quickly learning to optimize the reward.
In the second experiment, sets to be the fail probability for all states. And sets for the first few states on the leftend, and for the remaining. The optimal policies were similar to the first experiment. However the transition dynamics are the same for states closer to the leftend, while the polices are contradicting. For the optimal policy is to go left and for the optimal policy is to go right for states closer to the leftend. This leads to oscillating behavior when uncertainty about the true is high and policy switching is done frequently. The results are shown in Figure 2 where tmod1 and TSDE underperform significantly. Nonetheless, when the policy is switched after multiple interactions, the agent is likely to end up in parts of the space where it becomes easy to identify the true model of the world. The second experiment is an example where multistep exploration is necessary.
5.2 Multiple Parameters
Even though our theoretical analysis does not account for the case with multiple parameters, we tested empirically our algorithm with multiple parameters. We assumed a Dirichlet prior for every state and action pair. The initial parameters of the priors were set to one (uniform) for the nonzero transition probabilities of the RiverSwim problem and zero otherwise. Updating the posterior in this case is equivalent to updating the parameters after every transition. We did not compare with the tmod1 schedule, due to the computational cost of sampling and solving an MDP every timestep. Unlike the scalar case we cannot define a small finite number of values, for which we can precompute the MDP policies. The ground truth model used was from the second scalar experiment. Our results are shown in Figures 3 and 3. DSPSRL performs better than TSDE as we increase the number of parameters.
5.3 Continuous Domains
In a final experiment we tested the ability of DSPSRL algorithm in continuous state and action domains. Specifically, we implemented the discrete infinite horizon linear quadratic (LQ) problem in AbbasiYadkori and Szepesvári (2015, 2011):
where is the control at time , is the state at time , is the cost at time , is the ‘noise’, and are unknown matrices while and
are known (positive definite) matrices. The problem is to design a controller based on past observations to minimize the average expected cost. Uncertainty is modeled as a multivariate normal distribution. In our experiment we set
and .We compared DSPSRL with tmod1 and a recent TSDE algorithm for learningbased control of unknown linear systems with Thompson Sampling Ouyang et al. (2017a). This version of TSDE uses two dynamic conditions. The first condition is the same as in the discrete case, which activates when episodes increase by one from the previous episode. The second condition activates when the determinant of the sample covariance matrix is less than half of the previous value. All algorithms learn quickly the optimal and as shown in Figure 3(a). The fact that switching every timestep works well indicates that this problem does not require multistep exploration.
6 Application to Sequential Recommendations
With ‘sequential recommendations’ we refer to the problem where a system recommends various ‘items’ to a person over time to achieve a longterm objective. One example is a recommendation system at a website that recommends various offers. Another example is a tutorial recommendation system, where the sequence of tutorials is important in advancing the user from novice to expert over time. Finally, consider a points of interest recommendation (POI) system, where the system recommends various locations for a person to visit in a city, or attractions in a theme park. Personalized sequential recommendations are not sufficiently discussed in the literature and are practically nonexistent in the industry. This is due to the increased difficulty in accurately modeling longterm user behaviors and nonmyopic decision making. Part of the difficulty arises from the fact that there may not be a previous sequential recommendation system deployed for data collection, otherwise known as the cold start problem.
Fortunately, there is an abundance of sequential data in the real world. These data is usually ‘passive’ in that they do not include past recommendations. A practical approach that learns from passive data was proposed in Theocharous et al. (2017). The idea is to first learn a model from passive data that predicts the next activity given the history of activities. This can be thought of as the ‘norecommendation’ or passive model. To create actions for recommending the various activities, the authors perturb the passive model. Each perturbed model increases the probability of following the recommendations, by a different amount. This leads to a set of models, each one with a different ‘propensity to listen’. In effect, they used the single ‘propensity to listen’ parameter to turn a passive model into a set of active models. When there are multiple model one can use online algorithms, such as posterior sampling for Reinforcement learning (PSRL) to identify the best model for a new user (Strens, 2000; Osband et al., 2013). In fact, the algorithm used in Theocharous et al. (2017) was a deterministic schedule PSRL algorithm. However, there was no theoretical analysis. The perturbation function used was the following:
(5) 
where is a POI, a history of POIs, and is a normalizing factor. Here we show how this model satisfies both assumptions of our regret analysis.
Lipschitz Dynamics
We first prove that the dynamics are Lipschitz continuous:
Lemma 5
(Lipschitz Continuity) Assume the dynamics are given by Equation 5. Then for all and all and , we have
Please refer to Appendix D for the proof of this lemma.
Concentrating Posterior
7 Summary and Conclusions
We proposed a practical general case PSRL algorithm, called DSPSRL with provable guarantees. The algorithm has similar regret to stateoftheart. However, our result is more generally applicable to continuous stateaction problems; when dynamics of the system is parametrized by a scalar, our regret is independent of the number of states. In addition, our algorithm is practical. The algorithm provides for generalization, and uses a deterministic policy switching schedule of logarithmic order, which is independent from the true model of the world. This leads to efficiency in sample, space and time complexities. We demonstrated empirically how the algorithm outperforms stateoftheart PSRL algorithms. Finally, we showed how the assumptions satisfy a sensible parametrization for a large class of problems in sequential recommendations.
References
 AbbasiYadkori and Szepesvári [2011] Yasin AbbasiYadkori and Csaba Szepesvári. Regret bounds for the adaptive control of linear quadratic systems. In COLT, 2011.
 AbbasiYadkori and Szepesvári [2015] Yasin AbbasiYadkori and Csaba Szepesvári. Bayesian optimal control of smoothly parameterized systems. In UAI, pages 1–11, 2015.
 Agrawal and Jia [2017] Shipra Agrawal and Randy Jia. Posterior sampling for reinforcement learning: worstcase regret bounds. In NIPS, 2017.
 Bartlett and Tewari [2009] Peter L Bartlett and Ambuj Tewari. Regal: A regularization based algorithm for reinforcement learning in weakly communicating MDPs. In UAI, pages 35–42, 2009.
 Bertsekas [1995] Dimitri P Bertsekas. Dynamic programming and optimal control, volume 2. Athena Scientific Belmont, MA, 1995.

Brafman and Tennenholtz [2002]
Ronen I Brafman and Moshe Tennenholtz.
Rmaxa general polynomial time algorithm for nearoptimal
reinforcement learning.
Journal of Machine Learning Research
, 3(Oct):213–231, 2002.  Gopalan and Mannor [2015] Aditya Gopalan and Shie Mannor. Thompson sampling for learning parameterized Markov decision processes. In COLT, pages 861–898, 2015.
 Jaksch et al. [2010] Thomas Jaksch, Ronald Ortner, and Peter Auer. Nearoptimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
 Osband and Van Roy [2014] Ian Osband and Benjamin Van Roy. Modelbased reinforcement learning and the eluder dimension. In NIPS, pages 1466–1474, 2014.
 Osband and Van Roy [2016] Ian Osband and Benjamin Van Roy. Posterior sampling for reinforcement learning without episodes. arXiv preprint arXiv:1608.02731, 2016.
 Osband et al. [2013] Ian Osband, Dan Russo, and Benjamin Van Roy. (More) efficient reinforcement learning via posterior sampling. In NIPS, pages 3003–3011, 2013.
 Ouyang et al. [2017a] Yi Ouyang, Mukul Gagrani, and Rahul Jain. Learningbased control of unknown linear systems with thompson sampling. arXiv preprint arXiv:1709.04047, 2017.
 Ouyang et al. [2017b] Yi Ouyang, Mukul Gagrani, Ashutosh Nayyar, and Rahul Jain. Learning unknown Markov decision processes: A thompson sampling approach. In NIPS, 2017.

Strehl and Littman [2008]
Alexander L. Strehl and Michael L. Littman.
An analysis of modelbased interval estimation for markov decision processes.
Journal of Computer and System Sciences, 74(8):1309 – 1331, 2008. Learning Theory 2005.  Strens [2000] Malcolm Strens. A Bayesian framework for reinforcement learning. In ICML, pages 943–950, 2000.
 Sutton and Barto [1998] Richard S Sutton and Andrew G Barto. Introduction to reinforcement learning, volume 135. MIT Press Cambridge, 1998.
 Theocharous et al. [2017] Georgios Theocharous, Nikos Vlassis, and Zheng Wen. An interactive points of interest guidance system. In IUI Companion, pages 49–52. ACM, 2017.
 Thompson [1933] William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25:285–294, 1933.
Appendix A Proof of lemma 2
Proof. For deterministic schedule,
Thus we can write
Thus, we can bound the regret using
where the second inequality follows because and . Let denote the event that the algorithm has changed its policy at time t. We can write
Appendix B Proof of lemma 3
Proof. By CauchySchwarz inequality and Lipschitz dynamics assumption,
Recall that . Let be the length of episode . Because we have episodes, we can write
where is the number of steps in the th episode. Thus
Appendix C Proof of lemma 4
Proof. Let . Let be one plus the number of steps in the first episodes. So and . We write
where (a) follows from the fact that for all , (b) follows from
and , and (c) follows from Assumption 3.
Appendix D Proof of lemma 5
Proof. To simplify the expositions, we use to denote in this proof. Notice that . Based on the definition of , we have
(6) 
We also define . Based on calculus, we have
(7) 
The first equation implies that is strictly increasing in , and the second equation implies that for all , is maximized by setting . This implies that for all , we have
Hence, for all , we have . Consequently, as a function of is globally Lipschitz continuous for . So we have
Appendix E Posterior Concentration for POI Recommendation
Recall that the parameter space is a finite set, and is the true parameter. Notice that if is close to or , then the DSPSRL will not learn much about at time , since in such cases ’s are roughly the same for all . Hence, to derive the concentration result, we make the following simplifying assumption:
for some . Moreover, we assume that all the elements in are distinct, and define
as the minimum gap between and another . To simplify the exposition, we also define
Then we have the following lemma about the concentrating posterior of this problem:
Lemma 6
(Concentration) Assume that is sampled from at time step , then under the above assumptions, for any , we have
where , , and are constants defined above. Note that they only depend on and
e.1 Proof of lemma 6
Proof. We use to denote the prior over , and use to denote the posterior distribution over at the end of time . Note that by Bayes rule, we have
We also define the posterior loglikelihood of at time as
for all and all . Notice that always holds, and by definition. We also define to simplify the exposition. Note that by definition, we have
Define the indicator , then we have
Since is adaptive, we have
where the last inequality follows from Pinsker’s inequality. Notice that function is a strictly convex function of , and , we have
Similarly, we have . Consequently, we have
where the last inequality follows from the fact . Since function is concave on and , we have . Define
(8) 
then we have . Hence we have
Furthermore, we define
Comments
There are no comments yet.