Monte Carlo Rollout Policy for Recommendation Systems with Dynamic User Behavior

by   Rahul Meshram, et al.

We model online recommendation systems using the hidden Markov multi-state restless multi-armed bandit problem. To solve this we present Monte Carlo rollout policy. We illustrate numerically that Monte Carlo rollout policy performs better than myopic policy for arbitrary transition dynamics with no specific structure. But, when some structure is imposed on the transition dynamics, myopic policy performs better than Monte Carlo rollout policy.



There are no comments yet.


page 1

page 2

page 3

page 4


Indexability and Rollout Policy for Multi-State Partially Observable Restless Bandits

Restless multi-armed bandits with partially observable states has applic...

Cooperation on the monte carlo rule Prison's dilemma game on the grid

In this paper, we investigate the prison's dilemma game with monte carlo...

Sequential Monte Carlo Bandits

In this paper we propose a flexible and efficient framework for handling...

Adaptive Monte Carlo via Bandit Allocation

We consider the problem of sequentially choosing between a set of unbias...

Monte Carlo algorithms are very effective in finding the largest independent set in sparse random graphs

The effectiveness of stochastic algorithms based on Monte Carlo dynamics...

Monte Carlo Matrix Inversion Policy Evaluation

In 1950, Forsythe and Leibler (1950) introduced a statistical technique ...

Evaluating Impact of Human Errors on the Availability of Data Storage Systems

In this paper, we investigate the effect of incorrect disk replacement s...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Online recommendation systems (RS) are extensively used by multimedia hosting platforms e.g. YouTube, Spotify, and entertainment services e.g. Netflix, Amazon Prime etc. These systems create personalized playlists for users based on user behavioral information from individual watch history and also by harvesting information from social networking sites. In this paper we provide new models for user behavior and algorithms for recommendation.

Most often playlists are generated using the “matrix completion” problem and items are recommended to users based on their past preferences. It is implicitly assumed that user interest is static and the current recommendation does not influence future behavior of user interest. So, a playlist that is generated does not take into account the dynamic behavior or changes in user interest triggered by the current recommendation. In this paper we study a playlist generation system as a recommendation system, where a playlist is generated using immediate dynamic behavior of user interest. The user responds to different items differently. This behavior depends on the play history along with some element of randomness in the preferences.

We consider a Markov model for user interest or preferences where a state describes the intensity level of preferences

111 Markov model is an approximation of the dynamic behavior of user interest to make the analysis tractable. In general, user behavior can be more complex and requires further investigation.. A higher state means higher level of interest for an item. The user behavior for an item is determined by the transition dynamics for that item. We assume that the user provides a binary feedback upon the play of an item, and no feedback from not playing it. 222In general there can be other forms of feedback such as the user stopping a video in between, etc.

The likelihood of observing feedback is state dependent. The user interest goes to different states with different probability after playing an item. User interest for an item returns to a fixed state whenever it is not played. The item for which user interest stays in higher state with high probability after its play, is referred as a

viral item. For certain items, user interest drops immediately after playing it; these are referred as normal items

. Our objective is to model and analyze the diverse behavior of user interest for different items, and generate a dynamic playlist using binary feedback. Note that the state of user interest is not observable by the recommendation system. This is an example of multi-state hidden Markov model. Our model here is a generalization of the two-state hidden Markov model in

[1]. This paper studies a playlist generation system using multi-state hidden Markov model.

We make following contributions in this paper.

  1. We model a playlist generation (recommendation) system as hidden Markov multi-state restless multi armed bandit problem. We present a four state model. An item in an RS is modeled using a POMDP. This is given in Section II.

  2. We present the following solution approaches—myopic policy, Monte Carlo rollout policy and Whittle-index policy in Section III. The Whittle-index policy has limited applicability due to lack of an explicit index formula for multi-state hidden Markov bandits.

  3. We discuss numerical examples in Section IV. We present numerical results with myopic and Monte Carlo rollout policy. Our first numerical example illustrates that myopic policy performs better than Monte Carlo rollout policy whenever transition probabilities of interest states have a specific structure such as stochastic dominance. But myopic policy performs poorly compared to Monte Carlo rollout policy whenever there is no such structure imposed on the model. This is demonstrated in the second numerical example. In the third example, we compare Monte Carlo rollout policy with Whittle index policy and we observe that Monte Carlo policy performs better than Whittle index policy and this is due approximations involved in index calculations.

I-a Related Work

Recommendation systems are often studied using collaborative filtering methods, [2, 3, 4]. Matrix factorization (MF) is one such method employed in collaborative filtering, [5, 6]

. The idea is to represent a matrix as users and items. Each entry there describes the user rating for an item. MF method then transforms a large dimensional matrix into a lower dimensional matrix. Machine learning techniques are used in MF and collaborative filtering,

[7]. Recommendation systems ideas inspired on work of Matrix completion problem [8]. In all of these models are based on data which is obtained from previous recommendations or historical data. These works assume that user preferences are static and it does not take into account the dynamic behavior of user based on feedback from preceding recommendations.

Recently, there is another body of work on modeling online recommendation systems. This work is inspired from online learning with bandit algorithms, [9, 10, 11]. It uses contextual epsilon greedy algorithms for news recommendation. Another way to model online recommendation systems such as playlist generation systems are restless multi-armed bandits, [1]. In all these systems, user interest dynamically evolves and this evolution is dependent on whether an item is recommended or not.

We now describe some related work on RMAB, hidden Markov RMAB and their solution methodologies. RMAB is extensively studied for various application of communication systems, queuing networks and resource allocation problems, [12, 13]. RMAB problem is NP-hard, [14]

, but heuristic index based policy is used. To use such index based policy, there is requirement of structure on the dynamics of restless bandits. This can be limitation for hidden Markov RMAB, when each restless bandit is modeled using POMDP and it is very difficult to obtain structural results. This motivates us to look for an alternative policy, and Monte Carlo rollout policy is studied in this work. Monte Carlo rollout policy has been developed for complex Markov decision processes in

[15, 16, 17, 18].

Ii Online Recommendation System as Restless Multi-armed Bandits

We present models of online recommendation systems (RS). There are different types of items to be recommended. A model for each type describes the specific user behavior for that type of item. We consider a four state model where a state represents the user interest for an item. The states are called as Low interest (L), Medium interest (M), High interest (H) and Very high interest (V). Thus, the state space is RS can play an item or not play that item. The state evolution of user interest for an item depends on actions. There are two actions for each item, play or not play, i.e., where corresponds to not playing and corresponds to play of item.

We suppose that RS gets a binary observation signal, i.e., for like and for dislike333These observations dictate the actions of user based on their interest. For example, the user may skip an item when he dislikes it, or watch completely when he likes it. Further, more signals correspond to more actions from user.. In general, RS can have more than two signals as observations but for simplicity we consider only two signals. RS can not directly observe user interest for items and hence the state of each item is not observable. When an item is played the user clicks on either like or dislike (skip) buttons with probability and this click-through probability depends on current state of user interest for that item but not on the state of any other of items. Whenever user clicks, RS accrues a unit reward with probability for Further, we assume Thus, each item can be modeled as partially observable Markov decision process (POMDP) and it has finite states with two actions. From literature on POMDP [19, 20, 21]

, a belief vector

is maintained for each item, where is the probability that user interest for the item is in state and The immediate expected reward to RS from play of an item with belief is

When an item is not played, this implies that another item is played to the user. In this way the items are competing at RS for each time slot. The user interest state evolution of each item is dependent on whether that item is played or not. RS is an example of restless multi-armed bandit problem (RMAB), [12].

Suppose there are independent items, each item has the same number of states. After each play of an item, a unit reward is obtained by RS based on the user click. Further, RS can play only one arm (item) at each time instant. The objective of RS is to maximize the long term discounted cumulative reward (sum of cumulative reward from play of all items over the long term) subject to constraint that only one item is played at a time. Because RS does not observe the state of each user interest for item at each time step, we refer to this as hidden Markov RMAB, [22]. This is a constrained optimization problem and the items are coupled due to the integer constraint on RS. It is also called as weakly coupled POMDPs.

Iii Solution Approach

We discuss the following solution approaches—myopic policy, Monte-Carlo rollout policy and Whittle index policy.

We first describe the belief update rule for an item. After play of an item, the belief is and Here, is the belief vector at time and is the transition probability matrix for an item. For not playing, no signal is observed and hence the posterior belief

Iii-a Myopic policy

This is the simplest policy for RMAB with hidden states. In any given slot the item with the highest immediate expected payoff is played. Let be the belief vector for item at time A unit reward is obtained from playing item depending on state, with prob. Thus, the immediate expected payoff from play of item is The myopic policy plays an item

Iii-B Look ahead Policy using Monte Carlo Method

We study Monte Carlo rollout policy. There are trajectories simulated for fixed horizon length using a known transition and reward model. Along each trajectory, a fixed policy is employed according to which one item is played at each time step. The information obtained from a single trajectory upto horizon length is


under policy Here,

denotes a trajectory. The value estimate of trajectory

starting from belief state for items and initial action is

Then, the value estimate for state and action over trajectories under policy is

Here, policy can be uniform random policy or myopic (greedy) policy that is implemented for a trajectory. Next, a one step policy improvement is performed, and the optimal action selected is according follow rule.


In each time step, an item is played based on the above rule. Detailed discussion on rollout policy for RMAB is given in [18].

Iii-C Whittle-index policy

Another popular approach for RMAB (and weakly coupled POMDPs) is Whittle-index policy [12], where the constrained optimization problem can be solved via relaxing the integer constraints. The problem is transformed into an optimization problem with discounted constraints. Later, using Lagrangian technique, one decouples relaxed constrained optimization RMAB problem into single-armed restless bandit problems. In a single-armed restless bandit (SARB) problem, a subsidy for not playing the item is introduced. A SARB with hidden states is an example of a POMDP with a two action model, [22]. To use Whittle index policy, one requires to study structural properties of SARB, show the existence of a threshold policy, indexability for each item, and compute the indices for all items. In each time step, the item with highest index is played.

In our model, it is very difficult to claim indexability and obtain closed form index formula. The idexability condition require us to show a threshold policy behavior for each item. In [21, Proposition and Lemma ], authors have shown existence of threshold policy for specialized model in POMDP. In a specialized model, it is possible to show indexability (detail is omitted) and use Monte Carlo based index computation algorithm, see [18, Section IV, Algorithm ]. Note that this algorithm is computationally expensive and time consuming because Monte-Carlo algorithm has to run for each restless bandits till their value function converges.

Iv Numerical Results for Model

We now present numerical examples that illustrate the performance of myopic policy and Monte-Carlo rollout policy. In the first example we observe that myopic policy performs better than MC rollout policy for some structural assumptions on transition probabilities and reward probabilities. Finally, in the second example there are no structural assumptions on the transition probabilities. Here the MC rollout policy performs better than myopic policy.

Iv-1 Example-

In this example, we introduce structure on transition probability matrices of items, as in [21]. When an item is played, the user interest evolves according to different transition matrix corresponding to different items. But for not played items, the user interest evolves according to a common transition matrix. We use the following parameter set. The number of items number of states Transition probability for items when that is played is denoted by When item is not played, then the transition probability matrix is and it is same for all items.

Reward vector for all items.

Initial belief

Initial state of items from different states.

Fig. 1: Comparison of myopic policy and Monte Carlo rollout policy for Example with

From Fig. 1, we find that the myopic policy performs better than Monte Carlo rollout policy. In Monte Carlo rollout policy, we used and

Iv-2 Example-

We consider a general transition probability matrix for each action, and with no structural assumption. Hence, we do not have stochastic dominance condition for the transition probability matrix of each item. When item is played, the user interest evolves according to different transition matrices for different items but for not played items, the user interest evolves according to a common transition matrix. We use the following parameter set. The number of items number of states We use following parameters. Transition probability for items when that is played is denoted by When item is not played, then the transition probability matrix is and it is same for all items.

Reward vector for all items.

The initial belief vector and initial state is same as in example- We compare expected discounted cumulative reward with myopic and Monte Carlo rollout policy in Fig. 2. For Monte-Carlo rollout policy we use length of a trajectory and number of trajectories With myopic policy, items and are played most frequently, whereas with MC rollout policy, item is played most frequently.

Fig. 2: Comparison of Myopic policy and Monte Carlo rollout policy for Example with

Iv-3 Example

In this example we use same transition probability matrix as in example-1 when item is played. But when item is not played, the transition happens to state This is different from example and also reward matrix is different. We use same initial belief and initial state as in example We use discount parameter

Reward vector for all items.

Fig. 3: Comparison of Myopic policy, Monte Carlo rollout policy and Whittle index policy for Example with

We observe from Fig. 3 that Myopic policy and Monte Carlo rollout policy performs better than Whittle index policy. This may be due to approximation used for index computation, lack of explicit formula or structure of the problem. As we stated earlier, index policy is more computationally expensive than Monte Carlo rollout policy and myopic policy when there is no explicit closed formula in case of hidden Markov bandit. In such examples, Monte Carlo rollout policy is good alternative. The discount parameter

Iv-a Example

In this example we consider items with state. We compare only myopic policy and Monte Carlo rollout policy. We do not assume monotonicity structure on transition matrix. The comparison is illustrated in Fig. 4. We observe that Monte-Carlo rollout policy performs better than myopic policy, i.e, upto In Monte Carlo rollout policy, we use and

Fig. 4: Comparison of Myopic policy and Monte Carlo rollout policy for Example with

V Conclusions

We have studied an online recommendation system problem using hidden Markov RMAB and provided numerical results for Monte Carlo rollout policy and myopic policy. We observed that Monte Carlo rollout policy performs better for arbitrary transition dynamics. We observe numerically that myopic policy performs better than Monte Carlo whenever structure on state transition dynamics. We also presented the performance of Whittle index policy and that is compared with Monte Carlo rollout policy for a specialized model.

The objective in paper was to describe a new Monte Carlo rollout algorithm for RS with Markov model. We have demonstrated the performance of the algorithm on a small scale example. This study can be extended for large scale examples, e.g., large number of items upto few hundreds. Looking at the scalability problem, even though an RS might have millions of items in its database, it may only recommend items from a small subset considering the cognitive limitations of humans and the problem of information overload.


  • [1] R. Meshram, A. Gopalan, and D. Manjunath, “Restless bandits that hide their hand and recommendation systems,” in Proc. IEEE COMSNETS, 2017.
  • [2] C. C. Aggarwal, Recommender Systems, Springer, 2016.
  • [3] J. A. Konstan, B. N. Miller, D. Maltz, J. L. Herlocker, L. R. Gordon, and J. Riedl, “Grouplens: Applying collaborative filtering to usenet news,” Communication of ACM, 1997.
  • [4] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Item-based collaborative filtering recommendation algorithms,” in WWWW10, 2001, pp. 285–295.
  • [5] Y. Coren, R. Bell, and C. Volinsky, “Matrix factorization techniques for recommender systems,” IEEE Computer, 2009.
  • [6] J. B. Schafer, D. Frankowski, J. Herlocker, and S. Sen, “Collaborative filtering recommender systems,” The Adaptive Web Lecture notes in Computer Science, 2007.
  • [7] X. He, L. Liao, H. Zhang, L. Nie, X. Liu, and T. Chua, “Neural collaborating filtering,” Arxiv, 2017.
  • [8] E. Candes and T. Tao, “The power of convex relaxation: Near optimal matrix completion,” IEEE Transactions on Information Theory, vol. 56, no. 5, pp. 2053–2080, May 2010.
  • [9] J. Langford and T. Zhang,

    “The epoch-greedy algorithm for contextual multi-armed bandits,”

    in Proc. NIPS, 2007.
  • [10] L. Li, W. Chu, J. Langford, and R. E. Schapire, “A contextual-bandit approach to personalized news article recommendation,” in Proc. ACM WWW, 2010.
  • [11] D. Glowacka, Bandit Algorithms in Information Retrieval, Foundations and Trends in Information Retrieval NOW, 2019.
  • [12] P. Whittle, “Restless bandits: Activity allocation in a changing world,” Journal of Applied Probability, vol. 25, no. A, pp. 287–298, 1988.
  • [13] J. Gittins, K. Glazebrook, and R. Weber, Multi-armed Bandit Allocation Indices, John Wiley and Sons, New York, 2nd edition, 2011.
  • [14] C. H. Papadimitriou and J. H. Tsitsiklis, “The complexity of optimal queueing network control,” Mathematics of Operations Research, vol. 24, no. 2, pp. 293–305, May 1999.
  • [15] G. Tesauro and G. R. Galperin, “On-line policy improvement using monte carlo search,” in NIPs, 1996, pp. 1–7.
  • [16] H. S. Chang, R. Givan, and E. K. P. Chong, “Parallel rollout for online solution of partially observable markov decision processes,” Discret Event Dynamical Systems, vol. 14, pp. 309–341, 2004.
  • [17] D. P. Bertsekas,

    Distributed Reinforcement Learning, Rollout, and Approximate Policy Iteration

    Athena Scientific, 2020.
  • [18] R. Meshram and K. Kaza, “Simulation based algorithms for Markov decision processes and multi-action restless bandits,” Arxiv, 2020.
  • [19] R. D. Smallwood and E. J. Sondik, “The optimal control of partially observable processes over a finite horizon,” Operations Research, vol. 21, no. 5, pp. 1019–1175, Sept.-Oct. 1973.
  • [20] E. J. Sondik, “The optimal control of partially observable Markov processes over the infinite horizon: Discounted costs,” Operations Research, vol. 26, no. 2, pp. 282–304, March–April 1978.
  • [21] W. S. Lovejoy, “Some monotonicity results for partially observed Markov decision processes,” Operations Research, vol. 35, no. 5, pp. 736–743, October 1987.
  • [22] R. Meshram, D. Manjunath, and A. Gopalan, “On the Whittle index for restless multi-armed hidden markov bandits,” IEEE Transactions on Automatic Control, vol. 69, pp. 3046–3053, 2018.