MBCAL: A Simple and Efficient Reinforcement Learning Method for Recommendation Systems

11/06/2019 ∙ by Fan Wang, et al. ∙ 0

It has been widely regarded that only considering the immediate user feedback is not sufficient for modern industrial recommendation systems. Many previous works attempt to maximize the long term rewards with Reinforcement Learning(RL). However, model-free RL suffers from problems including significant variance in gradient, long convergence period, and requirement of sophisticated online infrastructures. While model-based RL provides a sample-efficient choice, the cost of planning in an online system is unacceptable. To achieve high sample efficiency in practical situations, we propose a novel model-based reinforcement learning method, namely the model-based counterfactual advantage learning(MBCAL). In the proposed method, a masking item is introduced in the environment model learning. With the masking item and the environment model, we introduce the counterfactual future advantage, which eliminates most of the noises in long term rewards. The proposed method selects through approximating the immediate reward and future advantage separately. It is easy to implement, yet it requires reasonable cost in both training and inference processes. In the experiments, we compare our methods with several baselines, including supervised learning, model-free RL, and other model-based RL methods in carefully designed experiments. Results show that our method transcends all the baselines in both sample efficiency and asymptotic performance.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


In recommendation systems(RS), the user interests and intents may change according to the interaction history, which influences the user behaviors toward the subsequent recommendations. For example, exploration inspires user’s interest in new areas, and repeating similar contents weakens the user’s intent to read. It is essential to maximize the long-term utility of the interactive process between the users and the system. Applying Reinforcement Learning(RL) to pursue long term utility in RS is of interest to many recent studies(Dulac-Arnold et al., 2015; Zheng et al., 2018; Zhao et al., 2018; Chen et al., 2019a)

, which treat this interactive process as a Markov Decision Process(MDP). However, in realistic recommendation systems, applying Model-Free RL(MFRL) suffers from the following problems: Firstly, MFRL is sample inefficient compared with supervised learning, which requires a large number of feedback. Also, MFRL requires high-risk explorations as well, while it is intolerable to show completely random recommendations to the real user at the beginning; Secondly, there exist extensive noises in the rewards of a recommendation system, which are originated from the unobserved parts of the environments. E.g., some users naturally behave more actively than the others, and user behaviors are more active in case they are on vacation than a busy workday. Those unobserved factors bring huge noise to the learning targets in MFRL and result in considerable bias for the function approximator.

Recently, there is increasing attention on model-based RL(MBRL), which uses far-less samples during the training process. Vanilla MBRL adopts an environment model that predicts immediate feedback and state transitions. It is followed by a planning module to plan an optimal trajectory based on the prediction. However, it requires a lot of (usually unacceptable) inference costs. Also, the available items for selection(candidates) need to be known in advance, which is typically unavailable in realistic RS. A potential solution is to use an environment model to generate virtual experiences to augment the learning of MFRL(Sutton, 1991; Liu et al., 2019). Though this setting improves the sample efficiency issue to some extent, the problem of noisy learning targets is left unsolved.

In this paper, we propose a novel MBRL algorithm toward RS, namely the Model-based Counterfactual Advantage Learning (MBCAL). It embraces two-stage learning processes. Firstly, to eliminate the noisy reward signals in learning, we introduce the masked environment model that can produce a counterfactual baseline. The environment model is trained by randomly masking some of the actions in the trajectory artificially. With the counterfactual baseline, we calculate the so-called counterfactual future advantage, which is a noise-free indicator of the long-term rewards. Secondly, we further introduce a model called the counterfactual future advantage approximator that tries to predict this future advantage. The predictions of the immediate reward and the future advantage are finally combined to select the item to display. To summarize, the proposed method has the following characteristics:

  • It achieves high sample efficiency through environment modeling.

  • The counterfactual future advantage eliminates the intractable noises in the rewards, through which the model learns with a much lower variance.

  • The proposed method involves only two-stage supervised learning, the implementation cost of which is relatively low.

Theoretical analysis and experimental results to validate the proposed method are presented. The experimental results show that our method transcends the other baseline methods, including supervised learning, MFRL, and MBRL in both asymptotic performance and sample efficiency.

Related Work

Non-RL based Recommendation Sytems

Traditional RS adopts either item or user-based collaborative filtering ((Schafer et al., 2007; Koren et al., 2009)

) to select the most relevant content(item) for the users. Recently, increasing works are using deep learning models to predict users’ behaviors (

(Cheng et al., 2016; Hidasi and Karatzoglou, 2018; He et al., 2017)). Typical RS is not a one-off interaction, but a sequence of correlated interactive behaviors between the system and the user. Recursive neural structures are frequently employed to predict the users’ behaviors by making use of the users’ history of interaction ((Hidasi et al., 2016),(Hidasi and Karatzoglou, 2018),(Li et al., 2017)). However, they are focusing on maximizing the immediate rewards of the users, which is inadequate for optimizing the user experience in the long run.

MFRL based Recommendation Sytems

Following the proposal of the deep reinforcement learning algorithms, the prior works (Mnih et al., 2015; Lillicrap et al., 2016) attempt to apply MFRL to recommendation systems. DRN (Zhao et al., 2018) applies Deep Q Learning (Mnih et al., 2015) to the real production environment by online training with real-time user feedback. To enhance the timeliness in the actual system, DRN implements the so-called minor updates and major updates, which require massive infrastructure reforming costs. As another issue in MFRL based RS, it is typically costly to maintain a value distribution over the large discrete item space. Some authors propose to use continuous action space to select items (Dulac-Arnold et al., 2015). It is followed by DRR (Liu et al., 2018) and DeepPage(Zhao et al., 2018) , both of which adopt a continuous action space with actor-critic algorithms.

Typically, it is not practical to update online in realistic systems. Most previous works choose off-policy learning algorithms, including DQN (Zhao et al., 2018), DDQN, DDPG(Dulac-Arnold et al., 2015; Liu et al., 2018; Zhao et al., 2018) and so on. Other works also focus on on-policy updates but apply an off-policy correction(Chen et al., 2019a)

. However, estimating the unknown behavioral policy that generates the user log is quite tricky.

MBRL Based Recommendation systems.

Model-based RL (MBRL) (Deisenroth and Rasmussen, 2011) uses an environment model and a planner to search. It is sample efficient in training, as the planner can interact with the environment model rather than the real environment. However, it suffers from costly inference processes and poor asymptotic performance due to the accumulation of errors caused by the environment model.

Combinations of environment model and MFRL have been proposed to utilize the privileges in both MBRL and MFRL, such as Dyna-Q (Sutton, 1991). In Dyna-Q, the environment model generates virtual interaction experience that is used to augment the realistic interaction records. In RS, works following the setting of Dyna-Q to pursue long term rewards are also reported (Chen et al., 2019b; Liu et al., 2019; Wang et al., 2019).


Problem Settings

Figure 1. Illustration of the Interaction between the Recommendation System and the User.

We suppose that each user interacts with the environment for a horizon of steps, namely an episode. In each step, the recommendation system selects an item from the candidates and pushes the item to the user. Then, the user exposes his behavior(feedback) to the system. A predefined reward function decides the reward given the specific user behavior. This process continues until . An illustration of the interaction between the recommendation system and the user is shown in Fig 1. The objective is to maximize the overall rewards of an episode.

Figure 2. A Sketch of Growing Batch Reinforcement Learning

In practical systems, a complete online update is typically not achievable. Attention has been attracted to problems such as Batch-RL(Lange et al., 2012), where optimizing the policy according to a fixed dataset is studied. The realistic online updating system is more of an intermediate state between complete online updating and Batch RL, which is also called Growing Batch RL(Lange et al., 2012). The data are collected over a certain period and used for policy updates. The recent RL-based RS, including DRN(Zheng et al., 2018), DeepPage(Zhao et al., 2018), off-policy corrected Policy Gradient(Chen et al., 2019a) also use a similar setting to Growing Batch RL. In this work, we mainly focus on this setting. More specifically, the Growing Batch RL includes two periodic interleaving stages:

Data Collection. The RS use the policy to interact with the environment(the users). The policy is kept static during this process. The interaction records are collected and denoted as .

Policy Update. The collected interaction records are used to update the policy to .

The difference between Batch RL and Growing Batch RL lies in that the Batch RL applies a single-time policy update only while the Growing Batch RL does it iteratively. A sketch of the Growing Batch-RL is shown in Fig. 2.


  • Action at step is denoted as . is the selected item by RS, with being the candidate set containing all available items at step .

  • User behavior is the user behavior toward the exposure of a certain item . , where is a pre-defined list of possible behaviors, e.g., .

  • Reward is selected from a predefined reward list , with , where corresponds to the user behavior . If , then

  • State in the recommendation system represents the unknown user interests and contextual information, which is invisible from the system.

  • Observation history of an episode is represented by , with being the observation at step . The observation consists of the action and the corresponding user behavior toward the action. Here we use the superscript to mark that the action or behavior belongs to a certain observation history . Also, we use the subscript to represent the piece of history starting at (inclusive) and ending at (inclusive) of , i.e., .

In addition to the symbols mentioned above, we also adopt other commonly used notations, including:

represents the transition probability of states;

represents the static user features as well as other observable contextual features; represents a policy; and represent the state-value function and the action-value function for policy respectively, with being the decay factor. The dataset is a collection of user profiles and interaction histories, i.e., , where we use the superscript to denote a specific interactive episode.

Model-Free RL Solutions

Most previous works in MFRL based RS directly treat the interaction histories as the state (Zhao et al., 2019; Liu et al., 2018), and regard the RS as a Fully Observable MDP(FOMDP). Usually, they use


One can replace the state in value functions, and denote with , and with . The function approximator is denoted as , with to be trainable parameters. here is optimized by minimizing the temporal difference loss(Eq. 2).


here represents some value backup(backup: updating the value of current states with that of future states(Sutton et al., 1998)) paradigms. In this paper, we compare the following widely-used algorithms.

Monte Carlo Policy Evaluation(MCPE)(Sutton et al., 1998; Dimitrakakis and Lagoudakis, 2008) takes the feedback from a complete trajectory as value backup. in Eq. 2 is set to . It is straight-forward, but with very large variance due to the full backup.

Deep Q Network(DQN)(Mnih et al., 2015; Zheng et al., 2018) is a popular off-policy value approximation paradigm, which typically uses 1-step off-policy value backup. It calculates by where . here represents the target value function, which periodically copies the parameters from during the optimization process(Mnih et al., 2015).

Double Deep Q Network(DDQN)(van Hasselt et al., 2016; Zheng et al., 2018) is an extension of DQN by decoupling the target action selection and the target value function. in Eq. 2 is replaced with where .

Deep Deteriminstic Policy Gradient(DDPG)(Lillicrap et al., 2016; Dulac-Arnold et al., 2015) is an off-policy actor-critic method. The action space of DDPG is continuous, and it calculates by , with being generated by another policy network. For inference, the continuous action needs to find its nearest neighbours in the actual discrete action space. For more details please refer to (Dulac-Arnold et al., 2015).

Environment Modeling

The environment model in a MBRL predicts the transition and the reward given the state and the action. For discrete state space, the environment model commonly includes two components: approximates , and approximates .


In the following part, we will first explain the MBCAL procedure in detail. Then we will give a theoretical validation of the proposed method, with also remarks on its strengths and limitations.

Model Based Counterfactual Advantage Learning

Masked Environment Model Learning


Following model-based RL, we build an environment model that predicts the transition and immediate rewards . Following Eq. 1, is written as Eq. 3.


We introduce the environment model with parameters to predict the transition , i.e.,


For immediate rewards , it is rewritten as . The environment model predicts reward according to Eq. 5.


where represents the approximation of reward function based on .

We then introduce the masking item(MI) to replace some of the items in the interaction history, denoted by

. It is represented by a trainable embedded vector that is optimized together with the other parameters during the training process. Given an observation trajectory

, we denote the trajectory where the actions at positions are replaced by as . For example, suppose , it will give a masked trajectory of

where actions and in are replaced by .

Given the interaction history of user , we sample a masking position set , by uniformly masking every time step with some constant probability . We set in our experiments. Then we collect the masked trajectories denoted by , we then apply self-supervised learning for the Masked Environment Model(MEM), by maximizing the likelihood or minimizing the negative log likelihood(NLL).


The environment model requires encoding sequential observations. Following session based recurrent RS((Hidasi et al., 2016; Hidasi and Karatzoglou, 2018)

), we use Gated Neural Network

(Graves et al., 2013) to encode the interaction history . An illustration of the MEM structure and learning process is in Fig. 3. Here we made a slight modification to the recurrent structure. As we need to encode and at the same time, we concatenate the input in a staggered way, similar to the setting of (Santoro et al., 2016). For each step , the model take and as input to predict distribution of . is introduced as a special behavior indicating the start of the sequence. The MEM is formulated as Eq. 7.


Here Emb denotes a representation layer; Concat denotes a concat operation and MLP

denotes multilayer perceptron(MLP);


represents a Gated Recurrent Unit.

Learning of the masked environment model(MEM).
Structure of the Counterfactual Future Advantage Approximator(CFA2).
Calculating Counterfactual Future Advantage(CFA) based on the trained Masked Environment Model.
Figure 3. Illustration of model based counterfactual learning algorithm.

Counterfactual Future Advantage Approximation

The key to MBCAL is that we use MEM to approximate the future advantage of action. Given the trained MEM , we first define Counterfactual Future Reward(CFR) of the observation trajectory at time step by Eq. 8.


We then define the Couterfactual Future Advantage(CFA) for each trajectory at each time step as subtracting a Counterfactual Baseline from the CFR. The counterfactual baseline is simply calculated by replacing the action with . Formally, it is defined as Eq. 9.


Following the definition of CFA we introduce another model, namely Counterfactual Future Advantage Approximator(CFA2), denoted as . is used to approximate based on current observation history . It is trained by minimizing the mean square error(Eq. 10) with respect to the parameter .


takes the same input as the MEM . Thus, we use a similar architecture with different parameters, except for the last layer, CFA2 predicts a value instead of a vector. We use


An illustration of the model architecture, as well as an explanation to CFA and CFA2, can be found in Fig. 3 and Fig. 3.

It is worth to take a closer inspection toward in Eq. 2 and in Eq. 10. They share the similar formulation except for the targets( and ). Both targets try to account for the long term rewards, but is remarkable in three aspects: Firstly, accounts for the advantage only, while involves the baseline value function. Secondly, accounts for the residual of advantage function by taking out the immediate reward of step . This is because the immediate reward has already been captured by the environment model , and there is no need to involve it any more. Thirdly, while uses value backup from real experiences, comes from the MEM, which already has lower variances compared with the real interactions. Those features enable to capture the long-term rewards with lower noises and higher sample efficiency.

The Overall MBCAL Process

At last, we select the item(action) based on both the MEM and CFA2. Given the user information and the observation trajectory , we pick the next action according to Eq. 12.


For exploration, we adopt -greedy strategy (Sutton and Barto, 2018). An random action in candidate set is selected with probability , while the best action based on Eq. 12 is selected with probability . Eq. 12 with exploration is written as


As mentioned before, Growing Batch RL collects data for a certain period and then uses that data for policy updates. The summary of Model Based Counterfactual Advantage Learning in Growing Batch RL setting is shown in Alg. 1, with the details of PolicyUpdate shown in Alg. 2.

1:Initial Policy .
2:for k = 0,1,2,… until convergence do
3:     Use policy to interact with users for trajectories, and record the interaction history .
4:     .
5:     Set according to Eq. 13.
6:Return in the last iteration.
Algorithm 1 Model Based Counterfactual Advantage Learning(MBCAL)
1:Input: Interaction history .
2:Randomly masking to acquire .
3:Minimize (Eq. 6) on to optimize .
4:For , calculate with by Eq. 9, Eq. 8 and Eq. 5.
5:Minimize (Eq. 10) on to optimize .
6:Return .
Algorithm 2 PolicyUpdate()

Remarks. The training and inference cost of MBCAL are both reasonable. The training stage include two stages of supervised learning. For the inference cost, it is also comparable to the traditional supervised learning based recommendation system.

Convergence Analysis

In this part, we show that the convergence of MBCAL can be ensured if the following Hypothesis. 1 is true.

Hypothesis 1 ().

In the recommender system, given the observation record that is collected with behavioral policy , if we re-sample an action , and replace with , the distribution of the subsequent actions in is not strongly perturbed. That means

Based on Hypothesis. 1, we show that the following proposition can be proved.

Proposition 1 ().

Given the sufficiently optimized MEM and CFA2 on data set , which is collected with the behavioral policy . Selecting the optimal action based on Eq. 12 is equivalent to selecting the action that maximizes the value function , ensured by the following equation

We left the proof of Proposition 1 in the Appendices. Notice here, both the term and the term are independent of the action , thus they can be taken as a constant term in each step. Proposition 1 shows that the action selection is equivalent to maximize . According to the policy iteration theories(Sutton et al., 1998), if the policy is determined based on the value function , the value function for any state is guaranteed to be improved. That is . Thus, the iteration in Algorithm 1 will converge to the optimal policy.

Also, we want to clarify the limitations of this method. We do not claim MBCAL to be a universal solution to POMDP problems. Two unique properties ensure its availability in RS. Firstly, RS is a special POMDP, where the observations include the action and the user feedback histories only, and direct observation from the state is naturally missing. In other problems such as Atari games, direct observation of the state is essential. In such cases, we can not use the proposed MEM. Secondly, we argue that the Hypothesis. 1 tend to be valid for RS only. In RS, it is reasonable to suppose that the recommendation policy is weakly dependent on a single piece of historical interaction record. However, for the other POMDP problems, this may not be true.



Dataset # of Users # of Items
# of Training
# of Validation
Item Features User Behaviors()
MovieLens(ml-20m) 130,000+ 20,000+ 2,486,840 1,270,879 movie-id, genres, tags ratings(5 classes)
Netflix 480,000+ 17,000+ 4,533,213 2,275,125 movie-id ratings(5 classes)
NewsFeed 920,000+ 110,000+ 9,408,891 4,700,894
news-id, tags, title
category, topics
news-type, news-source
dwelling time(12 classes)
Table 1. Properties of the used datasets.

In our experiments, we use three datasets: MovieLens ml-20m111http://files.grouplens.org/datasets/movielens/ml-20m-README.html (Harper and Konstan, 2016), Netflix Prize222https://www.kaggle.com/netflix-inc/netflix-prize-data and NewsFeed333Data collected from Baidu App News Feed System, as shown in Tab. 1.

MovieLens ml-20m

The dataset (ml-20m) describes 5-star rating activities from MovieLens. The user behavior corresponds to the star ratings, with the reward to be .

Netflix Prize

The dataset is a 5-star rating dataset from Netflix. The user behavior and the reward follow those of MovieLens.


It is collected from a real online news recommendation system. We focus on predicting the dwelling time on the clicked news. The dwelling time is partitioned into 12 levels, corresponding to 12 different user behaviors, with the corresponding reward .

Experiment Settings

Notice that many previous works on RL-based RS use off-line evaluation criteria, including NDCG and Precision(Zhao et al., 2018; Liu et al., 2018) to evaluate the performance. However, such criteria are incapable of indicating the ability of RS to pursue long term rewards. The ultimate evaluation for RS is online A/B Testing, but it is quite expensive. In order to thoroughly study the performance of the proposed systems, we use simulators to evaluate the proposed methods and baselines, which has been widely adopted in evaluating RL systems(Dulac-Arnold et al., 2015; Zhao et al., 2018; Liu et al., 2019). In our case, given the interaction history collected from the users, the simulator predicts the future user behavior. We utilize the LSTM structure and the features shown in Tab. 1 to train the simulator.

We set two different kinds of interaction rounds between the system to be evaluated and the simulators: the training round and the testing round. For a training round, a system to be evaluated produces actions by using -greedy policy ( throughout all experiments) and interacts with the simulator. The system updates its policy using the collected feedback and rewards from the simulator. In the testing round, the system produces actions by using a greedy policy and is evaluated by the simulator. The data used in the testing round are not used in the training round. For each episode in training or testing rounds, we allow for steps of interaction between the simulator and the systems. Each training round or testing round includes 256,000 episodes.

To thoroughly investigate the performance of different systems, we carried out two different evaluation processes: the Batch RL evaluation and the Online evaluation. Note that each experiment will be repeated three times, and we report the mean and variance of the results.

Batch RL Evaluation

The RS is only allowed to learn from the logged data, and the interactions with the simulator are not allowed, which is similar to supervised learning. It evaluates the performance of the system when it is launched online for the first time. We first cut the original user log into the training set and the validation set. For each RS, we use the training set to learn the model or update the policy, and we use the validation set to pick the epoch with the best performance. For RL-based methods, we perform only one policy update. In the end, we apply a testing round to evaluate the performance(Fig. 


Online Evaluation

One training round of the online evaluation includes a data collection and a policy update stage. In each round, the RS updates its policy based on the interaction history from the data collection stage. The process will run for 40 training rounds, and we plot the changes in the mean and variance of the performance score in each corresponding testing round(Fig. 4).

Batch-RL evaluation
Online evaluation
Figure 4. Illustration of the evaluation process.

Methods for Comparison

We compare different methods ranging from Supervised Learning(GRU4Rec), bandits(GRU4Rec(-greedy)) to MFRL(MCPE, DQN, DDQN, and DDPG) and MBRL(Dyna-Q). For bandits, LinUCB(Chu et al., 2011) is commonly used as a baseline. However, in our environments, an item is mainly represented by sparse features(including item id etc., with few dense representation features). LinUCB performs poorly in such circumstances, while Neural Network(NN) can do much better with its representative power. Thus, we post the results of adding -greedy in NN models(GRU4Rec(-greedy)) instead of applying LinUCB.

The methods for comparison include:

  • GRU4Rec follows (Hidasi et al., 2016) to predict the immediate reward by encoding the session history. The item are selected greedily in both the training and testing rounds.

  • GRU4Rec(-greedy) applies a -greedy selection of item in GRU4Rec during the training rounds.

  • DQN(Zheng et al., 2018). We modify the model architecture of DQN to be equal to our model architecture, to ensure a fair comparison between different learning algorithms.

  • DDQN(Zheng et al., 2018). Double DQN shareing the same model architecture with DQN.

  • DDPG(Zhao et al., 2018). Off-policy learning metrics for continuous action space with an actor network and a critic network. The inferred action is used to select the item that lies closest to display. We use a similar neural structure as the other methods for both actor and critic networks.

  • MCPE. Monte Carlo Policy Evaluation(Dimitrakakis and Lagoudakis, 2008) uses exactly the same model architecture as DQN and DDQN.

  • Dyna-Q(Sutton, 1991; Liu et al., 2019) is a MBRL method that maintains an environment model and DQN at the same time. Except for the interaction between the environment and the DQN, virtual interaction records between DQN and environment model are aggregated into the dataset to accelerate the learning process. The ratio of virtual experience to realistic experience is 1:1.

  • MBCAL is our full method, with using the model architecture shown in Fig. 3 and Fig. 3.

  • MBCAL(w/o variance reduction) is an ablative version of MBCAL. In Eq. 10, is used instead of .

All the models are optimized by Adam optimizer with learning rate = , and . The decay factor in the long-term reward is set to be . The default hidden size is set to be 32.

In order to fairly evaluate the methods and avoid ”hacking” the simulator, all the comparing systems are allowed to use only a subset of features from Tab. 1. That means the systems predict with partial information. In MovieLens and Netflix, only the movie-id is used. In NewsFeed, 4 features out of 7 is used, including the news-id, category, news-type and news-source. For the user representation , we use the user click history and the user id.444The code used to reproduce the experiments will be released as well as the trained simulator and the processed dataset.

Experimental Results

Algorithm Movielens Netflix NewsFeed
MBCAL(w/o variance reduction)
Table 2. Batch-RL evaluation results.

Batch-RL Evaluation Results

The results on Batch-RL evaluation are shown in Tab. 2. We evaluate the reward of an episode according to the reward generated by the simulator. To conclude from the result, MFRL can not compete with MBRL in all three environments. As MFRL is sample inefficient, it tends to have poor startup performance. Surprisingly DDPG has the poorest performance in all three environments. By closely investigating the value functions in DDPG, we found that DDPG overestimates the value function a lot compared with the other MFRL. We attribute it to that DDPG uses value backups from continuous action space, while those actions may not correspond to actual items. This lead to severe overestimation of value. The overestimation problem in actor-critic methods has also be thoroughly investigated in (Fujimoto et al., 2018).

Moreover, the MBCAL leads the performance of all the tested systems, including the Dyna-Q and MBCAL(w/o variance reduction), with solid margins, demonstrating its sample efficiency. However, for Movielens and Netflix, our method earns a smaller margin over the supervised learning method compared with that of NewsFeed. It is likely that the long term reward plays a more significant role in NewsFeed than the other two environments. Furthermore, learning to approximate long term rewards requires more training samples than learning the immediate reward. Thus, the advantage of RL-based methods has not yet been sufficiently revealed in Batch-RL settings.

Figure 5. Online evaluation results.

Online Evaluation Results

Fig. 5 shows the results of the Online Evaluation in three environments. GRU4Rec(-greedy) surpasses the purely supervised learning GRU4Rec by a small margin in every environment, showing the benefit of exploration in online systems. Performances of DDPG in all three environments are again surprisingly bad. In Fig. 5, we show DDPG curve in NewsFeed environment only because in the other two environments, DDPG lags too much behind all the other methods. We believe that continuous action space for recommendation systems with dynamic discrete item space can not work well.

In Movielens and Netflix environment, MCPE, DQN, DDQN, and Dyna-Q lag entirely behind the other methods, including supervised learning base-lines in both sample efficiency(rising speed) and asymptotic performance. With the assistance of the environment model, Dyna-Q gains some advantages at the beginning, but it gradually subsides as the learning continues. This phenomenon is just in line with the expectations, for the virtual experience quickly loses its benefit with the accumulation of sufficient realistic experience. We made some investigation into the large gap between DQN-type methods and the other methods. We found that in those two environments, the classification of behaviors has excellent advantages over the regression of rewards, which reveals another goodness of model-based systems. In NewsFeed environment, RL-based methods, including DQN and DDQN, generally achieve a significant margin over supervised learning and bandits. It shows that the long-term reward in this environment is more significant than the other two.

MBCAL again keeps its performance ahead of the other methods in all the environments. Even for Netflix and Movielens, where the other RL-based system fails to generate obvious benefits, MBCAL wins with a considerable margin. In NewsFeed, where the long term rewards play a more critical role, MBCAL strengthens the leading edge.

Mean Square Error Movielens Netflix NewsFeed
DQN 1.50 1.22 4.29
MCPE 17.1 9.21 46.9
Dyna-Q 0.94 1.04 7.87
MBCAL(w/o variance reduction) 3.45 3.29 3.07
Table 3. The mean square error in the testing round of the Batch-RL evaluation.

Analysis of the variance

The key point in MBCAL is the variance reduction through the proposed counterfactual future advantage. According to the previous theories(Geman et al., 1992)

, the mean square error in a well-trained model is composed of the model bias and the variance(noise) of the learning targets. As we exploit equivalent model structures in our methods for comparison, the mean square loss function can be taken as an indicator of the noise of the learning targets (

and ). We compare the mean square error of MCPE, DQN, Dyna-Q, MBCAL(w/o variance reduction), and MBCAL, based on the interaction records in the testing round of the Batch-RL evaluation. The average MSE error is presented in Tab. 3. Notice that those losses are directly comparable to each other. We can see that the regression error of MCPE is the largest, which is in line with the expectation because MCPE uses full value backup. It shows that the noises dominate the gradients of MCPE. Thus it requires many samples to avoid biases. Similarly, without variance reduction, the CFR has the second-largest variance. It is smaller compared with MCPE, due to the usage of the environment model. The variances of DQN and Dyna-Q are smaller because they use only one-step value backup. Compared with the other three methods, the mean square error of MBCAL is much smaller, showing that MBCAL indeed benefits from the low variance during its learning process.


To conclude this work, we mainly study the problem of long-term reward maximization in a recommendation system. To enhance the sample efficiency and improve the final performance, we propose the model-based counterfactual advantage learning. It includes two-stage learning processes with a masked environment model and a counterfactual future advantage approximator. The critical point of the proposed method is the reduction of variances in the learning process. Experiments on three different environments show that the proposed method achieves state-of-the-art performance with high sample efficiency. We are looking forward to seeing the proposed framework being launched online in various realistic RS products.


Appendix A Proof of Proposition 1

To start, we suppose that the collection of interactions is sufficient. We use to denote all the observation trajectories in , which is generated by the user . For convenience, we make the following additional definition.

Definition 1 ().
Definition 2 ().
Definition 3 ().

Straight from this definition, we can easily acquire the following equations.


We present lemmas toward the MEM and the CFA2. By minimizing the cross-entropy or mean square error, the output of the neural network is supposed to approximate the expectation of the target, which gives us Lemma A.1 and Lemma A.2.

Lemma A.1 ().

By minimizing the target function Eq. 6, the masked environment model with the masking item approximates the following requirements: Given any from the user feedback data , and and , MEM is supposed to satisfy

Lemma A.2 ().

By minimizing the target function Eq. 10, the CFA2 satisfies the following property: Given any from the user feedback data , and ,

Combining Lemma A.1, Lemma A.2, Eq. 8 and Eq. 9, we can rewrite as follows:

Notice that if and , we have . Then, for the and , they are only different in that re-samples . We can apply the Hypothesis. 1, and directly substitute with in the second term, which finally gives us the following equation.


Combining Eq. 15, Eq. 14, and Lemma A.1, we can further write


This finishes the proof of Proposition. 1.