1 Introduction
Recommendation systems have become a crucial part of almost all online service platforms. A typical interaction between the system and its users is — users are recommended a page of items and they provide feedback, and then the system recommends a new page of items. A common way of building recommendation systems is to estimate a model which minimizes the discrepancy between the model prediction and the
immediateuser response according to some loss function. In other words, these models do not explicitly take into account the longterm user interest. However, user’s interest can evolve over time based on what she observes, and the recommender’s action may significantly influence such evolution. In some sense, the recommender is guiding users’ interest by displaying particular items and hiding the rest. Thus, a recommendation strategy which takes users’ longterm interest into account is more favorable.
Reinforcement learning (RL) is a learning paradigm where a policy will be obtained to guide the actions in an environment so as to maximize the expected longterm reward. Although RL framework has been successfully applied to many game settings, such as Atari (Mnih et al., 2015) and GO (Silver et al., 2016), it met a few challenges in the recommendation system setting because the environment will correspond to the logged online user.
First, a user’s interest (reward function) driving her behavior is typically unknown, yet it is critically important for the use of RL algorithms. In existing RL algorithms for recommendation systems, the reward functions are manually designed (e.g. for click/noclick) which may not reflect a user’s preference over different items (Zhao et al., 2018a; Zheng et al., 2018).
Second, modelfree RL typically requires lots of interactions with the environment in order to learn a good policy. This is impractical in the recommendation system setting. An online user will quickly abandon the service if the recommendation looks random and do not meet her interests. Thus, to avoid the large sample complexity of the modelfree approach, a modelbased RL approach is more preferable. In a related but a different setting where one wants to train a robot policy, recent works showed that modelbased RL is much more sample efficient (Nagabandi et al., 2017; Deisenroth et al., 2015; Clavera et al., 2018). The advantage of modelbased approaches is that potentially large amount of offpolicy data can be pooled and used to learn a good environment dynamics model, whereas modelfree approaches can only use expensive onpolicy data for learning. However, previous modelbased approaches are typically designed based on physics or Gaussian processes, and not tailored for complex sequences of user behaviors.
To address the above challenges, we propose a novel modelbased RL framework for recommendation systems, where a user behavior model and the associated reward function are learned in unified minimax framework, and then RL policies are learned using this model. Our main technical innovations are:

[nosep,nolistsep]

We develop a generative adversarial learning ( GAN) formulation to model user behavior dynamics and recover her reward function. These two components are estimated simultaneously via a joint minimax optimization algorithm. The benefits of our formulation are: (i) a more predictive user model can be obtained, and the reward function are learned in a consistent way with the user model; (ii) the learned reward allows later reinforcement learning to be carried out in a more principled way, rather than relying on manually designed reward; (ii) the learned user model allows us to perform modelbased RL and online adaptation for new users to achieve better results.

Using this model as the simulation environment, we also develop a cascading DQN algorithm to obtain a combinatorial recommendation policy. The cascading design of actionvalue function allows us to find the best subset of items to display from a large pool of candidates with time complexity only linear in the number of candidates.
In our experiments with real data, we showed that this generative adversarial model is a better fit to user behavior in terms of heldout likelihood and click prediction. Based on the learned user model and reward, we show that the estimated recommendation policy leads to better cumulative longterm reward for the user. Furthermore, in the case of model mismatch, our modelbased policy can also quickly adapt to the new dynamics with a much fewer number of user interactions compared to modelfree approaches.
2 Related Work
Commonly used recommendation algorithms typically use a simple user model. For instance, Wide&Deep networks (Cheng et al., 2016)
and other methods such as xgboost
(Chen & Guestrin, 2016) and DFM (Guo et al., 2017)based on logistic regression essentially assume a user chooses each item independently; Collaborative competitive filtering
(Yang et al., 2011) takes into account the context where a user makes her choice but assumes that user’s behaviors in each page view are independent. Sessionbased RNN (Hidasi et al., 2016)and sessionbased KNN
(Jannach & Ludewig, 2017) improve upon previous approaches by modeling users’ history, but this model does not recover a users’ reward function and can not be used subsequently for reinforcement learning. Bandit based approaches, such as LinUCB (Li et al., 2010), can deal with adversarial user behaviors, but the reward is updated in a Bayesian framework and can not be directly used by a reinforcement learning framework.Zhao et al. (2018b, a); Zheng et al. (2018) used modelfree RL for recommender systems, which may require many user interactions and the reward function is manually designed. Modelbased reinforcement learning has been commonly used in robotics applications and resulted in reduced sample complexity to obtain a good policy (Deisenroth et al., 2015; Nagabandi et al., 2017; Clavera et al., 2018). However, these approaches can not be used in the recommendation setting, as a user behavior model typically consists of sequences of discrete choices under a complex session context.
3 Setting and RL Formulation
We will focus on a simple yet typical setting where the recommendation system and its user interact as follows: a user is displayed to a page of items and she provides feedback by clicking on one or none of these items, and then the system recommends a new page of items. Our model can be extended to settings with more complex page views and user interactions, but these settings are left for future studies.
Since reinforcement learning can take into account longterm reward, it holds the promise to improve users’ longterm engagement with an online platform. In the RL framework, a recommendation system wants to find a policy to choose a set of items based on user state , such that the longterm expected reward to the user is maximized, i.e.
(1) 
where several key aspects of this RL framework are as follows:

[nosep, nolistsep, wide]

Environment: will correspond to a logged online user who can click on one of the items displayed by the recommendation system in each page view (or interaction);

State : will correspond to an ordered sequence of a user’s historical clicks;

Action of the recommender: will correspond to a subset of items chosen by the recommender from to display to the user. means the set of all subsets of items of . is the subset of available items to recommend at time among all items .

State Transition
: will correspond to a user behavior model which returns the transition probability for
given previous state and the set of items displayed by the system. It is equivalent to the distribution over a user’s actions, which is defined in our user model in section 4.1. 
Reward Function : will correspond to a user’s utility or satisfaction after making her choice in state . Here we assume that the reward to the recommendation system is the same as the user’s utility. Thus, a recommendation algorithm which optimizes its longterm reward is designed to satisfy the user in a long run. One can also include the company’s benefit to the reward, but in this paper we will focus on users’ satisfaction.

Policy : will correspond to a recommendation strategy which takes a user’s state and returns the probability of displaying a subset of .
Remark. We note that in the above mapping, Environment, State and State Transition are associated with the user, the Action and Policy are associated with the recommendation system, and the Reward Function is associated with both the recommendation system and the user. Here we use the notation to emphasize the dependency of the reward on the recommendation action, as the user can only choose from the display set. However, the value of the reward is actually determined by the user’s state and the clicked item once the item occurs in the display set . In fact, . Thus, in section 4.1 where we discuss the user model, we simply denote and assume is true. The overall RL framework for recommendation is illustrated in Figure 1.
Since both the reward function and the state transition model are not provided, we need to learn them from data. Once these quantities are learned, the optimal policy in Eq. (1) can be estimated by repeated querying to the model using algorithms such as Qlearning (Watkins, 1989). In the next two sections, we will explain our formulation for estimating the user behavior model as well as the reward function and design an efficient algorithm for learning the RL policy for the recommendation.
4 Generative Adversarial User Model
In this section, we propose a model to imitate users’ sequential choices and discuss its parameterization and estimation. The formulation of our user model is inspired by imitation learning, which is a powerful tool for learning sequential decisionmaking policies from expert demonstrations
(Abbeel & Ng, 2004; Ho et al., 2016; Ho & Ermon, 2016; Torabi et al., 2018) We will formulate a unified minimax optimization to learn user behavior model and reward function simultaneously based on sample trajectories.4.1 User Behavior As Reward Maximization
We model user behavior based on two realistic assumptions. (i) Users are not passive. Instead, when a user is displayed to a set of items, she will make a choice to maximize her own reward. The reward measures how much she will be satisfied with or interested in an item. Alternatively, the user can choose not to click on any items. Then she will receive the reward of not wasting time on boring items. (ii) The reward depends not only on the selected item but also on the user’s history. For example, a user may not be interested in Taylor Swift’s song at the beginning, but once she happens to listen to it, she may like it and then becomes interested in her other songs. Also, a user can get bored after listening to Taylor Swift’s songs repeatedly. In other words, a user’s evaluation of the items varies in accordance with her personal experience.
To formalize the model, we consider both the clicked item and the state of the user as the inputs to the reward function , where the clicked item is the user’s action and the user’s history is captured in her state (nonclick is treated as a special item/action). Suppose in session , the user is presented with a set of items and their associated features by the recommendation system. She will take an action according to a strategy
which can maximize her expected reward. More specially, this strategy is a probability distribution over the set of candidate actions
, which is the result of the following optimization problem(2) 
where is the probability simplex, and is a convex regularization function to encourage exploration, and controls the strength of the regularization.
Model Interpretion. A widely used regularization is the negative Shannon entropy, with which we can obtain an interpretation of our user model from the perspective of explorationexploitation tradeoff (See Appendix A for a proof).
Lemma 1.
Let the regularization term in Eq. (2) be and is allowed to be arbitrary mappings. Then the optimal solution for the problem in Eq. (2) has a closed form
(3) 
Furthermore, in each session , the user’s decision according to her optimal policy is equivalent to the following discrete choice model where follows a Gumbel distribution.
(4) 
Essentially, this lemma makes it clear that the user greedily picks an item according to the reward function (exploitation), and yet the Gumbel noise allows the user to deviate and explore other less rewarding items. Similar models have also appeared in the econometric choice model (Manski, 1975; McFadden, 1973), but previous econometric models did not take into account diverse features and user state evolution. The regularization parameter is revealed to be an explorationexploitation tradeoff parameter. It can be easily seen that with a smaller , the user is more exploratory. Thus, reveals a part of users’ character. In practice, we simply set the value in our experiments, since it is implicitly learned in the reward , which is a function of various features of a user.
Remark. (i) Other regularization can also be used in our framework, which may induce different user behaviors. In these cases, the relations between and are also different, and may not appear in the closed form. (ii) The case where the user does not click any items can be regarded as a special item which is always in the display set
. It can be defined as an item with zero feature vector, or, alternatively, its reward value can be defined as a constant to be learned.
4.2 Model Parameterization
In this section, we will define the state as an embedding of the historical sequence of items clicked by the user before session , and then we will define the reward function based on the state and the embedding of the current action .
First, we will define the state of the user as , where each is the feature vector of the clicked item at session and is an embedding function. One can also define a truncated step sequence as . For the state embedding function , we propose a simple and effective position weighting scheme. Let be a matrix where the number of rows corresponds to a fixed number of historical steps, and each of the columns corresponds to one set of importance weights on positions. Then the embedding function can be designed as
(5) 
where is a bias matrix, and
is a nonlinear activation function such as ReLU and ELU, and
turns the input matrix into a long vector by concatenating the matrix columns. Alternatively, one can also use an LSTM to capture the history. However, the advantage of the position weighting parameterization is that the history embedding is obtained by a shallow network which is more efficient for forwardcomputation and gradient backpropagation than RNN.
Next, we define the reward function and the user behavior model. A user’s choice corresponds to an item with feature . Thus we will use as the surrogate for and parameterize the reward function and user behavior model as
(6) 
where are weight matrices,
are bias vectors blue
, and are the final regression parameters. See Figure 2 for an illustration of the overall parameterization. For simplicity of notation, we will denote the set of all parameters in the reward function as and the set of all parameters in the user model as , and hence the notation and respectively.4.3 Generative Adversarial Training
In practice, both the user reward function and the behavior model are unknown and need to be estimated from the data. The behavior model tries to mimic the action sequences provided by a real user who acts to maximize her reward function . In analogy to generative adversarial networks, (i) acts as a generator which generates the user’s next action based on her history, and (ii) acts as a discriminator which tries to differentiate the user’s actual actions from those generated by the behavior model . Thus, inspired by the GAN framework, we estimate and simultaneously via a minimax formulation.
More precisely, given a trajectory of observed actions of a user and the corresponding clicked item features , we learn the user behavior model and reward function jointly by solving the following minimax optimization
(7) 
where we use to emphasize that this is observed in the data. From the above optimization, one can see that the learned reward function will extract some statistics from both real user actions and model user actions, and try to magnify their difference (or make their negative gap larger). In contrast, the learned user behavior model will try to make the difference smaller, and hence more similar to the real user behavior. Alternatively, the minimax optimization can also be interpreted as a game between an adversary and a learner where the adversary tries to minimize the reward of the learner by adjusting , while the learner tries to maximize its reward by adjusting to counteract the adversarial moves. This gives the user behavior training process a largemargin training flavor, where we want to learn the best model even for the worst scenario.
For general regularization function , the minimax optimization problem in Eq. (7) does not have a closed form, and typically needs to be solved by alternatively updating and , e.g.
(8) 
The process may be unstable due to the nonconvexity nature of the problem. To stabilize the training process, we will leverage a special regularization for initializing the training process. More specifically, for entropy regularization, we can obtain a closed form solution to the innermaximization for user behavior model, which makes the learning of reward function easy (See lemma 2 below and Appendix A for a proof). Once the reward function is learned for entropy regularization, it can be used to initialize the learning in the case of other regularization functions which may induce different user behavior models and final rewards.
5 Cascading Qnetworks for RL Recommendation Policy
Using the estimated user behavior model and the corresponding reward function as the simulation environment, we can then use reinforcement learning to obtain a recommendation policy. Note that the recommendation policy needs to deal with a combinatorial action space , where each action is a subset of items chosen from a larger set of candidates. Two challenges associated with this problem include the potentially high computational complexity of the combinatorial action space and the development of a framework for estimating the longterm reward (the Q function) from a combination of items. Our contribution is designing a novel cascade of Qnetworks to handle the combinatorial action space. We can also design an algorithm to estimate this cascade of Qnetworks from interaction with the environment.
5.1 Cascading QNetworks
We assume that each time when a user visits the online platform, the recommendation system needs to choose a subset of items from . We will use the Qlearning framework where an optimal actionvalue function will be learned and satisfies , . Once the actionvalue function is learned, an optimal policy for recommendation can be obtained as
(10) 
where is the set of items available at time . The challenge is that the action space contains many choices, which can be very large even for moderate (e.g. 1,000) and (e.g. 5). Furthermore, an item put in different combinations can have different probabilities of being clicked, which is indicated by the user model and is in line with reality. For instance, interesting items may compete with each other for a user’s attention. Thus, the policy in Eq. (10) will be very expensive to compute. To address this challenge, we will design not just one but a set of related Qfunctions which will be used in a cascading fashion for finding the maximum in Eq. (10).
Denote the recommender actions as and the optimal action as . Our cascading Qnetworks are inspired by the key fact that:
(11) 
which also implies that there is a cascade of mutually consistent such that:
Thus, we can obtain an optimal action in computations by applying these functions in a cascading manner. See algorithm 1 and Figure 3 for a summary. However, this cascade of functions are usually not available and need to be estimated from the data.
5.2 Parameterization and Estimation of Cascading QNetworks
Each
function is estimated by a neural network parameterized as
(12) 
where , and are the set of parameters, and we use the same embedding for the state as in Eq. (5). Now the problem left is how we can estimate these functions . Note that the set of functions need to satisfy a large set of constraints. At the optimal point, the value of is the same as for all , i.e.,
(13) 
Since it may not be easy to strictly enforce these constraints, we take them into account in a soft and approximate way in our model fitting process as stated below.
Different from standard Qlearning, our cascading Qlearning process is learning a set of parameterized functions as approximations of . To enforce the constraints in Eq. (13) in a soft and approximate way, we can define the loss as
(14) 
That is all networks are fitting against the same target . Then the parameters can be updated by performing gradient steps over the above loss. It is noticed in our experiments that the set of learned networks satisfies the constraints nicely with a small error.
The overall cascading Qlearning algorithm is summarized in Algorithm 2 in Appendix B, where we employ the cascading Q functions to search the optimal action efficiently. Besides, both the experience replay (Mnih et al., 2013) and exploration techniques are applied.
1:function argmax_Q( ) 2: Let be empty. 3: remove clicked items. 4: for to do 5: 6: Update 7: end for 8: return 9:end function 
6 Experiments
We conduct three sets of experiments to evaluate our generative adversarial user model (called GAN user model) and the resulting RL recommendation policy. Our experiments are designed to investigate the following questions: (1) Can GAN user model lead to better user behavior prediction? (2) Can GAN user model lead to higher user reward and click rate? and (3) Can GAN user model help reduce the sample complexity of reinforcement learning?
6.1 Dataset and Feature Description
We experimented with 6 realworld datasets: (1) Ant Financial News dataset contains clicks records from 50,000 users for one month, involving dozens of thousands of news. On average each display set contains 5 news articles. It also contains useritem cross features which are widely used in this online platform; (2) MovieLens contains a large number of movie ratings, from which we randomly sample 1,000 active users. Each display set is simulated by collecting 39 movies released near the time the movie is rated. Movie features are collected from IMDB. Categorical and descriptive features are encoded as sparse and dense vectors respectively; (3) Last.fm contains listening records from 359,347 users. Each display set is simulated by collecting 9 songs with the nearest timestamp. (4) Yelp contains users’ reviews to various businesses. Each display set is simulated by collecting 9 businesses with the nearest location. (5) RecSys15 contains clickstreams that sometimes end with purchase events. (6) Taobao contains the clicking and buying records of users in 22 days. We consider the buying records as positive events. (More details in Appendix C)
6.2 Predictive Performance of User Model
To assess the predictive accuracy of GAN user model with position weight (GANPW) and LSTM (GANLSTM), we choose a series of most widely used or stateofthearts as the baselines, including: (1) W&DLR (Cheng et al., 2016), a wide & deep model with logistic regression loss function; (2) CCF (Yang et al., 2011)
, an advanced collaborative filtering model which takes into account the context information in the loss function; we further augment it with wide & deep feature layer (W&DCCF); (3) IKNN
(Hidasi et al., 2015), one of the most popular itemtoitem solutions, which calculates items similarly according to the number of cooccurrences in sessions; (4) SRNN (Hidasi et al., 2016), a sessionbased RNN model with a pairwise ranking loss; (5) SCKNNC (Jannach & Ludewig, 2017), a strong methods which unify session based RNN and KNN by cascading combination; (6) XGBOOST (Chen & Guestrin, 2016), a parallel tree boosting; (7) DFM (Guo et al., 2017) is a deep neural factorizationmachine based on wide & deep features.Top precision (Prec@
) is employed as the evaluation metric. It is the proportion of top
ranked items at each page view that are actually clicked by the user, averaged across test page views and users. Users are randomly divided into train(50%), validation(12.5%) and test(37.5%) subsets for 3 times. The results are reported in Table 1, which shows that GAN model performs significantly better than baseline models. Moreover, GANPW performs nearly as well as GANLSTM, but it is more efficient to train. Thus we use GANPW for later experiments and simply refer to it as GAN.(1) Ant Financial news dataset  (2) MovieLens dataset  (3) LastFM  
Model  prec(%)@1  prec(%)@2  prec(%)@1  prec(%)@2  prec(%)@1  prec(%)@2 
IKNN  20.6(0.2)  32.1(0.2)  38.8(1.9)  40.3(1.9)  20.4(0.6)  32.5(1.4) 
SRNN  32.2(0.9)  40.3(0.6)  39.3(2.7)  42.9(3.6)  9.4(1.6)  17.4(0.9) 
SCKNNC  34.6(0.7)  43.2(0.8)  49.4(1.9)  51.8(2.3)  21.4(0.5)  26.1(1.0) 
XGBOOST  41.9(0.1)  65.4(0.2)  66.7(1.1)  76.0(0.9)  10.2(2.6)  19.2(3.1) 
DFM  41.7(0.1)  64.2(0.2)  63.3(0.4)  75.9(0.3)  10.5(0.4)  20.4(0.1) 
W&DLR  37.5(0.2)  60.9(0.1)  61.5(0.7)  73.8(1.2)  7.6(2.9)  16.6(3.3) 
W&DCCF  37.7(0.1)  61.1(0.1)  65.7(0.8)  75.2(1.1)  15.4(2.4)  25.7(2.6) 
GANPW  41.9(0.1)  65.8(0.1)  66.6(0.7)  75.4(1.3)  24.1(0.8)  34.9(0.7) 
GANLSTM  42.1(0.2)  65.9(0.2)  67.4(0.5)  76.3(1.2)  24.0(0.9)  34.9(0.8) 
(4) Yelp  (5) Taobao  (6) RecSys15: YooChoose  
Model  prec(%)@1  prec(%)@2  prec(%)@1  prec(%)@2  prec(%)@1  prec(%)@2 
IKNN  57.7(1.8)  73.5(1.8)  32.8(2.6)  46.6(2.6)  39.3(1.5)  69.8(2.1) 
SRNN  67.8(1.4)  73.2(0.9)  32.7(1.7)  47.0(1.4)  41.8(1.2)  69.9(1.9) 
SCKNNC  60.3(4.5)  71.6(1.8)  35.7(0.4)  47.9(2.1)  40.8(2.5)  70.4(3.8) 
XGBOOST  64.1(2.1)  79.6(2.4)  30.2(2.5)  51.3(2.6)  60.8(0.4)  80.3(0.4) 
DFM  72.1(2.1)  80.3(2.1)  30.1(0.8)  48.5(1.1)  61.3(0.3)  82.5(1.5) 
W&DLR  62.7(0.8)  86.0(0.9)  34.0(1.1)  54.6(1.5)  51.9(0.8)  75.8(1.5) 
W&DCCF  73.2(1.8)  88.1(2.2)  34.9(1.1)  53.3(1.3)  52.1(0.5)  76.3(1.5) 
GANPW  72.0(0.2)  92.5(0.5)  34.7(0.6)  54.1(0.7)  52.9(0.7)  75.7(1.4) 
GANLSTM  73.0(0.2)  88.7(0.4)  35.9(0.6)  55.0(0.7)  52.7(0.3)  75.9(1.2) 
We also tested different types of regularization (Table 2). In general, Shannon entropy performs well and it is also favored for its closed form solution. However, on the Yelp dataset, we find that regularization leads to a better user model. It is noteworthy that the user model with regularization is trained with Shannon entropy initialization scheme proposed in section 4.3.
Split 1  Split 2  Split 3  

Model  prec(%)@1  prec(%)@2  prec(%)@1  prec(%)@2  prec(%)@1  prec(%)@2 
GANLSTMSE  73.1  88.8  72.8  89.0  73.1  88.2 
GANLSTM  73.5  89.0  78.8  91.5  76.1  91.1 
Another interesting result on Movielens is shown in Figure 4 (see Appendix D.1 for similar figures). The blue curve represents a user’s actual choices over time. The orange curves are trajectories predicted by GAN and W&DCCF. Each data point represents time step and the category of the clicked item. The upper subfigure shows that GAN performs much better as time goes by, while the items predicted by W&DCCF in the lower subfigure are concentrated on several categories. This indicates a drawback of static models  it fails to capture the evolution of a user’s interests.
6.3 Recommendation Policies Generated from User Models
With a learned user model, we can immediately derive a greedy policy to recommend items with the highest estimated likelihood. We will compare the strongest baseline methods W&DLR, W&DCCF and GANGreedy in this setting. Furthermore, we will learn an RL policy using the cascading Qnetworks from section 5 ( GANCDQN). We will compare it with two RL methods: a cascading Qnetwork trained with reward ( GANRWD1), and an additive Qnetwork policy (He et al., 2016), , trained with the learned reward ( GANGDQN).
Since we cannot perform online experiments at this moment, we use collected data from the online news platform to fit a user model, and then use it as a test environment. To make the experimental results trustful and solid, we fit the test model based on a randomly sampled test set of 1,000 users and keep this set isolated. The RL policies are learned from another set of 2,500 users without overlapping the test set. The performances are evaluated by two metrics: (1)
Cumulative reward: For each recommendation action, we can observe a user’s behavior and compute her reward using the test model. Note that we never use the reward of test users when we train the RL policy. The numbers shown in Table 3 are the cumulative rewards averaged over time horizon first and then averaged over all users. It can be formulated as , where is the reward received by user at time . (2) CTR (click through rate): it is the ratio of the number of clicks and the number of steps it is run. The values displayed in Table 3 are also averaged over 1,000 test users.model  reward  CTR  reward  CTR  reward  CTR 

W&DLR  11.82(0.38)  0.38(0.012)  14.46(0.42)  0.46(0.013)  15.18(0.38)  0.48(0.011) 
W&DCCF  17.15(1.16)  0.53(0.034)  19.93(1.09)  0.62(0.031)  20.94(1.03)  0.65(0.029) 
GANGreedy  19.17(1.20)  0.58(0.042)  21.37(1.24)  0.67(0.038)  22.97(1.22)  0.71(0.034) 
GANRWD1  22.37(0.87)  0.68(0.035)  22.17(1.07)  0.68(0.031)  25.15(1.04)  0.78(0.029) 
GANGDQN  21.88(0.92)  0.66(0.037)  23.60(1.06)  0.72(0.034)  23.19(1.17)  0.70(0.033) 
GANCDQN  22.76(0.90)  0.69(0.037)  24.05(0.98)  0.74(0.032)  25.36(1.10)  0.77(0.031) 
Three sets of experiments with different numbers of items in each page view are conducted and the results are summarized in Table 3. Since users’ behaviors are not deterministic, each policy is evaluated repeatedly for 50 times on test users. The results show that: (1) Greedy policy built on GAN model is significantly better than the policies built on other models. (2) RL policy learned from GAN is better than the greedy policy. (3) Although GANCDQN is trained to optimize the cumulative reward, the recommendation policy also achieves a higher CTR compared to GANRWD1 which directly optimizes reward. The learning of GANCDQN may have benefited from the wellknown reward shaping effects of the learned continuous reward (Mataric, 1994; Ng et al., 1999; Matignon et al., 2006). (4) While the computational cost of GANCDQN is about the same as that of GANGDQN (both are linear in the total number of items), our proposed GANCDQN is a more flexible parametrization and achieved better results, especially when is larger.
Since Table 3 only shows average values taken over test users, we compare the policies in user level and the results are shown in figure 5. GANCDQN policy results in higher averaged cumulative reward for most users. A similar figure which compares the CTR is deferred to Appendix D. Figure 6 shows that the learned cascading Qnetworks satisfy constraints in Eq. (13) well when .
6.4 User Model Assisted Policy Adaptation
Former results in section 6.2 and 6.3 have demonstrated that GAN is a better user model and RL policy based on it can achieve higher CTR compared to other user models, but this user model may be misspecified. In this section, we show that our GAN model can help an RL policy to quickly adapt to a new user. The RL policy assisted by GAN user model is compared with other policies that are learned from and adapted to online users: (1) CDQN with GAN: cascading Qnetworks which are first trained using the learned GAN user model from other users and then adapted online to a new user using MAML (Finn et al., 2017). (2) CDQN model free: cascading Qnetworks without pretrained by the GAN model. It interacts with and adapts to online users directly. (3) LinUCB: a classic contextual bandit algorithm which assumes adversarial user behavior. We choose its stronger version  LinUCB with hybrid linear models (Li et al., 2010)  to compare with.
The experiment setting is similar to section 6.3. All policies are evaluated on a set of 1,000 test users associated with a test model. Three sets of results corresponding to different sizes of display set are plotted in Figure 7. It shows how the CTR increases as each policy interacts with and adapts to users over time. In fact, the performances of users’ cumulative reward according to different policies are also similar, and the corresponding figure is deferred to Appendix D.3.
It can be easily seen that the CDQN policy pretrained over a GAN user model can quickly achieve a high CTR even when it is applied to a new set of users (Figure 7). Without the user model, CDQN can also adapt to the users during its interaction with them. However, it takes around 1,000 iterations (i.e., 100,000 interactive data points) to achieve similar performance as the CDQN policy assisted by GAN user model. LinUCB(hybrid) is also capturing users’ interests during its interaction with users. Similarly, it takes too many interactions. In Appendix D.3, another figure is attached to compare the cumulative reward received by the user instead of CTR. Generally speaking, GAN user model provides a dynamical environment for RL policies to interact with. It helps the policy achieve a more satisfying status before applying to online users.
7 Conclusion and Future Work
We proposed a novel modelbased reinforcement learning framework for recommendation systems, where we developed a GAN formulation to model user behavior dynamics and her associated reward function. Using this user model as the simulation environment, we develop a novel cascading Qnetwork for combinatorial recommendation policy which can handle a large number of candidate items efficiently. Although the experiments show clear benefits of our method in an offline and realistic simulation setting, even stronger results could be obtained via future online A/B testing.
References
 Abbeel & Ng (2004) Pieter Abbeel and Andrew Y. Ng. Apprenticeship learning via inverse reinforcement learning. In ICML, 2004.
 Chen & Guestrin (2016) Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794. ACM, 2016.

Cheng et al. (2016)
HengTze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra,
Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan
Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, and Hemal Shah.
Wide & deep learning for recommender systems.
In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. ACM, 2016.  Clavera et al. (2018) Ignasi Clavera, Anusha Nagabandi, Ronald S. Fearing, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Learning to adapt: Metalearning for modelbased control. arXiv preprint arXiv:1803.11347, 2018.
 Deisenroth et al. (2015) M. P. Deisenroth, D. Fox, and C. E. Rasmussen. Gaussian processes for dataefficient learning in robotics and control. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015. ISSN 01628828.
 Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. Modelagnostic metalearning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017.
 Guo et al. (2017) Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. Deepfm: a factorizationmachine based neural network for ctr prediction. arXiv preprint arXiv:1703.04247, 2017.
 He et al. (2016) Ji He, Mari Ostendorf, Xiaodong He, Jianshu Chen, Jianfeng Gao, Lihong Li, and Li Deng. Deep reinforcement learning with a combinatorial action space for predicting popular reddit threads. In EMNLP, 2016.
 Hidasi et al. (2015) Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. Sessionbased recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939, 2015.

Hidasi et al. (2016)
Balazs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk.
Sessionbased recommendations with recurrent neural networks.
In ICLR, 2016.  Ho & Ermon (2016) Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In NIPS, 2016.
 Ho et al. (2016) Jonathan Ho, Jayesh K. Gupta, and Stefano Ermon. Modelfree imitation learning with policy optimization. In ICML, 2016.
 Jannach & Ludewig (2017) Dietmar Jannach and Malte Ludewig. When recurrent neural networks meet the neighborhood for sessionbased recommendation. In Proceedings of the Eleventh ACM Conference on Recommender Systems, pp. 306–310. ACM, 2017.
 Li et al. (2010) Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661–670. ACM, 2010.
 Manski (1975) Charles F. Manski. Maximum score estimation of the stochastic utility model of choice. Journal of Econometrics, pp. 205 – 228, 1975. ISSN 03044076.
 Mataric (1994) Maja J. Mataric. Reward functions for accelerated learning. In ICML, 1994.
 Matignon et al. (2006) Laëtitia Matignon, Guillaume J. Laurent, and Nadine Le FortPiat. Reward function and initial values: Better choices for accelerated goaldirected reinforcement learning. In ICANN, 2006.

McFadden (1973)
D. McFadden.
Conditional logit analysis of qualitative choice behaviour.
In P. Zarembka (ed.), Frontiers in Econometrics, pp. 105–142. Academic Press New York, 1973.  Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 Nagabandi et al. (2017) Anusha Nagabandi, Gregory Kahn, Ronald S. Fearing, and Sergey Levine. Neural network dynamics for modelbased deep reinforcement learning with modelfree finetuning. arXiv preprint arXiv:1708.02596, 2017.
 Ng et al. (1999) Andrew Y. Ng, Daishi Harada, and Stuart J. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, 1999.
 Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
 Torabi et al. (2018) Faraz Torabi, Garrett Warnell, and Peter Stone. Behavioral cloning from observation. In IJCAI, 2018.
 Watkins (1989) C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King’s College, Oxford, May 1989. (To be reprinted by MIT Press.).
 Yang et al. (2011) ShuangHong Yang, Bo Long, Alexander J Smola, Hongyuan Zha, and Zhaohui Zheng. Collaborative competitive filtering: learning recommender using context of user choice. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pp. 295–304. ACM, 2011.
 Zhao et al. (2018a) Xiangyu Zhao, Long Xia, Liang Zhang, Zhuoye Ding, Dawei Yin, and Jiliang Tang. Deep reinforcement learning for pagewise recommendations. 2018a.
 Zhao et al. (2018b) Xiangyu Zhao, Liang Zhang, Zhuoye Ding, Dawei Yin, Yihong Zhao, and Jiliang Tang. Deep reinforcement learning for listwise recommendations. CoRR, 2018b.
 Zheng et al. (2018) Guanjie Zheng, Fuzheng Zhang, Zihan Zheng, Yang Xiang, Nicholas Jing Yuan, Xing Xie, and Zhenhui Li. Drn: A deep reinforcement learning framework for news recommendation. 2018.
Appendix A Lemma
a.1 Proof of lemma 1
See 1
Proof.
First, recall the problem defined in Eq. (2):
Denote . Since can be an arbitrary mapping (i.e., is not limited in a specific parameter space), can be an arbitrary vector in . Recall the notation
. Then the expectation taken over random variable
can be written as(15) 
By simple computation, the optimal vector which maximizes Eq. (15) is
(16) 
which is equivalent to Eq. (2). Next, we show the equivalence of Eq. (16) to the discrete choice model interpreted by Eq. (4).
The cumulative distribution function for the Gumbel distribution is
and the probability density is . Using the definition of the Gumbel distribution, the probability of the event where is defined in Eq. (4) isSuppose we know the random variable . Then we can compute the choice probability conditioned on this information. Let and be the conditional probability; then we have
In fact, we only know the density of
. Hence, using the Bayes theorem, we can express
asNow, let us look at the product itself.
Hence
where .
Next, we make a change of variable . The Jacobian of the inverse transform is . Since , the absolute of Jacobian is . Therefore,
∎
a.2 Proof of lemma 2
See 2
Proof.
This lemma is a straight forward result of lemma 1. First, recall the problem defined in Eq. (7):
We make a assumption that there is no repeated pair in Eq. (7). This is a very soft assumption because is updated overtime, and is in fact representing its feature vector , which is in space . With this assumption, we can let map each pair to the optimal vector which maximize since there is no repeated pair. Using Eq. (16), we have
Eq. (7) can then be written as
which is the negative loglikelihood function and is equivalent to lemma 2. ∎
Appendix B Alogrithm box
The following is the algorithm of learning the cascading deep Qnetworks. We employ the cascading functions to search the optimal action efficiently (line 9). Besides, both the experience replay (Mnih et al., 2013) and exploration techniques are applied. The system’s experiences at each timestep are stored in a replay memory set (line 11) and then a minibatch of data will be sampled from the replay memory to update (line 13 and 14). An exploration to the action space is executed with probability (line 8).
Appendix C Dataset description
(1) MovieLens public dataset^{1}^{1}1https://grouplens.org/datasets/movielens/ contains large amounts of movie ratings collected from their website. We randomly sample 1,000 active users from this dataset. On average, each of these active users rated more than 500 movies (including short films), so we assume they rated almost every movie that they watched and thus equate their rating behavior with watching behavior. MovieLens dataset is the most suitable public dataset for our experiments, but it is still not perfect. In fact, none of the public datasets provides the context in which a user’s choice is made. Thus, we simulate this missing information in a reasonable way. For each movie watched(rated) on the date , we collect a list of movies released within a month before that day . On average, movies run for about four weeks in theater. Even though we don’t know the actual context of user’s choice, at least the user decided to watch the rated movie instead of other movies in theater. Besides, we control the maximal size of each displayed set by 40. Features: In MovieLens dataset, only titles and IDs of the movies are given, so we collect detailed movie information from Internet Movie Database(IMDB). Categorical features as encoded as sparse vectors and descriptive features are encoded as dense vectors. The combination of such two types of vectors produces 722 dimensional raw feature vectors. To further reduce dimensionality, we use logistic regression to fit a wide&deep networks (Cheng et al., 2016) and use the learned input and hidden layers to reduce the feature to 10 dimension.
(2) An online news article recommendation dataset from Ant Financial is anonymously collected from Ant Financial news article online platform. It consists of 50,000 users’ clicks and impression logs for one month, involving dozens of thousands of news. It is a timestamped dataset which contains user features, news article features and the context where the user clicks the articles. The size of the display set is not fixed, since a user can browse the news article platform as she likes. On average a display set contains 5 new articles, but it actually various from 2 to 10. Features: The news article raw features are approximately of dimension 100 million because it summarizes the key words in the article. Apparently it is too expensive to use these raw features in practice. The features we use in the experiments are 20 dimensional dense vector embedding produced from the raw feature by wide&deep networks. The reduced 20 dimensional features are widely used in this online platform and revealed to be effective in practice.
(3) Last.fm^{2}^{2}2https://www.last.fm/api contains listening records from 359,347 users. Each display set is simulated by collecting 9 songs with nearest timestamp.
(4) Yelp^{3}^{3}3https://www.yelp.com/dataset/ contains users’ reviews to various businesses. Each display set is simulated by collecting 9 businesses with nearest location.
(5) RecSys15^{4}^{4}4https://2015.recsyschallenge.com/ contains clickstreams that sometimes end with purchase events.
(6) Taobao^{5}^{5}5https://tianchi.aliyun.com/datalab contains the clicking behavior and buying behavior of users in 22 days. We consider the buying behaviors as positive events.
Appendix D More figures for experimental results
d.1 Figures for section 6.2
An interesting comparison is shown in Figure 4 and more similar figures are provided here. The blue curve is the trajectory of a user’s actual choices of movies over time. The orange curves are simulated trajectories predicted by GAN and CCF, respectively. Similar to what we conclude in section 6.2, these figures reveal the good performances of GAN user model in terms of capturing the evolution of users’ interest.
d.2 Figures for section 6.3
We demonstrate the policy performance in user level in figure 5 by comparing the cumulative reward. Here we attach the figure which compares the click rate. In each subfigure, red curve represents GANDQN policy and blue curve represents the other. GANDQN policy contributes higher averaged click rate for most users.
d.3 Figures for section 6.4
This figure shows three sets of results corresponding to different sizes of display set. It reveals how users’ cumulative reward(averaged over 1,000 users) increases as each policy interacts with and adapts to 1,000 users over time. It can be easily that the CDQN policy pretrained over a GAN user model can adapt to online users much faster then other modelfree policies and can reduce the risk of losing the user at the beginning. The experiment setting is similar to section 6.3. All policies are evaluated on a separated set of 1,000 users associated with a test model. We need to emphasize that the GAN model which assists the CDQN policy is learned from a training set of users without overlapping test users. It is different from the test model which fits the 1,000 test users.
Comments
There are no comments yet.