1. Introduction
Recommender systems are intelligent Ecommerce applications. They assist users in their informationseeking tasks by suggesting items (products, services, or information) that best fit their needs and preferences. Recommender systems have become increasingly popular in recent years, and have been utilized in a variety of domains including movies, music, books, search queries, and social tags(Resnick and Varian, 1997; Ricci et al., 2011). Most existing recommender systems consider the recommendation procedure as a static process and make recommendations following a fixed greedy strategy. However, these approaches may fail given the dynamic nature of the users’ preferences. Furthermore, the majority of existing recommender systems are designed to maximize the immediate (shortterm) reward of recommendations, i.e., to make users order the recommended items, while completely overlooking whether these recommended items will lead to more likely or more profitable (longterm) rewards in the future (Shani et al., 2005).
In this paper, we consider the recommendation procedure as sequential interactions between users and recommender agent; and leverage Reinforcement Learning (RL) to automatically learn the optimal recommendation strategies. Recommender systems based on reinforcement learning have two advantages. First, they are able to continuously update their strategies during the interactions, until the system converges to the optimal strategy that generates recommendations best fitting users’ dynamic preferences. Second, the optimal strategy is made by maximizing the expected longterm cumulative reward from users. Therefore, the system can identify the item with a small immediate reward but making big contribution to the rewards for future recommendations.
Efforts have been made on utilizing reinforcement learning for recommender systems, such as POMDP(Shani et al., 2005) and Qlearning(Taghipour and Kardan, 2008). However, these methods may become inflexible with the increasing number of items for recommendations. This prevents them to be adopted by practical recommender systems. Thus, we leverage Deep Reinforcement Learning(Lillicrap et al., 2015)
with (adapted) artificial neural networks as the nonlinear approximators to estimate the actionvalue function in RL. This modelfree reinforcement learning method does not estimate the transition probability and not store the Qvalue table. This makes it flexible to support huge amount of items in recommender systems.
1.1. Listwise Recommendations
Users in practical recommender systems are typically recommended a list of items at one time. Listwise recommendations are more desired in practice since they allow the systems to provide diverse and complementary options to their users. For listwise recommendations, we have a listwise action space, where each action is a set of multiple interdependent subactions (items). Existing reinforcement learning recommender methods also could recommend a list of items. For example, DQN(Mnih et al., 2013) can calculate Qvalues of all recalled items separately, and recommend a list of items with highest Qvalues. However, these approaches recommend items based on one same state, and ignore relationship among the recommended items. As a consequence, the recommended items are similar. In practice, a bundling with complementary items may receive higher rewards than recommending all similar items. For instance, in realtime news feed recommendations, a user may want to read diverse topics of interest, and an action (i.e. recommendation) from the recommender agent would consist of a set of news articles that are not all similar in topics(Yue and Guestrin, 2011). Therefore, in this paper, we propose a principled approach to capture relationship among recommended items and generate a list of complementary items to enhance the performance.
1.2. Architecture Selection
Generally, there exist two Deep Qlearning architectures, shown in Fig.1 (a)(b). Traditional deep Qlearning adopts the first architecture as shown in Fig.1(a), which inputs only the state space and outputs Qvalues of all actions. This architecture is suitable for the scenario with high state space and small action space, like playing Atari(Mnih et al., 2013). However, one drawback is that it cannot handle large and dynamic action space scenario, like recommender systems. The second Qlearning architecture, shown Fig.1(b), treats the state and the action as the input of Neural Networks and outputs the Qvalue corresponding to this action. This architecture does not need to store each Qvalue in memory and thus can deal with large action space or even continuous action space. A challenging problem of leveraging the second architecture is temporal complexity, i.e., this architecture computes Qvalue for all potential actions, separately. To tackle this problem, in this paper, our recommending policy builds upon the ActorCritic framework(Sutton and Barto, 1998), shown in Fig.1 (c). The Actor inputs the current state and aims to output the parameters of a statespecific scoring function. Then the RA scores all items and selects an item with the highest score. Next, the Critic uses an approximation architecture to learn a value function (Qvalue), which is a judgment of whether the selected action matches the current state. Note that Critic shares the same architecture with the DQN in Fig.1(b). Finally, according to the judgment from Critic, the Actor updates its’ policy parameters in a direction of recommending performance improvement to output properer actions in the following iterations. This architecture is suitable for large action space, while can also reduce redundant computation simultaneously.
1.3. Online Environment Simulator
Unlike the Deep Qlearning method applied in playing Online Game like Atari, which can take arbitrary action and obtain timely feedback/reward, the online reward is hard to obtain before the recommender system is applied online. In practice, it is necessary to pretrain parameters offline and evaluate the model before applying it online, thus how to train our framework and evaluate the performance of our framework offline is a challenging task. To tackle this challenge, we propose an online environment simulator, which inputs current state and a selected action and outputs a simulated online reward, which enables the framework to train the parameters offline based on the simulated reward. More specifically, we build the simulator by users’ historical records. The intuition is no matter what algorithms a recommender system adopt, given the same state ( or a user’s historical records) and the same action (recommending the same items to the user), the user will make the same feedbacks to the items.
To evaluate the performance of a recommender system before applying it online, a practical way is to test it based on users’ historical clicking/ordering records. However, we only have the ground truth feedbacks (rewards) of the existing items in the users’ historical records, which are sparse compared with the enormous item space of current recommender system. Thus we cannot get the feedbacks (rewards) of items that are not in users’ historical records. This may result in inconsistent results between offline and online measurements. Our proposed online environment simulator can also mitigate this challenge by producing simulated online rewards given any stateaction pair, so that the recommender system can rate items from the whole item space. Based on offline training and evaluation, the well trained parameters can be utilized as the initial parameters when we launch our framework online, which can be updated and improved via onpolicy exploitation and exploration.
1.4. Our Contributions
We summarize our major contributions as follows:

We build an online useragent interacting environment simulator, which is suitable for offline parameters pretraining and evaluation before applying a recommender system online;

We propose a LIstwise Recommendation framework based on Deep reinforcement learning LIRD, which can be applied in scenarios with large and dynamic item space and can reduce redundant computation significantly; and

We demonstrate the effectiveness of the proposed framework in a realworld ecommerce dataset and validate the importance of listwise recommendation for accurate recommendations.
The rest of this paper is organized as follows. In Section 2, we first formally define the problem of recommender system via reinforcement learning. Then, we provide approaches to model the recommending procedure as a sequential useragent interactions and introduce details about employing ActorCritic framework to automatically learn the optimal recommendation strategies via a online simulator. Section 3 carries out experiments based on realword ecommerce site and presents experimental results. Section 4 briefly reviews related work. Finally, Section 5 concludes this paper and discusses our future work.
2. The Proposed Framework
In this section, we first formally define notations and the problem of recommender system via reinforcement learning. Then we build an online useragent interaction environment simulator. Next, we propose an ActorCritic based reinforcement learning framework under this setting. Finally, we discuss how to train the framework via users’ behavior log and how to utilize the framework for listwise recommendations.
2.1. Problem Statement
We study the recommendation task in which a recommender agent (RA) interacts with environment (or users) by sequentially choosing recommendation items over a sequence of time steps, so as to maximize its cumulative reward. We model this problem as a Markov Decision Process (MDP), which includes a sequence of states, actions and rewards. More formally, MDP consists of a tuple of five elements as follows:

State space : A state is defined as the browsing history of a user, i.e., previous items that a user browsed before time . The items in are sorted in chronological order.

Action space : An action is to recommend a list of items to a user at time based on current state , where is the number of items the RA recommends to user each time.

Reward : After the recommender agent takes an action at the state , i.e., recommending a list of items to a user, the user browses these items and provides her feedback. She can skip (not click), click, or order these items, and the agent receives immediate reward according to the user’s feedback.

Transition probability : Transition probability defines the probability of state transition from to when RA takes action . We assume that the MDP satisfies . If user skips all the recommended items, then the next state ; while if the user clicks/orders part of items, then the next state updates. More details will be shown in following subsections.

Discount factor : defines the discount factor when we measure the present value of future reward. In particular, when , RA only considers the immediate reward. In other words, when , all future rewards can be counted fully into that of the current action.
In practice, only using discrete indexes to denote items is not sufficient since we cannot know the relations between different items only from indexes. One common way is to use extra information to represent items. For instance, we can use the attribute information like brand, price, sale per month, etc. Instead of extra item information, in this paper, we use the useragent interaction information, i.e., users’ browsing history. We treat each item as a word and the clicked items in one recommendation session as a sentence. Then, we can obtain dense and lowdimensional vector representations for items via word embedding
(Levy and Goldberg, 2014).Figure 2 illustrates the agentuser interactions in MDP. By interacting with the environment (users), recommender agent takes actions (recommends items) to users in such a way that maximizes the expected return, which includes the delayed rewards. We follow the standard assumption that delayed rewards are discounted by a factor of per timestep.
With the notations and definitions above, the problem of listwise item recommendation can be formally defined as follows: Given the historical MDP, i.e., , the goal is to find a recommendation policy , which can maximize the cumulative reward for the recommender system.
2.2. Online UserAgent Interaction Environment Simulator
To tackle the challenge of training our framework and evaluating the performance of our framework offline, in this subsection, we propose an online useragent interaction environment simulator. In the online recommendation procedure, given the current state , the RA recommends a list of items to a user, and the user browses these items and provides her feedbacks, i.e., skip/click/order part of the recommended items. The RA receives immediate reward according to the user’s feedback. To simulate the aforementioned online interaction procedures, the task of simulator is to predict a reward based on current state and a selected action, i.e., .
According to collaborative filtering techniques, users with similar interests will make similar decisions on the same item. With this intuition, we match the current state and action to existing historical stateaction pairs, and stochastically generate a simulated reward. To be more specific, we first build a memory to store users’ historical browsing history, where is a useragent interaction triple . The procedure to build the online simulator memory is illustrated in Algorithm 1. Given a historical recommendation session , we can observe the initial state from the previous sessions (line 2). Each time we observe items in temporal order (line 3), where “” means that each iteration we will move forward a window of . We can observe the current state (line 4), current items (line 5), and the user’s feedbacks for these items (line 6). Then we store triple in memory(line7). Finally we update the state (lines 813), and move to the next items. Since we keep a fixed length state , each time a user clicked/ordered some items in the recommended list, we add these items to the end of state and remove the same number of items in the top of the state. For example, the RA recommends a list of five items to a user, if the user clicks and orders , then update .
Then we calculated the similarity of the current stateaction pair, say
, to each existing historical stateaction pair in the memory. In this work, we adopt cosine similarity as:
(1) 
where the first term measures the state similarity and the second term evaluates the action similarity. Parameter controls the balance of two similarities. Intuitively, with the increase of similarity between and , there is a higher chance mapping to the reward . Thus the probability of can be defined as follows:
(2) 
then we can map the current stateaction pair to a reward according the above probability. The major challenge of this projection is the computation complexity, i.e., we must compute pairwise similarity between and each . To tackle this challenge, we first group users’ historical browsing history according to the rewards. Note that the number of reward permutation is typically limited. For example, the RA recommends two items to user each time, and the reward of user skip/click/order an item is 0/1/5, then the permutation of two items’ rewards is 9, i.e., , which is much smaller than the total number of historical records. Then probability of mapping to can be computed as follows:
(3)  
where we assume that is a reward list containing user’s feedbacks of the recommended items, for instance . is the size of users’ historical browsing history group that . and are the average state vector and average action vector for , i.e., , and . The simulator only needs to precompute the , and , and can map to a reward list according to the probability in Eq.(3). In practice, RA updates , and every 1000 episodes. As is much smaller than the total number of historical records, Eq.(3) can map to a reward list efficiently.
In practice, the reward is usually a number, rather than a vector. Thus if the is mapped to , we calculate the overall reward of the whole recommended list as follows:
(4) 
where is the order that an item in the recommended list and is the length of the recommended list, and . The intuition of Eq.(4) is that reward in the top of recommended list has a higher contribution to the overall rewards, which force RA arranging items that user may order in the top of the recommended list.
2.3. The Actor Framework
In this subsection, we propose the listwise item recommending procedure, which consists of two steps, i.e., 1) statespecific scoring function parameter generating, and 2) action generating. Current practical recommender systems rely on a scoring or rating system which is averaged across all users ignoring specific demands of a user. These approaches perform poorly in tasks where there is large variation in users’ interests. To tackle this problem, we present a statespecific scoring function, which rates items according to user’s current state.
In the previous section, we have defined the state as the whole browsing history, which can be infinite and inefficient. A better way is to only consider the positive items, e.g., previous 10 clicked/ ordered items. A good recommender system should recommend the items that users prefer the most. The positive items represent key information about users’ preferences, i.e., which items the users prefer to. Thus, we only consider them for statespecific scoring function.
Our statespecific scoring function parameter generating step maps the current state to a list of weight vectors as follows:
(5) 
where is a function parametrized by , mapping from the state space to the weight representation space. Here we choose deep neural networks as the parameter generating function.
Next we present the actiongenerating step based on the aforementioned scoring function parameters. Without the loss of generality, we assume that the scoring function parameter and the embedding of item from the item space is linearrelated as:
(6) 
Note that it is straightforward to extend it with nonlinear relations. Then after computing scores of all items, the RA selects an item with highest score as the subaction of action . We present listwise item recommendation algorithm in Algorithm 2.
The Actor first generates a list of weight vectors (line 1). For each weight vector, the RA scores all items in the item space (line 3), selects the item with highest score (line 4), and then adds this item at the end of the recommendation list. Finally the RA removes this item from the item space, which prevents recommending the same item to the recommendation list.
2.4. The Critic Framework
The Critic is designed to leverage an approximator to learn an actionvalue function , which is a judgment of whether the action generated by Actor matches the current state . Then, according , the Actor updates its’ parameters in a direction of improving performance to generate proper actions in the following iterations. Many applications in reinforcement learning make use of the optimal actionvalue function . It is the maximum expected return achievable by the optimal policy, and should follow the Bellman equation (Bellman, 2013) as:
(7) 
In practice, to select an optimal , evaluations are necessary for the inner operation . This prevents Eq.(7) to be adopted in practical recommender systems with the enormous action space. However, the Actor architectures proposed in Section 2.3 outputs a deterministic action for Critic, which avoids the aforementioned computational cost of evaluations in Eq.(7) as follows:
(8) 
where the Qvalue function is the expected return based on state and the action .
In real recommender systems, the state and action spaces are enormous, thus estimating the actionvalue function for each stateaction pair is infeasible. In addition, many stateaction pairs may not appear in the real trace such that it is hard to update their values. Therefore, it is more flexible and practical to use an approximator function to estimate the actionvalue function, i.e., . In practice, the actionvalue function is usually highly nonlinear. Deep neural networks are known as excellent approximators for nonlinear functions. In this paper, We refer to a neural network function approximator with parameters as deep
network (DQN). A DQN can be trained by minimizing a sequence of loss functions
as(9) 
where is the target for the current iteration. The parameters from the previous iteration are fixed when optimizing the loss function
. In practice, it is often computationally efficient to optimize the loss function by stochastic gradient descent, rather than computing the full expectations in the above gradient.
2.5. The Training Procedure
An illustration of the proposed useragent online interaction simulator and deep reinforcement recommending LIRD framework is demonstrated in Figure 3. Next, we discuss the parameters training procedures. In this work, we utilize DDPG algorithm(Lillicrap et al., 2015) to train the parameters of the proposed framework. The training algorithm for the proposed framework DEV is presented in Algorithm 3.
In each iteration, there are two stages, i.e., 1) transition generating stage (lines 820), and 2) parameter updating stage (lines 2128). For transition generating stage (line 8): given the current state , the RA first recommends a list of items according to Algorithm 2 (line 9); then the agent observes the reward from simulator (line 10) and updates the state to (lines 1117) following the same strategy in Algorithm 1; and finally the recommender agent stores transitions into the memory (line 19), and set (line 20). For parameter updating stage: the recommender agent samples minibatch of transitions from (line 22), and then updates parameters of Actor and Critic (lines 2328) following a standard DDPG procedure (Lillicrap et al., 2015).
In the algorithm, we introduce widely used techniques to train our framework. For example, we utilize a technique known as experience replay (Lin, 1993) (lines 3,22), and introduce separated evaluation and target networks (Mnih et al., 2013)(lines 2,23), which can help smooth the learning and avoid the divergence of parameters. For the soft target updates of target networks(lines 27,28), we used . Moreover, we leverage prioritized sampling strategy (Moore and Atkeson, 1993) to assist the framework learning from the most important historical transitions.
2.6. The Testing Procedure
After framework training stage, RA gets welltrained parameters, say and . Then we can do framework testing on simulator environment. The model testing also follows Algorithm 3, i.e., the parameters continuously updates during the testing stage, while the major difference from training stage is before each recommendation session, we reset the parameters back to and , for the sake of fair comparison between each session. We can artificially control the length of recommendation session to study the shortterm and longterm performance.
3. Experiments
In this section, we conduct extensive experiments with a dataset from a real ecommerce site to evaluate the effectiveness of the proposed framework. We mainly focus on two questions: (1) how the proposed framework performs compared to representative baselines; and (2) how the listwise strategy contributes to the performance. We first introduce experimental settings. Then we seek answers to the above two questions. Finally, we study the impact of important parameters on the performance of the proposed framework.
3.1. Experimental Settings
We evaluate our method on a dataset of July, 2017 from a real ecommerce site. We randomly collect 100,000 recommendation sessions (1,156,675 items) in temporal order, and use the first 70% sessions as the training set and the later 30% sessions as the testing set. For a given session, the initial state is collected from the previous sessions of the user. In this paper, we leverage previously clicked/ordered items as the positive state. Each time the RA recommends a list of items to users. The reward of skipped/clicked/ordered items are empirically set as 0, 1, and 5, respectively. The dimension of the item embedding is 50, and we set the discounted factor . For the parameters of the proposed framework such as and , we select them via crossvalidation. Correspondingly, we also do parametertuning for baselines for a fair comparison. We will discuss more details about parameter selection for the proposed framework in the following subsections.
To evaluate the performance of the proposed framework, we select MAP (Turpin and Scholer, 2006) and NDCG (Järvelin and Kekäläinen, 2002) as the metrics to measure the performance. The difference of ours from traditional LearntoRank methods is that we rank both clicked and ordered items together, and set them by different rewards, rather than only rank clicked items as that in LearntoRank problems.
3.2. Performance Comparison for Item Recommendations
To answer the the first question, we compare the proposed framework with the following representative baseline methods:

CF: Collaborative filtering(Breese et al., 1998) is a method of making automatic predictions about the interests of a user by collecting preference information from many users, which is based on the hypothesis that people often get the best recommendations from someone with similar tastes to themselves.

FM: Factorization Machines(Rendle, 2010)
combine the advantages of support vector machines with factorization models. Compared with matrix factorization, higher order interactions can be modeled using the dimensionality parameter.

DNN: We choose a deep neural network with back propagation technique as a baseline to recommend the items in a given session. The input of DNN is the embeddings of users’ historical clicked/ordered items. We train the DNN to output the next recommended item.

RNN: This baseline utilizes the basic RNN to predict what user will buy next based on the clicking/ordering histories. To minimize the computation costs, it only keeps a finite number of the latest states.
3.3. Performance of ListWise Recommendations
To validate the effectiveness of the listwise recommendation strategy, we investigate how the proposed framework LIRD performs with the changes of the length of the recommendation list, i.e., , in longterm sessions, while fixing other parameters. Note that is the itemwise recommendation.
3.4. Performance of Simulator
The online simulator has one key parameter, i.e., , which controls the tradeoff between state and action similarity in simulator, see Eq.(3). To study the impact of this parameter, we investigate how the proposed framework LIRD works with the changes of in longterm sessions, while fixing other parameters.
4. Related Work
In this section, we briefly review works related to our study. In general, the related work can be mainly grouped into the following categories.
The first category related to this paper is traditional recommendation techniques. Recommender systems assist users by supplying a list of items that might interest users. Efforts have been made on offering meaningful recommendations to users. Collaborative filtering(Linden et al., 2003) is the most successful and the most widely used technique, which is based on the hypothesis that people often get the best recommendations from someone with similar tastes to themselves(Breese et al., 1998). Another common approach is contentbased filtering(Mooney and Roy, 2000), which tries to recommend items with similar properties to those that a user ordered in the past. Knowledgebased systems(Akerkar and Sajja, 2010) recommend items based on specific domain knowledge about how certain item features meet users’ needs and preferences and how the item is useful for the user. Hybrid recommender systems are based on the combination of the above mentioned two or more types of techniques(Burke, 2002)
. The other topic closely related to this category is deep learning based recommender system, which is able to effectively capture the nonlinear and nontrivial useritem relationships, and enables the codification of more complex abstractions as data representations in the higher layers
(Zhang et al., 2017). For instance, Nguyen et al.(Nguyen et al., 2017)proposed a personalized tag recommender system based on CNN. It utilizes constitutional and maxpooling layer to get visual features from patches of images. Wu et al.
(Wu et al., 2016) designed a sessionbased recommendation model for realworld ecommerce website. It utilizes the basic RNN to predict what user will buy next based on the click histories. This method helps balance the tradeoff between computation costs and prediction accuracy.The second category is about reinforcement learning for recommendations, which is different with the traditional item recommendations. In this paper, we consider the recommending procedure as sequential interactions between users and recommender agent; and leverage reinforcement learning to automatically learn the optimal recommendation strategies. Indeed, reinforcement learning have been widely examined in recommendation field. The MDPBased CF model in Shani et al.(Shani et al., 2005) can be viewed as approximating a partial observable MDP (POMDP) by using a finite rather than unbounded window of past history to define the current state. To reduce the high computational and representational complexity of POMDP, three strategies have been developed: value function approximation(Hauskrecht, 1997), policy based optimization (Ng and Jordan, 2000; Poupart and Boutilier, 2005), and stochastic sampling (Kearns et al., 2002). Furthermore, Mahmood et al.(Mahmood and Ricci, 2009) adopted the reinforcement learning technique to observe the responses of users in a conversational recommender, with the aim to maximize a numerical cumulative reward function modeling the benefit that users get from each recommendation session. Taghipour et al.(Taghipour et al., 2007; Taghipour and Kardan, 2008) modeled web page recommendation as a QLearning problem and learned to make recommendations from web usage data as the actions rather than discovering explicit patterns from the data. The system inherits the intrinsic characteristic of reinforcement learning which is in a constant learning process. Sunehag et al.(Sunehag et al., 2015) introduced agents that successfully address sequential decision problems with highdimensional combinatorial slateaction spaces.
5. Conclusion
In this paper, we propose a novel framework LIRD, which models the recommendation session as a Markov Decision Process and leverages Deep Reinforcement Learning to automatically learn the optimal recommendation strategies. Reinforcement learning based recommender systems have two advantages: (1) they can continuously update strategies during the interactions, and (2) they are able to learn a strategy that maximizes the longterm cumulative reward from users. Different from previous work, we propose a list wise recommendation framework, which can be applied in scenarios with large and dynamic item space and can reduce redundant computation significantly. Note that we design an online useragent interacting environment simulator, which is suitable for offline parameters pretraining and evaluation before applying a recommender system online. We evaluate our framework with extensive experiments based on data from a real ecommerce site. The results show that (1) our framework can improve the recommendation performance; and (2) listwise strategy outperforms itemwise strategies.
There are several interesting research directions. First, in addition to positional order of items we used in this work, we would like to investigate more orders like temporal order. Second, we would like to validate with more agentuser interaction patterns, e.g., adding items into shopping cart, and investigate how to model them mathematically for recommendations. Finally, the framework proposed in the work is quite general, and we would like to investigate more applications of the proposed framework, especially for those applications with both positive and negative(skip) signals.
References
 (1)
 Akerkar and Sajja (2010) Rajendra Akerkar and Priti Sajja. 2010. Knowledgebased systems. Jones & Bartlett Publishers.
 Bellman (2013) Richard Bellman. 2013. Dynamic programming. Courier Corporation.

Breese
et al. (1998)
John S Breese, David
Heckerman, and Carl Kadie.
1998.
Empirical analysis of predictive algorithms for
collaborative filtering. In
Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence
. Morgan Kaufmann Publishers Inc., 43–52.  Burke (2002) Robin Burke. 2002. Hybrid recommender systems: Survey and experiments. User modeling and useradapted interaction 12, 4 (2002), 331–370.
 Hauskrecht (1997) Milos Hauskrecht. 1997. Incremental methods for computing bounds in partially observable Markov decision processes. In AAAI/IAAI. 734–739.
 Järvelin and Kekäläinen (2002) Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gainbased evaluation of IR techniques. ACM Transactions on Information Systems (TOIS) 20, 4 (2002), 422–446.
 Kearns et al. (2002) Michael Kearns, Yishay Mansour, and Andrew Y Ng. 2002. A sparse sampling algorithm for nearoptimal planning in large Markov decision processes. Machine learning 49, 2 (2002), 193–208.
 Levy and Goldberg (2014) Omer Levy and Yoav Goldberg. 2014. Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems. 2177–2185.
 Lillicrap et al. (2015) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015).
 Lin (1993) LongJi Lin. 1993. Reinforcement learning for robots using neural networks. Technical Report. CarnegieMellon Univ Pittsburgh PA School of Computer Science.
 Linden et al. (2003) Greg Linden, Brent Smith, and Jeremy York. 2003. Amazon. com recommendations: Itemtoitem collaborative filtering. IEEE Internet computing 7, 1 (2003), 76–80.
 Mahmood and Ricci (2009) Tariq Mahmood and Francesco Ricci. 2009. Improving recommender systems with adaptive conversational strategies. In Proceedings of the 20th ACM conference on Hypertext and hypermedia. ACM, 73–82.
 Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013).
 Mooney and Roy (2000) Raymond J Mooney and Loriene Roy. 2000. Contentbased book recommending using learning for text categorization. In Proceedings of the fifth ACM conference on Digital libraries. ACM, 195–204.
 Moore and Atkeson (1993) Andrew W Moore and Christopher G Atkeson. 1993. Prioritized sweeping: Reinforcement learning with less data and less time. Machine learning 13, 1 (1993), 103–130.
 Ng and Jordan (2000) Andrew Y Ng and Michael Jordan. 2000. PEGASUS: A policy search method for large MDPs and POMDPs. In Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., 406–415.
 Nguyen et al. (2017) Hanh TH Nguyen, Martin Wistuba, Josif Grabocka, Lucas Rego Drumond, and Lars SchmidtThieme. 2017. Personalized Deep Learning for Tag Recommendation. In PacificAsia Conference on Knowledge Discovery and Data Mining. Springer, 186–197.
 Poupart and Boutilier (2005) Pascal Poupart and Craig Boutilier. 2005. VDCBPI: an approximate scalable algorithm for large POMDPs. In Advances in Neural Information Processing Systems. 1081–1088.
 Rendle (2010) Steffen Rendle. 2010. Factorization machines. In Data Mining (ICDM), 2010 IEEE 10th International Conference on. IEEE, 995–1000.
 Resnick and Varian (1997) Paul Resnick and Hal R Varian. 1997. Recommender systems. Commun. ACM 40, 3 (1997), 56–58.
 Ricci et al. (2011) Francesco Ricci, Lior Rokach, and Bracha Shapira. 2011. Introduction to recommender systems handbook. In Recommender systems handbook. Springer, 1–35.
 Shani et al. (2005) Guy Shani, David Heckerman, and Ronen I Brafman. 2005. An MDPbased recommender system. Journal of Machine Learning Research 6, Sep (2005), 1265–1295.
 Sunehag et al. (2015) Peter Sunehag, Richard Evans, Gabriel DulacArnold, Yori Zwols, Daniel Visentin, and Ben Coppin. 2015. Deep Reinforcement Learning with Attention for Slate Markov Decision Processes with HighDimensional States and Actions. arXiv preprint arXiv:1512.01124 (2015).
 Sutton and Barto (1998) Richard S Sutton and Andrew G Barto. 1998. Reinforcement learning: An introduction. Vol. 1. MIT press Cambridge.
 Taghipour and Kardan (2008) Nima Taghipour and Ahmad Kardan. 2008. A hybrid web recommender system based on qlearning. In Proceedings of the 2008 ACM symposium on Applied computing. ACM, 1164–1168.
 Taghipour et al. (2007) Nima Taghipour, Ahmad Kardan, and Saeed Shiry Ghidary. 2007. Usagebased web recommendations: a reinforcement learning approach. In Proceedings of the 2007 ACM conference on Recommender systems. ACM, 113–120.
 Turpin and Scholer (2006) Andrew Turpin and Falk Scholer. 2006. User performance versus precision measures for simple search tasks. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 11–18.

Wu
et al. (2016)
Sai Wu, Weichao Ren,
Chengchao Yu, Gang Chen,
Dongxiang Zhang, and Jingbo Zhu.
2016.
Personal recommendation using deep recurrent neural networks in NetEase. In
Data Engineering (ICDE), 2016 IEEE 32nd International Conference on. IEEE, 1218–1229.  Yue and Guestrin (2011) Yisong Yue and Carlos Guestrin. 2011. Linear submodular bandits and their application to diversified retrieval. In Advances in Neural Information Processing Systems. 2483–2491.
 Zhang et al. (2017) Shuai Zhang, Lina Yao, and Aixin Sun. 2017. Deep Learning based Recommender System: A Survey and New Perspectives. arXiv preprint arXiv:1707.07435 (2017).
Comments
There are no comments yet.