Deep Reinforcement Learning for List-wise Recommendations

12/30/2017 ∙ by Xiangyu Zhao, et al. ∙ Association for Computing Machinery, Inc. Michigan State University 0

Recommender systems play a crucial role in mitigating the problem of information overload by suggesting users' personalized items or services. The vast majority of traditional recommender systems consider the recommendation procedure as a static process and make recommendations following a fixed strategy. In this paper, we propose a novel recommender system with the capability of continuously improving its strategies during the interactions with users. We model the sequential interactions between users and a recommender system as a Markov Decision Process (MDP) and leverage Reinforcement Learning (RL) to automatically learn the optimal strategies via recommending trial-and-error items and receiving reinforcements of these items from users' feedbacks. In particular, we introduce an online user-agent interacting environment simulator, which can pre-train and evaluate model parameters offline before applying the model online. Moreover, we validate the importance of list-wise recommendations during the interactions between users and agent, and develop a novel approach to incorporate them into the proposed framework LIRD for list-wide recommendations. The experimental results based on a real-world e-commerce dataset demonstrate the effectiveness of the proposed framework.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Recommender systems are intelligent E-commerce applications. They assist users in their information-seeking tasks by suggesting items (products, services, or information) that best fit their needs and preferences. Recommender systems have become increasingly popular in recent years, and have been utilized in a variety of domains including movies, music, books, search queries, and social tags(Resnick and Varian, 1997; Ricci et al., 2011). Most existing recommender systems consider the recommendation procedure as a static process and make recommendations following a fixed greedy strategy. However, these approaches may fail given the dynamic nature of the users’ preferences. Furthermore, the majority of existing recommender systems are designed to maximize the immediate (short-term) reward of recommendations, i.e., to make users order the recommended items, while completely overlooking whether these recommended items will lead to more likely or more profitable (long-term) rewards in the future (Shani et al., 2005).

In this paper, we consider the recommendation procedure as sequential interactions between users and recommender agent; and leverage Reinforcement Learning (RL) to automatically learn the optimal recommendation strategies. Recommender systems based on reinforcement learning have two advantages. First, they are able to continuously update their strategies during the interactions, until the system converges to the optimal strategy that generates recommendations best fitting users’ dynamic preferences. Second, the optimal strategy is made by maximizing the expected long-term cumulative reward from users. Therefore, the system can identify the item with a small immediate reward but making big contribution to the rewards for future recommendations.

Efforts have been made on utilizing reinforcement learning for recommender systems, such as POMDP(Shani et al., 2005) and Q-learning(Taghipour and Kardan, 2008). However, these methods may become inflexible with the increasing number of items for recommendations. This prevents them to be adopted by practical recommender systems. Thus, we leverage Deep Reinforcement Learning(Lillicrap et al., 2015)

with (adapted) artificial neural networks as the non-linear approximators to estimate the action-value function in RL. This model-free reinforcement learning method does not estimate the transition probability and not store the Q-value table. This makes it flexible to support huge amount of items in recommender systems.

1.1. List-wise Recommendations

Users in practical recommender systems are typically recommended a list of items at one time. List-wise recommendations are more desired in practice since they allow the systems to provide diverse and complementary options to their users. For list-wise recommendations, we have a list-wise action space, where each action is a set of multiple interdependent sub-actions (items). Existing reinforcement learning recommender methods also could recommend a list of items. For example, DQN(Mnih et al., 2013) can calculate Q-values of all recalled items separately, and recommend a list of items with highest Q-values. However, these approaches recommend items based on one same state, and ignore relationship among the recommended items. As a consequence, the recommended items are similar. In practice, a bundling with complementary items may receive higher rewards than recommending all similar items. For instance, in real-time news feed recommendations, a user may want to read diverse topics of interest, and an action (i.e. recommendation) from the recommender agent would consist of a set of news articles that are not all similar in topics(Yue and Guestrin, 2011). Therefore, in this paper, we propose a principled approach to capture relationship among recommended items and generate a list of complementary items to enhance the performance.

Figure 1. DQN architecture selection.

1.2. Architecture Selection

Generally, there exist two Deep Q-learning architectures, shown in Fig.1 (a)(b). Traditional deep Q-learning adopts the first architecture as shown in Fig.1(a), which inputs only the state space and outputs Q-values of all actions. This architecture is suitable for the scenario with high state space and small action space, like playing Atari(Mnih et al., 2013). However, one drawback is that it cannot handle large and dynamic action space scenario, like recommender systems. The second Q-learning architecture, shown Fig.1(b), treats the state and the action as the input of Neural Networks and outputs the Q-value corresponding to this action. This architecture does not need to store each Q-value in memory and thus can deal with large action space or even continuous action space. A challenging problem of leveraging the second architecture is temporal complexity, i.e., this architecture computes Q-value for all potential actions, separately. To tackle this problem, in this paper, our recommending policy builds upon the Actor-Critic framework(Sutton and Barto, 1998), shown in Fig.1 (c). The Actor inputs the current state and aims to output the parameters of a state-specific scoring function. Then the RA scores all items and selects an item with the highest score. Next, the Critic uses an approximation architecture to learn a value function (Q-value), which is a judgment of whether the selected action matches the current state. Note that Critic shares the same architecture with the DQN in Fig.1(b). Finally, according to the judgment from Critic, the Actor updates its’ policy parameters in a direction of recommending performance improvement to output properer actions in the following iterations. This architecture is suitable for large action space, while can also reduce redundant computation simultaneously.

1.3. Online Environment Simulator

Unlike the Deep Q-learning method applied in playing Online Game like Atari, which can take arbitrary action and obtain timely feedback/reward, the online reward is hard to obtain before the recommender system is applied online. In practice, it is necessary to pre-train parameters offline and evaluate the model before applying it online, thus how to train our framework and evaluate the performance of our framework offline is a challenging task. To tackle this challenge, we propose an online environment simulator, which inputs current state and a selected action and outputs a simulated online reward, which enables the framework to train the parameters offline based on the simulated reward. More specifically, we build the simulator by users’ historical records. The intuition is no matter what algorithms a recommender system adopt, given the same state ( or a user’s historical records) and the same action (recommending the same items to the user), the user will make the same feedbacks to the items.

To evaluate the performance of a recommender system before applying it online, a practical way is to test it based on users’ historical clicking/ordering records. However, we only have the ground truth feedbacks (rewards) of the existing items in the users’ historical records, which are sparse compared with the enormous item space of current recommender system. Thus we cannot get the feedbacks (rewards) of items that are not in users’ historical records. This may result in inconsistent results between offline and online measurements. Our proposed online environment simulator can also mitigate this challenge by producing simulated online rewards given any state-action pair, so that the recommender system can rate items from the whole item space. Based on offline training and evaluation, the well trained parameters can be utilized as the initial parameters when we launch our framework online, which can be updated and improved via on-policy exploitation and exploration.

1.4. Our Contributions

We summarize our major contributions as follows:

  • We build an online user-agent interacting environment simulator, which is suitable for offline parameters pre-training and evaluation before applying a recommender system online;

  • We propose a LIst-wise Recommendation framework based on Deep reinforcement learning LIRD, which can be applied in scenarios with large and dynamic item space and can reduce redundant computation significantly; and

  • We demonstrate the effectiveness of the proposed framework in a real-world e-commerce dataset and validate the importance of list-wise recommendation for accurate recommendations.

The rest of this paper is organized as follows. In Section 2, we first formally define the problem of recommender system via reinforcement learning. Then, we provide approaches to model the recommending procedure as a sequential user-agent interactions and introduce details about employing Actor-Critic framework to automatically learn the optimal recommendation strategies via a online simulator. Section 3 carries out experiments based on real-word e-commerce site and presents experimental results. Section 4 briefly reviews related work. Finally, Section 5 concludes this paper and discusses our future work.

2. The Proposed Framework

In this section, we first formally define notations and the problem of recommender system via reinforcement learning. Then we build an online user-agent interaction environment simulator. Next, we propose an Actor-Critic based reinforcement learning framework under this setting. Finally, we discuss how to train the framework via users’ behavior log and how to utilize the framework for list-wise recommendations.

2.1. Problem Statement

We study the recommendation task in which a recommender agent (RA) interacts with environment (or users) by sequentially choosing recommendation items over a sequence of time steps, so as to maximize its cumulative reward. We model this problem as a Markov Decision Process (MDP), which includes a sequence of states, actions and rewards. More formally, MDP consists of a tuple of five elements as follows:

  • State space : A state is defined as the browsing history of a user, i.e., previous items that a user browsed before time . The items in are sorted in chronological order.

  • Action space : An action is to recommend a list of items to a user at time based on current state , where is the number of items the RA recommends to user each time.

  • Reward : After the recommender agent takes an action at the state , i.e., recommending a list of items to a user, the user browses these items and provides her feedback. She can skip (not click), click, or order these items, and the agent receives immediate reward according to the user’s feedback.

  • Transition probability : Transition probability defines the probability of state transition from to when RA takes action . We assume that the MDP satisfies . If user skips all the recommended items, then the next state ; while if the user clicks/orders part of items, then the next state updates. More details will be shown in following subsections.

  • Discount factor : defines the discount factor when we measure the present value of future reward. In particular, when , RA only considers the immediate reward. In other words, when , all future rewards can be counted fully into that of the current action.

In practice, only using discrete indexes to denote items is not sufficient since we cannot know the relations between different items only from indexes. One common way is to use extra information to represent items. For instance, we can use the attribute information like brand, price, sale per month, etc. Instead of extra item information, in this paper, we use the user-agent interaction information, i.e., users’ browsing history. We treat each item as a word and the clicked items in one recommendation session as a sentence. Then, we can obtain dense and low-dimensional vector representations for items via word embedding

(Levy and Goldberg, 2014).

Figure 2. The agent-user interactions in MDP.

Figure 2 illustrates the agent-user interactions in MDP. By interacting with the environment (users), recommender agent takes actions (recommends items) to users in such a way that maximizes the expected return, which includes the delayed rewards. We follow the standard assumption that delayed rewards are discounted by a factor of per time-step.

With the notations and definitions above, the problem of list-wise item recommendation can be formally defined as follows: Given the historical MDP, i.e., , the goal is to find a recommendation policy , which can maximize the cumulative reward for the recommender system.

2.2. Online User-Agent Interaction Environment Simulator

To tackle the challenge of training our framework and evaluating the performance of our framework offline, in this subsection, we propose an online user-agent interaction environment simulator. In the online recommendation procedure, given the current state , the RA recommends a list of items to a user, and the user browses these items and provides her feedbacks, i.e., skip/click/order part of the recommended items. The RA receives immediate reward according to the user’s feedback. To simulate the aforementioned online interaction procedures, the task of simulator is to predict a reward based on current state and a selected action, i.e., .

According to collaborative filtering techniques, users with similar interests will make similar decisions on the same item. With this intuition, we match the current state and action to existing historical state-action pairs, and stochastically generate a simulated reward. To be more specific, we first build a memory to store users’ historical browsing history, where is a user-agent interaction triple . The procedure to build the online simulator memory is illustrated in Algorithm 1. Given a historical recommendation session , we can observe the initial state from the previous sessions (line 2). Each time we observe items in temporal order (line 3), where “” means that each iteration we will move forward a window of . We can observe the current state (line 4), current items (line 5), and the user’s feedbacks for these items (line 6). Then we store triple in memory(line-7). Finally we update the state (lines 8-13), and move to the next items. Since we keep a fixed length state , each time a user clicked/ordered some items in the recommended list, we add these items to the end of state and remove the same number of items in the top of the state. For example, the RA recommends a list of five items to a user, if the user clicks and orders , then update .

Input: Users’ historical sessions , and the length of recommendation list .
Output:Simulator Memory

1:  for  do
2:     Observe initial state
3:     for  do
4:        Observe current state
5:        Observe current action list
6:        Observe current reward list
7:        Add the triple in
8:        for  do
9:           if  then
10:              Remove the first item in
11:              Add the item in the bottom of
12:           end if
13:        end for
14:     end for
15:  end for
16:  return  
Algorithm 1 Building Online Simulator Memory.

Then we calculated the similarity of the current state-action pair, say

, to each existing historical state-action pair in the memory. In this work, we adopt cosine similarity as:


where the first term measures the state similarity and the second term evaluates the action similarity. Parameter controls the balance of two similarities. Intuitively, with the increase of similarity between and , there is a higher chance mapping to the reward . Thus the probability of can be defined as follows:


then we can map the current state-action pair to a reward according the above probability. The major challenge of this projection is the computation complexity, i.e., we must compute pair-wise similarity between and each . To tackle this challenge, we first group users’ historical browsing history according to the rewards. Note that the number of reward permutation is typically limited. For example, the RA recommends two items to user each time, and the reward of user skip/click/order an item is 0/1/5, then the permutation of two items’ rewards is 9, i.e., , which is much smaller than the total number of historical records. Then probability of mapping to can be computed as follows:


where we assume that is a reward list containing user’s feedbacks of the recommended items, for instance . is the size of users’ historical browsing history group that . and are the average state vector and average action vector for , i.e., , and . The simulator only needs to pre-compute the , and , and can map to a reward list according to the probability in Eq.(3). In practice, RA updates , and every 1000 episodes. As is much smaller than the total number of historical records, Eq.(3) can map to a reward list efficiently.

In practice, the reward is usually a number, rather than a vector. Thus if the is mapped to , we calculate the overall reward of the whole recommended list as follows:


where is the order that an item in the recommended list and is the length of the recommended list, and . The intuition of Eq.(4) is that reward in the top of recommended list has a higher contribution to the overall rewards, which force RA arranging items that user may order in the top of the recommended list.

Figure 3. An illustration of the proposed framework with online simulator.

2.3. The Actor Framework

In this subsection, we propose the list-wise item recommending procedure, which consists of two steps, i.e., 1) state-specific scoring function parameter generating, and 2) action generating. Current practical recommender systems rely on a scoring or rating system which is averaged across all users ignoring specific demands of a user. These approaches perform poorly in tasks where there is large variation in users’ interests. To tackle this problem, we present a state-specific scoring function, which rates items according to user’s current state.

In the previous section, we have defined the state as the whole browsing history, which can be infinite and inefficient. A better way is to only consider the positive items, e.g., previous 10 clicked/ ordered items. A good recommender system should recommend the items that users prefer the most. The positive items represent key information about users’ preferences, i.e., which items the users prefer to. Thus, we only consider them for state-specific scoring function.

Our state-specific scoring function parameter generating step maps the current state to a list of weight vectors as follows:


where is a function parametrized by , mapping from the state space to the weight representation space. Here we choose deep neural networks as the parameter generating function.

Next we present the action-generating step based on the aforementioned scoring function parameters. Without the loss of generality, we assume that the scoring function parameter and the embedding of item from the item space is linear-related as:


Note that it is straightforward to extend it with non-linear relations. Then after computing scores of all items, the RA selects an item with highest score as the sub-action of action . We present list-wise item recommendation algorithm in Algorithm 2.

Input: Current state , Item space , the length of recommendation list .
Output:Recommendation list .

1:  Generate according Eq.(5)
2:  for do
3:   Score items in according Eq.(6)
4:   Select the an item with highest score as
5:   Add item in the bottom of
6:   Remove item from
7:  end for
8:  return
Algorithm 2 List-Wise Item Recommendation Algorithm.

The Actor first generates a list of weight vectors (line 1). For each weight vector, the RA scores all items in the item space (line 3), selects the item with highest score (line 4), and then adds this item at the end of the recommendation list. Finally the RA removes this item from the item space, which prevents recommending the same item to the recommendation list.

2.4. The Critic Framework

The Critic is designed to leverage an approximator to learn an action-value function , which is a judgment of whether the action generated by Actor matches the current state . Then, according , the Actor updates its’ parameters in a direction of improving performance to generate proper actions in the following iterations. Many applications in reinforcement learning make use of the optimal action-value function . It is the maximum expected return achievable by the optimal policy, and should follow the Bellman equation (Bellman, 2013) as:


In practice, to select an optimal , evaluations are necessary for the inner operation . This prevents Eq.(7) to be adopted in practical recommender systems with the enormous action space. However, the Actor architectures proposed in Section 2.3 outputs a deterministic action for Critic, which avoids the aforementioned computational cost of evaluations in Eq.(7) as follows:


where the Q-value function is the expected return based on state and the action .

In real recommender systems, the state and action spaces are enormous, thus estimating the action-value function for each state-action pair is infeasible. In addition, many state-action pairs may not appear in the real trace such that it is hard to update their values. Therefore, it is more flexible and practical to use an approximator function to estimate the action-value function, i.e., . In practice, the action-value function is usually highly nonlinear. Deep neural networks are known as excellent approximators for non-linear functions. In this paper, We refer to a neural network function approximator with parameters as deep

-network (DQN). A DQN can be trained by minimizing a sequence of loss functions



where is the target for the current iteration. The parameters from the previous iteration are fixed when optimizing the loss function

. In practice, it is often computationally efficient to optimize the loss function by stochastic gradient descent, rather than computing the full expectations in the above gradient.

2.5. The Training Procedure

An illustration of the proposed user-agent online interaction simulator and deep reinforcement recommending LIRD framework is demonstrated in Figure 3. Next, we discuss the parameters training procedures. In this work, we utilize DDPG algorithm(Lillicrap et al., 2015) to train the parameters of the proposed framework. The training algorithm for the proposed framework DEV is presented in Algorithm 3.

1:  Initialize actor network and critic network with random weights
2:  Initialize target network and with weights
3:  Initialize the capacity of replay memory
4:  for  do
5:     Reset the item space
6:     Initialize state from previous sessions
7:     for  do
8:        Stage 1: Transition Generating Stage
9:        Select an action according Alg.2
10:        Execute action and observe the reward list for each item in
11:        Set
12:        for  do
13:           if  then
14:              Add to the end of
15:              Remove the first item of
16:           end if
17:        end for
18:        Compute overall reward according Eq. (4)
19:        Store transition in
20:        Set
21:        Stage 2: Parameter Updating Stage
22:        Sample minibatch of transitions from
23:        Generate by target Actor network according Alg.2
24:        Set
25:        Update Critic by minimizing according to:
26:        Update the Actor using the sampled policy gradient:
27:        Update the Critic target networks:
28:        Update the Actor target networks:
29:     end for
30:  end for
Algorithm 3 Parameters Training for DEV with DDPG.

In each iteration, there are two stages, i.e., 1) transition generating stage (lines 8-20), and 2) parameter updating stage (lines 21-28). For transition generating stage (line 8): given the current state , the RA first recommends a list of items according to Algorithm 2 (line 9); then the agent observes the reward from simulator (line 10) and updates the state to (lines 11-17) following the same strategy in Algorithm 1; and finally the recommender agent stores transitions into the memory (line 19), and set (line 20). For parameter updating stage: the recommender agent samples mini-batch of transitions from (line 22), and then updates parameters of Actor and Critic (lines 23-28) following a standard DDPG procedure (Lillicrap et al., 2015).

In the algorithm, we introduce widely used techniques to train our framework. For example, we utilize a technique known as experience replay (Lin, 1993) (lines 3,22), and introduce separated evaluation and target networks (Mnih et al., 2013)(lines 2,23), which can help smooth the learning and avoid the divergence of parameters. For the soft target updates of target networks(lines 27,28), we used . Moreover, we leverage prioritized sampling strategy (Moore and Atkeson, 1993) to assist the framework learning from the most important historical transitions.

2.6. The Testing Procedure

After framework training stage, RA gets well-trained parameters, say and . Then we can do framework testing on simulator environment. The model testing also follows Algorithm 3, i.e., the parameters continuously updates during the testing stage, while the major difference from training stage is before each recommendation session, we reset the parameters back to and , for the sake of fair comparison between each session. We can artificially control the length of recommendation session to study the short-term and long-term performance.

3. Experiments

In this section, we conduct extensive experiments with a dataset from a real e-commerce site to evaluate the effectiveness of the proposed framework. We mainly focus on two questions: (1) how the proposed framework performs compared to representative baselines; and (2) how the list-wise strategy contributes to the performance. We first introduce experimental settings. Then we seek answers to the above two questions. Finally, we study the impact of important parameters on the performance of the proposed framework.

3.1. Experimental Settings

We evaluate our method on a dataset of July, 2017 from a real e-commerce site. We randomly collect 100,000 recommendation sessions (1,156,675 items) in temporal order, and use the first 70% sessions as the training set and the later 30% sessions as the testing set. For a given session, the initial state is collected from the previous sessions of the user. In this paper, we leverage previously clicked/ordered items as the positive state. Each time the RA recommends a list of items to users. The reward of skipped/clicked/ordered items are empirically set as 0, 1, and 5, respectively. The dimension of the item embedding is 50, and we set the discounted factor . For the parameters of the proposed framework such as and , we select them via cross-validation. Correspondingly, we also do parameter-tuning for baselines for a fair comparison. We will discuss more details about parameter selection for the proposed framework in the following subsections.

To evaluate the performance of the proposed framework, we select MAP (Turpin and Scholer, 2006) and NDCG (Järvelin and Kekäläinen, 2002) as the metrics to measure the performance. The difference of ours from traditional Learn-to-Rank methods is that we rank both clicked and ordered items together, and set them by different rewards, rather than only rank clicked items as that in Learn-to-Rank problems.

3.2. Performance Comparison for Item Recommendations

To answer the the first question, we compare the proposed framework with the following representative baseline methods:

  • CF: Collaborative filtering(Breese et al., 1998) is a method of making automatic predictions about the interests of a user by collecting preference information from many users, which is based on the hypothesis that people often get the best recommendations from someone with similar tastes to themselves.

  • FM: Factorization Machines(Rendle, 2010)

    combine the advantages of support vector machines with factorization models. Compared with matrix factorization, higher order interactions can be modeled using the dimensionality parameter.

  • DNN: We choose a deep neural network with back propagation technique as a baseline to recommend the items in a given session. The input of DNN is the embeddings of users’ historical clicked/ordered items. We train the DNN to output the next recommended item.

  • RNN: This baseline utilizes the basic RNN to predict what user will buy next based on the clicking/ordering histories. To minimize the computation costs, it only keeps a finite number of the latest states.

  • DQN: We use a Deep Q-network(Mnih et al., 2013) with embeddings of users’ historical clicked/ordered items (state) and a recommended item (action) as input, and train this baseline following Eq. 7. Note that the DQN shares the same architecture with the Critic in our framework.

3.3. Performance of List-Wise Recommendations

To validate the effectiveness of the list-wise recommendation strategy, we investigate how the proposed framework LIRD performs with the changes of the length of the recommendation list, i.e., , in long-term sessions, while fixing other parameters. Note that is the item-wise recommendation.

3.4. Performance of Simulator

The online simulator has one key parameter, i.e., , which controls the trade-off between state and action similarity in simulator, see Eq.(3). To study the impact of this parameter, we investigate how the proposed framework LIRD works with the changes of in long-term sessions, while fixing other parameters.

4. Related Work

In this section, we briefly review works related to our study. In general, the related work can be mainly grouped into the following categories.

The first category related to this paper is traditional recommendation techniques. Recommender systems assist users by supplying a list of items that might interest users. Efforts have been made on offering meaningful recommendations to users. Collaborative filtering(Linden et al., 2003) is the most successful and the most widely used technique, which is based on the hypothesis that people often get the best recommendations from someone with similar tastes to themselves(Breese et al., 1998). Another common approach is content-based filtering(Mooney and Roy, 2000), which tries to recommend items with similar properties to those that a user ordered in the past. Knowledge-based systems(Akerkar and Sajja, 2010) recommend items based on specific domain knowledge about how certain item features meet users’ needs and preferences and how the item is useful for the user. Hybrid recommender systems are based on the combination of the above mentioned two or more types of techniques(Burke, 2002)

. The other topic closely related to this category is deep learning based recommender system, which is able to effectively capture the non-linear and non-trivial user-item relationships, and enables the codification of more complex abstractions as data representations in the higher layers

(Zhang et al., 2017). For instance, Nguyen et al.(Nguyen et al., 2017)

proposed a personalized tag recommender system based on CNN. It utilizes constitutional and max-pooling layer to get visual features from patches of images. Wu et al.

(Wu et al., 2016) designed a session-based recommendation model for real-world e-commerce website. It utilizes the basic RNN to predict what user will buy next based on the click histories. This method helps balance the tradeoff between computation costs and prediction accuracy.

The second category is about reinforcement learning for recommendations, which is different with the traditional item recommendations. In this paper, we consider the recommending procedure as sequential interactions between users and recommender agent; and leverage reinforcement learning to automatically learn the optimal recommendation strategies. Indeed, reinforcement learning have been widely examined in recommendation field. The MDP-Based CF model in Shani et al.(Shani et al., 2005) can be viewed as approximating a partial observable MDP (POMDP) by using a finite rather than unbounded window of past history to define the current state. To reduce the high computational and representational complexity of POMDP, three strategies have been developed: value function approximation(Hauskrecht, 1997), policy based optimization (Ng and Jordan, 2000; Poupart and Boutilier, 2005), and stochastic sampling (Kearns et al., 2002). Furthermore, Mahmood et al.(Mahmood and Ricci, 2009) adopted the reinforcement learning technique to observe the responses of users in a conversational recommender, with the aim to maximize a numerical cumulative reward function modeling the benefit that users get from each recommendation session. Taghipour et al.(Taghipour et al., 2007; Taghipour and Kardan, 2008) modeled web page recommendation as a Q-Learning problem and learned to make recommendations from web usage data as the actions rather than discovering explicit patterns from the data. The system inherits the intrinsic characteristic of reinforcement learning which is in a constant learning process. Sunehag et al.(Sunehag et al., 2015) introduced agents that successfully address sequential decision problems with high-dimensional combinatorial slate-action spaces.

5. Conclusion

In this paper, we propose a novel framework LIRD, which models the recommendation session as a Markov Decision Process and leverages Deep Reinforcement Learning to automatically learn the optimal recommendation strategies. Reinforcement learning based recommender systems have two advantages: (1) they can continuously update strategies during the interactions, and (2) they are able to learn a strategy that maximizes the long-term cumulative reward from users. Different from previous work, we propose a list- wise recommendation framework, which can be applied in scenarios with large and dynamic item space and can reduce redundant computation significantly. Note that we design an online user-agent interacting environment simulator, which is suitable for offline parameters pre-training and evaluation before applying a recommender system online. We evaluate our framework with extensive experiments based on data from a real e-commerce site. The results show that (1) our framework can improve the recommendation performance; and (2) list-wise strategy outperforms item-wise strategies.

There are several interesting research directions. First, in addition to positional order of items we used in this work, we would like to investigate more orders like temporal order. Second, we would like to validate with more agent-user interaction patterns, e.g., adding items into shopping cart, and investigate how to model them mathematically for recommendations. Finally, the framework proposed in the work is quite general, and we would like to investigate more applications of the proposed framework, especially for those applications with both positive and negative(skip) signals.


  • (1)
  • Akerkar and Sajja (2010) Rajendra Akerkar and Priti Sajja. 2010. Knowledge-based systems. Jones & Bartlett Publishers.
  • Bellman (2013) Richard Bellman. 2013. Dynamic programming. Courier Corporation.
  • Breese et al. (1998) John S Breese, David Heckerman, and Carl Kadie. 1998. Empirical analysis of predictive algorithms for collaborative filtering. In

    Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence

    . Morgan Kaufmann Publishers Inc., 43–52.
  • Burke (2002) Robin Burke. 2002. Hybrid recommender systems: Survey and experiments. User modeling and user-adapted interaction 12, 4 (2002), 331–370.
  • Hauskrecht (1997) Milos Hauskrecht. 1997. Incremental methods for computing bounds in partially observable Markov decision processes. In AAAI/IAAI. 734–739.
  • Järvelin and Kekäläinen (2002) Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS) 20, 4 (2002), 422–446.
  • Kearns et al. (2002) Michael Kearns, Yishay Mansour, and Andrew Y Ng. 2002. A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Machine learning 49, 2 (2002), 193–208.
  • Levy and Goldberg (2014) Omer Levy and Yoav Goldberg. 2014. Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems. 2177–2185.
  • Lillicrap et al. (2015) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015).
  • Lin (1993) Long-Ji Lin. 1993. Reinforcement learning for robots using neural networks. Technical Report. Carnegie-Mellon Univ Pittsburgh PA School of Computer Science.
  • Linden et al. (2003) Greg Linden, Brent Smith, and Jeremy York. 2003. Amazon. com recommendations: Item-to-item collaborative filtering. IEEE Internet computing 7, 1 (2003), 76–80.
  • Mahmood and Ricci (2009) Tariq Mahmood and Francesco Ricci. 2009. Improving recommender systems with adaptive conversational strategies. In Proceedings of the 20th ACM conference on Hypertext and hypermedia. ACM, 73–82.
  • Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013).
  • Mooney and Roy (2000) Raymond J Mooney and Loriene Roy. 2000. Content-based book recommending using learning for text categorization. In Proceedings of the fifth ACM conference on Digital libraries. ACM, 195–204.
  • Moore and Atkeson (1993) Andrew W Moore and Christopher G Atkeson. 1993. Prioritized sweeping: Reinforcement learning with less data and less time. Machine learning 13, 1 (1993), 103–130.
  • Ng and Jordan (2000) Andrew Y Ng and Michael Jordan. 2000. PEGASUS: A policy search method for large MDPs and POMDPs. In Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., 406–415.
  • Nguyen et al. (2017) Hanh TH Nguyen, Martin Wistuba, Josif Grabocka, Lucas Rego Drumond, and Lars Schmidt-Thieme. 2017. Personalized Deep Learning for Tag Recommendation. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 186–197.
  • Poupart and Boutilier (2005) Pascal Poupart and Craig Boutilier. 2005. VDCBPI: an approximate scalable algorithm for large POMDPs. In Advances in Neural Information Processing Systems. 1081–1088.
  • Rendle (2010) Steffen Rendle. 2010. Factorization machines. In Data Mining (ICDM), 2010 IEEE 10th International Conference on. IEEE, 995–1000.
  • Resnick and Varian (1997) Paul Resnick and Hal R Varian. 1997. Recommender systems. Commun. ACM 40, 3 (1997), 56–58.
  • Ricci et al. (2011) Francesco Ricci, Lior Rokach, and Bracha Shapira. 2011. Introduction to recommender systems handbook. In Recommender systems handbook. Springer, 1–35.
  • Shani et al. (2005) Guy Shani, David Heckerman, and Ronen I Brafman. 2005. An MDP-based recommender system. Journal of Machine Learning Research 6, Sep (2005), 1265–1295.
  • Sunehag et al. (2015) Peter Sunehag, Richard Evans, Gabriel Dulac-Arnold, Yori Zwols, Daniel Visentin, and Ben Coppin. 2015. Deep Reinforcement Learning with Attention for Slate Markov Decision Processes with High-Dimensional States and Actions. arXiv preprint arXiv:1512.01124 (2015).
  • Sutton and Barto (1998) Richard S Sutton and Andrew G Barto. 1998. Reinforcement learning: An introduction. Vol. 1. MIT press Cambridge.
  • Taghipour and Kardan (2008) Nima Taghipour and Ahmad Kardan. 2008. A hybrid web recommender system based on q-learning. In Proceedings of the 2008 ACM symposium on Applied computing. ACM, 1164–1168.
  • Taghipour et al. (2007) Nima Taghipour, Ahmad Kardan, and Saeed Shiry Ghidary. 2007. Usage-based web recommendations: a reinforcement learning approach. In Proceedings of the 2007 ACM conference on Recommender systems. ACM, 113–120.
  • Turpin and Scholer (2006) Andrew Turpin and Falk Scholer. 2006. User performance versus precision measures for simple search tasks. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 11–18.
  • Wu et al. (2016) Sai Wu, Weichao Ren, Chengchao Yu, Gang Chen, Dongxiang Zhang, and Jingbo Zhu. 2016.

    Personal recommendation using deep recurrent neural networks in NetEase. In

    Data Engineering (ICDE), 2016 IEEE 32nd International Conference on. IEEE, 1218–1229.
  • Yue and Guestrin (2011) Yisong Yue and Carlos Guestrin. 2011. Linear submodular bandits and their application to diversified retrieval. In Advances in Neural Information Processing Systems. 2483–2491.
  • Zhang et al. (2017) Shuai Zhang, Lina Yao, and Aixin Sun. 2017. Deep Learning based Recommender System: A Survey and New Perspectives. arXiv preprint arXiv:1707.07435 (2017).