Model-Based Reinforcement Learning for Whole-Chain Recommendations

02/11/2019 ∙ by Xiangyu Zhao, et al. ∙ Association for Computing Machinery, Inc. Michigan State University 0

With the recent prevalence of Reinforcement Learning (RL), there have been tremendous interests in developing RL-based recommender systems. In practical recommendation sessions, users will sequentially access multiple scenarios, such as the entrance pages and the item detail pages, and each scenario has its own recommendation strategy. However, the majority of existing RL-based recommender systems focus on separately optimizing each strategy, which could lead to sub-optimal overall performance, because independently optimizing each scenario (i) overlooks the sequential correlation among scenarios, (ii) ignores users' behavior data from other scenarios, and (iii) only optimizes its own objective but neglects the overall objective of a session. Therefore, in this paper, we study the recommendation problem with multiple (consecutive) scenarios, i.e., whole-chain recommendations. We propose a multi-agent reinforcement learning based approach (DeepChain), which can capture the sequential correlation among different scenarios and jointly optimize multiple recommendation strategies. To be specific, all recommender agents share the same memory of users' historical behaviors, and they work collaboratively to maximize the overall reward of a session. Note that optimizing multiple recommendation strategies jointly faces two challenges - (i) it requires huge amounts of user behavior data, and (ii) the distribution of reward (users' feedback) are extremely unbalanced. In this paper, we introduce model-based reinforcement learning techniques to reduce the training data requirement and execute more accurate strategy updates. The experimental results based on data from a real e-commerce platform demonstrate the effectiveness of the proposed framework. Further experiments have been conducted to validate the importance of each component of DeepChain.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

With the recent tremendous development in Reinforcement Learning (RL), there have been increasing interests in adapting RL for recommendations. RL-based recommender systems treat the recommendation procedures as sequential interactions between users and a recommender agent (RA). They aim to automatically learn an optimal recommendation strategy (policy) that maximizes cumulative reward from users without any specific instructions. RL-based recommender systems can achieve two key advantages: (i) the recommender agent can learn their recommendation strategies based on users’ real-time feedback during the user-agent interactions continuously; and (ii) the optimal strategies target at maximizing the long-term reward from users (e.g. the overall revenue of a recommendation session). Therefore, numerous efforts have been made on developing RL-based recommender systems (Dulac-Arnold et al., 2015; Zhao et al., 2018b, 2017, a; Zheng et al., 2018).

Figure 1. An example of whole-chain recommendations.

In reality, as shown in Figure 1, users often sequentially interact with multiple recommender agents (RAs) of different scenarios in one recommendation session. First, a user usually starts a recommendation session by browsing the items recommended in the entrance page of the E-commerce platform, which suggests diverse and complementary items according to the user’s browsing history where the user can: (i) skip the recommended items and continue browsing the new recommendations, or (ii) go to the item detail page if she clicks one preferred item. Second, the item detail page shows the details of the clicked item, and the recommender agent of this page recommends a set of items related to the clicked item where the user can (i) go back to the entrance page, (ii) go to another item detail page if she clicks one recommended item, or (iii) add the item into shopping cart and go to the shopping cart page. Third, the shopping cart page lists all items that the user have added, and a recommender agent generates recommendations associated with the items stored in the shopping cart where the user can (i) return to the last item detail page, (ii) click one recommended item and go the item detail page, or (iii) go to the order page if she decides to purchase some items. Finally, after purchasing items in the order page, a recommender agent will recommend a set of items related to the purchased items. Note that (i) the user will be navigated to an item detail page wherever she clicks a recommended item, and (ii) the user can leave the platform at any scenarios (we only show one “leave” behavior in Figure 1).

The real example suggests that there is a chain of recommendation scenarios and these scenarios are sequentially related. However, the majority of traditional methods usually independently optimize each recommendation strategy, which could result in sub-optimal overall performance. First, separate optimization ignores the sequential correlation and dependency of users’ behaviors among different scenarios. Second, optimizing one strategy within a specific scenario only leverages the user-agent interaction data within this scenario, while completely ignoring the information (users’ behaviors) from other scenarios. Third, independent optimization of one scenario only maximizes its own objective, which may negatively affect the overall objective of the whole recommendation session. In other words, recommending an item in one specific scenario may negatively influence user’s click/purchase behaviors in other scenarios. Thus, in this paper, we formulate the recommendation tasks within multiple consecutive scenarios as a whole-chain recommendation problem, and leverage multi-agent reinforcement learning (MARL) to jointly optimize multiple recommendation strategies, which is capable of maximizing the overall performance of the whole recommendation session. The designed whole-chain recommendation framework (DeepChain) has three advantages. First, recommender agents are sequentially activated to capture the sequential dependency of users’ behaviors among different scenarios. Second, all recommender agents in different scenarios share the same memory of historical user behavior data, in other words, an agent in one scenario can leverage user behavior data from other scenarios to make more accurate decisions. Third, all recommender agents can work collaboratively to maximize the overall performance of the whole recommendation session.

In order to optimize recommendation strategies, existing model-free reinforcement learning based recommender systems typically require a larger amount of user-agent interaction data (Dulac-Arnold et al., 2015; Zhao et al., 2018b; Zheng et al., 2018). The whole-chain setting with multiple scenarios demands even more data. However, this requirement is challenging in practical recommendation systems, since real-world users will leave the platforms quickly if the systems randomly recommend items that cannot fit users’ preferences (Chen et al., 2018c). Furthermore, the distributions of users’ immediate feedback (reward) on the recommended items are extremely unbalanced in users’ historical logs, since users’ click/purchase behaviors (with positive reward) occur much infrequently than users’ skip behaviors (with zero reward). This will lead to the inaccurate update of the action-value function of RL (Hu et al., 2018). Therefore, to tackle these challenges, in this paper, we propose a model-based reinforcement learning framework for the MARL-based recommender systems (DeepChain), which approximates the user behaviors (environment) to reduce training data amount requirement and performs accurate optimization of the action-value function. We summarize our major contributions as follows:

  • [leftmargin=*]

  • We identify the sequential correlation and dependency of users’ behaviors in different scenarios of one recommendation session and propose a principled approach to capture them for recommendations;

  • We propose a multi-agent model-based reinforcement learning based framework (DeepChain) for whole-chain recommendation problem, which can jointly optimize multiple recommendation strategies (agents) for different scenarios with a model-based RL schema;

  • We demonstrate the effectiveness of the proposed framework on a real-world dataset from an e-commerce platform and validate the importance of the components in DeepChain for accurate recommendations.

The rest of this paper is organized as follows. In Section 2, we formally define the problem of whole-chain recommendation problem with the Markov Decision Process of user-agent interactions. In Section 3, we provide details about employing multi-agent model-based reinforcement learning techniques to collaboratively learn multiple recommendation strategies under the whole-chain recommendation setting. Section 4 carries out experiments based on a real-world dataset from an e-commerce platform and provides experimental results. Section 5 briefly reviews related work. Finally, Section 6 concludes this work and discusses some future work.

2. Problem Statement

We formulate the whole-chain recommendation task as a multi-agent model-based reinforcement learning problem. To be specific, there exist several recommender agents (RAs) corresponding to different recommendation scenarios. Each recommender agent (RA) serves as a recommendation strategy that recommends items to a user (the environment ) in a specific scenario according to the user’s browsing history. Furthermore, the recommender agents sequentially interact with the user by recommending items over a sequence of time steps, thus the recommender agents are sequentially activated according to the user’s behaviors, and only one recommender agent is activated at each time step. All recommender agents work cooperatively to maximize the cumulative reward of a recommendation session. In this paper, we model the above multi-agent model-based reinforcement learning problem as a Markov Decision Process (MDP), which contains a sequence of states, actions and rewards. Formally, the MDP is a tuple with five elements as:

Figure 2. The MDP of user-agent interactions.
  • [leftmargin=*]

  • State space : The state is defined as a chronologically sorted sequence of a user’s historical clicked or purchased items before time , which represents the user’s preference at time .

  • Action space : An action of an RA is recommending a list of relevant items corresponding to state . Without the loss of generality, in this paper, each time an RA recommends only one item to the user, while it is straightforward to extend the setting to recommend multiple items.

  • Reward : When an RA recommends an item to a user at time (i.e. taking action ), the user will browse the recommended item and provide corresponding feedback (such as skip, click, purchase or leave), and then the RA will receive an immediate reward based on the user’s feedback.

  • Transition probability

    : Transition probability

    is defined as the probability of state transiting from to when action is executed by an RA. The MDP is assumed to satisfy the Markov property . In our setting, the transition probability is equivalent to user behavior probability, which is also associated with the activation of RAs.

  • Discount factor : the reward discount factor is leveraged to calculate the present value of future reward. When , all future rewards can be fully counted into the current action; when , only the immediate reward is considered.

The MDP of agent-user interactions is illustrated in Figure 2. With the aforementioned definitions and descriptions, we formally define the whole-chain recommendation problem as follows: Given the historical MDP, i.e., , the goal is to find a set recommendation policies for multiple recommender agents of different recommendation scenarios, which can maximize the cumulative reward of the whole recommendation session.

3. The Proposed Framework

In this section, we will propose a deep reinforcement learning approach for the whole-chain recommendation problem, which can simultaneously learn multiple recommendation strategies for different scenarios by a model-based learning algorithm. As discussed in Section 1, developing a whole-chain recommendation framework is challenging, because (i) conventional one-scenario recommendation framework neglects the sequential correlation among scenarios and the information from other scenarios, and solely optimizes its own objective that may lead to sub-optimal overall performance of the whole session, (ii) jointly optimizing multiple recommendation strategies requires substantial user behavior data, and (iii) the users’ feedback (reward) distributions are extremely unbalanced. To address these challenges, we propose a multi-agent model-based reinforcement learning framework. Note that for the sake of simplicity, we will only discuss the recommendations within two scenarios, i.e., entrance page and item detail page, however, it is straightforward to extend the setting with more scenarios. In the following, we will first illustrate the overview of the proposed framework, then introduce the architectures of recommender agents (actors) and critic separately, and finally we will discuss the objective function with the optimization algorithm.

Figure 3. An overview of the proposed framework.

3.1. An Overview of the Proposed Framework

The multi-agent reinforcement learning framework with Actor-Critic architecture is illustrated in Figure 3. In our setting, the proposed framework has two recommender agents (actors), i.e., providing recommendations in the entrance page and for the recommendations in the item detail page. Actors aim to generate recommendations (action) according to users’ browsing histories (state). As mentioned in Section 1: (i) the recommender agents are sequentially activated to interact with users, (ii) the recommender agents share the same memory of users’ historical behavior data (state), and (iii) the recommender agents will work collaboratively to maximize the overall performance, which is evaluated by a global action-value function (critic). In other words, a global critic controls all actors to enable them to work collaboratively to optimize the same overall performance. To be specific, the critic takes users’ historical behavior data and one recommended item (current state-action pair), and outputs an action-value evaluating the long-term future rewards corresponding to the current state and action. Next, we will discuss their architectures in details.

3.2. The Actor Architecture

The goal of the actors is to suggest recommendations (action) based on users’ historical browsing behaviors (state), which should address two challenges: (i) how to capture users’ dynamic preference in one recommendation session, and (ii) how to generate recommendations according to the learned users’ preference. To tackle these challenges, we develop an actor framework with the Encoder-Decoder architecture. Note that all actors share the same architecture with different parameters.

3.2.1. Encoder to Capture Users’ Preference

The sub-figure under the dash line of Figure 4 illustrates the Encoder architecture that aims to learn users’ dynamic preference during the recommendation session. The Encoder takes the item representations111The item representations are dense and low-dimensional vectors, which are pre-trained based on users’ browsing history by a real e-commerce company. The item representations are trained via word embedding (Levy and Goldberg, 2014), where the clicked items in one recommendation session are treated as a sentence, and each item is treated as a word. The effectiveness of these item representations is demonstrated by their business such as searching, recommendation and advertisement. of users’ last clicked/purchased items (in sequential order) as inputs (

), and will output the representations of users’ dynamic preference in the form of a dense and low-dimensional vector. We introduce a recurrent neural network (RNN) with Gated Recurrent Units (GRU) to capture users’ sequential browsing behaviors. We leverage GRU rather than other architectures like Long Short-Term Memory (LSTM) since GRU is easier to train with simpler architecture and fewer parameters. Unlike LSTM leveraging an input gate and a forget gate to produce a new state, GRU has an update gate



GRU uses a reset gate to control the former state :


The candidate activation function

is computed as:


Finally, the activation of GRU is a linear interpolation between the candidate activation

and the previous activation :


We leverage the final hidden state as the low-dimensional representation of user’s current preference:

Figure 4. The architecture of the actors.

3.2.2. Decoder to Generate Recommendations

The sub-figure above the dash line of Figure 4 illustrates the Decoder architecture, which targets to generate recommendations according to users’ preference learned by the Encoder. There are several choices of the Decoder architecture, while here we introduce the alternative bi-linear decoding architecture (Li et al., 2017) because of its fewer parameters and good performance in recommendation tasks. To be specific, a bi-linear similarity function between the representations of user’s current preference and each candidate item is proposed to calculate a similarity score :


where , is the dimension of item representation and is the dimension of

. Then the similarity score of all candidate items are input into a softmax layer to obtain the probability that the item to be recommended. We select the item with the highest probability as the output of the Decoder, i.e., the next item to be recommended according to users’ current preference. Next we will discuss the architectures of the critic.

3.3. The Critic Architecture

The recommender agents introduced in Section 3.2 should work collaboratively to optimize the same overall objective, which is measured by a global critic network (action-value function). In other words, the global critic will control all recommender agents to work cooperatively to maximize the global performance. Specifically, the input of critic is the current state-action pair, i.e., users’ historical browsing behaviors and one recommended item, and the output is an action-value that evaluates the future cumulative rewards starting from the current state and action. According to the , the actors will update their parameters to generate more accurate recommendations (actions), which will be discussed in following subsections. The global critic should tackle one challenge, i.e., how to approximate the action-value of one recommended item in different scenarios. In other word, recommending the same item in different scenarios should lead to different long-term future rewards.

Figure 5. The architecture of the critic.

Figure 5 illustrates the architecture of the global critic, which takes the current state and action as input. We follow the same strategy from Eq (1) to Eq (4) to generate users’ dynamic preference by a RNN with GRU based on state . Note that the RNNs in actors and global critic share the same architecture with independent parameters. In order to tackle the aforementioned challenge, we introduce a scenario layer, whose intuition is that if users are currently in the entrance page, we will transform their current preference into one form, while if the users are currently in the item detail page, the will be transformed into another form. In other words, users in different scenarios have different preferences, even though they have the same browsing histories. The scenario layer takes users’ dynamic preference , and outputs the transformed users’ preference as follows:


where the first term corresponds to users’ preference in the entrance page, while the second term associates with users’ preference in the item detail page. The mutually-exclusive indicators and will control the activation of two terms, i.e., only one term is activated each time. Next, we will concatenate the transformed users’ preference and current action

, and feed them into several fully connected layers as a nonlinear approximator to estimate the action-value function


3.4. The Optimization Task

In this subsection, we will discuss the objective functions with the optimization algorithm. As mentioned in Section 1, the majority of existing model-free RL-based recommender systems need huge amounts of users’ behavior data. It is hard to be satisfied in the practical business, because real users will leave the system quickly if they receive almost randomly recommended items, which frequently happens in the initial model training (exploration) stage. Furthermore, since users’ skip behaviors (with zero reward) occur much frequently than users’ click/purchase behaviors (with positive reward), the distributions of immediate reward function are extremely unbalanced, which can result in inaccurate update of action-value function. Therefore, we proposed a model-based reinforcement learning framework for the whole-chain recommendation system, which can approximate the environment (user behaviors) to reduce the desired training data amount and perform more accurate optimization of the action-value function (Brafman and Tennenholtz, 2002; Kearns and Singh, 2002; Hu et al., 2018).

Under our setting with two scenarios, i.e., the entrance page and the item detail page, users have three types of behaviors in each scenario. In the entrance page, given a recommended item based on the current state , users can: (i) skip the item and continue browsing in the entrance page with a probability , (ii) click the item and go to the item detail page with probability , or (iii) leave the session with probability . Similarly, in the item detail page, given a state-action pair, users can: (i) click the item and go to another item detail page with probability , (ii) skip the item and go back to the entrance page with the probability , or (iii) leave the session with probability . Then the approximation of the action value function, referred as to , can be formulated in a model-based form as follows:


where the mutually-exclusive indicators and control the activation of two scenarios. Notations , and represent the parameters of the target network of , and Critic respectively of the DDPG framework (Lillicrap et al., 2015). In the Eq (8), the first row corresponds to the “skip” behavior in entrance page that leads to a nonzero Q-value, and the will continue recommending next item according to new state ; the second row corresponds to the “click” behavior in the entrance page that leads to a positive immediate reward and a nonzero Q-value, and is activated to recommend next item; the third row corresponds to the “leave” behavior in the entrance page that leads to a negative immediate reward, and the session ends; the fourth row corresponds to the “click” behavior in the item detail page that leads to a positive immediate reward and a nonzero Q-value, and will continue recommending next item; the fifth row corresponds to the “skip” behavior in the item detail page that leads to a nonzero Q-value, and is re-activated to generate next recommendation; the last row corresponds to the “leave” behavior in the item detail page that leads to a negative immediate reward, and the session ends.

We leverage the off-policy DDPG algorithm (Lillicrap et al., 2015) to update the parameters of the proposed Actor-Critic framework based on the samples stored in a replay buffer (Mnih et al., 2015), and we introduce separated evaluation and target networks (Mnih et al., 2013) to help smooth the learning and avoid the divergence of parameters. Next, we will discuss the optimization of user behavior probabilities, actors and critic, separately.

3.4.1. Optimizing the State Transition Probability

In fact, user behavior probabilities and are state transition probabilities introduced in Section 2. In other word, users’ different behaviors result in different state transitions. We develop one probability network, to estimate the state transition probabilities. The architecture of the neural network is similar with the critic network shown in Figure 5

, which takes current state-action pair as input, while the only difference is that the output layer has two separate softmax layers that predicts the state transition probabilities of two scenarios. To update the parameters of probability networks, we leverage supervised learning techniques like standard model-based reinforcement learning, which minimize the

cross entropy loss between predicted probability vector and ground truth one-hot vector (e.g. represents “click” behavior).

3.4.2. Optimizating the Critic Parameters

The critic, i.e., the action value function

, can be optimized by minimizing the loss functions

as follows:


where represents all the parameters of critic (evaluation network), and is defined in Eq (8). The parameters , and learned from the previous iteration and the state transition probabilities in Eq (8) are fixed when optimizing the loss function . The derivative of the loss function with respective to parameters is presented as follows:


3.4.3. Optimizating the Actor Parameters

In the typical on-policy setting, the actors can be updated by maximizing using the policy gradient:


where can represent the parameters of or . However, in off-policy setting based on users’ historical behavior data, the recommended item (ground truth action ) and the corresponding reward are given in the historical data. Thus whatever action outputted by the actors, the ground truth action is fixed in the historical data, which disconnects the outputs of actors and the inputs of critic. Motivated by existing work (Lillicrap et al., 2015; Dulac-Arnold et al., 2015; Zhao et al., 2018a), the outputted action of actors and the inputted action of critic should be similar. Therefore, before updating actors and critics following traditional on-policy schema, we should first minimize the cross entropy loss between (i) the probability vector outputted by actors and (ii) the one-hot vector of the ground truth recommended item in the historical data, referred as to , which updates actors’ parameters in the direction of pushing and to be similar, which connects the actors and the critic for off-policy training. The objective function can be rewritten as follows:

1:  Randomly initialize actor and critic networks , ,
2:  Initialize target network
3:  Initialize the capacity of replay buffer
4:  for  do
5:     Receive initial observation state
6:     for  do
7:        Observe following off-policy
8:        Store transition in
9:        Sample minibatch of transitions from
10:        Update and according to Section 3.4.1
11:        Update Actors , by minimizing Eq (12)
12:        Compute according to Eq (8)
13:        Update Critic by minimizing according to Eq (10)
14:        Update Actors , using the sampled policy gradient according to Eq (11)
15:        Update the target networks:
16:     end for
17:  end for
Algorithm 1 Off-policy Training for DeepChain with DDPG.

3.4.4. The Training Algorithm

The off-policy training algorithm for DeepChain is presented in Algorithm 1. In each iteration, there are two stages, i.e., 1) transition generating stage (lines 7-8), and 2) parameter updating stage (lines 9-15). For transition generating stage: we first observe the transition following offline behavior policy that generates the historical behavior data (line 7), then we store the transition into the replay buffer (line 8). For parameter updating stage: we first sample mini-batch of transitions from (line 9), then we update the state transition probabilities by supervised learning techniques as mentioned in Section 3.4.1 (line 10), next we connect actors and critic by minimizing Eq (12) for off-policy training (line 11), and finally we update critic and actors (lines 12-15) following a standard DDPG procedure (Lillicrap et al., 2015). Note that it is straightforward to extend the off-policy training to on-policy training: (i) in transition generating stage, we can collect transitions during the interactions with real users; and (ii) we can remove line 11, because in on-policy setting the outputted action of actors is the same as the inputted action of critic.

3.5. The Test Tasks

In this subsection, we will present the test tasks of the DeepChain framework. We propose two test tasks, i.e., (i) Offline test: testing the proposed framework based on user’s historical behavior data; and (ii) Online test: testing the proposed framework in real online environment where the agents interact with real-world users and receive immediate reward (real-time feedback) of the recommended items from users. Note that offline test is necessary because recommendation algorithms should be pre-trained (by the off-policy algorithm in Section 3.4) and evaluated offline before launching them in the real online system, which ensures the recommendation quality and mitigates the negative influence on user experience.

1:  Initialize actors with well trained parameters and
2:  Observe initial the state
3:  for  do
4:     if the user in main page then
5:        Execute an action following policy
6:     else
7:        Execute an action following policy
8:     end if
9:     Observe the reward and transition to new state
10:  end for
Algorithm 2 Online Test of DeepChain.

3.5.1. Online Test

The online test algorithm is presented in Algorithm 2. In each iteration of a recommendation session, given the current state and scenario, one actor is activated to recommend an item to user following policy or (line 5 or 7). Then the system observes the reward from user and updates the state to (line 9).

Input: Item list and related reward list of a session.
Output:Re-ranked recommendation list

1:  Initialize actor with well trained parameters
2:  Receive initial observation state
3:  while  do
4:     Execute an action following policy
5:     Add into the end of
6:     Observe reward from users (historical data)
7:     Observe new state
8:     Remove from
9:  end while
Algorithm 3 Offline Test of DeepChain.

3.5.2. Offline Test

The intuition of offline test is that, given a historical offline recommendation session data, if DeepChain works well, it can re-rank the items in this session and the ground truth clicked items can be sorted at the top of the new list. The DeepChain only re-ranks items in this session rather than all items from item space, because we can only know the ground truth rewards corresponding to the existing items of this session in the offline data. The offline test algorithm is presented in Algorithm 3. In each iteration, given , the actor recommends an item to user following policy (line 4), where we calculate the probabilities to recommended at the next location of all items, and select the items in the item list with the highest probability as . Then we add into new recommendation list (line 5), and record reward from historical data (line 6). Next we update the state to (line 7). Finally, we remove from (line 8), which avoids to repeatedly recommend the same items. Note that in offline test setting, we collect user behavior data in two scenarios separately and re-rank the items in each scenario.

4. Experiment

In this section, we conduct extensive experiments to evaluate the effectiveness of the proposed framework based on a real-world dataset from an e-commerce platform. We mainly focus on two questions: (1) how the proposed framework performs compared to the state-of-the-art baselines; and (2) how the components in the framework contribute to the performance. We first introduce experimental settings. Then we seek answers to the above two questions. Finally, we study the impact of key parameters on the performance of the proposed framework.

4.1. Experimental Settings

We evaluate the DeepChain framework on a real-world dataset of December, 2018 from an e-commerce platform. We randomly collect 500,000 recommendation sessions (with 19,667,665 items) in temporal order, and leverage the first 80% sessions as the training/validation datasets and the later 20% sessions as the test dataset. For a new session, the initial state (users’ historical clicked items) is previously clicked items obtained from users’ previous sessions. The immediate reward of click/skip/leave behavior is empirically set as 1, 0, and -2, respectively. The dimensions of item representation vector and hidden state of RNN are and . The discounted factor , and the rate for soft updates of target networks . We select the parameters of the DeepChain framework via cross-validation, and do parameter-tuning for baselines for a fair comparison. More details about parameter analysis will be discussed in the following subsections. For offline test, we select NDCG (Järvelin and Kekäläinen, 2002) and MAP (Turpin and Scholer, 2006) as the metrics. For online test, we leverage the overall reward in one recommendation session as the metric.

Figure 6. Overall performance comparison in offline test.

4.2. Performance Comparison for Offline Test

We compare the proposed framework with the following representative baseline methods:

  • [leftmargin=*]

  • FM (Rendle, 2010)

    : Factorization Machines combine the advantages of support vector machines with factorization models. Compared with matrix factorization, higher order interactions can be modeled using the dimensionality parameter.

  • GRU (Hidasi et al., 2015): GRU4Rec leverages the RNN with GRU to predict what a user will click next based on the clicking history. We also keep clicked items as the state for fair comparison.

  • DDPG (Dulac-Arnold et al., 2015): This baseline uses conventional Deep Deterministic Policy Gradient with five fully connected layers in both Actor and Critic. The input for Actor is the concatenation of embeddings of users’ historical clicked items (state). The input for Critic is the concatenation of state and a recommended item (action).

  • MA (Feng et al., 2018): MA-RDPG is a multi-agent model-free RL model, which employs continuous actions, deterministic policies, and recurrent message encodings by a centralized critic, private actors (agents), and a communication component.

The results are shown in Figure 6. Note that in the offline test, we separately collect user behavior data from two scenarios and re-rank the items in each scenario by the corresponding agent. We make the following observations:

  • [leftmargin=*]

  • GRU outperforms FM, since GRU can capture the temporal sequence within one recommendation session, while FM neglects it. This result also demonstrates the advantage of deep learning techniques in the recommendation task.

  • DDPG achieves better performance than GRU, since DDPG can optimize overall performance of one recommendation session, but GRU only maximizes the immediate reward. This result validates the advantage of RL techniques in recommendations.

  • FM, GRU and DDPG perform worse than MA and DeepChain, because the first three baselines are one-agent models where each agent is trained in one scenario separately, while MA and DeepChain are multi-agent models where agents are jointly trained on two scenarios (the whole dataset) to optimize the global performance.

  • DeepChain outperforms MA, since model-based RL model like DeepChain can perform more accurate optimization of the action-value function based on less training data.

To sum up, DeepChain outperforms representative baselines, which demonstrates its effectiveness in recommendations.

Figure 7. Overall performance comparison in online test.

4.3. Performance Comparison for Online Test

Following (Zhao et al., 2018b, a), we evaluate the proposed framework on a simulated online environment, where a model-based deep neural network is pre-trained to estimate users’ behaviors based on any state-action pairs. We compare DeepChain with FM, GRU, DDPG and MA as in offline test. Furthermore, to answer the second question, we systematically eliminate the corresponding components of DeepChain by defining the following variants:

  • [leftmargin=*]

  • DC-o: This variant is a one-agent version of DeepChain. In other words, only one recommender agent is trained to generate recommendations in both the entrance page and item detail page.

  • DC-f: This variant is a model-free version of DeepChain, which does not estimate the user behavior probabilities as mentioned in Section 3.4.1.

The results are shown in Figure 7. It can be observed:

  • [leftmargin=*]

  • We observe similar comparison results between DeepChain and the state-of-the-art baselines as these in the offline test in Figure 7(a).

  • DC-o performs worse than DeepChain, since DC-o only trains one recommender agent for both two scenarios. This result indicates that users’ interests in different scenarios are different. Thus developing separate recommender agents for different scenarios is necessary.

  • DC-f achieves worse performance than DeepChain. The key reasons include: (i) model-free version DC-f requires more training data; and (ii) DC-f performs less accurate optimization of than model-based model DeepChain. This result validates the effectiveness of model-based RL in recommendations.

In summary, appropriately developing separate recommender agents and introducing model-based techniques to update action-value function can boost the recommendation performance.

Figure 8. Parameter sensitiveness.

4.4. Parameter Sensitivity Analysis

In this subsection, we investigate how the proposed framework DeepChain performs with the changes of , i.e., the length of users’ browsing history (state), while fixing other parameters.

Figure 8 (a) demonstrates the parameter sensitivity of in online test. We can find that with the increase of , the overall performance improves. Figure 8 (b) shows the parameter sensitivity of in the offline test task. We can observe that the recommendation performance of the entrance page is more sensitive than that of the item detail page. The reason is that users’ interests are different in two scenarios: in the entrance page, users’ preferences are diverse, thus including longer browsing history can better discover users’ various interests; while in one specific item’s detail page, users’ attention mainly focuses on the similar items to this specific item, in other words, users would like to compare this item with similar ones, thus involving longer browsing history cannot significantly improve the performance.

5. Related Work

In this section, we briefly review works related to our study, i.e., reinforcement learning based recommender systems, which typically consider the recommendation task as a Markov Decision Process (MDP), and model the recommendation procedure as sequential interactions between users and recommender system. Practical recommender systems are always with millions of items (discrete actions) to recommend. Thus, most RL-based models will become inefficient since they are not able to handle such a large discrete action space. A Deep Deterministic Policy Gradient (DDPG) algorithm is introduced to mitigate the large action space issue in practical RL-based recommender systems (Dulac-Arnold et al., 2015). To avoid the inconsistency of DDPG and improve recommendation performance, a tree-structured policy gradient is proposed in (Chen et al., 2018b). Biclustering technique is also introduced to model recommender systems as grid-world games so as to reduce the state/action space (Choi et al., 2018). To solve the unstable reward distribution problem in dynamic recommendation environments, approximate regretted reward technique is proposed with Double DQN to obtain a reference baseline from individual customer sample (Chen et al., 2018d). Users’ positive and negative feedback, i.e., purchase/click and skip behaviors, are jointly considered in one framework to boost recommendations, since both types of feedback can represent part of users’ preference (Zhao et al., 2018b). Architecture aspect and formulation aspect improvement are introduced to capture both positive and negative feedback in a unified RL framework. A page-wise recommendation framework is proposed to jointly recommend a page of items and display them within a 2-D page (Zhao et al., 2017, 2018a). CNN technique is introduced to capture the item display patterns and users’ feedback of each item in the page. In the news feed scenario, a DQN based framework is proposed to handle the challenges of conventional models, i.e., (1) only modeling current reward like CTR, (2) not considering click/skip labels, and (3) feeding similar news to users (Zheng et al., 2018). An RL framework for explainable recommendation is proposed in (Wang et al., 2018), which can explain any recommendation model and can flexibly control the explanation quality based on the application scenario. A policy gradient-based top-K recommender system for YouTube is developed in (Chen et al., 2018a), which addresses biases in logged data through incorporating a learned logging policy and a novel top-K off-policy correction. Other applications includes sellers’ impression allocation (Cai et al., 2018a), fraudulent behavior detection (Cai et al., 2018b), and user state representation (Liu et al., 2018).

6. Conclusion

In this paper, we propose a novel multi-agent model-based reinforcement learning framework (DeepChain) for the whole-chain recommendation problem. It is able to collaboratively train multiple recommendation agents for different scenarios by a model-based optimization algorithm. Multi-agent RL based recommender systems have three advantages: (i) the recommender agents are sequentially activated to capture the sequential dependency of users’ behaviors among different scenarios; (ii) the recommender agents share the same memory of users’ historical behavior information to make more accurate decisions, and (iii) the recommender agents will work collaboratively to maximize the global performance of one recommendation session. Note that we design a model-based RL optimization algorithm that can reduce the requirement of training data and perform more accurate optimization of the action-value function than model-free algorithms. We conduct extensive experiments based on a real-world dataset from an e-commerce platform. The results show that (i) DeepChain can significantly enhance the recommendation performance; and (ii) multi-agent techniques and model-based RL can assist the recommendation task.

There are several interesting research directions. First, in addition to tackling only whole-chain recommendation task in e-commerce platforms, we would like to develop one framework to jointly resolve more tasks like search, recommendation and advertisement by a whole-chain schema. Second, we would like to incorporate agent-user interactions from more scenarios, e.g., shopping cart page and order page, and investigate how to model them mathematically for recommendations. Finally, the DeepChain framework proposed in the work is quite general, and we would like to investigate more applications beyond e-commerce.