Toward Simulating Environments in Reinforcement Learning Based Recommendations

06/27/2019 ∙ by Xiangyu Zhao, et al. ∙ Association for Computing Machinery, Inc. Michigan State University 0

With the recent advances in Reinforcement Learning (RL), there have been tremendous interests in employing RL for recommender systems. RL-based recommender systems have two key advantages: (i) they can continuously update their recommendation strategies according to users' real-time feedback, and (ii) the optimal strategy maximizes the long-term reward from users, such as the total revenue of a recommendation session. However, directly training and evaluating a new RL-based recommendation algorithm needs to collect users' real-time feedback in the real system, which is time and efforts consuming and could negatively impact on users' experiences. Thus, it calls for a user simulator that can mimic real users' behaviors where we can pre-train and evaluate new recommendation algorithms. Simulating users' behaviors in a dynamic system faces immense challenges -- (i) the underlining item distribution is complex, and (ii) historical logs for each user are limited. In this paper, we develop a user simulator base on Generative Adversarial Network (GAN). To be specific, we design the generator to capture the underlining distribution of users' historical logs and generate realistic logs that can be considered as augmentations of real logs; while the discriminator is developed to not only distinguish real and fake logs but also predict users' behaviors. The experimental results based on real-world e-commerce data demonstrate the effectiveness of the proposed simulator. Further experiments have been conducted to understand the importance of each component in the simulator.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

With the explosive growth of the world-wide web, huge amounts of data have been generated, which results in the increasingly severe information overload problem that users are overwhelmed by massive information (Chang et al., 2006). Recommender system can mitigate the information overload problem through suggesting personalized items (products, services, or information) that best match users’ preferences (Linden et al., 2003; Breese et al., 1998; Mooney and Roy, 2000; Zhao et al., 2016; Guo et al., 2016). The majority of existing recommender systems, e.g., content-based, learning-to-rank and collaborative filtering, often face several common challenges. First, most traditional recommender systems consider the recommendation task as a static procedure and generate recommendations via a fixed greedy strategy. However, these approaches may fail to capture the evolution of users’ preferences over time. Second, most current recommender systems have been developed to maximize the short-term reward (e.g. immediate revenue) of recommendations, i.e., immediately purchasing the recommended items, but completely neglect whether these recommendations will result in more profitable rewards in the long run (Shani et al., 2005).

Figure 1. An example of system-user interactions.

Recently, with the immense development of Reinforcement Learning techniques, a wide range of complex tasks such as the game of Go (Silver et al., 2016; Silver et al., 2017), video games (Mnih et al., 2013, 2016) and robotics (Kretzschmar et al., 2016) have been unprecedentedly advanced. In the reinforcement learning framework, an intelligent agent learns to solve a complex task by acquiring experiences from try-and-error interactions with a dynamic environment. The goal is to automatically learn an optimal policy (solution) for the complex task without any specific instructions (Kaelbling et al., 1996). Applying reinforcement learning in recommendation tasks can naturally solve the aforementioned challenges, where recommendation procedures are considered as sequential interactions between users and a recommender system (Zhao et al., 2017, 2018b, 2019, 2018a, 2018c) as shown in Figure 1. In each iteration, the recommender system suggests a set of items to the user; then the user browses the recommended items and provides her/his real-time feedback; and next the system will update its recommendation strategy according to user’s feedback. First, considering the recommendation task as sequential interactions between an agent (recommender system) and environment (users), the agent can continuously update its strategies according to users’ real-time feedback (reward function) during the interactions, until the system generates items that best fit users’ preferences (the optimal policy). Second, the RL mechanism is developed to maximize the cumulative (long-term) reward from users, e.g., the total revenue of a recommendation session. Therefore, the agent is able to identify items with smaller immediate rewards but benefit the cumulative rewards in the long run. Given the advantages of reinforcement learning, very recently, there have been tremendous interests in developing RL-based recommender systems (Dulac-Arnold et al., 2015; Xia et al., 2017; Zhao et al., 2018c, b, 2017; Zheng et al., 2018).

RL-based recommendation algorithms are desired to be trained and evaluated based on users’ real-time feedback (reward function). The most practical way is online A/B test, where a new recommendation algorithm is trained based on the feedback from real users and the performance is compared against that of the previous algorithm via randomized experiments. However, online A/B tests are expensive and inefficient: (1) online A/B tests usually take several weeks to collect sufficient data for sake of statistical sufficiency, and (2) numerous engineering efforts are required to deploy the new algorithm in the real system (Yang et al., 2018; Gilotte et al., 2018; Li et al., 2015). Furthermore, online A/B tests often lead to bad user experience in the initial stage when the new recommendation algorithms have not been well trained (Li et al., 2012). These reasons prevent us from quickly training and testing new RL-based recommendation algorithms. One successful solution to handle these challenges in the RL community is to first build a simulator to approximate the environment (e.g. OpenAI Gym for video games), and then use it to train and evaluate the RL algorithms (Gao, 2014). Thus, following the best practice, we aim to build a user simulator based on users’ historical logs in this work, which can be utilized to pre-train and evaluate new recommendation algorithms before launching them online.

However, simulating users’ behaviors (preferences) in a dynamic recommendation environment is very challenging. There are millions of items in practical recommender systems. Thus the underlining distribution of recommended item sequences are widely spanned and extremely complex in historical logs. In order to learn a robust simulator, it typically requires large-scale historical logs as training data from each user. Though massive historical logs are often available, data available to each user is rather limited. An attempt to tackle the two aforementioned challenges, we propose a simulator (RecSimu) for reinforcement learning based recommendations based on Generative Adversarial Network (GAN) (Goodfellow et al., 2014). We summarize our major contributions as follows:

  • [leftmargin=*]

  • We introduce a principled approach to capture the underlining distribution of recommended item sequences in historical logs, and generate realistic item sequences;

  • We propose a user behavior simulator RecSimu, which can be utilized to simulate environments to pre-train and evaluate reinforcement learning based recommender systems; and

  • We conduct experiments based on real-world data to demonstrate the effectiveness of the proposed simulator and validate the effectiveness of its components.

Figure 2. A common setting for RL-based recommendations.

2. Problem Statement

Following one common setting as shown in Figure 2, we first formally define reinforcement learning based recommendations (Zhao et al., 2018c, b) and then present the problem we aim to solve based on this setting. In this setting, we treat the recommendation task as sequential interactions between a recommender system (agent) and users (environment

), and use Markov Decision Process (MDP) to model it. It consists of a sequence of states, actions and rewards. Typically, MDP involves four elements

, and below we introduce how to set them. Note that there are other settings (Dulac-Arnold et al., 2015; Zhao et al., 2017; Zheng et al., 2018) and we leave further investigations on them as one future work.

  • [leftmargin=*]

  • State space : We define the state as a sequence of items that a user browsed and user’s corresponding feedback for each item. The items in are chronologically sorted.

  • Action space : An action from the recommender system perspective is defined as recommending a set of items to a user. Without loss of generality, we suppose that each time the recommender system suggests one item to the user, but it is straightforward to extend this setting to recommending more items.

  • Reward : When the system takes an action based on the state , the user will browse the recommended item and provide her feedback of the item. In this paper, we assume a user could skip, click and purchase the recommended items. Then the recommender system will receive a reward solely according to the type of feedback.

  • State transition probability

    : State transition probability

    is defined as the probability that the state transits from to when action is executed. We assume that the MDP satisfies , and the state transition is deterministic: we always remove the first item from and add the action at the bottom of , i.e., .

With the aforementioned definitions and notations, in this paper, we aim to build a simulator to imitate users’ feedback (behavior) on a recommended item according to user’s preference learned from the user’s browsing history as demonstrated in Figure 2. In other words, the simulator aims to mimic the reward function . More formally, the goal of a simulator can be formally defined as follows: Given a state-action pair , the goal is to find a reward function , which can imitate user’s behaviors as much as possible.

Figure 3. An overview of the proposed simulator.
Figure 4. The proposed generator with Encoder-Decoder architecture.

3. The Proposed Simulator

In this section, we will propose a simulator framework that imitates users’ feedback (behavior) on a recommended item according to the user’s current preference learned from her browsing history. As discussed in Section 1, building a user simulator is challenging, since (1) the underlining distribution of recommended item sequences in users’ historical logs is complex, and (2) historical data for each user is usually limited.

Recent efforts have demonstrated that Generative Adversarial Network (GAN) and its variants are able to generate fake but realistic images (Goodfellow et al., 2014), which implies their potential in modeling complex distributions. Furthermore, the generated images can be considered as augmentations of real-world images to enlarge the data space. Driven by these advantages, we propose to build the simulator based on GAN to capture the complex distribution of users’ browsing logs and generate realistic logs to enrich the training dataset. Another challenge with GAN-based simulator is that the discriminator should not only be able to distinguish real logs and generated logs, but also can predict user’s feedback of a recommended item. To address these challenges, we propose a recommendation simulator as shown in Figure 3, where the generator with a supervised component is designed to learn the data distribution and generate indistinguishable logs, and the discriminator with a supervised component can simultaneously distinguish real/generated logs and predict user’s feedback of a recommended item. In the following, we will first introduce the architectures of generator and discriminator separately, and then discuss the objective functions with the optimization algorithm.

3.1. The Generator Architecture

The goal of the generator is to learn the data distribution and then generate indistinguishable logs (action) based on users’ browsing history (state), i.e., to imitate the recommendation policy of the recommender system that generates the historical logs. Figure 4 illustrates the generator with the Encoder-Decoder architecture. The Encoder component aims to learn user’s preference according to the items browsed by the user and the user’s feedback. The input is the state that is observed in the historical logs, i.e., the sequence of items that a user browsed and user’s corresponding feedback for each item. The output is a low-dimensional representation of user’s current preference, referred as to . Each item involves two types of information:


where is a low-dimensional and dense item-embedding of the recommended item 111The item-embeddings are pre-trained by an e-commerce company via word embedding (Levy and Goldberg, 2014) based on users’ historical browsing logs, where each item is considered as a word and the item sequence of a recommendation session as a sentence. The effectiveness of these item representations is demonstrated in their business like searching, recommendation and advertisement. , and

is a one-hot vector representation to denote user’s feedback on the recommended item. The intuition of selecting these two types of information is that, we not only want to learn the information of each item in the sequence, but also want to capture user’s interests (feedback) on each item. We use an embedding layer to transform

into a low-dimensional and dense vector: . Note that we use “

activate function since

. Then, we concatenate and , and get a low-dimensional and dense vector () as:


Note that all embedding layers share the same parameters and

, which can reduce the amount of parameters and have better generalization. We introduce a Recurrent Neural Network (RNN) with Gated Recurrent Units (GRU) to capture the sequential patterns of items in the logs. We choose GRU rather than Long Short-Term Memory (LSTM) because of GRU’s fewer parameters and simpler architecture. Unlike LSTM utilizing an input gate and a forget gate to yield a new state, GRU uses an update gate



GRU employs a reset gate to control the former state :


The candidate activation function is computed as:


Finally, the activation of GRU is a linear interpolation between the the candidate activation

and the previous activation :


We consider the final hidden state as the output of Encoder component, i.e., the lower dimensional representation of user’s current preference:


The goal of the Decoder component is to predict the item that will be recommended according to the user’s current preference. Therefore, the input is user’s preference representation , while the output is the item-embedding of the item that is predicted to appear at next position in the log, referred as to . For simplification, we leverage several fully-connected layers as the Decoder to directly transform to . Note that it is straightforward to leverage other methods to generate the next item, such as using a layer to compute relevance scores of all items, and selecting the item with the highest score as the next item. So far, we have delineated the architecture of the Generator, which aims to imitate the recommendation policy of the existing recommender system, and generate realistic logs to augment the historical data. In addition, we add a supervised component to encourage the generator to yield items that are close to the ground truth items, which will be discussed in Section 3.3. Next, we will discuss the architecture of discriminator.

Figure 5. The discriminator architecture.

3.2. The Discriminator Architecture

The discriminator aims to not only distinguish real historical logs and generated logs, but also predict the class of user’s feedback of a recommended item according to her browsing history. Thus we consider the problem as a classification problem with classes, i.e., classes of real feedback for the recommended items observed from historical logs, and classes of fake feedback for the recommended items yielded by the generator.

Figure 5 illustrates the architecture of the discriminator. Similar with the generator, we introduce a RNN with GRU to capture user’s dynamic preference. Note that the architecture is the same with the RNN in generator, but they have different parameters. The input of the RNN is the state observed in the historical logs, where , and the output is the dense representation of user’s current preference, referred as to . Meanwhile, we feed the item-embedding of the recommended item (real or fake ) into fully-connected layers, which encode the recommended items to low-dimensional representations, referred as to . Then we concatenate and , and feed the concatenation into fully-connected layers, whose goal is to (1) judge whether the recommended items are real or fake, and (2) predict users’ feedback on these items. Therefore, the final fully-connected layer outputs a

dimensional vector of logits, which represent

classes of real feedback and classes of fake feedback respectively:


where we include classes of fake feedback in output layer rather than only one fake

class, since fine-grained distinction on fake samples can increase the power of discriminator (more details in following subsections). These logits can be transformed to class probabilities through a softmax layer, and the probability corresponding to the

class is:


where is the result of classification. The objective function is based on these class probabilities. In addition, a supervised component is introduced to enhance the user’s feedback prediction and more details about this component will be discussed in Section 3.3.

3.3. The Objective Function

In this subsection, we will introduce the objective functions of the proposed simulator. The discriminator has two goals: (1) distinguishing real-world historical logs and generated logs, and (2) predicting the class of user’s feedback of a recommended item according to the browsing history. The first goal corresponds to an unsupervised problem just like standard GAN that distinguishes real and fake images, while the second goal is a supervised problem that minimizes the class difference between users’ ground truth feedback and the predicted feedback. Therefore, the loss function

of discriminator consists of two components.

For the unsupervised component that distinguishes real-world historical logs and generated logs, we need calculate the probability that a state-action pair is real or fake. From Eq (9

), we know the probability that a state-action pair observed from historical logs is classified as

real, referred as to , is the summation of the probabilities of real feedback:


while the probability of a fake state-action pair where action is produced by the generator, say , is the summation of the probabilities of real feedback:


Then, the unsupervised component of the loss function is defined as follows:


where both and are sampled from historical logs distribution in the first term; in the second term, only is sampled from historical logs distribution , while the action is yielded by generator policy .

The supervised component targets to predict the class of user’s feedback, which is formulated as a supervised problem to minimize the class difference (i.e. the cross-entropy loss) between users’ ground truth feedback and the predicted feedback. Thus it also has two terms – the first term is the cross-entropy loss of ground truth class and predicted class for a real state-action pair sampled from real historical data distribution ; while the second term is the cross-entropy loss of ground truth class and predicted class for a fake state-action pair, where the action is yielded by the generator. Thus the supervised component of the loss function is defined as follows:


where controls the contribution of the second term. The first term is a standard cross entropy loss of a supervised problem. The intuition we introduce the second term of Eq (13) is – in order to tackle the data limitation challenge mentioned in Section 1, we consider fake state-action pairs as augmentations of real state-action pairs, then fine-grained distinction on fake state-action pairs will increase the power of discriminator, which also in turn forces the generator to output more indistinguishable actions. The overall loss function of the discriminator is defined as follows:


where parameter is introduced to control the contribution of the supervised component.

The target of the generator is to output realistic recommended items that can fool the discriminator, which tackles the complex data distribution problem as mentioned in Section 1. To achieve this goal, we design two components for the loss function of the generator. The first component aims to maximize in Eq (12) with respect to . In other words, the first component minimizes that probabilities that fake state-action pairs are classified as fake, thus we have:


where is sampled from real historical logs distribution and the action is yielded by generator policy . Inspired by a supervised version of GAN (Luc et al., 2016), we introduce a supervised loss as the second component of , which is the distance between the ground truth item and the generated item :


where and are sampled from historical logs distribution . This supervised component encourages the generator to yield items that are close to the ground truth items. The overall loss function of the discriminator is defined as follows:


where controls the contribution of the supervised component.

1:  Initialize the generator and discriminator with random weights and
2:  Sample a pre-training dataset of
3:  Pre-train by minimizing in Eq (16)
4:  Generate fake-actions for training
5:  Pre-train by minimizing in Eq (13)
6:  repeat
7:     for d-steps do
8:        Sample minibatch of
9:        Use current to generate minibatch of
10:        Update the by minimizing in Eq (14)
11:     end for
12:     for g-steps do
13:        Sample minibatch of
14:        Update the by minimizing in Eq (17)
15:     end for
16:  until simulator converges
Algorithm 1 An Training Algorithm for the Proposed Simulator.

We present our simulator training algorithm in details shown in Algorithm 1. At the beginning of the training stage, we use standard supervised methods to pre-train the generator (line 3) and discriminator (line 5). After the pre-training stage, the discriminator (lines 7-11) and generator (lines 12-15) are trained alternatively. For training the discriminator, state and real action are sampled from real historical logs, while fake actions are generated through the generator. To keep balance in each d-step, we generate fake actions with the same number of real actions .

4. Experiments

In this section, we conduct extensive experiments to evaluate the effectiveness of the proposed simulator with a real-world dataset from an e-commerce site. We mainly focus on two questions: (1) how the proposed simulator performs compared to the state-of-the-art baselines; and (2) how the components in the generator and discriminator contribute to the performance. We first introduce experimental settings. Then we seek answers to the above two questions. Finally, we study the impact of important parameters on the performance of the proposed framework.

4.1. Experimental Settings

We evaluate our method on a dataset of July 2018 from a real e-commerce company. We randomly collect 272,250 recommendation sessions, and each session is a sequence of item-feedback pairs. After filtering out items that appear less than 5 times, there remain 1,355,255 items. For each session, we use first items and corresponding feedback as the initial state, the item as the first action, then we could collect a sequence of (state,action,reward) tuples following the MDP defined in Section 2. We collect the last (state,action,reward) tuples from all recommendation sessions as the test set, while using the other tuples as the training set.

In this paper, we leverage items that a user browsed and user’s corresponding feedback for each item as state . The dimension of the item-embedding is , and the dimension of action representation is ( is a 2-dimensional one-hot vector: when feedback is negative, while when feedback is positive). The output of discriminator is a dimensional vector of logits, and each logit represents real-positive, real-negative, fake-positive and fake-negative respectively:


where real denotes that the recommended item is observed from historical logs; fake denotes that the recommended item is yielded by the generator; positive denotes that a user clicks/purchases the recommended item; and negative denotes that a user skips the recommended item. Note that though we only simulate two types of behaviors of users (i.e., positive and negative), it is straightforward to extend the simulators with more types of behaviors. AdamOptimizer is applied in optimization, and the learning rate for both Generator and Discriminator is 0.001, and batch-size is 500. The hidden size of RNN is 128. For the parameters of the proposed framework such as , and , we select them via cross-validation. Correspondingly, we also do parameter-tuning for baselines for a fair comparison. We will discuss more details about parameter selection for the proposed simulator in the following subsections.

In the test stage, given a state-action pair, the simulator will predict the classes of user’s feedback for the action (recommended item), and then compare the prediction with ground truth feedback observed from the historical log. For this classification task, we select the commonly used F1-score

as the metric, which is a measure that combines precision and recall, namely the harmonic mean of precision and recall. Moreover, we leverage

(i.e. the probability that user will provide positive feedback to a real recommended item) as the score, and use AUC (Area under the ROC Curve) as the metric to evaluate the performance.

4.2. Comparison of the Overall Performance

To answer the first question, we compare the proposed simulator with the following state-of-the-art baseline methods:

  • [leftmargin=*]

  • Random: This baseline randomly assigns each recommended item a , and uses as the threshold value to classify items as positive or negative; this is also used to calculate AUC.

  • LR

    : Logistic Regression 

    (Menard, 2002) uses a logistic function to model a binary dependent variable through minimizing the loss , where ; we concatenate all as the feature vector for the -th item, and set ground truth if feedback is positive, otherwise .

  • GRU: This baseline utilizes an RNN with GRU to predict the class of user’s feedback to a recommended item. The input of each unit is , and the output of RNN is the representation of user’s preference, say , then we concatenate with the embedding of a recommended item, and leverage a layer to predict the class of user’s feedback to this item (output is a 2-dimensional vector).

  • GAN: This baseline is based on Generative Adversarial Network (Goodfellow et al., 2014), where the generator takes state-action pairs (the browsing histories and the recommended items) and outputs user’s feedback (reward) to the items, while the discriminator takes (state, action, reward) tuples and distinguishes between real tuples (whose rewards are observed from historical logs) and fake ones. Note that we also use an RNN with GRU to capture user’s sequential preference.

  • GAN-s: This baseline is a supervised version of GAN (Luc et al., 2016), where the setting is similar with the above GAN baseline, while a supervise component is added on the output of the generator, which minimizes the difference between real feedback and predicted feedback.

Figure 6. The results of overall performance comparison.

The results are shown in Figure 6. We make the following observations:

  • [leftmargin=*]

  • LR achieves worse performance than GRU, since LR neglects the temporal sequence within users’ browsing history, while GRU can capture the temporal patterns within the item sequences and users’ feedback for each item. This result demonstrates that it is important to capture the sequential patterns of users’ browsing history when learning user’s dynamic preference.

  • GAN-s performs better than GRU and GAN, since GAN-s benefits from not only the advantages of the GAN framework (the unsupervised component), but also the advantages of the supervised component that directly minimizes the cross-entropy between the ground truth feedback and the predicted feedback.

  • RecSimu outperforms GAN-s because the generator imitates the recommendation policy that generates the historical logs, and the generated logs can be considered as augmentations of real logs, which solves the data limitation challenge; while the discriminator can distinguish real and generated logs (unsupervised component), and simultaneously predict user’s feedback of a recommended item (supervised component). In other words, RecSimu takes advantage of both the unsupervised and supervised components. The contributions of model components of RecSimu will be studied in the following subsection.

To sum up, the proposed framework outperforms the state-of-the-art baselines, which validates its effectiveness in simulating users’ behaviors in recommendation tasks.

Figure 7. The results of component analysis.

4.3. Component Anslysis

To answer the second question, we systematically eliminate the corresponding components of the simulator by defining following variants of RecSimu:

  • [leftmargin=*]

  • RecSimu-1: This variant is a simplified version of the simulator who has the same architecture except that the output of the discriminator is a 3-dimensional vector , where each logit represents real-positive, real-negative and fake respectively, i.e., this variant will not distinguish the generated positive and negative items.

  • RecSimu-2: In this variant, we evaluate the contribution of the supervised component , so we eliminate the impact of by setting .

  • RecSimu-3: This variant is to evaluate the effectiveness of the competition between generator and discriminator, hence, we remove and from the loss function.

The results are shown in Figure 7. It can be observed:

  • [leftmargin=*]

  • RecSimu performs better than RecSimu-1, which demonstrates that distinguishing the generated positive and negative items can enhance the performance. This also validates that the generated data from the generator can be considered as augmentations of real-world data, which resolves the data limitation challenge.

  • RecSimu-2 performs worse than RecSimu, which suggests that the supervised component is helpful for the generator to produce more indistinguishable items.

  • RecSimu-3 first trains a generator, then uses real data and generated data to train the discriminator; while RecSimu updates the generator and discriminator iteratively. RecSimu outperforms RecSimu-3, which indicates that the competition between the generator and discriminator can enhance the power of both the generator (to capture complex data distribution) and the discriminator (to classify real and fake samples).

Figure 8. The results of parametric sensitivity analysis.

4.4. Parametric Sensitivity Analysis

Our method has two key parameters, i.e., (1) that controls the length of state, and (2) that controls the contribution of the second term in Eq (13), which classifies the generated items into positive or negative class. To study the impact of these parameters, we investigate how the proposed framework RecSimu works with the changes of one parameter, while fixing other parameters. The results are shown in Figure 8. We have following observations:

  • [leftmargin=*]

  • Figure 8 (a) demonstrates the parameter sensitivity of . We find that with the increase of , the performance improves. To be specific, the performance improves significantly first and then becomes relatively stable. This result indicates that introducing longer browsing history can enhance the performance.

  • Figure 8 (b) shows the parameter sensitivity of . The performance for the simulator achieves the peak when . In other words, the second term in Eq (13) indeed improves the performance of the simulator; however, the performance mainly depends on the first term in Eq (13), which classifies the real items into positive and negative classes.

5. Related Work

In this section, we briefly review works related to our study. In general, the related work can be mainly grouped into the following categories.

The first category related to this paper is reinforcement learning based recommender systems, which typically consider the recommendation task as a Markov Decision Process (MDP), and model the recommendation procedure as sequential interactions between users and recommender system. Practical recommender systems are always with millions of items (discrete actions) to recommend. However, most RL-based models will become inefficient since they are not able to handle such a large discrete action space. A Deep Deterministic Policy Gradient (DDPG) algorithm is introduced to mitigate the large action space issue in practical RL-based recommender systems (Dulac-Arnold et al., 2015), where an Actor produces the optimal action based on current state, and a Critic outputs the action-value (Q-value) for this state-action pair. To avoid the inconsistency of DDPG and improve recommendation performance, a tree-structured policy gradient is proposed in (Chen et al., 2018a)

, which constructs a balanced hierarchical clustering tree over items and pick an item through seeking a path to a specific leaf of the tree. Biclustering technique is also introduced to model recommender systems as grid-world games so as to reduce the state/action space 

(Choi et al., 2018)

. To solve the unstable reward distribution problem in dynamic recommendation environments, approximate regretted reward technique is proposed with Double DQN to obtain a reference baseline from individual customer sample, which can effectively stabilize the reward value estimation and enhance the recommendation quality 

(Chen et al., 2018b). Users’ positive and negative feedback, i.e., purchase/click and skip behaviors, are jointly considered in one framework to boost recommendations, since both types of feedback can represent part of users’ preference (Zhao et al., 2018c). Architecture aspect and formulation aspect improvement are introduced to capture both positive and negative feedback in a unified RL framework. A page-wise recommendation framework is proposed to jointly recommend a page of items and display them within a 2-D page (Zhao et al., 2017, 2018b). CNN technique is introduced to capture the item display patterns and users’ feedback of each item in the page. In news feed scenario, a DQN based framework is proposed to handle the challenges of conventional models, i.e., (1) only modeling current reward like CTR, (2) not considering click/skip labels, and (3) feeding similar news to users (Zheng et al., 2018). Other applications includes sellers’ impression allocation (Cai et al., 2018a), fraudulent behavior detection (Cai et al., 2018b) and user state representation (Liu et al., 2018).

The second category related to this paper is behavior simulation. Reinforcement learning and supervised learning algorithms typically learn experts’ behavior with the guidance of the rewards, feedback or labels from real-world environment. However, deploying algorithms in real environment cost money and time, which calls for estimation of environment so as to train the algorithms to learn experts’ behavior based on the simulation of the environment, before launching the algorithms online. One of the most effective approaches is Learning from Demonstration (LfD), which estimates implicit reward function from expert’s behavior state to action mappings. Successful LfD applications include autonomous helicopter maneuvers 

(Ross et al., 2013), self-driving car (Bojarski et al., 2016), playing table tennis (Calinon et al., 2010), object manipulation (Pastor et al., 2009) and making coffee (Sung et al., 2018). For example, Ross et al. (Ross et al., 2013) develop a method that autonomously navigates a small helicopter at low altitude in a natural forest environment. Given the demonstration of a small group of human pilots, the authors leverage LfD techniques to train a controller to learn how a human expert would control a helicopter in similar environment to successfully avoid collisions with trees and leaves, using only low-weight visual sensors. Bojarski et al. (Bojarski et al., 2016) train a CNN to directly map the raw pixels of a single front-facing camera to the steering commands. With minimal human expert data, the system can automatically learn the representation of the environment, such as useful road features, only from the human driving angle as a training signal, and then learn to drive on local roads and highways with or without lane markings. Calinon et al. (Calinon et al., 2010) propose a probabilistic method to train robust models of human motion by imitating, e.g., playing table tennis. The association of HMM, Gaussian mixture regression and dynamical systems enable the method to extract redundancy from multiple demonstrations and develop time-independent models to mimic the dynamic nature of the demonstration behaviors. Pastor et al. (Pastor et al., 2009) present a general method to learn robot motor skills from human demonstrations. To represent an observed motion, the model learns a nonlinear differential equation that reproduces the motion. According to this representation, a library of movements is developed that marks each recorded motion based on the task and context, such as grasping, placing, and releasing. Sung et al. (Sung et al., 2018) proposed a manipulation planning approach according to the assumption that many household items share similar operational components. Thus the manipulation planning is formulated a structured prediction problem, and a DNN-based model is developed that can deal with large noise in manipulation demonstrations and can learn characteristics from three different patterns: point cloud, language, and trajectory. To gather a large number of manipulation demonstrations of different objects, the authors develop a new crowd-sourcing platform.

6. Conclusion

In this paper, we propose a novel user simulator RecSimu base on Generative Adversarial Network (GAN) framework, which models real users’ behaviors from users’ historical logs, and tackle the two challenges: (i) the recommended item distribution is complex within users’ historical logs, and (ii) labeled training data from each user is limited. The GAN-based user simulator can naturally resolve these two challenges and can be used to pre-train and evaluate new recommendation algorithms before launching them online. To be specific, the generator captures the underlining item distribution of users’ historical logs and generates indistinguishable fake logs that can be used as augmentations of real logs; and the discriminator is able to predict users’ feedback of a recommended item based on users’ browsing logs, which takes advantage of both supervised and unsupervised learning techniques. In order to validate the effectiveness of the proposed user simulator, we conduct extensive experiments based on real-world e-commerce dataset. The results show that the proposed user simulator can improve the user behavior prediction performance in recommendation task with significant margin compared with several state-of-the-art baselines.

There are several interesting research directions. First, for the sake of generalization, in this paper, we do not consider the dependency between consecutive actions, in other words, we split one recommendation session to multiple independent state-action pairs. Some recent techniques of imitation learning, such as Inverse Reinforcement Learning and Generative Adversarial Imitation Learning, consider a sequence of state-action pairs as a whole trajectory and the prior actions could influence the posterior actions. We will introduce this idea as one future work. Second, positive (click/purchase) and negative (skip) feedback is extremely unbalanced in users’ historical logs, which makes it even harder to collect sufficient positive feedback data. In this paper, we leverage traditional up-sampling techniques to generate more training data of positive feedback. In the future, we consider leverage the GAN framework to automatically generate more data of positive feedback. Finally, users skip items for many reasons, such as (1) users indeed don’t like the item, (2) users do not look the item in detail and skip it by mistake, (3) there exists a better item in the nearby position, etc. These reasons result in predicting skip behavior even harder. Thus, we will introduce explainable recommendation techniques to identify the reasons why users skip items.