Imagine that a customer visits a retail shop to purchase a dress which is to her liking. As the customer walks in, a business assistant is present to assist the customer by answering questions on fashion trend or suggesting related dresses. In online e-commerce applications, more business units are adding a component that plays a similar role as the business assistant in a shop. In this paper, we are interested in a particular component, commonly known as search story, that has become popular among e-commerce search engines on many online platforms. For instance, in news feed platforms and web and image search platforms, each search story is a display of recommended high-quality content which is relevant to a user’s personal interests. In e-commerce search platforms, on the other hand, a search story is instead a display of sponsored article that gives an overview and comparison of several product items. Figure 1 illustrates an example of a search story in a real e-commerce search engine, which is embedded within the organic search results. In this example, the search story itself displays, when clicked, a short survey that summarizes and compares a list of selected product items and related styles.
The search story recommendation can be naturally formulated as a conventional recommendation or ranking problem that aims to suggest relevant items to users based on search keywords. For instance, one may model the problem as a click-through prediction task and recommend the search story with the highest predicted click-through rate. However, compared with conventional recommendation systems or search engines, recommendation of search stories focuses more on guiding users to figure out their own preferences and personal intents. Consider the following concrete example that illustrates a multitude of objectives of a search story recommender.
As shown in Figure 1, suppose a customer wants to purchase a “dress outfit” for a party, but does not know what exact style she is looking for (e.g., “sleepless loose plain dress”). The purpose of a search story recommender is to assist and guide the customer within each search session, as if it plays the role of an assistant in a shop. On the one hand, a user’s search session history can be leveraged to learn the user’s intent and subsequently to build a better recommendation model for future search stories. On the other hand, the recommended search story guides the user to figure out her preferences and personal intents, which affects not only her immediate behavior (e.g., clicking or ordering product items from the current page of the search story in Figure 1(b)), but also her long-term behavior (e.g., clicking or ordering product items in future search session in Figure 1(a)).
As this example illustrates, the ultimate goal of a search story recommendation in e-commerce search applications is to recommend the best search story that maximizes both short-term reward (e.g., purchasing a product shown in the landing page of a search story) and long-term reward (e.g., returning back to start another search session in a week). Compared with organic search results, search stories risk disrupting users’ current search to achieve better long-term benefit in their following search. Therefore, search story recommendation requires a solution to consider both immediate and future benefits. Although we consider direct feedbacks (i.e., users’ clicking or ordering product items in the landing page of search stories), indirect feedbacks (i.e., users’ clicking or ordering product items in the search page) is more important. Such a cross-channel effect  is hard to model using the conventional supervised learning framework. This motivates us to propose a novel reinforcement learning framework for personalized search story recommendation.
Concretely, we formulate the personalized search story recommendation problem as a Markov decision process and propose a deep reinforcement learning architecture with (1) a combination of both imitation learning and reinforcement learning, as well as (2) a combination of both model-based and model-free reinforcement learning.
Our deep reinforcement learning solution, named as DRESS (Deep REinforce-
ment learning for Search S
tory recommendation), consists of three components: a dynamic model parameterized by a recurrent neural network, an actor network with a proximal policy optimizer
, and a critic network. The dynamic model is used to infer user behavior pattern (i.e., the environment) and is applied as the virtual environment for the controller learning. Such a model-based method complements model-free method for data efficiency, which is critical when only offline data is available. The actor network provides the policy (i.e., the distribution of recommended search stories) based on the state, the user’ behavior history, and the current query. We use both imitation learning and critic network (reinforcement learning) to tune the actor network. The imitation learning procedure fits the actor network to the offline data, so that on one hand the stochastic logging policy of the offline data is estimated, and on the other hand, the actor network is warmed-up for further tuning. The critic network estimates the long-term reward (i.e., state value function or advantage) of the logging policy, which can be used to tune the actor network based on the idea of safe policy iteration.
The main contribution of this work can be summarized as follows:
- Novel Problem.
We study an emerging search story recommendation problem and develop a solution based on deep reinforcement learning framework, addressing the challenges that originate from its cross-channel and long-term property.
- Sound Methodology.
We propose an architecture combining model-free and model-based reinforcement learning, as well as imitation learning and reinforcement learning as a strategy to apply safe policy iteration to offline data.
- Practical Solution.
Experiments on real-life data sets from JD.com have empirically demonstrated the effectiveness of our proposed solution.
The remainder of the paper is organized as follows. We review related literature in Section 2 and formulate the problem in Section 3. Section 4 provides a brief overview of our proposed solution. We present the dynamic model and controller in Section 5 and Section 6 respectively, and introduce imitation and imagination learning in Section 7. The experimental results are shown in Section 8. We finally conclude our work in Section 9.
2 Related Work
In this session, we briefly review two topics that are relevant to our work, namely reinforcement learning and recommendation/ranking.
2.1 Reinforcement Learning
In the general reinforcement learning framework, an agent sequentially interacts with the environment and learns to achieve the best return, which is in the form of accumulated immediate rewards. In the partially observable Markov decision process (POMDP) model, at each time step , when the agent has the observation of the environment , an action is taken to obtain the reward from the environment. As the environment is partially observable, the state of the environment at time can only be inferred from the whole history up to time , which can be denoted as . The goal of the reinforcement learning problem is to learn an optimal policy, a sequence of decisions mapping state to action , to maximize the expected accumulated long term reward.
|Value||Deep Q-learning Network (DQN)||[24, 37, 39]|
|Policy||Actor-Critic||[23, 28, 29]|
|Deterministic Policy Gradient (DPG)||[31, 20, 40]|
|Model-free||[24, 37, 39, 23, 28, 29, 31, 20, 40]|
Remarkably, deep reinforcement learning has achieved notable success in various tasks including but not limited to game playing [24, 30], search and recommendation [34, 44], robotics and autonomous vehicles [15, 22], online advertising [4, 5], several NLP tasks [16, 3] and database management systems [18, 35, 42]. A list of representative works (though not exhaustive) is summarized in Table 1, and we refer readers to surveys [19, 2] for more details.
To build a high-quality recommendation system, one needs to understand and characterize the individual profile and behaviors of users, items, and their interactions. The commonly used factorization models [41, 14] learn factors for user and item by decomposing user-item interaction matrices. Neighborhood methods [27, 13, 9] rely on similarities between users and items that are derived from content or co-occurrence. These popular methods often ignore or under-exploit important temporal dynamics and sequential properties of the interaction between users and items.
In addition to these popular methods, deep feed-forward networks have been successfully applied in recommender systems. 
used restricted Boltzmann machines for collaborative filtering and achieved remarkable results. Other feed-forward models[36, 38, 7]
(e.g. convolutional neural networks, stacked denoising auto-encoders) have also been used to extract feature representations from items to improve recommendation.
In order to exploit the temporal dynamics and sequential information, 
introduced recurrent neural network (RNN) to recommendation system on the task of session based recommendation. They devised a GRU-based RNNs and demonstrated good performance with one hot encoding item input and rank based loss functions. Further improvements on session based recommendation include exploiting rich features like image and data augmentation .
2.3 Reinforcement Learning in Recommendation and Ranking
All the above works on recommendation still focus on one round static optimization of the recommendation model. To better incorporate real-time user’s feedback, several contextual bandit based ranking/recommendation approaches [25, 17, 43] were proposed to update the selection strategy based on user-click feedback to maximize total user clicks.
However, a major assumption of bandit approaches is the ineffectiveness of action (i.e., choice of arms in bandit) on the environment state transitions, which fails in personalized search story recommendation scenario, where the environment state or users’ preference and intent here will be affected by the recommended search story. Hence we turn to reinforcement learning (RL) framework which can take into account the long-term effect of current actions.
There are some pioneering works applying RL to different tasks in recommendation and ranking, such as cross-channel recommendation , personalized news recommendation , impression allocation of advertisements , and learn-to-rank for search sessions . Their motivations to use RL are all based on the long-term effect of current actions in the corresponding problems. For example, in personalized news recommendation, the current recommended piece may shape the users’ interests so that it can affect later recommendation results .
3 Problem Definition
For ease of presentation, we first introduce the list of notations and basic concepts used through the entire work. Specifically, we use lower case symbols , , , to represent a single user, query, story item, and an item from another channel (e.g., the product item), respectively. Upper case symbols , , , are used to represent a set of users, queries, story items, and product items respectively. Let denote the search story recommendation function that maps a context to a selected story . The context can be a specific query for general search or a specific user for recommendation or a single user plus a single query for personalized search. With the above notations, we define the concept of search session and personalized search episode as follows.
Definition 1 (Search Session)
A search session is a series of feedback (e.g., click, order, page view) by the user at time towards the returned page with a search story addressing a given query . Formally, we can use a tuple , , , , to denote a search session.
Definition 2 (Search Episode)
A search episode is a temporal sequence of search sessions by the user , which is denoted as , , , , . We add a subscript to (i.e., ) to denote a search episode of a specific user .
3.2 Problem Formulation
As introduced earlier, in this work we focus on reinforcement learning for personalized search story recommendation. Specifically, we aim to find a strategy that updates the search story item recommendation function of a search engine along search episodes to achieve the best reward for each user.
When putting the personalized search story learning into the general reinforcement learning framework, the corresponding observation , the action , the state , the transition , the reward are defined as:
is the user-dependent and query-dependent feature space for story items with each item represented as or in short .
is the selection of the search story .
is the combination of users search episode up to time , i.e., the history , and the observation .
is the state transition function dependent on , .
can be quantified as the number of clicks, or the number of orders, or gross merchandise volume received from users when users are in state and search story recommender performs action . In this work, we set reward as the binary indicator whether users click any products in the search session .
Therefore, in this work, we aim to solve the following problem:
Given the entire search episode of a user , we aim to sequentially refine the action towards each search session based on observed feature space and a policy . Specifically, at each time step , the objective is to find the best policy to maximize the estimated cumulative rewards. That is:
where is the discounted cumulative rewards, is the discount factor, and denotes the expectation of .
4 Deep Reinforcement Learning
for Search Story Recommendation
In this section, we give an overview of our deep reinforcement learning framework for personalized search story recommendation, DRESS. Given limited offline data, we propose to combine both model-based augmentation and imitation learning with the conventional reinforcement learning. Model-based reinforcement learning requires much less training data compared to model-free reinforcement learning. The data efficiency provides additional benefits such as faster model iteration and less storage of logging data, both of which are very important for industry applications. On the other hand, imitation learning estimates the logging policy (that leads to the offline data) from the offline data, which is both the initialization of the actor network and a critical component in safe policy iteration controller learning algorithm.
|Input: Logging Data|
|Output: The search story recommender|
|1: = Dynamic_Model_Training() (Section 5)|
|2: Initialize the critic network , actor network|
|// imitation learning|
|3: = Controller_Imitation(, ) (Section 7.1)|
|// one step reinforcement learning on|
|4: , = Controller_Learning(, , )|
|// reinforcement learning on (Section 6)|
|6: = Imagine(, ) (Section 7.2)|
|7: , = Controller_Learning(, , )|
|8: until converges|
|9: return , ,|
The approach is outlined in Algorithm 1. Randomly sampled logging search session data are collected and added to dataset , which is used to train the dynamic model as proposed in Section 5 (Line 1). The dynamic model serves as a virtual environment that interacts with our search story recommendation controller to learn a better recommendation policy. Search story recommendation controller is built upon the Actor-Critic framework , which is parametrized as and (Line 2). Next, instead of directly performing reinforcement learning with environment, an initial policy was learned from log data with the controller imitation learning (Line 3). We thus further improve the initial policy with a standard proximal policy gradient approach  from the logging data (Line 4). Ideally, the controller would like to gather new on-policy data and iteratively learn a better policy in an on-policy manner. However, in this application, our “on-policy” data are generated by the virtual environment– the dynamic model . We thus repeatedly perform the above procedure to learn a better policy: 1) perform controller imagination to gather new session data and add them to a separate dataset ; 2) perform controller reinforcement learning to improve the recommendation policy from (Lines 5 – 8).
5 The Neural Network Dynamic
5.1 Illustrative Overview
As introduced earlier, we parameterize the dynamic model as a neural network function and thus represents the weights of neural networks. As illustrated in Figure 2, our dynamic model consists of two units: a reward model and a transition model . The transition model updates the user hidden feature to and predicts the next query , based on the user search session and user hidden feature as inputs. The user hidden feature is the hidden state of recurrently applying to the user search sessions until timestamp and the initial user hidden feature is determined by the user profile .
The reward model can be intuitively interpreted as a click-through prediction model (CTR model). The inputs are the user hidden feature , the query , the product item , and the search story , whereas the output is the reward . The state submodule takes input and and outputs the user state , representing the user intent. And then is combined with and as inputs to the core submodule to predict the reward .
In the following, we introduce the detailed implementation of transition and reward model architectures including featurizations, loss function, and optimizers. Note that although we introduce them separately, these two units are implemented within the same architecture and various layers (variables) are shared. For instance, the user hidden feature is shared across both the transition model and reward model.
5.2 Transition Model
We outlined the detailed architecture of transition model on the left side of Figure 3.
The hidden feature
, is represented as a user vector constructed from both user’s long-term profile and real-time profile. Regarding the user search session, as defined in Definition 1, the user search session consists of the query , the story , and the feedback . For each query, as shown in Figure 3, we represent it as an aggregated vector of its token embeddings (yellow boxes). For each story, we first represent it as raw tokens plus dense human crafted features. The raw tokens were obtained from both the title/description of story itself as well as those of product items within the story . The raw tokens were fed into the embedding layer (shared with query embedding) and transformed into an aggregated vector of token embeddings (red boxes). The aggregated embedding vector (red box) was concatenated with dense vectors (pink box) as the final representation of the story .
The feedback was represented as the concatenation of two one-hot encoding session-level search story/product item engagement binary indicator vectors (green boxes) and the aggregated vector of token embeddings from user engaged product items (light blue boxes).
5.2.2 Layers of Model
The transition model is empowered with a traditional encoding-decoding architecture using the gated recurrent unit (GRU). The inputs are the concatenation of feature vectors of story, query and feedbacks as well as the hidden state. The output is the feature representation of predicted next query .
5.2.3 Loss Functions
We simply use the mean square error MSE between the predicted feature vector of query and the ground truth feature vector of query as the loss function for the transition model.
5.3 Reward Model
The architecture of reward model is outlined on the right side of Figure 3.
The featurization of search story is the same as in the transition model. For the product item , similar to the search story, we represent it as an aggregated vector of token embeddings (orange box). The user intent
(the dark blue box), was featurized as a hidden representation, which is learned by the state submodule. The takes the input of hidden history (shared with the transition model) and observed query (same featurization as the transition model, yellow box) and outputs the user state .
5.3.2 Layers of Model
We simply use a multilayer perceptron (MLP) network, which takes the input of user, search story, product items and predict the feedback for search story and product items. The output layer is formulated as a classification layer for search story feedback prediction and a combination of classification and regression layer for product item feedback prediction.
5.3.3 Loss function
For the classification layer, we simply use the cross entropy loss (CE), while for the regression layer, we use the conditional square error (CSE). Specifically, assume that the ground truth feedback label for search story and product item and the ground truth product representation is , and the predicted feedback label for search story and product item and the product representation is //, the loss function is defined as:
where the cross entropy loss is defined as: and the conditional square error is defined as: .
5.4 Dynamic Model Training
Given the logging data , we thus train the dynamic model by optimizing the following loss function:
where is the coefficient that is proportional to the contribution of each loss function. For ease of presentation, we use = Dynamic_Model_Training() to denote the procedure of training the dynamic model with the architecture shown in Figure 3.
6 Controller Reinforcement
Our reinforcement learning controller is designed under the traditional actor-critic architecture . Specifically, the controller is a multi-head neural network, which is used as the function approximator for choosing the best story from the story embedding pool. Figure 4 illustrates our network structure of reinforcement learning controller, which consists of the state-value head (i.e., critic network) and policy head (i.e., actor network) with the shared input of state representation, the user hidden feature from the transition model . The details are presented as follows.
6.1 Critic Network
As shown in Figure 4, the value network is joinly learned with the policy network, where the input is the user hidden feature from the transition model , representing the state , and the output is the value of state under policy . Without ambiguity, we use , omitting the policy subscription. Our value network uses a neural network to learn the value function with parameter . Specifically, the is updated by the gradient descent optimizer with the following loss function:
The updated formula of the parameter with regard to Equation 5 is the stochastic version of the Bellman equation.
6.2 Actor Network
Our policy optimization is designed based on the state-of-the-art Proximal Policy Optimization (PPO) controller . Our policy , is again parametrized as a neural network function with parameter (in order to distinguish with the dynamic model parameter ). The architecture of our policy neural network, is shown on the right side of Figure 4.
|Input: Data , current actor network|
|and the critic network|
|Output: The updated actor network and|
|1: Repeat sampling a mini-batch of search sessions|
|2: update the critic network minimizing eq. 5|
|3: update the actor network minimizing eq. 6|
|4: return ,|
In the controller reinforcement learning procedure, it learns the policy by maximizing the accumulated state value of a policy averaging over the state distribution of the search session history:
where ; is the estimated advantage function defined as . The advantage function estimator here is the same with setting in the GAE estimate for advantage used in the original PPO paper  as the experiment suggests no better performance with a non-zero value. is the entropy of the policy given state , and is the weight.
7 Imitation and Imagination
7.1 Imitation Learning
In our search recommendation task, and most other real-world decision-making problems (e.g., finance and health-care), we have access to the logging data of the system being operated by its previous controller, but we do not have access to an accurate simulator of the system. The goal of the imitation learning is thus to learn to imitate the previous controller with a fixed policy .
Specifically, we learn the policy from by optimizing the likelihood of any action chosen. Formally, imitation learning can be formulated as the below optimization task:
where is the likelihood function of observing actions in given the policy , together with an entropy penalty,
where is the weight for the entropy regularizer .
7.2 Controller Imagination
It is not data efficient to only apply model-free reinforcement learning method on the logging data, especially the previous controller reinforcement learning (Section 6 )is simply one iteration of the PPO algorithm . The goal of controller imagination is thus to use the trained dynamic model to further improve the actor network.
Specifically, we use randomly selected sessions in as starting sessions, from each of which, the dynamic model (, ) and the current actor network are applied to rollout fictional search sessions, stored in . The imagined data is then used in the controller reinforcement learning (Section 6) to further tune the actor network.
Generally, it is similar to the original PPO controller learning , except that the real environment is replaced by the dynamic model here.
8 Experimental Validation
In this section, we conduct extensive experiments with a dataset from a real e-commerce company and evaluate the effectiveness of DRESS.
8.1 Experimental Setup
We evaluate our methods on a dataset collected between Apr 2018 and Jul 2018 from JD.com . We sampled all search sessions that are related to a category “women dress” and filtered out search episodes with only a few sessions or a huge number of sessions. Our dataset are carefully pre-processed and anonymized. The distributions of the episode length and the number of search sessions in which each search story appears are visualized in Fig. 5 and 5. As shown in Fig. 5, we only keep the search episodes whose length is within the range [11, 200].
Other statistics of our dataset is summarized in Table 2. We randomly divide the dataset into 5 folds by users. Hence each fold of the dataset contains equal number of complete search episodes. We did 5-fold cross-validation experiments with one random fold as testing data for each experiment.
The processed feature dimensions are summarized as follows. Each query is represented as an aggregation of 200 dimensional word embedding vectors of segmented query words. Each product is represented as an aggregation of 200 dimensional word embedding vectors of words from product titles. For each story, it is featured as a concatenation of 200 dimensional word embedding vectors of words from story titles, 200 dimensional word embedding vectors of title words from the products embedded within a story, and 13 human crafted features of a story.
|# users||# stories||# products||# sessions|
|122 ,886||2 ,185||304 ,780||1 ,842 ,879|
8.1.2 Comparable Methods
We compare the proposed method DRESS as described in Algorithm 1 with the following baseline methods:
ORIGIN: This is the state-of-the-art implementation of a search story recommendation, that results in the offline data, currently being used by the company.
DNNC (Deep Neural Network Classifier): Without considering the cross-channel effect, this method is trained to recommend a search story that is likely to be clicked, given the story feedback data. To be a fair comparison, DNNC uses the architecture with the actor network and is initialized with the imitation policy, same as DRESS.
DRESS-m: This is the myopic version of DRESS that only considers immediate short-term reward, which is implemented by setting .
DRESS-s: This is the simplified version of DRESS with the controller imagination module (Section.7.2) removed.
8.1.3 Evaluation Metric
The goal of a search story recommendation is to facilitate users during the search of products. Therefore, we use search session based user feedback on products as the main performance measure. In particular, we use the percentage of search sessions in which users have clicked a product, (Click Through Rate):
where is a binary indicator whether a user clicked a product in a search session , which is the same as the reward in the RL framework. Similarly, we also use (Conversion Rate):
where is a binary indicator whether a user ordered a product in a search session .
It is risky to evaluate the learned policy on a real-life system. Therefore, we use a statistical estimate method, Truncated Weighted Importance Sampling (TWIS), to estimate the performance from the offline test data as follows:
where is an episode, is the horizon for the latest sessions to use per episode, can be () so that is the estimate of (, respectively),
is the probability of the observed action given by the evaluated policyand by the logging policy. With only offline data, the imitation policy is used as the logging policy. It is justified by the following factors:
All compared methods share the same imitation policy in initialization.
The imitation policy is trained to fit the offline data generated by the logging policy.
Following the idea of importance sampling, is the estimate of () of the weighted policy , where is the true logging policy, which is valid when and are close enough. When the true logging policy is different from , in an application, the policy ratio can still play the role as reinforcement learning complements and gets evaluated by .
This evaluation metric is invariant toward the arbitrary constant scale ofand of . The truncated setting encourages the equal importance of users with the episodes of different lengths.
One potential downside of importance sampling based evaluation methods is the large variance when the target policy and the logging policy are very different. However, in our case with safe policy iteration and/or limited model-based improvement, both and are close. The exception is DNNC, which is a supervised learning method, so that there is no guarantee that the trained classifier policy will be similar enough to the imitation policy at initialization. In order to obtain valid result regarding TWIS, we add KL-divergence regularization to the negative likelihood loss of DNNC.
Most hyperparameters are tuned using the validation set for each experiment. For reproducibility of our experimental results, values of important hyperparameters are summarized in Table.3.
|discount factor (Eq.1)||0.7|
|transition loss weight (Eq.4)||1.0|
|story loss weight (Eq.4)||1.0|
|product CE loss weight (Eq.4)||1.0|
|product CSE loss weight (Eq.4)||1.0|
|Entropy weight for controller learning (Eq.6)||0.01|
|Entropy weight for controller imitation (Eq.8)||0.0001|
|clipping factor (Eq.6)||0.2|
|evaluation horizon (Eq.11)||15|
8.2 Empirical Results
In this section, we conduct different groups of experiments to empirically validate the proposed approaches. Specifically, we aim to answer the following questions: (1) Is it necessary to formulate the search story recommendation problem as a reinforcement learning problem?; (2) Does the model-based reinforcement learning lead to a better performance of search story recommendation?; and (3) What is the advantage of combining both imitation learning and reinforcement learning?
Question 1Justification of Reinforcement Learning: Is it necessary to use the reinforcement learning framework to solve the personalized search story recommendation problem?
As shown in Table 4, DNNC performs the worst compared to all other methods. DNNC is also the method that cannot obtain significant improvement over the ORIGIN. Furthermore, from Table 5, DNNC even performs worse than the ORIGIN in CVR. Compared with other methods, DNNC only considers the direct feedback for search stories, which ignores the feedback on organic search from another channel. Hence, this result highlights the necessity to consider the cross-channel effect for effective search story recommendation, as DRESS does. In addition, from Table 4 and Table 5, DRESS-m performs much worse than DRESS. It is as expected because DRESS-m only considers short-term rewards while ignoring the influence of current actions (i.e., engagement towards recommended search story) on users’ long-term behavior. The result strongly suggests that one should take the long-term effect into consideration for effective search story recommendation.
As discussed in Section 1, the reinforcement learning framework is a perfect fit for supporting both cross-channel effect and long-term effect. The empirical results further suggests the strong justification to formulate the search story recommendation as a reinforcement learning problem.
Question 2Model-based versus Model-free: Does the model-based controller imagination help improve the performance?
From Table. 4 the superiority of DRESS over DRESS-s shows the contribution of the model-based controller imagination sub-module, which is even more obvious in as shown in Table. 5. In addition, we also show the improvement rate with different values of evaluation horizon up to in Figure. 6. It shows that the model-based controller imagination module decreases the short-term performance (i.e., smaller ), but increases the long-term performance (i.e., larger ). Hence, considering long-term performance, infinite in the real-world situation, the combination with the model-based sub-module is expected to produce better results.
Question 3Imitation + Safe policy improvement: what is the advantage of using the safe policy iteration reinforcement learning algorithm?
The imitation policy is the estimation of current online policy that generates offline data. We expect the resulted policy to be close to the latter to ensure the stability of an online system. It is similar, if the policy ratio is applied to weight the current online policy to , as argued in Sec.8.1.3. We calculate three measures of distribution difference.
Log probability ratio: for a session ;
Total variation divergence: 
We calculate the averages of each difference measure over sessions in test data. We use the uniform distributionfor comparison. Results are shown in Table.6. Compared with uniform policy , both DRESS and DRESS-s are close to the imitation policy. As expected, the policy obtained by DRESS deviates more from the imitation policy compared with DRESS-s because of the additional controller imagination sub-module (Section. 7.2). Hence in an application, a trade-off should be made between performance gain and stability by controlling how much controller imagination should be included (limiting the number of iterations of step 6,7 in Algorithm 1).
Deep reinforcement learning has been successfully used as a powerful method to capture a wide variety of non-trivial user behavior on online platforms (e.g., news feed recommendation, e-commerce search). In this work, following these successes, we applied the reinforcement learning framework to the challenging problem of cross-channel search story recommendation by resorting it into a Markov decision process. We further proposed a unified deep learning architecture employing both imitation learning and reinforcement learning. Comprehensive empirical validation indicates that our proposal,DRESS, is effective in improving a conversion rate on real-world data sets from JD.com.
-  N. Abe, N. Verma, C. Apte, and R. Schroko. Cross channel optimized marketing by reinforcement learning. In SIGKDD, pages 767–772. ACM, 2004.
-  K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath. A brief survey of deep reinforcement learning. arXiv preprint arXiv:1708.05866, 2017.
-  D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe, J. Pineau, A. Courville, and Y. Bengio. An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086, 2016.
-  L. Bottou, J. Peters, J. Quiñonero-Candela, D. X. Charles, D. M. Chickering, E. Portugaly, D. Ray, P. Simard, and E. Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. JMLR, 14(1):3207–3260, 2013.
-  H. Cai, K. Ren, W. Zhang, K. Malialis, J. Wang, Y. Yu, and D. Guo. Real-time bidding by reinforcement learning in display advertising. In WSDM, pages 661–670. ACM, 2017.
-  Q. Cai, A. Filos-Ratsikas, P. Tang, and Y. Zhang. Reinforcement mechanism design for e-commerce. In WWW, pages 1339–1348, 2018.
-  P. Covington, J. Adams, and E. Sargin. Deep neural networks for youtube recommendations. In Recommender System, pages 191–198. ACM, 2016.
-  S. Gu, T. Lillicrap, I. Sutskever, and S. Levine. Continuous deep q-learning with model-based acceleration. In ICML, pages 2829–2838, 2016.
-  R. Guerraoui, A.-M. Kermarrec, T. Lin, and R. Patra. Heterogeneous recommendations: what you might like to read after watching interstellar. Proceedings of the VLDB Endowment, 10(10):1070–1081, 2017.
-  B. Hidasi, A. Karatzoglou, L. Baltrunas, and D. Tikk. Session-based recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939, 2015.
-  B. Hidasi, M. Quadrana, A. Karatzoglou, and D. Tikk. Parallel recurrent neural network architectures for feature-rich session-based recommendations. 2016.
-  Y. Hu, Q. Da, A. Zeng, Y. Yu, and Y. Xu. Reinforcement learning to rank in e-commerce search engine: Formalization, analysis, and application. arXiv preprint arXiv:1803.00710, 2018.
-  Y. Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In SIGKDD, pages 426–434. ACM, 2008.
-  Y. Koren, R. Bell, C. Volinsky, et al. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, 2009.
-  S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies. JMLR, 17(1):1334–1373, 2016.
-  J. Li, W. Monroe, A. Ritter, M. Galley, J. Gao, and D. Jurafsky. Deep reinforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541, 2016.
-  L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article recommendation. In WWW, pages 661–670. ACM, 2010.
-  T. Li, Z. Xu, J. Tang, and Y. Wang. Model-free control for distributed stream data processing using deep reinforcement learning. Proceedings of the VLDB Endowment, 11(6):705–718, 2018.
-  Y. Li. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274, 2017.
-  T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
-  T. Mandel, Y.-E. Liu, S. Levine, E. Brunskill, and Z. Popovic. Offline policy evaluation across representations with applications to educational games. In AAMAS, pages 1077–1084, 2014.
-  J. Michels, A. Saxena, and A. Y. Ng. High speed obstacle avoidance using monocular vision and reinforcement learning. In ICML, pages 593–600. ACM, 2005.
-  V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In ICML, pages 1928–1937, 2016.
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
-  F. Radlinski, R. Kleinberg, and T. Joachims. Learning diverse rankings with multi-armed bandits. In ICML, pages 784–791. ACM, 2008.
-  R. Salakhutdinov, A. Mnih, and G. Hinton. Restricted boltzmann machines for collaborative filtering. In ICML, pages 791–798. ACM, 2007.
-  B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Item-based collaborative filtering recommendation algorithms. In WWW, pages 285–295. ACM, 2001.
-  J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In ICML, pages 1889–1897, 2015.
-  J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
-  D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
-  D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy gradient algorithms. In ICML, 2014.
-  R. S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine Learning, pages 216–224. Elsevier, 1990.
-  Y. K. Tan, X. Xu, and Y. Liu. Improved recurrent neural networks for session-based recommendations. arXiv preprint arXiv:1606.08117, 2016.
-  G. Theocharous, P. S. Thomas, and M. Ghavamzadeh. Personalized ad recommendation systems for life-time value optimization with guarantees. In IJCAI, pages 1806–1812, 2015.
-  I. Trummer, S. Moseley, D. Maram, S. Jo, and J. Antonakakis. Skinnerdb: regret-bounded query evaluation via reinforcement learning. Proceedings of the VLDB Endowment, 11(12):2074–2077, 2018.
-  A. Van den Oord, S. Dieleman, and B. Schrauwen. Deep content-based music recommendation. In NIPS, pages 2643–2651, 2013.
-  H. Van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double q-learning. In AAAI, volume 2, page 5. Phoenix, AZ, 2016.
-  H. Wang, N. Wang, and D.-Y. Yeung. Collaborative deep learning for recommender systems. In SIGKDD, pages 1235–1244. ACM, 2015.
-  Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, and N. De Freitas. Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581, 2015.
-  M. Watter, J. Springenberg, J. Boedecker, and M. Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. In NIPS, pages 2746–2754, 2015.
-  M. Weimer, A. Karatzoglou, Q. V. Le, and A. Smola. Maximum margin matrix factorization for collaborative ranking. NIPS, pages 1–8, 2007.
-  J. Zhang, Y. Liu, K. Zhou, G. Li, Z. Xiao, B. Cheng, J. Xing, Y. Wang, T. Cheng, L. Liu, M. Ran, and Z. Li. An end-to-end automatic cloud database tuning system using deep reinforcement learning. SIDMOD, 2019.
-  X. Zhao, W. Zhang, and J. Wang. Interactive collaborative filtering. In CIKM, pages 1411–1420. ACM, 2013.
-  G. Zheng, F. Zhang, Z. Zheng, Y. Xiang, N. J. Yuan, X. Xie, and Z. Li. Drn: A deep reinforcement learning framework for news recommendation. In WWW, pages 167–176, 2018.
-  L. Zou, L. Xia, Z. Ding, J. Song, W. Liu, and D. Yin. Reinforcement learning to optimize long-term user engagement in recommender systems, 2019.