Generator and Critic: A Deep Reinforcement Learning Approach for Slate Re-ranking in E-commerce

05/25/2020 ∙ by Jianxiong Wei, et al. ∙ Taobao 0

The slate re-ranking problem considers the mutual influences between items to improve user satisfaction in e-commerce, compared with the point-wise ranking. Previous works either directly rank items by an end to end model, or rank items by a score function that trades-off the point-wise score and the diversity between items. However, there are two main existing challenges that are not well studied: (1) the evaluation of the slate is hard due to the complex mutual influences between items of one slate; (2) even given the optimal evaluation, searching the optimal slate is challenging as the action space is exponentially large. In this paper, we present a novel Generator and Critic slate re-ranking approach, where the Critic evaluates the slate and the Generator ranks the items by the reinforcement learning approach. We propose a Full Slate Critic (FSC) model that considers the real impressed items and avoids the impressed bias of existing models. For the Generator, to tackle the problem of large action space, we propose a new exploration reinforcement learning algorithm, called PPO-Exploration. Experimental results show that the FSC model significantly outperforms the state of the art slate evaluation methods, and the PPO-Exploration algorithm outperforms the existing reinforcement learning methods substantially. The Generator and Critic approach improves both the slate efficiency(4 experiments on one of the largest e-commerce websites in the world.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

In a typical e-commerce website, when a user searches a keyword, the website returns a list of items to the user by the ranking algorithm. This process is usually done by predicting the scores of user-item pairs and sorting items based on the point-wise scores. However, the point-wise ranking algorithm does not consider the mutual influences between items in one page, and returns similar items with the highest scores. Here is an example in Figure 1: a buyer searches the “smart watch” in one e-commerce app, the app returns similar watches, which will decrease user satisfaction and the efficiency of the list.

Figure 1. The return list when searching “smart watch”

To improve the diversity of the list, a series of research works, MMR, IA-Select, xQuAD and DUM (Agrawal et al., 2009; Ashkan et al., 2015; Carbonell and Goldstein, 1998; Santos et al., 2010) are proposed to rank items by weighted functions that trade-off the user-item scores and the diversities of items. However, these methods ignore the impact of diversity on the efficiency of the list. Wilhelm (Wilhelm et al., 2018) present a deep determinantal point process model to unify the user-item scores and the distances of items, and devise a practical personalized approach. Nevertheless, the determinantal point process uses separate structures to model the user-item scores and the diversities of items, which limits the representation ability of the user interest over the slate.

(Ai et al., 2018, 2019; Zhuang et al., 2018) propose models that input the n point-wise ranked items, and output the re-fined scores, by adding intra-item features to the model. This approach selects top-k items from n refined scores, and is commonly called “Slate Re-ranking” or “Post Ranking”(). (Pei et al., 2019) employ the efficient Transformer structure and introduce the personalized user encoding to the structure, compared with the RNN structure in two classical re-ranking methods, DLCM (Ai et al., 2018) and GlobalRank (Zhuang et al., 2018). However, the main concern of this approach is that only items are shown (impressed) to the user in reality. Thus these models can not reflect the true preference of the user over the slate, which causes the “impressed bias”.

Indeed, the slate re-ranking problem presents two main challenges:

(1) The evaluation of the slate is difficult, as it is crucial for a model to tackle the “position bias”, the “impressed bias”, and the mutual influence of items, where the “position bias” means the impact of the positions of items on the preference of the user.

(2) Given an evaluation model for any top- slate, the number of possible choices of top- slates is extremely huge.

To solve these two challenges, we propose a novel Generator and Critic slate re-ranking approach, shown in Figure 2. The Full Slate Critic module evaluates any candidate top- list. The Generator learns the optimal top- list by reinforcement learning algorithms (Sutton and Barto, 2018). That is, the Generator module is the reinforcement learning agent and the Critic module is the environment. The environment is defined as follows: At each episode(a user and a query come), the action of the Generator at each step is to pick an item from candidate items, and the Critic returns the evaluation score(reward) of the whole slate after the episode ends, i.e., items have been selected. At each step, the Generator takes a reinforcement learning policy that inputs the state(the user, the query, the feature of items, and selected items), outputs an action(the new item to be added to the list).

For the Critic module, to tackle the exact evaluation challenge of the slate, we present a Full Slate Critic(FSC) model. The FSC model inputs features of items, rather than total items, which avoids the “impressed bias”. The model consists of four components: Su Attention Net, Pair Influence Net, Bi-GRU Net and Feature Compare Net. The Su Attention Net encodes the real-time impressed items, click and pay behaviors of the user in the same search session, to capture the real-time interest of the user. The Pair Influence Net calculates the pair-wise influences of other items on each item, and aggregates the influences by the Attention method (Wang et al., 2017), which models the mutual influences between items. The Bi-GRU Net uses Bidirectional-GRU structure (Schuster and Paliwal, 1997) to capture the influence of the nearby items over each item, and is designed for reducing “position bias”. The Feature Compare Net compares the value of each item with other items, and represents the impact of the difference of feature values on the user interest.

For the Generator module, we embrace the recent advancements of combining deep reinforcement learning with e-commerce (Cai et al., 2018a, b; Chen et al., 2019b, 2018a, 2018b; Hu et al., 2018; Zhao et al., 2018; Zheng et al., 2018). Chen (Chen et al., 2019b) proposes a top- off-policy correction method that takes the reinforcement learning(RL) policy to generate the top- list, and learns the policy from real rewards. However, this approach is not practical as it costs millions of samples to train a good policy in a model-free reinforcement learning style, and the e-commerce site can not afford the exploration risk. As model-based RL methods are significantly more sample efficient than model-free RL methods (Deisenroth and Rasmussen, 2011; Levine et al., 2016), we train the RL policy with model-based samples, which are generated by the RL policy and evaluated by the Critic module. We present a new RL method, called PPO-Exploration, which builds on the state of the art reinforcement learning method, proximal policy optimization(PPO)(Schulman et al., 2017). PPO-Exploration drives the policy to generate diverse items, by adding the diversity score to the reward function at each step. The diversity score is defined as the distance between the current picked item over selected items and serves as the exploration bonus (Bellemare et al., 2016).

To summarize, the main contribution of this paper is as follows:

1. We present a new Generator and Critic slate re-ranking approach to tackle the two challenges: the exact evaluation of slates and the efficient generation of optimal slate.

2. We design a Full Slate Critic(FSC) model avoiding the “im pressed bias” that exists in current methods, and better represents the mutual influences between items. Experimental results show that the FSC model outperforms both the point-wise Learning to Rank (LTR) and the DLCM method substantially. We show that the FSC model is sufficiently correct to evaluate any slate generation policy.

3. We propose a new model-based RL method for generating slates efficiently, called PPO-Exploration, which encourages the policy to pick diverse items in the slate generation. Experimental results show that the PPO-Exploration performs better than the reinforce algorithm and the PPO algorithm significantly. We also validate the effectiveness of the model-based method comparing with the model-free method that trains the RL policy with real rewards.

4. The Generator and Critic approach has been successfully ap plied in one of the largest e-commerce websites in the world, and improves 5% number of orders and 4% gmv during the A/B test. The diversity of slates is also improved.

1.1. Related Work

Deep Reinforcement Learning(DRL) recently achieves cutting-edge progress in Atari games (Mnih et al., 2015) and continuous control tasks (Lillicrap et al., 2015). Motivated by the success of deep reinforcement learning on these tasks, applying DRL on the recommendation and searching task has been a hot research topic recently (Bai et al., 2019; Chen et al., 2019a, b, 2018b; Gong et al., 2019; Shi et al., 2019; Takanobu et al., 2019; Zhang et al., 2019; Zou et al., 2019). The most representative work in this research line is (Hu et al., 2018), where Hu apply reinforcement learning to the Learning to Rank task in e-commerce.

The main difference between our work and previous works is that we do not solve the MDP where the agent is the recommendation system, at each state the action of the agent is to recommend items to the user. In our MDP, there are one new user and one query at each initial state of each episode, the action is to pick one item from original LTR ranked items at each step. To the best of our knowledge, we are one of the first to applying reinforcement learning on the slate re-ranking problem. We also build a more precise evaluation model of the slate, which enables efficient model- based reinforcement learning.

Different from slate re-ranking works that focus on the refined scores of items, Jiang (Jiang et al., 2018) propose List-CVAE to pick items by item embedding. List-CVAE uses a deep generative model to learn desired embedding, and picks items closest to each desired embedding from the original items. However, this approach can not guarantee the efficiency of the output list as there may not exist items that are close enough to the desired embedding.

2. Preliminary

In this section we introduce the basic concepts of slate re-ranking in e-commerce, and reinforcement learning.

2.1. Slate Re-ranking

When a user searches a query , the ranking system returns items sorted by n user-item pair scores

Let denotes the slate re-ranking function. The slate re-ranking function outputs a top-k list , given the input of the query, the user feature, and n items.

Given an estimation function

, the score of top-k list with the query and the user , the objective of slate re-ranking is to find a slate re-ranking function that maximizes the total user satisfaction over m queries, .

2.2. Reinforcement Learning

Reinforcement learning focus on solving the Markov Decision Process problem

(Sutton and Barto, 2018). In a Markov Decision Problem(MDP), the agent observes a state in the state space , plays an action in the action space , and gets the immediate reward from the environment. After the agent plays the action, the environment changes the state by the distribution,

, which represents the probability of the next state

given the current state and action . The policy is defined as , that outputs an action distribution for any state , and is parameterized by . The objective of the agent is to find maximizing the expected discounted long-term rewards of the policy: , where denote the initial state, and denotes the discounted factor.

The well-known DQN algorithm (Mnih et al., 2015) applies the maximum operator on the action space, which is not suitable to the slate re-ranking problem with large action space. Thus in this paper we focus on stochastic reinforcement learning algorithms. The Reinforce algorithm (Williams, 1992) is proposed to optimize a stochastic policy by the gradient of the objective over the policy and is applied in (Chen et al., 2019b): where denotes the discounted rewards from the

-th step. However, the Reinforce algorithm suffers from the high variance problem as it directly uses the Monte-Carlo sampling method to estimate the long-term return. Thus actor-critic algorithms such as a2c

(Mnih et al., 2016) chooses to update the policy by


where is the advantage function, and is defined as the difference between the action-value function and the value function . But the policy gradient of Eq.(1) may incur large update of the policy, that is, the performance of the policy is unstable during the training. Schulman (Schulman et al., 2017) propose the proximal policy optimization(PPO) algorithm that optimizes a clipped surrogate objective to stabilize the policy update:


where The PPO algorithm enables efficient computation compared with the trust-region policy optimization method (Schulman et al., 2015), and achieves state of the art performance on standard control tasks. In this paper we adapt the principle of the PPO algorithm to train the slate generation policy.

3. The Gcr Model: The Generator and Critic Ranking Approach

In this section we introduce the Generator and Critic ranking framework, called the GCR model. The GCR model works as follows, shown in Figure 2:

Figure 2. The GCR framework
  • When a user query comes to the system, the LTR model picks items to the GCR model.

  • The Generator module picks top- items from items by the reinforcement learning policy.

  • The Critic module evaluates both the picked top- slate and the original top- slate.

  • The GCR model chooses the slate with a larger score from two candidate slates: the original slate and the re-ranked slate, and exposes the slate to the user.

The Critic module is trained with real user feedback on the slate, and serves as the environment. The Generator module, i.e., the agent, generates slates, receives the evaluation scores of slates from the Critic, and updates the reinforcement learning policy based on these evaluation scores.

4. The Critic Module

In this section we present the Full Slate Critic model that avoids the “impressed bias”, reduces the “position bias”, and handles the mutual influences of items precisely.

4.1. The Bias of Slate Evaluation

First of all, we discuss the limitation of previous works on the slate evaluation. We claim that it is crucial for a slate evaluation model to consider exact items that are impressed to the users in reality.

Figure 3. The bias of slate evaluation

As shown in Figure 3, given candidate items, the slate re- ranking process outputs items to the user. In most e-commerce websites, is equal to 10. (Ai et al., 2018, 2019; Zhuang et al., 2018) propose models that input the items and outputs refined scores by considering features between items. But only top- items are shown to the user, and the evaluation model that builds upon items can not exactly reflect the true user preference over any real impressed slate. That is, there exists the “impressed bias”. Also, picking top- items from refined scores limits the search space of the slate generation.

4.2. The Full Slate Critic Model

Now we introduce the critic module, named as the Full Slate Critic(FSC) model. The FSC model takes user behavior sequences one search session, features of items, the query as input, outputs the predicted conversion probabilities of items, denoted by For the ease of the simplicity, we let denote the predicted conversion probability of the item .

As the FSC model already considers the influence of other items over each item when predicting the score of each item, we assume that the conversion probabilities of items are independent. Then the conversion probability of the whole slate (at least one item in the slate is purchased) predicted by the model is:


The objective of the model is to minimize the cross entropy between the predicted scores of items and the true labels of item ,

Figure 4. The Full Slate Critic Model

For the architecture of the Full Slate Critic(FSC) model, there are four parts: Su Attention Net, Pair-Influence Net, Bi-GRU Net and Feature Compare Net, shown in Figure 4.

Su Attention Net: The samples are firstly processed by the model, where the static ID features (Item ID, Category ID, Brand ID, Seller ID) and statistics (cvr,ctr) of 10 items are concatenated with the processed user real-time behavior sequences by the Su Attention Net.

Figure 5. Su Attention Net

The detail of the Su Attention Net is shown in Figure 5. The impact of the products exposed in the same search session and the real-time behavior of the user on the current product is aggregated using the attention technique. We separately process impression(Pv), click, add to cart(Atc), and purchasing(pv) item feature lists to get Pv influence embedding, Click influence embedding, Atc influence embedding, and Pay influence embedding respectively. Then these embeddings are concatenated with 10 item features for use by subsequent networks.

Pair Influence Net: We use the Pair Influence Net to model mutual influences between items inside one slate. For example, if the price of one item among the 10 same items is higher than the other 9 items, the probability of this item being purchased will decrease. For each item , we firstly obtain the pair-wise influence of other items over this item, which models the impact of both features and relative positions. Let denote the influence of item over the item . The total influence of other items over the item , is represented by aggregating pair-wise influences with the attention technique:


where denote the weight of the pair , and is obtained by the softmax distribution. The detailed structure of the Pair Influence Net is referred to Figure 6.

Figure 6. Pair Influence Net

Bi-GRU Net:

The position bias of items is handled by the Bi-GRU Net, which uses Bidirectional Gate Recurrent Unit(GRU) to model the influence of nearby items over each item

, as shown in Figure 7. The position influence of each item is the concatenation of the results of the -th forward GRU and -th backward GRU.

Figure 7. Bi-GRU Net

Feature Compare Net: The features of items are from two categories, discrete ID-type features (item ID, category ID, brand ID, etc.) and real-value features (statistics, user preference scores, etc.). The Feature Compare Net in Figure 8 takes these feature values as input, output the comparison result of the values of other items for each item and each feature. The Feature Compare Net enables more efficient structures for slate evaluation, as it directly encodes the difference of items in the dimension of feature values.

Figure 8. Feature Compare Net

5. The Generator Module

In this section, we adapt the reinforcement learning method to search the optimal slate given the above Full Slate Critic model. We firstly formally define the slate re-ranking MDP, discuss three main challenges of this MDP, and propose the PPO-Exploration algorithm that tackles these challenges.

5.1. The Slate Re-ranking MDP

Recall the definition of the slate re-rank function defined in Section 2, . We choose to generate the slate by picking up items one-by-one from the candidate list . Formally, the slate re-ranking MDP is defined as follows:

State: There are steps in one episode, and the state at each step , is defined as a tuple , where denotes the set of selected items before the th step.

Initial state: At the first step of each episode, a query and user is sampled from a fixed distribution, then the LTR model generates candidate items. The initial state is a tuple , where is empty in the initial state.

Action: The action of the policy at each state is the index of a candidate item, denoted by .

State transition: The state transition is defined as


The set represents the set of items that are selected before the -th step.

Reward: The objective of the generator is to maximize the total conversion probability of the total slate, Eq.(3). The immediate reward is 0 before the whole slate is generated, and is at the last step of the episode.

Done: Each episode ends after steps.

5.2. Challenges of the Slate Re-ranking MDP

There are three challenges to solve the slate re-ranking MDP:

State representation: Modeling the influence of selected items over the remaining items is critical for the RL policy to pick up diverse items in subsequent steps.

Sparse reward: By the definition of the slate re-ranking MDP, the reward function is sparse except for the last step, which renders RL algorithms from learning better slates. That is, to learn better policies, a more appropriate reward function is required.

Efficient exploration: The number of possible slates is is extremely large. Although the PPO algorithm enables the stable policy update, it tends to stuck in the local optimum and can not explore new slates. Thus it is vital to improve the exploration ability of the RL algorithms.

5.3. The PPO-Exploration Algorithm

Now we present our design to tackle these three challenges.

Figure 9. The Policy Network

State representation: The design of the policy network, as shown in Figure 9. By definition, the state at each time step is

. The policy processes the state and inputs the vector ((query, su, sc,

, ), …, (query, su, sc, , )), where su is the the status of the user behavior sequences, sc denotes the features of candidate items, denotes the feature of the candidate item and represents the information of selected items generated by the Sg cell. Then the policy outputs weights of candidate items, and samples an item from the softmax distribution, i.e., . Note that during the training of the RL policy, the softmax distribution is used to sample actions and improve the exploration ability, while the policy selects the item with the maximum score in testing.

Now we introduce the detail of generating the information of selected items by the Sg cell. Assume that there are features of each item, such as Brand, Price and Category. The Sg cell firstly en- codes these features, and get encoding matrix, i.e., . Then the encoding of the selected items can be represented as . In the implementation, we choose an encoding matrix with size , . Figure 10 shows the example of the encoding in the case that there are items which have been selected. Besides the encoding representation, the Sg cell also compare the encoding of each candidate item with the the encoding of select items, , to get the diversity information . The diversity information provides effective signals for the RL algorithm to select items with new feature values. The output of the Sg cell for each item , is the combination of both the encoding information of selected items and the comparison of the diversity of the item over selected items. That is,

Figure 10. Sg cell

Reward Design: We observe that directly training the RL policy to solve the MDP fails to learn a good generation policy, as the agent can only receive the signal of the slate at the last step. As the objective function Eq.(3) is increasing with the conversion probability of any item, thus we choose the probability of each item as the immediate reward,

Efficient Exploration: To improve the exploration ability of the RL algorithms, (Bellemare et al., 2016) proposes the counted-based exploration methods. The counted-based exploration methods give important exploration bonus to encourage the RL algorithm to explore unseen states and actions: , where is a positive constant, counts the number of times of the pair visited by the RL policy during the training, and denotes the bonus function which is decreasing with . However, it is impossible to apply the count-based exploration method to our problem as the state space and the action space is very huge.

(Pathak et al., 2017) propose to learn the model that predicts the state transition function, and uses the prediction error as the exploration bonus, to motivate the policy to explore novel states that are the hardest to be predicted by the model. But the state transition in our setting is decided only by the policy, the prediction error of the model is small for any state. Thus the method can not apply to the slate re-ranking problem.

Based on the intuition that improving the diversity of items helps to explore new slates, we propose a new exploration method for the slate generation, called PPO-Exploration. At each step, we set the norm of the diversity of the picked item over selected items, as the exploration bonus of the RL policy. That is,


The new reward function Eq.(6) trades off between the conversion probability of the picked item the step , and the degree of the diversity of the item over selected items, . Now we propose the PPO-Exploration algorithm in Algorithm 1, which enables an efficient exploration and exploitation trade-off.

1:  Initialize the actor network , buffer
2:  for batch =1,…, do
3:     Sample a batch from the log data; Clear the buffer
4:     for i=1,…, do
5:        Initialize the state to be
6:        for t=1,…, do
7:           Sample action according to the softmax distribution given by the current policy,
8:           Execute action , observe new state
9:           The Sg cell outputs the diversity bonus,
10:        end for
11:        The critic module evaluates the generated slate
12:        Initialize
13:        for ,…,1 do
14:           Set , and
15:           Store transition in
16:        end for
17:     end for
18:     Update the policy by minimizing the loss on the buffer
20:  end for
Algorithm 1 The PPO-Exploration algorithm

6. Experiments

Now we design experiments to validate the power of the Generator and Critic approach. We aim to answer these questions:

(1) What is the comparison result of the Full Slate Critic model with other slate evaluation methods? What is the effect of each component in the FSC model?

(2) Can the FSC model characterize the mutual influences of items?

(3) How does the PPO-Exploration algorithm compare with start of the art reinforcement learning algorithms such as PPO and Rein- force?

(4) Is the FSC model sufficient correct to be used as the evaluation for the RL policy?

(5) Can the Generator and Critic approach be applied to online e-commerce?

6.1. Experimental Setup

Our dataset is obtained from one of the largest e-commerce websites in the world. For the FSC model, the training data contains 100 million samples(user, query, and top-10 list). We reinforce negative samples to improve the ratio of the number of positive samples111The positive samples mean that at least one item is purchased. over the number of negative samples to 1/50. Each sample consists of 23 features about the historical record of each 10 items, 7 ID features of each 10 items, and the real-time item features the user viewed, clicked, and purchased in one same search session(category, brand, shop, seller, etc). For each item in the slate, we set the label of the item is 1 if the item is purchased(pay) or added to cart(atc), and 0 otherwise. To tackle the problem of imbalance samples, we choose a weighted sum of the cross-entropy loss of items: pay(50), atc(4), click(1), impression(0.05). The testing dataset for the critic model contains 3 million new samples. For the generator module, the number of samples in the training dataset is 10 million, and each sample contains (user, query, top-50 candidate items). In our experiment, we set and . In the implementation of Algorithm 1, the batch size is 64, the factor of the exploration bonus is 1.

6.2. Comparison Result of the FSC Model with Other Methods

We compare the FSC model with the LTR method and the DLCM method in terms of the auc of three aspects: pv-pay, click-pay, and slate-pay. The pv-pay refers to purchased or not given the item impression, click-pay denotes purchased or not given the item click. Pv-pay and click-pay are both point-wise metrics. The slate-pay means at least one item of the slate is purchased or not given one impression of the whole slate. LTR(pv-pay) means that training LTR on the same dataset with the FSC model, while LTR(click-pay) denotes that training LTR on the same dataset that only contains clicked items. The labels of both two LTR models are pay behaviors. The DLCM model is trained on the same dataset with the FSC model except that the input of the DLCM model is 50 items rather 10 items.

Model pv-pay auc click-pay auc slate-pay auc
LTR(pv-pay) 0.861 0.757 0.800
LTR(click-pay) 0.855 0.785 0.809
DLCM 0.862 0.764 0.812
DNN 0.876 0.790 0.825
DNN+FCN 0.878 0.791 0.826
DNN+FCN+PIN 0.884 0.796 0.827
DNN+FCN+PIN+Bi-GRU 0.887 0.799 0.831
DNN+FCN+PIN+Bi-GRU+SAN 0.896 0.813 0.841
Table 1. Comparisons of FSC with other models

The comparison results on the testing dataset are shown in Table 1. The last row “DNN+FCN+PIN+Bi-GRU+SAN” in Table 1 represents the FSC model. Results show that FSC significantly outperforms both LTR and DLCM in terms of both the slate-wise pay auc, and the point-wise item pay auc. To analyze the effect of each component of the FSC model, we also train and test its variants, from “DNN” to “DNN+FCN+PIN+Bi-GRU”. “DNN” means that we exclude the four components discussed in Section 4.2. Note that “DNN” outperforms DLCM substantially, and the improvement comes from that “DNN” avoids the “impressed bias” while DLCM inputs 50 items rather than 10 real-impressed items. Comparing FSC with its variants, we claim that each component helps to improve the slate-wise pay auc. The most critical component is SAN, which shows that the FSC model successfully makes use of real-time user features and captures the real-time user behaviors.

6.3. Visualization of the Mutual Influence of Items in the FSC Model

Figure 11. Attention weight matrix in Pair Influence Net

Now we validate that the Full Slate Critic(FSC) model is able to characterize the influence of the position and the feature of other items over each item. The visualization result of the Pair Influence Net is shown in Figure 11. We input a specific slate with 10 similar items to the model and the only difference is the price: (4, 3, 5, 5, 5, 5, 5, 5, 4, 5). Then we plot the attention weight matrix as defined in Eq.(4). The x-axis denotes items to be influenced. We get that the weights of the 2-th column is largest among all columns, that is, the influence of other items over the 2-th item are largest. The comparison results prove that the FSC model is able to capture the effect of the distinct feature value over the user interest. It can also be observed that the weight of item 9 and item 2, 0.11 is less than the weight of item 1 and item 2, 0.12. Note that the price of item 1 and item 9 are the same, and the result can be explained by the fact that the positions of item 1 and item 2 is much closer compared with the positions of item 9 and item 2. This validates that the FSC model also carefully considers the factor of the positions of items.

6.4. Performance Comparison of PPO-Exploration with RL Algorithms

Algorithm Replacement ratio
Reinforce(no Sg cell, real reward) 0.335
Reinforce(no Sg cell, model reward) 0.598
Reinforce(Sg cell, model reward) 0.77
PPO(Sg cell, model reward) 0.784
PPO-Exploration(Sg cell, model reward) 0.819
Table 2. Comparisons results of RL algorithms

We compare PPO-Exploration with start of the art RL algorithms, Reinforce and PPO. Note that the GCR framework selects the slate with a large score from two candidates: the original slate (the first 10 items from top-50 candidate items) and the re-ranked slate. Thus we use the replacement ratio to evaluate each RL algorithms, which is the frequency that the evaluation score of the slate generated by the RL algorithm is higher than that of the original slate during testing. Results are shown in Table 2. We get that the replacement ratio of PPO-Exploration significantly outperforms both PPO and Reinforce, which confirms that the exploration method helps to improve the performance. We also do the ablation study. Comparing row 3 with row 4 in Table 2, we validate the critical effect of the Sg cell that generates the information of selected items on the performance. Directly training the Reinforce algorithm with real rewards(the immediate reward of each item is drawn from the historical dataset) achieves the worst performance. This is because the number of historical samples is limited and model-free RL algorithms suffer from the problem of high sample complexity.

6.5. Validation of the Effectiveness of the FSC Model on the Slate Generation

As the FSC model is not perfect, one may doubt the correctness of using it as the evaluation of a slate generation policy. Now we validate that the FSC model can be used as the evaluation for the RL policy. We choose to apply the unbiased inverse propensity score estimator(Swaminathan et al., 2017) to evaluate any RL policy with the real reward,

Algorithm IPS wIPS
Reinforce(no Sg cell, real reward)
Reinforce(Sg cell, model reward)
PPO-Exploration(Sg cell, model reward)
Table 3. The importance sampling evaluation of RL algo- rithms

denotes the real reward of the slate, that is, the number of items purchased according to the dataset. is the slate generated by the policy and is the probability that the slate is picked by the policy . We choose “Reinforce(no Sg cell, real reward)” in Table 2 as the base policy

to evaluate other policies. We also take the weighted inverse propensity score(wIPS) estimator as the evaluation metric, with lower variance than IPS and asymptotically zero bias. As shown in Table

3, the comparison results of RL algorithms in terms of IPS and wIPS are consistent with that of replacement ratio (evaluation results from the FSC model). That is, the FSC model is appropriate for evaluating the RL policy.

6.6. Live Experiments

Bucket brand(entropy) price(entropy)
Base 1.342 1.618
Test 1.471 1.630
Table 4. Comparisons on the diversity of slates

We apply the Generator and Critic approach to one of largest e-commerce websites in the world. The Critic and Generator module works as in Figure 2 in online experiments, and are updated weekly. During the A/B test in one month, the GCR approach improves 5.5% number of orders, 4.3% gmv, 2.03% conversion rate222The main metrics of ranking algorithms in e-commerce., compared with the base bucket. We also compare the entropy of the brand(price) distribution of slates from the experiment bucket and the base bucket. As shown in Table 4, the GCR approach also improves the diversity of slates besides efficiency improvement.

7. Conclusion

In this paper, we propose the Generator and Critic approach to solve the main challenges in the slate re-ranking problem. For the Critic module, We present a Full Slate Critic model that avoids the “impressed bias” of previous slate evaluation models and outperforms other models significantly. For the Generator module, We propose a new model-based algorithm called PPO-Exploration. Results show that PPO-Exploration outperforms start of the art RL methods substantially. We apply the GCR approach to one of the largest e-commerce websites, and the A/B test result shows that the GCR approach improves both the slate efficiency and diversity.

It is promising to apply the GCR approach to generate multiple slates with different objectives, and output the final slate by some exploration methods. For the future work, it is interesting to study the combination of model-based reward output by the Critic model with model-free real reward, to improve the performance of the PPO-Exploration algorithm.


  • R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong (2009) Diversifying search results. In Proceedings of the second ACM international conference on web search and data mining, pp. 5–14. Cited by: §1.
  • Q. Ai, K. Bi, J. Guo, and W. B. Croft (2018) Learning a deep listwise context model for ranking refinement. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 135–144. Cited by: §1, §4.1.
  • Q. Ai, X. Wang, N. Golbandi, M. Bendersky, and M. Najork (2019)

    Learning groupwise scoring functions using deep neural networks

    Cited by: §1, §4.1.
  • A. Ashkan, B. Kveton, S. Berkovsky, and Z. Wen (2015) Optimal greedy diversity for recommendation. In

    Twenty-Fourth International Joint Conference on Artificial Intelligence

    Cited by: §1.
  • X. Bai, J. Guan, and H. Wang (2019) A model-based reinforcement learning with adversarial training for online recommendation. In Advances in Neural Information Processing Systems, pp. 10734–10745. Cited by: §1.1.
  • M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos (2016) Unifying count-based exploration and intrinsic motivation. In Advances in neural information processing systems, pp. 1471–1479. Cited by: §1, §5.3.
  • Q. Cai, A. Filos-Ratsikas, P. Tang, and Y. Zhang (2018a) Reinforcement mechanism design for e-commerce. In Proceedings of the 2018 World Wide Web Conference, pp. 1339–1348. Cited by: §1.
  • Q. Cai, A. Filos-Ratsikas, P. Tang, and Y. Zhang (2018b) Reinforcement mechanism design for fraudulent behaviour in e-commerce. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1.
  • J. Carbonell and J. Goldstein (1998) The use of mmr, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 335–336. Cited by: §1.
  • H. Chen, X. Dai, H. Cai, W. Zhang, X. Wang, R. Tang, Y. Zhang, and Y. Yu (2019a) Large-scale interactive recommendation with tree-structured policy gradient. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 3312–3320. Cited by: §1.1.
  • M. Chen, A. Beutel, P. Covington, S. Jain, F. Belletti, and E. H. Chi (2019b) Top-k off-policy correction for a reinforce recommender system. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pp. 456–464. Cited by: §1.1, §1, §2.2.
  • S. Chen, Y. Yu, Q. Da, J. Tan, H. Huang, and H. Tang (2018a) Stabilizing reinforcement learning in dynamic environment with application to online recommendation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1187–1196. Cited by: §1.
  • X. Chen, S. Li, H. Li, S. Jiang, Y. Qi, and L. Song (2018b) Generative adversarial user model for reinforcement learning based recommendation system. arXiv preprint arXiv:1812.10613. Cited by: §1.1, §1.
  • M. Deisenroth and C. E. Rasmussen (2011) PILCO: a model-based and data-efficient approach to policy search. In

    Proceedings of the 28th International Conference on machine learning (ICML-11)

    pp. 465–472. Cited by: §1.
  • Y. Gong, Y. Zhu, L. Duan, Q. Liu, Z. Guan, F. Sun, W. Ou, and K. Q. Zhu (2019) Exact-k recommendation via maximal clique optimization. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 617–626. Cited by: §1.1.
  • Y. Hu, Q. Da, A. Zeng, Y. Yu, and Y. Xu (2018) Reinforcement learning to rank in e-commerce search engine: formalization, analysis, and application. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 368–377. Cited by: §1.1, §1.
  • R. Jiang, S. Gowal, T. A. Mann, and D. J. Rezende (2018) Beyond greedy ranking: slate optimization via list-cvae. arXiv preprint arXiv:1803.01682. Cited by: §1.1.
  • S. Levine, C. Finn, T. Darrell, and P. Abbeel (2016) End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research 17 (1), pp. 1334–1373. Cited by: §1.
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1.1.
  • V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §2.2.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §1.1, §2.2.
  • D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell (2017) Curiosity-driven exploration by self-supervised prediction. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    pp. 16–17. Cited by: §5.3.
  • C. Pei, Y. Zhang, Y. Zhang, F. Sun, X. Lin, H. Sun, J. Wu, P. Jiang, J. Ge, W. Ou, et al. (2019) Personalized re-ranking for recommendation. In Proceedings of the 13th ACM Conference on Recommender Systems, pp. 3–11. Cited by: §1.
  • R. L. Santos, C. Macdonald, and I. Ounis (2010) Exploiting query reformulations for web search result diversification. In Proceedings of the 19th international conference on World wide web, pp. 881–890. Cited by: §1.
  • J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz (2015) Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. Cited by: §2.2.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §1, §2.2.
  • M. Schuster and K. K. Paliwal (1997) Bidirectional recurrent neural networks. IEEE transactions on Signal Processing 45 (11), pp. 2673–2681. Cited by: §1.
  • J. Shi, Y. Yu, Q. Da, S. Chen, and A. Zeng (2019) Virtual-taobao: virtualizing real-world online retail environment for reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4902–4909. Cited by: §1.1.
  • R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §1, §2.2.
  • A. Swaminathan, A. Krishnamurthy, A. Agarwal, M. Dudik, J. Langford, D. Jose, and I. Zitouni (2017) Off-policy evaluation for slate recommendation. In Advances in Neural Information Processing Systems, pp. 3632–3642. Cited by: §6.5.
  • R. Takanobu, T. Zhuang, M. Huang, J. Feng, H. Tang, and B. Zheng (2019) Aggregating e-commerce search results from heterogeneous sources via hierarchical reinforcement learning. In The World Wide Web Conference, pp. 1771–1781. Cited by: §1.1.
  • F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang (2017) Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164. Cited by: §1.
  • M. Wilhelm, A. Ramanathan, A. Bonomo, S. Jain, E. H. Chi, and J. Gillenwater (2018) Practical diversified recommendations on youtube with determinantal point processes. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 2165–2173. Cited by: §1.
  • R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), pp. 229–256. Cited by: §2.2.
  • R. Zhang, T. Yu, Y. Shen, H. Jin, and C. Chen (2019) Text-based interactive recommendation via constraint-augmented reinforcement learning. In Advances in Neural Information Processing Systems, pp. 15188–15198. Cited by: §1.1.
  • X. Zhao, L. Xia, L. Zhang, Z. Ding, D. Yin, and J. Tang (2018) Deep reinforcement learning for page-wise recommendations. In Proceedings of the 12th ACM Conference on Recommender Systems, pp. 95–103. Cited by: §1.
  • G. Zheng, F. Zhang, Z. Zheng, Y. Xiang, N. J. Yuan, X. Xie, and Z. Li (2018) DRN: a deep reinforcement learning framework for news recommendation. In Proceedings of the 2018 World Wide Web Conference, pp. 167–176. Cited by: §1.
  • T. Zhuang, W. Ou, and Z. Wang (2018) Globally optimized mutual influence aware ranking in e-commerce search. arXiv preprint arXiv:1805.08524. Cited by: §1, §4.1.
  • L. Zou, L. Xia, Z. Ding, J. Song, W. Liu, and D. Yin (2019) Reinforcement learning to optimize long-term user engagement in recommender systems. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2810–2818. Cited by: §1.1.