1. Introduction
In a typical ecommerce website, when a user searches a keyword, the website returns a list of items to the user by the ranking algorithm. This process is usually done by predicting the scores of useritem pairs and sorting items based on the pointwise scores. However, the pointwise ranking algorithm does not consider the mutual influences between items in one page, and returns similar items with the highest scores. Here is an example in Figure 1: a buyer searches the “smart watch” in one ecommerce app, the app returns similar watches, which will decrease user satisfaction and the efficiency of the list.
To improve the diversity of the list, a series of research works, MMR, IASelect, xQuAD and DUM (Agrawal et al., 2009; Ashkan et al., 2015; Carbonell and Goldstein, 1998; Santos et al., 2010) are proposed to rank items by weighted functions that tradeoff the useritem scores and the diversities of items. However, these methods ignore the impact of diversity on the efficiency of the list. Wilhelm et.al (Wilhelm et al., 2018) present a deep determinantal point process model to unify the useritem scores and the distances of items, and devise a practical personalized approach. Nevertheless, the determinantal point process uses separate structures to model the useritem scores and the diversities of items, which limits the representation ability of the user interest over the slate.
(Ai et al., 2018, 2019; Zhuang et al., 2018) propose models that input the n pointwise ranked items, and output the refined scores, by adding intraitem features to the model. This approach selects topk items from n refined scores, and is commonly called “Slate Reranking” or “Post Ranking”(). (Pei et al., 2019) employ the efficient Transformer structure and introduce the personalized user encoding to the structure, compared with the RNN structure in two classical reranking methods, DLCM (Ai et al., 2018) and GlobalRank (Zhuang et al., 2018). However, the main concern of this approach is that only items are shown (impressed) to the user in reality. Thus these models can not reflect the true preference of the user over the slate, which causes the “impressed bias”.
Indeed, the slate reranking problem presents two main challenges:
(1) The evaluation of the slate is difficult, as it is crucial for a model to tackle the “position bias”, the “impressed bias”, and the mutual influence of items, where the “position bias” means the impact of the positions of items on the preference of the user.
(2) Given an evaluation model for any top slate, the number of possible choices of top slates is extremely huge.
To solve these two challenges, we propose a novel Generator and Critic slate reranking approach, shown in Figure 2. The Full Slate Critic module evaluates any candidate top list. The Generator learns the optimal top list by reinforcement learning algorithms (Sutton and Barto, 2018). That is, the Generator module is the reinforcement learning agent and the Critic module is the environment. The environment is defined as follows: At each episode(a user and a query come), the action of the Generator at each step is to pick an item from candidate items, and the Critic returns the evaluation score(reward) of the whole slate after the episode ends, i.e., items have been selected. At each step, the Generator takes a reinforcement learning policy that inputs the state(the user, the query, the feature of items, and selected items), outputs an action(the new item to be added to the list).
For the Critic module, to tackle the exact evaluation challenge of the slate, we present a Full Slate Critic(FSC) model. The FSC model inputs features of items, rather than total items, which avoids the “impressed bias”. The model consists of four components: Su Attention Net, Pair Influence Net, BiGRU Net and Feature Compare Net. The Su Attention Net encodes the realtime impressed items, click and pay behaviors of the user in the same search session, to capture the realtime interest of the user. The Pair Influence Net calculates the pairwise influences of other items on each item, and aggregates the influences by the Attention method (Wang et al., 2017), which models the mutual influences between items. The BiGRU Net uses BidirectionalGRU structure (Schuster and Paliwal, 1997) to capture the influence of the nearby items over each item, and is designed for reducing “position bias”. The Feature Compare Net compares the value of each item with other items, and represents the impact of the difference of feature values on the user interest.
For the Generator module, we embrace the recent advancements of combining deep reinforcement learning with ecommerce (Cai et al., 2018a, b; Chen et al., 2019b, 2018a, 2018b; Hu et al., 2018; Zhao et al., 2018; Zheng et al., 2018). Chen et.al (Chen et al., 2019b) proposes a top offpolicy correction method that takes the reinforcement learning(RL) policy to generate the top list, and learns the policy from real rewards. However, this approach is not practical as it costs millions of samples to train a good policy in a modelfree reinforcement learning style, and the ecommerce site can not afford the exploration risk. As modelbased RL methods are significantly more sample efficient than modelfree RL methods (Deisenroth and Rasmussen, 2011; Levine et al., 2016), we train the RL policy with modelbased samples, which are generated by the RL policy and evaluated by the Critic module. We present a new RL method, called PPOExploration, which builds on the state of the art reinforcement learning method, proximal policy optimization(PPO)(Schulman et al., 2017). PPOExploration drives the policy to generate diverse items, by adding the diversity score to the reward function at each step. The diversity score is defined as the distance between the current picked item over selected items and serves as the exploration bonus (Bellemare et al., 2016).
To summarize, the main contribution of this paper is as follows:
1. We present a new Generator and Critic slate reranking approach to tackle the two challenges: the exact evaluation of slates and the efficient generation of optimal slate.
2. We design a Full Slate Critic(FSC) model avoiding the “im pressed bias” that exists in current methods, and better represents the mutual influences between items. Experimental results show that the FSC model outperforms both the pointwise Learning to Rank (LTR) and the DLCM method substantially. We show that the FSC model is sufficiently correct to evaluate any slate generation policy.
3. We propose a new modelbased RL method for generating slates efficiently, called PPOExploration, which encourages the policy to pick diverse items in the slate generation. Experimental results show that the PPOExploration performs better than the reinforce algorithm and the PPO algorithm significantly. We also validate the effectiveness of the modelbased method comparing with the modelfree method that trains the RL policy with real rewards.
4. The Generator and Critic approach has been successfully ap plied in one of the largest ecommerce websites in the world, and improves 5% number of orders and 4% gmv during the A/B test. The diversity of slates is also improved.
1.1. Related Work
Deep Reinforcement Learning(DRL) recently achieves cuttingedge progress in Atari games (Mnih et al., 2015) and continuous control tasks (Lillicrap et al., 2015). Motivated by the success of deep reinforcement learning on these tasks, applying DRL on the recommendation and searching task has been a hot research topic recently (Bai et al., 2019; Chen et al., 2019a, b, 2018b; Gong et al., 2019; Shi et al., 2019; Takanobu et al., 2019; Zhang et al., 2019; Zou et al., 2019). The most representative work in this research line is (Hu et al., 2018), where Hu et.al apply reinforcement learning to the Learning to Rank task in ecommerce.
The main difference between our work and previous works is that we do not solve the MDP where the agent is the recommendation system, at each state the action of the agent is to recommend items to the user. In our MDP, there are one new user and one query at each initial state of each episode, the action is to pick one item from original LTR ranked items at each step. To the best of our knowledge, we are one of the first to applying reinforcement learning on the slate reranking problem. We also build a more precise evaluation model of the slate, which enables efficient model based reinforcement learning.
Different from slate reranking works that focus on the refined scores of items, Jiang et.al (Jiang et al., 2018) propose ListCVAE to pick items by item embedding. ListCVAE uses a deep generative model to learn desired embedding, and picks items closest to each desired embedding from the original items. However, this approach can not guarantee the efficiency of the output list as there may not exist items that are close enough to the desired embedding.
2. Preliminary
In this section we introduce the basic concepts of slate reranking in ecommerce, and reinforcement learning.
2.1. Slate Reranking
When a user searches a query , the ranking system returns items sorted by n useritem pair scores
Let denotes the slate reranking function. The slate reranking function outputs a topk list , given the input of the query, the user feature, and n items.
Given an estimation function
, the score of topk list with the query and the user , the objective of slate reranking is to find a slate reranking function that maximizes the total user satisfaction over m queries, .2.2. Reinforcement Learning
Reinforcement learning focus on solving the Markov Decision Process problem
(Sutton and Barto, 2018). In a Markov Decision Problem(MDP), the agent observes a state in the state space , plays an action in the action space , and gets the immediate reward from the environment. After the agent plays the action, the environment changes the state by the distribution,, which represents the probability of the next state
given the current state and action . The policy is defined as , that outputs an action distribution for any state , and is parameterized by . The objective of the agent is to find maximizing the expected discounted longterm rewards of the policy: , where denote the initial state, and denotes the discounted factor.The wellknown DQN algorithm (Mnih et al., 2015) applies the maximum operator on the action space, which is not suitable to the slate reranking problem with large action space. Thus in this paper we focus on stochastic reinforcement learning algorithms. The Reinforce algorithm (Williams, 1992) is proposed to optimize a stochastic policy by the gradient of the objective over the policy and is applied in (Chen et al., 2019b): where denotes the discounted rewards from the
th step. However, the Reinforce algorithm suffers from the high variance problem as it directly uses the MonteCarlo sampling method to estimate the longterm return. Thus actorcritic algorithms such as a2c
(Mnih et al., 2016) chooses to update the policy by(1) 
where is the advantage function, and is defined as the difference between the actionvalue function and the value function . But the policy gradient of Eq.(1) may incur large update of the policy, that is, the performance of the policy is unstable during the training. Schulman et.al (Schulman et al., 2017) propose the proximal policy optimization(PPO) algorithm that optimizes a clipped surrogate objective to stabilize the policy update:
(2) 
where The PPO algorithm enables efficient computation compared with the trustregion policy optimization method (Schulman et al., 2015), and achieves state of the art performance on standard control tasks. In this paper we adapt the principle of the PPO algorithm to train the slate generation policy.
3. The Gcr Model: The Generator and Critic Ranking Approach
In this section we introduce the Generator and Critic ranking framework, called the GCR model. The GCR model works as follows, shown in Figure 2:

When a user query comes to the system, the LTR model picks items to the GCR model.

The Generator module picks top items from items by the reinforcement learning policy.

The Critic module evaluates both the picked top slate and the original top slate.

The GCR model chooses the slate with a larger score from two candidate slates: the original slate and the reranked slate, and exposes the slate to the user.
The Critic module is trained with real user feedback on the slate, and serves as the environment. The Generator module, i.e., the agent, generates slates, receives the evaluation scores of slates from the Critic, and updates the reinforcement learning policy based on these evaluation scores.
4. The Critic Module
In this section we present the Full Slate Critic model that avoids the “impressed bias”, reduces the “position bias”, and handles the mutual influences of items precisely.
4.1. The Bias of Slate Evaluation
First of all, we discuss the limitation of previous works on the slate evaluation. We claim that it is crucial for a slate evaluation model to consider exact items that are impressed to the users in reality.
As shown in Figure 3, given candidate items, the slate re ranking process outputs items to the user. In most ecommerce websites, is equal to 10. (Ai et al., 2018, 2019; Zhuang et al., 2018) propose models that input the items and outputs refined scores by considering features between items. But only top items are shown to the user, and the evaluation model that builds upon items can not exactly reflect the true user preference over any real impressed slate. That is, there exists the “impressed bias”. Also, picking top items from refined scores limits the search space of the slate generation.
4.2. The Full Slate Critic Model
Now we introduce the critic module, named as the Full Slate Critic(FSC) model. The FSC model takes user behavior sequences one search session, features of items, the query as input, outputs the predicted conversion probabilities of items, denoted by For the ease of the simplicity, we let denote the predicted conversion probability of the item .
As the FSC model already considers the influence of other items over each item when predicting the score of each item, we assume that the conversion probabilities of items are independent. Then the conversion probability of the whole slate (at least one item in the slate is purchased) predicted by the model is:
(3) 
The objective of the model is to minimize the cross entropy between the predicted scores of items and the true labels of item ,
For the architecture of the Full Slate Critic(FSC) model, there are four parts: Su Attention Net, PairInfluence Net, BiGRU Net and Feature Compare Net, shown in Figure 4.
Su Attention Net: The samples are firstly processed by the model, where the static ID features (Item ID, Category ID, Brand ID, Seller ID) and statistics (cvr,ctr) of 10 items are concatenated with the processed user realtime behavior sequences by the Su Attention Net.
The detail of the Su Attention Net is shown in Figure 5. The impact of the products exposed in the same search session and the realtime behavior of the user on the current product is aggregated using the attention technique. We separately process impression(Pv), click, add to cart(Atc), and purchasing(pv) item feature lists to get Pv influence embedding, Click influence embedding, Atc influence embedding, and Pay influence embedding respectively. Then these embeddings are concatenated with 10 item features for use by subsequent networks.
Pair Influence Net: We use the Pair Influence Net to model mutual influences between items inside one slate. For example, if the price of one item among the 10 same items is higher than the other 9 items, the probability of this item being purchased will decrease. For each item , we firstly obtain the pairwise influence of other items over this item, which models the impact of both features and relative positions. Let denote the influence of item over the item . The total influence of other items over the item , is represented by aggregating pairwise influences with the attention technique:
(4) 
where denote the weight of the pair , and is obtained by the softmax distribution. The detailed structure of the Pair Influence Net is referred to Figure 6.
BiGRU Net:
The position bias of items is handled by the BiGRU Net, which uses Bidirectional Gate Recurrent Unit(GRU) to model the influence of nearby items over each item
, as shown in Figure 7. The position influence of each item is the concatenation of the results of the th forward GRU and th backward GRU.Feature Compare Net: The features of items are from two categories, discrete IDtype features (item ID, category ID, brand ID, etc.) and realvalue features (statistics, user preference scores, etc.). The Feature Compare Net in Figure 8 takes these feature values as input, output the comparison result of the values of other items for each item and each feature. The Feature Compare Net enables more efficient structures for slate evaluation, as it directly encodes the difference of items in the dimension of feature values.
5. The Generator Module
In this section, we adapt the reinforcement learning method to search the optimal slate given the above Full Slate Critic model. We firstly formally define the slate reranking MDP, discuss three main challenges of this MDP, and propose the PPOExploration algorithm that tackles these challenges.
5.1. The Slate Reranking MDP
Recall the definition of the slate rerank function defined in Section 2, . We choose to generate the slate by picking up items onebyone from the candidate list . Formally, the slate reranking MDP is defined as follows:
State: There are steps in one episode, and the state at each step , is defined as a tuple , where denotes the set of selected items before the th step.
Initial state: At the first step of each episode, a query and user is sampled from a fixed distribution, then the LTR model generates candidate items. The initial state is a tuple , where is empty in the initial state.
Action: The action of the policy at each state is the index of a candidate item, denoted by .
State transition: The state transition is defined as
(5) 
The set represents the set of items that are selected before the th step.
Reward: The objective of the generator is to maximize the total conversion probability of the total slate, Eq.(3). The immediate reward is 0 before the whole slate is generated, and is at the last step of the episode.
Done: Each episode ends after steps.
5.2. Challenges of the Slate Reranking MDP
There are three challenges to solve the slate reranking MDP:
State representation: Modeling the influence of selected items over the remaining items is critical for the RL policy to pick up diverse items in subsequent steps.
Sparse reward: By the definition of the slate reranking MDP, the reward function is sparse except for the last step, which renders RL algorithms from learning better slates. That is, to learn better policies, a more appropriate reward function is required.
Efficient exploration: The number of possible slates is is extremely large. Although the PPO algorithm enables the stable policy update, it tends to stuck in the local optimum and can not explore new slates. Thus it is vital to improve the exploration ability of the RL algorithms.
5.3. The PPOExploration Algorithm
Now we present our design to tackle these three challenges.
State representation: The design of the policy network, as shown in Figure 9. By definition, the state at each time step is
. The policy processes the state and inputs the vector ((query, su, sc,
, ), …, (query, su, sc, , )), where su is the the status of the user behavior sequences, sc denotes the features of candidate items, denotes the feature of the candidate item and represents the information of selected items generated by the Sg cell. Then the policy outputs weights of candidate items, and samples an item from the softmax distribution, i.e., . Note that during the training of the RL policy, the softmax distribution is used to sample actions and improve the exploration ability, while the policy selects the item with the maximum score in testing.Now we introduce the detail of generating the information of selected items by the Sg cell. Assume that there are features of each item, such as Brand, Price and Category. The Sg cell firstly en codes these features, and get encoding matrix, i.e., . Then the encoding of the selected items can be represented as . In the implementation, we choose an encoding matrix with size , . Figure 10 shows the example of the encoding in the case that there are items which have been selected. Besides the encoding representation, the Sg cell also compare the encoding of each candidate item with the the encoding of select items, , to get the diversity information . The diversity information provides effective signals for the RL algorithm to select items with new feature values. The output of the Sg cell for each item , is the combination of both the encoding information of selected items and the comparison of the diversity of the item over selected items. That is,
Reward Design: We observe that directly training the RL policy to solve the MDP fails to learn a good generation policy, as the agent can only receive the signal of the slate at the last step. As the objective function Eq.(3) is increasing with the conversion probability of any item, thus we choose the probability of each item as the immediate reward,
Efficient Exploration: To improve the exploration ability of the RL algorithms, (Bellemare et al., 2016) proposes the countedbased exploration methods. The countedbased exploration methods give important exploration bonus to encourage the RL algorithm to explore unseen states and actions: , where is a positive constant, counts the number of times of the pair visited by the RL policy during the training, and denotes the bonus function which is decreasing with . However, it is impossible to apply the countbased exploration method to our problem as the state space and the action space is very huge.
(Pathak et al., 2017) propose to learn the model that predicts the state transition function, and uses the prediction error as the exploration bonus, to motivate the policy to explore novel states that are the hardest to be predicted by the model. But the state transition in our setting is decided only by the policy, the prediction error of the model is small for any state. Thus the method can not apply to the slate reranking problem.
Based on the intuition that improving the diversity of items helps to explore new slates, we propose a new exploration method for the slate generation, called PPOExploration. At each step, we set the norm of the diversity of the picked item over selected items, as the exploration bonus of the RL policy. That is,
(6) 
The new reward function Eq.(6) trades off between the conversion probability of the picked item the step , and the degree of the diversity of the item over selected items, . Now we propose the PPOExploration algorithm in Algorithm 1, which enables an efficient exploration and exploitation tradeoff.
6. Experiments
Now we design experiments to validate the power of the Generator and Critic approach. We aim to answer these questions:
(1) What is the comparison result of the Full Slate Critic model with other slate evaluation methods? What is the effect of each component in the FSC model?
(2) Can the FSC model characterize the mutual influences of items?
(3) How does the PPOExploration algorithm compare with start of the art reinforcement learning algorithms such as PPO and Rein force?
(4) Is the FSC model sufficient correct to be used as the evaluation for the RL policy?
(5) Can the Generator and Critic approach be applied to online ecommerce?
6.1. Experimental Setup
Our dataset is obtained from one of the largest ecommerce websites in the world. For the FSC model, the training data contains 100 million samples(user, query, and top10 list). We reinforce negative samples to improve the ratio of the number of positive samples^{1}^{1}1The positive samples mean that at least one item is purchased. over the number of negative samples to 1/50. Each sample consists of 23 features about the historical record of each 10 items, 7 ID features of each 10 items, and the realtime item features the user viewed, clicked, and purchased in one same search session(category, brand, shop, seller, etc). For each item in the slate, we set the label of the item is 1 if the item is purchased(pay) or added to cart(atc), and 0 otherwise. To tackle the problem of imbalance samples, we choose a weighted sum of the crossentropy loss of items: pay(50), atc(4), click(1), impression(0.05). The testing dataset for the critic model contains 3 million new samples. For the generator module, the number of samples in the training dataset is 10 million, and each sample contains (user, query, top50 candidate items). In our experiment, we set and . In the implementation of Algorithm 1, the batch size is 64, the factor of the exploration bonus is 1.
6.2. Comparison Result of the FSC Model with Other Methods
We compare the FSC model with the LTR method and the DLCM method in terms of the auc of three aspects: pvpay, clickpay, and slatepay. The pvpay refers to purchased or not given the item impression, clickpay denotes purchased or not given the item click. Pvpay and clickpay are both pointwise metrics. The slatepay means at least one item of the slate is purchased or not given one impression of the whole slate. LTR(pvpay) means that training LTR on the same dataset with the FSC model, while LTR(clickpay) denotes that training LTR on the same dataset that only contains clicked items. The labels of both two LTR models are pay behaviors. The DLCM model is trained on the same dataset with the FSC model except that the input of the DLCM model is 50 items rather 10 items.
Model  pvpay auc  clickpay auc  slatepay auc 

LTR(pvpay)  0.861  0.757  0.800 
LTR(clickpay)  0.855  0.785  0.809 
DLCM  0.862  0.764  0.812 
DNN  0.876  0.790  0.825 
DNN+FCN  0.878  0.791  0.826 
DNN+FCN+PIN  0.884  0.796  0.827 
DNN+FCN+PIN+BiGRU  0.887  0.799  0.831 
DNN+FCN+PIN+BiGRU+SAN  0.896  0.813  0.841 
The comparison results on the testing dataset are shown in Table 1. The last row “DNN+FCN+PIN+BiGRU+SAN” in Table 1 represents the FSC model. Results show that FSC significantly outperforms both LTR and DLCM in terms of both the slatewise pay auc, and the pointwise item pay auc. To analyze the effect of each component of the FSC model, we also train and test its variants, from “DNN” to “DNN+FCN+PIN+BiGRU”. “DNN” means that we exclude the four components discussed in Section 4.2. Note that “DNN” outperforms DLCM substantially, and the improvement comes from that “DNN” avoids the “impressed bias” while DLCM inputs 50 items rather than 10 realimpressed items. Comparing FSC with its variants, we claim that each component helps to improve the slatewise pay auc. The most critical component is SAN, which shows that the FSC model successfully makes use of realtime user features and captures the realtime user behaviors.
6.3. Visualization of the Mutual Influence of Items in the FSC Model
Now we validate that the Full Slate Critic(FSC) model is able to characterize the influence of the position and the feature of other items over each item. The visualization result of the Pair Influence Net is shown in Figure 11. We input a specific slate with 10 similar items to the model and the only difference is the price: (4, 3, 5, 5, 5, 5, 5, 5, 4, 5). Then we plot the attention weight matrix as defined in Eq.(4). The xaxis denotes items to be influenced. We get that the weights of the 2th column is largest among all columns, that is, the influence of other items over the 2th item are largest. The comparison results prove that the FSC model is able to capture the effect of the distinct feature value over the user interest. It can also be observed that the weight of item 9 and item 2, 0.11 is less than the weight of item 1 and item 2, 0.12. Note that the price of item 1 and item 9 are the same, and the result can be explained by the fact that the positions of item 1 and item 2 is much closer compared with the positions of item 9 and item 2. This validates that the FSC model also carefully considers the factor of the positions of items.
6.4. Performance Comparison of PPOExploration with RL Algorithms
Algorithm  Replacement ratio 

Reinforce(no Sg cell, real reward)  0.335 
Reinforce(no Sg cell, model reward)  0.598 
Reinforce(Sg cell, model reward)  0.77 
PPO(Sg cell, model reward)  0.784 
PPOExploration(Sg cell, model reward)  0.819 
We compare PPOExploration with start of the art RL algorithms, Reinforce and PPO. Note that the GCR framework selects the slate with a large score from two candidates: the original slate (the first 10 items from top50 candidate items) and the reranked slate. Thus we use the replacement ratio to evaluate each RL algorithms, which is the frequency that the evaluation score of the slate generated by the RL algorithm is higher than that of the original slate during testing. Results are shown in Table 2. We get that the replacement ratio of PPOExploration significantly outperforms both PPO and Reinforce, which confirms that the exploration method helps to improve the performance. We also do the ablation study. Comparing row 3 with row 4 in Table 2, we validate the critical effect of the Sg cell that generates the information of selected items on the performance. Directly training the Reinforce algorithm with real rewards(the immediate reward of each item is drawn from the historical dataset) achieves the worst performance. This is because the number of historical samples is limited and modelfree RL algorithms suffer from the problem of high sample complexity.
6.5. Validation of the Effectiveness of the FSC Model on the Slate Generation
As the FSC model is not perfect, one may doubt the correctness of using it as the evaluation of a slate generation policy. Now we validate that the FSC model can be used as the evaluation for the RL policy. We choose to apply the unbiased inverse propensity score estimator(Swaminathan et al., 2017) to evaluate any RL policy with the real reward,
(7) 
Algorithm  IPS  wIPS 

Reinforce(no Sg cell, real reward)  
Reinforce(Sg cell, model reward)  
PPOExploration(Sg cell, model reward) 
denotes the real reward of the slate, that is, the number of items purchased according to the dataset. is the slate generated by the policy and is the probability that the slate is picked by the policy . We choose “Reinforce(no Sg cell, real reward)” in Table 2 as the base policy
to evaluate other policies. We also take the weighted inverse propensity score(wIPS) estimator as the evaluation metric, with lower variance than IPS and asymptotically zero bias. As shown in Table
3, the comparison results of RL algorithms in terms of IPS and wIPS are consistent with that of replacement ratio (evaluation results from the FSC model). That is, the FSC model is appropriate for evaluating the RL policy.6.6. Live Experiments
Bucket  brand(entropy)  price(entropy) 

Base  1.342  1.618 
Test  1.471  1.630 
We apply the Generator and Critic approach to one of largest ecommerce websites in the world. The Critic and Generator module works as in Figure 2 in online experiments, and are updated weekly. During the A/B test in one month, the GCR approach improves 5.5% number of orders, 4.3% gmv, 2.03% conversion rate^{2}^{2}2The main metrics of ranking algorithms in ecommerce., compared with the base bucket. We also compare the entropy of the brand(price) distribution of slates from the experiment bucket and the base bucket. As shown in Table 4, the GCR approach also improves the diversity of slates besides efficiency improvement.
7. Conclusion
In this paper, we propose the Generator and Critic approach to solve the main challenges in the slate reranking problem. For the Critic module, We present a Full Slate Critic model that avoids the “impressed bias” of previous slate evaluation models and outperforms other models significantly. For the Generator module, We propose a new modelbased algorithm called PPOExploration. Results show that PPOExploration outperforms start of the art RL methods substantially. We apply the GCR approach to one of the largest ecommerce websites, and the A/B test result shows that the GCR approach improves both the slate efficiency and diversity.
It is promising to apply the GCR approach to generate multiple slates with different objectives, and output the final slate by some exploration methods. For the future work, it is interesting to study the combination of modelbased reward output by the Critic model with modelfree real reward, to improve the performance of the PPOExploration algorithm.
References
 Diversifying search results. In Proceedings of the second ACM international conference on web search and data mining, pp. 5–14. Cited by: §1.
 Learning a deep listwise context model for ranking refinement. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 135–144. Cited by: §1, §4.1.

Learning groupwise scoring functions using deep neural networks
. Cited by: §1, §4.1. 
Optimal greedy diversity for recommendation.
In
TwentyFourth International Joint Conference on Artificial Intelligence
, Cited by: §1.  A modelbased reinforcement learning with adversarial training for online recommendation. In Advances in Neural Information Processing Systems, pp. 10734–10745. Cited by: §1.1.
 Unifying countbased exploration and intrinsic motivation. In Advances in neural information processing systems, pp. 1471–1479. Cited by: §1, §5.3.
 Reinforcement mechanism design for ecommerce. In Proceedings of the 2018 World Wide Web Conference, pp. 1339–1348. Cited by: §1.
 Reinforcement mechanism design for fraudulent behaviour in ecommerce. In ThirtySecond AAAI Conference on Artificial Intelligence, Cited by: §1.
 The use of mmr, diversitybased reranking for reordering documents and producing summaries. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 335–336. Cited by: §1.
 Largescale interactive recommendation with treestructured policy gradient. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 3312–3320. Cited by: §1.1.
 Topk offpolicy correction for a reinforce recommender system. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pp. 456–464. Cited by: §1.1, §1, §2.2.
 Stabilizing reinforcement learning in dynamic environment with application to online recommendation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1187–1196. Cited by: §1.
 Generative adversarial user model for reinforcement learning based recommendation system. arXiv preprint arXiv:1812.10613. Cited by: §1.1, §1.

PILCO: a modelbased and dataefficient approach to policy search.
In
Proceedings of the 28th International Conference on machine learning (ICML11)
, pp. 465–472. Cited by: §1.  Exactk recommendation via maximal clique optimization. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 617–626. Cited by: §1.1.
 Reinforcement learning to rank in ecommerce search engine: formalization, analysis, and application. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 368–377. Cited by: §1.1, §1.
 Beyond greedy ranking: slate optimization via listcvae. arXiv preprint arXiv:1803.01682. Cited by: §1.1.
 Endtoend training of deep visuomotor policies. The Journal of Machine Learning Research 17 (1), pp. 1334–1373. Cited by: §1.
 Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1.1.
 Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §2.2.
 Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §1.1, §2.2.

Curiositydriven exploration by selfsupervised prediction.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops
, pp. 16–17. Cited by: §5.3.  Personalized reranking for recommendation. In Proceedings of the 13th ACM Conference on Recommender Systems, pp. 3–11. Cited by: §1.
 Exploiting query reformulations for web search result diversification. In Proceedings of the 19th international conference on World wide web, pp. 881–890. Cited by: §1.
 Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. Cited by: §2.2.
 Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §1, §2.2.
 Bidirectional recurrent neural networks. IEEE transactions on Signal Processing 45 (11), pp. 2673–2681. Cited by: §1.
 Virtualtaobao: virtualizing realworld online retail environment for reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4902–4909. Cited by: §1.1.
 Reinforcement learning: an introduction. MIT press. Cited by: §1, §2.2.
 Offpolicy evaluation for slate recommendation. In Advances in Neural Information Processing Systems, pp. 3632–3642. Cited by: §6.5.
 Aggregating ecommerce search results from heterogeneous sources via hierarchical reinforcement learning. In The World Wide Web Conference, pp. 1771–1781. Cited by: §1.1.
 Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164. Cited by: §1.
 Practical diversified recommendations on youtube with determinantal point processes. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 2165–2173. Cited by: §1.
 Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning 8 (34), pp. 229–256. Cited by: §2.2.
 Textbased interactive recommendation via constraintaugmented reinforcement learning. In Advances in Neural Information Processing Systems, pp. 15188–15198. Cited by: §1.1.
 Deep reinforcement learning for pagewise recommendations. In Proceedings of the 12th ACM Conference on Recommender Systems, pp. 95–103. Cited by: §1.
 DRN: a deep reinforcement learning framework for news recommendation. In Proceedings of the 2018 World Wide Web Conference, pp. 167–176. Cited by: §1.
 Globally optimized mutual influence aware ranking in ecommerce search. arXiv preprint arXiv:1805.08524. Cited by: §1, §4.1.
 Reinforcement learning to optimize longterm user engagement in recommender systems. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2810–2818. Cited by: §1.1.
Comments
There are no comments yet.