Learning-to-rank is the central problem and directly connects with the profits in online search engines and recommendation systems. Many previous LTR approaches assume the query relevance of an item (or document) is inherent, and they want models to accurately capture the relevance from a set of labeled data. The labels are commonly collected from customers’ implicit feedback, which are treated as the ground-truth to train models in many approaches. In the above setting, it is reasonable to focus on data-based ranking metrics, such as the widely adopted Area Under Curve (AUC) and Normalized Discounted Cumulative Gain (NDCG). It leads to LTR models that tightly match the labeled data and the models are then used to find the most relevant items.
However, in E-commerce scenarios, the conversion rate of an item does not solely depend on itself. For instance, if an item is surrounded by similar but expensive items, the likelihood of buying it increases, known as the decoy effects. Figure 1 shows an illustrative example that the context may change customers’ behaviors. If we count the possible surrounding contexts of an item, the number can be extremely huge since the contexts are related to the combinations of all the items that can be billions. To overcome the difficulty of the huge combinatorial space, different from the classical LTR approaches, the re-ranking strategy was proposed [Zhuang et al.2018]. The learning process firstly finds a small set of candidate items relevant to the query, and then in the re-ranking phase, the order of the candidate items is determined. The re-ranking strategy drastically reduces the combinatorial space, and thus enables a comprehensive understanding of the candidates for finding a proper order. The group-wise scoring functions (GSF) framework [Ai et al.2019] also pays attention to the influences between items, where it shares the same setting with re-ranking.
Even if the re-ranking strategy has reduced the candidate numbers so that the search in the combinatorial space is feasible, we still need an accurate evaluator to score any item list for finding the best order. However, it is quite often that many orders of existed lists do not appear in the collected data. With these issues in mind, we can notice that the previous approaches employing supervised learning with data-based metrics have two major limitations. Firstly, we discovered that the data-based metrics are often inconsistent with the online performance, thus they can mislead the learning; secondly, the supervised learning paradigm is hard to explore the combinatorial space, thus it is hard to directly optimize the final performance metrics such as Conversion Rate (CR) and Gross Merchandise Volume (GMV). Therefore, it is quite appealing to have an evaluation approach beyond the dataset, and have an exploration approach beyond the supervised learning paradigm.
In this paper, we present an evaluator-generator framework, EG-Rerank, for E-commerce group-wise LTR. EG-Rerank learns an evaluator to predict the purchase probability of ordered item lists using the information of the items and the context. In addition, we introduce a discriminator which works as a self-confidence scoring function, learned by adversarial training to tell how confident the evaluator scores a list. We use the discriminator to lead the generator to output orders in its confident space in the view of discriminator. EG-Rerank then trains the LTR model by a reinforcement learning approach, which has the ability to explore the item orders, guided by the evaluator. The contributions of this paper are summarized below:
Through experiments on AliExpress Search, which is one of the world’s largest international retail platforms, we show that some commonly used data-based metrics can be inconsistent with the online performance, thus confirm that the data-based metrics can mislead the LTR model learning.
We show that the learned evaluator can be a much robust objective, and can serve as a substitution of the data-based metrics.
We present the EG-Rerank and EG-Rerank+ approach of the evaluator-generator framework. EG-Rerank+ is shown steadily improving the conversion rate by 2% over the fine-tuned industrial-level re-ranking pairwise scoring model in online A/B tests, which is considered a significant improvement in a large and mature platform.
2 Related Works
Learning-to-rank (LTR) is to solve ranking problems by machine learning algorithms. There are several types of classic models, such as point-wise models, pair-wise models and list-wise models. Point-wise models[Cossock and Zhang2008, Li et al.2007] treat the ranking problems as classification or regression tasks for each item. Pair-wise models [Joachims2002, Burges et al.2005, Burges2010] convert the original problem into the internal ranking of pairs. List-wise models [Cao et al.2007, Xia et al.2008, Ai et al.2018]
use well-design loss function to directly optimize the ranking criteria. Some more detailed definitions, such as the group-wise model[Ai et al.2019] and the page-wise model [Zhao et al.2018] are proposed in recent years, which share the same intuition with the re-ranking method [Zhuang et al.2018]. The pointer network was also used to be a solution for re-ranking [Bello et al.2018]. However, above-cited methods focus on the data-based metrics like AUC and NDCG, which are inconsistent with actual online improvement as we will show.
Slate optimization is a close topic with learning-to-rank. Recently, there are some slate optimization works for recommendation systems, which do not completely focus on the data-based ranking metrics. [Wang et al.2019]
uses a sequential evaluation and generation framework for slate optimization, which is similar to ours. However, their reinforcement learning solution performs poorly and they prefer to apply supervising learning, which does not solve the inconsistency issue. In addition, repeated sampling and heuristic search (such as beam search[Zhuang et al.2018]) are very time-consuming for online services. Exact-K Recommendation [Gong et al.2019]
uses a similar framework for the slate optimization by imitating the outputs with good feedback. Our method encourages models to outperform the experts (offline data) by generative adversarial imitation learning, which has been proven to be a better choice for imitation learning than thebehavior cloning methods [Ho and Ermon2016, Duan et al.2017]. List-CVAE [Jiang et al.2019]
is a conditional variational autoencoder framework for optimizing a list based on a fixed candidate set of items, making it difficult to be deployed in a real large system.
Reinforcement learning [Sutton and Barto1998] algorithms are aiming to find the optimal policy that maximizes the expected return. Proximal Policy Optimization (PPO) [Schulman et al.2017] is one of them, which optimizes a surrogate objective function vi stochastic gradient ascent. PPO retains some benefits of Trust Region Policy Optimization (TRPO) [Schulman et al.2015], but with much simpler implementation and better sample complexity. Moreover, we introduce the Generative Adversarial Imitation Learning (GAIL) [Ho and Ermon2016] to our framework. GAIL is intimately connected to generative adversarial networks [Goodfellow et al.2014], which trains a policy to generate the data as close as possible with the experts’ data by utilizing a discriminator. A GAIL framework to generate virtual users was tried in a recent work [Shi et al.2019], while it did not model the contextual information in the final order. Some other interesting works [Ie et al.2019b, Chen et al.2019] use a pure reinforcement learning method for slates optimization and have good performances in online environments. Their methods focus on a series of slates and optimize long term value (LTV) of users’ feedback, and we focus on the other fundamental challenge to optimize single slate without further interactions.
Our evaluator-generator framework for group-wise LTR includes a generator, an evaluator and a discriminator. Reinforcement learning is an natural idea to optimize the generator with feedback from the evaluator in our framework.
3.1 Problem definition
Our objective is to find the best permutation among all permutations of candidate items in the item set . Let denote a customer and denote the arrangement of first items and is the shortening of . For simplicity, we assume customers browse in a top-down manner, then the probability that customer purchases item in order can be written as
, where random variabledenotes the event of purchasing item . Recall that the number of candidates is so small in re-ranking that re-ranking models can include information of , while ordinary LTR model scores items without knowledge of complete candidates.
We use another random variable to denote the number of purchasing, then we have
The last equation is held with the assumption of the top-down browsing manner. The goal of ranking is to find a permutation that maximizes the expected number of purchasing:
where is the set of all permutations of candidate items. The purchase probability here acts as the evaluator in the framework, which plays an important role in the training, and can be learned by any discriminative machine learning method. When choosing item is considered as an action and the feedback of evaluator is considered as the reward, the training of generator can be connected with the evaluator by reinforcement learning algorithms. Permutation can be represented by a trajectory which is sampled by a re-ranking policy and is considered as reward where and are the state and item is the action.
The evaluator is the key model in our framework and it predicts the expected number of purchases of lists, which is the objective we want to maximize. The evaluator is expected to judge performances of orders and can provide appropriate rewards. In our setting, the reward of an order is the expectation of the number of purchases , which has been introduced in the last section.
The structure of our evaluator is shown in Figure 2. The input includes the features of a list of items and the scenario feature . Scenario feature is independent to items, but provides abundant information for us, such as date, language and open profiles of users. We use to extract the hidden feature for each item:
where represents a network with fully connected layers. To characterize the context of an item, we use a LSTM cell to process the state:
Finally we use another to regress the probability of purchasing item under state :
In order to take over the sparseness of purchase samples, the evaluator is also co-trained with click labels. It can help model learn common knowledge in the click prediction task and the purchase prediction task.
the loss function is a weighted sum of both objectives and a parameter should be chosen according to the ratio between the number of purchases and clicks.
where CE is short for standard cross entropy loss.
With a reliable evaluator, the generator is possible to autonomously explore the best order. The structure of our generator is similar to the pointer network and well simplified for faster online predicting.
Our encoder divides the input into two parts. The first part is to characterize the current state of the list, which can be similarly processed as the evaluator has done:
where is item , which is the item picked in step . The second part is to extract the feature of actions (next chosen item):
Note that is the same for all step and can be reused.
Finally the output of encoder is
The output has vectors where includes the features of a candidate item and the current hidden state.
Then the decoder receives pairs of actions and the hidden states as above. Due to the time bottleneck of the online system, we use to decode the probabilities of actions. The output is sampled from the for unpicked item which can be done by simple masking.
We use the PPO algorithm to optimize the generator through the feedback of evaluator on the generator output. However, standard PPO failed to train a stable critic network in our offline data. In our experiment, the critic network always outputs random values and cannot help the training too much with the state produced from the encoder. Instead of training a critic network, we sample a few trajectories and then can estimate the value of a state. Concretely, we sampletrajectories from a state to calculate the estimate of state value, which denoted as .
Moreover, we calculate the standard deviation of value estimation and apply it to the loss function to make training more stable. The standard deviation ofcan be formulate as
The loss function of EG-Rerank can be written as
Here is equivalent to where is in our experiments. Policy is the one collecting rewards and is current policy. The motivation of involving is to forbid too fast changing of policy and practically helpful. With the above loss function, the generator can update its parameters by sampling trajectories and interacting with the evaluator to get rewards.
The framework is supposed to work smoothly when the evaluator is well trained. Nevertheless, an obvious weakness appears since the evaluator tries modeling customers’ actions only by a narrow scope in combinatorial space of items. Figure 3 shows this phenomenon in the simulated environment.
Our solution is to introduce a sequential discriminator , which is the parameters of the discriminator. We denote this framework as EG-Rerank+.
The gives a confidence score of whether the list is a real logged list. For an ordered item list , is the summation of item rank scores, which are produced by a sequential structure
The discriminator is trained by classifying the generated listand the labeled list , which is updated with gradient
We take the output of the discriminator as part of the reward for EG-Rerank+ learning,
This modification will lead the generator outputting orders where the discriminator cannot easily distinguish the source. Therefore, the feedback of the evaluator becomes more confident. Figure 4 shows the distribution of lists in logged data, outputs of EG-Rerank and EG-Rerank+. We use t-SNE for data dimensionality reduction (without knowledge about groups) and visualization. The data contains thousands of real lists that have original query ‘phone screen protectors’ in the online system. It is clear to see that outputs of EG-Rerank+ are closer to the logged lists than EG-Rerank.
We first share two real cases in AliExpress Search with two types of average Group AUC [Zhou et al.2018]:
offline GAUC: the one computed before a model changes the order (old labels);
online GAUC: the one computed after a model changes the order (new labels).
Table 1 shows the offline GAUC of two models in a week. Pages (lists of item) are considered as groups and purchased items have a positive label.
|Methods||Update||offline GAUC||CR gap|
|EG-Rerank||Daily||0.512 0.007||+2.022% 0.015|
RankNet* is the industrial-level pair-wise model we mentioned, and it follows the design of RankNet and has the best offline performance in long-term experiments. We can see that EG-Rerank receives poor Group AUC but increases the number of purchases greatly (more than ).
Next, table 2 shows the online GAUC and actual performances of three online experiments in August 2019.
|online GAUC||Aug 17||Aug 18||Aug 19||Aug 20|
|CR gap||Aug 17||Aug 18||Aug 19||Aug 20|
From table 2, it seems that model 3 is the best one among them according to offline metric GAUC. But in fact, model 3 is actually the worst one in CR gap. Consider such an extreme case: a model always picks the best item from its view, and then add irrelative items to the list. It can have an online GAUC approaches 1 (since only the first item may have a positive label) but obviously performs poorly online.
|Method groups||Method||Ranking policy||NDCG||Group AUC||AUC on list pairs||Evaluator score||True score|
|Point-wise method||miDNN + MSE loss||By scores||0.971020||0.946146||0.725314||5.530002||5.553024|
|miDNN(CE loss)||By scores||0.971223||0.946613||0.726313||5.521028||5.532152|
|miDNN(Hinge loss)||By scores||0.956350||0.893319||0.695156||5.559538||5.529714|
|Pair-wise method||RankNet(Logistic loss)||By scores||0.969761||0.944799||0.714187||5.565940||5.581728|
|RankNet*(Hinge loss)||By scores||0.948557||0.902552||0.701231||5.718276||5.681378|
|List-wise method||ListNet||By scores||0.971562||0.947294||0.718952||5.542377||5.552450|
|Group-wise method||GSF(3)||By scores||0.972133||0.947882||0.727182||5.529502||5.539089|
|Pointer network||seq2slate(CE loss)||Generate orders||0.871331||0.839300||-||5.802512||5.769682|
|seq2slate(Hinge loss)||Generate orders||0.966575||0.936420||-||5.668325||5.664184|
From the above two facts, we can see that neither offline Group AUC nor online Group AUC, and also other traditional ranking metrics that computed from original labels, can properly reflect the models’ performance in the real E-commerce system. Therefore, we would like to emphasize the importance of the evaluator: it is not only a module in the framework but also a novel metric for ranking, which is much more related to online performance than data-based metrics.
4.1 Offline experiment: under a simulated environment
We are going to show that classical ranking metrics are an inconsistency with our objective even in a simple LTR scenario. We set up the offline experiments under a simulated rule-based environment, which has mutual influences between items and the environment is easy to reproduce.
4.1.1 Simulated environment
Our simulated environment borrows some ideas from the design of Google’s RecSim [Ie et al.2019a]. We prepare items and each item has a random feature of length . Subsets of size are sampled uniformly from items for , and then we randomly repeat to sample lists from each subset for times. Compared with , is so large that the distribution bias exists naturally. We send all these sampled lists to the simulated environment and it will label the data and models may access. To label the data, firstly the environment scores items by summing up a randomly weighted DNN and mutual conversion benefits yielded from the cosine distance between its feature and the average prefix feature. Then with a sampling from these scores, each item gets a 0-1 label and indicates whether an item is clicked. The score of a list is the expected number of clicks of it, which is exactly the summation of the scores in the above step.
We take five indicators (NDCG, Group AUC, AUC on list pairs, Evaluator Score and Environment score) into consideration. ‘AUC on list pairs’ is an indicator of how a model can precisely find the better list between a pair of lists, where the label of each list is the number of clicks on it. The evaluator is required to have a high AUC on list pairs, and then it can properly supervise the generator to find efficient orders. Since some LTR models are independent with the order of items, we add a position discount weight to their output on each item like NDCG, and this weighted sum is the score to list. For the evaluator, it already has the contextual information and there is no need to design a heuristic weight for it. Evaluator Score is the prediction from the evaluator, and it is also used to train EG-Rerank and EG-Rerank+, and environment score is the one we want to maximize.
4.1.2 Models for comparison
Offline experiments are divided into two groups. The first group contains classical methods which generates orders by their scores. All features contain the global feature extension in miDNN [Zhuang et al.2018]. We use miDNN as a base model, and apply mean square root error (MSE) loss, cross entropy (CE) loss and hinge loss to represent point-wise methods, and apply pair-wise logistic loss and hinge loss to represent pair-wise methods which are similar to RankNet [Burges et al.2005]. For list-wise methods, we involve ListNet [Cao et al.2007] and ListMLE [Ai et al.2018]. Group-wise scoring framework [Ai et al.2019] is a novel choice for re-ranking, and we examine GSF(3) and GSF(10) with the sampling trick introduced in its paper.
The second group contains the methods which straightforwardly generate the orders. A pointer network solution [Bello et al.2018] is recently proved to be efficient, which is trained by supervised learning. Two different loss functions are in the paper, namely cross entropy loss and hinge loss, are added to the experiment. EG-Rerank and EG-Rerank+ also appear in this group. We did not try a Behavior Cloning method like [Gong et al.2019], because it is hard to define good lists for model cloning in our virtual task.
4.1.3 Result analysis
We show the complete results in Table 3. Classic methods get high NDCG and Group AUC in the simulated environment, while they are less possible to generate orders with a high score in the environment. Instead of that, the EG-Rerank series have low NDCG and Group AUC, but output more satisfying orders. We can clearly see that data-based ranking metrics are inconsistent with the actual performance.
Another observation is that ‘Evaluator score’ column is much more consistent with ‘True score’ than data-based ranking metrics. However, the evaluator still fails to capture the accurate true score. EG-Rerank and EG-Rerank+ both get a high evaluator score, but the gap of true scores between the traditional methods and them are not significant as the gap of evaluator scores. It again implies that the evaluator cannot correctly predict the score for orders which are far away from the distribution of labeled data. Intuitively, EG-Rerank+ should not get a better evaluator score since the reward is mixed by the distribution reward, but in the experiment EG-Rerank+ has both a better evaluator score and true score than EG-Rerank. We conjecture that the discriminator helps the generator more likely to find better local optima.
4.2 Online A/B tests
We set up a few online A/B tests on AliExpress Search, where each model serves a random portion of search queries. Models can access data in the last two weeks and there are billions displayed lists () and millions purchased records (
). The conversion rate of purchases is the main criterion of online performance. The online environment varies so rapidly that the gaps may be different day by day. All A/B tests were held for a week and the variances then are acceptable and the better method can be clearly determined.
Since the online testing resource is expensive, only the model that gets significant offline improvement can be examined online. In our long-term trials, fined-tuned RankNet* has been proven to have both the best offline performance (GAUC 0.78) and a great online improvement. RankNet* is well collaborated with the system and can update itself almost real-time, where EG-Rerank (EG-Rerank+) is incrementally trained every day. Thanks to the discriminator strategy, EG-Rerank+ has a higher average offline Group AUC (about 0.63) than EG-Rerank (about 0.51). The results for online metrics are shown in Table 4.
|Methods||Update||Online GAUC||Evaluator*||CR gap|
|No Re-rank||-||0.758 0.004||+0.000%||+0.000%|
|RankNet*||Real-time||0.789 0.002||+15.16%||+6.559% 0.013|
|EG-Rerank||Daily||0.783 0.003||+9.938%||+2.022% 0.015|
|EG-Rerank+||Daily||0.796 0.007||+2.108%||+2.282% 0.005|
|EG-Rerank+||Daily||0.786 0.003||+0.809%||+0.626% 0.008|
The results demonstrate the inconsistency between online Group AUC and online improvement. On the other hand, ‘Evaluator*’ in the table is another model that has the same structure as the evaluator in the framework, while it takes real-time training and can better predict models’ performance. It can even capture the slight gap between EG-Rerank and EG-Rerank+. According to the results in the table, we can see that it may be a potentially consistent metric. Although it is hard to predict the accurate gap in A/B tests, the online evaluator can find the winner with high probability. Therefore, we can regard it as a useful metric that can predict actual performances of models offline.
In E-commerce ranking tasks, most of learning-to-rank methods may not work practically since their models lose effectiveness when the order changes. We propose the evaluator-generator group-wise LTR framework EG-Rerank+ which consists of an evaluator, a generator and a discriminator. We demonstrate its great performance in both offline and online situations and believe the framework is valuable for a variety of real-world ranking scenarios and may significantly improve the business goals in different E-Commerce tasks.
- [Ai et al.2018] Qingyao Ai, Keping Bi, Jiafeng Guo, and W Bruce Croft. Learning a deep listwise context model for ranking refinement. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 135–144. ACM, 2018.
[Ai et al.2019]
Qingyao Ai, Xuanhui Wang, Sebastian Bruch, Nadav Golbandi, Michael Bendersky,
and Marc Najork.
Learning groupwise multivariate scoring functions using deep neural networks.In Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval, pages 85–92. ACM, 2019.
- [Bello et al.2018] Irwan Bello, Sayali Kulkarni, Sagar Jain, Craig Boutilier, Ed Huai-hsin Chi, Elad Eban, Xiyang Luo, Alan Mackey, and Ofer Meshi. Seq2slate: Re-ranking and slate optimization with rnns. CoRR, abs/1810.02019, 2018.
- [Burges et al.2005] Christopher J. C. Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Gregory N. Hullender. Learning to rank using gradient descent. In Machine Learning, Proceedings of the Twenty-Second International Conference (ICML 2005), pages 89–96, 2005.
- [Burges2010] Christopher JC Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 11(23-581):81, 2010.
- [Cao et al.2007] Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: from pairwise approach to listwise approach. In Machine Learning, Proceedings of the Twenty-Fourth International Conference (ICML 2007), Corvallis, Oregon, USA, June 20-24, 2007, pages 129–136, 2007.
- [Chen et al.2019] Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, and Ed H. Chi. Top-k off-policy correction for a REINFORCE recommender system. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, WSDM 2019, Melbourne 11-15, 2019, pages 456–464, 2019.
- [Cossock and Zhang2008] David Cossock and Tong Zhang. Statistical analysis of bayes optimal subset ranking. IEEE Trans. Information Theory, 54(11):5140–5154, 2008.
- [Duan et al.2017] Yan Duan, Marcin Andrychowicz, Bradly Stadie, OpenAI Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. One-shot imitation learning. In Advances in neural information processing systems, pages 1087–1098, 2017.
- [Gong et al.2019] Yu Gong, Yu Zhu, Lu Duan, Qingwen Liu, Ziyu Guan, Fei Sun, Wenwu Ou, and Kenny Q. Zhu. Exact-k recommendation via maximal clique optimization. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019, pages 617–626, 2019.
- [Goodfellow et al.2014] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, pages 2672–2680, 2014.
- [Ho and Ermon2016] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 4565–4573, 2016.
- [Ie et al.2019a] Eugene Ie, Chih-wei Hsu, Martin Mladenov, Vihan Jain, Sanmit Narvekar, Jing Wang, Rui Wu, and Craig Boutilier. Recsim: A configurable simulation platform for recommender systems. arXiv preprint arXiv:1909.04847, 2019.
[Ie et al.2019b]
Eugene Ie, Vihan Jain, Jing Wang, Sanmit Narvekar, Ritesh Agarwal, Rui Wu,
Heng-Tze Cheng, Tushar Chandra, and Craig Boutilier.
Slateq: A tractable decomposition for reinforcement learning with
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, August 10-16, 2019, pages 2592–2599, 2019.
- [Jiang et al.2019] Ray Jiang, Sven Gowal, Yuqiu Qian, Timothy A. Mann, and Danilo J. Rezende. Beyond greedy ranking: Slate optimization via list-cvae. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019.
- [Joachims2002] Thorsten Joachims. Optimizing search engines using clickthrough data. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 23-26, 2002, Edmonton, Alberta, Canada, pages 133–142, 2002.
[Li et al.2007]
Ping Li, Christopher J. C. Burges, and Qiang Wu.
Mcrank: Learning to rank using multiple classification and gradient boosting.In Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 3-6, 2007, pages 897–904, 2007.
- [Schulman et al.2015] John Schulman, Sergey Levine, Pieter Abbeel, Michael I. Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, pages 1889–1897, 2015.
- [Schulman et al.2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017.
- [Shi et al.2019] Jing-Cheng Shi, Yang Yu, Qing Da, Shi-Yong Chen, and An-Xiang Zeng. Virtual-taobao: Virtualizing real-world online retail environment for reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4902–4909, 2019.
- [Sutton and Barto1998] RS Sutton and AG Barto. Introduction to reinforcement learning, volume 135. MIT press Cambridge, 1998.
- [Wang et al.2019] Fan Wang, Xiaomin Fang, Lihang Liu, Yaxue Chen, Jiucheng Tao, Zhiming Peng, Cihang Jin, and Hao Tian. Sequential evaluation and generation framework for combinatorial recommender system. CoRR, abs/1902.00245, 2019.
- [Xia et al.2008] Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. Listwise approach to learning to rank: theory and algorithm. In Machine Learning, Proceedings of the Twenty-Fifth International Conference (ICML 2008), Helsinki, Finland, June 5-9, 2008, pages 1192–1199, 2008.
- [Zhao et al.2018] Xiangyu Zhao, Long Xia, Liang Zhang, Zhuoye Ding, Dawei Yin, and Jiliang Tang. Deep reinforcement learning for page-wise recommendations. In Proceedings of the 12th ACM Conference on Recommender Systems, pages 95–103. ACM, 2018.
- [Zhou et al.2018] Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1059–1068. ACM, 2018.
- [Zhuang et al.2018] Tao Zhuang, Wenwu Ou, and Zhirong Wang. Globally optimized mutual influence aware ranking in e-commerce search. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden, pages 3725–3731, 2018.