1. Introduction
Information retrieval and machine learning applications play an important role in industrial and commercial scenarios, ranging from web searching engine (Google.com, Baidu.com, etc.) to ecommence websites (Taobao.com, Amazon.com). The major applications, ie., search and recommendation, usually require to rank a large set of data items in terms of response to users’ requests under an online circumstance. To support these applications, there are generally two issues: a) Effectiveness such as how accurate and reliable the search results in the final ranking list are and b) Efficiency as how fast the search engine’s response to the user’s queries in a timely manner and whether the computational burden of ranking is as low as possible from a system’s perspective. It is a challenge to address both issues in largescale applications for providing excellent user experience and an efficient performance solution.
Generally speaking, to address the high computational cost of more deep models as well as largescale traffic requests, the search engine system has to degrade the service level in the aspect of effectiveness, i.e., reducing the number of recalled items, offlining some unnecessary service and so on, in oder to avoid access delay or even unavailability, which severely affects the users’ experience. Though successful, these methods only adopts a compromise between the search engine processing performance and the service availability in a hard way, which means there must be unnecessary sacrifice of revenue in real world business practice. Consequently, this raises a question whether we are able to design a ”soft” or ”intelligent” solution, which allows to achieve both of effectiveness and efficiency.
The answer seems to be promising. Liu et al. proposed a cascading ranking model to address the tradeoff between effectiveness and efficiency in the largescale ecommerce search applications (Liu et al., 2017). Their method mainly focuses on reducing the number of items in the ranking process, however, their model sheds a light on us to optimize the ecommerce search engine in another possible way. In a search engine, a set of factors is applied to the ranking process and we conjecture that not all of those factors are necessary in the realworld applications. After thorough investigation in the real world operational environment, we discover that there are still relatively high correlations between those ranking factors in our system. See Figure 1 for details. Therefore, on the one hand, there exists redundancy of factors in our online operational environment. On the other hand, we also realize that the conversion rates vary on items under different contexts^{1}^{1}1Here, the userquery pair denotes the context.. For instance, the users with higher purchase power always have a higher conversion rate under some longtail (lowfrequency) queries. Based on aforementioned analysis, we consider that some computational efficient factors may be sufficient for achieving effectiveness under such contexts. The above observations shows the possibility of keeping the effectiveness by carefully selecting a subset of all factors under certain circumstances, which indeed is a standard combinatorial optimization problem, but with auxiliary context description.
Combinatorial optimization is a fundamental problem in computer science. Recently, Bello et al. show that reinforcement learning is capable of solving combinatorial optimization problem like TSP via pointer network (Vinyals et al., 2015; Bello et al., 2016)
. In this paper, we try to address above challenges by designing an innovative model via reinforcement learning algorithms. We formally define our optimization problem with a general framework and a loss function which is able to reflect both of ranking effectiveness and efficiency. Then, we transform the contextual combinatorial optimization problem to a sequential decisionmaking one by incorporating the contextual setting and factor selection into the state and action of an MDP, respectively. The reward is designed to encourage to save the computational cost of factors as well as ensure that the ranking results are still effective. The final solution can be obtained via the stateoftheart reinforcement learning algorithms such as Asynchronous Advantage ActorCritic (A3C) in this paper. Based on the correlation among factors as well as the context dependency in our system, our method is capable of handling contextual factor selection in terms of user and query, while ensuring to minimized the influence on business indicators, i.e., gross merchandise value (GMV), clickthrough rate (CTR) and so on.
We show our algorithm outperforms comparative algorithms in both of offline and online evaluation. In Singles’ Day Shopping Festival, , we also demonstrate the capabilities of this new method in realworld largescale system.
The contributions of this paper can be summarized as: i) injecting contextual factor selection into search engine ranking score for engine acceleration, ii) formulating the contextual factor selection for ranking as a contextual combinatorial optimization problem, iii) deriving a reinforcement learning based solution to the proposed optimization problem, iv) demonstrating the effectiveness of our technique in both of offline and online environments.
The rest of the paper is organized as follows: Section 2 introduces the background; Section 3 provides some related work; In section 4, we define our problem in a view of optimization; Section 5 proposes actorcritic method to resolve the problem in section 4; We show the experimental results in section 7. Finally, section 8 summarizes the whole paper.
2. Ranking in ECommerce Search
Suppose that is the set of all available items in the database, is the set of all possible queries and denotes the set of all users’ information. Let be a set of userquery pairs, where denotes the th userquery pair from the search requests. is the set of items associated with the th userquery request, where, is the number of the related items. The ranking problem in ecommerce scenario can be then formally defined as a task to generate a permutation function , where is an onetoone correspondence from to itself and denotes the set of all the possible permutations on
. The goal is to maximize the probability of purchase under the permutation. The permutation is usually generated by a ranking function
which scores each item for the request . Letbe the corresponding factor vector of item
under query , where ; . Some factors in the factor vector depends on the userquery pair and the item . Without loss of generality, the ranking model is defined by(1) 
where is the ranking function. It could be any function such as a linear model, a deep neural network or a tree model.
The ranking function is usually trained from a dataset logged from the real system, where is the number of training examples and denotes the labels associates with items. Specifically, represents the feedback of the user on the th item. The training can be conducted in pointwise way (Cooper et al., 1992; Li et al., 2008), pairwise way (Freund et al., 2003; Burges et al., 2005; Zheng et al., 2007), or listwise way (Xu and Li, 2007; Cao et al., 2007; Burges, 2010). It is worth noting that, in this paper we assume that a trained ranking function is given and consider the general case that the ranking function is provided as a black box, i.e., with no access to the gradient or even the Hessian matrix.
3. Related Work
There are a lot of work that attempts to resolve the effectiveness and efficiency challenge and we review will some of them.
Cascade learning is originally proposed to address the effectiveness and efficiency issue in traditional classification and detection problems such as fast visual object detection (Bourdev and Brandt, 2005; Schneiderman, 2004; Viola and Jones, 2003). Liu et al. develop a cascade ranking model for a largescale ecommerce search system and deploy it in Taobao.com (Liu et al., 2017). However, they only exploit optimization in terms of the number of ranking items, we mainly focuses on the factor usage during the ranking process.
Feature selection tries to remove irrelevant and/or redundant features to improve learning performance (Guyon and Elisseeff, 2003). Traditional feature selection techniques roughly fall into two categories, i.e., filter methods and wrapper methods. Filter methods use learnerirrelevant measurements to evaluate and select features, such as information gain and Relief (Kira and Rendell, 1992). Wrapper methods involve the final learner in the feature selection process, such as using the accuracy as the evaluation criterion for the goodness of features. Liu et al.
proposed the TEFE (TimeEfficient Feature Extraction) approach, which balances between the test accuracy and test time cost by extracting a proper subset of features for each test object
(Liu et al., 2008). In learning to rank literature, Feature selection is a common strategy to improve the efficiency. In generally, a set of crucial factors are selected from a complete set of all possible factors according to some criteria such as importance to the ranking (Geng et al., 2007; Wang et al., 2010a, b). Geng et al. propose a selection method based on factor importance in a queryfree manner, but they do not consider the real computational cost and querydependent factor (Geng et al., 2007). There are also some methods are querydependent, in which the cost (delay) of the query is considered (Wang et al., 2010a, b). In contrast, we consider the computational cost (delay) of individual factor.Ensemble pruning is a class of approaches that tries to select a subset of learners (factors) to comprise the ensemble learner (Zhou, 2012). Recently, Benbouzid et al. apply Qlearning algorithm to ensemble pruning, in which a reinforcement learning agent tries to decide whether or not to skip the base learner. However, their method is contextfree and lacks of evaluation in a realworld largescale application.
4. Contextual Factor Selection for Ranking
4.1. The CFS framework
In this subsection, we describes a general framework of Contextual Factor Selection (CFS), for constructing a search engine optimizer which achieves both of effectiveness and efficiency in terms of ecommerce search engine. As mentioned above, a factor vector is assigned to the corresponding item , in which each dimension of the factor vector is calculated online and varies on the computational cost. Let be the factor vector associated with a cost vector , where denotes the computational cost of the th factor. Let be the set of all factors and be a subset of . The indicator function of a subset of the set defined as
(2) 
From a practical point of view, some of factors are not necessary in terms of ranking. For example, given a set of factors , a subset of with highly confident factors might be sufficient under some contexts. Therefore, given an item and indicator function , the computational cost function can be written as , where the indicator function determines whether or not we use the factor to participate the sorting process^{2}^{2}2We mainly consider the computational cost of the factors while ignoring other costs.. Thus, given a set of items , the total computational cost is
(3) 
As defined in Equation 1, the ranking model with all factors can be written as and the one with a subset is written as
(4) 
Intuitively, we can treat the permutation generated by as the optimal one since it includes all the factors we have during the ranking process. Thus, given a request, the objective is
(5) 
where denotes the distance between function and over the item set
, which could be any distance between two functions, i.e., KullbackLeibler divergence
(Kullback and Leibler, 1951), the second term is the computational costs of factors in the set , is the tradeoff parameter and is the number of items in query . Intuitively, the objective implies that it reduces the usage of factors as many as possible, while approximating the original ranking function by function as close as possible.However, Equation 5 is intractable even for a single request, which is able to reduced to the optimal subset selection problem. Consequently, it is a NPhard problem in general (Davis et al., 1997; Natarajan, 1995). Moreover, we need to do the contextual factor selection, i.e., solving a general NPhard problem for every , which is impractical in a largescale system even with a small number of contexts. To overcome this challenge, we try to generalize the solution of Equation 5 at the contextual level. That is, we do not directly search the optimal subset and define :
(6) 
where is a model parameterized by and the userquery pair characterizes the context. Such formulation reduces the solution space to a global parameter from the original multiple optimal subset selection problems, based on the assumption that similar representations should have similar optimal subset structure. Thus, our goal is to search for the global parameter vector to minimize the loss defined in Equation 5 over all the requests.
To illustrate our method, we adopt the linear ranking functions as a demonstration, and other representations, i.e., deep neural network and tree based ranking function, can be derived by similar way. In the linear setting, the score of item under userquery is
(7) 
where is the corresponding weight of factor .
In another point of view, the permutation significantly depends on the factors are used to calculate the scores. Formally, given an userquery pair and a corresponding weight vector , the linear ranking function
(8) 
where is the indicator function, which depends on the userquery pair . For convenience, denotes the binary vector with respect to the factor vector . Therefore, the ranking permutation highly depends on the ranking function and the indicator function , assuming the weight vector is fixed if the ranking model is given. Thus, the crucial part of ranking optimization is to learn an indicator function to determine the utilization of the factors. See Figure 2 for illustration. To simplify the notation, we write as , where the parameter characterizes the factor subset . Thence, the ranking permutation is induced by the ranking function and the indicator function . Thus, we can rewrite the distance function as , where and are permutations reduced by ranking function and , respectively.
4.2. CFS with Pairwise Ranking Loss
With the optimal ranking permutation above, then we define the distance over a item set between a permutation and the optimal ranking permutation as
(9) 
where equals if and otherwise. The definition of the distance is the analogue of the averaged pairwise loss in learning to rank literature. The distance measures that how far away is the induced permutation to the optimal one in terms of ranking pairs.
With the distance and total cost function defined above, our goal is, given a userquery pair , the corresponding item set and the ranking function , to learn an indicator function such that minimizes the the distance function and the total computational costs. Formally, the objective in Equation 5 can be further rewritten as
(10) 
5. RankCFS: A Reinforcement Learning Approach
As mentioned in section 4.1, the optimization problem defined in Equation 10 is NPhard in general case and finding the exact solution is computationally intractable. Inspired by recent work (Bello et al., 2016; Benbouzid et al., 2012), we propose to optimize the factor usage using reinforcement learning framework in order to learn an indicator function , by transforming the assignment of each element in the indicator vector as a sequential decisionmaking problem. We call it RankCFS.
5.1. Reinforcement Learning and ActorCritic Methods
In this subsection, we will review some basic concepts in reinforcement learning. This subsection could be skipped If the readers are similar with reinforcement learning.
In reinforcement learning, an agent must sequentially select actions to maximize its total expected payoff. These problems are typically formalized as Markov decision processes (MDPs) with a tuple of
, where and denote the state and action spaces. represents the transition probability governing the dynamics of the system, is the reward function quantifying the performance of the agent and is a discount factor specifying the degree to which rewards are discounted over time. At each step , the agent is in state and must choose an action , transitioning it to a successor state as given by and yielding a reward . A policyis defined as a probability distribution over stateaction pairs, where
denotes the probability of choosing action at state .Policy gradients (Kober and Peters, 2011; Sutton and Barto, 1998) are a class of reinforcement learning algorithms that have shown successes in solving complex robotic problems (Kober and Peters, 2011). Such methods represent the policy by an unknown vector of parameters . The goal is to determine the optimal parameter vector that maximize the expected discounted cumulative reward:
(11) 
where denotes a trajectory over a possibly finite horizon . The probability of acquiring a trajectory, , under the policy parameterization and discounted cumulative reward is given by:
(12)  
(13) 
with an initial state distribution . Policy gradient methods, such as episodic REINFORCE (Williams, 1992) and Natural Actor Critic (Bhatnagar et al., 2009; Peters and Schaal, 2008), typically employ a lowerbound on the expected return for fitting the unknown policy parameters . To achieve this, such algorithms generate trajectories using the current policy , and then compare performance with a new parameterization . As detailed in (Kober and Peters, 2011), the policy gradient of
can be estimated using the the likelihood ratio trick as
(14) 
which is usually approximated with empirical estimate for sample trajectories under the policy , i.e., . The gradient can be applied in every step and further improved by introducing a learned bias
to reduce the variance of this estimate as in
(Mnih et al., 2016)(15) 
where is the discounted cumulative reward from step and is the function approximation of parameterized by , of which the gradient is
(16) 
5.2. Converting CFS to MDP Setting
It is possible to learn a factor subset, in which a subset of factors are chosen for the ranking process, by a reinforcement learning policy, instead of approximating the indicator vector directly. However, it results in a combinatorial action space which leads to computational intractability and searching failure with a high probability.
To reduce the action space, we introduce a fixed factor sequence so that a policy can sequentially determine the corresponding utilization of factors. Formally, for each userquery request, the vector function can be determined in steps, where in the th step (), we need to decide whether the th factor should be applied to the ranking function or not for this certain request, i.e., is the action taken at step and the is the action space. is obtained through a policy
(17) 
where is the state representation of th step. Then we can get
(18) 
After steps, is determined and so is ranking permutation . Then we can directly calculate the loss to evaluate the result of selected actions, which can be further used to define the total reward of the actions generated in steps during the episode. See Figure 3 for illustration. The key idea lies in the state design (base on which the action is generated), the reward design (how to evaluate each action) and the optimization method for this reinforcement learning problem (how to find the optimal policy).
5.3. The State and Reward Design
The optimal policy should generalize over the state space, and the optimal actions for an episode only depend on the request, so ideally, the state can be designed as
(19) 
where is the representations for the userquery pair . The corresponding reward is then defined as
(20) 
The agent is designed to obtain a reward of when the episode is not terminated, i.e., , and a reward of  when the episode ends, like the goaldirected tasks. By above definition and assigning to 1, then we can conclude that the objective in this reinforcement learning problem is exactly the negative of the objective in Equation 10:
(21) 
This means that maximizing can directly minimize , allowing us to find the optimal solution of with the power of deep reinforcement learning.
However, empirically there are two issues that make learning the optimal policy for above reinforcement learning problem difficult. One is that the reward is sparse over states, known as the sparse feedback problem (Kulkarni et al., 2016). The other one is that the reward itself () distributes widely in the continuous space, making the critic model difficult to converge. Inspired by the reward shaping (Ng et al., 1999) technique, we consider to slightly change the representations of states and rewards, to alleviate the above issues.
We firstly initialize as an allone vector, and at the step update as
(22) 
Then we extend our state vector to
(23) 
Thus our state memorizes the decisions made before during an episode. At each step , the reward is calculated based on , i.e., at each step it is preevaluated for the decisions made so far, assuming the rest decisions are all ones by default. For each reward , we decompose it into the effectiveness part and the efficiency part , i.e., . For the efficiency part, we simply add a penalty when keeping the th factor as
(24) 
This part is consistent with the Equation 10. For the effectiveness part, we choose to give a constant penalty if the ranking loss under exceeds a predefined threshold as
(25) 
rather than itself shown in Equation 10. By such design we could help the critic distinguish bad and good ranking result much easier. Moreover, we could avoid generating poor ranking performance with a large penalty .
5.4. Learning the Policy
After transforming the original problem into a reinforcement learning one, we could then apply any reinforcement learning methods. In this paper, we choose the wellknown policy gradient method with actorcritic models as described in (Mnih et al., 2016) and we call it RankCFS. It is worthing noting that, the difficulty of the original optimization problem does not decrease with the introduction of reinforcement learning techniques. The RLbased approach here acts as a solver whose solution space contains the optimal, and provides an efficient searching path to the optimal through trialanderror methods.
Algorithm 1 shows the training details. The data of page views in the online search system , the reward discount factor , the parameters used in the reward definition and the maximal number of training step are given as the input of the algorithm. The parameter of the actor network is the output of the algorithm. We firstly initialize the parameters of the actor and critic network, as well as the step counter , as in Line 1 and 2. The training phase starts with the iteration of the each page view, with which an episode will be generated during the Line 610. Then standard policy gradient is conducted in Line 1415, where the tuple is organized in the backward way so the discounted cumulative reward can be updated incrementally as in Line 13. The training process ends when the number of steps exceeds the given threshold .
6. Experimental Results
In this section, we provide empirical results of our approaches in offline evaluation and commercial online evaluation. We show the results of offline settings in order to provide a way to justify our algorithm. Then, we test our method in a real online commercial web search engine to reveal the performance improvement with respect to the resource consumption. At last, we demonstrate the performance of our method in Singes’ Day shopping festival.
6.1. Offline Comparison
In this subsection, we compare our method with norm elimination method, based feature selection, treebased feature selection and test feature selection in an offline evaluation setting.
Norm Elimination is that we remove those factors whose absolute values of weights are less than a positive constant .
based feature selection is a modelbased feature selection method such that it selects factor according to the regularizer. The basic idea is that to eliminate those factors whose corresponding coefficients are zero. Since this method must be based on a supervised machine learning model, we need to convert our ranking problem into a supervised one. We define the training data set as following: let the label and corresponding factor vector , thus the set of training examples has the form such that and . Therefore, we can train a regressor via training set and select the factors base on the trained model. We adopt the Lasso as our comparison method.
Algorithm  Averaged  Averaged  Weighted 

Pairwise Loss  factor Usage  factor Usage  
Ftest  
Ftest  
Ftest  
Norm Elimination  5  
Lasso  
Treebased  
RankCFS  
RankCFS  
RankCFS  2 
Treebased feature selection is similar to the
based feature selection and the difference is that we replace the Lasso model with a nonlinear regression tree model.
test feature selection is a modelfree feature selection method that selects top factors based on test scores.
Rank Contextual Factor selection (RankCFS) algorithm is out actorcritic method which is capable of adjusting the usage of factor by contexts.
For based feature selection, Treebased feature selection and test feature selection, we adopt their implementations in scikitlearn (Pedregosa et al., 2011)
. We implement RankCFS with Tensorflow
(Abadi et al., 2016). For the optimal ranking model in the offline evaluation, we select one of linear ranking models of Taobao.com and treat it as a black box so that the input and output of the ranking model are merely considered during the experimental process. We set the constant in the Norm Elimination method; the constant that multiplies the term equals in the Lasso model; We choose the default ExtraTreeRegressor in scikitlearn package as our tree model; The actor and critic are construct by two deep neural networks (DNN) with three fully connected layers, respectively. The DNN structures of actor are and ones of critic are . We adopt reluas the activation functions for the hidden layers, Adam as our optimizer and the learning rates of actor and critic are
and , receptively.We sample a data set with examples as mentioned above, then train the based, Treebased and test approaches^{3}^{3}3It is not necessary to train Norm Elimination method since it only removes those factors whose absolute values of weights are less than a positive constant . on examples and test then on the rest of ^{4}^{4}4 page views and items in each page view.. For the , Treebased and Ftest methods, the feature selection are determined after the training, that is, we use a fixed feature selection policy during the testing stage. Since our method requires to consider the computational costs of factors, the computational cost vector is obtained from the online operational environment of Taobao.com.
We test our methods on page views and each page view contains items so that there are testing examples. Then, we evaluate the averaged pairwise loss defined in Equation 9 and factor usage over page views. Figure 46 show our experimental results. Generally, the Norm Elimination method removes those factors whose absolute values are small under different contexts, therefore indicators such as loss, factor usage may vary over page views. Figure 4 demonstrates that our RankCFS with the threshold outperforms all other methods in terms of pairwise loss. And in Figure 6, it shows that the averaged factor usage of RankCFS algorithm is also close to the lowest Treebased method. Figure 6 demonstrates the weighted factor usage, in which the weights are the corresponding computational costs. This metric is more accurate to describe the factor usage in terms of efficiency due to the variety in computational costs among factors. For example , it only considers absolute values of weights in the Norm Elimination method, while RankCFS tends to eliminate those factors with high computational costs. Although, RankCFS and Treebased method have similar averaged factor usage, RankCFS has much lower weighted factor usage. It is the evidence that our approach relieves the computational burden in an intelligent way and save more computational resources. Empirically, our method is capable of exploring better solution in a combinatorial solution space with less factor usage. The Ftest method suffers high pairwise loss since it is a modelfree method, it is not required to consider the ranking model we adopt. Overall, the experiments shows that our RankCFS algorithm successfully explores the function space and find an excellent approximation to the optimal ranking function. We summarized the complete experimental results in Table 1 with varieties in parameters. Note that RankCFS outperforms RankCFS , it is possible that RankCFS falls into a worse local optimal solution and fails to escape from it.
6.2. Online Evaluation in Operational Environment
In this subsection and the following subsection, we present the experimental results of online evaluation in the realworld largescale operational environment of Taobao.com with a standard A/B testing setting. For the online evaluation, we adopt the same learning structure with the offline one, but with a more complex nonlinear optimal ranking model. The training is conducted with more than training samples on a distributed streaming system in an online learning fashion. The system information of computers in the clusters on which we conducted our experiments is list in Table 2.
Hardware  Configuration 

CPU  2x 16core Intel(R) Xeon(R) 
CPU E52682 v4 @ 2.5GHz  
RAM  256 GB 
Hyperthreading  Yes 
Networking  10 Gbps 
OS  ALIOS7 Linux 3.10.0 x86_64 
The search engine of Taobao.com is a complex system, processing billions of items and hundreds of millions of user queries every day. As a core system in Taobao.com, the search engine needs to respond the user queries in a timely manner. The search traffic might increase significantly during some promotional campaign such as the Singes’ Day shopping festival. Therefore, the system efficiency is always an important issue. Furthermore, the system is still required to provide high quality search service to the users, leading to computational burden for the whole system.
We conduct a standard A/B test experiment in our operational environment, where roughly 6% of the random users are select for the testing. The parameter and are tuned through the GMV and search latency. The goal is to minimize the impact on the GMV as much as possible, while reducing the latency as much as we can, comparing the control group. Figure 6(a) shows the best result with and . Our method saves approximately average search latency, comparing to the control group. For the max search latency, out algorithm reduce roughly latency. The system performance (GMV) is almost the same or little lower ( to ) as the control group.
6.3. Singles’ Day Evaluation
Alibaba Singles’ Day shopping festival is one of the biggest shopping extravaganza around the world and is the Chinese version of Black Friday. In 2017, by the end of day (November 11), sales hit a new record of $25.3 billion, more than higher than sales on Singles’ Day 2016 and it attracts over hundreds of millions users from more than 200 different countries. The infrastructure system manages to handle 0.325 millions orders per second at peak^{5}^{5}5https://techcrunch.com/2017/11/11/alibabasmashesitssinglesdayrecord/. The ecommerce search system played a crucial role in this event.
In November 11th, the search traffic burden of the ecommerce search engine abruptly increases by multiple times as much as in a regular day. On the one hand, the ecommerce search engine faces the high traffic challenge, which might lead to system degradation. On the other hand, it is still crucial to provide high search accuracy even during shopping festival.
In the event, our method collaborates previous work (Liu et al., 2017) in order to maximize optimization at the search engine system level. The algorithm called CLOES mainly focuses on optimizing the number of items in the ranking process via cascading model, while our method concentrates on optimizing the set of ranking factors during the ranking process. Thus, both of approaches are able to apply on the search engine simultaneously. We use CLOSE approach as our control group, CLOES+RankCFS as the experimental group, in which the parameter ^{6}^{6}6Due to the limited online resources, we are only able to use CLOES as our control group.. Figure 6(b) depicts the average latency change during the experiments, in which our method averagely saves more latency on the basis of CLOES. And also, our method saves approximately peak latency on the basis of CLOES. The system performance (GMV) is almost the same as the CLOSE method.
Our method and CLOES collaborate together in the very day of Singes’ Day 2017, and succeed in providing a much better search performance than previous year.
7. Conclusion and Future Work
In this paper, we thoroughly investigate the effectiveness and efficiency issues in a realworld largescale ecommerce search system and propose an intelligent optimization solution by reinforcement learning method. We formally defined the learning to rank problem in an ecommerce scenario and characterize the effectiveness and efficiency, which is a NPhard problem. Then, we convert the problem into a reinforcement learning problem by the reward design and solve it by the actorcritic method. We empirically test our method in offline and online evaluation scenarios, demonstrating our method is a practical solution in a realworld largescale ecommerce search system. In future, we plan on finding other ways to optimize the system engine such as memory usage, load balancing, combined with search latency. Moreover, the DNN network representation in our current setting is not an endtoend solution so that the endtoend representation solution (i.e. pointer network (Vinyals et al., 2015)) will be considered.
References
 (1)
 Abadi et al. (2016) Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. TensorFlow: A System for LargeScale Machine Learning.. In OSDI, Vol. 16. 265–283.
 Bello et al. (2016) Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. 2016. Neural combinatorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940 (2016).
 Benbouzid et al. (2012) Djalel Benbouzid, Röbert BusaFekete, and Balázs Kégl. 2012. Fast classification using sparse decision DAGs. In Proceedings of the 29th International Conference on International Conference on Machine Learning. Omnipress, 747–754.
 Bhatnagar et al. (2009) Shalabh Bhatnagar, Richard S Sutton, Mohammad Ghavamzadeh, and Mark Lee. 2009. Natural actor–critic algorithms. Automatica 45, 11 (2009), 2471–2482.
 Bourdev and Brandt (2005) Lubomir Bourdev and Jonathan Brandt. 2005. Robust object detection via soft cascade. In Computer Vision and Pattern Recognition, 2005. IEEE Computer Society Conference on, Vol. 2. IEEE, 236–243.
 Burges et al. (2005) Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. 2005. Learning to Rank Using Gradient Descent. In Proceedings of the 22nd International Conference on Machine Learning. ACM, New York, NY, USA, 89–96.
 Burges (2010) Chris J.C. Burges. 2010. From RankNet to LambdaRank to LambdaMART: An Overview. Technical Report. https://www.microsoft.com/enus/research/publication/fromranknettolambdaranktolambdamartanoverview/
 Cao et al. (2007) Zhe Cao, Tao Qin, TieYan Liu, MingFeng Tsai, and Hang Li. 2007. Learning to Rank: From Pairwise Approach to Listwise Approach. Technical Report. https://www.microsoft.com/enus/research/publication/learningtorankfrompairwiseapproachtolistwiseapproach/

Cooper
et al. (1992)
William S. Cooper,
Fredric C. Gey, and Daniel P. Dabney.
1992.
Probabilistic Retrieval Based on Staged Logistic Regression. In
Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, USA, 198–210.  Davis et al. (1997) Geoff Davis, Stephane Mallat, and Marco Avellaneda. 1997. Adaptive greedy approximations. Constructive Approximation 13, 1 (1997), 57–98.
 Freund et al. (2003) Yoav Freund, Raj Iyer, Robert E Schapire, and Yoram Singer. 2003. An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research 4, 6 (2003), 170–178.
 Geng et al. (2007) Xiubo Geng, TieYan Liu, Tao Qin, and Hang Li. 2007. Feature Selection for Ranking. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, USA, 407–414.
 Guyon and Elisseeff (2003) Isabelle Guyon and André Elisseeff. 2003. An introduction to variable and feature selection. Journal of machine learning research 3, Mar (2003), 1157–1182.

Kira and Rendell (1992)
Kenji Kira and Larry A
Rendell. 1992.
The feature selection problem: traditional methods
and a new algorithm. In
Proceedings of the 10th National Conference on Artificial Intelligence
. AAAI Press, 129–134.  Kober and Peters (2011) Jens Kober and Jan Peters. 2011. Policy search for motor primitives in robotics. Machine Learning 1, 84 (2011), 171–203.
 Kulkarni et al. (2016) Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. 2016. Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.). Curran Associates, Inc., 3675–3683.
 Kullback and Leibler (1951) Solomon Kullback and Richard A Leibler. 1951. On information and sufficiency. The annals of mathematical statistics 22, 1 (1951), 79–86.

Li et al. (2008)
Ping Li, Chris J.C.
Burges, and Qiang Wu. 2008.
Learning to Rank Using Classification and Gradient Boosting, In Advances in Neural Information Processing Systems 20.
 Liu et al. (2008) LiPing Liu, Yang Yu, Yuan Jiang, and ZhiHua Zhou. 2008. TEFE: A TimeEfficient Approach to Feature Extraction. In Proceedings of the 8th IEEE International Conference on Data Mining. IEEE Computer Society, 423–432.
 Liu et al. (2017) Shichen Liu, Fei Xiao, Wenwu Ou, and Luo Si. 2017. Cascade Ranking for Operational Ecommerce Search. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, USA, 1557–1565.
 Mnih et al. (2016) Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Tim Harley, Timothy P Lillicrap, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, Vol. 48. JMLR. org, 1928–1937.
 Natarajan (1995) Balas Kausik Natarajan. 1995. Sparse approximate solutions to linear systems. SIAM journal on computing 24, 2 (1995), 227–234.
 Ng et al. (1999) Andrew Y. Ng, Daishi Harada, and Stuart J. Russell. 1999. Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping. In Proceedings of the Sixteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 278–287.
 Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikitlearn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
 Peters and Schaal (2008) Jan Peters and Stefan Schaal. 2008. Natural actorcritic. Neurocomputing 71, 7 (2008), 1180–1190.
 Schneiderman (2004) Henry Schneiderman. 2004. Featurecentric evaluation for efficient cascaded object detection. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, Vol. 2. IEEE, II–II.
 Sutton and Barto (1998) Richard S. Sutton and Andrew G. Barto. 1998. Introduction to Reinforcement Learning (1st ed.). MIT Press, Cambridge, MA, USA.
 Vinyals et al. (2015) Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer Networks. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). Curran Associates, Inc., 2692–2700.
 Viola and Jones (2003) Paul Viola and Michael Jones. 2003. Rapid Object Detection using a Boosted Cascade of Simple Features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. I–511–I–518 vol.1.
 Wang et al. (2010a) Lidan Wang, Jimmy Lin, and Donald Metzler. 2010a. Learning to Efficiently Rank. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, USA, 138–145.
 Wang et al. (2010b) Lidan Wang, Donald Metzler, and Jimmy Lin. 2010b. Ranking Under Temporal Constraints. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management. ACM, New York, NY, USA, 79–88.
 Williams (1992) Ronald J Williams. 1992. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning 8, 34 (1992), 229–256.
 Xu and Li (2007) Jun Xu and Hang Li. 2007. AdaRank: A Boosting Algorithm for Information Retrieval. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, USA, 391–398.
 Zheng et al. (2007) Zhaohui Zheng, Keke Chen, Gordon Sun, and Hongyuan Zha. 2007. A Regression Framework for Learning Ranking Functions Using Relative Relevance Judgments. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, USA, 287–294.
 Zhou (2012) ZhiHua Zhou. 2012. Ensemble methods: foundations and algorithms. CRC press.
Comments
There are no comments yet.