Accelerating E-Commerce Search Engine Ranking by Contextual Factor Selection

by   Yusen Zhan, et al.
Nanjing University

In industrial large-scale search systems, such as search for commodities, the quality of the ranking result is getting continually improved by introducing more factors from complex procedures, e.g., deep neural networks for extracting image factors. Meanwhile, the increasing of the factors demands more computation resource and raises the system response latency. It has been observed that a search instance usually requires only a small set of effective factors, instead of all factors. Therefore, removing ineffective factors significantly improves the system efficiency. This paper studies the Contextual Factor Selection (CFS), which selects only a subset of effective factors for every search instance, for a well balance between the search quality and the response latency. We inject CFS into the search engine ranking score to accelerate the engine, considering both ranking effectiveness and efficiency. The learning of the CFS model involves a combinatorial optimization, which is transformed as a sequential decision-making problem. Solving the problem by reinforcement learning, we propose the RankCFS, which has been assessed in an off-line environment as well as a real-world on-line environment ( The empirical results show that, the proposed CFS approach outperforms several existing supervised/unsupervised methods for feature selection in the off-line environment, and also achieves significant real-world performance improvement, in term of service latency, in daily test as well as Singles' Day Shopping Festival in 2017.



There are no comments yet.


page 2


Towards a Better Tradeoff between Effectiveness and Efficiency in Pre-Ranking: A Learnable Feature Selection based Approach

In real-world search, recommendation, and advertising systems, the multi...

Cascade Ranking for Operational E-commerce Search

In the 'Big Data' era, many real-world applications like search involve ...

Demystifying Core Ranking in Pinterest Image Search

Pinterest Image Search Engine helps hundreds of millions of users discov...

Feature Importance Ranking for Deep Learning

Feature importance ranking has become a powerful tool for explainable AI...

Reinforcement Learning to Rank in E-Commerce Search Engine: Formalization, Analysis, and Application

In e-commerce platforms such as Amazon and TaoBao, ranking items in a se...

Deploying Deep Ranking Models for Search Verticals

In this paper, we present an architecture executing a complex machine le...

Seasonal-adjustment Based Feature Selection Method for Large-scale Search Engine Logs

Search engine logs have a great potential in tracking and predicting out...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Information retrieval and machine learning applications play an important role in industrial and commercial scenarios, ranging from web searching engine (,, etc.) to e-commence websites (, The major applications, ie., search and recommendation, usually require to rank a large set of data items in terms of response to users’ requests under an on-line circumstance. To support these applications, there are generally two issues: a) Effectiveness such as how accurate and reliable the search results in the final ranking list are and b) Efficiency as how fast the search engine’s response to the user’s queries in a timely manner and whether the computational burden of ranking is as low as possible from a system’s perspective. It is a challenge to address both issues in large-scale applications for providing excellent user experience and an efficient performance solution.

Generally speaking, to address the high computational cost of more deep models as well as large-scale traffic requests, the search engine system has to degrade the service level in the aspect of effectiveness, i.e., reducing the number of recalled items, off-lining some unnecessary service and so on, in oder to avoid access delay or even unavailability, which severely affects the users’ experience. Though successful, these methods only adopts a compromise between the search engine processing performance and the service availability in a hard way, which means there must be unnecessary sacrifice of revenue in real world business practice. Consequently, this raises a question whether we are able to design a ”soft” or ”intelligent” solution, which allows to achieve both of effectiveness and efficiency.

The answer seems to be promising. Liu et al. proposed a cascading ranking model to address the trade-off between effectiveness and efficiency in the large-scale e-commerce search applications (Liu et al., 2017). Their method mainly focuses on reducing the number of items in the ranking process, however, their model sheds a light on us to optimize the e-commerce search engine in another possible way. In a search engine, a set of factors is applied to the ranking process and we conjecture that not all of those factors are necessary in the real-world applications. After thorough investigation in the real world operational environment, we discover that there are still relatively high correlations between those ranking factors in our system. See Figure 1 for details. Therefore, on the one hand, there exists redundancy of factors in our on-line operational environment. On the other hand, we also realize that the conversion rates vary on items under different contexts111Here, the user-query pair denotes the context.. For instance, the users with higher purchase power always have a higher conversion rate under some long-tail (low-frequency) queries. Based on aforementioned analysis, we consider that some computational efficient factors may be sufficient for achieving effectiveness under such contexts. The above observations shows the possibility of keeping the effectiveness by carefully selecting a subset of all factors under certain circumstances, which indeed is a standard combinatorial optimization problem, but with auxiliary context description.

Combinatorial optimization is a fundamental problem in computer science. Recently, Bello et al. show that reinforcement learning is capable of solving combinatorial optimization problem like TSP via pointer network (Vinyals et al., 2015; Bello et al., 2016)

. In this paper, we try to address above challenges by designing an innovative model via reinforcement learning algorithms. We formally define our optimization problem with a general framework and a loss function which is able to reflect both of ranking effectiveness and efficiency. Then, we transform the contextual combinatorial optimization problem to a sequential decision-making one by incorporating the contextual setting and factor selection into the state and action of an MDP, respectively. The reward is designed to encourage to save the computational cost of factors as well as ensure that the ranking results are still effective. The final solution can be obtained via the state-of-the-art reinforcement learning algorithms such as Asynchronous Advantage Actor-Critic (A3C) in this paper. Based on the correlation among factors as well as the context dependency in our system, our method is capable of handling contextual factor selection in terms of user and query, while ensuring to minimized the influence on business indicators, i.e., gross merchandise value (GMV), click-through rate (CTR) and so on.

We show our algorithm outperforms comparative algorithms in both of off-line and on-line evaluation. In Singles’ Day Shopping Festival, , we also demonstrate the capabilities of this new method in real-world large-scale system.

The contributions of this paper can be summarized as: i) injecting contextual factor selection into search engine ranking score for engine acceleration, ii) formulating the contextual factor selection for ranking as a contextual combinatorial optimization problem, iii) deriving a reinforcement learning based solution to the proposed optimization problem, iv) demonstrating the effectiveness of our technique in both of off-line and on-line environments.

The rest of the paper is organized as follows: Section 2 introduces the background; Section 3 provides some related work; In section 4, we define our problem in a view of optimization; Section 5 proposes actor-critic method to resolve the problem in section 4; We show the experimental results in section 7. Finally, section 8 summarizes the whole paper.

Figure 1.

The element-wise Pearson product-moment correlation coefficients between selected factors. The darker the block is, the less correlated the corresponding factors are, and vice versa

2. Ranking in E-Commerce Search

Suppose that is the set of all available items in the database, is the set of all possible queries and denotes the set of all users’ information. Let be a set of user-query pairs, where denotes the -th user-query pair from the search requests. is the set of items associated with the -th user-query request, where, is the number of the related items. The ranking problem in e-commerce scenario can be then formally defined as a task to generate a permutation function , where is an one-to-one correspondence from to itself and denotes the set of all the possible permutations on

. The goal is to maximize the probability of purchase under the permutation. The permutation is usually generated by a ranking function

which scores each item for the request . Let

be the corresponding factor vector of item

under query , where ; . Some factors in the factor vector depends on the user-query pair and the item . Without loss of generality, the ranking model is defined by


where is the ranking function. It could be any function such as a linear model, a deep neural network or a tree model.

The ranking function is usually trained from a dataset logged from the real system, where is the number of training examples and denotes the labels associates with items. Specifically, represents the feedback of the user on the -th item. The training can be conducted in point-wise way (Cooper et al., 1992; Li et al., 2008), pair-wise way (Freund et al., 2003; Burges et al., 2005; Zheng et al., 2007), or list-wise way (Xu and Li, 2007; Cao et al., 2007; Burges, 2010). It is worth noting that, in this paper we assume that a trained ranking function is given and consider the general case that the ranking function is provided as a black box, i.e., with no access to the gradient or even the Hessian matrix.

3. Related Work

There are a lot of work that attempts to resolve the effectiveness and efficiency challenge and we review will some of them.

Cascade learning is originally proposed to address the effectiveness and efficiency issue in traditional classification and detection problems such as fast visual object detection  (Bourdev and Brandt, 2005; Schneiderman, 2004; Viola and Jones, 2003). Liu et al. develop a cascade ranking model for a large-scale e-commerce search system and deploy it in (Liu et al., 2017). However, they only exploit optimization in terms of the number of ranking items, we mainly focuses on the factor usage during the ranking process.

Feature selection tries to remove irrelevant and/or redundant features to improve learning performance (Guyon and Elisseeff, 2003). Traditional feature selection techniques roughly fall into two categories, i.e., filter methods and wrapper methods. Filter methods use learner-irrelevant measurements to evaluate and select features, such as information gain and Relief (Kira and Rendell, 1992). Wrapper methods involve the final learner in the feature selection process, such as using the accuracy as the evaluation criterion for the goodness of features. Liu et al.

proposed the TEFE (Time-Efficient Feature Extraction) approach, which balances between the test accuracy and test time cost by extracting a proper subset of features for each test object

(Liu et al., 2008). In learning to rank literature, Feature selection is a common strategy to improve the efficiency. In generally, a set of crucial factors are selected from a complete set of all possible factors according to some criteria such as importance to the ranking (Geng et al., 2007; Wang et al., 2010a, b). Geng et al. propose a selection method based on factor importance in a query-free manner, but they do not consider the real computational cost and query-dependent factor (Geng et al., 2007). There are also some methods are query-dependent, in which the cost (delay) of the query is considered (Wang et al., 2010a, b). In contrast, we consider the computational cost (delay) of individual factor.

Ensemble pruning is a class of approaches that tries to select a subset of learners (factors) to comprise the ensemble learner (Zhou, 2012). Recently, Benbouzid et al. apply Q-learning algorithm to ensemble pruning, in which a reinforcement learning agent tries to decide whether or not to skip the base learner. However, their method is context-free and lacks of evaluation in a real-world large-scale application.

4. Contextual Factor Selection for Ranking

4.1. The CFS framework

In this subsection, we describes a general framework of Contextual Factor Selection (CFS), for constructing a search engine optimizer which achieves both of effectiveness and efficiency in terms of e-commerce search engine. As mentioned above, a factor vector is assigned to the corresponding item , in which each dimension of the factor vector is calculated on-line and varies on the computational cost. Let be the factor vector associated with a cost vector , where denotes the computational cost of the -th factor. Let be the set of all factors and be a subset of . The indicator function of a subset of the set defined as


From a practical point of view, some of factors are not necessary in terms of ranking. For example, given a set of factors , a subset of with highly confident factors might be sufficient under some contexts. Therefore, given an item and indicator function , the computational cost function can be written as , where the indicator function determines whether or not we use the factor to participate the sorting process222We mainly consider the computational cost of the factors while ignoring other costs.. Thus, given a set of items , the total computational cost is


As defined in Equation 1, the ranking model with all factors can be written as and the one with a subset is written as


Intuitively, we can treat the permutation generated by as the optimal one since it includes all the factors we have during the ranking process. Thus, given a request, the objective is


where denotes the distance between function and over the item set

, which could be any distance between two functions, i.e., Kullback-Leibler divergence 

(Kullback and Leibler, 1951), the second term is the computational costs of factors in the set , is the trade-off parameter and is the number of items in query . Intuitively, the objective implies that it reduces the usage of factors as many as possible, while approximating the original ranking function by function as close as possible.

However, Equation 5 is intractable even for a single request, which is able to reduced to the optimal subset selection problem. Consequently, it is a NP-hard problem in general (Davis et al., 1997; Natarajan, 1995). Moreover, we need to do the contextual factor selection, i.e., solving a general NP-hard problem for every , which is impractical in a large-scale system even with a small number of contexts. To overcome this challenge, we try to generalize the solution of Equation 5 at the contextual level. That is, we do not directly search the optimal subset and define :


where is a model parameterized by and the user-query pair characterizes the context. Such formulation reduces the solution space to a global parameter from the original multiple optimal subset selection problems, based on the assumption that similar representations should have similar optimal subset structure. Thus, our goal is to search for the global parameter vector to minimize the loss defined in Equation 5 over all the requests.

To illustrate our method, we adopt the linear ranking functions as a demonstration, and other representations, i.e., deep neural network and tree based ranking function, can be derived by similar way. In the linear setting, the score of item under user-query is


where is the corresponding weight of factor .

In another point of view, the permutation significantly depends on the factors are used to calculate the scores. Formally, given an user-query pair and a corresponding weight vector , the linear ranking function


where is the indicator function, which depends on the user-query pair . For convenience, denotes the binary vector with respect to the factor vector . Therefore, the ranking permutation highly depends on the ranking function and the indicator function , assuming the weight vector is fixed if the ranking model is given. Thus, the crucial part of ranking optimization is to learn an indicator function to determine the utilization of the factors. See Figure 2 for illustration. To simplify the notation, we write as , where the parameter characterizes the factor subset . Thence, the ranking permutation is induced by the ranking function and the indicator function . Thus, we can rewrite the distance function as , where and are permutations reduced by ranking function and , respectively.

Figure 2. Ranking optimization illustration.

4.2. CFS with Pairwise Ranking Loss

With the optimal ranking permutation above, then we define the distance over a item set between a permutation and the optimal ranking permutation as


where equals if and otherwise. The definition of the distance is the analogue of the averaged pairwise loss in learning to rank literature. The distance measures that how far away is the induced permutation to the optimal one in terms of ranking pairs.

With the distance and total cost function defined above, our goal is, given a user-query pair , the corresponding item set and the ranking function , to learn an indicator function such that minimizes the the distance function and the total computational costs. Formally, the objective in Equation 5 can be further rewritten as


5. RankCFS: A Reinforcement Learning Approach

As mentioned in section 4.1, the optimization problem defined in Equation 10 is NP-hard in general case and finding the exact solution is computationally intractable. Inspired by recent work (Bello et al., 2016; Benbouzid et al., 2012), we propose to optimize the factor usage using reinforcement learning framework in order to learn an indicator function , by transforming the assignment of each element in the indicator vector as a sequential decision-making problem. We call it RankCFS.

5.1. Reinforcement Learning and Actor-Critic Methods

In this subsection, we will review some basic concepts in reinforcement learning. This subsection could be skipped If the readers are similar with reinforcement learning.

In reinforcement learning, an agent must sequentially select actions to maximize its total expected pay-off. These problems are typically formalized as Markov decision processes (MDPs) with a tuple of

, where and denote the state and action spaces. represents the transition probability governing the dynamics of the system, is the reward function quantifying the performance of the agent and is a discount factor specifying the degree to which rewards are discounted over time. At each step , the agent is in state and must choose an action , transitioning it to a successor state as given by and yielding a reward . A policy

is defined as a probability distribution over state-action pairs, where

denotes the probability of choosing action at state .

Policy gradients (Kober and Peters, 2011; Sutton and Barto, 1998) are a class of reinforcement learning algorithms that have shown successes in solving complex robotic problems (Kober and Peters, 2011). Such methods represent the policy by an unknown vector of parameters . The goal is to determine the optimal parameter vector that maximize the expected discounted cumulative reward:


where denotes a trajectory over a possibly finite horizon . The probability of acquiring a trajectory, , under the policy parameterization and discounted cumulative reward is given by:


with an initial state distribution . Policy gradient methods, such as episodic REINFORCE (Williams, 1992) and Natural Actor Critic (Bhatnagar et al., 2009; Peters and Schaal, 2008), typically employ a lower-bound on the expected return for fitting the unknown policy parameters . To achieve this, such algorithms generate trajectories using the current policy , and then compare performance with a new parameterization . As detailed in (Kober and Peters, 2011), the policy gradient of

can be estimated using the the likelihood ratio trick as


which is usually approximated with empirical estimate for sample trajectories under the policy , i.e., . The gradient can be applied in every step and further improved by introducing a learned bias

to reduce the variance of this estimate as in 

(Mnih et al., 2016)


where is the discounted cumulative reward from step and is the function approximation of parameterized by , of which the gradient is


5.2. Converting CFS to MDP Setting

It is possible to learn a factor subset, in which a subset of factors are chosen for the ranking process, by a reinforcement learning policy, instead of approximating the indicator vector directly. However, it results in a combinatorial action space which leads to computational intractability and searching failure with a high probability.

To reduce the action space, we introduce a fixed factor sequence so that a policy can sequentially determine the corresponding utilization of factors. Formally, for each user-query request, the vector function can be determined in steps, where in the -th step (), we need to decide whether the -th factor should be applied to the ranking function or not for this certain request, i.e., is the action taken at step and the is the action space. is obtained through a policy


where is the state representation of -th step. Then we can get


After steps, is determined and so is ranking permutation . Then we can directly calculate the loss to evaluate the result of selected actions, which can be further used to define the total reward of the actions generated in steps during the episode. See Figure 3 for illustration. The key idea lies in the state design (base on which the action is generated), the reward design (how to evaluate each action) and the optimization method for this reinforcement learning problem (how to find the optimal policy).

Figure 3. An example of the pruning procedure via reinforcement learning. Our goal is to select a subset of factors to optimize the objective defined in Equation 5.

5.3. The State and Reward Design

The optimal policy should generalize over the state space, and the optimal actions for an episode only depend on the request, so ideally, the state can be designed as


where is the representations for the user-query pair . The corresponding reward is then defined as


The agent is designed to obtain a reward of when the episode is not terminated, i.e., , and a reward of - when the episode ends, like the goal-directed tasks. By above definition and assigning to 1, then we can conclude that the objective in this reinforcement learning problem is exactly the negative of the objective in Equation 10:


This means that maximizing can directly minimize , allowing us to find the optimal solution of with the power of deep reinforcement learning.

However, empirically there are two issues that make learning the optimal policy for above reinforcement learning problem difficult. One is that the reward is sparse over states, known as the sparse feedback problem  (Kulkarni et al., 2016). The other one is that the reward itself () distributes widely in the continuous space, making the critic model difficult to converge. Inspired by the reward shaping (Ng et al., 1999) technique, we consider to slightly change the representations of states and rewards, to alleviate the above issues.

We firstly initialize as an all-one vector, and at the step update as


Then we extend our state vector to

0:    : Training data set : The ranking function: Parameters of the algorithm
0:    : Parameters of actor model
1:  Initialize the actor network params and the critic network params
3:  repeat
4:     for each (For each page view) do
6:        Initialized the initial state as in Eq. 23
7:        for  do
8:           Taking action on the -th factor based on , observe and .
9:           Cache the tuple
10:        end for
12:        for  do
16:        end for
17:     end for
18:  until 
Algorithm 1 RankCFS

Thus our state memorizes the decisions made before during an episode. At each step , the reward is calculated based on , i.e., at each step it is pre-evaluated for the decisions made so far, assuming the rest decisions are all ones by default. For each reward , we decompose it into the effectiveness part and the efficiency part , i.e., . For the efficiency part, we simply add a penalty when keeping the -th factor as


This part is consistent with the Equation 10. For the effectiveness part, we choose to give a constant penalty if the ranking loss under exceeds a pre-defined threshold as


rather than itself shown in Equation 10. By such design we could help the critic distinguish bad and good ranking result much easier. Moreover, we could avoid generating poor ranking performance with a large penalty .

5.4. Learning the Policy

After transforming the original problem into a reinforcement learning one, we could then apply any reinforcement learning methods. In this paper, we choose the well-known policy gradient method with actor-critic models as described in (Mnih et al., 2016) and we call it RankCFS. It is worthing noting that, the difficulty of the original optimization problem does not decrease with the introduction of reinforcement learning techniques. The RL-based approach here acts as a solver whose solution space contains the optimal, and provides an efficient searching path to the optimal through trial-and-error methods.

Algorithm 1 shows the training details. The data of page views in the on-line search system , the reward discount factor , the parameters used in the reward definition and the maximal number of training step are given as the input of the algorithm. The parameter of the actor network is the output of the algorithm. We firstly initialize the parameters of the actor and critic network, as well as the step counter , as in Line 1 and 2. The training phase starts with the iteration of the each page view, with which an episode will be generated during the Line 6-10. Then standard policy gradient is conducted in Line 14-15, where the tuple is organized in the backward way so the discounted cumulative reward can be updated incrementally as in Line 13. The training process ends when the number of steps exceeds the given threshold .

6. Experimental Results

In this section, we provide empirical results of our approaches in off-line evaluation and commercial on-line evaluation. We show the results of off-line settings in order to provide a way to justify our algorithm. Then, we test our method in a real on-line commercial web search engine to reveal the performance improvement with respect to the resource consumption. At last, we demonstrate the performance of our method in Singes’ Day shopping festival.

6.1. Off-line Comparison

Figure 4. Pairwise loss vs Page view in off-line comparison. Lower is better.
Figure 5. Factor usage vs Page view in off-line comparison. Lasso, Tree-based method and F-test k=8 have the same averaged factor usage at . Lower is better.
Figure 6. Weighted averaged factor usage vs Page view in off-line comparison. Lower is better.

In this subsection, we compare our method with norm elimination method, -based feature selection, tree-based feature selection and -test feature selection in an off-line evaluation setting.

Norm Elimination is that we remove those factors whose absolute values of weights are less than a positive constant .

-based feature selection is a model-based feature selection method such that it selects factor according to the regularizer. The basic idea is that to eliminate those factors whose corresponding coefficients are zero. Since this method must be based on a supervised machine learning model, we need to convert our ranking problem into a supervised one. We define the training data set as following: let the label and corresponding factor vector , thus the set of training examples has the form such that and . Therefore, we can train a regressor via training set and select the factors base on the trained model. We adopt the Lasso as our comparison method.

Algorithm Averaged Averaged Weighted
Pairwise Loss factor Usage factor Usage
Norm Elimination 5
RankCFS 2
Table 1. Results Summary

Tree-based feature selection is similar to the

-based feature selection and the difference is that we replace the Lasso model with a non-linear regression tree model.

-test feature selection is a model-free feature selection method that selects top factors based on -test scores.

Rank Contextual Factor selection (RankCFS) algorithm is out actor-critic method which is capable of adjusting the usage of factor by contexts.

For -based feature selection, Tree-based feature selection and -test feature selection, we adopt their implementations in scikit-learn (Pedregosa et al., 2011)

. We implement RankCFS with Tensorflow 

(Abadi et al., 2016). For the optimal ranking model in the off-line evaluation, we select one of linear ranking models of and treat it as a black box so that the input and output of the ranking model are merely considered during the experimental process. We set the constant in the Norm Elimination method; the constant that multiplies the term equals in the Lasso model; We choose the default ExtraTreeRegressor in scikit-learn package as our tree model; The actor and critic are construct by two deep neural networks (DNN) with three fully connected layers, respectively. The DNN structures of actor are and ones of critic are . We adopt relu

as the activation functions for the hidden layers, Adam as our optimizer and the learning rates of actor and critic are

and , receptively.

We sample a data set with examples as mentioned above, then train the -based, Tree-based and -test approaches333It is not necessary to train Norm Elimination method since it only removes those factors whose absolute values of weights are less than a positive constant . on examples and test then on the rest of 444 page views and items in each page view.. For the , Tree-based and F-test methods, the feature selection are determined after the training, that is, we use a fixed feature selection policy during the testing stage. Since our method requires to consider the computational costs of factors, the computational cost vector is obtained from the on-line operational environment of

We test our methods on page views and each page view contains items so that there are testing examples. Then, we evaluate the averaged pairwise loss defined in Equation 9 and factor usage over page views. Figure 4-6 show our experimental results. Generally, the Norm Elimination method removes those factors whose absolute values are small under different contexts, therefore indicators such as loss, factor usage may vary over page views. Figure 4 demonstrates that our RankCFS with the threshold outperforms all other methods in terms of pairwise loss. And in Figure 6, it shows that the averaged factor usage of RankCFS algorithm is also close to the lowest Tree-based method. Figure 6 demonstrates the weighted factor usage, in which the weights are the corresponding computational costs. This metric is more accurate to describe the factor usage in terms of efficiency due to the variety in computational costs among factors. For example , it only considers absolute values of weights in the Norm Elimination method, while RankCFS tends to eliminate those factors with high computational costs. Although, RankCFS and Tree-based method have similar averaged factor usage, RankCFS has much lower weighted factor usage. It is the evidence that our approach relieves the computational burden in an intelligent way and save more computational resources. Empirically, our method is capable of exploring better solution in a combinatorial solution space with less factor usage. The F-test method suffers high pairwise loss since it is a model-free method, it is not required to consider the ranking model we adopt. Overall, the experiments shows that our RankCFS algorithm successfully explores the function space and find an excellent approximation to the optimal ranking function. We summarized the complete experimental results in Table 1 with varieties in parameters. Note that RankCFS outperforms RankCFS , it is possible that RankCFS falls into a worse local optimal solution and fails to escape from it.

(a) Latency fo regular operational environment
(b) Latency for Singles’ Day
Figure 7. Latency in a real-world large-scale e-commerce search engine. Lower is better.

6.2. On-line Evaluation in Operational Environment

In this subsection and the following subsection, we present the experimental results of on-line evaluation in the real-world large-scale operational environment of with a standard A/B testing setting. For the on-line evaluation, we adopt the same learning structure with the off-line one, but with a more complex nonlinear optimal ranking model. The training is conducted with more than training samples on a distributed streaming system in an on-line learning fashion. The system information of computers in the clusters on which we conducted our experiments is list in Table 2.

Hardware Configuration
CPU 2x 16-core Intel(R) Xeon(R)
CPU E5-2682 v4 @ 2.5GHz
RAM 256 GB
Hyperthreading Yes
Networking 10 Gbps
OS ALIOS7 Linux 3.10.0 x86_64
Table 2. System information

The search engine of is a complex system, processing billions of items and hundreds of millions of user queries every day. As a core system in, the search engine needs to respond the user queries in a timely manner. The search traffic might increase significantly during some promotional campaign such as the Singes’ Day shopping festival. Therefore, the system efficiency is always an important issue. Furthermore, the system is still required to provide high quality search service to the users, leading to computational burden for the whole system.

We conduct a standard A/B test experiment in our operational environment, where roughly 6% of the random users are select for the testing. The parameter and are tuned through the GMV and search latency. The goal is to minimize the impact on the GMV as much as possible, while reducing the latency as much as we can, comparing the control group. Figure 6(a) shows the best result with and . Our method saves approximately average search latency, comparing to the control group. For the max search latency, out algorithm reduce roughly latency. The system performance (GMV) is almost the same or little lower ( to ) as the control group.

6.3. Singles’ Day Evaluation

Alibaba Singles’ Day shopping festival is one of the biggest shopping extravaganza around the world and is the Chinese version of Black Friday. In 2017, by the end of day (November 11), sales hit a new record of $25.3 billion, more than higher than sales on Singles’ Day 2016 and it attracts over hundreds of millions users from more than 200 different countries. The infrastructure system manages to handle 0.325 millions orders per second at peak555 The e-commerce search system played a crucial role in this event.

In November 11th, the search traffic burden of the e-commerce search engine abruptly increases by multiple times as much as in a regular day. On the one hand, the e-commerce search engine faces the high traffic challenge, which might lead to system degradation. On the other hand, it is still crucial to provide high search accuracy even during shopping festival.

In the event, our method collaborates previous work (Liu et al., 2017) in order to maximize optimization at the search engine system level. The algorithm called CLOES mainly focuses on optimizing the number of items in the ranking process via cascading model, while our method concentrates on optimizing the set of ranking factors during the ranking process. Thus, both of approaches are able to apply on the search engine simultaneously. We use CLOSE approach as our control group, CLOES+RankCFS as the experimental group, in which the parameter 666Due to the limited on-line resources, we are only able to use CLOES as our control group.. Figure 6(b) depicts the average latency change during the experiments, in which our method averagely saves more latency on the basis of CLOES. And also, our method saves approximately peak latency on the basis of CLOES. The system performance (GMV) is almost the same as the CLOSE method.

Our method and CLOES collaborate together in the very day of Singes’ Day 2017, and succeed in providing a much better search performance than previous year.

7. Conclusion and Future Work

In this paper, we thoroughly investigate the effectiveness and efficiency issues in a real-world large-scale e-commerce search system and propose an intelligent optimization solution by reinforcement learning method. We formally defined the learning to rank problem in an e-commerce scenario and characterize the effectiveness and efficiency, which is a NP-hard problem. Then, we convert the problem into a reinforcement learning problem by the reward design and solve it by the actor-critic method. We empirically test our method in off-line and on-line evaluation scenarios, demonstrating our method is a practical solution in a real-world large-scale e-commerce search system. In future, we plan on finding other ways to optimize the system engine such as memory usage, load balancing, combined with search latency. Moreover, the DNN network representation in our current setting is not an end-to-end solution so that the end-to-end representation solution (i.e. pointer network (Vinyals et al., 2015)) will be considered.


  • (1)
  • Abadi et al. (2016) Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. TensorFlow: A System for Large-Scale Machine Learning.. In OSDI, Vol. 16. 265–283.
  • Bello et al. (2016) Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. 2016. Neural combinatorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940 (2016).
  • Benbouzid et al. (2012) Djalel Benbouzid, Röbert Busa-Fekete, and Balázs Kégl. 2012. Fast classification using sparse decision DAGs. In Proceedings of the 29th International Conference on International Conference on Machine Learning. Omnipress, 747–754.
  • Bhatnagar et al. (2009) Shalabh Bhatnagar, Richard S Sutton, Mohammad Ghavamzadeh, and Mark Lee. 2009. Natural actor–critic algorithms. Automatica 45, 11 (2009), 2471–2482.
  • Bourdev and Brandt (2005) Lubomir Bourdev and Jonathan Brandt. 2005. Robust object detection via soft cascade. In Computer Vision and Pattern Recognition, 2005. IEEE Computer Society Conference on, Vol. 2. IEEE, 236–243.
  • Burges et al. (2005) Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. 2005. Learning to Rank Using Gradient Descent. In Proceedings of the 22nd International Conference on Machine Learning. ACM, New York, NY, USA, 89–96.
  • Burges (2010) Chris J.C. Burges. 2010. From RankNet to LambdaRank to LambdaMART: An Overview. Technical Report.
  • Cao et al. (2007) Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. 2007. Learning to Rank: From Pairwise Approach to Listwise Approach. Technical Report.
  • Cooper et al. (1992) William S. Cooper, Fredric C. Gey, and Daniel P. Dabney. 1992.

    Probabilistic Retrieval Based on Staged Logistic Regression. In

    Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, USA, 198–210.
  • Davis et al. (1997) Geoff Davis, Stephane Mallat, and Marco Avellaneda. 1997. Adaptive greedy approximations. Constructive Approximation 13, 1 (1997), 57–98.
  • Freund et al. (2003) Yoav Freund, Raj Iyer, Robert E Schapire, and Yoram Singer. 2003. An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research 4, 6 (2003), 170–178.
  • Geng et al. (2007) Xiubo Geng, Tie-Yan Liu, Tao Qin, and Hang Li. 2007. Feature Selection for Ranking. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, USA, 407–414.
  • Guyon and Elisseeff (2003) Isabelle Guyon and André Elisseeff. 2003. An introduction to variable and feature selection. Journal of machine learning research 3, Mar (2003), 1157–1182.
  • Kira and Rendell (1992) Kenji Kira and Larry A Rendell. 1992. The feature selection problem: traditional methods and a new algorithm. In

    Proceedings of the 10th National Conference on Artificial Intelligence

    . AAAI Press, 129–134.
  • Kober and Peters (2011) Jens Kober and Jan Peters. 2011. Policy search for motor primitives in robotics. Machine Learning 1, 84 (2011), 171–203.
  • Kulkarni et al. (2016) Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. 2016. Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.). Curran Associates, Inc., 3675–3683.
  • Kullback and Leibler (1951) Solomon Kullback and Richard A Leibler. 1951. On information and sufficiency. The annals of mathematical statistics 22, 1 (1951), 79–86.
  • Li et al. (2008) Ping Li, Chris J.C. Burges, and Qiang Wu. 2008.

    Learning to Rank Using Classification and Gradient Boosting, In Advances in Neural Information Processing Systems 20.

  • Liu et al. (2008) Li-Ping Liu, Yang Yu, Yuan Jiang, and Zhi-Hua Zhou. 2008. TEFE: A Time-Efficient Approach to Feature Extraction. In Proceedings of the 8th IEEE International Conference on Data Mining. IEEE Computer Society, 423–432.
  • Liu et al. (2017) Shichen Liu, Fei Xiao, Wenwu Ou, and Luo Si. 2017. Cascade Ranking for Operational E-commerce Search. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, USA, 1557–1565.
  • Mnih et al. (2016) Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Tim Harley, Timothy P Lillicrap, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, Vol. 48. JMLR. org, 1928–1937.
  • Natarajan (1995) Balas Kausik Natarajan. 1995. Sparse approximate solutions to linear systems. SIAM journal on computing 24, 2 (1995), 227–234.
  • Ng et al. (1999) Andrew Y. Ng, Daishi Harada, and Stuart J. Russell. 1999. Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping. In Proceedings of the Sixteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 278–287.
  • Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
  • Peters and Schaal (2008) Jan Peters and Stefan Schaal. 2008. Natural actor-critic. Neurocomputing 71, 7 (2008), 1180–1190.
  • Schneiderman (2004) Henry Schneiderman. 2004. Feature-centric evaluation for efficient cascaded object detection. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, Vol. 2. IEEE, II–II.
  • Sutton and Barto (1998) Richard S. Sutton and Andrew G. Barto. 1998. Introduction to Reinforcement Learning (1st ed.). MIT Press, Cambridge, MA, USA.
  • Vinyals et al. (2015) Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer Networks. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). Curran Associates, Inc., 2692–2700.
  • Viola and Jones (2003) Paul Viola and Michael Jones. 2003. Rapid Object Detection using a Boosted Cascade of Simple Features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. I–511–I–518 vol.1.
  • Wang et al. (2010a) Lidan Wang, Jimmy Lin, and Donald Metzler. 2010a. Learning to Efficiently Rank. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, USA, 138–145.
  • Wang et al. (2010b) Lidan Wang, Donald Metzler, and Jimmy Lin. 2010b. Ranking Under Temporal Constraints. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management. ACM, New York, NY, USA, 79–88.
  • Williams (1992) Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8, 3-4 (1992), 229–256.
  • Xu and Li (2007) Jun Xu and Hang Li. 2007. AdaRank: A Boosting Algorithm for Information Retrieval. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, USA, 391–398.
  • Zheng et al. (2007) Zhaohui Zheng, Keke Chen, Gordon Sun, and Hongyuan Zha. 2007. A Regression Framework for Learning Ranking Functions Using Relative Relevance Judgments. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, USA, 287–294.
  • Zhou (2012) Zhi-Hua Zhou. 2012. Ensemble methods: foundations and algorithms. CRC press.