U-rank: Utility-oriented Learning to Rank with Implicit Feedback

11/01/2020 ∙ by Xinyi Dai, et al. ∙ HUAWEI Technologies Co., Ltd. Shanghai Jiao Tong University UCL 0

Learning to rank with implicit feedback is one of the most important tasks in many real-world information systems where the objective is some specific utility, e.g., clicks and revenue. However, we point out that existing methods based on probabilistic ranking principle do not necessarily achieve the highest utility. To this end, we propose a novel ranking framework called U-rank that directly optimizes the expected utility of the ranking list. With a position-aware deep click-through rate prediction model, we address the attention bias considering both query-level and item-level features. Due to the item-specific attention bias modeling, the optimization for expected utility corresponds to a maximum weight matching on the item-position bipartite graph. We base the optimization of this objective in an efficient Lambdaloss framework, which is supported by both theoretical and empirical analysis. We conduct extensive experiments for both web search and recommender systems over three benchmark datasets and two proprietary datasets, where the performance gain of U-rank over state-of-the-arts is demonstrated. Moreover, our proposed U-rank has been deployed on a large-scale commercial recommender and a large improvement over the production baseline has been observed in an online A/B testing.



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Ranking is the core of information retrieval. In the traditional web search scenario, learning to rank (LETOR) methods are proposed to optimize the ranked list based on human-annotated relevance labels (Liu, 2011)

. Typically these methods sort the documents according to their probability of relevance in descending order, according to the famous probabilistic ranking principle (PRP) 

(Robertson, 1977)

. Due to the lack of annotated labels, recently, many works have focused on learning to rank via implicit feedback, such as user’s click data, which is timely and personalized. These works are also based on PRP, where the relevance is estimated from implicit feedback through counterfactual methods 

(Wang et al., 2016; Joachims et al., 2017; Wang et al., 2018a; Ai et al., 2018; Hu et al., 2019). Besides the traditional web search scenario, nowadays, ranking is also an important part of many real-world applications, including recommender systems (Karatzoglou et al., 2013), online advertising (Tagami et al., 2013) and product search (Karmaker Santu et al., 2017). In these applications, specific utility metrics (such as clicks, conversions and revenue, etc.) are proposed, by which the quality of a ranked list is evaluated.

In many existing works, LETOR algorithms are derived on the basis of PRP, and then evaluated on some utility metrics (Karmaker Santu et al., 2017; Zhao et al., 2019). However, we find that the PRP based ranking framework does not necessarily bring the highest utility in reality. To be more specific, PRP is basically correct for items with large differences in relevance estimation. However, for two items with close relevance estimation, putting the item which is more sensitive to position change at the higher position will bring a higher expected utility, even if it is slightly less relevant. To provide a persuasive example, we show the average click curve of five popular apps from a mainstream App Store in the right panel of Figure 1. Consider a non-personalized case for simplicity that we recommend App 1, App 2, and App 3 to one user. If the apps are sorted by PRP, the ranked list will be App 1, App 2, and App 3 by their relevance in terms of click-through rate (CTR). However, the optimal ranked list with the maximum utility should be App 1, App 3, and App 2, since the utility gain of promoting App 3 from the 3rd to 2nd position is 0.019, which is larger than the utility loss 0.010 of dragging App 2 from the 2nd to 3rd position. As can be seen, sorting items by relevance may fail to achieve the highest utility in some situations, which is actually quite common in industrial scenarios. Therefore, we aim to optimize the objectives that are directly related to the utility based on user’s implicit feedback.

Figure 1. The CTR analysis w.r.t. query/item features. The data is collected through a 120 days’ click log on random recommendation traffic in a mainstream App Store.

Optimizing the utility metric like CTR of the whole ranked list is not as easy as it seems. One direct solution might be estimating only one unique CTR of each query-item pair and then optimizing the certain metric w.r.t. the whole list using the estimated CTR (Wu et al., 2018). However, bias will be introduced in this solution since user’s CTR is not a static property like relevance. To be clear, for the same query-item pair, the CTR might change with its presented position. As shown in Figure 1, the CTR decreases as the presented position goes from top to bottom, and moreover, the magnitude of the decrease is different among items and device types.

In order to design effective utility-oriented algorithms, we need to figure out why this phenomenon happens and then investigate how to deal with it. The decrease of user’s CTR mainly results from the decrease of user attention, which is supported by eye-tracking studies (Lorigo et al., 2008, 2006). Most existing works have treated such attention bias as position bias (Wang et al., 2016; Joachims et al., 2017), i.e., more attention is paid to the top positions than the bottom ones. In the literature, position bias is considered to be decorrelated with the ranked items, i.e., makes the same effect on all items (Richardson et al., 2007; Craswell et al., 2008), which is generally correct in the traditional 10 blue links scenario. Under such assumption, following PRP achieves the goal of the highest expected utility since the click curves of different positions across different items have the same shape despite the different scales.

However, we argue that in many real-world applications, a user’s attention on items does not only depend on the positions but also the item attributes and the user contexts. For the App recommendation case as demonstrated in Figure 1, visual difference in the thumbnail of a product or the preview frame of videos leads to item-specific attention bias. In web search, an example of item-specific attention bias is vertical bias, commonly observed when the page contains vertical search results (such as images, videos, maps, etc.). For example, Metrikov et al. (2014) found that an image in search result can raise CTR and flatten the click curve at the same time. A visually attractive content, like a vertical search result or an item with a fancy thumbnail, can still attract user’s attention even it is placed at a lower position, leading to a flatter click curve. In other words, such visually attractive results are less sensitive to the position change. In the example above, CTR of App 2 is less sensitive in whether placing it in position 2 or position 3 compared to App 3. Placing items of which CTR is more sensitive to position change at top positions often leads to a higher utility. Besides, query-level features like device type, as shown in Figure 1, also leads to different attention biases. Hence, to obtain unbiased CTR estimation, we need to exploit both the item-level and the query-level features to model the dependency between click and position.

Based on these considerations, in this work, we propose a ranking framework called U-rank

that directly optimizes expected utility from implicit feedback. Instead of ranking according to PRP, we first derive a new list-wise learning objective, of which the expectation is the utility metric we want to maximize. Then to obtain an unbiased estimation of the expected utility, we address the attention bias considering both the query-level and item-level features with a position-aware deep CTR model. Finally, to efficiently optimize the expected utility, we formulate it as an item-position matching problem as shown in Figure 

2, and learn a scoring function towards the best matching through pairwise permutations inspired by Lambdaloss framework (Wang et al., 2018b), which reduces the complexity in inference stage from to . Theoretical analysis demonstrates that we solve an upper bound problem of the matching problem.

We conduct thorough experiments on three benchmark LETOR datasets and a large-scale real-world commercial recommendation dataset to verify the effectiveness of U-rank. Further, U-rank has been deployed on the recommender system of a mainstream App Store, where a 10-day online A/B test shows that U-rank achieves an average improvement of 19.2% on CTR and 20.8% on conversion rate over the production baseline.

2. Related Work

Generative click models are introduced to study user browsing behavior and extract unbiased relevance feedbacks from click data. For example, Position-based model (PBM) (Richardson et al., 2007) assumes that a click only depends on the position and the relevance of the document. Cascade model (Craswell et al., 2008) assumes that user browses a search web page sequentially from top to bottom until a relevant document is found. Following these two classical click models, more sophisticated ones (e.g., UBM (Dupret and Piwowarski, 2008), DBN (Chapelle and Zhang, 2009), CCM (Guo et al., 2009) and NCM (Borisov et al., 2016)) have been proposed. These click models estimate the relevance of each item in a point-wise manner instead of considering the relative order of the items as in pairwise or listwise approaches. Recently, a new line of research, referred to as counterfactual methods, utilizes inverse propensity score (IPS) weighting to address position bias in a learning to rank framework. Wang et al. (2016) and Joachims et al. (2017) proposed the IPW-based framework of debiasing click data in a learning to rank framework. In both works, the propensity estimation relies on randomizing search results displayed to users, which obviously degrades users’ search experience. Considering this, Agarwal et al. (2019) proposed PBM to estimate propensity without Intrusive Interventions. CPBM  (Fang et al., 2019), on the basis of PBM, learns a query-dependent propensity estimation. However, multiple rankers are required to learn, which makes them inconvenient to deploy in real-world applications. Besides, another branch of unbiased learning to rank works (Wang et al., 2018a; Ai et al., 2018; Hu et al., 2019) jointly learn the propensity model with a relevance model, which results in biased estimation of propensity unless the relevance estimation is very accurate.

3. Problem Formulation

Figure 2. The maximization of the utility can be seen as solving the maximum-weight matching on the item-position bipartite graph, where the edge weight between an item and a position denotes the utility of placing the item at this position, i.e., the product of item’s CTR at this position and the utility value of this item.

When a user issues a new request , the system delivers a ranked list of

items to the user according to a ranking model over all the candidate items. The feature vector

of each item consists of item features, context features, and user/query features. The scalar denotes the utility value related to each item , e.g., the watch time of each video in video recommendation, or the bid price of each ad in sponsored search.

The users’ click logs are a set , where is the position of item , and is the users’ implicit feedback on item when displaying at position , i.e., for click and for non-click. To distinguish between the position of item in users’ click logs and in the current ranking model, in the following parts, we use to denote the position of item in click logs and keep as the position of item in the current ranking model.

The ultimate goal of this system is to find the best permutation of candidate items for each query to maximize the utility. The utility is defined as the expected sum of weighted clicks of each item over the whole ranked list, as follows,


where is the probability of the item being clicked if displayed at position . Maximizing utility is equivalent to solving a maximum weight matching problem on the item-position bipartite graph, where is the edge weight between the item and the position in the graph, as shown in Figure 2.

4. Model Framework

In this section, we present a general ranking framework, i.e., U-rank, to maximize the utility in Eq. (1) directly. Firstly, we derive an unbiased metric of utility from click logs, the expectation of which is the utility

. Secondly, we design an efficient learning to rank method to optimize this metric, of which the loss function is an upper bound of the utility regret.

4.1. Unbiased Estimation of the Utility

The main difficulty of existing methods of learning to rank via implicit feedback lies in the estimation of the underlying attention bias (or position bias), since we do not observe them directly from the data. With the new learning objective, we do not need to infer relevance or the attention bias explicitly. Instead, we have to deal with another mismatch problem, which is between the CTR of the historical presented position and that of the final presented position. For example, if one item is ranked first in the click logs but presented at the 10th position in the final ranking, then its utility is overestimated. To correct this bias, we need an accurate model of user’s CTR on different positions.

The estimation of CTR on different position refers to one of the most well-studied tasks in recommender systems, i.e., CTR estimation. Deep CTR models (Zhang et al., 2016; Qu et al., 2018) can take position and the rich query-item features as input, to model the complex user interaction in feature space from the click logs. It is pointed out that position is a very important feature in CTR estimation (Bai et al., 2019; Guo et al., 2019). However, if we directly used a CTR estimation model as the ranking model, the position feature is vacant at the inference stage. Therefore, we design a position-aware deep CTR model as a debiasing module instead of directly using it as a ranking model. Assume that the probability function of item displayed at position is a function of item feature and position , i.e., . Then we can estimate the parameter via the standard cross-entropy minimization:


where is the cross-entropy loss.

Based on users’ click logs and the estimated CTR, we derive an unbiased metric of utility as


We prove that our derived utility is unbiased w.r.t. in Eq. (1), by showing that the expectation of is equivalent to , as


4.2. Learning to Optimize the Utility

One straightforward way to optimize is to perform maximum-weight matching algorithm (e.g., Kuhn-Munkres algorithm (Kuhn, 1955; Munkres, 1957)) on the bipartite graph directly (each query corresponds to one graph), given . However, the complexity for such a graph matching algorithm to run in the inference stage is ( denotes the number of candidate items), which is unacceptable in a production system. Therefore, in this section, we propose a parameterized scoring function to approximate the maximum-weight matching procedure on each query, still aiming at maximizing the utility, so that the complexity at the inference stage can be reduced to . For each item , the scoring function gives a ranking score as . For each query , we compute the score of each item , and the result list is generated by sorting their scores in descending order.

According to Eq. (3), we define the utility of displaying item at position as . With being the optimal position assigned to item by the graph matching algorithm, the regret of the utility is defined as


Minimizing the regret of the utility directly is infeasible as ’s are discrete values. Therefore, we adapt the LambdaLoss framework (Wang et al., 2018b) to learn a ranking model towards the optimal ranking by optimizing our proposed loss function (which will be presented in Eq. (7)) with iterative pairwise permutation. Like in LambdaLoss we follow an EM procedure that in E step we obtain the ranked list based on current scoring function and in M step we re-estimate the scoring function to minimize our loss function. The learning procedure of U-rank is as follows.

We first initialize the scoring function with random initialization of . Inspired by the re-weighting technique used in LambdaRank (Burges et al., 2007), we compute the difference between the unbiased utility when the positions of two items and are swapped, as


Then this difference value is used as the weight in the pairwise loss for each pair of items. Following (Burges et al., 2005; Burges et al., 2007), we design our loss function in the form of logistic loss, as


where and denote the position assigned to item and by ranking model at the last step (by the scoring function ). This loss is minimized, so that we get a new scoring function . Then the process is repeated until convergence.

Notice that in a standard LambdaLoss framework, the LambdaLoss is defined as


Note that the differences between our objective (7) and the LambdaLoss objective (8) lie in (i) the subscript of the summation symbol and (ii) the absolute value symbol of the difference term . In LambdaLoss framework, the pairwise label of each item pair is determined. The optimal ranking order is known to us by ranking all the items according to relevance label or click label (denoted by for item ), in descending order. However, in our framework, we cannot obtain the explicit label for item . An item is treated as the positive item if it is placed at a lower position by scoring function and the swap of the item pair brings utility gain, and vise versa. We cannot access the optimal ranking order in each query beforehand, where the optimal order is achieved through iterative pairwise permutation.

4.3. Theoretical Analysis

In this section, we theoretically prove that our proposed training objective is an upper bound of the utility regret . To make the proof easier to understand, we construct a function:


We start with several lemmas which will be used in our proof.

Lemma 4.1 ().

Given an indicator function and a function where is a constant in , it holds that for all .

Lemma 4.2 ().

Given an indicator function and a function where is a constant in , it holds that for .

Lemma 4.3 ().

Given a sum function and a max function , it holds that for .

Now we are ready to derive the main theoretical result.

Theorem 4.4 ().

Assume is a monotonic decreasing function w.r.t and the ranking score is bounded in the range of [-C,C]. Let and . Then we have .


where the inequality holds due to Lemma 4.1 and Lemma 4.2. ∎

Theorem 4.4 states that is upper bounded by our objective plus . is a constant since in the M step only depends on the current scoring function . Notice that the assumptions in the theorem are not restrictive in practice. As illustrated in Figure 1, the real utility basically satisfies the monotonic decreasing assumption. Moreover, the ranking score is often clipped in implementation to avoid explosion in exponential function.

Theorem 4.5 ().

Assume the utility is a monotonic decreasing function w.r.t . Then is upper bounded by .


where the first inequality holds due to Lemma 4.3. ∎

Ranking model Yahoo! LETOR set 1 MSLR-WEB10K Istella-S LETOR
MAP nDCG # Click CTR MAP nDCG # Click CTR MAP nDCG # Click CTR
SVMRank None 0.702 0.845 0.599 0.0641 0.498 0.735 0.834 0.0827 0.773 0.808 0.931 0.0939
Randomization 0.639 0.787 0.544 0.0574 0.433 0.686 0.820 0.0799 0.742 0.787 0.909 0.0910
CPBM 0.701 0.843 0.594 0.0637 0.477 0.721 0.751 0.0746 0.752 0.793 0.923 0.0919
Groundtruth 0.718 0.859 0.612 0.0656 0.515 0.748 0.872 0.0870 0.775 0.816 0.958 0.0952
LambdaRank None 0.700 0.847 0.606 0.0641 0.498 0.736 0.834 0.0828 0.776 0.810 0.945 0.0941
Randomization 0.680 0.828 0.582 0.0621 0.451 0.700 0.813 0.0808 0.748 0.793 0.924 0.0923
CPBM 0.718 0.857 0.613 0.0651 0.514 0.744 0.836 0.0836 0.779 0.813 0.941 0.0941
Groundtruth 0.719 0.859 0.618 0.0657 0.521 0.748 0.882 0.0883 0.781 0.815 0.948 0.0948
DNN DLA 0.639 0.782 0.553 0.0589 0.430 0.682 0.830 0.0839 0.676 0.703 0.828 0.0824
CTR-1 0.647 0.792 0.551 0.0577 0.477 0.722 0.829 0.0814 0.733 0.771 0.894 0.0895
U-rank 0.719 0.861* 0.618* 0.0659* 0.492 0.725 0.903* 0.0915* 0.783* 0.816* 0.959* 0.0953*
KM (oracle model) 0.935 0.987 0.684 0.0737 0.710 0.723 0.976 0.0969 0.993 0.995 1.132 0.1126
Table 1. Comparison of different unbiased learning to rank models on three benchmark datasets.
Theorem 4.6 ().

Under the assumption of Theorem 4.4 and Theorem 4.5, we have that .

The proof of Theorem 4.6 is trivial due to Theorem 4.4 and Theorem 4.5. Theorem 4.6 demonstrates that the utility regret is bounded by our proposed objective plus a constant. It implies that optimizing our proposed objective is actually minimizing the upper bound of the utility regret, which guarantees the effectiveness of U-rank theoretically.

5. Semi-synthetic Experiments

The semi-synthetic setup is widely applied in unbiased learning to rank (Fang et al., 2019) which allows us to explore different settings 111Code for our experiments is available at https://github.com/xydaisjtu/U-rank.

5.1. Datasets

5.2. Click Data Generation

We mainly follow Fang et al. (2019) to generate synthetic click data with item-specific attention bias for the three datasets. In the following part, oracle model refers to this click generation model. Similar to (Fang et al., 2019), the attention bias which is related to both position and the item feature is calculated by . In our setting, is the set of item features, while in the setting of (Fang et al., 2019), is the set of query features which is shared among all the items for a same query. The parameter vector

is drawn from a uniform distribution over

and is normalized such that . Following Hu et al. (2019), the relevance probability is defined as , where denotes the relevance label of and is the highest level of relevance. is set to 0.1, which denotes the CTR of irrelevant documents. The overall CTR is calculated by . The maximal position is set to be 10.

Figure 3. Comparison of the result lists of different methods on the first query of MSLR-WEB10K.

5.3. Baselines

We implement eight baselines that explore the performance of two standard learning to rank methods (i.e., SVMRank (Joachims, 2006) and LambdaRank (Burges et al., 2007)), with four propensity estimation methods, which are detailed as follows. (1) None uses the original click data without debiasing. (2) Randomization (Joachims et al., 2017) estimates propensity with online randomized experiments. (3) CPBM (Fang et al., 2019) estimates examination probability w.r.t different queries based on intervention harvesting. (4) Groundtruth uses the groundtruth examination probability for oracle model as propensity. The result of this method is the upper bound of the results of other IPS approaches based on the same ranking model. Other methods we implement include (5) CTR-1, the position-aware click model used in our framework which assigns position 1 to each item during online inference. (6) DLA (Ai et al., 2018) is a dual learning algorithm that jointly learns an unbiased ranker and an unbiased propensity model. We also explore the performance of KM (oracle model) which solves the maximum-weight graph matching problem via Kuhn-Munkres (KM) algorithm (Kuhn, 1955; Munkres, 1957), given the groundtruth CTR. It is supposed to produce the best utility we can achieve on the test data.

5.4. Overall Performance

In this section, we assume the utility value of each item to be 1 in order to consistently and fairly compare U-rank with existing (unbiased) learning to rank methods. We evaluate the performance of the baseline approaches and U-rank in terms of the relevance based metrics, i.e., MAP and nDCG (nDCG denotes nDCG@10), and utility based metrics, i.e., # Click and CTR. Here, The # Click and CTR are utility metrics based on oracle click model denoting clicks per query and CTR per document, respectively. The overall performance on the three benchmark datasets is shown in Table LABEL:tab:simulation. Firstly, our method U-rank achieves consistently the best performance over the state-of-the-art baseline approaches on the utility-based metrics, i.e., #Click and CTR. For example, U-rank achieves 1% improvement in Yahoo LETOR set 1 and 8.3% improvement in MSLR-WEB10K on CTR comparing to the best baseline methods 555The baseline in italic use the information from oracle click model, so they are not included for comparison.. Secondly, U-rank also outperforms most of the baselines in terms of the relevance based metrics, i.e., MAP and nDCG, though it does not always perform the best especially on MSLR-WEB10K dataset where the disagreement between utility-based metric and relevance-based metric is larger than that on the other two datasets. Thirdly, the method Groundtruth achieves the best utility among the counterfactual learning approaches, demonstrating the effectiveness of the IPS-based framework when the propensity estimation is accurate. Randomization fails to perform well because it assumes that the user’s attention only relates to the position which is not true in our setting where the user’s attention relies on both the position and the item feature. CPBM achieves the second-best utility among the IPS-based methods since it models the propensity of each query by taking query features into consideration. Lastly, U-rank and CTR-1 share the same click model. However, U-rank outperforms CTR-1 mainly because CTR-1 ranks items by their estimated CTR at position 1, which is suboptimal in case of item-specific attention bias. U-rank also outperforms DLA since DLA relies heavily on the accuracy of the estimated propensity, which is hard to achieve.

5.5. Empirical Analysis

RQ1: How does our model achieve higher CTR? In Figure 4, we show the average CTR on each position of U-rank and LambdaRank(Groundtruth) , the upper bound of counterfactual learning methods in Table LABEL:tab:simulation. We also plot the results of KM (oracle model) for reference.

Figure 4. Average CTR on each position.

Firstly, comparing the results of the two datasets in Figure 4, we observe a steeper decline of average CTR to positions of the KM (oracle model) method on Yahoo dataset than that on the MSLR dataset. This suggests that on this dataset, positions have a very strong impact on users’ click. Thus, to optimize the utility, a well-performed approach should put more relevant items at higher positions. In MSLR-WEB10K, on the other hand, the average CTR of the optimal matching tends to be equally distributed on the positions, compared to the Yahoo dataset. We find that our method is adaptive to different severity of position bias. In Yahoo dataset, our model focuses more on top positions than LambdaRank, while in the MSLR dataset, our model learns a flatter distribution. Notably, on both datasets, our model achieves a larger sum of click probabilities over all the positions than LambdaRank.

Secondly, we analyze the result of a single query in detail. The experiment is conducted on the first query of the MSLR-WEB10K dataset. Figure 3 shows the click probabilities of the ten items for this query and their click probabilities if placed at each position according to our oracle click data generation model. The position of each item assigned by different methods is denoted in orange color. We can see that although LambdaRank performs better in nDCG with a groundtruth propensity. It, however, achieves a lower CTR than our method U-rank. This is because, similar to the KM (oracle model), U-rank will take the position sensitivity of different items into consideration. For example, document 6 is of high relevance and relatively not sensitive to the position change. LambdaRank displays it at the second position while our method and KM both display it at a lower position, so that the second position is kept for an item that is more sensitive to the position change.

# click
train test validation
A1 0.695 0.684 0.687 0.903 0.915
A2 0.701 0.693 0.692 0.852 0.847
Table 2. Comparison of two different architecture of the position aware click estimation.

RQ2: What kind of architecture should we use to implement the position-aware click estimation?

We implement two kinds of architecture for the position-aware click estimation. A1 is a neural network, with the item features as input and a

-dim vector as output, where the -th dimension denotes the CTR of the item at position , and

denotes the number of positions. A2 is also a neural network, with the concatenation of item feature and position in one-hot encoding as its input and a single value as output, representing the CTR of the item at the given position. The result on MLSR-WEB10K is presented in Table 

2. Although A2 achieves better AUC, we utilize A1 as the click model to pursue higher utility.

6. Real-world Deployment

Ranking model Scenario 1 (without bid) Scenario 2 (with bid)
# click # click@1 # click@3 # click@5 Revenue Revenue@1 Revenue@3 Revenue@5
SVMRank None 1.586 0.500 1.107 1.348 3.602 0.959 2.177 2.788
Groundtruth 1.617 0.536 1.154 1.386 3.619 1.015 2.229 2.825
LambdaRank None 1.750 0.701 1.327 1.556 3.586 0.964 2.178 2.774
Groundtruth 1.826 0.781 1.429 1.640 3.637 1.009 2.245 2.837
DLA 1.665 0.624 1.338 1.520 3.631 0.958 2.233 2.827
DeepFM 1.790 0.762 1.379 1.593 3.753 1.131 2.289 2.881
U-rank 1.859* 0.841* 1.474* 1.676* 3.966* 1.264* 2.607* 3.214*
  • denotes statistically significant improvement (measured by t-test with p-value

    0.05) over all baselines.

Table 3. Comparison of different unbiased learning to rank models on real-world recommendation scenarios.

In order to verify the effectiveness of our proposed model in real-world applications, we conduct experiments in two recommendation scenarios in Company X’s App Store. This App Store has hundreds of millions of daily active users who create hundreds of billions of user logs every day in the form of implicit feedback such as browsing, clicking, and downloading behaviors.

6.1. Offline Evaluation

Setups. We conduct offline experiments based on two recommendation scenarios with different utility settings. In Scenario 1, we only consider the downloads of the Apps as the utility, while in Scenario 2, the bid price of each App download needs to be considered. In both scenarios, we use seven consecutive days’ data for training and the eighth day’s data for testing. As in the semi-synthetic experiments, we also implement two LETOR methods, i.e., SVMrank and LambdaRank as baselines. The propensity estimation method Randomization is not applicable here since we are not allowed to randomly swap two items of a ranked list in a live recommender system. Similarly, CPBM is not applicable either since in practice we cannot obtain the ranking results of the same user from multiple rankers at the same time. Thus, we only compare U-rank with the ranker learned with biased click data, i.e., None, and the ranker with groundtruth propensity, i.e., Groundtruth. The groundtruth propensity is the same as the propensity that we use in evaluation in the next paragraph. DeepFM is included as a baseline as it is the production baseline in this App recommendation online system. It trains with position feature and takes default position 1 in the inference stage. To make a fair comparison, we perform DeepFM architecture in both click model and ranking model in U-rank.

Metrics. Unlike in the semi-synthetic experiments, here we do not know the underlying user click model. Thus, we have to debias the click data generated by a historical ranker with a pre-estimated propensity to obtain the click signals on the new positions. We estimate the propensity for each category of items from 120 days’ click data on random traffic. This category-wise propensity estimation is a coarse approximation of the groundtruth propensity, which is not available. The propensity is defined as , where denotes the category of item . This propensity is only used for evaluation except in the methods in Table 3, where this propensity is used for debiasing.

The utility in Scenario 1 is defined as the expected number of debiased clicks at the top- positions in a session, i.e., . We use to denote the case when . The utility in Scenario 2 is defined as the expected revenue at top- positions in a session after debasing, i.e., where is the bid of item . We use to denote the case when .

Results. The overall performance on the two real-world datasets is shown in Table 3. We have the following observations. Firstly, U-rank achieves the best performance comparing to the state-of-the-art baselines. Specifically, in Scenario 1, U-rank achieves 1.8%, 7.7%, 3.1% and 2.2% improvement over the best baseline method in terms of #click, #click@1, #click@3 and #click@5, respectively. In the experiment with bid in Scenario 2, the improvements are 2.5%, 13%, 6.7% and 3.9% in terms of Revenue, Revenue@1, Revenue@3 and Revenue@5, respectively. These results demonstrate the superiority of our approach over the baselines in optimizing the utility, which motivates us to deploy U-rank in the live recommender system. Secondly, U-rank performs better than DeepFM because U-rank considers item-specific attention bias while DeepFM learns from biased data. We do not elaborate on the other results since they are consistent with the results in the semi-synthetic experiments.

Figure 5. Online experimental results of click through rate.
Figure 6. Online experimental results of conversion rate.
Figure 5. Online experimental results of click through rate.

6.2. Online Evaluation

Setups. We conduct A/B testing in a recommendation scenario in Company X’s App store, comparing the proposed model U-rank with the current production baseline DeepFM (Guo et al., 2017) that supports multiple scenarios such as “Must-have Apps” and “Novel and Fun”. The whole online experiment lasts 24 days, from May 6, 2020 to May 29, 2020. We monitor the results of A/A testing for the first seven days, conduct A/B testing for the following ten days, and conduct A/A testing again in the last seven days. 15% of the users are randomly selected as the experimental group and another 15% of the users are in the control group. During A/A testing, all the users are served by DeepFM model (Guo et al., 2017). During A/B testing, users in the control group are presented with recommendation by DeepFM, while users in the experimental group are presented with the recommendation by our proposed model U-rank. Note that the click model of U-rank shares the same network architecture and parameter complexity with DeepFM in order to verify whether the improvement is brought by the objective function design of the ranker in U-rank.

To deploy U-rank, we utilize a single node with 48 core Intel Xeon CPU E5-2670 (2.30 GHZ), 400 GB RAM and as well as 2 NVIDIA TESLA V100 GPU cards, which is the same as the training environment of the baseline DeepFM. For model training, U-rank requires minor changes to the current training procedure due to the pair-wise loss function. For model inference, U-rank shares the same pipeline as DeepFM, which means there is no extra engineering work needed in model inference, to upgrade DeepFM model (or other similar deep models) to U-rank.

Metrics. We examine two metrics in the online evaluation. They are Click-through rate: and Conversion rate: , where # downloads, # impressions and #users are the number of downloads, impressions and visited users, respectively.

Results. Figure 6 and Figure 6 show the improvement of the experimental group over the control group with respective to CTR and CVR, respectively. We can see that the system is rather stable where both CTR and CVR fluctuated within 8% during the A/A testing. Our U-rank model is launched to the live system on Day 8. From Day 8, we observe a significant improvement over the baseline model with respect to both CTR and CVR. The average improvement of CTR is 19.2% and the average improvement of CVR is 20.8% over the ten days of A/B testing. These results clearly demonstrate the high effectiveness of our proposed model in improving the total utility which refers to the number of downloads in this scenario. From Day 18, we conduct A/A testing again to replace our U-rank model with the baseline model in the experimental group. We observe a sharp drop in the performance of the experimental group, which once more verify that the improvement of online performance in the experimental group is indeed introduced by our proposed model.

7. Conclusion

In this paper, we propose a novel framework U-rank, which directly optimizes the expected utility of the ranked list without any extra assumption on relevance nor examination. Specifically, U-rank first uses a position-aware deep CTR model to perform an unbiased estimation of the expected utility, and then optimizes the objective with an efficient algorithm based on a LambdaRank-like objective. Extensive studies on three benchmark datasets and two real-world datasets based on different scenarios have shown the effectiveness of our work. We also deploy this ranking framework on a commercial recommender system and observe a large utility improvement over the production baseline via online A/B testing. In future work, we plan to consider other biases like selection bias and propose a more general debiasing framework.


The corresponding author Weinan Zhang thanks the support of NSFC (61702327, 61772333, 61632017). The work is also sponsored by Huawei Innovation Research Program.


  • (1)
  • Agarwal et al. (2019) Aman Agarwal, Ivan Zaitsev, Xuanhui Wang, Cheng Li, Marc Najork, and Thorsten Joachims. 2019. Estimating Position Bias without Intrusive Interventions. In WSDM.
  • Ai et al. (2018) Qingyao Ai, Keping Bi, Cheng Luo, Jiafeng Guo, and W. Bruce Croft. 2018. Unbiased Learning to Rank with Unbiased Propensity Estimation. In SIGIR.
  • Bai et al. (2019) Xiao Bai, Reza Abasi, Bora Edizel, and Amin Mantrach. 2019. Position-aware deep character-level CTR prediction for sponsored search. TKDE (2019).
  • Borisov et al. (2016) Alexey Borisov, Ilya Markov, Maarten de Rijke, and Pavel Serdyukov. 2016. A Neural Click Model for Web Search. In WWW.
  • Burges et al. (2005) Christopher Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Gregory N Hullender. 2005. Learning to rank using gradient descent. In ICML.
  • Burges et al. (2007) Christopher J Burges, Robert Ragno, and Quoc V Le. 2007. Learning to rank with nonsmooth cost functions. In NeuIPS.
  • Chapelle and Zhang (2009) Olivier Chapelle and Ya Zhang. 2009.

    A Dynamic Bayesian Network Click Model for Web Search Ranking. In

  • Craswell et al. (2008) Craswell, Nick, Zoeter, Onno, Taylor, Michael Lyu, Ramsey, and Bill. 2008. An experimental comparison of click position-bias models. In WSDM.
  • Dupret and Piwowarski (2008) Georges E. Dupret and Benjamin Piwowarski. 2008. A User Browsing Model to Predict Search Engine Click Data from Past Observations.. In SIGIR.
  • Fang et al. (2019) Zhichong Fang, Aman Agarwal, and Thorsten Joachims. 2019. Intervention Harvesting for Context-Dependent Examination-Bias Estimation. In SIGIR.
  • Guo et al. (2009) Fan Guo, Chao Liu, Anitha Kannan, Tom Minka, Michael Taylor, Yi-Min Wang, and Christos Faloutsos. 2009. Click Chain Model in Web Search. In WWW.
  • Guo et al. (2017) Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: A Factorization-Machine based Neural Network for CTR Prediction. In IJCAI.
  • Guo et al. (2019) Huifeng Guo, Jinkai Yu, Qing Liu, Ruiming Tang, and Yuzhou Zhang. 2019. PAL: A Position-Bias Aware Learning Framework for CTR Prediction in Live Recommender Systems. In Recsys.
  • Hu et al. (2019) Ziniu Hu, Yang Wang, Qu Peng, and Hang Li. 2019. Unbiased LambdaMART: An Unbiased Pairwise Learning-to-Rank Algorithm. In WWW.
  • Joachims (2006) Thorsten Joachims. 2006. Training Linear SVMs in Linear Time. In KDD.
  • Joachims et al. (2017) Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. 2017. Unbiased Learning-to-Rank with Biased Feedback. In WSDM.
  • Karatzoglou et al. (2013) Alexandros Karatzoglou, Linas Baltrunas, and Yue Shi. 2013. Learning to rank for recommender systems. In Recsys.
  • Karmaker Santu et al. (2017) Shubhra Kanti Karmaker Santu, Parikshit Sondhi, and ChengXiang Zhai. 2017. On application of learning to rank for e-commerce search. In SIGIR.
  • Kuhn (1955) Harold W Kuhn. 1955. The Hungarian method for the assignment problem. Naval research logistics quarterly (1955).
  • Liu (2011) Tie-Yan Liu. 2011. Learning to rank for information retrieval. Springer Science & Business Media.
  • Lorigo et al. (2008) Lori Lorigo, Maya Haridasan, Hrönn Brynjarsdóttir, Ling Xia, Thorsten Joachims, Geri Gay, Laura Granka, Fabio Pellacini, and Bing Pan. 2008. Eye tracking and online search: Lessons learned and challenges ahead. Journal of the American Society for Information Science and Technology (2008).
  • Lorigo et al. (2006) Lori Lorigo, Bing Pan, Helene Hembrooke, Thorsten Joachims, Laura Granka, and Geri Gay. 2006. The influence of task and gender on search and evaluation behavior using Google. Information Processing & Management (2006).
  • Lucchese et al. (2016) Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, Fabrizio Silvestri, and Salvatore Trani. 2016. Post-Learning Optimization of Tree Ensembles for Efficient Ranking. In SIGIR.
  • Metrikov et al. (2014) Pavel Metrikov, Fernando Diaz, Sebastien Lahaie, and Justin Rao. 2014. Whole Page Optimization: How Page Elements Interact with the Position Auction.
  • Munkres (1957) James Munkres. 1957. Algorithms for the assignment and transportation problems. Journal of the society for industrial and applied mathematics (1957).
  • Qu et al. (2018) Yanru Qu, Bohui Fang, Weinan Zhang, Ruiming Tang, Minzhe Niu, Huifeng Guo, Yong Yu, and Xiuqiang He. 2018. Product-based neural networks for user response prediction over multi-field categorical data. TOIS (2018).
  • Richardson et al. (2007) Richardson, Matthew, Dominowska, Ewa, Ragno, and Robert. 2007. Predicting clicks: Estimating the click-through rate for new ads. In WWW.
  • Robertson (1977) Stephen E Robertson. 1977. The probability ranking principle in IR. Journal of documentation (1977).
  • Tagami et al. (2013) Yukihiro Tagami, Shingo Ono, Koji Yamamoto, Koji Tsukamoto, and Akira Tajima. 2013. Ctr prediction for contextual advertising: Learning-to-rank approach. In Proceedings of the Seventh International Workshop on Data Mining for Online Advertising.
  • Wang et al. (2016) Xuanhui Wang, Michael Bendersky, Donald Metzler, and Marc Najork. 2016. Learning to Rank with Selection Bias in Personal Search. In SIGIR.
  • Wang et al. (2018a) Xuanhui Wang, Nadav Golbandi, Michael Bendersky, Donald Metzler, and Marc Najork. 2018a. Position Bias Estimation for Unbiased Learning to Rank in Personal Search. In WSDM.
  • Wang et al. (2018b) Xuanhui Wang, Cheng Li, Nadav Golbandi, Mike Bendersky, and Marc Najork. 2018b. The LambdaLoss Framework for Ranking Metric Optimization. In CIKM.
  • Wu et al. (2018) Liang Wu, Diane Hu, Liangjie Hong, and Huan Liu. 2018. Turning clicks into purchases: Revenue optimization for product search in e-commerce. In SIGIR.
  • Zhang et al. (2016) Weinan Zhang, Tianming Du, and Jun Wang. 2016. Deep learning over multi-field categorical data. In ECIR.
  • Zhao et al. (2019) Zhe Zhao, Lichan Hong, Li Wei, Jilin Chen, Aniruddh Nath, Shawn Andrews, Aditee Kumthekar, Maheswaran Sathiamoorthy, Xinyang Yi, and Ed Chi. 2019. Recommending what video to watch next: a multitask ranking system. In Recsys.