1. Introduction
Ranking is the core of information retrieval. In the traditional web search scenario, learning to rank (LETOR) methods are proposed to optimize the ranked list based on humanannotated relevance labels (Liu, 2011)
. Typically these methods sort the documents according to their probability of relevance in descending order, according to the famous probabilistic ranking principle (PRP)
(Robertson, 1977). Due to the lack of annotated labels, recently, many works have focused on learning to rank via implicit feedback, such as user’s click data, which is timely and personalized. These works are also based on PRP, where the relevance is estimated from implicit feedback through counterfactual methods
(Wang et al., 2016; Joachims et al., 2017; Wang et al., 2018a; Ai et al., 2018; Hu et al., 2019). Besides the traditional web search scenario, nowadays, ranking is also an important part of many realworld applications, including recommender systems (Karatzoglou et al., 2013), online advertising (Tagami et al., 2013) and product search (Karmaker Santu et al., 2017). In these applications, specific utility metrics (such as clicks, conversions and revenue, etc.) are proposed, by which the quality of a ranked list is evaluated.In many existing works, LETOR algorithms are derived on the basis of PRP, and then evaluated on some utility metrics (Karmaker Santu et al., 2017; Zhao et al., 2019). However, we find that the PRP based ranking framework does not necessarily bring the highest utility in reality. To be more specific, PRP is basically correct for items with large differences in relevance estimation. However, for two items with close relevance estimation, putting the item which is more sensitive to position change at the higher position will bring a higher expected utility, even if it is slightly less relevant. To provide a persuasive example, we show the average click curve of five popular apps from a mainstream App Store in the right panel of Figure 1. Consider a nonpersonalized case for simplicity that we recommend App 1, App 2, and App 3 to one user. If the apps are sorted by PRP, the ranked list will be App 1, App 2, and App 3 by their relevance in terms of clickthrough rate (CTR). However, the optimal ranked list with the maximum utility should be App 1, App 3, and App 2, since the utility gain of promoting App 3 from the 3rd to 2nd position is 0.019, which is larger than the utility loss 0.010 of dragging App 2 from the 2nd to 3rd position. As can be seen, sorting items by relevance may fail to achieve the highest utility in some situations, which is actually quite common in industrial scenarios. Therefore, we aim to optimize the objectives that are directly related to the utility based on user’s implicit feedback.
Optimizing the utility metric like CTR of the whole ranked list is not as easy as it seems. One direct solution might be estimating only one unique CTR of each queryitem pair and then optimizing the certain metric w.r.t. the whole list using the estimated CTR (Wu et al., 2018). However, bias will be introduced in this solution since user’s CTR is not a static property like relevance. To be clear, for the same queryitem pair, the CTR might change with its presented position. As shown in Figure 1, the CTR decreases as the presented position goes from top to bottom, and moreover, the magnitude of the decrease is different among items and device types.
In order to design effective utilityoriented algorithms, we need to figure out why this phenomenon happens and then investigate how to deal with it. The decrease of user’s CTR mainly results from the decrease of user attention, which is supported by eyetracking studies (Lorigo et al., 2008, 2006). Most existing works have treated such attention bias as position bias (Wang et al., 2016; Joachims et al., 2017), i.e., more attention is paid to the top positions than the bottom ones. In the literature, position bias is considered to be decorrelated with the ranked items, i.e., makes the same effect on all items (Richardson et al., 2007; Craswell et al., 2008), which is generally correct in the traditional 10 blue links scenario. Under such assumption, following PRP achieves the goal of the highest expected utility since the click curves of different positions across different items have the same shape despite the different scales.
However, we argue that in many realworld applications, a user’s attention on items does not only depend on the positions but also the item attributes and the user contexts. For the App recommendation case as demonstrated in Figure 1, visual difference in the thumbnail of a product or the preview frame of videos leads to itemspecific attention bias. In web search, an example of itemspecific attention bias is vertical bias, commonly observed when the page contains vertical search results (such as images, videos, maps, etc.). For example, Metrikov et al. (2014) found that an image in search result can raise CTR and flatten the click curve at the same time. A visually attractive content, like a vertical search result or an item with a fancy thumbnail, can still attract user’s attention even it is placed at a lower position, leading to a flatter click curve. In other words, such visually attractive results are less sensitive to the position change. In the example above, CTR of App 2 is less sensitive in whether placing it in position 2 or position 3 compared to App 3. Placing items of which CTR is more sensitive to position change at top positions often leads to a higher utility. Besides, querylevel features like device type, as shown in Figure 1, also leads to different attention biases. Hence, to obtain unbiased CTR estimation, we need to exploit both the itemlevel and the querylevel features to model the dependency between click and position.
Based on these considerations, in this work, we propose a ranking framework called Urank
that directly optimizes expected utility from implicit feedback. Instead of ranking according to PRP, we first derive a new listwise learning objective, of which the expectation is the utility metric we want to maximize. Then to obtain an unbiased estimation of the expected utility, we address the attention bias considering both the querylevel and itemlevel features with a positionaware deep CTR model. Finally, to efficiently optimize the expected utility, we formulate it as an itemposition matching problem as shown in Figure
2, and learn a scoring function towards the best matching through pairwise permutations inspired by Lambdaloss framework (Wang et al., 2018b), which reduces the complexity in inference stage from to . Theoretical analysis demonstrates that we solve an upper bound problem of the matching problem.We conduct thorough experiments on three benchmark LETOR datasets and a largescale realworld commercial recommendation dataset to verify the effectiveness of Urank. Further, Urank has been deployed on the recommender system of a mainstream App Store, where a 10day online A/B test shows that Urank achieves an average improvement of 19.2% on CTR and 20.8% on conversion rate over the production baseline.
2. Related Work
Generative click models are introduced to study user browsing behavior and extract unbiased relevance feedbacks from click data. For example, Positionbased model (PBM) (Richardson et al., 2007) assumes that a click only depends on the position and the relevance of the document. Cascade model (Craswell et al., 2008) assumes that user browses a search web page sequentially from top to bottom until a relevant document is found. Following these two classical click models, more sophisticated ones (e.g., UBM (Dupret and Piwowarski, 2008), DBN (Chapelle and Zhang, 2009), CCM (Guo et al., 2009) and NCM (Borisov et al., 2016)) have been proposed. These click models estimate the relevance of each item in a pointwise manner instead of considering the relative order of the items as in pairwise or listwise approaches. Recently, a new line of research, referred to as counterfactual methods, utilizes inverse propensity score (IPS) weighting to address position bias in a learning to rank framework. Wang et al. (2016) and Joachims et al. (2017) proposed the IPWbased framework of debiasing click data in a learning to rank framework. In both works, the propensity estimation relies on randomizing search results displayed to users, which obviously degrades users’ search experience. Considering this, Agarwal et al. (2019) proposed PBM to estimate propensity without Intrusive Interventions. CPBM (Fang et al., 2019), on the basis of PBM, learns a querydependent propensity estimation. However, multiple rankers are required to learn, which makes them inconvenient to deploy in realworld applications. Besides, another branch of unbiased learning to rank works (Wang et al., 2018a; Ai et al., 2018; Hu et al., 2019) jointly learn the propensity model with a relevance model, which results in biased estimation of propensity unless the relevance estimation is very accurate.
3. Problem Formulation
When a user issues a new request , the system delivers a ranked list of
items to the user according to a ranking model over all the candidate items. The feature vector
of each item consists of item features, context features, and user/query features. The scalar denotes the utility value related to each item , e.g., the watch time of each video in video recommendation, or the bid price of each ad in sponsored search.The users’ click logs are a set , where is the position of item , and is the users’ implicit feedback on item when displaying at position , i.e., for click and for nonclick. To distinguish between the position of item in users’ click logs and in the current ranking model, in the following parts, we use to denote the position of item in click logs and keep as the position of item in the current ranking model.
The ultimate goal of this system is to find the best permutation of candidate items for each query to maximize the utility. The utility is defined as the expected sum of weighted clicks of each item over the whole ranked list, as follows,
(1) 
where is the probability of the item being clicked if displayed at position . Maximizing utility is equivalent to solving a maximum weight matching problem on the itemposition bipartite graph, where is the edge weight between the item and the position in the graph, as shown in Figure 2.
4. Model Framework
In this section, we present a general ranking framework, i.e., Urank, to maximize the utility in Eq. (1) directly. Firstly, we derive an unbiased metric of utility from click logs, the expectation of which is the utility
. Secondly, we design an efficient learning to rank method to optimize this metric, of which the loss function is an upper bound of the utility regret.
4.1. Unbiased Estimation of the Utility
The main difficulty of existing methods of learning to rank via implicit feedback lies in the estimation of the underlying attention bias (or position bias), since we do not observe them directly from the data. With the new learning objective, we do not need to infer relevance or the attention bias explicitly. Instead, we have to deal with another mismatch problem, which is between the CTR of the historical presented position and that of the final presented position. For example, if one item is ranked first in the click logs but presented at the 10th position in the final ranking, then its utility is overestimated. To correct this bias, we need an accurate model of user’s CTR on different positions.
The estimation of CTR on different position refers to one of the most wellstudied tasks in recommender systems, i.e., CTR estimation. Deep CTR models (Zhang et al., 2016; Qu et al., 2018) can take position and the rich queryitem features as input, to model the complex user interaction in feature space from the click logs. It is pointed out that position is a very important feature in CTR estimation (Bai et al., 2019; Guo et al., 2019). However, if we directly used a CTR estimation model as the ranking model, the position feature is vacant at the inference stage. Therefore, we design a positionaware deep CTR model as a debiasing module instead of directly using it as a ranking model. Assume that the probability function of item displayed at position is a function of item feature and position , i.e., . Then we can estimate the parameter via the standard crossentropy minimization:
(2) 
where is the crossentropy loss.
Based on users’ click logs and the estimated CTR, we derive an unbiased metric of utility as
(3) 
We prove that our derived utility is unbiased w.r.t. in Eq. (1), by showing that the expectation of is equivalent to , as
(4)  
4.2. Learning to Optimize the Utility
One straightforward way to optimize is to perform maximumweight matching algorithm (e.g., KuhnMunkres algorithm (Kuhn, 1955; Munkres, 1957)) on the bipartite graph directly (each query corresponds to one graph), given . However, the complexity for such a graph matching algorithm to run in the inference stage is ( denotes the number of candidate items), which is unacceptable in a production system. Therefore, in this section, we propose a parameterized scoring function to approximate the maximumweight matching procedure on each query, still aiming at maximizing the utility, so that the complexity at the inference stage can be reduced to . For each item , the scoring function gives a ranking score as . For each query , we compute the score of each item , and the result list is generated by sorting their scores in descending order.
According to Eq. (3), we define the utility of displaying item at position as . With being the optimal position assigned to item by the graph matching algorithm, the regret of the utility is defined as
(5) 
Minimizing the regret of the utility directly is infeasible as ’s are discrete values. Therefore, we adapt the LambdaLoss framework (Wang et al., 2018b) to learn a ranking model towards the optimal ranking by optimizing our proposed loss function (which will be presented in Eq. (7)) with iterative pairwise permutation. Like in LambdaLoss we follow an EM procedure that in E step we obtain the ranked list based on current scoring function and in M step we reestimate the scoring function to minimize our loss function. The learning procedure of Urank is as follows.
We first initialize the scoring function with random initialization of . Inspired by the reweighting technique used in LambdaRank (Burges et al., 2007), we compute the difference between the unbiased utility when the positions of two items and are swapped, as
(6) 
Then this difference value is used as the weight in the pairwise loss for each pair of items. Following (Burges et al., 2005; Burges et al., 2007), we design our loss function in the form of logistic loss, as
(7) 
where and denote the position assigned to item and by ranking model at the last step (by the scoring function ). This loss is minimized, so that we get a new scoring function . Then the process is repeated until convergence.
Notice that in a standard LambdaLoss framework, the LambdaLoss is defined as
(8) 
Note that the differences between our objective (7) and the LambdaLoss objective (8) lie in (i) the subscript of the summation symbol and (ii) the absolute value symbol of the difference term . In LambdaLoss framework, the pairwise label of each item pair is determined. The optimal ranking order is known to us by ranking all the items according to relevance label or click label (denoted by for item ), in descending order. However, in our framework, we cannot obtain the explicit label for item . An item is treated as the positive item if it is placed at a lower position by scoring function and the swap of the item pair brings utility gain, and vise versa. We cannot access the optimal ranking order in each query beforehand, where the optimal order is achieved through iterative pairwise permutation.
4.3. Theoretical Analysis
In this section, we theoretically prove that our proposed training objective is an upper bound of the utility regret . To make the proof easier to understand, we construct a function:
(9) 
We start with several lemmas which will be used in our proof.
Lemma 4.1 ().
Given an indicator function and a function where is a constant in , it holds that for all .
Lemma 4.2 ().
Given an indicator function and a function where is a constant in , it holds that for .
Lemma 4.3 ().
Given a sum function and a max function , it holds that for .
Now we are ready to derive the main theoretical result.
Theorem 4.4 ().
Assume is a monotonic decreasing function w.r.t and the ranking score is bounded in the range of [C,C]. Let and . Then we have .
Theorem 4.4 states that is upper bounded by our objective plus . is a constant since in the M step only depends on the current scoring function . Notice that the assumptions in the theorem are not restrictive in practice. As illustrated in Figure 1, the real utility basically satisfies the monotonic decreasing assumption. Moreover, the ranking score is often clipped in implementation to avoid explosion in exponential function.
Theorem 4.5 ().
Assume the utility is a monotonic decreasing function w.r.t . Then is upper bounded by .
Proof.
(12) 
where the first inequality holds due to Lemma 4.3. ∎
Ranking model  Yahoo! LETOR set 1  MSLRWEB10K  IstellaS LETOR  

MAP  nDCG  # Click  CTR  MAP  nDCG  # Click  CTR  MAP  nDCG  # Click  CTR  
SVMRank  None  0.702  0.845  0.599  0.0641  0.498  0.735  0.834  0.0827  0.773  0.808  0.931  0.0939 
Randomization  0.639  0.787  0.544  0.0574  0.433  0.686  0.820  0.0799  0.742  0.787  0.909  0.0910  
CPBM  0.701  0.843  0.594  0.0637  0.477  0.721  0.751  0.0746  0.752  0.793  0.923  0.0919  
Groundtruth  0.718  0.859  0.612  0.0656  0.515  0.748  0.872  0.0870  0.775  0.816  0.958  0.0952  
LambdaRank  None  0.700  0.847  0.606  0.0641  0.498  0.736  0.834  0.0828  0.776  0.810  0.945  0.0941 
Randomization  0.680  0.828  0.582  0.0621  0.451  0.700  0.813  0.0808  0.748  0.793  0.924  0.0923  
CPBM  0.718  0.857  0.613  0.0651  0.514  0.744  0.836  0.0836  0.779  0.813  0.941  0.0941  
Groundtruth  0.719  0.859  0.618  0.0657  0.521  0.748  0.882  0.0883  0.781  0.815  0.948  0.0948  
DNN  DLA  0.639  0.782  0.553  0.0589  0.430  0.682  0.830  0.0839  0.676  0.703  0.828  0.0824 
CTR1  0.647  0.792  0.551  0.0577  0.477  0.722  0.829  0.0814  0.733  0.771  0.894  0.0895  
Urank  0.719  0.861*  0.618*  0.0659*  0.492  0.725  0.903*  0.0915*  0.783*  0.816*  0.959*  0.0953*  
KM (oracle model)  0.935  0.987  0.684  0.0737  0.710  0.723  0.976  0.0969  0.993  0.995  1.132  0.1126 
The proof of Theorem 4.6 is trivial due to Theorem 4.4 and Theorem 4.5. Theorem 4.6 demonstrates that the utility regret is bounded by our proposed objective plus a constant. It implies that optimizing our proposed objective is actually minimizing the upper bound of the utility regret, which guarantees the effectiveness of Urank theoretically.
5. Semisynthetic Experiments
The semisynthetic setup is widely applied in unbiased learning to rank (Fang et al., 2019) which allows us to explore different settings ^{1}^{1}1Code for our experiments is available at https://github.com/xydaisjtu/Urank.
5.1. Datasets

Yahoo! LETOR set 1^{2}^{2}2https://webscope.sandbox.yahoo.com is used in Yahoo! LearningtoRank Challenge. The dataset consists of 700 features normalized in , which are extracted from querydocument pairs.

MSLRWEB10K^{3}^{3}3https://www.microsoft.com/enus/research/project/mslr/
is a largescale dataset released by Microsoft Research. It contains 10,000 queries and 1,200,193 documents. There are 136 features extracted from querydocument pairs.

IstellaS LETOR^{4}^{4}4http://quickrank.isti.cnr.it/istelladataset/ (Lucchese et al., 2016) is one of the largest public available datasets. IstellaS is composed of 33,018 queries, where for each querydocument pair there are 220 features.
5.2. Click Data Generation
We mainly follow Fang et al. (2019) to generate synthetic click data with itemspecific attention bias for the three datasets. In the following part, oracle model refers to this click generation model. Similar to (Fang et al., 2019), the attention bias which is related to both position and the item feature is calculated by . In our setting, is the set of item features, while in the setting of (Fang et al., 2019), is the set of query features which is shared among all the items for a same query. The parameter vector
is drawn from a uniform distribution over
and is normalized such that . Following Hu et al. (2019), the relevance probability is defined as , where denotes the relevance label of and is the highest level of relevance. is set to 0.1, which denotes the CTR of irrelevant documents. The overall CTR is calculated by . The maximal position is set to be 10.5.3. Baselines
We implement eight baselines that explore the performance of two standard learning to rank methods (i.e., SVMRank (Joachims, 2006) and LambdaRank (Burges et al., 2007)), with four propensity estimation methods, which are detailed as follows. (1) None uses the original click data without debiasing. (2) Randomization (Joachims et al., 2017) estimates propensity with online randomized experiments. (3) CPBM (Fang et al., 2019) estimates examination probability w.r.t different queries based on intervention harvesting. (4) Groundtruth uses the groundtruth examination probability for oracle model as propensity. The result of this method is the upper bound of the results of other IPS approaches based on the same ranking model. Other methods we implement include (5) CTR1, the positionaware click model used in our framework which assigns position 1 to each item during online inference. (6) DLA (Ai et al., 2018) is a dual learning algorithm that jointly learns an unbiased ranker and an unbiased propensity model. We also explore the performance of KM (oracle model) which solves the maximumweight graph matching problem via KuhnMunkres (KM) algorithm (Kuhn, 1955; Munkres, 1957), given the groundtruth CTR. It is supposed to produce the best utility we can achieve on the test data.
5.4. Overall Performance
In this section, we assume the utility value of each item to be 1 in order to consistently and fairly compare Urank with existing (unbiased) learning to rank methods. We evaluate the performance of the baseline approaches and Urank in terms of the relevance based metrics, i.e., MAP and nDCG (nDCG denotes nDCG@10), and utility based metrics, i.e., # Click and CTR. Here, The # Click and CTR are utility metrics based on oracle click model denoting clicks per query and CTR per document, respectively. The overall performance on the three benchmark datasets is shown in Table LABEL:tab:simulation. Firstly, our method Urank achieves consistently the best performance over the stateoftheart baseline approaches on the utilitybased metrics, i.e., #Click and CTR. For example, Urank achieves 1% improvement in Yahoo LETOR set 1 and 8.3% improvement in MSLRWEB10K on CTR comparing to the best baseline methods ^{5}^{5}5The baseline in italic use the information from oracle click model, so they are not included for comparison.. Secondly, Urank also outperforms most of the baselines in terms of the relevance based metrics, i.e., MAP and nDCG, though it does not always perform the best especially on MSLRWEB10K dataset where the disagreement between utilitybased metric and relevancebased metric is larger than that on the other two datasets. Thirdly, the method Groundtruth achieves the best utility among the counterfactual learning approaches, demonstrating the effectiveness of the IPSbased framework when the propensity estimation is accurate. Randomization fails to perform well because it assumes that the user’s attention only relates to the position which is not true in our setting where the user’s attention relies on both the position and the item feature. CPBM achieves the secondbest utility among the IPSbased methods since it models the propensity of each query by taking query features into consideration. Lastly, Urank and CTR1 share the same click model. However, Urank outperforms CTR1 mainly because CTR1 ranks items by their estimated CTR at position 1, which is suboptimal in case of itemspecific attention bias. Urank also outperforms DLA since DLA relies heavily on the accuracy of the estimated propensity, which is hard to achieve.
5.5. Empirical Analysis
RQ1: How does our model achieve higher CTR? In Figure 4, we show the average CTR on each position of Urank and LambdaRank(Groundtruth) , the upper bound of counterfactual learning methods in Table LABEL:tab:simulation. We also plot the results of KM (oracle model) for reference.
Firstly, comparing the results of the two datasets in Figure 4, we observe a steeper decline of average CTR to positions of the KM (oracle model) method on Yahoo dataset than that on the MSLR dataset. This suggests that on this dataset, positions have a very strong impact on users’ click. Thus, to optimize the utility, a wellperformed approach should put more relevant items at higher positions. In MSLRWEB10K, on the other hand, the average CTR of the optimal matching tends to be equally distributed on the positions, compared to the Yahoo dataset. We find that our method is adaptive to different severity of position bias. In Yahoo dataset, our model focuses more on top positions than LambdaRank, while in the MSLR dataset, our model learns a flatter distribution. Notably, on both datasets, our model achieves a larger sum of click probabilities over all the positions than LambdaRank.
Secondly, we analyze the result of a single query in detail. The experiment is conducted on the first query of the MSLRWEB10K dataset. Figure 3 shows the click probabilities of the ten items for this query and their click probabilities if placed at each position according to our oracle click data generation model. The position of each item assigned by different methods is denoted in orange color. We can see that although LambdaRank performs better in nDCG with a groundtruth propensity. It, however, achieves a lower CTR than our method Urank. This is because, similar to the KM (oracle model), Urank will take the position sensitivity of different items into consideration. For example, document 6 is of high relevance and relatively not sensitive to the position change. LambdaRank displays it at the second position while our method and KM both display it at a lower position, so that the second position is kept for an item that is more sensitive to the position change.
AUC 

CTR  

train  test  validation  
A1  0.695  0.684  0.687  0.903  0.915  
A2  0.701  0.693  0.692  0.852  0.847 
RQ2: What kind of architecture should we use to implement the positionaware click estimation?
We implement two kinds of architecture for the positionaware click estimation. A1 is a neural network, with the item features as input and a
dim vector as output, where the th dimension denotes the CTR of the item at position , anddenotes the number of positions. A2 is also a neural network, with the concatenation of item feature and position in onehot encoding as its input and a single value as output, representing the CTR of the item at the given position. The result on MLSRWEB10K is presented in Table
2. Although A2 achieves better AUC, we utilize A1 as the click model to pursue higher utility.6. Realworld Deployment
Ranking model  Scenario 1 (without bid)  Scenario 2 (with bid)  

# click  # click@1  # click@3  # click@5  Revenue  Revenue@1  Revenue@3  Revenue@5  
SVMRank  None  1.586  0.500  1.107  1.348  3.602  0.959  2.177  2.788 
Groundtruth  1.617  0.536  1.154  1.386  3.619  1.015  2.229  2.825  
LambdaRank  None  1.750  0.701  1.327  1.556  3.586  0.964  2.178  2.774 
Groundtruth  1.826  0.781  1.429  1.640  3.637  1.009  2.245  2.837  
DLA  1.665  0.624  1.338  1.520  3.631  0.958  2.233  2.827  
DeepFM  1.790  0.762  1.379  1.593  3.753  1.131  2.289  2.881  
Urank  1.859*  0.841*  1.474*  1.676*  3.966*  1.264*  2.607*  3.214* 

denotes statistically significant improvement (measured by ttest with pvalue
0.05) over all baselines.
In order to verify the effectiveness of our proposed model in realworld applications, we conduct experiments in two recommendation scenarios in Company X’s App Store. This App Store has hundreds of millions of daily active users who create hundreds of billions of user logs every day in the form of implicit feedback such as browsing, clicking, and downloading behaviors.
6.1. Offline Evaluation
Setups. We conduct offline experiments based on two recommendation scenarios with different utility settings. In Scenario 1, we only consider the downloads of the Apps as the utility, while in Scenario 2, the bid price of each App download needs to be considered. In both scenarios, we use seven consecutive days’ data for training and the eighth day’s data for testing. As in the semisynthetic experiments, we also implement two LETOR methods, i.e., SVMrank and LambdaRank as baselines. The propensity estimation method Randomization is not applicable here since we are not allowed to randomly swap two items of a ranked list in a live recommender system. Similarly, CPBM is not applicable either since in practice we cannot obtain the ranking results of the same user from multiple rankers at the same time. Thus, we only compare Urank with the ranker learned with biased click data, i.e., None, and the ranker with groundtruth propensity, i.e., Groundtruth. The groundtruth propensity is the same as the propensity that we use in evaluation in the next paragraph. DeepFM is included as a baseline as it is the production baseline in this App recommendation online system. It trains with position feature and takes default position 1 in the inference stage. To make a fair comparison, we perform DeepFM architecture in both click model and ranking model in Urank.
Metrics. Unlike in the semisynthetic experiments, here we do not know the underlying user click model. Thus, we have to debias the click data generated by a historical ranker with a preestimated propensity to obtain the click signals on the new positions. We estimate the propensity for each category of items from 120 days’ click data on random traffic. This categorywise propensity estimation is a coarse approximation of the groundtruth propensity, which is not available. The propensity is defined as , where denotes the category of item . This propensity is only used for evaluation except in the methods in Table 3, where this propensity is used for debiasing.
The utility in Scenario 1 is defined as the expected number of debiased clicks at the top positions in a session, i.e., . We use to denote the case when . The utility in Scenario 2 is defined as the expected revenue at top positions in a session after debasing, i.e., where is the bid of item . We use to denote the case when .
Results. The overall performance on the two realworld datasets is shown in Table 3. We have the following observations. Firstly, Urank achieves the best performance comparing to the stateoftheart baselines. Specifically, in Scenario 1, Urank achieves 1.8%, 7.7%, 3.1% and 2.2% improvement over the best baseline method in terms of #click, #click@1, #click@3 and #click@5, respectively. In the experiment with bid in Scenario 2, the improvements are 2.5%, 13%, 6.7% and 3.9% in terms of Revenue, Revenue@1, Revenue@3 and Revenue@5, respectively. These results demonstrate the superiority of our approach over the baselines in optimizing the utility, which motivates us to deploy Urank in the live recommender system. Secondly, Urank performs better than DeepFM because Urank considers itemspecific attention bias while DeepFM learns from biased data. We do not elaborate on the other results since they are consistent with the results in the semisynthetic experiments.
6.2. Online Evaluation
Setups. We conduct A/B testing in a recommendation scenario in Company X’s App store, comparing the proposed model Urank with the current production baseline DeepFM (Guo et al., 2017) that supports multiple scenarios such as “Musthave Apps” and “Novel and Fun”. The whole online experiment lasts 24 days, from May 6, 2020 to May 29, 2020. We monitor the results of A/A testing for the first seven days, conduct A/B testing for the following ten days, and conduct A/A testing again in the last seven days. 15% of the users are randomly selected as the experimental group and another 15% of the users are in the control group. During A/A testing, all the users are served by DeepFM model (Guo et al., 2017). During A/B testing, users in the control group are presented with recommendation by DeepFM, while users in the experimental group are presented with the recommendation by our proposed model Urank. Note that the click model of Urank shares the same network architecture and parameter complexity with DeepFM in order to verify whether the improvement is brought by the objective function design of the ranker in Urank.
To deploy Urank, we utilize a single node with 48 core Intel Xeon CPU E52670 (2.30 GHZ), 400 GB RAM and as well as 2 NVIDIA TESLA V100 GPU cards, which is the same as the training environment of the baseline DeepFM. For model training, Urank requires minor changes to the current training procedure due to the pairwise loss function. For model inference, Urank shares the same pipeline as DeepFM, which means there is no extra engineering work needed in model inference, to upgrade DeepFM model (or other similar deep models) to Urank.
Metrics. We examine two metrics in the online evaluation. They are Clickthrough rate: and Conversion rate: , where # downloads, # impressions and #users are the number of downloads, impressions and visited users, respectively.
Results. Figure 6 and Figure 6 show the improvement of the experimental group over the control group with respective to CTR and CVR, respectively. We can see that the system is rather stable where both CTR and CVR fluctuated within 8% during the A/A testing. Our Urank model is launched to the live system on Day 8. From Day 8, we observe a significant improvement over the baseline model with respect to both CTR and CVR. The average improvement of CTR is 19.2% and the average improvement of CVR is 20.8% over the ten days of A/B testing. These results clearly demonstrate the high effectiveness of our proposed model in improving the total utility which refers to the number of downloads in this scenario. From Day 18, we conduct A/A testing again to replace our Urank model with the baseline model in the experimental group. We observe a sharp drop in the performance of the experimental group, which once more verify that the improvement of online performance in the experimental group is indeed introduced by our proposed model.
7. Conclusion
In this paper, we propose a novel framework Urank, which directly optimizes the expected utility of the ranked list without any extra assumption on relevance nor examination. Specifically, Urank first uses a positionaware deep CTR model to perform an unbiased estimation of the expected utility, and then optimizes the objective with an efficient algorithm based on a LambdaRanklike objective. Extensive studies on three benchmark datasets and two realworld datasets based on different scenarios have shown the effectiveness of our work. We also deploy this ranking framework on a commercial recommender system and observe a large utility improvement over the production baseline via online A/B testing. In future work, we plan to consider other biases like selection bias and propose a more general debiasing framework.
Acknowledgement
The corresponding author Weinan Zhang thanks the support of NSFC (61702327, 61772333, 61632017). The work is also sponsored by Huawei Innovation Research Program.
References
 (1)
 Agarwal et al. (2019) Aman Agarwal, Ivan Zaitsev, Xuanhui Wang, Cheng Li, Marc Najork, and Thorsten Joachims. 2019. Estimating Position Bias without Intrusive Interventions. In WSDM.
 Ai et al. (2018) Qingyao Ai, Keping Bi, Cheng Luo, Jiafeng Guo, and W. Bruce Croft. 2018. Unbiased Learning to Rank with Unbiased Propensity Estimation. In SIGIR.
 Bai et al. (2019) Xiao Bai, Reza Abasi, Bora Edizel, and Amin Mantrach. 2019. Positionaware deep characterlevel CTR prediction for sponsored search. TKDE (2019).
 Borisov et al. (2016) Alexey Borisov, Ilya Markov, Maarten de Rijke, and Pavel Serdyukov. 2016. A Neural Click Model for Web Search. In WWW.
 Burges et al. (2005) Christopher Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Gregory N Hullender. 2005. Learning to rank using gradient descent. In ICML.
 Burges et al. (2007) Christopher J Burges, Robert Ragno, and Quoc V Le. 2007. Learning to rank with nonsmooth cost functions. In NeuIPS.

Chapelle and
Zhang (2009)
Olivier Chapelle and Ya
Zhang. 2009.
A Dynamic Bayesian Network Click Model for Web Search Ranking. In
WWW.  Craswell et al. (2008) Craswell, Nick, Zoeter, Onno, Taylor, Michael Lyu, Ramsey, and Bill. 2008. An experimental comparison of click positionbias models. In WSDM.
 Dupret and Piwowarski (2008) Georges E. Dupret and Benjamin Piwowarski. 2008. A User Browsing Model to Predict Search Engine Click Data from Past Observations.. In SIGIR.
 Fang et al. (2019) Zhichong Fang, Aman Agarwal, and Thorsten Joachims. 2019. Intervention Harvesting for ContextDependent ExaminationBias Estimation. In SIGIR.
 Guo et al. (2009) Fan Guo, Chao Liu, Anitha Kannan, Tom Minka, Michael Taylor, YiMin Wang, and Christos Faloutsos. 2009. Click Chain Model in Web Search. In WWW.
 Guo et al. (2017) Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: A FactorizationMachine based Neural Network for CTR Prediction. In IJCAI.
 Guo et al. (2019) Huifeng Guo, Jinkai Yu, Qing Liu, Ruiming Tang, and Yuzhou Zhang. 2019. PAL: A PositionBias Aware Learning Framework for CTR Prediction in Live Recommender Systems. In Recsys.
 Hu et al. (2019) Ziniu Hu, Yang Wang, Qu Peng, and Hang Li. 2019. Unbiased LambdaMART: An Unbiased Pairwise LearningtoRank Algorithm. In WWW.
 Joachims (2006) Thorsten Joachims. 2006. Training Linear SVMs in Linear Time. In KDD.
 Joachims et al. (2017) Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. 2017. Unbiased LearningtoRank with Biased Feedback. In WSDM.
 Karatzoglou et al. (2013) Alexandros Karatzoglou, Linas Baltrunas, and Yue Shi. 2013. Learning to rank for recommender systems. In Recsys.
 Karmaker Santu et al. (2017) Shubhra Kanti Karmaker Santu, Parikshit Sondhi, and ChengXiang Zhai. 2017. On application of learning to rank for ecommerce search. In SIGIR.
 Kuhn (1955) Harold W Kuhn. 1955. The Hungarian method for the assignment problem. Naval research logistics quarterly (1955).
 Liu (2011) TieYan Liu. 2011. Learning to rank for information retrieval. Springer Science & Business Media.
 Lorigo et al. (2008) Lori Lorigo, Maya Haridasan, Hrönn Brynjarsdóttir, Ling Xia, Thorsten Joachims, Geri Gay, Laura Granka, Fabio Pellacini, and Bing Pan. 2008. Eye tracking and online search: Lessons learned and challenges ahead. Journal of the American Society for Information Science and Technology (2008).
 Lorigo et al. (2006) Lori Lorigo, Bing Pan, Helene Hembrooke, Thorsten Joachims, Laura Granka, and Geri Gay. 2006. The influence of task and gender on search and evaluation behavior using Google. Information Processing & Management (2006).
 Lucchese et al. (2016) Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, Fabrizio Silvestri, and Salvatore Trani. 2016. PostLearning Optimization of Tree Ensembles for Efficient Ranking. In SIGIR.
 Metrikov et al. (2014) Pavel Metrikov, Fernando Diaz, Sebastien Lahaie, and Justin Rao. 2014. Whole Page Optimization: How Page Elements Interact with the Position Auction.
 Munkres (1957) James Munkres. 1957. Algorithms for the assignment and transportation problems. Journal of the society for industrial and applied mathematics (1957).
 Qu et al. (2018) Yanru Qu, Bohui Fang, Weinan Zhang, Ruiming Tang, Minzhe Niu, Huifeng Guo, Yong Yu, and Xiuqiang He. 2018. Productbased neural networks for user response prediction over multifield categorical data. TOIS (2018).
 Richardson et al. (2007) Richardson, Matthew, Dominowska, Ewa, Ragno, and Robert. 2007. Predicting clicks: Estimating the clickthrough rate for new ads. In WWW.
 Robertson (1977) Stephen E Robertson. 1977. The probability ranking principle in IR. Journal of documentation (1977).
 Tagami et al. (2013) Yukihiro Tagami, Shingo Ono, Koji Yamamoto, Koji Tsukamoto, and Akira Tajima. 2013. Ctr prediction for contextual advertising: Learningtorank approach. In Proceedings of the Seventh International Workshop on Data Mining for Online Advertising.
 Wang et al. (2016) Xuanhui Wang, Michael Bendersky, Donald Metzler, and Marc Najork. 2016. Learning to Rank with Selection Bias in Personal Search. In SIGIR.
 Wang et al. (2018a) Xuanhui Wang, Nadav Golbandi, Michael Bendersky, Donald Metzler, and Marc Najork. 2018a. Position Bias Estimation for Unbiased Learning to Rank in Personal Search. In WSDM.
 Wang et al. (2018b) Xuanhui Wang, Cheng Li, Nadav Golbandi, Mike Bendersky, and Marc Najork. 2018b. The LambdaLoss Framework for Ranking Metric Optimization. In CIKM.
 Wu et al. (2018) Liang Wu, Diane Hu, Liangjie Hong, and Huan Liu. 2018. Turning clicks into purchases: Revenue optimization for product search in ecommerce. In SIGIR.
 Zhang et al. (2016) Weinan Zhang, Tianming Du, and Jun Wang. 2016. Deep learning over multifield categorical data. In ECIR.
 Zhao et al. (2019) Zhe Zhao, Lichan Hong, Li Wei, Jilin Chen, Aniruddh Nath, Shawn Andrews, Aditee Kumthekar, Maheswaran Sathiamoorthy, Xinyang Yi, and Ed Chi. 2019. Recommending what video to watch next: a multitask ranking system. In Recsys.
Comments
There are no comments yet.