Counterfactual Learning to Rank using Heterogeneous Treatment Effect Estimation

07/19/2020
by   Mucun Tian, et al.
Pandora Media, Inc.
0

Learning-to-Rank (LTR) models trained from implicit feedback (e.g. clicks) suffer from inherent biases. A well-known one is the position bias – documents in top positions are more likely to receive clicks due in part to their position advantages. To unbiasedly learn to rank, existing counterfactual frameworks first estimate the propensity (probability) of missing clicks with intervention data from a small portion of search traffic, and then use inverse propensity score (IPS) to debias LTR algorithms on the whole data set. These approaches often assume the propensity only depends on the position of the document, which may cause high estimation variance in applications where the search context (e.g. query, user) varies frequently. While context-dependent propensity models reduce variance, accurate estimations may require randomization or intervention on a large amount of traffic, which may not be realistic in real-world systems, especially for long tail queries. In this work, we employ heterogeneous treatment effect estimation techniques to estimate position bias when intervention click data is limited. We then use such estimations to debias the observed click distribution and re-draw a new de-biased data set, which can be used for any LTR algorithms. We conduct simulations with varying experiment conditions and show the effectiveness of the proposed method in regimes with long tail queries and sparse clicks.

READ FULL TEXT VIEW PDF

Authors

page 7

11/05/2018

Intervention Harvesting for Context-Dependent Examination-Bias Estimation

Accurate estimates of examination bias are crucial for unbiased learning...
09/16/2018

A Novel Algorithm for Unbiased Learning to Rank

Although click data is widely used in search systems in practice, so far...
01/29/2020

Correcting for Selection Bias in Learning-to-rank Systems

Click data collected by modern recommendation systems are an important s...
06/09/2018

Consistent Position Bias Estimation without Online Interventions for Learning-to-Rank

Presentation bias is one of the key challenges when learning from implic...
12/12/2018

Estimating Position Bias without Intrusive Interventions

Presentation bias is one of the key challenges when learning from implic...
11/08/2020

Do We Exploit all Information for Counterfactual Analysis? Benefits of Factor Models and Idiosyncratic Correction

The measurement of treatment (intervention) effects on a single (or just...
08/22/2018

Robust Counterfactual Inferences using Feature Learning and their Applications

In a wide variety of applications, including personalization, we want to...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Learning-to-rank (LTR) models have been widely used in information retrieval and recommender systems. These models are often trained in the offline setting with implicit feedback (e.g. clicks) collected from production systems. While implicit feedback is an attractive training source (e.g. abundant, privacy preserving), its inherent biases hinder the effectiveness of learning-to-rank (Joachims et al., 2007). One such bias is the position bias. To mitigate position bias, traditional approaches have gone into modeling the bias-aware relevance (Joachims, 2002; Chuklin et al., 2015; Craswell et al., 2008). However, accurately inferring individual relevance requires each query-document pair repeating multiple times at multiple positions, which is not realistic in many search systems. Instead of modeling individual relevance, recent counterfactual frameworks (Wang et al., 2016; Joachims et al., 2017; Agarwal et al., 2019c; Fang et al., 2019) attempt to estimate the examination probability under Position-Based Model (PBM) (Richardson et al., 2007) and use the estimation as inverse propensity score (IPS) to weight pairwise or listwise ranking. While IPS weighting provides an unbiased LTR under PBM, it has several limitations.

First, all the existing approaches follow a direct method to estimate propensities, which requires the same set of query-document pairs appearing in at least two different positions. This can be implemented by randomizing top- (Wang et al., 2016), swapping pairs (Joachims et al., 2017; Wang et al., 2018), or integrating multiple loggers (Agarwal et al., 2019c; Fang et al., 2019). For long tail queries, however, we rarely observe their intervention counterparts even with randomization, causing biased propensity estimations on these long tail queries. This could potentially break the unbiasedness of IPS weighted LTR. Second, the examination propensity can only be estimated on observed clicks when using the PBM model. In other words, we cannot infer from un-clicked documents whether these are irrelevant or not examined. Therefore, applying IPS into pointwise learning is not as effective as pairwise or listwise ones (Wang et al., 2018)

. Third, IPS only weights clicked documents in the empirical loss function. Novel items (e.g. new music releases, products) for which we haven’t observed any clicks, due to the lack of exposure to users, are treated as negative examples under IPS weighted LTR. This undermines LTR’s ability to promote novel documents.

In this paper, we employ heterogeneous treatment effect (HTE) estimation methods (Athey et al., 2019; Künzel et al., 2019) to address these limitations. We first estimate causal effects of click probabilities between two positions. Based on these estimations, we then debias click distributions of observational data and draw clicks for unbiased LTR. Finally, we evaluate the effectiveness of the proposed method under varying experiment conditions using semi-synthetic data simulated from the Microsoft Learning to Rank dataset (Qin and Liu, 2013). The objective of this work is twofold: i) compare the proposed heterogeneous treatment effect methodology to existing ones, ii) evaluate the effectiveness of the proposed method on long tail queries.

Estimating heterogeneous treatment effect does not require intervention data, nor the click model. Instead, it utilizes ”collaborative information” in the feature space, allowing the position bias estimation for long tail queries. Under the unconfoundedness assumption (ROSENBAUM and RUBIN, 1983), the estimator is unbiased for any given context (Athey et al., 2019), so drawing from debiased click distributions can provide reliable clicks for documents with unknown relevance information in observational data.

2. Related Work

There are several lines of research on estimating and/or debiasing click data for LTR. One approach is to infer relevance with heuristics or modeling, including SkipAbove (Joachims et al., 2005), Position-Based Model (PBM) (Richardson et al., 2007), the Cascade Model (Craswell et al., 2008), and other extensions (Chuklin et al., 2015)

. These approaches attempt to derive the absolute or relative relevance by taking into account users’ search behavior. While relative relevance is found to be more accurate on average, LTR trained from this relevance is likely to reverse the presented order without additional heuristics

(Joachims et al., 2007, 2017).

Another approach is Online Learning (Yue and Joachims, 2009; Schuth et al., 2016). Online learning is robust to bias and noise, but it learns from randomization data. This can hurt users’ experience during the initial deployment stage (Jagerman et al., 2019).

Counterfactual LTR frameworks (Wang et al., 2016; Joachims et al., 2017; Agarwal et al., 2019a) seek to estimate how likely a document is to be examined and use the inverse propensity score (IPS) to weight pairwise or listwise LTR. Counterfactual LTR does not need randomization in the learning process and is proven to be unbiased under PBM (Joachims et al., 2017). However, it’s sensitive to selection bias and noise (Jagerman et al., 2019)

. Some attempts were made to keep the IPS estimator ”doubly robust”, such as adding an imputation term (regression on the complete data)

(Su et al., 2019), inclusion propensity (the propensity of a new document being exposed to users) (Carterette and Chandar, 2018), or noise-aware parameters (Agarwal et al., 2019b). The robustness of these estimators rely on the accurate imputation or noise modeling. These approaches also often assume the examination propensity only depends on the position. Several techniques have been proposed to estimate the context-dependent propensity (Wang et al., 2016; Chandar and Carterette, 2018; Fang et al., 2019), yet all of these require intervention data. There is another line of work that jointly estimates the relevance and the position bias (Wang et al., 2018; Ai et al., 2018; Hu et al., 2019) on observational data. But coupling the relevance and the bias together without controlling for either one of them calls into question the unbiasedness of the estimator (Agarwal et al., 2019c; Fang et al., 2019).

Recent advances in heterogeneous treatment effect estimation provide promising techniques for identifying individual treatment effect in observational studies. This line of work follows the potential outcomes framework (Rubin, 1974; Imbens and Rubin, 2015) to estimate the treatment effect using Robinson (1988) transformation under the unconfoundedness assumption (ROSENBAUM and RUBIN, 1983). Künzel et al. (2019) introduced meta-learners that indirectly predict the heterogeneous treatment effect using imputation on unobserved outcomes. Athey and Imbens (2016) proposed recursive partitioning — namely causal trees — to assess heterogeneity in the treatment effect. A following work by Wager and Athey (2018) developed causal forests to consistently estimate the true treatment effect. These tree-based methods require manually-designed criteria for parameter tuning due to the fundamental problem of causal inference — we observed an individual either in treatment or control group, but not both. Nie and Wager (2017) proposed R-learner that separates confounding factors from the treatment effect estimator, enabling the traditional cross-validation for goodness-of-fit. This motivates

generalized random forests

’s (Athey et al., 2019) parameter tuning.

3. Methods

In this section, we briefly revisit the existing counterfactual LTR framework, then introduce the heterogeneous treatment effect estimation and describe our experiment protocol.

3.1. Counterfactual LTR

Counterfactual LTR frameworks (Wang et al., 2016; Joachims et al., 2017; Agarwal et al., 2019a) assume that documents at higher positions are more likely to be examined by a user than ones at lower positions. Therefore, observed clicks are missing with certain propensities (probabilities) for documents at position . Given these propensities, we can use inverse propensity score (IPS) technique to weight the positive examples for unbiased LTR.

3.1.1. Position-Based Propensity Estimation

The propensity is often unknown in practice. To estimate it, existing literature follows the position-based model (PBM) (Richardson et al., 2007), which assumes the observed click depends on the examination and the relevance in the following way,

where

is the feature vector that encodes the query, user and document, and

is the document position. This assumes the examination only depends on the position. We can also relax the examination to be context-dependent as

We then can fit the examination and the average relevance models with the intervention data to consistently estimate the propensity by minimizing the cross-entropy loss (Wang et al., 2018; Fang et al., 2019),

(1)

where represents a unique tuple of (query, document, position) and is the click rate of a unique query-document pair at position .

3.1.2. IPS Weighted LTR

IPS weighting is found to be effective to pairwise or listwise LTR (Wang et al., 2018). In the IPS weighted pairwise LTR setting, we have of lists with size , and we want to learn a score function from the following loss,

where is the click of the th document in a ranking list, is the propensity score at position , and is the pairwise loss that approximates or bounds to a ranking metric (e.g. DCG, Relevance Rank) (Joachims et al., 2017; Agarwal et al., 2019a).

3.1.3. Propensity Model Implementation

We select the contextual-dependent position-based model (CPBM) (Fang et al., 2019) as the contextual propensity model111

This was implemented using Tensorflow

(Abadi et al., 2015).

. Specifically, the examination model is implemented with a 3-layer neural network; the input

corresponds to the context feature with size , and the output corresponds to the top- positions to be estimated. The average relevance is also modeled by a 3-layer neural network; the input corresponds to the context feature with size , and the output corresponds to the intervention sets for the position pairs. An intervention set is composed of documents that appear at at least two different positions. To take into account the fact that the average relevance given the context for intervention documents at positions is equal to the average relevance at positions , the output layer is the arithmetic mean of the previous output and its transpose, making the final output a symmetric matrix (see (Fang et al., 2019) for the details).

(a) Position bias estimations.
Unbiased learning-to-rank.
Figure 1. Simulation architecture.

3.2. Heterogeneous Treatment Effect Estimation

Another way to achieve unbiased LTR is to place every document to the first position and collect the data for offline training. This is obviously impractical, so we seek to unbiasedly estimate the conditional incremental effect — how much is the increase of the click probability if a document would have been in the first position had it been in position

. With the estimation, we can compensate the click probability of the document at position during the offline training.

3.2.1. Heterogeneous Treatment Effect Estimation

To estimate the conditional incremental effect, we employ the potential outcome framework (Rubin, 1974) to formulate this problem. In LTR, we observe of i.i.d. examples , where is query-document feature, is the observed outcome (e.g. click, grade), and is the treatment variable, indicating whether a document is observed in position () or ; , if the position is ; , if the position is . We assume there are potential outcomes , corresponding to the treatment or control group, so , if ; otherwise, . We then want to estimate the conditional average treatment effect (CATE) between position and ,

To estimate CATE, we assume unconfoundedness given any specific context (ROSENBAUM and RUBIN, 1983),

(2)

Intuitively, this assumes data points surrounding a specific context, , are missing at random so that we can estimate without bias.222In practice, this assumption can be met by conducting randomization experiments. For example, we map each unique query to a random seed, and then we randomly shuffle the top- list based on this random seed. In this way, the ranking is still deterministic, but documents’ positions now are independent of clicks they will receive. This reduces the harm to the user experience compared to full randomization.

3.2.2. Click Distribution Correction and Data Resampling

With the estimator in hand, we compute a potential click rate at position for each unique query-document pair () observed at position by

(3)

where is the observed click rate of query-document pair with feature at position .

Based on (3), we re-draw clicks/non-clicks for each unique query-document pair () from

where is the number of query-document pair observed in the training data. We truncate to the range before sampling clicks.

In this way, we reconstruct clicks as if we would have been putting each document in the first position. A nice property of our approach is that relevant documents without any observed clicks due to position bias can now become positive training examples after the correction and resampling. This is particularly useful for long tail queries and long results lists. However, this is not achievable using existing counterfactual framework since IPS only weights the clicked documents.

3.2.3. Heterogeneous Treatment Effect Implementation

In the following we describe the two methods we used to estimate the heterogeneous treatment effect : causal forests (Wager and Athey, 2018; Athey et al., 2019) and X-Learner (Künzel et al., 2019).

Causal forests (Wager and Athey, 2018) estimates the heterogeneous treatment effect at each leaf node by

(4)

Causal forests (Wager and Athey, 2018) recursively solves (4) and selects a cut in the feature space that maximizes the difference of between two child nodes. To speed up the tree building process, (Athey et al., 2019) uses the gradient method to optimize a linear approximation of the difference (see (Athey et al., 2019) for the details). We train causal forests to estimate the bias in top- positions.

X-Learner (Künzel et al., 2019) estimates the heterogeneous treatment effect by fitting models on the imputed outcomes. It has three steps:

  1. Fit two regression models, and , to estimate the average outcomes of the treatment and control group, and , respectively.

  2. Impute the individual treatment effects in the treatment and control group by and and fit two regression models, and , to estimate the imputed treatment effects, and , respectively.

  3. Estimate the heterogeneous treatment effect by

    where is often set to the propensity of treatment assignment, that is .

Similarly, we train X-Learners for top- positions using Causal ML (Chen et al., 2020). We select the tree boosting method (Chen and Guestrin, 2016) as the base regression models.

3.3. Simulation Protocol

To evaluate the accuracy of position bias estimations and the ranking effectiveness, we simulate the entire search system, intervention experiments, and model training and evaluation. Figure 1 shows our simulation architecture. We begin by generating intervention clicks and estimating position bias (Figure 0(a)), and then apply bias estimations to unbiased learning to rank (Figure 1). In the following we detail the different steps and components of the simulation architecture.

3.3.1. Query Sampler

We use the Microsoft Learning-to-rank data set (Qin and Liu, 2013) as the query corpus since it provides the real-world context features and human annotated -grade relevance, . The data set contains 31K unique queries and their corresponding candidate documents, and it is split into train (60%), validation (20%), and test sets (20%). To simulate the long tail and popular search queries, we draw uniformly from the query corpus with different sample sizes. We also train a linear pairwise ranking with of the train and validation queries to simulate the production ranker. For each incoming query, the production ranker outputs a ranked list for further experiments.

3.3.2. Intervention Simulation

Recent work collects intervention clicks from randomization (Wang et al., 2016; Joachims et al., 2017; Wang et al., 2018) or multiple loggers (Agarwal et al., 2019c; Fang et al., 2019). For the purpose of comparison between the proposed method with the existing propensity estimation methods, we adopt an intervention simulation that is similar to the randomized swap (Joachims et al., 2017; Wang et al., 2018). We randomly assign each ranked query list to one of arms to create intervention rankings for position bias estimations. In the control group (Arm-1), incoming query lists remain unchanged; in Arm-k, we always swap the document at the first position with the one at the th position. The random swap assignment creates not only the intervention sets but also the unconfoundedness equation (2), so that we can compare the two methods without violating their own assumptions.

3.3.3. Click Model

Before generating clicks for query lists, we follow (Joachims et al., 2017)

to binarize the

-grade relevance by setting , and , and truncate lists to top-. We then model the contextual examination by (Fang et al., 2019), is drawn uniformly from . To select the context features, we trained a random forests (Breiman, 2001) from the train and validation queries with normalized features and binary relevance, and we selected the top- important features as our context features333These context features encode the dependency between relevance and the context. Reducing the feature number in simulation also make the examination model more stable to produce enough variations in examination values.. We also add click noise to the final click model by (Joachims et al., 2017; Jagerman et al., 2019; Fang et al., 2019), modeling that a user can mistakenly click an irrelevant document after examining the document.

3.3.4. Position Bias Estimation

To feed CPBM and HTE Estimator (i.e. Causal Forests and X-Learner), we process the simulation clicks in the two following ways:

  • CPBM, merge clicks from all arms, select query-document pairs shown at least two positions (intervention sets) (Agarwal et al., 2019c; Fang et al., 2019), and estimate the examination propensity using the method described in 3.1.

  • HTE Estimator, combine clicks between Arm-1 and Arm-k, sample one position randomly from intervention sets for each unique query-document pair shown at position 1 and/or k, compute click rate at the sampled position, and estimate the heterogeneous treatment effect between position 1 and k with the method described in 3.2.

From the data process, we can see that not all the interventions are required for fitting HTE estimators. We train models on the simulation clicks generated from the train and validation sets of the query corpus. The training features are the same as the ones selected in the click model described in section 3.3.3. After this phase, we have now trained CPBM and HTE Estimator. These models will be used in the next phase in Figure 1.

3.3.5. Unbiased Learning-to-rank

Figure 1 illustrates the architecture of unbiased LTR. We implemented two types of LTR models for the comparison. The main steps are:

  • CPBM LTR, estimate the examination propensity by CPBM and use the inverse propensity to weight the pairwise LTR.

  • HTE LTR, compute click rate at the observed position, estimate the heterogeneous treatment effects with HTE estimators trained in the previous phase, compute the potential click rate at position 1 by the sum of the observed click rate and the treatment effect estimated, draw click/non-click by the distribution with the parameter of the potential click rate at position 1, and train the pairwise LTR.

The LTR models were implemented using the tensorflow-ranking (Pasumarthi et al., 2019) library. We used a linear scoring function, pairwise hinge loss and L2 regularization. We used all features in the query corpus to train LTR models. Hyper-parameter tuning was conducted on the validation sets.

3.3.6. Evaluation Metrics

We evaluate the accuracy of the position bias estimation by computing the Root Mean Square Error (RMSE) between the estimated and the true on queries in the test set for top- positions. CPBM does not output directly. Instead, it predicts the examination probability, , and the average relevance, , separately. We compute under CPBM by

To evaluate ranking effectiveness we computed nDCG@10 on the test set and used binary relevance. We rerun 3 times of the entire simulation experiment, including simulation click generation, position bias estimations, and LTR training and evaluations. 444Full experiment code is publicly available at https://github.com/KimuraTian/sigir-eCom20-counterfactual-ltr-using-hte.

4. Results

In this section, we detail and analyze the experiment results comparing CPBM and the proposed methodology based on heterogeneous treatment effect estimation methods.

Avg. Searches/Query = 5 Avg. Searches/Query = 10 Avg. Searches/Query = 25 Avg. Searches/Query = 50
Percentage of Training Queries 1% 10% 50% 100% 1% 10% 50% 100% 1% 10% 50% 100% 1% 10% 50% 100%
CPBM 0.360 0.256 0.239 0.237 0.325 0.249 0.239 0.238 0.304 0.246 0.238 0.237 0.295 0.245 0.239 0.237
Causal Forests 0.335 0.276 0.255 0.249 0.317 0.270 0.251 0.246 0.310 0.266 0.250 0.245 0.307 0.265 0.249 0.244
X-Learner 0.374 0.315 0.275 0.260 0.350 0.302 0.265 0.253 0.341 0.291 0.260 0.249 0.341 0.286 0.256 0.247
Table 1. Position bias estimations RMSE.
Figure 2. Position bias estimations RMSE, as . The columns and rows are the average searches per query and the true relevance of documents, respectively. X-axis represents the percentage of training queries used in the intervention simulation.

4.1. Position Bias Estimation

Table 1 shows RMSE of heterogeneous treatment effect estimations on the test set; models with the best RMSE are in bold. The RMSE gives us a picture of how accurate the estimator is able to capture the heterogeneity of the position bias. Among the three estimation methods, X-Learner has the largest estimation error. This may be because X-Learner is often more efficient when there is imbalance between the treatment and control groups or structural assumptions on heterogeneous treatment effects (Künzel et al., 2019). However, our experiment has balanced treatment and control groups, and the structure of the true relevance is unknown. This may suggest that future work will consider the ratio of treatment to control group and base learners for specific applications when picking HTE estimators. Under the extremely sparse condition (the percentage of training queries = 1% and avg. searches per query = 5), causal forests method exhibits the smallest estimation errors. This is to be expected, since causal forests can utilize the similarities in the context feature spaces, reducing the high sample variance faced by CPBM due to the lack of intervention clicks. As we increase the density of intervention data points, CPBM has the best estimation accuracy. We conjecture this is related to the fact that causal forests uses the average value of the data points in the leaf node to make predictions, which introduces prediction noise when there is higher feature heterogeneity in the leaf node. When a large amount of intervention clicks is available, all the three algorithms tend to converge to a comparable level of estimation accuracy.

Figure 2 shows RMSE of heterogeneous treatment effect estimations for true relevant () and irrelevant () documents. For relevant documents, CPBM achieves the best RMSE except for the extremely sparse condition (avg. searches/query = 5 and percentage of training queries = 1%), while both HTE estimators are comparable to each other. For irrelevant documents, Causal Forests has the best estimation accuracy. As our simulation modeled click noise, heterogeneous treatment effects for irrelevant documents are probabilities of misclicks. Although a good estimator should accurately capture heterogeneous treatment effects for both relevant and irrelevant documents, we speculate that the ranking performance of HTE LTR is less sensitive to the estimation noise of irrelevant documents compared to IPS weighted LTR, when click noise is not high. We think this is also related to the fact that misclicked irrelevant documents contribute to a small proportion of the clicked documents (e.g. maximum 0.1 under our simulation). Furthermore, in our HTE LTR approach the debiased clicks are generated by resampling, while the IPS weighting may amplify misclicks.

4.2. Unbiased LTR

Figure 3 shows the box plot of ranking accuracy results while Table 2 shows ranking accuracy measured with nDCG@10. Overall, HTE LTR approaches (Causal Forests LTR and X-Learner LTR) outperform the IPS weighted LTR baseline (CPBM LTR). There are two possible reasons: i) noisy propensity estimations cause extremely high IPS when propensity scores are close to 0, which further explodes the weights of IPS weighted LTR loss function, making learning process unstable; ii) the Bernoulli sampling can create positive clicks for documents without any observation clicks. To mitigate the high variance problem encountered by IPS weighted LTR, we trained a IPS weighted LTR with truncated propensity estimations in the range (CPBM Clipped IPS LTR). CPBM Clipped IPS LTR improved the ranking performance over the baseline in some cases (e.g. avg. searches/query = 10 and percentage of training queries = 10%, 50%, 100%, avg. searches/query = 25 and percentage of training queries = 1%, 10%), but it does not beat HTE LTR approaches. When clicks become abundant (i.e. avg. searches/query = 50), the performance of IPS weighted LTRs is significantly improved. For example, CPBM LTR beats Causal Forests LTR when 100% of training queries are used for the propensity estimation. This may suggest that in applications where a large amount of intervention clicks is available, CPBM LTR is preferred as it requires simpler pre-processing. But when the lack of clicks is a main concern (e.g. personal file search, long tail queries), heterogeneous treatment effect estimation LTR methods may be better alternatives. Within the HTE LTR group, Causal Forests LTR and X-Learner LTR have similar ranking performance (cases when p ¿ 0.05,555T-test with unequal variance on data from three runs. the percentage of training queries = 1%, 10%, 100% and avg. searches per query = 5, the percentage of training queries = 10% and avg. searches/query = 10).

Avg. Searches/Query = 5 Avg. Searches/Query = 10 Avg. Searches/Query = 25 Avg. Searches/Query = 50
Percentage of Training Queries 1% 10% 50% 100% 1% 10% 50% 100% 1% 10% 50% 100% 1% 10% 50% 100%
CPBM LTR 0.77 0.80 0.78 0.76 0.78 0.77 0.77 0.76 0.77 0.77 0.79 0.77 0.80 0.81 0.82 0.82
CPBM Clipped IPS LTR 0.77 0.78*** 0.78 0.75** 0.77*** 0.78*** 0.79*** 0.78*** 0.78* 0.80*** 0.77*** 0.77 0.83*** 0.79*** 0.82 0.79***
Causal Forests LTR 0.82*** 0.81*** 0.81*** 0.81*** 0.82*** 0.81*** 0.81*** 0.82*** 0.83*** 0.82*** 0.82*** 0.84*** 0.81*** 0.85*** 0.84*** 0.79***
X-Learner LTR 0.82*** 0.82*** 0.82*** 0.81*** 0.81*** 0.81*** 0.82*** 0.84*** 0.84*** 0.84*** 0.85*** 0.82*** 0.84*** 0.83*** 0.83** 0.86***
Table 2. Ranking performance measured by nDCG@10. Notation , , and mean statistically significant with , , and , respectively, compared to CPBM LTR.
Figure 3. Ranking performance of unbiased LTRs.
Figure 4. Distribution of click rate in the training data.
Figure 5. Distribution of click rate for unobserved true relevant documents.
Figure 6. Distribution of click rate for mis-clicked true irrelevant documents.

4.3. Position Bias Results Analysis

The core idea of counterfactual LTR is to estimate the bias in observational data to debias the training data and learn a better LTR model. To better understand the impact of the bias estimation on the ranking performance, we look at the distribution of corrected CTR and observed CTR in the training data. Figure 4 illustrates distribution of click rate for relevant documents and irrelevant documents. True Model CTR is the click probability defined in our click model (encoded with position bias); Observation CTR is observation CTR computed from simulation clicks; Causal Forests Theta Truncated and X-Leaner Theta Truncated are corrected CTR ( of the Bernoulli sampling); CPBM IPS Weighted Observation CTR is the multiplication of IPS estimated by CPBM and observation CTR; the horizontal line marks . For relevant documents, we expect corrected CTR is as close as to 1 since these are true relevant. For irrelevant documents, we expect corrected CTR to be as low as 0 and not over-estimate the false CTR. By looking at the bottom part of Figure 4, we can see that the observation CTR almost covers the range (high position bias), but it approaches to 1 after CTR adjustments. HTE LTR methods exhibit this property for all simulation conditions, while IPS weighted LTR only works well when there are more clicks available (small variance when avg. searches/query = 25, 50). On the other hand (top part of Figure 4), we can see that HTE estimators, while having larger estimation variance on irrelevant documents, can still confine the corrected CTR of the majority of irrelevant documents under a certain level (e.g. 0.5), making the noisy estimation less detrimental to LTR. As long as relevant documents get sampled much more often than irrelevant documents, pairwise LTR can still differentiate relevant documents with irrelevant ones effectively.

We also investigate how counterfactual LTR helps with debiasing observational data. Specifically, we focus on two cases: False Negative (Relevant documents without clicks) and False Positive (Misclicks) in observational data. Figure 5 and Figure 6 show the distributions. From Figure 5, HTE LTR recovers the majority of unobserved relevant documents through the click distribution correction. However, IPS weighted LTR does not help much, since it only weights observed clicks (always 0 for unobserved ones). For the false positive case (Figure 6), large estimation variance (e.g. extremely high means when avg.searches/query ¡=25 and percentage of training queries = 1%) could be detrimental to IPS weighted LTR as it may severely amplify the noise.

5. Conclusions and Future Work

In this work we show how heterogeneous treatment effect estimation techniques can be used to address the position bias in search results ranking. To utilize estimated incremental causal effects for unbiased LTR, we drew clicks from debiased click distributions of observational data. We compared the proposed method with an existing contextual position-based model (Fang et al., 2019) under varying simulation conditions. Our results showed that the heterogeneous treatment effect estimation method is particularly effective for long tail queries with high click sparsity.

The usage of sampled clicks is not limited to pairwise LTR. It would be interesting to see how it performs when applied to other LTR methods. There is a variety of estimation methods for heterogeneous treatment effect, such as T-learner (Künzel et al., 2019), R-learner (Nie and Wager, 2017). In the future, we can explore the multivariate extension of R-learner (Nie and Wager, 2017) so that we only need to build a single model for estimating bias in multiple positions instead of multiple models with the binary treatment indicator. The treatment position we chose in this work is the first position, we conjecture that the estimation could be improved if we extend to other anchor positions when there is huge imbalance of clicks across different positions. The simulation of user search behavior we used in this work (uniform sampling) is simple, future work may capture complex user behavior using other sampling distribution (e.g. Pareto distribution).

Acknowledgements.
We thank Tao Ye, Jenny Lin, Oliver Bembom, Filip Korzeniowski, Ali Goli, Oscar Celma, and the Pandora Science Team, for their support.

References

  • M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng (2015)

    TensorFlow: large-scale machine learning on heterogeneous systems

    .
    Note: Software available from tensorflow.org External Links: Link Cited by: footnote 1.
  • A. Agarwal, K. Takatsu, I. Zaitsev, and T. Joachims (2019a) A general framework for counterfactual learning-to-rank. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR’19, New York, NY, USA, pp. 5–14. External Links: ISBN 9781450361729, Link, Document Cited by: §2, §3.1.2, §3.1.
  • A. Agarwal, X. Wang, C. Li, M. Bendersky, and M. Najork (2019b) Addressing trust bias for unbiased learning-to-rank. In The World Wide Web Conference, WWW ’19, New York, NY, USA, pp. 4–14. External Links: ISBN 9781450366748, Link, Document Cited by: §2.
  • A. Agarwal, I. Zaitsev, X. Wang, C. Li, M. Najork, and T. Joachims (2019c) Estimating position bias without intrusive interventions. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, WSDM ’19, New York, NY, USA, pp. 474–482. External Links: ISBN 9781450359405, Link, Document Cited by: §1, §1, §2, 1st item, §3.3.2.
  • Q. Ai, K. Bi, C. Luo, J. Guo, and W. B. Croft (2018) Unbiased learning to rank with unbiased propensity estimation. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’18, New York, NY, USA, pp. 385–394. External Links: ISBN 9781450356572, Link, Document Cited by: §2.
  • S. Athey and G. Imbens (2016) Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences of the United States of America 113 (27), pp. 7353–7360. External Links: ISSN 00278424, 10916490, Link Cited by: §2.
  • S. Athey, J. Tibshirani, and S. Wager (2019) Generalized random forests. Ann. Statist. 47 (2), pp. 1148–1178. External Links: Document, Link Cited by: §1, §1, §2, §3.2.3, §3.2.3.
  • L. Breiman (2001) Random forests. Machine learning 45 (1), pp. 5–32. Cited by: §3.3.3.
  • B. Carterette and P. Chandar (2018) Offline comparative evaluation with incremental, minimally-invasive online feedback. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’18, New York, NY, USA, pp. 705–714. External Links: ISBN 9781450356572, Link, Document Cited by: §2.
  • P. Chandar and B. Carterette (2018) Estimating clickthrough bias in the cascade model. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, CIKM ’18, New York, NY, USA, pp. 1587–1590. External Links: ISBN 9781450360142, Link, Document Cited by: §2.
  • H. Chen, T. Harinen, J. Lee, M. Yung, and Z. Zhao (2020) CausalML: python package for causal machine learning. External Links: 2002.11631 Cited by: §3.2.3.
  • T. Chen and C. Guestrin (2016) XGBoost: a scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, New York, NY, USA, pp. 785–794. External Links: ISBN 9781450342322, Link, Document Cited by: §3.2.3.
  • A. Chuklin, I. Markov, and M. d. Rijke (2015) Click models for web search. Synthesis Lectures on Information Concepts, Retrieval, and Services 7 (3), pp. 1–115. External Links: Document, Link, https://doi.org/10.2200/S00654ED1V01Y201507ICR043 Cited by: §1, §2.
  • N. Craswell, O. Zoeter, M. Taylor, and B. Ramsey (2008) An experimental comparison of click position-bias models. In Proceedings of the 2008 International Conference on Web Search and Data Mining, WSDM ’08, New York, NY, USA, pp. 87–94. External Links: ISBN 9781595939272, Link, Document Cited by: §1, §2.
  • Z. Fang, A. Agarwal, and T. Joachims (2019) Intervention harvesting for context-dependent examination-bias estimation. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR’19, New York, NY, USA, pp. 825–834. External Links: ISBN 9781450361729, Link, Document Cited by: §1, §1, §2, 1st item, §3.1.1, §3.1.3, §3.3.2, §3.3.3, §5.
  • Z. Hu, Y. Wang, Q. Peng, and H. Li (2019) Unbiased lambdamart: an unbiased pairwise learning-to-rank algorithm. In The World Wide Web Conference, WWW ’19, New York, NY, USA, pp. 2830–2836. External Links: ISBN 9781450366748, Link, Document Cited by: §2.
  • G. W. Imbens and D. B. Rubin (2015) Causal inference for statistics, social, and biomedical sciences: an introduction. Cambridge University Press. External Links: Document Cited by: §2.
  • R. Jagerman, H. Oosterhuis, and M. de Rijke (2019) To model or to intervene: a comparison of counterfactual and online learning to rank from user interactions. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR’19, New York, NY, USA, pp. 15–24. External Links: ISBN 9781450361729, Link, Document Cited by: §2, §2, §3.3.3.
  • T. Joachims, L. Granka, B. Pan, H. Hembrooke, and G. Gay (2005) Accurately interpreting clickthrough data as implicit feedback. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’05, New York, NY, USA, pp. 154–161. External Links: ISBN 1595930345, Link, Document Cited by: §2.
  • T. Joachims, L. Granka, B. Pan, H. Hembrooke, F. Radlinski, and G. Gay (2007) Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search. ACM Trans. Inf. Syst. 25 (2), pp. 7–es. External Links: ISSN 1046-8188, Link, Document Cited by: §1, §2.
  • T. Joachims, A. Swaminathan, and T. Schnabel (2017) Unbiased learning-to-rank with biased feedback. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, WSDM ’17, New York, NY, USA, pp. 781–789. External Links: ISBN 9781450346757, Link, Document Cited by: §1, §1, §2, §2, §3.1.2, §3.1, §3.3.2, §3.3.3.
  • T. Joachims (2002) Optimizing search engines using clickthrough data. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’02, New York, NY, USA, pp. 133–142. External Links: ISBN 158113567X, Link, Document Cited by: §1.
  • S. R. Künzel, J. S. Sekhon, P. J. Bickel, and B. Yu (2019) Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the National Academy of Sciences 116 (10), pp. 4156–4165. External Links: Document, ISSN 0027-8424, Link, https://www.pnas.org/content/116/10/4156.full.pdf Cited by: §1, §2, §3.2.3, §3.2.3, §4.1, §5.
  • X. Nie and S. Wager (2017) Quasi-oracle estimation of heterogeneous treatment effects. External Links: 1712.04912 Cited by: §2, §5.
  • R. K. Pasumarthi, S. Bruch, X. Wang, C. Li, M. Bendersky, M. Najork, J. Pfeifer, N. Golbandi, R. Anil, and S. Wolf (2019) TF-ranking: scalable tensorflow library for learning-to-rank. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, New York, NY, USA, pp. 2970–2978. External Links: ISBN 9781450362016, Link, Document Cited by: §3.3.5.
  • T. Qin and T. Liu (2013) Introducing LETOR 4.0 datasets. CoRR abs/1306.2597. External Links: Link Cited by: §1, §3.3.1.
  • M. Richardson, E. Dominowska, and R. Ragno (2007) Predicting clicks: estimating the click-through rate for new ads. In Proceedings of the 16th International Conference on World Wide Web, WWW ’07, New York, NY, USA, pp. 521–530. External Links: ISBN 9781595936547, Link, Document Cited by: §1, §2, §3.1.1.
  • P. M. Robinson (1988) Root-n-consistent semiparametric regression. Econometrica 56 (4), pp. 931–954. External Links: ISSN 00129682, 14680262, Link Cited by: §2.
  • P. R. ROSENBAUM and D. B. RUBIN (1983) The central role of the propensity score in observational studies for causal effects. Biometrika 70 (1), pp. 41–55. External Links: ISSN 0006-3444, Document, Link, https://academic.oup.com/biomet/article-pdf/70/1/41/662954/70-1-41.pdf Cited by: §1, §2, §3.2.1.
  • D. B. Rubin (1974) Estimating causal effects of treatments in randomized and nonrandomized studies.. Journal of educational Psychology 66 (5), pp. 688. Cited by: §2, §3.2.1.
  • A. Schuth, H. Oosterhuis, S. Whiteson, and M. de Rijke (2016) Multileave gradient descent for fast online learning to rank. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, WSDM ’16, New York, NY, USA, pp. 457–466. External Links: ISBN 9781450337168, Link, Document Cited by: §2.
  • Y. Su, L. Wang, M. Santacatterina, and T. Joachims (2019) CAB: continuous adaptive blending for policy evaluation and learning. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 6005–6014. External Links: Link Cited by: §2.
  • S. Wager and S. Athey (2018) Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association 113 (523), pp. 1228–1242. External Links: Document, Link, https://doi.org/10.1080/01621459.2017.1319839 Cited by: §2, §3.2.3, §3.2.3.
  • X. Wang, M. Bendersky, D. Metzler, and M. Najork (2016) Learning to rank with selection bias in personal search. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’16, New York, NY, USA, pp. 115–124. External Links: ISBN 9781450340694, Link, Document Cited by: §1, §1, §2, §3.1, §3.3.2.
  • X. Wang, N. Golbandi, M. Bendersky, D. Metzler, and M. Najork (2018) Position bias estimation for unbiased learning to rank in personal search. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, WSDM ’18, New York, NY, USA, pp. 610–618. External Links: ISBN 9781450355810, Link, Document Cited by: §1, §2, §3.1.1, §3.1.2, §3.3.2.
  • Y. Yue and T. Joachims (2009) Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, New York, NY, USA, pp. 1201–1208. External Links: ISBN 9781605585161, Link, Document Cited by: §2.