Correcting for Selection Bias in Learning-to-rank Systems

Click data collected by modern recommendation systems are an important source of observational data that can be utilized to train learning-to-rank (LTR) systems. However, these data suffer from a number of biases that can result in poor performance for LTR systems. Recent methods for bias correction in such systems mostly focus on position bias, the fact that higher ranked results (e.g., top search engine results) are more likely to be clicked even if they are not the most relevant results given a user's query. Less attention has been paid to correcting for selection bias, which occurs because clicked documents are reflective of what documents have been shown to the user in the first place. Here, we propose new counterfactual approaches which adapt Heckman's two-stage method and accounts for selection and position bias in LTR systems. Our empirical evaluation shows that our proposed methods are much more robust to noise and have better accuracy compared to existing unbiased LTR algorithms, especially when there is moderate to no position bias.



There are no comments yet.


page 1

page 2

page 3

page 4


Handling Position Bias for Unbiased Learning to Rank in Hotels Search

Nowadays, search ranking and recommendation systems rely on a lot of dat...

Mixture-Based Correction for Position and Trust Bias in Counterfactual Learning to Rank

In counterfactual learning to rank (CLTR) user interactions are used as ...

Reinforcement Online Learning to Rank with Unbiased Reward Shaping

Online learning to rank (OLTR) aims to learn a ranker directly from impl...

Unbiased Learning to Rank via Propensity Ratio Scoring

Implicit feedback, such as user clicks, is a major source of supervision...

Counterfactual Learning to Rank using Heterogeneous Treatment Effect Estimation

Learning-to-Rank (LTR) models trained from implicit feedback (e.g. click...

Quantifying Voter Biases in Online Platforms: An Instrumental Variable Approach

In content-based online platforms, use of aggregate user feedback (say, ...

When Inverse Propensity Scoring does not Work: Affine Corrections for Unbiased Learning to Rank

Besides position bias, which has been well-studied, trust bias is anothe...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The abundance of data found online has inspired new lines of inquiry about human behavior and the development of machine-learning algorithms that learn individual preferences from such data. Patterns in such data are often driven by the underlying algorithms supporting online platforms, rather than naturally-occurring user behavior. For example, interaction data from social media news feeds, such as user clicks and comments on posts, reflect not only latent user interests but also news feed personalization and what the underlying algorithms chose to show to users in the first place. Such data in turn are used to train new news feed algorithms, propagating the bias further 

(chaney-recsys18). This can lead to phenomena such as filter bubbles and echo chambers and can challenge the validity of social science research that relies on found data (japec-poq15; lazer-science14).

One of the places where these biases surface is in personalized recommender systems whose goal is to learn user preferences from available interaction data. These systems typically rely on learning procedures to estimate the parameters of

new ranking algorithms that are capable of ranking items based on inferred user preferences, in a process known as learning-to-rank (LTR) (liu-springer11). Much of the work on unbiasing the parameter estimation for learning-to-rank systems has focused on position bias (joachims-wsdm17), the bias caused by the position where a result was displayed to a user. Position bias makes higher ranked results (e.g., top search engine results) more likely to be clicked even if they are not the most relevant.

Algorithms that correct for position bias typically assume that all relevant results have non-zero probability of being observed (and thus clicked) by the user and focus on boosting the relevance of lower ranked relevant results 

(joachims-wsdm17). However, users rarely have the chance to observe all relevant results, either because the system chose to show a truncated list of top recommended results or because users do not spend the time to peruse through tens to hundreds of ranked results. In this case, lower ranked, relevant results have zero probability of being observed (and clicked) and never get the chance to be boosted in LTR systems. This leads to selection bias in clicked results which is the focus of our work.

Here, we frame the problem of learning to rank as a counterfactual problem of predicting whether a document would have been clicked had it been observed. In order to recover from selection bias for clicked documents, we focus on identifying the relevant documents that were never shown to users. Our formulation is different from previous counterfactual formulations which correct for position bias and study the likelihood of a document being clicked had it been placed in a higher position given that it was placed in a lower position (joachims-wsdm17).

Here, we propose a general framework for recovering from selection bias that stems from both limited choices given to users and position bias. First, we propose , an algorithm for addressing selection bias in the context of learning-to-rank systems. By adapting Heckman’s two-stage method, an econometric tool for addressing selection bias, we account for the limited choice given to users and the fact that some items are more likely to be shown to a user than others. Because this correction method is very general, it is applicable to any type of selection bias in which the system’s decision to show documents can be learned from features. Because

treats selection as a binary variable, we propose two bias-correcting ensembles that account for the nuanced probability of being selected due to position bias and combine

with existing position-bias correction methods.

Our experimental evaluations demonstrate the utility of our proposed method when compared to state-of-the-art algorithms for unbiased learning-to-rank. Our ensemble methods have better accuracy compared to existing unbiased LTR algorithms under realistic selection bias assumptions, especially when the position bias is not severe. Moreover, is more robust to noise than both ensemble methods and position-bias correcting methods across difference position bias assumptions. The experiments also show that selection bias affects the performance of LTR systems even in the absence of position bias, and is able to correct for it.

2. Related work

Here, we provide the context for our work and present the three areas that best encompass our problem: bias in recommender systems, selection bias correction, and unbiased learning-to-rank.

Bias in recommender system. Many technological platforms, such as recommendation systems, tailor items to users by filtering and ranking information according to user history. This process influences the way users interact with the system and how the data collected from users is fed back to the system and can lead to several types of biases. chaney-recsys18 explore a closely related problem called algorithmic confounding bias, where live systems are retrained to incorporate data that was influenced by the recommendation algorithm itself. Their study highlights the fact that training recommendation platforms with naive data that are not debiased can cause a severe decrease in the utility of such systems. For example, “echo chambers” are consequence of this problem (fleder2007recommender; dan2013long), where users are limited to an increasingly narrower choice set over time which can lead to a phenomenon called polarization (dandekar2013biased). Popularity bias, is another bias affecting recommender system that is studied by celma2008hits. Popularity bias refers to the idea that a recommender system will display the most popular items to a user, even if they are not the most relevant to a user’s query. Recommender systems can also affect users decision making process, known as decision bias, and  chen2013human show how understanding this bias can improve recommender systems. Position bias is yet another type of bias that is studied in the context of learning-to-rank systems and refers to documents that higher ranked will be more likely to be selected regardless of the document’s relevancy. joachims-wsdm17 focus on this bias and we compare our results to theirs throughout.

Selection bias correction. Selection bias occurs when a data sample is not representative of the underlying data distribution. Selection bias can have various underlying causes, such as participants self-selecting into a study based on certain criteria, or subjects choosing over a choice set that is restricted in a non-random way. Selection bias could also encompass the biases listed above. Various studies attempt to correct for selection bias in different contexts.

Heckman correction, and more generally, bivariate selection models, control for the probability of being selected into the sample when predicting outcomes (heckman-econometrica79). smith-kdd04

study Heckman correction for different types of selection bias through Bayesian networks, but not in the context in learning-to-rank systems.


study selection bias in the context of well-known classifiers, where the outcome is binary rather than continuous as with ranking algorithms. Selection bias has also been studied in the context of causal graphs 

(bareinboim-aaai14; bareinboim-pnas16; correa-aaai18; bareinboim2015recovering; correa2017causal). For example, if an underlying data generation model is assumed,  bareinboim-aistats12 show that selection bias can be removed even in the presence of confounding bias, i.e., when a variable can affect both treatment and control. We leverage this work in our discussion of identifiability under selection bias.

The most related work to our context are studies by schnabel2016recommendations; wang2018deconfounded; hernandez2014probabilistic. Both schnabel2016recommendations and hernandez2014probabilistic use a matrix factorization model to represent data (ratings by users) that are missing not-at-random, where schnabel2016recommendations outperform hernandez2014probabilistic. More recently, joachims-wsdm17 propose a position debiasing approach in the context of learning-to-rank systems as a more general approach compared to schnabel2016recommendations. Throughout, we compare our results to joachims-wsdm17, although, it should be noted that the latter deals with a more specific bias - position bias - than what we address here. Finally, wang2018deconfounded address selection bias due to confounding, whereas we address selection bias that is treatment-dependent only.

Unbiased learning-to-rank. The problem we study here investigates debiasing data in learning-to-rank systems. There are two approaches to LTR systems, offline and online, and the work we propose here falls in the category of offline LTR systems.

Offline LTR systems learn a ranking model from historical click data and interpret clicks as absolute relevance indicators (joachims2005accurately; joachims-wsdm17; ai-sigir18; borisov2016neural; chapelle2009dynamic; craswell-wsdm08; joachims2002optimizing; richardson2007predicting; wang-wsdm18; wang-sigir16; hu2019unbiased). Offline approaches must contend with the many biases that found data are subject to, including position and selection bias, among others. For example, wang-sigir16 use a propensity weighting approach to overcome position bias. Similarly, joachims-wsdm17 propose a method to correct for position bias, by augmenting learning with an Inverse Propensity Score defined for clicks rather than queries. They demonstrate that Propensity-Weighted outperforms a standard Ranking by accounting for position bias. More recently agarwal2019general proposed nDCG that outperforms Propensity-Weighted  (joachims-wsdm17), but only when position bias is severe. We show that our proposed algorithm outperforms  (joachims-wsdm17) when position bias is not severe. Thus, we do not compare our results to (agarwal2019general).

Other studies aim to improve on joachims-wsdm17, such as wang-wsdm18 and ai-sigir18, but only in the ease of their methodology.  wang-wsdm18

propose a regression-based Expectation Maximization method for estimating the click position bias, and its main advantage over  

joachims-wsdm17 is that it does not require randomized tests to estimate the propensity model. Similarly, the Dual Learning Algorithm (DLA) proposed by ai-sigir18 jointly learns the propensity model and ranking model without randomization tests.  hu2019unbiased

introduce a method that jointly estimates position bias and trains a ranker using a pairwise loss function. The focus of these latter studies is position bias and not selection bias, namely the fact that some relevant documents may not be exposed to users at all, which is what we study here.

In contrast to offline LTR systems, online LTR algorithms intervene during click collection by interactively updating a ranking model after each user interaction (hofmann2013reusing; oosterhuis2018differentiable; schuth2014multileaved; yue2009interactively; chapelle2012large; raman2013learning; schuth2016multileave; jagerman2019model). This can be costly, as it requires intervening with users’ experience of the system. The main study in this context is jagerman2019model who compare the online learning approach by oosterhuis2018differentiable with the offline LTR approach proposed by  joachims-wsdm17 under selection bias. The study shows that the method by  oosterhuis2018differentiable outperforms  (joachims-wsdm17) when selection bias and moderate position bias exist, and when no selection bias and severe position bias exist. One advantage of our offline algorithms over online LTR ones is that they do not have a negative impact on user experience while learning.

3. Problem description

In this section, we review the definition of learning-to-rank systems, position and selection bias in recommender systems, as well as our framing of bias-corrected ranking with counterfactuals.

3.1. Learning-to-Rank Systems

We first describe learning-to-rank systems assuming knowledge of true relevances (full information setting) following (joachims-wsdm17). Given a sample of i.i.d. queries () and relevancy score rel() for all documents , we denote to be the loss of any ranking for query . The risk of ranking system that returns ranking for queries is given by:


Since the distribution of queries is not known in practice, cannot be computed directly, it is often estimated empirically as follows:


The goal of learning-to-rank systems is to find a ranking function that minimizes the risk . Learning-to-rank systems are a special case of a recommender system where, appropriate ranking is learned.

The relevancy score rel() denotes the true relevancy of document for a specific query . It is typically obtained via human annotation, and is necessary for the full information setting. Despite being reliable, true relevance assignments are frequently impossible or expensive to obtain because they require a manual evaluation of every possible document given a query.

Due to the cost of annotation, recommender system training often relies on implicit feedback from users in what is known as partial information setting. Click logs collected from users are easily observable in any recommender system, and can serve as a proxy to the relevancy of a document. For this reason clicks are frequently used to train new recommendation algorithms. Unfortunately, there is a cost for using click log data because of noise (e.g., people can click on items that are not relevant) and various biases that the data are subject to, including position bias and selection bias which we discuss next.

3.2. Position bias

Implicit feedback (clicks) in LTR systems is inherently biased. Position bias refers to the notion that higher ranked results are more likely to be clicked by a user even if they are not the most relevant results given a user’s query.

Previous work (joachims-wsdm17; wang-sigir16) has focused on tempering the effects of position bias via inverse propensity weighting (IPW). IPW re-weights the relevance of documents using a factor inversely related to the documents’ position on a page. For a given query instance , the relevance of document to query is , and

is a set of vectors indicating whether a document

is observed. Suppose the performance metric of interest is the sum of the rank of relevant documents:


Due to position bias, given a presented ranking

, clicks are more likely to occur for top-ranked documents. Therefore, the goal is to obtain an unbiased estimate of

for a new ranking .

There are existing approaches that address position bias in LTR systems. For example, Propensity , proposed by joachims-wsdm17, is one such algorithm. It uses inverse propensity weights (IPW) to counteract the effects of position bias:


where the propensity weight denotes the marginal probability of observing the relevance ) of result for query , when the user is presented with ranking . joachims-wsdm17 estimated the IPW to be:


where is severity of position bias. The IPW has two main properties. First, it is computed only for documents that are observed and clicked. Therefore, documents that are never clicked do not contribute to the IPW calculation. Second, as shown by Joachims et al. (joachims-wsdm17), a ranking model trained with clicks and the IPW method will converge to a model trained with true relevance labels, rendering a LTR framework robust to position bias.

3.3. Selection bias

LTR systems rely on implicit feedback (clicks) to improve their performance. However, a sample of relevant documents from click data does not reflect the true distribution of all possible relevant documents because a user observes a limited choice of documents. This can occur because i) a recommender system ranks relevant documents too low for a user to feasibly see, or ii) because a user can examine only a truncated list of top recommended items. As a result, clicked documents are not randomly selected for LTR systems to be trained on, and therefore cannot reveal the relevancy of documents that were excluded from the ranking . This leads to selection bias.

Selection bias and position bias are closely related. Besides selection bias due to unobserved relevant documents, selection bias can also arise due to position bias: lower ranked results are less likely to be observed, and thus selected more frequently than higher-ranked ones. Previous work on LTR algorithms that corrects for position bias assigns a non-zero observation probability to all documents, and proofs of debiasing are based on this assumption (joachims-wsdm17). However, in practice it is rarely realistic to assume that all documents can be observed by a user. When there is a large list of potentially relevant documents, the system may choose to show only the top results and a user can only act on these results. Therefore, lower-ranked results are never observed, which leads to selection bias. Here, we consider the selection bias that arises when some documents have a zero probability of being observed if they are ranked below a certain cutoff . The objective of this paper is to propose a ranking algorithm that corrects for both selection and position bias, and therefore is a better tool for training future LTR systems (see Section 4.1).

3.4. Ranking with counterfactuals

We define the problem of ranking documents as a counterfactual problem (pearl2016causal). Let denote a treatment variable indicating whether a user observed document given query . Let represent the click counterfactual indicating whether a document would have been clicked had been observed under query . The goal of ranking with counterfactuals is to reliably estimate the probability of click counterfactuals for all documents:


and then rank the documents according to this probability. Solving the ranking with counterfactuals problem would allow us to find a ranking system that returns ranking for query that is robust to selection bias.

Current techniques that correct for position bias aim to provide reliable estimates of this probability by taking into consideration the rank-dependent probability of being observed. However, this approach is only effective for documents that have a non-zero probability of being observed:


The challenge with selection bias is to estimate this probability for documents that have neither been observed nor clicked in the first place:


To address this challenge, in the following Section 4 we turn to econometric methods, which have a long history of addressing selection bias.

Note that in order to recover from selection bias we must address the concept of identifiability and whether causal estimates can even be obtained in the context of our setup. A bias is identified to be recoverable if the treatment is known (bareinboim-aaai14). In our context the treatment is whether a document enters into the data training pool (clicked). While it is difficult to guarantee that a user observed a document that was shown to them (i.e. we cannot know whether an absence of a click is due non-observance or to non-relevance), it is easier to guarantee that a document was not observed by a user if it was not shown to them in the first place (e.g., it is below a cutoff for top results or the user never scrolled down to that document in a ranked list). Our proposed solution, therefore, identifies the treatment first as a binary variable (whether the document is shown versus not shown) and then as a continuous variable that takes position bias into account.

4. Bias-corrected ranking with counterfactuals

In this section we adapt a well-known sample selection correction method, known as Heckman’s two-stage correction, to the context of LTR systems. Integrating the latter framework requires a detailed understanding of how LTR systems generally process and cut interaction data to train new recommendation algorithms, and at what stages in that process selection biases are introduced. Thus, while the methodology we introduce is a well established tool in the causal inference literature, integrating it within the multiple stages of training a machine learning algorithm is a complex translational problem. We then introduce two aggregation methods to combine our proposed , correcting for selection bias, with existing methods for position bias to further improve the accuracy in ranking prediction.

4.1. Selection bias correction with

Econometrics, or the application of a statistical methods to economic problems, has long been concerned with confounded or held-out data in the context of consumer choices. Economists are interested in estimating models of consumer choice to both learn consumers’ preferences and to predict their outcomes. Frequently, the data used to estimate these models are observational, not experimental. As such, the outcomes observed in the data are based on a limited and self-selected sample. A quintessential example of this problem is estimating the impact of education on worker’s wages based on only those workers who are employed (heckman-econometrica79). However, those who are employed are a self-selected sample, and estimates of education’s effect on wages will be biased.

A broad class of models in econometrics that deal with such selection biases are known as bivariate sample selection models. A well-known method for correcting these biases in economics is known as Heckman correction or two-step Heckman. In the first stage the probability of self selection is estimated, and in the second stage the latter probability is accounted for. As heckman-econometrica79 pointed out self selection bias can occur for two reasons. “First, there may be self selection by the individuals or data units being investigated (as in the labor example). Second, sample selection decisions by analysts or data processors operate in much the same fashion as self selection (by individuals).”

Adapting a sample selection model, such as Heckman’s, to LTR systems requires an understanding of when and how data are progressively truncated when training a recommender algorithm. We introduce notation and a framework to outline this problem here.

Let denote whether a document is selected (e.g., clicked) under query for each pair; represents the features of the , and

is a normally distributed error term. The same query can produce multiple

pairs, where the documents are then ordered by a LTR algorithm. However, it is important to note that a LTR algorithm will not rank every single document in the data given a . Unranked documents are typically discarded when training future algorithms. Herein lies the selection bias. Documents that are not shown to the user can then never be predicted as a potential choice. Moreover, documents far down in the rankings may still be kept in future training data, but will appear infrequently. Both these points will contribute to generating increasingly restrictive data that new algorithms are trained on.

If we fail to account for the repercussions of these selection biases, then modeling whether a document is selected will be based only upon the features of documents that were ranked and shown to the user, which can be written as:


In this setup we only consider a simple linear model; however, future research will incorporate nonlinear models. In estimating (10), we refer to the feature weights estimator, , as being biased, because the feature design matrix will only reflect documents that were shown to the user. But documents that were not shown to the user could also have been selected. Thus, (10) reflects the limitation outlined in (7). When we discard unseen documents then we can only predict clicks for documents that were shown, while our objective is to predict the unconditional probability that a document is clicked regardless of whether it was shown.

To address this point, we will first explicitly model an algorithm’s document selection process. Let denote a binary variable that indicates whether a document is shown and observed () or not shown and not observed (). For now, we assume that if a document is shown to the user that user also sees the document. We relax this assumption in Section  4.2. is a set of explanatory variables that determine whether a document is shown, which includes the features in , but can also include external features, including characteristics of the algorithm that first generated the data:


In the first stage of , we estimate the probability of a document being observed using a Probit model:


where denotes the standard normal CDF. Note that a crucial assumption here is that we will use both seen and unseen documents for a given in estimating (12). Therefore, the dimensions of our data will be far larger than if we had discarded unseen documents, as most LTR systems typically do. After estimating (12) we can compute what is known as an Inverse Mills ratio for every pair:


where is the standard normal distribution. reflects the severity of selection bias and corresponds to our desire to condition on versus , as described in Equations 7 and 9, but using a continuous variable reflecting the probability of selection.

In the second stage of , we estimate the probability of whether a user will click on a document. Heckman’s seminal work showed that if we condition our estimates on the our estimated feature weights will be statistically unbiased in expectation. This can improve our predictions if we believe that including is relevant in predicting clicks. We assume joint normality of the errors, and our setup naturally implies that the error terms and are correlated, namely that clicking on a document depends upon whether a document is observed by users and, therefore, has the potential for being selected.

The conditional expectation of clicking on a document conditional on the document being shown is given by:


We can see that if the error terms in (10) and (11) are correlated then , and estimating (14) without accounting for this correlation will lead to biased estimates of . Thus, in the second stage, we correct for selection bias to obtain an unbiased estimate of by controlling for :


Estimation of (15) allows us to predict click probabilities, , where . This click probability refers to our ability to estimate (9), the unconditional click probability, using . We then compute document rankings for a given query by sorting documents according to their predicted click probabilities. Note that our main equation (15) has a bivariate outcome. Thus, in this selection correction setup we are following a Heckprobit model, as opposed to the initial model that Heckman proposed in heckman-econometrica79 where the main outcome is a continuous variable.

Our setup helps account for the inherent selection bias that can occur in any LTR system, as all LTR systems must make a choice in what documents they show to a user. What is unique to our formulation of the problem is our use of a two stage estimation process to account for the two stage document selection process: namely, whether the document is shown, and whether the document is then selected. Accounting for the truncation of the data is critical for training a LTR system, and previously has not been considered. In order to improve a system’s ranking accuracy it must be able to predict document selection for both unseen as well as seen documents. If not, the choice set of documents that are available to a user can only become progressively smaller. Our correction is a simple method to counteract such a trend in found data.

4.2. Bias-correcting ensembles

Biased data limits the ability to accurately train an LTR algorithm on click logs. In this section, we present methods for addressing two types of selection bias, one stemming from truncated recommendations and the other one from position bias. One of the deficiencies of using to deal with biased data is that it assumes that all documents that are shown to a user are also observed by the user. However, due to position bias that is not necessarily the case, and lower-ranked shown documents have lower probability of being observed. Therefore, it is natural to consider combining , which focuses on recovering from selection bias due to unobserved documents, with a ranking algorithm that accounts for the nuanced observation probability of shown documents due to position bias.

Algorithms that rely on IPW (wang-sigir16; joachims-wsdm17; agarwal2019general) consider the propensity of observation for any document given a ranking for a certain query and it is exponentially dependent on the rank of the document in the given ranking. This is clearly different from our approach for recovering from selection bias where we model the observation probability to be either or depending on its position relative in the ranking.

Ensemble ranking objective. In order to harness the power of correcting for these biases in a collective manner, we propose to use ensembles that can combine the results produced by and any position bias correcting method. We refer to the algorithm correcting for selection bias as and for position bias as while and are the rankings generated for a certain query by the algorithms respectively. Our goal is to produce an ensemble ranking based on and for all queries that is more accurate than either ranking alone.

There is a wide scope for designing an appropriate ensemble method to serve our objective. We propose two simple but powerful approaches, as our experimental evaluation shows. The two approaches differ in their fundamental intuition. The intuition behind the first approach is to model the value of individual ranking algorithms through a linear combination of the rankings they produce. We can learn the coefficients of that linear combination using linear models on the training data. We call this method Linear Combination. The second approach is a standard approach for combining ranking algorithms using Borda counts  (dwork-www01). It works as a post processing step after the candidate algorithms and produce their respective rankings and . We apply a certain Rank Aggregation algorithm over and to produce for a given query for evaluation. Next, we discuss each of the approaches in the context of our problem.

4.2.1. Linear Combination

A simple aggregation method for combining and is to estimate the value of each algorithm in predicting a click. After training the algorithms and , we use the same training data to learn the weights of a linear model that considers the rank of each document produced by and . For any given query the ranking of document produced by is given by . Similarly, represents the ranking given by . We also consider the relevance of document , which is either for not relevant or for relevant, modeled through clicks.

We train a binary classifier to predict relevance (click) of documents which incorporates the estimated value of individual algorithms. We select logistic regression to be the binary classifier in our implementation, but any other standard classification method should work as well. We model the relevance of a document

, given two rankings and as the following logistic function:

Upon training the logistic regression model we learn the parameters where and represent the estimated impact of and respectively. During evaluation we predict the click counterfactual probability for each pair using the trained classifier. Then we can sort the documents for each query according to these probability values to generate the final ensemble ranking .

4.2.2. Rank Aggregation

Rank aggregation aims to combine rankings generated by multiple ranking algorithms. In a typical rank aggregation problem, we are given a set of rankings of a set of objects given a query . The objective is to find a single ranking that corroborates with all other existing rankings. Many aggregation methods have been proposed (lin-wir2010). A commonly used approach is the Borda count, which scores documents based on their relative position in a ranking, and then totals all scores across all rankings for a given query (dwork-www01).

In our scenario, we have two rankings and . For a given query, there are documents to rank. Consider as the score for document () given by . Similarly, refers to the score for document given by . The total score for document would be . Based on these total scores we sort the documents in non-ascending order of their scores to produce the ensemble ranking . The score given to a document in a specific ranking (or ) is simply the number of documents it beats in the respective ranking. For example, given a certain query, if a document is ranked in and in then the total score for this document would be . This very simple scheme reflects the power of the combined method to recover from different biases in LTR systems.

5. Experiments

In this section, we evaluate our proposed approach for addressing selection bias under several conditions:

  • Varying the number of observed documents given a fixed position bias (Section 5.2)

  • Varying position bias with no noise (Section 5.2.1)

  • Varying position bias with noisy clicks (Section 5.2.2)

  • Varying noise level in click sampling (Section 5.2.3)

The parameter values are summarized in Table 1.

5.1. Experimental setup

Next, we describe the dataset we use, the process for click data generation, and the evaluation framework.

5.1.1. Base dataset

In order to explore selection bias in LTR systems, we conduct several experiments using semi-synthetic datasets based on set 1 and set 2 from the Yahoo! Learning to Rank Challenge (C14B) 111, denoted as . Set 1 contains train and test queries including train and test documents. Set 2 contains train queries and train documents, with documents per query on average (chapelle-11yahoo). Each query is represented by an and each ¡query, document¿ pair is represented by a -dimensional feature vector with normalized feature values . The dataset contains true relevance of rankings based on expert annotated relevance score associated with each pair, with meaning least relevant and

most relevant. We binarized the relevance score following

joachims-wsdm17, such that denotes irrelevant (a relevance score of , or ), and relevant (a score of and ).

We first conduct extensive experiments on the train portion of the smaller set 2, where we randomly sample of the queries as training data and as test data, with which LTR algorithms can be trained and evaluated respectively (Section  5.2). To confirm the performance of our proposed method with out-of-sample test data, we conduct experiments on the larger set 1, where we train LTR algorithms on set 1 train data and evaluate them on set 1 test data (Section  5.3).

5.1.2. Semi-synthetic data generation

We use the real-world base dataset, , to generate semi-synthetic datasets that contain document clicks for rankings. The main motivation behind using the Yahoo! Learning To Rank dataset is that it provides unbiased ground truth for relevant results, thus enabling unbiased evaluation of ranking algorithms. In real-world scenarios, unbiased ground truth is hard to come by and LTR algorithms are typically trained on biased, click data which does not allow for unbiased evaluation. To mimic real-world scenarios for LTR, the synthetic data generation creates such biased click data.

We follow the data-generation process of  joachims-wsdm17. We train a base ranker, in our case , with of the training dataset that contains true relevances, and then use the trained model to generate rankings for the remaining of the queries in the training dataset. The second step of the data-generation process generates clicks on the ranked documents in the training dataset. The click probability of document for a given query is calculated as where and represent if a document is clicked and relevant respectively, denotes the ranking of document for query if the user was presented the ranking , and indicates the severity of position bias. Note that clicks are not generated for documents that are bellow a certain rank cutoff to incorporate the selection bias.

In a single pass over the entire training data we generate clicks following the above click probability. We refer to this as one sampling pass. For the smaller set 2, we generate clicks over sampling passes, while for the larger set 1, we generate clicks over sampling passes. This click-generation process reflects a common user behavior where some relevant documents do not receive any clicks, and other relevant documents receive multiple clicks. This process captures the generation of noiseless clicks, where users only click on relevant documents. We also consider a click generation process with noisy clicks in which a small percentage of clicks () occur on irrelevant documents.

5.1.3. Evaluation

We explore the performance of LTR algorithms Naive , Propensity , along with the two ensemble methods Linear Combination (CombinedW) and Rank Aggregation (RankAgg) with two different metrics: Average Rank of Relevant Results and Normalized Discounted Cumulative Gain where is the rank position up to which we are interested to evaluate, represents the discounted cumulative gain of the given ranking whereas refers to the ideal discounted cumulative gain. We can compute using the following formula Similarly, where represents the list of relevant documents (ordered by their relevance) in the ranking up to position for a given query. In our evaluation we chose for nDCG metric and we refer to it by nDCG@10.

Each figure in the experiments depicts how the ARRR or nDCG@10 ( axis) changes when the user only observes the first documents ( axis). Note that reflects the severity of selection bias as we model selection bias by assigning a zero observation probability to documents below cutoff . In contrast, position bias is modeled by assigning a non-zero probability to every single document where represents the severity of the position bias. We vary severity of both selection bias and position bias with or without the existence of noise in click generation.

parameter value/category description section
1-30 number of observed docs (selection bias) 5.2
0, 0.5, 1, 1.5, 2 position bias severity 5.2.1
noise 0%, 10%, 20%, 30% clicks on irrelevant docs 5.2.2, 5.2.3
Table 1. Experimental Parameters
(a) ARRR,
(b) ARRR,
(c) ARRR,
(d) ARRR,
(e) nDCG@10,
(f) nDCG@10,
(g) nDCG@10,
(h) nDCG@10,
Figure 9. The performance of LTR algorithms on set 2. Lower is better for ARRR, higher is better for nDCG@10.

When training Propensity , we apply an Inverse Propensity Score for clicked documents where and represent whether a document is observed and relevant respectively, following joachims-wsdm17. is the propensity score denoting the marginal probability of observing the relevance of result for query if the user was presented the ranking , and indicates the severity of position bias.

is implemented following the steps described in section 4. In step 1, the documents that appear among the shown results for each query are considered observed (), and the remainder as not-observed (). It is important to note that other LTR algorithms throw away the documents with in training, while we do not. In our implementation only includes the feature set common to . For the ensemble methods, the selection bias recovery algorithm is and the position bias recovery algorithm is Propensity .

Given the model learned during training, each algorithm ranks the documents in the test set. In the following subsections, we evaluate each algorithm performance under different scenarios. For evaluation, the (noiseless) clicked documents in the test set are considered to be relevant documents, and the average rank of relevant results (ARRR) across queries along with nDCG@10 is our metric to evaluate each algorithm’s performance.

5.2. Experimental results on set 2

Here, we evaluate the performance of each algorithm under different levels of position bias () and when clicks are noisy or noiseless (, , and noise). In each case, we use ARRR and nDCG@10.

5.2.1. Effect of position bias

Figure 9 illustrates the performance of all LTR algorithms and ensembles for varying degrees of position bias . Figures (a)a, (b)b, (c)c and (d)d show the performance as ARRR. Figures (e)e, (f)f, (g)g and (h)h show nDCG@10. Due to space, we omit the figures for since captures the trend of starting to work better than the other methods. suffers when there is severe position bias.

Figures (a)a-(d)d illustrate that outperforms Propensity in the absence of position bias (), or when position bias is low () and moderate (). The better performance of over Propensity vanishes with increased position bias, such that at a high position bias level (), falls behind Propensity , but still outperforms Naive . The reason for this is that a high position bias results in a high click frequency for top-ranked documents, leaving low-ranked documents with a very small chance of being clicked. is designed to control for the probability of a document being observed. If top-ranked documents have a disproportionately higher density in click data relative to low-ranked documents, then the predicted probabilities in will also reflect this imbalance. In terms of algorithms that address both position bias and selection bias, Figures (a)a, (b)b show that for , and outperform both Propensity and for . Moreover, Figures (c)c and (d)d show that for and outperforms its component algorithms for almost all values of .

(a) ARRR,
(b) ARRR,
(c) ARRR,
(d) ARRR,
(e) nDCG@10,
(f) nDCG@10,
(g) nDCG@10,
(h) nDCG@10,
Figure 18. The performance of LTR algorithms on set 2 under noisy clicks.

When is the metric of interest, Figure (a)a, (b)b, (c)c and (d)d illustrate that outperforms Propensity in the absence of position bias () and when position bias is low to moderate (), while it falls behind Propensity when position bias increases (). To compare it to the results for illustrated in Figures (e)e, (f)f, (g)g and (h)h, appears to start lagging behind in performance at .For the ensemble methods, Figures (e)e, (f)f illustrate that when and outperform their component algorithms for . (g)g demonstrates the better performance of to its component algorithms for all values of when . However, for a severe position bias , and do not outperform their component algorithms for any value of , but becomes the second best algorithm. Among the ensemble methods, is more robust to position bias than .

Our main takeaways from this experiment are:

  • Under small to no position bias () outperforms Propensity for both metrics.

  • Under moderate position bias (, while outperforms Propensity for ARRR, it lags behind Propensity for nDCG@10.

  • Under severe position bias (), falls behind Propensity for both and .

  • performs better than for all selection bias levels and it is more robust to position bias than . surpasses under severe selection bias ().

5.2.2. Effect of click noise

Thus far, we have considered noiseless clicks that are generated only over relevant documents. However, this is not a realistic assumption as users may also click on irrelevant documents. We now relax this assumption and allow for of the clicked documents to be irrelevant.

When ARRR is the preferred metric, Figures (a)a, (b)b, (c)c and (d)d illustrate that outperforms Propensity for , while under higher position bias level (), falls behind Propensity . Comparing the noisy click performance to the noiseless one (Figures (a)a, (b)b, (c)c), one can conclude that for , Propensity is highly affected by noise, while is much less affected. For example, Figure (c)c illustrates that for , the better performance of over Propensity is much more noticeable compared to (c)c where clicks were noiseless. Interestingly, the ensembles nor , do not outperform the most successful algorithm in the presence of noisy clicks.

When nDCG@10 is the preferred metric, one can draw the same conclusions: while is more robust to noise and outperforms Propensity for , it fails to beat Propensity for . Another interesting point is that Propensity is severely affected by noise when selection bias is high (low values of ), such that it even falls behind Naive . This exemplifies how much selection bias can degrade the performance of LTR systems if they do not correct for it.

(a) Evaluation on ARRR.
(b) Evaluation on nDCG@10.
Figure 21. Effect of noisy clicks for high selection bias () and moderate position bias ().
(a) Evaluation on ARRR.
(b) Evaluation on nDCG@10.
Figure 24. Effect of noisy clicks for high selection bias () and high position bias ().
(a) ARRR,
(b) ARRR,
(c) ARRR,
(d) ARRR,
Figure 29. The performance (ARRR) of LTR algorithms on set 1.

In the presence of noisy clicks, the main takeaways are:

  • Under severe to moderate selection bias (), Propensity suffers a lot from the noise and it even falls behind Naive for both and .

  • outperforms Propensity when position bias is not severe () for both metrics.

  • Just like in the noiseless case, cannot surpass Propensity under severe position bias ().

  • and surpass for a severe selection bias () when for both and . However, and cannot beat Propensity under high position bias.

5.2.3. Effect of varying noise for and

In this section, we investigate whether our proposed models are robust to noise. Toward this goal, we varied the noise level from to . Figures (a)a and (b)b show the performance of the LTR algorithms for different levels of noise, where and . Under increasing noise, the performance of is relatively stable and even improves, while the performance of all other LTR algorithms degrades. Even Naive is more robust to noise compared to Propensity , which is different from the results by joachims-wsdm17 where no selection bias was considered. The reason could be that their evaluation is based on the assumption that all documents have a non-zero probability of being observed, while Figure (a)a and (b)b are under the condition that documents ranked bellow a certain cut-off () have a zero probability of being observed.

We also investigate the performance of LTR algorithms with respect to noise, when position bias is severe (). As shown in Figure 24, irrespective of metric of interest, is robust to varying noise, while the performance of all other algorithms degrades when the noise level increases. Propensity falls behind all other algorithms in high level of noise. This implies that even though cannot surpass Propensity when position bias is severe () in noiseless environments, it clearly outperforms Propensity in the presence of selection bias with noise. This is an extremely useful property since in real world applications we cannot assume a noiseless environment.

5.3. Experimental results on set 1

To confirm the performance of our proposed methods on the larger set 1 with out-of-sample test data, we ran experiments varying position bias () under noiseless clicks. The results on this dataset were even more promising, especially for high position bias. Figure 29 illustrates the ARRR performance of all algorithms. outperforms for all position bias levels, though its strong performance decreases with increasing . This is unlike set 2 where did not outperform under high position bias. The ensemble outperforms both and for all position and selection bias levels, while outperforms but does not surpass . Moreover, the stronger performance of and over is much more pronounced compared to set 2.

6. Conclusion

In this work, we formalized the problem of selection bias in learning-to-rank systems and proposed as an approach for correcting for selection bias. We also presented two ensemble methods that correct for both selection and position bias by combining the rankings of and Propensity . Our extensive experiments on semi-synthetic datasets show that selection bias affects the performance of LTR systems and that performs better than existing approaches that correct for position bias but that do not address selection bias. Nonetheless, this performance decreases as the position bias increases. At the same time, is more robust to noisy clicks even with severe position bias, while Propensity is adversely affected by noisy clicks in the presence of selection bias and even falls behind Naive . The ensemble methods, and , outperform for severe selection bias and zero to small position bias.

Our initial study of selection bias suggests a number of promising future avenues for research. For example, our initial work considers only linear models but a Heckman-based solution to selection bias can be adapted to non-linear algorithms as well, including extensions that consider bias correction mechanisms specific to each learning-to-rank algorithm. Our experiments suggest that studying correction methods that jointly account for position bias and selection bias can potentially address the limitations of methods that only account for one. Finally, even though we specifically studied selection bias in the context of learning-to-rank systems, we expect that our methodology will have broader applications beyond LTR systems.