Log In Sign Up

Learning More From Less: Towards Strengthening Weak Supervision for Ad-Hoc Retrieval

The limited availability of ground truth relevance labels has been a major impediment to the application of supervised methods to ad-hoc retrieval. As a result, unsupervised scoring methods, such as BM25, remain strong competitors to deep learning techniques which have brought on dramatic improvements in other domains, such as computer vision and natural language processing. Recent works have shown that it is possible to take advantage of the performance of these unsupervised methods to generate training data for learning-to-rank models. The key limitation to this line of work is the size of the training set required to surpass the performance of the original unsupervised method, which can be as large as 10^13 training examples. Building on these insights, we propose two methods to reduce the amount of training data required. The first method takes inspiration from crowdsourcing, and leverages multiple unsupervised rankers to generate soft, or noise-aware, training labels. The second identifies harmful, or mislabeled, training examples and removes them from the training set. We show that our methods allow us to surpass the performance of the unsupervised baseline with far fewer training examples than previous works.


page 1

page 2

page 3

page 4


Adversarial Sampling and Training for Semi-Supervised Information Retrieval

Modern ad-hoc retrieval models learned with implicit feedback have two p...

Zero-shot Neural Retrieval via Domain-targeted Synthetic Query Generation

Deep neural scoring models have recently been shown to improve ranking q...

A Deep Relevance Matching Model for Ad-hoc Retrieval

In recent years, deep neural networks have led to exciting breakthroughs...

Pre-training for Ad-hoc Retrieval: Hyperlink is Also You Need

Designing pre-training objectives that more closely resemble the downstr...

Fast Incremental Learning for Off-Road Robot Navigation

A promising approach to autonomous driving is machine learning. In such ...

Information Planning for Text Data

Information planning enables faster learning with fewer training example...

The Crossover Process: Learnability and Data Protection from Inference Attacks

It is usual to consider data protection and learnability as conflicting ...

1. Introduction

Classical ad-hoc retrieval methods have relied primarily on unsupervised signals such as BM25, TF-IDF, and PageRank as inputs to ltr models. Supervision for these models is often supplied in the form of click-stream logs or hand-curated rankings, both of which come with their issues and limitations. First, both sources are typically limited in availability and are often proprietary company resources. Second, click-stream data is typically biased towards the first few elements in the ranking presented to the user (Ai et al., 2018) and are noisy in general. Finally, such logs are only available after the fact, leading to a cold start problem. These issues motivate the search for an alternate source of “ground truth” ranked lists to train our ltr model on.

In (Dehghani et al., 2017c), Dehghani et al. show that the output of an unsupervised document retrieval method can be used to train a supervised ranking model that outperforms the original unsupervised ranker. More recently, (Nie et al., 2018) proposed a hierarchical interaction based model that is trained on a similarly generated training set. These works have shown the potential of leveraging unsupervised methods as sources of weak supervision for the retrieval task. However, they require training on as many as training examples to surpass the performance of the unsupervised baseline (Dehghani et al., 2017c; Nie et al., 2018).

In this work, we substantially reduce this number by making more effective use of the generated training data. We present two methods that make improvements in this direction, and beat the unsupervised method using fewer than 10% of the training rankings compared to previous techniques.

In the first method, we take a crowdsourcing approach and collect the output of multiple unsupervised retrieval models. Following (Ratner et al., 2017)

, we learn a joint distribution over the outputs of said retrieval models and generate a new training set of soft labels. We call this the

noise-aware model. The noise-aware model does not require access to any gold labels111To differentiate them from labels originating from weak supervision sources, we refer to relevance scores assigned by a human as “gold” labels.

Our second method builds on the idea of dataset debugging and identifies training examples with the most harmful influence (Koh and Liang, 2017) (the labels most likely to be incorrect) and drops them from the training set. We call this the influence-aware model.

2. Related Work

Much of the prior work in handling noisy datasets has been in the context of a classifier from noisy labels. In the binary classification context, noise is seen as a class-conditional probability that an observed label is the opposite of the true label

(Jiang et al., 2017; Northcutt et al., 2017)

. In the ranking context, we typically expect that models trained using pairwise or listwise loss functions will far outperform pointwise approaches

(Liu, 2009). Since the label of a pair is determined by the ordering of the documents within the pair, it is not immediately obvious how the class-conditional flip probabilities translate to this formulation. The relationship to listwise objectives is not straightforward either.

In (Dehghani et al., 2017a) and (Dehghani et al., 2017b), Dehghani et al. introduce two semi-supervised student-teacher models where the teacher weights the contribution of each sample in the student model’s training loss based on its confidence in the quality of the label. They train the teacher on a small subset of gold labels and use the model’s output as confidence weights for the student model. (Dehghani et al., 2017a) shows that using this approach, they can beat the unsupervised ranker using ~75% of the data required when training directly on the noisy data. They train a cluster of 50 gaussian processes to form the teacher annotations which are used to generate soft labels to fine-tune the student model.

In (Ratner et al., 2017), Ratner et al. transform a set of weak supervision sources, that may disagree with each other, into soft labels used to train a discriminative model. They show experimentally that this approach outperforms the naïve majority voting strategy for generating the target labels. This inspires our noise-aware approach.

In (Koh and Liang, 2017)

, Koh et al. apply classical results from regression analysis to approximate the change in loss at a test point caused by removing a specific point from the training set. They show experimentally that their method approximates this change in loss well, even for highly non-linear models, such as GoogLeNet. They also apply their method to prioritize training examples to check for labeling errors. Our

influence-aware approach uses influence functions (Koh and Liang, 2017) to identify mislabeled training examples.

3. Proposed Methods

3.1. Model Architecture

In this work, we only explore pairwise loss functions since they typically lead to better performing models than the pointwise approach. Listwise approaches, although typically the most effective, tend to have high training and inference time computational complexity due to their inherently permutation based formulations (Liu, 2009).

We consider a slight variant of the Rank model proposed in (Dehghani et al., 2017c) as our baseline model. We represent the tokens in the query as and the tokens in the document as . We embed these tokens in a low dimensional space with a mapping where is the vocabulary and is the embedding dimension. We also learn token dependent weights . Our final representation for a query is a weighted sum of the word embeddings: where indicates that the weights are normalized to sum to 1 across tokens in the query

using a softmax operation. The vector representation for documents is defined similarly.

In addition, we take the difference and elementwise products of the document and query vectors and concatenate them into a single vector . We compute the relevance score of a document, , to a query, by passing through a feed-forward network with ReLU activations and scalar output. We use a at the output of the rank

model and use the raw logit scores otherwise. We represent the output of our model parameterized by

as .

Our training set is a set of tuples where is the relevance score of to given by the unsupervised ranker. The pairwise objective function we minimize is given by:


Where gives the relative relevance of and to . is either or for cross-entropy or hinge loss, respectively. The key difference between the rank and noise-aware models is how is determined. As in (Dehghani et al., 2017c), we train the rank model by minimizing the max-margin loss and compute as .

Despite the results in (Zamani and Croft, 2018) showing that the max-margin loss exhibits stronger empirical risk guarantees for ranking tasks using noisy training data, we minimize the cross-entropy loss in each of our proposed models for the following reasons: in the case of the noise-aware model, each of our soft training labels are a distribution over {0, 1}, so we seek to learn a calibrated model rather than one which maximizes the margin (as would be achieved using a hinge loss objective). For the influence-aware model, we minimize the cross-entropy rather than the hinge loss since the method of influence functions relies on having a twice differentiable objective.

3.2. Noise-aware model

In this approach, are soft relevance labels. For each of the queries in the training set, we rank the top documents by relevance using

unsupervised rankers. Considering ordered pairs of these documents, each ranker gives a value of

if it agrees with the order, if it disagrees and if neither document appears in the top 10 positions of the ranking. We collect these values into a matrix for document pairs. The joint distribution over these pairwise preferences and the true pairwise orderings is given by:


Where is a vector of learned parameters and is the partition function. A natural choice for is to model the accuracy of each individual ranker in addition to the pairwise correlations between each of the rankers. So for the document pair, we have the following expression for :

Since the true relevance preferences are unknown, we treat them as latent. We learn the parameters for this model without any gold relevance labels by maximizing the marginal likelihood (as in (Ratner et al., 2017)) given by:


We use the Snorkel library222 to optimize equation 5

by stochastic gradient descent, where we perform Gibbs sampling to estimate the gradient at each step. Once we have determined the parameters of the model, we can evaluate the posterior probabilities

which we use as our soft training labels.

3.3. Influence Aware Model

In this approach, we identify training examples that hurt the generalization performance of the trained model. We expect that many of these will be incorrectly labeled, and that our model will perform better if we drop them from the training set. The influence of removing a training example on the trained model’s loss at a test point is computed as (Koh and Liang, 2017):


where is the Hessian of the objective function. If is negative, then is a harmful training example for since it’s inclusion in the training set causes an increase in the loss at that point. Summing this value over the entire test set gives us . We compute for each training example , expecting it to represent ’s impact on the model’s performance at test time. In our setup, we know that some of our training examples are mislabeled; we expect that these points will have a large negative value for . Of course, for a fair evaluation, the

points are taken from the development set used for hyperparameter tuning (see section


We address the computational constraints of computing (7

) by treating our trained model as a logistic regression on the bottleneck features. We freeze all model parameters except the last layer of the feed-forward network and compute the gradient with respect to these parameters only. This gradients can be computed in closed form in an easily parallelizable way, allowing us to avoid techniques that rely on autodifferentiation operations

(Pearlmutter, 1994). We compute for every using the method of conjugate gradients following (Shewchuk, 1994). We also add a small damping term to the diagonal of the Hessian to ensure that it is positive definite (Martens, 2010).

4. Data Preprocessing and Model Training

We evaluate the application of our methods to ad-hoc retrieval on the Robust04 corpus with the associated test queries and relevance labels. As in (Dehghani et al., 2017c), our training data comes from the AOL query logs (Pass et al., 2006) on which we perform similar preprocessing. We use the Indri333 search engine to conduct indexing and retrieval and use the default parameters for the query likelihood (QL) retrieval model (Ponte and Croft, 1998) which we use as the weak supervision source. We fetch only the top 10 documents from each ranking in comparison to previous works which trained on as many as the top 1000 documents for each query. To compensate for this difference, we randomly sample additional documents from the rest of the corpus for each of these 10 documents. We train our model on a random subset of 100k rankings generated by this process. This is fewer than 10% the number of rankings used in previous works (Nie et al., 2018; Dehghani et al., 2017c), each of which contains far fewer document pairs.

For the word embedding representations, , we use the 840B.300d GloVe (Pennington et al., 2015) pretrained word embedding set444 The feed-forward network hidden layer sizes are chosen from {512, 256, 128, 64} with up to 5 layers. We use the first 50 queries in the Robust04 dataset as our development set for hyperparameter selection, computation of and early stopping. The remaining 200 queries are used for evaluation.

During inference, we rank documents by the output of the feed-forward network. Since it is not feasible to rank all the documents in the corpus, we fetch the top 100 documents using the QL retrieval model and then rerank using the trained model’s scores.

4.1. Model Specific Details

For the noise-aware model, we generate separate rankings for each query using the following retrieval methods: Okapi BM25, TF-IDF, QL, QL+RM3 (Abdul-Jaleel et al., 2004) using Indri with the default parameters.

For the influence-aware model, we train the model once on the full dataset and then compute for each training point dropping all training examples with a negative value for which we find to typically be around half of the original training set. We then retrain the model on this subset.

Interestingly, we find that using a smaller margin, , in the training loss of the rank model leads to improved performance. Using a smaller margin incurs 0 loss for a smaller difference in the model’s relative preference between the two documents. Intuitively, this allows for less overfitting to the noisy data. We use a margin of 0.1 chosen by cross-validation.

The noise-aware and influence-aware models train end-to-end in around 12 and 15 hours respectively on a single NVIDIA Titan Xp.

5. Experimental Results

We compare our two methods against two baselines, the unsupervised ranker (QL) and the rank model. Compared to the other unsupervised rankers (see section 4.1) used as input to the noise-aware model, the QL ranker performs the best on all metrics. Training the rank model on the results of the majority vote of the set of unsupervised rankers used for the noise-aware model performed very similarly to the rank model, so we only report results of the rank

model. We also compare the results after smoothing with the normalized QL document scores by linear interpolation.

Rank Model
NDCG@10 0.3881 0.3952 0.4008 0.3843
Prec@10 0.3535 0.3621 0.3657 0.3515
MAP 0.2675 0.2774 0.2792 0.2676
Table 1. Results comparison with smoothing.
Rank Model Noise-Aware Influence-Aware
NDCG@10 0.2610 0.2886 0.2966
Prec@10 0.2399 0.2773 0.2742
MAP 0.1566 0.1831 0.1839
Table 2. Results comparison without smoothing.
Figure 1. Test NDCG@10 during training

The results in tables 1 and 2 show that the noise-aware and influence-aware models perform similarly, with both outperforming the unsupervised baseline. Bold items are the largest in their row and daggers indicate statistically significant improvements over the rank model at a level of 0.05 using Bonferroni correction. Figure 1 shows that the rank model quickly starts to overfit. This does not contradict the results in (Dehghani et al., 2017c) since in our setup we train on far fewer pairs of documents for each query, so each relevance label error has much greater impact. For each query, our distribution over documents is uniform outside the results from the weak supervision source, so we expect to perform worse than if we had a more faithful relevance distribution. Our proposed approaches use an improved estimate of the relevance distribution at the most important positions in the ranking, allowing them to perform well.

We now present two representative training examples showing how our methods overcome the limitations of the rank model.

Example 5.0 ().

The method in section 3.2 used to create labels for the noise-aware model gives the following training example an unconfident label (~0.5) rather than a relevance label of 1 or 0: (=“town of davie post office”, (=FBIS3-25584, =FT933-13328)) where is ranked above . Both of these documents are about people named “Davie” rather than about a town or a post office, so it is reasonable to avoid specifying a hard label indicating which one is explicitly more relevant.

Example 5.0 ().

One of the most harmful training points as determined by the method described in section 3.3 is the pair (=“pictures of easter mice”, (=FT932-15650, =LA041590-0059)) where is ranked above . discusses the computer input device and is about pictures that are reminiscent of the holiday. The incorrect relevance label explains why the method identifies this as a harmful training example.

6. Conclusions and Future Work

We have presented two approaches to reduce the amount of weak data needed to surpass the performance of the unsupervised method that generates the training data. The noise-aware model does not require ground truth labels, but has an additional data dependency on multiple unsupervised rankers. The influence-aware model requires a small set of gold-labels in addition to a re-train of the model, although empirically, only around half the dataset is used when training the second time around.

Interesting paths for future work involve learning a better joint distribution for training the noise-aware model or leveraging ideas from (Zamani et al., 2018) to construct soft training labels rather than for the query performance prediction task. Similarly, we could apply ideas from unsupervised ltr (Bhowmik and Ghosh, 2017) to form better noise-aware labels. For the influence-aware model, we could use the softrank loss (Baeza-Yates et al., 2007) rather than cross-entropy and instead compute set influence rather than the influence of a single training example (Khanna et al., 2018).