1. Introduction
Classical adhoc retrieval methods have relied primarily on unsupervised signals such as BM25, TFIDF, and PageRank as inputs to ltr models. Supervision for these models is often supplied in the form of clickstream logs or handcurated rankings, both of which come with their issues and limitations. First, both sources are typically limited in availability and are often proprietary company resources. Second, clickstream data is typically biased towards the first few elements in the ranking presented to the user (Ai et al., 2018) and are noisy in general. Finally, such logs are only available after the fact, leading to a cold start problem. These issues motivate the search for an alternate source of “ground truth” ranked lists to train our ltr model on.
In (Dehghani et al., 2017c), Dehghani et al. show that the output of an unsupervised document retrieval method can be used to train a supervised ranking model that outperforms the original unsupervised ranker. More recently, (Nie et al., 2018) proposed a hierarchical interaction based model that is trained on a similarly generated training set. These works have shown the potential of leveraging unsupervised methods as sources of weak supervision for the retrieval task. However, they require training on as many as training examples to surpass the performance of the unsupervised baseline (Dehghani et al., 2017c; Nie et al., 2018).
In this work, we substantially reduce this number by making more effective use of the generated training data. We present two methods that make improvements in this direction, and beat the unsupervised method using fewer than 10% of the training rankings compared to previous techniques.
In the first method, we take a crowdsourcing approach and collect the output of multiple unsupervised retrieval models. Following (Ratner et al., 2017)
, we learn a joint distribution over the outputs of said retrieval models and generate a new training set of soft labels. We call this the
noiseaware model. The noiseaware model does not require access to any gold labels^{1}^{1}1To differentiate them from labels originating from weak supervision sources, we refer to relevance scores assigned by a human as “gold” labels.Our second method builds on the idea of dataset debugging and identifies training examples with the most harmful influence (Koh and Liang, 2017) (the labels most likely to be incorrect) and drops them from the training set. We call this the influenceaware model.
2. Related Work
Much of the prior work in handling noisy datasets has been in the context of a classifier from noisy labels. In the binary classification context, noise is seen as a classconditional probability that an observed label is the opposite of the true label
(Jiang et al., 2017; Northcutt et al., 2017). In the ranking context, we typically expect that models trained using pairwise or listwise loss functions will far outperform pointwise approaches
(Liu, 2009). Since the label of a pair is determined by the ordering of the documents within the pair, it is not immediately obvious how the classconditional flip probabilities translate to this formulation. The relationship to listwise objectives is not straightforward either.In (Dehghani et al., 2017a) and (Dehghani et al., 2017b), Dehghani et al. introduce two semisupervised studentteacher models where the teacher weights the contribution of each sample in the student model’s training loss based on its confidence in the quality of the label. They train the teacher on a small subset of gold labels and use the model’s output as confidence weights for the student model. (Dehghani et al., 2017a) shows that using this approach, they can beat the unsupervised ranker using ~75% of the data required when training directly on the noisy data. They train a cluster of 50 gaussian processes to form the teacher annotations which are used to generate soft labels to finetune the student model.
In (Ratner et al., 2017), Ratner et al. transform a set of weak supervision sources, that may disagree with each other, into soft labels used to train a discriminative model. They show experimentally that this approach outperforms the naïve majority voting strategy for generating the target labels. This inspires our noiseaware approach.
In (Koh and Liang, 2017)
, Koh et al. apply classical results from regression analysis to approximate the change in loss at a test point caused by removing a specific point from the training set. They show experimentally that their method approximates this change in loss well, even for highly nonlinear models, such as GoogLeNet. They also apply their method to prioritize training examples to check for labeling errors. Our
influenceaware approach uses influence functions (Koh and Liang, 2017) to identify mislabeled training examples.3. Proposed Methods
3.1. Model Architecture
In this work, we only explore pairwise loss functions since they typically lead to better performing models than the pointwise approach. Listwise approaches, although typically the most effective, tend to have high training and inference time computational complexity due to their inherently permutation based formulations (Liu, 2009).
We consider a slight variant of the Rank model proposed in (Dehghani et al., 2017c) as our baseline model. We represent the tokens in the query as and the tokens in the document as . We embed these tokens in a low dimensional space with a mapping where is the vocabulary and is the embedding dimension. We also learn token dependent weights . Our final representation for a query is a weighted sum of the word embeddings: where indicates that the weights are normalized to sum to 1 across tokens in the query
using a softmax operation. The vector representation for documents is defined similarly.
In addition, we take the difference and elementwise products of the document and query vectors and concatenate them into a single vector . We compute the relevance score of a document, , to a query, by passing through a feedforward network with ReLU activations and scalar output. We use a at the output of the rank
model and use the raw logit scores otherwise. We represent the output of our model parameterized by
as .Our training set is a set of tuples where is the relevance score of to given by the unsupervised ranker. The pairwise objective function we minimize is given by:
(1)  
(2)  
(3) 
Where gives the relative relevance of and to . is either or for crossentropy or hinge loss, respectively. The key difference between the rank and noiseaware models is how is determined. As in (Dehghani et al., 2017c), we train the rank model by minimizing the maxmargin loss and compute as .
Despite the results in (Zamani and Croft, 2018) showing that the maxmargin loss exhibits stronger empirical risk guarantees for ranking tasks using noisy training data, we minimize the crossentropy loss in each of our proposed models for the following reasons: in the case of the noiseaware model, each of our soft training labels are a distribution over {0, 1}, so we seek to learn a calibrated model rather than one which maximizes the margin (as would be achieved using a hinge loss objective). For the influenceaware model, we minimize the crossentropy rather than the hinge loss since the method of influence functions relies on having a twice differentiable objective.
3.2. Noiseaware model
In this approach, are soft relevance labels. For each of the queries in the training set, we rank the top documents by relevance using
unsupervised rankers. Considering ordered pairs of these documents, each ranker gives a value of
if it agrees with the order, if it disagrees and if neither document appears in the top 10 positions of the ranking. We collect these values into a matrix for document pairs. The joint distribution over these pairwise preferences and the true pairwise orderings is given by:(4) 
Where is a vector of learned parameters and is the partition function. A natural choice for is to model the accuracy of each individual ranker in addition to the pairwise correlations between each of the rankers. So for the document pair, we have the following expression for :
Since the true relevance preferences are unknown, we treat them as latent. We learn the parameters for this model without any gold relevance labels by maximizing the marginal likelihood (as in (Ratner et al., 2017)) given by:
(5) 
We use the Snorkel library^{2}^{2}2https://github.com/HazyResearch/snorkel to optimize equation 5
by stochastic gradient descent, where we perform Gibbs sampling to estimate the gradient at each step. Once we have determined the parameters of the model, we can evaluate the posterior probabilities
which we use as our soft training labels.3.3. Influence Aware Model
In this approach, we identify training examples that hurt the generalization performance of the trained model. We expect that many of these will be incorrectly labeled, and that our model will perform better if we drop them from the training set. The influence of removing a training example on the trained model’s loss at a test point is computed as (Koh and Liang, 2017):
(6)  
(7) 
where is the Hessian of the objective function. If is negative, then is a harmful training example for since it’s inclusion in the training set causes an increase in the loss at that point. Summing this value over the entire test set gives us . We compute for each training example , expecting it to represent ’s impact on the model’s performance at test time. In our setup, we know that some of our training examples are mislabeled; we expect that these points will have a large negative value for . Of course, for a fair evaluation, the
points are taken from the development set used for hyperparameter tuning (see section
4).We address the computational constraints of computing (7
) by treating our trained model as a logistic regression on the bottleneck features. We freeze all model parameters except the last layer of the feedforward network and compute the gradient with respect to these parameters only. This gradients can be computed in closed form in an easily parallelizable way, allowing us to avoid techniques that rely on autodifferentiation operations
(Pearlmutter, 1994). We compute for every using the method of conjugate gradients following (Shewchuk, 1994). We also add a small damping term to the diagonal of the Hessian to ensure that it is positive definite (Martens, 2010).4. Data Preprocessing and Model Training
We evaluate the application of our methods to adhoc retrieval on the Robust04 corpus with the associated test queries and relevance labels. As in (Dehghani et al., 2017c), our training data comes from the AOL query logs (Pass et al., 2006) on which we perform similar preprocessing. We use the Indri^{3}^{3}3https://www.lemurproject.org/indri.php search engine to conduct indexing and retrieval and use the default parameters for the query likelihood (QL) retrieval model (Ponte and Croft, 1998) which we use as the weak supervision source. We fetch only the top 10 documents from each ranking in comparison to previous works which trained on as many as the top 1000 documents for each query. To compensate for this difference, we randomly sample additional documents from the rest of the corpus for each of these 10 documents. We train our model on a random subset of 100k rankings generated by this process. This is fewer than 10% the number of rankings used in previous works (Nie et al., 2018; Dehghani et al., 2017c), each of which contains far fewer document pairs.
For the word embedding representations, , we use the 840B.300d GloVe (Pennington et al., 2015) pretrained word embedding set^{4}^{4}4https://nlp.stanford.edu/projects/glove/. The feedforward network hidden layer sizes are chosen from {512, 256, 128, 64} with up to 5 layers. We use the first 50 queries in the Robust04 dataset as our development set for hyperparameter selection, computation of and early stopping. The remaining 200 queries are used for evaluation.
During inference, we rank documents by the output of the feedforward network. Since it is not feasible to rank all the documents in the corpus, we fetch the top 100 documents using the QL retrieval model and then rerank using the trained model’s scores.
4.1. Model Specific Details
For the noiseaware model, we generate separate rankings for each query using the following retrieval methods: Okapi BM25, TFIDF, QL, QL+RM3 (AbdulJaleel et al., 2004) using Indri with the default parameters.
For the influenceaware model, we train the model once on the full dataset and then compute for each training point dropping all training examples with a negative value for which we find to typically be around half of the original training set. We then retrain the model on this subset.
Interestingly, we find that using a smaller margin, , in the training loss of the rank model leads to improved performance. Using a smaller margin incurs 0 loss for a smaller difference in the model’s relative preference between the two documents. Intuitively, this allows for less overfitting to the noisy data. We use a margin of 0.1 chosen by crossvalidation.
The noiseaware and influenceaware models train endtoend in around 12 and 15 hours respectively on a single NVIDIA Titan Xp.
5. Experimental Results
We compare our two methods against two baselines, the unsupervised ranker (QL) and the rank model. Compared to the other unsupervised rankers (see section 4.1) used as input to the noiseaware model, the QL ranker performs the best on all metrics. Training the rank model on the results of the majority vote of the set of unsupervised rankers used for the noiseaware model performed very similarly to the rank model, so we only report results of the rank
model. We also compare the results after smoothing with the normalized QL document scores by linear interpolation.
Rank Model 


QL  
NDCG@10  0.3881  0.3952  0.4008  0.3843  
Prec@10  0.3535  0.3621  0.3657  0.3515  
MAP  0.2675  0.2774  0.2792  0.2676 
Rank Model  NoiseAware  InfluenceAware  
NDCG@10  0.2610  0.2886  0.2966 
Prec@10  0.2399  0.2773  0.2742 
MAP  0.1566  0.1831  0.1839 
The results in tables 1 and 2 show that the noiseaware and influenceaware models perform similarly, with both outperforming the unsupervised baseline. Bold items are the largest in their row and daggers indicate statistically significant improvements over the rank model at a level of 0.05 using Bonferroni correction. Figure 1 shows that the rank model quickly starts to overfit. This does not contradict the results in (Dehghani et al., 2017c) since in our setup we train on far fewer pairs of documents for each query, so each relevance label error has much greater impact. For each query, our distribution over documents is uniform outside the results from the weak supervision source, so we expect to perform worse than if we had a more faithful relevance distribution. Our proposed approaches use an improved estimate of the relevance distribution at the most important positions in the ranking, allowing them to perform well.
We now present two representative training examples showing how our methods overcome the limitations of the rank model.
Example 5.0 ().
The method in section 3.2 used to create labels for the noiseaware model gives the following training example an unconfident label (~0.5) rather than a relevance label of 1 or 0: (=“town of davie post office”, (=FBIS325584, =FT93313328)) where is ranked above . Both of these documents are about people named “Davie” rather than about a town or a post office, so it is reasonable to avoid specifying a hard label indicating which one is explicitly more relevant.
Example 5.0 ().
One of the most harmful training points as determined by the method described in section 3.3 is the pair (=“pictures of easter mice”, (=FT93215650, =LA0415900059)) where is ranked above . discusses the computer input device and is about pictures that are reminiscent of the holiday. The incorrect relevance label explains why the method identifies this as a harmful training example.
6. Conclusions and Future Work
We have presented two approaches to reduce the amount of weak data needed to surpass the performance of the unsupervised method that generates the training data. The noiseaware model does not require ground truth labels, but has an additional data dependency on multiple unsupervised rankers. The influenceaware model requires a small set of goldlabels in addition to a retrain of the model, although empirically, only around half the dataset is used when training the second time around.
Interesting paths for future work involve learning a better joint distribution for training the noiseaware model or leveraging ideas from (Zamani et al., 2018) to construct soft training labels rather than for the query performance prediction task. Similarly, we could apply ideas from unsupervised ltr (Bhowmik and Ghosh, 2017) to form better noiseaware labels. For the influenceaware model, we could use the softrank loss (BaezaYates et al., 2007) rather than crossentropy and instead compute set influence rather than the influence of a single training example (Khanna et al., 2018).
References
 (1)
 AbdulJaleel et al. (2004) Nasreen AbdulJaleel, James Allan, Bruce Croft, Fernando Diaz, Leah Larkey, Xiaoyan Li, Donald Metzler, Mark D. Smucker, Trevor Strohman, Howard Turtle, and Courtney Wade. 2004. Umass at trec 2004: Notebook. academia.edu (2004).
 Ai et al. (2018) Qingyao Ai, Keping Bi, Cheng Luo, Jiafeng Guo, and W Bruce Croft. 2018. Unbiased Learning to Rank with Unbiased Propensity Estimation. In The 41st International ACM SIGIR Conference. ACM Press, New York, New York, USA, 385–394. https://doi.org/10.1145/3209978.3209986
 BaezaYates et al. (2007) Ricardo BaezaYates, Berthier de Araújo Neto Ribeiro, et al. 2007. Learning to rank with nonsmooth cost functions. NIPS (2007).
 Bhowmik and Ghosh (2017) Avradeep Bhowmik and Joydeep Ghosh. 2017. LETOR Methods for Unsupervised Rank Aggregation. In the 26th International Conference. ACM Press, New York, New York, USA, 1331–1340. https://doi.org/10.1145/3038912.3052689
 Dehghani et al. (2017a) Mostafa Dehghani, Arash Mehrjou, Stephan Gouws, Jaap Kamps, and Bernhard Schölkopf. 2017a. FidelityWeighted Learning. arXiv.org (Nov. 2017). arXiv:cs.LG/1711.02799v2
 Dehghani et al. (2017b) Mostafa Dehghani, Aliaksei Severyn, Sascha Rothe, and Jaap Kamps. 2017b. Learning to Learn from Weak Supervision by Full Supervision. arXiv.org (Nov. 2017), 1–8. arXiv:1711.11383
 Dehghani et al. (2017c) Mostafa Dehghani, Hamed Zamani, Aliaksei Severyn, Jaap Kamps, and W Bruce Croft. 2017c. Neural Ranking Models with Weak Supervision. In the 40th International ACM SIGIR Conference. ACM Press, New York, New York, USA, 65–74. https://doi.org/10.1145/3077136.3080832
 Jiang et al. (2017) Xinxin Jiang, Shirui Pan, Guodong Long, Fei Xiong, Jing Jiang, and Chengqi Zhang. 2017. Costsensitive learning with noisy labels. JMLR (2017).
 Khanna et al. (2018) Rajiv Khanna, Been Kim, Joydeep Ghosh, and Oluwasanmi Koyejo. 2018. Interpreting Black Box Predictions using Fisher Kernels. arXiv.org (Oct. 2018). arXiv:cs.LG/1810.10118v1
 Koh and Liang (2017) Pang Wei Koh and Percy Liang. 2017. Understanding Blackbox Predictions via Influence Functions. arXiv.org (March 2017), 1–11. arXiv:1703.04730
 Liu (2009) TieYan Liu. 2009. Learning to Rank for Information Retrieval. Foundations and Trends® in Information Retrieval 3, 3 (2009), 225–331. https://doi.org/10.1561/1500000016
 Martens (2010) James Martens. 2010. Deep learning via Hessianfree optimization. (2010).
 Nie et al. (2018) Yifan Nie, Alessandro Sordoni, and JianYun Nie. 2018. Multilevel Abstraction Convolutional Model with Weak Supervision for Information Retrieval. In The 41st International ACM SIGIR Conference. ACM Press, New York, New York, USA, 985–988. https://doi.org/10.1145/3209978.3210123
 Northcutt et al. (2017) Curtis G Northcutt, Tailin Wu, and Isaac L Chuang. 2017. Learning with Confident Examples: Rank Pruning for Robust Classification with Noisy Labels. arXiv.org (May 2017). arXiv:1705.01936
 Pass et al. (2006) Greg Pass, Abdur Chowdhury, and Cayley Torgeson. 2006. A Picture of Search. Infoscale (2006), 1–es. https://doi.org/10.1145/1146847.1146848
 Pearlmutter (1994) Barak Pearlmutter. 1994. Fast exact multiplication by the Hessian. MIT Press 6, 1 (Jan. 1994), 147–160. https://doi.org/10.1162/neco.1994.6.1.147
 Pennington et al. (2015) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2015. GloVe: Global Vectors for Word Representation.
 Ponte and Croft (1998) Jay M Ponte and W Bruce Croft. 1998. A Language Modeling Approach to Information Retrieval. SIGIR (1998), 275–281. https://doi.org/10.1145/290941.291008
 Ratner et al. (2017) Alexander Ratner, Stephen H Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. 2017. Snorkel. Proceedings of the VLDB Endowment 11, 3 (Nov. 2017), 269–282. https://doi.org/10.14778/3157794.3157797
 Shewchuk (1994) Jonathan R Shewchuk. 1994. An introduction to the conjugate gradient method without the agonizing pain. (1994).
 Zamani and Croft (2018) Hamed Zamani and W Bruce Croft. 2018. On the Theory of Weak Supervision for Information Retrieval. ACM, New York, New York, USA. https://doi.org/10.1145/3234944.3234968
 Zamani et al. (2018) Hamed Zamani, W Bruce Croft, and J Shane Culpepper. 2018. Neural Query Performance Prediction using Weak Supervision from Multiple Signals. In The 41st International ACM SIGIR Conference. ACM Press, New York, New York, USA, 105–114. https://doi.org/10.1145/3209978.3210041