Modeling Document Interactions for Learning to Rank with Regularized Self-Attention

by   Shuo Sun, et al.
Johns Hopkins University

Learning to rank is an important task that has been successfully deployed in many real-world information retrieval systems. Most existing methods compute relevance judgments of documents independently, without holistically considering the entire set of competing documents. In this paper, we explore modeling documents interactions with self-attention based neural networks. Although self-attention networks have achieved state-of-the-art results in many NLP tasks, we find empirically that self-attention provides little benefit over baseline neural learning to rank architecture. To improve the learning of self-attention weights, We propose simple yet effective regularization terms designed to model interactions between documents. Evaluations on publicly available Learning to Rank (LETOR) datasets show that training self-attention network with our proposed regularization terms can significantly outperform existing learning to rank methods.



There are no comments yet.


page 1

page 2

page 3

page 4


SetRank: Learning a Permutation-Invariant Ranking Model for Information Retrieval

In learning-to-rank for information retrieval, a ranking model is automa...

Feature Importance Estimation with Self-Attention Networks

Black-box neural network models are widely used in industry and science,...

Local Self-Attention over Long Text for Efficient Document Retrieval

Neural networks, particularly Transformer-based architectures, have achi...

Self-Attentive Document Interaction Networks for Permutation Equivariant Ranking

How to leverage cross-document interactions to improve ranking performan...

Context-Aware Learning to Rank with Self-Attention

In learning to rank, one is interested in optimising the global ordering...

Hyper-SAGNN: a self-attention based graph neural network for hypergraphs

Graph representation learning for hypergraphs can be used to extract pat...

SALADnet: Self-Attentive multisource Localization in the Ambisonics Domain

In this work, we propose a novel self-attention based neural network for...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Learning to rank has attracted much attention in the research community, where the focus has been on developing loss objectives that effectively optimizes information retrieval (IR) metrics. The general idea is to fit a global relevance function on a training set that consists of queries, sets of documents, and their desired rankings. In particular, given a query , a set of documents , existing methods learn a function that gives higher scores to documents that rank better. There are three common approaches:

  1. Pointwise approaches optimize the relevance score of a single document without considering other documents in the same set.

  2. Pairwise approaches optimize the ranking between pairs of documents, such that if ranks better than .

  3. Listwise approaches directly attempts to optimize the target IR metric (such as MAP or NDCG), which are based on the entire set of document scores.

Importantly, all these approaches focus on the loss objective during the training phase. Whether the objective is pairwise or listwise, the function computes relevance scores for each document independently at test inference time.

We propose to formulate the relevance function based on the set of documents to be ranked: .111The notation will be described more precisely later. For now, the point is to illustrate the difference between modeling independently for each , versus adding the full document set in . Suppose some competing documents are dropped, will output the same relevance score, whereas will automatically adapt. This is more similar to how humans might rank documents at test time: multiple competing documents are reviewed before assigning the relevance score to .

Recently, self-attention has been successfully applied to many tasks such as machine translation (Vaswani et al., 2017) and natural language inference (Devlin et al., 2018; Liu et al., 2019; Lan et al., 2019). As self-attention has the capability to establish direct connections among elements within a set, it is a suitable mechanism that can model interactions among documents. The use of self-attention on set of documents allows the model to adjust scores based on other competing documents.

However, experiment results on benchmark datasets show that ListNet with self-attention only performs marginally better. Deeper analyses of attention weights reveal that self-attention alone is not effective at modeling document interactions. In section 2, we propose regularization terms that can push the model towards learning meaning weights that can better model interactions between documents.

We evaluated our model on the popular Yahoo, MSLR-WEB and Istella LETOR datasets and show that neural networks with properly regularized self-attention weights can significantly outperform existing strong ensemble tree and neural network baselines.

2. Model Description

Given a query q, a set of documents

and a feature extraction function,

, the input to a learning to rank model is a set of feature vectors:


We want to model a ranking function such that:


where is the predicted relevance score for document . In this notation, now has a vector output of dimension , where each element represents the relevance score for a document.

Ideally, we want to be sorted in the same order as the desired ranking. At test inference time, we compute all relevance scores then sort the documents according to these scores.

2.1. ListNet

Our starting point for modeling is the ListNet(Cao et al., 2007) algorithm. ListNet is a strong neural learning to rank algorithm which optimizes a listwise objective function. Due to the combinatorial nature of the ranking tasks, popular metrics such as NDCG (Järvelin and Kekäläinen, 2002) and ERR (Chapelle et al., 2009)

are not differentiable with respect to model parameters and consequently gradient descent based learning algorithm cannot be directly used for optimization. Therefore, ListNet optimizes a surrogate loss function which is defined below:

Given predicted relevance judgments and ground truth

. The top one probability of document

based on f is:


and the top one probability of document based on R is:


Loss is defined as the cross entropy between the top one probability distribution of predicted scores and the top one probability distribution of ground truth:


2.2. Self-Attention (SA)

Self-attention is an attention mechanism which learns to represent every element in a set by a weighted sum of every other elements within the same set. Self-attention based neural networks have found success in many NLP tasks such as machine reading, machine translation and sentence representations learning (Lin et al., 2017; Vaswani et al., 2017; Cheng et al., 2016; Devlin et al., 2018).

The input to the self-attention layer is a set of vector representations:


here is a d-dimensional vector representation of the i-th document, . V is a matrix, which concatenates the vector representations of the n documents. The output of the self-attention layer is:


where , , and

is the sigmoid function.

and are trainable weight matrices.

2.3. ListNet + Self-Attention (SA)

Figure 1. A document encoder consisting of two feed forward layers and a self-attention layer. , , are highway connections (Srivastava et al., 2015).

ListNet uses a single layer feed forward neural network without bias term and nonlinear activation function. We improve the original architecture with recent techniques such as layer normalization

(Ba et al., 2016), highway connections (Srivastava et al., 2015) and exponential linear units (Clevert et al., 2015). Inspired by the transformer(Vaswani et al., 2017) architecture, we insert a self-attention layer in the middle of two feed forward layers. We will refer to this architecture as document encoder (DE). Figure 1 shows the architecture of document encoder.

2.4. ListNet + Regularized Self-Attention (RSA)

(a) Self-attention
(b) ListNet + RSA
Figure 2. Self-attention layer and ListNet + Regularized Self-Attention (RSA).

We observe that certain document interactions are embedded in the datasets: 1) relative orderings between documents and 2) arithmetic differences in relevance judgments between documents. We hypothesize that this information can provide powerful supervisions for the learning of the self-attention weights. We explore four different document encoders, each of which is supervised by a different regularization term:

  • is a document encoder which enhances vector representations of documents by paying attention to other documents that are more relevant. i.e, for a given document , the attention weight for is:

  • is similar to except that it assigns exponentially higher attention weights to documents with higher relevance judgments:

  • does the opposite of . It assigns positive attention weights to documents that are less relevant.

  • is similar to , except that it assigns exponentially higher attention weights to documents with lower relevance judgments:


In equations (9) and (11), k refers to the maximum relevance judgment. In this paper, k=4 for all the datasets.

The outputs from the four document encoders are concatenated and then converted to scores via another feedforward layer. The final scores are used to rank the documents.

2.5. Regularization Terms

We introduce regularization terms which encourage the document encoders to learn attention weights close to the values mentioned in equations (8), (9), (10) and (11):

Rewrite equation (7) as:




is the attention matrix of a document encoder, .
The regularization terms are defined as the average binary cross entropy between the attention weight matrices and the ideal attention weight matrices defined in equations (8), (9), (10) and (11):


for .

Final objective function is the summation of the ListNet loss function and the regularization terms:


3. Experimental Setup

3.1. Datasets

Dataset Year # Features Type #Queries (Q) #Documents (D) Average # D/Q
Yahoo LETOR 2010 700 Train 19944 473134 23.72
Validation 2994 71083 23.74
Test 6983 165660 23.72
MSLR-WEB10K 2010 136 Train 6000 723412 120.57
Validation 2000 235259 117.63
Test 2000 241521 120.76
MSLR-WEB30K 2010 136 Train 18919 2270296 120.00
Validation 6306 747218 118.49
Test 6306 753611 119.51
Istella-S LETOR 2016 220 Train 19245 2043304 106.17
Validation 7211 684076 118.49
Test 6562 681250 103.82
Istella LETOR 2016 220 Train 17331 5459701 315.03
Validation 5888 1865924 316.90
Test 9799 3129004 319.32
Table 1. Characteristics of the datasets.

We conduct evaluations on the Yahoo LETOR(Chapelle and Chang, 2011), MSLR-WEB30K (Qin and Liu, 2013) and Istella LETOR(Dato et al., 2016) datasets shown in table 1. We also include results on the MSLR-WEB10K and Istella-S LETOR datasets, which are sampled from MSLR-WEB30K and Istella LETOR respectively.

Due to privacy regulations, all datasets only contain extracted feature vectors and raw texts of queries and documents are not publicly available. Every dataset has five levels of relevant judgment, from 0 (not relevant) to 4 (highly relevant).

3.2. Baseline Systems and Parameters Tuning

All neural models were implemented with PyTorch

222 We also provide results of two strong learning to rank algorithms based on ensembles of regression trees: MART (Friedman, 2002) and LambdaMART (Burges, 2010). We used RankLib333
We omit the results of other learning to rank algorithms in Ranklib as they perform significantly worse than MART and LambdaMART.

to train and evaluate these models and did hyperparameter tuning on the number of trees and the number of leaves per tree.

Models with highest NDCG@10 scores on validation sets were used to obtain final results on test sets and significance tests were conducted using paired t-test.

3.3. Evaluation Metrics

We consider two popular ranking metrics which support multiple levels of relevance judgment:

  1. Normalized Discounted Cumulative Gain (NDCG) (Järvelin and Kekäläinen, 2002) sums relevance judgments (gain) of ranked documents, which are discounted by their positions in ranking and normalized by the discounted cumulative gain of the ideal documents ordering.

  2. Expected Reciprocal Rank (ERR) (Chapelle et al., 2009) measures the expected reciprocal rank at which a user will stop his search.

We report results at positions 1, 3, 5 and 10 for both metrics.

4. Results and analyses

4.1. Results

Algorithm ERR@1 NDCG@1 ERR@3 NDCG@3 ERR@5 NDCG@5 ERR@10 NDCG@10
MART 0.3440 0.6837 0.4196 0.6877 0.4400 0.7072 0.4553 0.7468
LambdaMART 0.3460 0.6870 0.4205 0.6853 0.4409 0.7040 0.4553 0.7468+
ListNet 0.3394 0.6703 0.4140 0.6764 0.4348 0.6974 0.4496 0.7420*
ListNet + SA 0.3381* 0.6723* 0.4127* 0.6765* 0.4338* 0.6967* 0.4486* 0.7422*
ListNet + RSA 0.3418 0.6733 0.4171 0.6839 0.4382 0.7066 0.4526 0.7499
(a) Results on Yahoo LETOR
Algorithm ERR@1 NDCG@1 ERR@3 NDCG@3 ERR@5 NDCG@5 ERR@10 NDCG@10
MART 0.2172* 0.4149* 0.2971* 0.4161* 0.3213* 0.4260* 0.3399* 0.4479*
LambdaMART 0.2261* 0.4278* 0.3061* 0.4244* 0.3291* 0.4305* 0.3479* 0.4510*
ListNet 0.2108* 0.4095* 0.2845* 0.3978* 0.3081* 0.4072* 0.3276* 0.4292*
ListNet + SA 0.2127* 0.4019* 0.2904* 0.4050* 0.3127* 0.4104* 0.3315* 0.4310*
ListNet + RSA 0.2305 0.4386 0.3099 0.4347 0.3319 0.4383 0.3502 0.4568
(b) Results on MSLR-WEB10K
Algorithm ERR@1 NDCG@1 ERR@3 NDCG@3 ERR@5 NDCG@5 ERR@10 NDCG@10
MART 0.2217* 0.4364+ 0.3032* 0.4259 0.3276 0.4352 0.3468* 0.4574*
LambdaMART 0.2395* 0.4580+ 0.3213* 0.4461 0.3442* 0.4512* 0.3630* 0.4711*
ListNet 0.2288* 0.4289* 0.3072* 0.4241* 0.3295* 0.4290* 0.3481* 0.4489*
ListNet + SA 0.2270* 0.4286* 0.3040* 0.4215* 0.3269* 0.4272* 0.3460* 0.4492*
ListNet + RSA 0.2494 0.4643 0.3264 0.4524 0.3489 0.4568 0.3669 0.4775
(c) Results on MSLR-WEB30K
Algorithm ERR@1 NDCG@1 ERR@3 NDCG@3 ERR@5 NDCG@5 ERR@10 NDCG@10
MART 0.5563* 0.6251* 0.6739* 0.6054* 0.6922* 0.6334* 0.6998* 0.7038*
LambdaMART 0.5868* 0.6579* 0.6983* 0.6309* 0.7146* 0.6561* 0.7213* 0.7193*
ListNet 0.5855* 0.6566* 0.6981* 0.6316* 0.7141* 0.6577* 0.7203* 0.7190*
ListNet + SA 0.5921* 0.6633* 0.7025* 0.6345* 0.7189* 0.6618* 0.7252* 0.7232*
ListNet + RSA 0.5986 0.6714 0.7113 0.6490 0.7264 0.6757 0.7322 0.7394
(d) Results on Istella-S LETOR
Algorithm ERR@1 NDCG@1 ERR@3 NDCG@3 ERR@5 NDCG@5 ERR@10 NDCG@10
MART 0.5629* 0.6208* 0.6683* 0.5667* 0.6864* 0.5838* 0.6946* 0.6323*
LambdaMART 0.5938* 0.65436* 0.6962* 0.5958* 0.7123* 0.6104* 0.7194* 0.6574*
ListNet 0.5760* 0.6327* 0.6805* 0.5775* 0.6964* 0.5888* 0.7041* 0.6317*
ListNet + SA 0.5911* 0.6499* 0.6955* 0.5959* 0.7108* 0.6098* 0.7182* 0.6556*
ListNet + RSA 0.6035 0.6646 0.7085 0.6153 0.7236 0.6293 0.7302 0.6777
(e) Results on Istella LETOR
Table 2. Evaluation results. * and + indicate results which are statistically significant different from the results of ListNet + RSA at p0.01 and 0.01p0.05 respectively.

Tables 2(a), 2(b), 2(c), 2(d) and 2(e) present our main results on the datasets. We observe that ListNet with additional self-attention layer only does marginally better than ListNet without self-attention layer. However, ListNet with regularized self-attention consistently achieves very strong ERR@ and NDCG@ scores at various positions . In particular, ListNet with regularized self-attention is the single best system in all metrics measured in four out of the five datasets. For example, in MSLR-WEB10K, our model achieves 0.2305 ERR@1, 0.4386 NDCG@1, 0.3502 ERR@10, and 0.4568 NDCG@10, all outperforming the next-best model LambdaMART (which achieved 0.2261 ERR@1, 0.4278 NDCG@1, 0.3479 ERR@10, 0.4510 NDCG@10). This trend holds true for the MSLR-Web30K, Istella-S, and Istella datasets as well. The only exception where ListNet + RSA does not win on all metrics is the Yahoo LETOR datasets: but even there ListNet + RSA ranks second or third in most cases, and still outperforms on NDCG@10.

These consistent improvements confirm that the proposed regularized self-attention mechanism is an effective way to improve learning to rank results.

4.2. Impact of Regularization Terms

Figure 3.

Plots of NDCG@10 scores against training epochs on all validation sets. Curves of models with regularization terms are almost always above the curves of models without regularization terms.

Figure 3 shows the plots of NDCG@10 scores against training epochs on all validation sets. As seen in the plots, the curves of models with the regularization terms are almost always above the curves of models without the regularization terms. Further, the former always converge to significantly higher NDCG@10 values than the latter. These phenomenons clearly show that our proposed regularization terms are effective at improving the performance of ListNet with self-attention layer. In fact, models without the regularization terms perform worse than MART and LambdaMART on all datasets.

4.3. Attention Visualization

Figure 4. Top row: attention weights matrices. Bottom row: attention weights matrices without regularization terms. The relevance judgments of the documents for this sample query are , , , , , , , , and .

We sample query and document pairs from the Istella-S dataset and plot the heatmaps of the attention weights of the four different document encoders in figure 4.

Bottom row of figure 4 shows attention heatmaps of a model trained without the regularization terms. We are unable to observe any explainable pattern in the attention matrices. From the results our experiments, self-attention alone is not effective at figuring out attention weights that are useful for modeling document interactions.

In contrast, top row of the visualization suggests that our model can learn better attention weights with the supervisions from the regularization terms: and place more attention weights on rows with higher relevance judgments, while and place more attention weights on columns with higher relevance judgments.

4.4. Impact of the document encoders

Figure 5. ERR@10 and NDCG@10 scores on the MSLR-WEB10K test set for different document encoders.

We train four separate models, each of which uses only one of the four document encoders. Figure 5 presents our results.

We observe that or perform better than and , despite the fact that and are trained to put exponentially higher attention weights on documents with higher relevance judgments. We suspect the regularization terms for and are more difficult to optimize than the regularization terms for and .

We also observe that the NDCG@10 score from a model with four document encoders is around 1 to 2.3 points higher than the models with individual document encoders. This shows that ensembling the document encoders is effective at improving results on learning to rank tasks.

5. Related Work

There are two general research directions in learning to rank. In the traditional setting, machine learning algorithms are employed to re-rank documents based on preprocessed feature vectors. In the end-to-end setting, models are designed to extract features and rank documents simultaneously.

5.1. Traditional Learning to Rank

As there can be tens of millions of candidate documents for every query in real-world contexts, information retrieval systems usually employ a two-phase approach. In the first phase, a smaller set of candidate documents are shortlisted from the bigger pool of documents using simpler models such as vector space model (Salton et al., 1975) and BM25 (Robertson et al., 2009). In the second phase, shortlisted documents are converted into feature vectors and more accurate learning to rank models are used to re-rank the feature vectors. Examples of commonly used features are term frequencies, BM25 scores, URL click rate and length of documents.

Predicting the relevance scores of documents in pointwise approaches can be treated as a regression problem. Popular regression algorithms such as (Breiman, 2001; Friedman, 2002)

are often directly used to estimate relevance judgments of documents. Pairwise approaches such as

(Adomavicius and Tuzhilin, 2005; Joachims, 2002; Burges et al., 2007) treat learning to rank as binary classification problem. Ensemble trees are generally recognized as the strongest systems, e.g. an ensemble of LambdaMART and other lambda-gradient models (Burges et al., 2011) won the Yahoo Learning to Rank challenge (Chapelle and Chang, 2011). Neural networks such as RankNet (Burges et al., 2005) and ListNet (Cao et al., 2007)

are also effective. The common theme in these papers is to learn a classifier which can determine the correct ordering given a pair of documents.

Optimizing listwise objectives can be difficult. This is because popular IR metrics such as MAP, ERR and NDCG are not differentiable, which means gradient descent based methods cannot be directly used for optimization. Various surrogate loss functions have been proposed over the years. For example, (Cao et al., 2007) proposed ListNet, which uses the cross entropy between the permutation probability distributions of the predicted ranking and the ground truth as loss function. (Taylor et al., 2008) proposed SoftRank, which uses smoothed approximations to ranking metrics. (Burges, 2010) proposed LambdaRank and LambdaMART, which approximate gradients by the directions of swapping two documents, scaled by change in ranking metrics. Although these loss functions demonstrate various degree of success on learning to rank tasks, most of the papers only use them to train global ranking models which predict relevance scores of every document independently. In contrast, our model is designed specifically to model the interdependence between documents. Our ListNet loss function can also be replaced with any of the above existing loss objectives.

More recently, (Ai et al., 2018)

proposed the DLCM, which uses recurrent neural network to sequentially encode documents in the order returned by strong baseline learning to rank algorithms such as LambdaMART. The authors find that incorporating local ranking context can further fine-tune the initial results of baseline systems. Unlike DLCM which relies on the ranking results from other learning to rank algorithms, our model is a self-contained learning to rank algorithm. Therefore, direct comparison between DLCM and our model is not possible.

5.2. End-to-end learning to rank

As traditional learning to rank systems rely heavily on handcrafted feature engineering that can be tedious and often incomplete, there is growing interest in end-to-end learning to rank tasks among both NLP and IR researchers. Systems in this category focus on generating feature vectors automatically using deep neural networks, without the need for feature vectors extraction.

End-to-end models can be further be classified under two broad categories: 1) representation-based models and 2) interaction-based models. Representation-based models try to generate good representations of query and document independently before conducting relevance matching e.g, (Huang et al., 2013; Hu et al., 2014), while interaction-based models focus on learning local interactions between query text and document text before aggregating the local matching signals, e.g, (Guo et al., 2016; Pang et al., 2017; McDonald et al., 2018).

Since the aforementioned models focus primarily on learning better vector representations of query-document pairs from raw texts, the output representations from those models can be directly fed as inputs to our model, which is designed to learn the interactions among the documents. As end-to-end learning to rank is not the focus of this paper, we will explore end-to-end models in future work.

6. Conclusions

This paper explores the possibility of modeling document interactions with self-attention mechanism. We show that a self-attentional neural network with properly regularized attention weights can outperform state-of-the-art learning to rank algorithms on publicly available benchmark datasets.

We believe that different interpolation weights in the regularization terms of the loss function in Equation 18 may affect performance; we will explore this in future work. Another line of future work is to combine the idea of self-attention proposed here to the various end-to-end deep learning ranking models proposed in the literature.


  • G. Adomavicius and A. Tuzhilin (2005) Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge & Data Engineering (6), pp. 734–749. Cited by: §5.1.
  • Q. Ai, K. Bi, J. Guo, and W. B. Croft (2018) Learning a deep listwise context model for ranking refinement. In Proceedings of SIGIR ’18. Cited by: §5.1.
  • J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. stat 1050, pp. 21. Cited by: §2.3.
  • L. Breiman (2001) Random forests. Machine learning 45 (1), pp. 5–32. Cited by: §5.1.
  • C. J. Burges, R. Ragno, and Q. V. Le (2007) Learning to rank with nonsmooth cost functions. In Advances in neural information processing systems, pp. 193–200. Cited by: §5.1.
  • C. J. Burges (2010) From ranknet to lambdarank to lambdamart: an overview. Cited by: §3.2, §5.1.
  • C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. N. Hullender (2005) Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine learning (ICML-05), pp. 89–96. Cited by: §5.1.
  • C. Burges, K. Svore, P. Bennett, A. Pastusiak, and Q. Wu (2011) Learning to rank using an ensemble of lambda-gradient models. In Proceedings of the Learning to Rank Challenge, pp. 25–35. Cited by: §5.1.
  • Z. Cao, T. Qin, T. Liu, M. Tsai, and H. Li (2007) Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning, pp. 129–136. Cited by: §2.1, §5.1, §5.1.
  • O. Chapelle and Y. Chang (2011) Yahoo! learning to rank challenge overview. In Proceedings of the Learning to Rank Challenge, pp. 1–24. Cited by: §3.1, §5.1.
  • O. Chapelle, D. Metlzer, Y. Zhang, and P. Grinspan (2009) Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM conference on Information and knowledge management, pp. 621–630. Cited by: §2.1, item 2.
  • J. Cheng, L. Dong, and M. Lapata (2016) Long short-term memory-networks for machine reading. arXiv preprint arXiv:1601.06733. Cited by: §2.2.
  • D. Clevert, T. Unterthiner, and S. Hochreiter (2015) Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289. Cited by: §2.3.
  • D. Dato, C. Lucchese, F. M. Nardini, S. Orlando, R. Perego, N. Tonellotto, and R. Venturini (2016) Fast ranking with additive ensembles of oblivious and non-oblivious regression trees. ACM Transactions on Information Systems (TOIS) 35 (2), pp. 15. Cited by: §3.1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §2.2.
  • J. H. Friedman (2002)

    Stochastic gradient boosting

    Computational Statistics & Data Analysis 38 (4), pp. 367–378. Cited by: §3.2, §5.1.
  • J. Guo, Y. Fan, Q. Ai, and W. B. Croft (2016) A deep relevance matching model for ad-hoc retrieval. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 55–64. Cited by: §5.2.
  • B. Hu, Z. Lu, H. Li, and Q. Chen (2014) Convolutional neural network architectures for matching natural language sentences. In Advances in neural information processing systems, pp. 2042–2050. Cited by: §5.2.
  • P. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck (2013) Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pp. 2333–2338. Cited by: §5.2.
  • K. Järvelin and J. Kekäläinen (2002) Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems (TOIS) 20 (4), pp. 422–446. Cited by: §2.1, item 1.
  • T. Joachims (2002) Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 133–142. Cited by: §5.1.
  • Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2019)

    ALBERT: a lite bert for self-supervised learning of language representations

    External Links: 1909.11942 Cited by: §1.
  • Z. Lin, M. Feng, C. N. d. Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio (2017) A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130. Cited by: §2.2.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a robustly optimized bert pretraining approach. External Links: 1907.11692 Cited by: §1.
  • R. McDonald, G. Brokos, and I. Androutsopoulos (2018) Deep relevance ranking using enhanced document-query interactions. arXiv preprint arXiv:1809.01682. Cited by: §5.2.
  • L. Pang, Y. Lan, J. Guo, J. Xu, J. Xu, and X. Cheng (2017) Deeprank: a new deep architecture for relevance ranking in information retrieval. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 257–266. Cited by: §5.2.
  • T. Qin and T. Liu (2013) Introducing LETOR 4.0 datasets. CoRR abs/1306.2597. External Links: Link Cited by: §3.1.
  • S. Robertson, H. Zaragoza, et al. (2009) The probabilistic relevance framework: bm25 and beyond. Foundations and Trends® in Information Retrieval 3 (4), pp. 333–389. Cited by: §5.1.
  • G. Salton, A. Wong, and C. Yang (1975) A vector space model for automatic indexing. Communications of the ACM 18 (11), pp. 613–620. Cited by: §5.1.
  • R. K. Srivastava, K. Greff, and J. Schmidhuber (2015) Highway networks. arXiv preprint arXiv:1505.00387. Cited by: Figure 1, §2.3.
  • M. Taylor, J. Guiver, S. Robertson, and T. Minka (2008) Softrank: optimizing non-smooth rank metrics. In Proceedings of the 2008 International Conference on Web Search and Data Mining, pp. 77–86. Cited by: §5.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §1, §2.2, §2.3.