1. Introduction
Learning to rank has attracted much attention in the research community, where the focus has been on developing loss objectives that effectively optimizes information retrieval (IR) metrics. The general idea is to fit a global relevance function on a training set that consists of queries, sets of documents, and their desired rankings. In particular, given a query , a set of documents , existing methods learn a function that gives higher scores to documents that rank better. There are three common approaches:

Pointwise approaches optimize the relevance score of a single document without considering other documents in the same set.

Pairwise approaches optimize the ranking between pairs of documents, such that if ranks better than .

Listwise approaches directly attempts to optimize the target IR metric (such as MAP or NDCG), which are based on the entire set of document scores.
Importantly, all these approaches focus on the loss objective during the training phase. Whether the objective is pairwise or listwise, the function computes relevance scores for each document independently at test inference time.
We propose to formulate the relevance function based on the set of documents to be ranked: .^{1}^{1}1The notation will be described more precisely later. For now, the point is to illustrate the difference between modeling independently for each , versus adding the full document set in . Suppose some competing documents are dropped, will output the same relevance score, whereas will automatically adapt. This is more similar to how humans might rank documents at test time: multiple competing documents are reviewed before assigning the relevance score to .
Recently, selfattention has been successfully applied to many tasks such as machine translation (Vaswani et al., 2017) and natural language inference (Devlin et al., 2018; Liu et al., 2019; Lan et al., 2019). As selfattention has the capability to establish direct connections among elements within a set, it is a suitable mechanism that can model interactions among documents. The use of selfattention on set of documents allows the model to adjust scores based on other competing documents.
However, experiment results on benchmark datasets show that ListNet with selfattention only performs marginally better. Deeper analyses of attention weights reveal that selfattention alone is not effective at modeling document interactions. In section 2, we propose regularization terms that can push the model towards learning meaning weights that can better model interactions between documents.
We evaluated our model on the popular Yahoo, MSLRWEB and Istella LETOR datasets and show that neural networks with properly regularized selfattention weights can significantly outperform existing strong ensemble tree and neural network baselines.
2. Model Description
Given a query q, a set of documents
and a feature extraction function,
, the input to a learning to rank model is a set of feature vectors:
(1) 
We want to model a ranking function such that:
(2) 
where is the predicted relevance score for document . In this notation, now has a vector output of dimension , where each element represents the relevance score for a document.
Ideally, we want to be sorted in the same order as the desired ranking. At test inference time, we compute all relevance scores then sort the documents according to these scores.
2.1. ListNet
Our starting point for modeling is the ListNet(Cao et al., 2007) algorithm. ListNet is a strong neural learning to rank algorithm which optimizes a listwise objective function. Due to the combinatorial nature of the ranking tasks, popular metrics such as NDCG (Järvelin and Kekäläinen, 2002) and ERR (Chapelle et al., 2009)
are not differentiable with respect to model parameters and consequently gradient descent based learning algorithm cannot be directly used for optimization. Therefore, ListNet optimizes a surrogate loss function which is defined below:
Given predicted relevance judgments and ground truth
. The top one probability of document
based on f is:(3) 
and the top one probability of document based on R is:
(4) 
Loss is defined as the cross entropy between the top one probability distribution of predicted scores and the top one probability distribution of ground truth:
(5) 
2.2. SelfAttention (SA)
Selfattention is an attention mechanism which learns to represent every element in a set by a weighted sum of every other elements within the same set. Selfattention based neural networks have found success in many NLP tasks such as machine reading, machine translation and sentence representations learning (Lin et al., 2017; Vaswani et al., 2017; Cheng et al., 2016; Devlin et al., 2018).
The input to the selfattention layer is a set of vector representations:
(6) 
here is a ddimensional vector representation of the ith document, . V is a matrix, which concatenates the vector representations of the n documents. The output of the selfattention layer is:
(7) 
where , , and
is the sigmoid function.
and are trainable weight matrices.2.3. ListNet + SelfAttention (SA)
ListNet uses a single layer feed forward neural network without bias term and nonlinear activation function. We improve the original architecture with recent techniques such as layer normalization
(Ba et al., 2016), highway connections (Srivastava et al., 2015) and exponential linear units (Clevert et al., 2015). Inspired by the transformer(Vaswani et al., 2017) architecture, we insert a selfattention layer in the middle of two feed forward layers. We will refer to this architecture as document encoder (DE). Figure 1 shows the architecture of document encoder.2.4. ListNet + Regularized SelfAttention (RSA)
We observe that certain document interactions are embedded in the datasets: 1) relative orderings between documents and 2) arithmetic differences in relevance judgments between documents. We hypothesize that this information can provide powerful supervisions for the learning of the selfattention weights. We explore four different document encoders, each of which is supervised by a different regularization term:

is a document encoder which enhances vector representations of documents by paying attention to other documents that are more relevant. i.e, for a given document , the attention weight for is:
(8) 
is similar to except that it assigns exponentially higher attention weights to documents with higher relevance judgments:
(9) 
does the opposite of . It assigns positive attention weights to documents that are less relevant.
(10) 
is similar to , except that it assigns exponentially higher attention weights to documents with lower relevance judgments:
(11)
In equations (9) and (11), k refers to the maximum relevance judgment. In this paper, k=4 for all the datasets.
The outputs from the four document encoders are concatenated and then converted to scores via another feedforward layer. The final scores are used to rank the documents.
2.5. Regularization Terms
We introduce regularization terms which encourage the document encoders to learn attention weights close to the values mentioned in equations (8), (9), (10) and (11):
Rewrite equation (7) as:
(12) 
where:
(13) 
is the attention matrix of a document encoder, .
The regularization terms are defined as the average binary cross entropy between the attention weight matrices and the ideal attention weight matrices defined in equations (8), (9), (10) and (11):
(14) 
for .
Final objective function is the summation of the ListNet loss function and the regularization terms:
(15) 
3. Experimental Setup
3.1. Datasets
Dataset  Year  # Features  Type  #Queries (Q)  #Documents (D)  Average # D/Q 

Yahoo LETOR  2010  700  Train  19944  473134  23.72 
Validation  2994  71083  23.74  
Test  6983  165660  23.72  
MSLRWEB10K  2010  136  Train  6000  723412  120.57 
Validation  2000  235259  117.63  
Test  2000  241521  120.76  
MSLRWEB30K  2010  136  Train  18919  2270296  120.00 
Validation  6306  747218  118.49  
Test  6306  753611  119.51  
IstellaS LETOR  2016  220  Train  19245  2043304  106.17 
Validation  7211  684076  118.49  
Test  6562  681250  103.82  
Istella LETOR  2016  220  Train  17331  5459701  315.03 
Validation  5888  1865924  316.90  
Test  9799  3129004  319.32 
We conduct evaluations on the Yahoo LETOR(Chapelle and Chang, 2011), MSLRWEB30K (Qin and Liu, 2013) and Istella LETOR(Dato et al., 2016) datasets shown in table 1. We also include results on the MSLRWEB10K and IstellaS LETOR datasets, which are sampled from MSLRWEB30K and Istella LETOR respectively.
Due to privacy regulations, all datasets only contain extracted feature vectors and raw texts of queries and documents are not publicly available. Every dataset has five levels of relevant judgment, from 0 (not relevant) to 4 (highly relevant).
3.2. Baseline Systems and Parameters Tuning
All neural models were implemented with PyTorch
^{2}^{2}2https://pytorch.org/. We also provide results of two strong learning to rank algorithms based on ensembles of regression trees: MART (Friedman, 2002) and LambdaMART (Burges, 2010). We used RankLib^{3}^{3}3https://sourceforge.net/p/lemur/wiki/RankLib/We omit the results of other learning to rank algorithms in Ranklib as they perform significantly worse than MART and LambdaMART.
to train and evaluate these models and did hyperparameter tuning on the number of trees and the number of leaves per tree.
Models with highest NDCG@10 scores on validation sets were used to obtain final results on test sets and significance tests were conducted using paired ttest.
3.3. Evaluation Metrics
We consider two popular ranking metrics which support multiple levels of relevance judgment:

Normalized Discounted Cumulative Gain (NDCG) (Järvelin and Kekäläinen, 2002) sums relevance judgments (gain) of ranked documents, which are discounted by their positions in ranking and normalized by the discounted cumulative gain of the ideal documents ordering.

Expected Reciprocal Rank (ERR) (Chapelle et al., 2009) measures the expected reciprocal rank at which a user will stop his search.
We report results at positions 1, 3, 5 and 10 for both metrics.
4. Results and analyses
4.1. Results





Tables 2(a), 2(b), 2(c), 2(d) and 2(e) present our main results on the datasets. We observe that ListNet with additional selfattention layer only does marginally better than ListNet without selfattention layer. However, ListNet with regularized selfattention consistently achieves very strong ERR@ and NDCG@ scores at various positions . In particular, ListNet with regularized selfattention is the single best system in all metrics measured in four out of the five datasets. For example, in MSLRWEB10K, our model achieves 0.2305 ERR@1, 0.4386 NDCG@1, 0.3502 ERR@10, and 0.4568 NDCG@10, all outperforming the nextbest model LambdaMART (which achieved 0.2261 ERR@1, 0.4278 NDCG@1, 0.3479 ERR@10, 0.4510 NDCG@10). This trend holds true for the MSLRWeb30K, IstellaS, and Istella datasets as well. The only exception where ListNet + RSA does not win on all metrics is the Yahoo LETOR datasets: but even there ListNet + RSA ranks second or third in most cases, and still outperforms on NDCG@10.
These consistent improvements confirm that the proposed regularized selfattention mechanism is an effective way to improve learning to rank results.
4.2. Impact of Regularization Terms
Figure 3 shows the plots of NDCG@10 scores against training epochs on all validation sets. As seen in the plots, the curves of models with the regularization terms are almost always above the curves of models without the regularization terms. Further, the former always converge to significantly higher NDCG@10 values than the latter. These phenomenons clearly show that our proposed regularization terms are effective at improving the performance of ListNet with selfattention layer. In fact, models without the regularization terms perform worse than MART and LambdaMART on all datasets.
4.3. Attention Visualization
We sample query and document pairs from the IstellaS dataset and plot the heatmaps of the attention weights of the four different document encoders in figure 4.
Bottom row of figure 4 shows attention heatmaps of a model trained without the regularization terms. We are unable to observe any explainable pattern in the attention matrices. From the results our experiments, selfattention alone is not effective at figuring out attention weights that are useful for modeling document interactions.
In contrast, top row of the visualization suggests that our model can learn better attention weights with the supervisions from the regularization terms: and place more attention weights on rows with higher relevance judgments, while and place more attention weights on columns with higher relevance judgments.
4.4. Impact of the document encoders
We train four separate models, each of which uses only one of the four document encoders. Figure 5 presents our results.
We observe that or perform better than and , despite the fact that and are trained to put exponentially higher attention weights on documents with higher relevance judgments. We suspect the regularization terms for and are more difficult to optimize than the regularization terms for and .
We also observe that the NDCG@10 score from a model with four document encoders is around 1 to 2.3 points higher than the models with individual document encoders. This shows that ensembling the document encoders is effective at improving results on learning to rank tasks.
5. Related Work
There are two general research directions in learning to rank. In the traditional setting, machine learning algorithms are employed to rerank documents based on preprocessed feature vectors. In the endtoend setting, models are designed to extract features and rank documents simultaneously.
5.1. Traditional Learning to Rank
As there can be tens of millions of candidate documents for every query in realworld contexts, information retrieval systems usually employ a twophase approach. In the first phase, a smaller set of candidate documents are shortlisted from the bigger pool of documents using simpler models such as vector space model (Salton et al., 1975) and BM25 (Robertson et al., 2009). In the second phase, shortlisted documents are converted into feature vectors and more accurate learning to rank models are used to rerank the feature vectors. Examples of commonly used features are term frequencies, BM25 scores, URL click rate and length of documents.
Predicting the relevance scores of documents in pointwise approaches can be treated as a regression problem. Popular regression algorithms such as (Breiman, 2001; Friedman, 2002)
are often directly used to estimate relevance judgments of documents. Pairwise approaches such as
(Adomavicius and Tuzhilin, 2005; Joachims, 2002; Burges et al., 2007) treat learning to rank as binary classification problem. Ensemble trees are generally recognized as the strongest systems, e.g. an ensemble of LambdaMART and other lambdagradient models (Burges et al., 2011) won the Yahoo Learning to Rank challenge (Chapelle and Chang, 2011). Neural networks such as RankNet (Burges et al., 2005) and ListNet (Cao et al., 2007)are also effective. The common theme in these papers is to learn a classifier which can determine the correct ordering given a pair of documents.
Optimizing listwise objectives can be difficult. This is because popular IR metrics such as MAP, ERR and NDCG are not differentiable, which means gradient descent based methods cannot be directly used for optimization. Various surrogate loss functions have been proposed over the years. For example, (Cao et al., 2007) proposed ListNet, which uses the cross entropy between the permutation probability distributions of the predicted ranking and the ground truth as loss function. (Taylor et al., 2008) proposed SoftRank, which uses smoothed approximations to ranking metrics. (Burges, 2010) proposed LambdaRank and LambdaMART, which approximate gradients by the directions of swapping two documents, scaled by change in ranking metrics. Although these loss functions demonstrate various degree of success on learning to rank tasks, most of the papers only use them to train global ranking models which predict relevance scores of every document independently. In contrast, our model is designed specifically to model the interdependence between documents. Our ListNet loss function can also be replaced with any of the above existing loss objectives.
More recently, (Ai et al., 2018)
proposed the DLCM, which uses recurrent neural network to sequentially encode documents in the order returned by strong baseline learning to rank algorithms such as LambdaMART. The authors find that incorporating local ranking context can further finetune the initial results of baseline systems. Unlike DLCM which relies on the ranking results from other learning to rank algorithms, our model is a selfcontained learning to rank algorithm. Therefore, direct comparison between DLCM and our model is not possible.
5.2. Endtoend learning to rank
As traditional learning to rank systems rely heavily on handcrafted feature engineering that can be tedious and often incomplete, there is growing interest in endtoend learning to rank tasks among both NLP and IR researchers. Systems in this category focus on generating feature vectors automatically using deep neural networks, without the need for feature vectors extraction.
Endtoend models can be further be classified under two broad categories: 1) representationbased models and 2) interactionbased models. Representationbased models try to generate good representations of query and document independently before conducting relevance matching e.g, (Huang et al., 2013; Hu et al., 2014), while interactionbased models focus on learning local interactions between query text and document text before aggregating the local matching signals, e.g, (Guo et al., 2016; Pang et al., 2017; McDonald et al., 2018).
Since the aforementioned models focus primarily on learning better vector representations of querydocument pairs from raw texts, the output representations from those models can be directly fed as inputs to our model, which is designed to learn the interactions among the documents. As endtoend learning to rank is not the focus of this paper, we will explore endtoend models in future work.
6. Conclusions
This paper explores the possibility of modeling document interactions with selfattention mechanism. We show that a selfattentional neural network with properly regularized attention weights can outperform stateoftheart learning to rank algorithms on publicly available benchmark datasets.
We believe that different interpolation weights in the regularization terms of the loss function in Equation 18 may affect performance; we will explore this in future work. Another line of future work is to combine the idea of selfattention proposed here to the various endtoend deep learning ranking models proposed in the literature.
References
 Toward the next generation of recommender systems: a survey of the stateoftheart and possible extensions. IEEE Transactions on Knowledge & Data Engineering (6), pp. 734–749. Cited by: §5.1.
 Learning a deep listwise context model for ranking refinement. In Proceedings of SIGIR ’18. Cited by: §5.1.
 Layer normalization. stat 1050, pp. 21. Cited by: §2.3.
 Random forests. Machine learning 45 (1), pp. 5–32. Cited by: §5.1.
 Learning to rank with nonsmooth cost functions. In Advances in neural information processing systems, pp. 193–200. Cited by: §5.1.
 From ranknet to lambdarank to lambdamart: an overview. Cited by: §3.2, §5.1.
 Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine learning (ICML05), pp. 89–96. Cited by: §5.1.
 Learning to rank using an ensemble of lambdagradient models. In Proceedings of the Learning to Rank Challenge, pp. 25–35. Cited by: §5.1.
 Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning, pp. 129–136. Cited by: §2.1, §5.1, §5.1.
 Yahoo! learning to rank challenge overview. In Proceedings of the Learning to Rank Challenge, pp. 1–24. Cited by: §3.1, §5.1.
 Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM conference on Information and knowledge management, pp. 621–630. Cited by: §2.1, item 2.
 Long shortterm memorynetworks for machine reading. arXiv preprint arXiv:1601.06733. Cited by: §2.2.
 Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289. Cited by: §2.3.
 Fast ranking with additive ensembles of oblivious and nonoblivious regression trees. ACM Transactions on Information Systems (TOIS) 35 (2), pp. 15. Cited by: §3.1.
 Bert: pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §2.2.

Stochastic gradient boosting
. Computational Statistics & Data Analysis 38 (4), pp. 367–378. Cited by: §3.2, §5.1.  A deep relevance matching model for adhoc retrieval. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 55–64. Cited by: §5.2.
 Convolutional neural network architectures for matching natural language sentences. In Advances in neural information processing systems, pp. 2042–2050. Cited by: §5.2.
 Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pp. 2333–2338. Cited by: §5.2.
 Cumulated gainbased evaluation of ir techniques. ACM Transactions on Information Systems (TOIS) 20 (4), pp. 422–446. Cited by: §2.1, item 1.
 Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 133–142. Cited by: §5.1.

ALBERT: a lite bert for selfsupervised learning of language representations
. External Links: 1909.11942 Cited by: §1.  A structured selfattentive sentence embedding. arXiv preprint arXiv:1703.03130. Cited by: §2.2.
 RoBERTa: a robustly optimized bert pretraining approach. External Links: 1907.11692 Cited by: §1.
 Deep relevance ranking using enhanced documentquery interactions. arXiv preprint arXiv:1809.01682. Cited by: §5.2.
 Deeprank: a new deep architecture for relevance ranking in information retrieval. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 257–266. Cited by: §5.2.
 Introducing LETOR 4.0 datasets. CoRR abs/1306.2597. External Links: Link Cited by: §3.1.
 The probabilistic relevance framework: bm25 and beyond. Foundations and Trends® in Information Retrieval 3 (4), pp. 333–389. Cited by: §5.1.
 A vector space model for automatic indexing. Communications of the ACM 18 (11), pp. 613–620. Cited by: §5.1.
 Highway networks. arXiv preprint arXiv:1505.00387. Cited by: Figure 1, §2.3.
 Softrank: optimizing nonsmooth rank metrics. In Proceedings of the 2008 International Conference on Web Search and Data Mining, pp. 77–86. Cited by: §5.1.
 Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §1, §2.2, §2.3.