Self-Attentive Document Interaction Networks for Permutation Equivariant Ranking

10/21/2019 ∙ by Rama Kumar Pasumarthi, et al. ∙ 0

How to leverage cross-document interactions to improve ranking performance is an important topic in information retrieval (IR) research. However, this topic has not been well-studied in the learning-to-rank setting and most of the existing work still treats each document independently while scoring. The recent development of deep learning shows strength in modeling complex relationships across sequences and sets. It thus motivates us to study how to leverage cross-document interactions for learning-to-rank in the deep learning framework. In this paper, we formally define the permutation-equivariance requirement for a scoring function that captures cross-document interactions. We then propose a self-attention based document interaction network and show that it satisfies the permutation-equivariant requirement, and can generate scores for document sets of varying sizes. Our proposed methods can automatically learn to capture document interactions without any auxiliary information, and can scale across large document sets. We conduct experiments on three ranking datasets: the benchmark Web30k, a Gmail search, and a Google Drive Quick Access dataset. Experimental results show that our proposed methods are both more effective and efficient than baselines.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Ranking is a central problem in many applications of information retrieval (IR) such as search, recommender systems, and question answering. The purpose of a ranking algorithm is to sort a set of items into a ranked list such that the utility of the entire list is maximized. For example in search, a set of documents are to be ranked to answer a user’s query. The utility of the entire list highly depends on the top ranked documents.

Learning-to-rank employs machine learning techniques to solve ranking problems. The common formulation is to find a function that can produce scores for the list of documents of a query. The scores can then be used to sort the documents. Many early attempts to learning-to-rank cast a ranking problem as regression or classification 

(Burges, 2010; Joachims, 2006)

. In such methods, the loss function being minimized incurs a cost for an incorrect prediction of relevance labels (“pointwise” loss) or pairwise preferences (“pairwise” loss). Such formulations are, however, misaligned with the ranking objective where the utility is often defined over the entire list of documents. Indeed, the so called “listwise” methods that optimize a loss function defined over the entire list have been shown to learn better ranking functions 

(Burges et al., 2006; Wu et al., 2010; Cao et al., 2007).

While much research has been devoted to the evolution of loss functions, the nature of the learned scoring function has largely remained the same: a univariate scoring function that computes a relevance score for a document in isolation. How to capture cross-document interactions is the motivation behind several previous works (Diaz, 2007; Qin et al., 2008; Dehghani et al., 2017; Ai et al., 2018; Bello et al., 2018; Ai et al., 2019). Early methods such as the score regularization technique (Diaz, 2007) and the conditional random field based models (Qin et al., 2008) use the similarity between documents to smooth or regulate ranking scores. These methods, however, assume the existence of document similarity information from another source such as document clusters. More recently, neural learning-to-rank algorithms (Ai et al., 2018; Bello et al., 2018) and click models (Borisov et al., 2016)

capture document interactions using recurrent neural networks over document lists. These methods, however, belong to the

re-ranking setting because they assume that the input is an ordered list, but not a set.

Another work that investigates the effect of document interactions on ranking quality is RankProb (Dehghani et al., 2017). It is a bivariate neural scoring function that takes a pair of documents as input and predicts the preference of one over the other. More recently, a more general framework was proposed in (Ai et al., 2019) to learn multivariate “groupwise” scoring functions (GSFs). Though being able to model document interactions, both methods are highly inefficient at inference time. These models suffer from a training-serving discrepancy: the function learned during training is different from the scoring function used in serving. For example, average pooling over the bivariate function learned during training is used as the scoring function in RankProb during serving. For higher-order interaction models (such as GSFs), the pooling is over an intractable number of permutations, and hence, approximation via sub-sampling is used, which worsens the training-serving discrepancy and makes the inference unstable.

In this paper, we identify a generic requirement for scoring functions with document interactions: permutation-equivariance. We analyze the existing approaches with respect to this requirement. Based on the this requirement, we propose a class of neural network models and show that they not only satisfy this requirement precisely, but are also more efficient in modeling document interactions and do not have the training-serving discrepancy. Our proposed method is based on self-attention on the document level. It naturally captures the cross-document interactions via the self-attention mechanism. To the best of our knowledge, our work is the first to use it to model document interactions for learning-to-rank.

Our contributions can be summarized as follows:

  • We propose the permutation-equivariance requirement for any document interaction model and analyze existing methods with respect to this requirement.

  • We identify a generic class of permutation-equivariant functions, instantiate it using a self-attentive document interaction network, and incorporate it into learning-to-rank.

  • We empirically demonstrate the effectiveness and efficiency of our proposed methods on both search and recommendation tasks using three data sets.

This paper is organized as follows. We begin with a review of the literature in Section 2, and formalize the problem we wish to study in Section 3. In Section 4, we present a detailed description of our proposed methods. We examine the effectiveness of our methods empirically and summarize our findings in Section 5. Finally, we conclude this work and discuss future directions in Section 6.

2. Related Work

In learning-to-rank literature, a common approach is called “score and sort”. For capturing the loss between the list of scores for documents and relevance labels, pointwise (Fuhr, 1989; Chu and Ghahramani, 2005), pairwise (Burges et al., 2005; Burges, 2010) and listwise (Xia et al., 2008b, a; Bruch et al., 2019a)

losses have been extensively studied. Scoring functions have been parameterized by boosted decision trees 

(Ke et al., 2017), SVMs (Joachims, 2002), and neural networks (Burges, 2010; Pang et al., 2017; Pasumarthi et al., 2019).

In the context of scoring query-document pairs, the recent neural ranking models have been broadly classified 

(Guo et al., 2019) into two categories: representation focused and interaction focused. The methods that are representation-focused (Huang et al., 2013; Pang et al., 2017, 2016)

look at learning optimal vector space representations for queries and documents, and then combine them using dot product or cosine similarity. The interaction-focused methods learn a joint representation based on interaction networks between queries and documents. These approaches, along with hybrid variants between representation and interaction focused 

(Mitra et al., 2017), are univariate approaches, i.e., they deal with scoring a query-document pair, and do not capture cross-document interactions. Please note that attention mechanism (Romeo et al., 2016) has also been explored in this line of work, but it is mainly used in the word or paragraph level, not the document level.

Recent work in modeling document interactions in learning-to-rank have focused on the re-ranking scenario (Ai et al., 2018; Bello et al., 2018; Pei et al., 2019), where the input is an ordered list of documents, not a set. These are not applicable to full set ranking, which is the focus of our work. Regularizing scores (Diaz, 2007), and a CRF approach (Qin et al., 2008) using document cluster information to augment the training loss have been explored, which are complementary to our proposed approach.

3. Problem Formulation

In this section, we formulate our problem in the learning-to-rank setting.

3.1. Learning-to-Rank

Learning-to-rank solves ranking problems using machine learning techniques. In such a setting, the training data consists of a set of queries with each having a list of documents that we would like to rank. Formally, let be a training data set where is a query, is the list of documents for , and is the relevance labels for . We use and to refer to the -th elements in and respectively. A scoring function takes both and as input and computes a vector of scores :


A loss function for query can be defined between the predicted scores and the labels:

The goal of a learning-to-rank task is to find a scoring function that minimizes the empirical loss over the training data:


Typical examples of the hypothesis space for a scoring function

are support vector machines 

(Joachims, 2006; Joachims et al., 2017), boosted weak learners (Xu and Li, 2007)

, gradient-boosted trees 

(Friedman, 2001; Burges, 2010), and neural networks (Burges et al., 2005).

3.2. Ranking Loss Functions

Given a formulation of the scoring function, there are various definitions of ranking loss functions (Liu, 2009). In this paper, we focus on the following two listwise loss functions as they have been shown to be closely related to the ranking metrics (Bruch et al., 2019b; Qin et al., 2010a; Bruch et al., 2019a). The first one is the Softmax Cross-Entropy loss (denoted as Softmax) and has been shown to be a proper bound of ranking metrics over binary relevance labels like MRR (Bruch et al., 2019b):


where the subscript and means the -th or -th element in a vector.

The second one is the Approximate NDCG loss (denoted as ApproxNDCG) (Qin et al., 2010b; Bruch et al., 2019a). It is more suitable for graded relevance labels, and is derived from the NDCG metric, but uses scores to approximate the ranks to make the objective smooth:


where is the normalization term of NDCG and is the approximate rank defined as

where is the parameter that controls the closeness of the approximation. When is replaced by the rank sorted by scores , Equation 4 becomes the NDCG metric. A larger makes closer to , but it also makes ApproxNDCG less smooth and thus harder for optimization. We tune in our experiments and set since it gives the optimal results.

3.3. Permutation-Equivariance Requirement

Our focus in this paper is on scoring functions. We postulate that it is preferable that the scoring function is permutation equivariant, so that the resulting ranking will be independent of the original ordering induced over the items by the candidate generation process (e.g., a base ranker, or a retrieval mechanism). This ensures that the learned ranker will not be biased by any errors in the candidate generation process. We first give the general mathematical definition of permutation-equivariant functions.

Definition 3.1 (Permutation-Equivariant Functions).

Let be a vector of elements , where , and is a permutation of indices . A function is permutation-equivariant iff

That is, a permutation applied to the input vector will result in the same permutation applied to the output of the function.

For a scoring function , the input domain is defined by the representation of and (e.g., where and are the dimension of their vector representation) and the output domain is . It is permutation-equivariant iff

We analyze some existing work in term of this requirement.

The vast majority of learning-to-rank algorithms assume a univariate scoring function that computes a score for each document independently of others. With slight abuse of notation, we also use to represent the scoring function on each individual document:


where is an individual document in the list and is the value of the score vector . A univariate scoring function is permutation-equivariant because

The Groupwise Scoring Functions (GSFs) (Ai et al., 2019) boil down to univariate scoring functions when the group size is 1. A larger group size is needed to model cross-document interactions. For example, for groups of size 2, the scoring of the -th document is:


where is the sub-scoring function in GSF and is implemented using feed forward networks. Higher-order interactions are explicitly captured when the group size is larger, but it becomes impractical to implement a GSF precisely due to the combinatorial number of groups. Monte Carlo sampling methods are used to approximate and this can make GSFs unstable. In this sense, GSFs are approximately permutation-equivariant.

The RankProb approach in (Dehghani et al., 2017) trains a bivariate interaction scoring function

by concatenating the features as the input for a feed forward network. The loss function is a logistic regression on the pairwise preference of the two documents. For inference, it uses the average pooling in Equation 

6. This model is similar to the GSFs with group size 2. It has a training-serving discrepancy. The average pooling makes the scoring function permutation-equivariant but directly using it has a time complexity, which is not scalable.

Figure 1. Self-Attentive Document Interaction Network.

4. Proposed Methods

In this section, we first present a general class of permutation-equivariant functions and outline how we build a permutation equivariant scoring function using deep neural networks for our proposed approach.

4.1. A Class of Permutation Equivariant Functions

Our permutation-equivariant functions are based on permutation-invariant functions. We start with the formal definition of permutation-invariant functions.

Definition 4.1 (Permutation-Invariant Functions).

Let be a vector of elements , where and be a permutation of indices of . A function is permutation-invariant iff

That is, any permutation of the input has the same output.

The work in (Zaheer et al., 2017) provides a general characterization of permutation-invariant functions as follows:

Theorem 4.2 ().

A function is permutation-invariant iff it can be decomposed in the form , for a suitable choice of of and .

Though simple, Theorem 4.2 is less constructive. Ilse et. al. (Ilse et al., 2018) proposed a mechanism to extend the form in Theorem 4.2 (called pooling function) to a weighting pooling, based on the attention mechanism. We shall refer this as attention pooling function. Given a generic context , a pooling function can be extended to attention pooling as follows:


Here, is the popular attention mechanism, which is used to capture the similarity between the context and the item.

The class of permutation-equivariant functions in this paper is based on self-attention (Lin et al., 2017). The key idea is to instantiate the context by an item in . Based on Equation 7, we form a function that can be verified to be permutation-equivariant as follows:


4.2. Self-Attentive Document Interaction Networks

We instantiate the permutation-equivariant functions using the sclaed-dot product attention, proposed in the work on Transformer (Vaswani et al., 2017).

4.2.1. Self-Attention Layers

The attention layer in Transformer is defined based on three matrices: , where is the dimension of keys in matrix, as follows:


The output of the attention is a matrix in . The self-attention is a special case of the attention where we use . In our setting, we implement each row of as the concatenation of the vector representation of and each . The self-attention is permutation-equivariant by using each row of as and setting the matrix form of as in Equation 8. Similar to the work on Transformer (Vaswani et al., 2017), we use layer normalization (Ba et al., 2016)

and residual connections 

(He et al., 2016) over the output of the self-attention and these operations form the function of in Equation 8.

Furthermore, we use the multi-headed attention mechanism, which allows the model to learn multiple attention weights per layer:


where matrices ’s are the weight matrices for each head. Heads are concatenated along rows and projected by . Again we can have a self-attention layer by setting and show this is permutation-equivariant.

We note that such an self-attention mostly take the pairwise document interactions. Since permutation-equivariance is preserved for function composition , we can stack multiple self-attention layers. Multiple layers can enhance and potentially capture higher-order interactions better.

4.2.2. Scoring Layers

However, our goal is to derive a permutation-equivariant scoring function whose output is . We propose to use a univariate scoring function on top of the output of self-attention layers. Let be the output of self-attention layers and be the -th row of the output, corresponding to document . We propose a “wide and deep” scoring function to combine self-attention based features with query and document features:

We refer to this a as “wide and deep” architecture, where the output of “deep” layers, a stack of self-attention layers, is combined in a “wide” univariate scoring function with query and document features to generate a score per document.

We show that this “wide and deep” scoring function (denoted as ) is still permutation-equivariant, while capturing cross-document interactions.

We call our method Self-Attentive Document Interaction Networks (denoted as attn-DIN) and the structure of the score for a given document is shown in Figure 1. The self-attention layer can be stacked sequentially, without losing the permutation equivariance property. In “wide and deep” fashion, the output of this layer, document interaction embeddings, is combined with query and document features and fed as an input to a univariate scoring function. The specific univariate scoring function captures interactions between features using a stack of feedforward layers. Specifically, for each feedforward layer, the input is passed through a dropout regularization (Srivastava et al., 2014)

(to prevent overfitting), and the output is passed through a batch normalization layer 

(Ioffe and Szegedy, 2015)

, followed by a non-linear ReLU 

(Nair and Hinton, 2010) activation, where . We refer to this combination as FC-BN-ReLU in Figure 1. The final output is projected to a single score for a document.

Method NDCG@1 NDCG@5 NDCG@10
GSF(m=64) with Softmax (best reported (Ai et al., 2019)) 44.21 44.46 46.77
GSF(m=1) with ApproxNDCG (best reported (Bruch et al., 2019a)) 46.64 45.38 47.31
GSF(m=1) with ApproxNDCG (finetuned) 46.81 45.59 47.39
attn-DIN with ApproxNDCG (proposed approach) 48.16 46.62 48.21
LambdaMART (RankLib) 45.35 44.59 46.46
LambdaMART (lightGBM) 50.33 49.20 51.05
Table 1. Comparison of NDCG between GSF models and attn-DIN for cross-document interactions on Web30k data. indicates the best GSF model. / indicate statistically significant increase/decrease compared to best GSF model (p-value¡0.05).

5. Experiments

In this section, we first outline several learning-to-rank datasets and baseline methods used in our experiments. We then report the comparisons on both model effectiveness and inference efficiency.

5.1. Datasets

5.1.1. MSLR Web30k

The Microsoft Learning to Rank (MSLR) Web30k (Burges et al., 2006) public dataset comprises of 30,000 queries. We use Fold1, which has 3 partitions: train, validation, and test. For each query-document pair, it has 136 dense features. Each query has a variable number of documents, and we use at most 200 documents per query for training baseline and proposed methods. During evaluation, we use the test partition and consider all the documents present for a query. We discard queries with no relevant documents both during training and evaluation.

5.1.2. Quick Access

In Google Drive, Quick Access (Tata et al., 2017) is a zero-state recommendation system, that surfaces relevant documents that users might click on when they visit the Drive home. The features are all dense, as described in Tata et. al. (Tata et al., 2017), and user clicks are used as relevance labels. We collect around 30 million recommended documents and their click information. Each session has up to 100 documents, along with user features as contextual information for training and evaluation. We use a 90%-10% train-test split on this dataset.

5.1.3. Gmail Search

In Gmail, we look at search over e-mails, where a user types in a query, looks for a relevant e-mail, and clicks on one of the six results returned by the system. The list of six e-mails are considered as the candidate set, and the clicks are used as the relevance labels. To preserve privacy, we remove personal information, and apply

-anonymization. Around 200 million queries are collected, with a 90%-10% train-test split. The features comprise of both dense and sparse features, The sparse features comprise of character and word level n-grams with

-anonymization applied.

5.2. Baselines

On the public Web30k dataset, we compare with LambdaMART’s implementation in RankLib (Croft et al., 2013) and lightGBM (Ke et al., 2017), and with multivariate Groupwise Scoring Functions (Ai et al., 2019). Since the labels consist of graded relevance, for evaluation measures, we use Normalized Discounted Cumulative Gain (NDCG) (Järvelin and Kekäläinen, 2002) for top 1, 5, and 10 documents ordered by the scores.

On the private datasets of Quick Access and Gmail, we compare only with Groupwise Scoring Functions. Given the massive scale of the datasets, and the heterogeneous nature of features (dense and sparse), the open source implementations of LambdaMART do not scale on these datasets. Furthermore, prior work demonstrated that GSFs are superior to LambdaMART when sparse features are present 

(Ai et al., 2019). We evaluate using Mean Reciprocal Rank (Craswell, 2009) and Average Relevance Position (Zhu, 2004), as the labels are binary clicks, for which these two measures are most suitable.

5.3. Hyperparameters

On Web30k dataset, to encode document interactions, we use one self-attention layer with 100 neurons, and with a single attention head. The univariate scoring function to combine the output of self-attention with query and document features comprises of an input batch normalization layer, followed by 7 feedforward fully-connected layers (

FC-BN-ReLU layers, as shown in Figure 1) of sizes . The model is trained using a training batch size of 128, and Adagrad (Duchi et al., 2011) optimizer, with a learning rate of 0.005 to minimize the ApproxNDCG ranking loss. We use a similar setup for Gmail and Quick Access, with Softmax loss minimized using Adagrad Optimizer, trained for 10 million and 5 million steps respectively. For Gmail, we use 5 layers of self-attention with 100 neurons each, with 4 heads for encoding document interactions. For Quick Access, we use 3 layers of self-attention with 100 neurons each, with 5 heads for encoding document interactions.

Figure 2. Comparison of ranking metrics between GSF models and attn-DIN on (left to right): Web30k, Quick Access and Gmail datasets.

5.4. Model Effectiveness

(a) Quick Access MRR ARP
GSF(m=1 (univariate scoring)
GSF(m=4) -0.440 0.177 -0.659 0.141
attn-DIN (proposed approach) +0.312 0.113 +0.413 0.124
(b) Gmail Search MRR ARP
GSF(m=1) (univariate scoring)
GSF(m=3 +1.006 0.247 +1.308 0.246
attn-DIN (proposed approach) +1.245 0.228 +1.430 0.247
Table 2. Model performance on Quick Access and Gmail data. Note that MRR and ARP denote % relative improvement. The best performance per column is in bold. indicates the best GSF model. / indicate statistically significant increase/decrease compared to the best GSF model (p-value¡0.05).
Figure 3. Comparison of normalized inference time per query and number of parameters between GSF models and attn-DIN.

In Table 1, on Web30k data, we compare the proposed attn-DIN approach with LambdaMART and GSFs. For LambdaMART, we consider the lightGBM implementation and the older RankLib implementation, and list the best reported metrics on test data. For the GSFs, we list the best reported metrics, and also an improved model based on our finetuning experiments. In Figure 2

, we compare attn-DIN model with multivariate GSF models for varying group sizes. Since the list size is large for Web30k dataset (around 200), we increase the group size on an exponential scale from 1 to 128. We also show 95% bootstrapped confidence intervals for each of the models.

We observe that the proposed approach significantly outperforms both the best reported and finetuned GSFs, giving around 1 point improvement for NDCG@5 (measured from 0-100), which is statistically significant (), measured using paired -test with p-value threshold of . These gains are not just from using a deeper network or from more neural network parameters, as shown in Figure 3. The increase in number of parameters over the univariate scoring is the smallest for attn-DIN model, compared to any of the GSF models, while the improvement in ranking measure is significant. Our attn-DIN model tries to capture similarity using dot product attention mechanism and pooling to combine feature values, while GSFs explicitly model cross-document interactions using feedforward networks. As the group size increases, the number of parameters needed to capture cross-document interactions also increase. This also leads to increase in inference time, as discussed in Section 5.5.

The proposed approach outperforms RankLib’s LambdaMART, but not the lightGBM implementation. We believe this is due to the fact that Gradient Boosted Decision Trees are very powerful class of machine learning models when the feature set consists of purely dense features, and smaller training datasets.

In most real world scenarios, input features tend to have both dense and sparse features. Query, document titles and metadata tend to naturally have textual description, which play a key role during user’s relevance judgment, and hence are powerful signals for training ranking models. We look at two real world datasets, on Gmail Search and Quick Access, with a large amount of data and a variety of features, as described in Section 5.3. In Table 2, we report relative improvements in MRR, due to private nature of these datasets. For statistical significance, we use paired t-test over relative improvements in MRR, with p-value threshold of .

On the Quick Access dataset (Table 2(a)), we analyze the relative improvements in MRR, and observe that the proposed approach does significantly better than the univariate model, which is in fact, the best performing GSF model. While the GSF models fail to produce any improvements from cross-document interactions on this dataset, our proposed approach effectively captures them.

On the Gmail dataset (Table 2(b)), the proposed approach is significantly better than the univariate model, and is superior to the best GSF model (). We conducted a paired t-test between attn-DIN and the best GSF model, and we observe a relative improvement in MRR, , which is a statistically significant improvement. Note that in Gmail, we consider smaller document candidate sets (6 document per query), whereas in Web30k and and Quick Access, we use much larger candidate sets (200 documents per query and 100 documents per user request, respectively). For larger group sizes (¿ 8), the performance of GSF models deteriorates, whereas the proposed approach is able to capture cross-document interactions effectively.

5.5. Model Efficiency

In Figure 3, we compare the inference time and number of parameters for various GSF models, normalized with the value for the proposed approach. Over univariate scoring functions, the proposed approach has an increase in inference time similar to GSF model of group size 8, despite capturing interactions across the entire document set of sizes 200 for Web30k. For the GSF models, the inference time increases with the increase of group size. GSFs use an approximation during inference. For group size 2, it uses a rolling window of 2 over a shuffled list to reduce the time complexity to  (Ai et al., 2019). However, it is not guaranteed to be permutation-equivariant, and may be unstable during inference. The exact inference is using Equation 6, the same as RankProb (Dehghani et al., 2017), which has time complexity. In our experiments, it takes around ms for inference per query, and is drastically slower than the attn-DIN approach, which takes around ms for inference per query.

For the Web30k dataset, from Figures 2 and  3, we can observe that the proposed approaches are significantly better than univariate approaches, and are faster during inference than GSF models at large group sizes while capturing cross-document interactions across the list. From Figure 3, we can also observe that the proposed model has fewer parameters than multivariate GSF models; hence the gain in ranking metrics is not from using larger number of parameters, but from effectively capturing similarity via cross-document attention pooling of the document features.

6. Conclusion

In this paper, we study how to leverage document interactions in the learning-to-rank setting. We proposed the permutation-equivariance requirement for a scoring function that takes document interactions into consideration. We show that self-attention mechanism can be used to implement such a permutation-equivariant function, and that any univariate query-document scoring function can be extended to capture cross-document interactions using this proposed self-attention mechanism. We choose the attention method used in Transformer (Vaswani et al., 2017) in our paper and combine the output of self-attention layers with a feed forward network in a wide and deep architecture. We conducted our experiments on three datasets and the results show that our proposed methods can capture document interactions effectively in a statistically significant manner, and can efficiently scale to large document sets.


  • Q. Ai, K. Bi, J. Guo, and W. B. Croft (2018) Learning a deep listwise context model for ranking refinement. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 135–144. Cited by: §1, §2.
  • Q. Ai, X. Wang, S. Bruch, N. Golbandi, M. Bendersky, and M. Najork (2019) Learning groupwise multivariate scoring functions using deep neural networks. In Proceedings of the 5th ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR), Cited by: §1, §1, §3.3, Table 1, §5.2, §5.2, §5.5.
  • J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §4.2.1.
  • I. Bello, S. Kulkarni, S. Jain, C. Boutilier, E. Chi, E. Eban, X. Luo, A. Mackey, and O. Meshi (2018) Seq2slate: re-ranking and slate optimization with rnns. arXiv preprint arXiv:1810.02019. Cited by: §1, §2.
  • A. Borisov, I. Markov, M. de Rijke, and P. Serdyukov (2016) A neural click model for web search. In Proc. of WWW, pp. 531–541. Cited by: §1.
  • S. N. Bruch, M. Zoghi, M. Bendersky, and M. Najork (2019a) Revisiting approximate metric optimization in the age of deep neural networks. Cited by: §2, §3.2, §3.2, Table 1.
  • S. Bruch, X. Wang, M. Bendersky, and M. Najork (2019b) An analysis of the softmax cross entropy loss for learning-to-rank with binary relevance. In Proceedings of the 2019 ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR 2019), Cited by: §3.2.
  • C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender (2005) Learning to rank using gradient descent. In 22nd International Conference on Machine Learning, pp. 89–96. Cited by: §2, §3.1.
  • C. J. C. Burges, R. Ragno, and Q. V. Le (2006) Learning to rank with nonsmooth cost functions. In Proceedings of the 19th International Conference on Neural Information Processing Systems, NIPS, pp. 193–200. Cited by: §1, §5.1.1.
  • C. J.C. Burges (2010) From RankNet to LambdaRank to LambdaMART: an overview. Technical report Technical Report Technical Report MSR-TR-2010-82, Microsoft Research. Cited by: §1, §2, §3.1.
  • Z. Cao, T. Qin, T. Liu, M. Tsai, and H. Li (2007) Learning to rank: from pairwise approach to listwise approach. In 24th International Conference on Machine Learning, pp. 129–136. Cited by: §1.
  • W. Chu and Z. Ghahramani (2005) Preference learning with gaussian processes. In 22nd International Conference on Machine Learning, pp. 137–144. Cited by: §2.
  • N. Craswell (2009) Mean reciprocal rank. In Encyclopedia of Database Systems, pp. 1703–1703. Cited by: §5.2.
  • W. B. Croft, J. Callan, J. Allan, C. Zhai, D. Fisher, T. Avrahami, T. Strohman, D. Metzler, P. Ogilvie, M. Hoy, et al. (2013) The Lemur project. Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts Amherst 140. Cited by: §5.2.
  • M. Dehghani, H. Zamani, A. Severyn, J. Kamps, and W. B. Croft (2017) Neural ranking models with weak supervision. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR, pp. 65–74. Cited by: §1, §1, §3.3, §5.5.
  • F. Diaz (2007) Regularizing query-based retrieval scores. Information Retrieval 10 (6), pp. 531–562. Cited by: §1, §2.
  • J. Duchi, E. Hazan, and Y. Singer (2011) Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12, pp. 2121–2159. Cited by: §5.3.
  • J. H. Friedman (2001) Greedy function approximation: a gradient boosting machine. Annals of Statistics 29 (5), pp. 1189–1232. Cited by: §3.1.
  • N. Fuhr (1989)

    Optimum polynomial retrieval functions based on the probability ranking principle

    ACM Transactions on Information Systems 7 (3), pp. 183–204. Cited by: §2.
  • J. Guo, Y. Fan, L. Pang, L. Yang, Q. Ai, H. Zamani, C. Wu, W. B. Croft, and X. Cheng (2019) A deep look into neural ranking models for information retrieval. arXiv preprint arXiv:1903.06902. Cited by: §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §4.2.1.
  • P. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck (2013) Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management, CIKM, pp. 2333–2338. Cited by: §2.
  • M. Ilse, J. M. Tomczak, and M. Welling (2018) Attention-based deep multiple instance learning. arXiv preprint arXiv:1802.04712. Cited by: §4.1.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §4.2.2.
  • K. Järvelin and J. Kekäläinen (2002) Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems 20 (4), pp. 422–446. Cited by: §5.2.
  • T. Joachims, A. Swaminathan, and T. Schnabel (2017) Unbiased learning-to-rank with biased feedback. In 10th ACM International Conference on Web Search and Data Mining, pp. 781–789. Cited by: §3.1.
  • T. Joachims (2002) Optimizing search engines using clickthrough data. In 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 133–142. Cited by: §2.
  • T. Joachims (2006) Training linear svms in linear time. In 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 217–226. Cited by: §1, §3.1.
  • G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T. Liu (2017) LightGBM: a highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems 30, pp. 3146–3154. Cited by: §2, §5.2.
  • Z. Lin, M. Feng, C. N. d. Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio (2017) A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130. Cited by: §4.1.
  • T. Liu (2009) Learning to rank for information retrieval. Foundations and Trends in Information Retrieval 3 (3), pp. 225–331. Cited by: §3.2.
  • B. Mitra, F. Diaz, and N. Craswell (2017)

    Learning to match using local and distributed representations of text for web search

    In Proceedings of the 26th International Conference on World Wide Web, pp. 1291–1299. Cited by: §2.
  • V. Nair and G. E. Hinton (2010) Rectified linear units improve restricted Boltzmann machines. In 27th International Conference on Machine Learning, pp. 807–814. Cited by: §4.2.2.
  • L. Pang, Y. Lan, J. Guo, J. Xu, S. Wan, and X. Cheng (2016) Text matching as image recognition. In

    Thirtieth AAAI Conference on Artificial Intelligence

    Cited by: §2.
  • L. Pang, Y. Lan, J. Guo, J. Xu, J. Xu, and X. Cheng (2017) DeepRank: a new deep architecture for relevance ranking in information retrieval. In Proceedings of the 2017 ACM Conference on Information and Knowledge Management, CIKM, pp. 257–266. Cited by: §2, §2.
  • R. K. Pasumarthi, S. Bruch, X. Wang, C. Li, M. Bendersky, M. Najork, J. Pfeifer, N. Golbandi, R. Anil, and S. Wolf (2019)

    TF-ranking: scalable tensorflow library for learning-to-rank

    In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2970–2978. Cited by: §2.
  • C. Pei, Y. Zhang, Y. Zhang, F. Sun, X. Lin, H. Sun, J. Wu, P. Jiang, J. Ge, W. Ou, et al. (2019) Personalized re-ranking for recommendation. In Proceedings of the 13th ACM Conference on Recommender Systems, pp. 3–11. Cited by: §2.
  • T. Qin, T. Liu, and H. Li (2010a) A general approximation framework for direct optimization of information retrieval measures. Information Retrieval 13 (4), pp. 375–397. Cited by: §3.2.
  • T. Qin, T. Liu, and H. Li (2010b) A general approximation framework for direct optimization of information retrieval measures. Information Retrieval 13 (4), pp. 375–397. Cited by: §3.2.
  • T. Qin, T. Liu, X. Zhang, D. Wang, and H. Li (2008) Global ranking using continuous conditional random fields. In Proc. of the 21st International Conference on Neural Information Processing Systems, pp. 1281–1288. Cited by: §1, §2.
  • S. Romeo, G. Da San Martino, A. Barrón-Cedeno, A. Moschitti, Y. Belinkov, W. Hsu, Y. Zhang, M. Mohtarami, and J. Glass (2016) Neural attention for learning to rank questions in community question answering. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 1734–1745. Cited by: §2.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §4.2.2.
  • S. Tata, A. Popescul, M. Najork, M. Colagrosso, J. Gibbons, A. Green, A. Mah, M. Smith, D. Garg, C. Meyer, et al. (2017) Quick Access: building a smart experience for Google Drive. In 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1643–1651. Cited by: §5.1.2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §4.2.1, §4.2, §6.
  • Q. Wu, C. J. Burges, K. M. Svore, and J. Gao (2010) Adapting boosting for information retrieval measures. Information Retrieval 13 (3), pp. 254–270. Cited by: §1.
  • F. Xia, T. Liu, J. Wang, W. Zhang, and H. Li (2008a) Listwise approach to learning to rank: theory and algorithm. In 25th International Conference on Machine Learning, pp. 1192–1199. Cited by: §2.
  • F. Xia, T. Liu, J. Wang, W. Zhang, and H. Li (2008b) Listwise approach to learning to rank: theory and algorithm. In Proc. of the 25th International Conference on Machine Learning, pp. 1192–1199. Cited by: §2.
  • J. Xu and H. Li (2007) AdaRank: a boosting algorithm for information retrieval. In 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 391–398. Cited by: §3.1.
  • M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola (2017) Deep sets. In Advances in neural information processing systems, pp. 3391–3401. Cited by: §4.1.
  • M. Zhu (2004) Recall, precision and average precision. Technical report Department of Statistics and Actuarial Science, University of Waterloo. Cited by: §5.2.