1. Introduction
Ranking is a central problem in many applications of information retrieval (IR) such as search, recommender systems, and question answering. The purpose of a ranking algorithm is to sort a set of items into a ranked list such that the utility of the entire list is maximized. For example in search, a set of documents are to be ranked to answer a user’s query. The utility of the entire list highly depends on the top ranked documents.
Learningtorank employs machine learning techniques to solve ranking problems. The common formulation is to find a function that can produce scores for the list of documents of a query. The scores can then be used to sort the documents. Many early attempts to learningtorank cast a ranking problem as regression or classification
(Burges, 2010; Joachims, 2006). In such methods, the loss function being minimized incurs a cost for an incorrect prediction of relevance labels (“pointwise” loss) or pairwise preferences (“pairwise” loss). Such formulations are, however, misaligned with the ranking objective where the utility is often defined over the entire list of documents. Indeed, the so called “listwise” methods that optimize a loss function defined over the entire list have been shown to learn better ranking functions
(Burges et al., 2006; Wu et al., 2010; Cao et al., 2007).While much research has been devoted to the evolution of loss functions, the nature of the learned scoring function has largely remained the same: a univariate scoring function that computes a relevance score for a document in isolation. How to capture crossdocument interactions is the motivation behind several previous works (Diaz, 2007; Qin et al., 2008; Dehghani et al., 2017; Ai et al., 2018; Bello et al., 2018; Ai et al., 2019). Early methods such as the score regularization technique (Diaz, 2007) and the conditional random field based models (Qin et al., 2008) use the similarity between documents to smooth or regulate ranking scores. These methods, however, assume the existence of document similarity information from another source such as document clusters. More recently, neural learningtorank algorithms (Ai et al., 2018; Bello et al., 2018) and click models (Borisov et al., 2016)
capture document interactions using recurrent neural networks over document lists. These methods, however, belong to the
reranking setting because they assume that the input is an ordered list, but not a set.Another work that investigates the effect of document interactions on ranking quality is RankProb (Dehghani et al., 2017). It is a bivariate neural scoring function that takes a pair of documents as input and predicts the preference of one over the other. More recently, a more general framework was proposed in (Ai et al., 2019) to learn multivariate “groupwise” scoring functions (GSFs). Though being able to model document interactions, both methods are highly inefficient at inference time. These models suffer from a trainingserving discrepancy: the function learned during training is different from the scoring function used in serving. For example, average pooling over the bivariate function learned during training is used as the scoring function in RankProb during serving. For higherorder interaction models (such as GSFs), the pooling is over an intractable number of permutations, and hence, approximation via subsampling is used, which worsens the trainingserving discrepancy and makes the inference unstable.
In this paper, we identify a generic requirement for scoring functions with document interactions: permutationequivariance. We analyze the existing approaches with respect to this requirement. Based on the this requirement, we propose a class of neural network models and show that they not only satisfy this requirement precisely, but are also more efficient in modeling document interactions and do not have the trainingserving discrepancy. Our proposed method is based on selfattention on the document level. It naturally captures the crossdocument interactions via the selfattention mechanism. To the best of our knowledge, our work is the first to use it to model document interactions for learningtorank.
Our contributions can be summarized as follows:

We propose the permutationequivariance requirement for any document interaction model and analyze existing methods with respect to this requirement.

We identify a generic class of permutationequivariant functions, instantiate it using a selfattentive document interaction network, and incorporate it into learningtorank.

We empirically demonstrate the effectiveness and efficiency of our proposed methods on both search and recommendation tasks using three data sets.
This paper is organized as follows. We begin with a review of the literature in Section 2, and formalize the problem we wish to study in Section 3. In Section 4, we present a detailed description of our proposed methods. We examine the effectiveness of our methods empirically and summarize our findings in Section 5. Finally, we conclude this work and discuss future directions in Section 6.
2. Related Work
In learningtorank literature, a common approach is called “score and sort”. For capturing the loss between the list of scores for documents and relevance labels, pointwise (Fuhr, 1989; Chu and Ghahramani, 2005), pairwise (Burges et al., 2005; Burges, 2010) and listwise (Xia et al., 2008b, a; Bruch et al., 2019a)
losses have been extensively studied. Scoring functions have been parameterized by boosted decision trees
(Ke et al., 2017), SVMs (Joachims, 2002), and neural networks (Burges, 2010; Pang et al., 2017; Pasumarthi et al., 2019).In the context of scoring querydocument pairs, the recent neural ranking models have been broadly classified
(Guo et al., 2019) into two categories: representation focused and interaction focused. The methods that are representationfocused (Huang et al., 2013; Pang et al., 2017, 2016)look at learning optimal vector space representations for queries and documents, and then combine them using dot product or cosine similarity. The interactionfocused methods learn a joint representation based on interaction networks between queries and documents. These approaches, along with hybrid variants between representation and interaction focused
(Mitra et al., 2017), are univariate approaches, i.e., they deal with scoring a querydocument pair, and do not capture crossdocument interactions. Please note that attention mechanism (Romeo et al., 2016) has also been explored in this line of work, but it is mainly used in the word or paragraph level, not the document level.Recent work in modeling document interactions in learningtorank have focused on the reranking scenario (Ai et al., 2018; Bello et al., 2018; Pei et al., 2019), where the input is an ordered list of documents, not a set. These are not applicable to full set ranking, which is the focus of our work. Regularizing scores (Diaz, 2007), and a CRF approach (Qin et al., 2008) using document cluster information to augment the training loss have been explored, which are complementary to our proposed approach.
3. Problem Formulation
In this section, we formulate our problem in the learningtorank setting.
3.1. LearningtoRank
Learningtorank solves ranking problems using machine learning techniques. In such a setting, the training data consists of a set of queries with each having a list of documents that we would like to rank. Formally, let be a training data set where is a query, is the list of documents for , and is the relevance labels for . We use and to refer to the th elements in and respectively. A scoring function takes both and as input and computes a vector of scores :
(1) 
A loss function for query can be defined between the predicted scores and the labels:
The goal of a learningtorank task is to find a scoring function that minimizes the empirical loss over the training data:
(2) 
Typical examples of the hypothesis space for a scoring function
(Joachims, 2006; Joachims et al., 2017), boosted weak learners (Xu and Li, 2007), gradientboosted trees
(Friedman, 2001; Burges, 2010), and neural networks (Burges et al., 2005).3.2. Ranking Loss Functions
Given a formulation of the scoring function, there are various definitions of ranking loss functions (Liu, 2009). In this paper, we focus on the following two listwise loss functions as they have been shown to be closely related to the ranking metrics (Bruch et al., 2019b; Qin et al., 2010a; Bruch et al., 2019a). The first one is the Softmax CrossEntropy loss (denoted as Softmax) and has been shown to be a proper bound of ranking metrics over binary relevance labels like MRR (Bruch et al., 2019b):
(3) 
where the subscript and means the th or th element in a vector.
The second one is the Approximate NDCG loss (denoted as ApproxNDCG) (Qin et al., 2010b; Bruch et al., 2019a). It is more suitable for graded relevance labels, and is derived from the NDCG metric, but uses scores to approximate the ranks to make the objective smooth:
(4) 
where is the normalization term of NDCG and is the approximate rank defined as
where is the parameter that controls the closeness of the approximation. When is replaced by the rank sorted by scores , Equation 4 becomes the NDCG metric. A larger makes closer to , but it also makes ApproxNDCG less smooth and thus harder for optimization. We tune in our experiments and set since it gives the optimal results.
3.3. PermutationEquivariance Requirement
Our focus in this paper is on scoring functions. We postulate that it is preferable that the scoring function is permutation equivariant, so that the resulting ranking will be independent of the original ordering induced over the items by the candidate generation process (e.g., a base ranker, or a retrieval mechanism). This ensures that the learned ranker will not be biased by any errors in the candidate generation process. We first give the general mathematical definition of permutationequivariant functions.
Definition 3.1 (PermutationEquivariant Functions).
Let be a vector of elements , where , and is a permutation of indices . A function is permutationequivariant iff
That is, a permutation applied to the input vector will result in the same permutation applied to the output of the function.
For a scoring function , the input domain is defined by the representation of and (e.g., where and are the dimension of their vector representation) and the output domain is . It is permutationequivariant iff
We analyze some existing work in term of this requirement.
The vast majority of learningtorank algorithms assume a univariate scoring function that computes a score for each document independently of others. With slight abuse of notation, we also use to represent the scoring function on each individual document:
(5) 
where is an individual document in the list and is the value of the score vector . A univariate scoring function is permutationequivariant because
The Groupwise Scoring Functions (GSFs) (Ai et al., 2019) boil down to univariate scoring functions when the group size is 1. A larger group size is needed to model crossdocument interactions. For example, for groups of size 2, the scoring of the th document is:
(6) 
where is the subscoring function in GSF and is implemented using feed forward networks. Higherorder interactions are explicitly captured when the group size is larger, but it becomes impractical to implement a GSF precisely due to the combinatorial number of groups. Monte Carlo sampling methods are used to approximate and this can make GSFs unstable. In this sense, GSFs are approximately permutationequivariant.
The RankProb approach in (Dehghani et al., 2017) trains a bivariate interaction scoring function
by concatenating the features as the input for a feed forward network. The loss function is a logistic regression on the pairwise preference of the two documents. For inference, it uses the average pooling in Equation
6. This model is similar to the GSFs with group size 2. It has a trainingserving discrepancy. The average pooling makes the scoring function permutationequivariant but directly using it has a time complexity, which is not scalable.4. Proposed Methods
In this section, we first present a general class of permutationequivariant functions and outline how we build a permutation equivariant scoring function using deep neural networks for our proposed approach.
4.1. A Class of Permutation Equivariant Functions
Our permutationequivariant functions are based on permutationinvariant functions. We start with the formal definition of permutationinvariant functions.
Definition 4.1 (PermutationInvariant Functions).
Let be a vector of elements , where and be a permutation of indices of . A function is permutationinvariant iff
That is, any permutation of the input has the same output.
The work in (Zaheer et al., 2017) provides a general characterization of permutationinvariant functions as follows:
Theorem 4.2 ().
A function is permutationinvariant iff it can be decomposed in the form , for a suitable choice of of and .
Though simple, Theorem 4.2 is less constructive. Ilse et. al. (Ilse et al., 2018) proposed a mechanism to extend the form in Theorem 4.2 (called pooling function) to a weighting pooling, based on the attention mechanism. We shall refer this as attention pooling function. Given a generic context , a pooling function can be extended to attention pooling as follows:
(7) 
Here, is the popular attention mechanism, which is used to capture the similarity between the context and the item.
4.2. SelfAttentive Document Interaction Networks
We instantiate the permutationequivariant functions using the sclaeddot product attention, proposed in the work on Transformer (Vaswani et al., 2017).
4.2.1. SelfAttention Layers
The attention layer in Transformer is defined based on three matrices: , where is the dimension of keys in matrix, as follows:
(9) 
The output of the attention is a matrix in . The selfattention is a special case of the attention where we use . In our setting, we implement each row of as the concatenation of the vector representation of and each . The selfattention is permutationequivariant by using each row of as and setting the matrix form of as in Equation 8. Similar to the work on Transformer (Vaswani et al., 2017), we use layer normalization (Ba et al., 2016)
(He et al., 2016) over the output of the selfattention and these operations form the function of in Equation 8.Furthermore, we use the multiheaded attention mechanism, which allows the model to learn multiple attention weights per layer:
(10)  
where matrices ’s are the weight matrices for each head. Heads are concatenated along rows and projected by . Again we can have a selfattention layer by setting and show this is permutationequivariant.
We note that such an selfattention mostly take the pairwise document interactions. Since permutationequivariance is preserved for function composition , we can stack multiple selfattention layers. Multiple layers can enhance and potentially capture higherorder interactions better.
4.2.2. Scoring Layers
However, our goal is to derive a permutationequivariant scoring function whose output is . We propose to use a univariate scoring function on top of the output of selfattention layers. Let be the output of selfattention layers and be the th row of the output, corresponding to document . We propose a “wide and deep” scoring function to combine selfattention based features with query and document features:
We refer to this a as “wide and deep” architecture, where the output of “deep” layers, a stack of selfattention layers, is combined in a “wide” univariate scoring function with query and document features to generate a score per document.
We show that this “wide and deep” scoring function (denoted as ) is still permutationequivariant, while capturing crossdocument interactions.
We call our method SelfAttentive Document Interaction Networks (denoted as attnDIN) and the structure of the score for a given document is shown in Figure 1. The selfattention layer can be stacked sequentially, without losing the permutation equivariance property. In “wide and deep” fashion, the output of this layer, document interaction embeddings, is combined with query and document features and fed as an input to a univariate scoring function. The specific univariate scoring function captures interactions between features using a stack of feedforward layers. Specifically, for each feedforward layer, the input is passed through a dropout regularization (Srivastava et al., 2014)
(to prevent overfitting), and the output is passed through a batch normalization layer
(Ioffe and Szegedy, 2015), followed by a nonlinear ReLU
(Nair and Hinton, 2010) activation, where . We refer to this combination as FCBNReLU in Figure 1. The final output is projected to a single score for a document.Method  NDCG@1  NDCG@5  NDCG@10 
GSF(m=64) with Softmax (best reported (Ai et al., 2019))  44.21  44.46  46.77 
GSF(m=1) with ApproxNDCG (best reported (Bruch et al., 2019a))  46.64  45.38  47.31 
GSF(m=1) with ApproxNDCG (finetuned)  46.81  45.59  47.39 
attnDIN with ApproxNDCG (proposed approach)  48.16  46.62  48.21 
LambdaMART (RankLib)  45.35  44.59  46.46 
LambdaMART (lightGBM)  50.33  49.20  51.05 
5. Experiments
In this section, we first outline several learningtorank datasets and baseline methods used in our experiments. We then report the comparisons on both model effectiveness and inference efficiency.
5.1. Datasets
5.1.1. MSLR Web30k
The Microsoft Learning to Rank (MSLR) Web30k (Burges et al., 2006) public dataset comprises of 30,000 queries. We use Fold1, which has 3 partitions: train, validation, and test. For each querydocument pair, it has 136 dense features. Each query has a variable number of documents, and we use at most 200 documents per query for training baseline and proposed methods. During evaluation, we use the test partition and consider all the documents present for a query. We discard queries with no relevant documents both during training and evaluation.
5.1.2. Quick Access
In Google Drive, Quick Access (Tata et al., 2017) is a zerostate recommendation system, that surfaces relevant documents that users might click on when they visit the Drive home. The features are all dense, as described in Tata et. al. (Tata et al., 2017), and user clicks are used as relevance labels. We collect around 30 million recommended documents and their click information. Each session has up to 100 documents, along with user features as contextual information for training and evaluation. We use a 90%10% traintest split on this dataset.
5.1.3. Gmail Search
In Gmail, we look at search over emails, where a user types in a query, looks for a relevant email, and clicks on one of the six results returned by the system. The list of six emails are considered as the candidate set, and the clicks are used as the relevance labels. To preserve privacy, we remove personal information, and apply
anonymization. Around 200 million queries are collected, with a 90%10% traintest split. The features comprise of both dense and sparse features, The sparse features comprise of character and word level ngrams with
anonymization applied.5.2. Baselines
On the public Web30k dataset, we compare with LambdaMART’s implementation in RankLib (Croft et al., 2013) and lightGBM (Ke et al., 2017), and with multivariate Groupwise Scoring Functions (Ai et al., 2019). Since the labels consist of graded relevance, for evaluation measures, we use Normalized Discounted Cumulative Gain (NDCG) (Järvelin and Kekäläinen, 2002) for top 1, 5, and 10 documents ordered by the scores.
On the private datasets of Quick Access and Gmail, we compare only with Groupwise Scoring Functions. Given the massive scale of the datasets, and the heterogeneous nature of features (dense and sparse), the open source implementations of LambdaMART do not scale on these datasets. Furthermore, prior work demonstrated that GSFs are superior to LambdaMART when sparse features are present
(Ai et al., 2019). We evaluate using Mean Reciprocal Rank (Craswell, 2009) and Average Relevance Position (Zhu, 2004), as the labels are binary clicks, for which these two measures are most suitable.5.3. Hyperparameters
On Web30k dataset, to encode document interactions, we use one selfattention layer with 100 neurons, and with a single attention head. The univariate scoring function to combine the output of selfattention with query and document features comprises of an input batch normalization layer, followed by 7 feedforward fullyconnected layers (
FCBNReLU layers, as shown in Figure 1) of sizes . The model is trained using a training batch size of 128, and Adagrad (Duchi et al., 2011) optimizer, with a learning rate of 0.005 to minimize the ApproxNDCG ranking loss. We use a similar setup for Gmail and Quick Access, with Softmax loss minimized using Adagrad Optimizer, trained for 10 million and 5 million steps respectively. For Gmail, we use 5 layers of selfattention with 100 neurons each, with 4 heads for encoding document interactions. For Quick Access, we use 3 layers of selfattention with 100 neurons each, with 5 heads for encoding document interactions.5.4. Model Effectiveness
(a) Quick Access  MRR  ARP 

GSF(m=1 (univariate scoring)  –  – 
GSF(m=4)  0.440 0.177  0.659 0.141 
attnDIN (proposed approach)  +0.312 0.113  +0.413 0.124 
(b) Gmail Search  MRR  ARP 
GSF(m=1) (univariate scoring)  –  – 
GSF(m=3  +1.006 0.247  +1.308 0.246 
attnDIN (proposed approach)  +1.245 0.228  +1.430 0.247 
In Table 1, on Web30k data, we compare the proposed attnDIN approach with LambdaMART and GSFs. For LambdaMART, we consider the lightGBM implementation and the older RankLib implementation, and list the best reported metrics on test data. For the GSFs, we list the best reported metrics, and also an improved model based on our finetuning experiments. In Figure 2
, we compare attnDIN model with multivariate GSF models for varying group sizes. Since the list size is large for Web30k dataset (around 200), we increase the group size on an exponential scale from 1 to 128. We also show 95% bootstrapped confidence intervals for each of the models.
We observe that the proposed approach significantly outperforms both the best reported and finetuned GSFs, giving around 1 point improvement for NDCG@5 (measured from 0100), which is statistically significant (), measured using paired test with pvalue threshold of . These gains are not just from using a deeper network or from more neural network parameters, as shown in Figure 3. The increase in number of parameters over the univariate scoring is the smallest for attnDIN model, compared to any of the GSF models, while the improvement in ranking measure is significant. Our attnDIN model tries to capture similarity using dot product attention mechanism and pooling to combine feature values, while GSFs explicitly model crossdocument interactions using feedforward networks. As the group size increases, the number of parameters needed to capture crossdocument interactions also increase. This also leads to increase in inference time, as discussed in Section 5.5.
The proposed approach outperforms RankLib’s LambdaMART, but not the lightGBM implementation. We believe this is due to the fact that Gradient Boosted Decision Trees are very powerful class of machine learning models when the feature set consists of purely dense features, and smaller training datasets.
In most real world scenarios, input features tend to have both dense and sparse features. Query, document titles and metadata tend to naturally have textual description, which play a key role during user’s relevance judgment, and hence are powerful signals for training ranking models. We look at two real world datasets, on Gmail Search and Quick Access, with a large amount of data and a variety of features, as described in Section 5.3. In Table 2, we report relative improvements in MRR, due to private nature of these datasets. For statistical significance, we use paired ttest over relative improvements in MRR, with pvalue threshold of .
On the Quick Access dataset (Table 2(a)), we analyze the relative improvements in MRR, and observe that the proposed approach does significantly better than the univariate model, which is in fact, the best performing GSF model. While the GSF models fail to produce any improvements from crossdocument interactions on this dataset, our proposed approach effectively captures them.
On the Gmail dataset (Table 2(b)), the proposed approach is significantly better than the univariate model, and is superior to the best GSF model (). We conducted a paired ttest between attnDIN and the best GSF model, and we observe a relative improvement in MRR, , which is a statistically significant improvement. Note that in Gmail, we consider smaller document candidate sets (6 document per query), whereas in Web30k and and Quick Access, we use much larger candidate sets (200 documents per query and 100 documents per user request, respectively). For larger group sizes (¿ 8), the performance of GSF models deteriorates, whereas the proposed approach is able to capture crossdocument interactions effectively.
5.5. Model Efficiency
In Figure 3, we compare the inference time and number of parameters for various GSF models, normalized with the value for the proposed approach. Over univariate scoring functions, the proposed approach has an increase in inference time similar to GSF model of group size 8, despite capturing interactions across the entire document set of sizes 200 for Web30k. For the GSF models, the inference time increases with the increase of group size. GSFs use an approximation during inference. For group size 2, it uses a rolling window of 2 over a shuffled list to reduce the time complexity to (Ai et al., 2019). However, it is not guaranteed to be permutationequivariant, and may be unstable during inference. The exact inference is using Equation 6, the same as RankProb (Dehghani et al., 2017), which has time complexity. In our experiments, it takes around ms for inference per query, and is drastically slower than the attnDIN approach, which takes around ms for inference per query.
For the Web30k dataset, from Figures 2 and 3, we can observe that the proposed approaches are significantly better than univariate approaches, and are faster during inference than GSF models at large group sizes while capturing crossdocument interactions across the list. From Figure 3, we can also observe that the proposed model has fewer parameters than multivariate GSF models; hence the gain in ranking metrics is not from using larger number of parameters, but from effectively capturing similarity via crossdocument attention pooling of the document features.
6. Conclusion
In this paper, we study how to leverage document interactions in the learningtorank setting. We proposed the permutationequivariance requirement for a scoring function that takes document interactions into consideration. We show that selfattention mechanism can be used to implement such a permutationequivariant function, and that any univariate querydocument scoring function can be extended to capture crossdocument interactions using this proposed selfattention mechanism. We choose the attention method used in Transformer (Vaswani et al., 2017) in our paper and combine the output of selfattention layers with a feed forward network in a wide and deep architecture. We conducted our experiments on three datasets and the results show that our proposed methods can capture document interactions effectively in a statistically significant manner, and can efficiently scale to large document sets.
References
 Learning a deep listwise context model for ranking refinement. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 135–144. Cited by: §1, §2.
 Learning groupwise multivariate scoring functions using deep neural networks. In Proceedings of the 5th ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR), Cited by: §1, §1, §3.3, Table 1, §5.2, §5.2, §5.5.
 Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §4.2.1.
 Seq2slate: reranking and slate optimization with rnns. arXiv preprint arXiv:1810.02019. Cited by: §1, §2.
 A neural click model for web search. In Proc. of WWW, pp. 531–541. Cited by: §1.
 Revisiting approximate metric optimization in the age of deep neural networks. Cited by: §2, §3.2, §3.2, Table 1.
 An analysis of the softmax cross entropy loss for learningtorank with binary relevance. In Proceedings of the 2019 ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR 2019), Cited by: §3.2.
 Learning to rank using gradient descent. In 22nd International Conference on Machine Learning, pp. 89–96. Cited by: §2, §3.1.
 Learning to rank with nonsmooth cost functions. In Proceedings of the 19th International Conference on Neural Information Processing Systems, NIPS, pp. 193–200. Cited by: §1, §5.1.1.
 From RankNet to LambdaRank to LambdaMART: an overview. Technical report Technical Report Technical Report MSRTR201082, Microsoft Research. Cited by: §1, §2, §3.1.
 Learning to rank: from pairwise approach to listwise approach. In 24th International Conference on Machine Learning, pp. 129–136. Cited by: §1.
 Preference learning with gaussian processes. In 22nd International Conference on Machine Learning, pp. 137–144. Cited by: §2.
 Mean reciprocal rank. In Encyclopedia of Database Systems, pp. 1703–1703. Cited by: §5.2.
 The Lemur project. Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts Amherst 140. Cited by: §5.2.
 Neural ranking models with weak supervision. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR, pp. 65–74. Cited by: §1, §1, §3.3, §5.5.
 Regularizing querybased retrieval scores. Information Retrieval 10 (6), pp. 531–562. Cited by: §1, §2.
 Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12, pp. 2121–2159. Cited by: §5.3.
 Greedy function approximation: a gradient boosting machine. Annals of Statistics 29 (5), pp. 1189–1232. Cited by: §3.1.

Optimum polynomial retrieval functions based on the probability ranking principle
. ACM Transactions on Information Systems 7 (3), pp. 183–204. Cited by: §2.  A deep look into neural ranking models for information retrieval. arXiv preprint arXiv:1903.06902. Cited by: §2.

Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778. Cited by: §4.2.1.  Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management, CIKM, pp. 2333–2338. Cited by: §2.
 Attentionbased deep multiple instance learning. arXiv preprint arXiv:1802.04712. Cited by: §4.1.
 Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §4.2.2.
 Cumulated gainbased evaluation of IR techniques. ACM Transactions on Information Systems 20 (4), pp. 422–446. Cited by: §5.2.
 Unbiased learningtorank with biased feedback. In 10th ACM International Conference on Web Search and Data Mining, pp. 781–789. Cited by: §3.1.
 Optimizing search engines using clickthrough data. In 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 133–142. Cited by: §2.
 Training linear svms in linear time. In 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 217–226. Cited by: §1, §3.1.
 LightGBM: a highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems 30, pp. 3146–3154. Cited by: §2, §5.2.
 A structured selfattentive sentence embedding. arXiv preprint arXiv:1703.03130. Cited by: §4.1.
 Learning to rank for information retrieval. Foundations and Trends in Information Retrieval 3 (3), pp. 225–331. Cited by: §3.2.

Learning to match using local and distributed representations of text for web search
. In Proceedings of the 26th International Conference on World Wide Web, pp. 1291–1299. Cited by: §2.  Rectified linear units improve restricted Boltzmann machines. In 27th International Conference on Machine Learning, pp. 807–814. Cited by: §4.2.2.

Text matching as image recognition.
In
Thirtieth AAAI Conference on Artificial Intelligence
, Cited by: §2.  DeepRank: a new deep architecture for relevance ranking in information retrieval. In Proceedings of the 2017 ACM Conference on Information and Knowledge Management, CIKM, pp. 257–266. Cited by: §2, §2.

TFranking: scalable tensorflow library for learningtorank
. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2970–2978. Cited by: §2.  Personalized reranking for recommendation. In Proceedings of the 13th ACM Conference on Recommender Systems, pp. 3–11. Cited by: §2.
 A general approximation framework for direct optimization of information retrieval measures. Information Retrieval 13 (4), pp. 375–397. Cited by: §3.2.
 A general approximation framework for direct optimization of information retrieval measures. Information Retrieval 13 (4), pp. 375–397. Cited by: §3.2.
 Global ranking using continuous conditional random fields. In Proc. of the 21st International Conference on Neural Information Processing Systems, pp. 1281–1288. Cited by: §1, §2.
 Neural attention for learning to rank questions in community question answering. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 1734–1745. Cited by: §2.
 Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §4.2.2.
 Quick Access: building a smart experience for Google Drive. In 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1643–1651. Cited by: §5.1.2.
 Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §4.2.1, §4.2, §6.
 Adapting boosting for information retrieval measures. Information Retrieval 13 (3), pp. 254–270. Cited by: §1.
 Listwise approach to learning to rank: theory and algorithm. In 25th International Conference on Machine Learning, pp. 1192–1199. Cited by: §2.
 Listwise approach to learning to rank: theory and algorithm. In Proc. of the 25th International Conference on Machine Learning, pp. 1192–1199. Cited by: §2.
 AdaRank: a boosting algorithm for information retrieval. In 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 391–398. Cited by: §3.1.
 Deep sets. In Advances in neural information processing systems, pp. 3391–3401. Cited by: §4.1.
 Recall, precision and average precision. Technical report Department of Statistics and Actuarial Science, University of Waterloo. Cited by: §5.2.
Comments
There are no comments yet.