1. Introduction
Ranking is at the heart of many information retrieval (IR) problems, including adhoc document retrieval, question answering, recommender systems, and many others. Unlike the traditional classification or regression setting, the main goal of a ranking problem is not to assign a label or a value to each individual document, but to determine the relative preference among a list of them. The relevance of the top ranked documents influences the perceived utility of the entire list more than others (Chapelle et al., 2009). Thus, the notion of relativity is crucial in a ranking problem.
How to model relativity in ranking has been extensively studied, especially in the learningtorank setting (Liu, 2009)
. In this setting, a scoring function that maps document feature vectors to realvalued scores is learned from training data. Documents are then ranked according to the predictions of the scoring function. To learn such a scoring function, the majority of the learningtorank algorithms use pairwise or listwise loss functions to capture the relative relevance between documents
(Burges et al., 2005; Burges et al., 2006; Cao et al., 2007; Xu and Li, 2007). Such a loss guides the learning of the scoring function to optimize preferences between documents or an IR metric such as NDCG (Joachims, 2002; Taylor et al., 2008; Burges, 2010).Though effective, most existing learningtorank frameworks are restricted to the paradigm of pointwise scoring functions: the relevance score of a document to a query is computed independently of the other documents in the list. This setting could be less optimal for ranking problems for multiple reasons^{1}^{1}1
It is interesting to note the connection of pointwise scoring to Robertsons’ probability ranking principle (PRP)
(Robertson, 1977). PRP states that documents should be ranked by their probability of relevance to the query. While highly cited and influential, this principle was recognized to be problematic by Robertson himself: “The PRP works documentbydocument, whereas the results should be evaluated requestbyrequest” (Robertson, 1977).. First, it has limited power to model crossdocument comparison. Consider a search scenario where a user is searching for a name of a musical artist. If all the results returned by the query (e.g., calvin harris) are recent, the user may be interested in the latest news or tour information. If, on the other hand, most of the query results are older (e.g., frank sinatra), it is more likely that the user wants to learn about artist discography or biography. Thus, the relevance of each document depends on the distribution of the whole list. Second, user interaction with search results shows strong comparison patterns. Prior research suggests that preference judgments by comparing a pair of documents are faster to obtain, and are more consistent than the absolute ratings (Bennett et al., 2008; Ye and Doermann, 2013; Koczkodaj, 1996). Also, better predictive capability is achieved when user actions are modeled in a relative fashion (e.g., SkipAbove) (Joachims et al., 2005; Borisov et al., 2016; Yilmaz et al., 2014; Craswell et al., 2008; Dupret and Piwowarski, 2008; Chapelle et al., 2009). These indicate that users compare the clicked document to its surrounding documents prior to a click, and a ranking model that uses the direct comparison mechanism can be more effective as it mimics the user behavior more faithfully.In this paper, we hypothesize that the relevance score of a document to a query should be computed by comparison with other documents at the feature level. Specifically, we explore a general setting of groupwise scoring functions (GSFs) for learningtorank. A GSF takes multiple documents as input and scores them together by leveraging a joint set of features of these documents. It outputs a relative relevance score of a document with respect to the other documents in a group, and the final score of each document is computed by aggregating all the relative relevance scores in a voting mechanism. The proposed GSF setting is general as we can define arbitrary number of input documents and any ranking loss in it. We show that several representative learningtorank models can be formulated as special cases in the GSF framework.
While it is easy to define a GSF, it is unclear how to learn such a function from training data efficiently, as well as how to use it to score a new list of documents during inference remains a challenge. To solve these challenges, we propose a novel deep neural network architecture in the learningtorank setting. We design a sampling method that allows us to train GSFs efficiently with backpropagation, and obtain a scalar value per document during inference using a Monte Carlo method.
We evaluate our proposed models in a standard learningtorank setting. Specifically, we evaluate GSFs using the LETOR data set MSLRWEB30K (Qin and Liu, 2013), in which the relevance judgments are obtained from human ratings, and a predefined feature set is provided. We show that our GSFs perform reasonably well by themselves and can improve the stateoftheart treebased models in a hybrid setting where the output of GSFs is used as a feature for treebased models.
2. Related Work
Learningtorank refers to algorithms that try to solve ranking problems using machine learning techniques. There is a plethora of learningtorank work
(Joachims, 2006; Friedman, 2001; Burges, 2010; Burges et al., 2005; Cao et al., 2007; Xia et al., 2008), which mainly differs in its definitions of loss functions. Almost all of them use the setting of pointwise scoring functions. To the best of our knowledge, there are only a few exceptions.The first ones are the score regularization technique (Diaz, 2007) and the CRFbased model (Qin et al., 2008) that take document similarities to smoothe the initial ranking scores or as additional features for each querydocument pair. When computing ranking scores, however, they take only one document at a time and do not consider comparison between features from different documents.
The second exception is a pairwise scoring function that takes a pair of documents together and predicts the preference between them (Dehghani et al., 2017). We demonstrate that this pairwise scoring function can be instantiated in the proposed groupwise framework, and compare against it in our experiments.
The third exception is the neural click model (Borisov et al., 2016) and the deep listwise context model (Ai et al., 2018) that builds an LSTM model on top of document lists. The neural click model summarizes other documents into a hidden state and eventually uses a pointwise loss (e.g., log loss) for each document. Such a pointwise loss is difficult to adapt to the unbiased learningtorank setting (Wang et al., 2018), while our GSF models can. The deep listwise context model is similar to our models in terms of loss functions, but our GSF is more flexible as it can take any document lists as its inputs, whether ordered or not.
Search result diversification is related to our GSF models since it also takes into account subsets of documents through maximizing objectives such as maximal marginal relevance (Carbonell and Goldstein, 1998) or subtopic relevance (Agrawal et al., 2009). Recently, several deep learning algorithms were proposed, with losses corresponding to these objectives (Jiang et al., 2017; Xia et al., 2016). In contrast to this work, the goal of our paper is to improve relevance modeling through groupwise comparisons, but not diversity.
Pseudorelevance feedback (Manning et al., 2008) is a classic retrieval method that uses query expansion from the firstround top retrieved documents to improve the secondround retrieval. Our GSFs consider document relationship in the learningtorank setting, not in the retrieval setting, and do not require two rounds of retrieval. We also do not assume a preexisting initial ordering of the document list.
Note that our work is complementary to the recently proposed neural IR techniques (Pang et al., 2017; Guo et al., 2016; Mitra et al., 2017; Pang et al., 2016; Dehghani et al., 2017). While these techniques focus on advanced representations of document and query text with the employment of standard pointwise or pairwise scoring and loss functions, our work focuses on the nature of the scoring functions and the combination of multiple feature types while employing a relatively simple matching model.
3. Problem Formulation
In this section, we formulate our problem in the learningtorank framework. Let represent a user query string , its list of documents , and their respective relevance labels . For each document , we have a relevance label . Let be the training data set. The goal of learningtorank is to find a scoring function that minimizes the average loss over the training data:
(1) 
Without loss of generality, we assume that takes both and as input, and produces a score for each document :
The local loss function is computed based on the relevance label list , and the scores produced by the scoring function :
(2) 
The main difference among the various learningtorank algorithms lies in how the scoring function and the loss function are defined. While there is much work in different types of loss functions, categorized as pointwise, pairwise or listwise loss (Liu, 2009), the majority of work assumes a pointwise scoring function (PSF) that takes a single document at a time as its input, and produces a score for every document separately:
We reuse as the pointwise scoring function in this context for brevity.
In this paper, we explore a generic groupwise scoring function (GSF) to better compare documents at feature level. A GSF takes a group of documents and scores them together:
PSF is a special case of GSF when . Intuitively, a requirement for GSF is to be invariant to the input document order. However, it is not immediately clear on how to define based on function and how to learn based on training data. This is what we are going to address in the next section.
4. Methods
4.1. DNNbased Scoring Function
Because GSF takes multiple documents as an input, the feature vectors for GSF may be potentially exponentially larger than those for PSF. Therefore, the scoring function in GSF should have good scalability and be easily applicable to different types of features with large dimensionality. In this paper, we use deep neural networks (DNNs) as the building block for constructing the scoring function . Feedforward DNN models have widely been used for learningtorank problems (e.g., (Edizel et al., 2017; Huang et al., 2013; Zamani et al., 2017)). Compared to treebased models (Friedman, 2001), they show better scalability in terms of input dimensionality.
Let be the feature vector for a document given a specific query . For simplicity, we use concatenation of feature vectors of the documents as the input layer in the DNNbased implementation of , while leaving the exploration of more sophisticated input representations for future work. Specifically, let
and a multilayer feed forward network with 3 hidden layers is defined as
(3) 
where and
denote the weight matrix and the bias vector in the
th layer,is an activation function and we use the hyperbolic tangent function in our paper:
Our DNNbased function is defined as
(4) 
where and are the weight vector and the bias in the output layer. The output dimension is set as in the output layer.
4.2. Groupwise Scoring Architecture
Given a set of training data , a straightforward approach to learn a groupwise scoring function is to concatenate all features of documents in a document list together, and build a DNN with output size as the relevance prediction of the whole list. In this case, group size is set to be and . Such an approach, however, has two drawbacks: (1) The learned model is sensitive to document order; (2) The comparison is among all documents and this can make it difficult to learn the useful local comparison patterns. To overcome these drawbacks, we propose to limit the comparison within small groups and make the model invariant to the document order.
Let the input of a GSF model consist of a document list with documents. For simplicity, we assume that each has exactly documents and . Notice that such an assumption trivially holds in the click data setting where the top
documents per query is used, or can be ensured with padding or sampling strategy. For example, we can randomly sample
documents per query (without replacement) to form a new list. When , such a sampling strategy is similar to standard pairwise approaches where document pairs of a query are formed as training examples.Let be a list of documents that forms a group. In each group, documents are compared and scored based on their feature vectors to predict which documents are more relevant.
(5) 
where is the th dimension of , and also the intermediate relevance score for the th document in . Note that we use the slightly different notation to make the following discussion easier.
The intermediate relevance scores from each individual groups are accumulated to generate final scores for each document. As we know, there are multiple (namely, ) permutations with size in a list with size , which means that there are as many possible inputs for a groupwise scoring function in a list of documents . Let be all the permutations of the subset documents with size from , then we compute the final ranking score for a document as the accumulation of intermediate relevance scores from each individual group as
(6) 
where is an indicator function that denotes whether is the th document in . And the final output of a GSF model would be
An example of such with group size and list size is shown in Figure 1.
4.3. Loss Functions
Intuitively, we can define any loss functions between the scores and labels to train the GSF models. In this paper, we focus on a simple loss function for graded relevance, leaving other loss functions to future studies.
Graded relevance is a multilevel relevance judgment and is commonly used in humanrated data sets. The loss for graded relevance is generally defined over document pairs. We extend the commonly used pairwise logistic loss for a list as
(7) 
where means the th document is more relevant than the th document and the loss of a list is the sum of the loss of all pairs in the list. Please note that when , GSF is actually a pointwise scoring function. In this case, the loss in Equation (7) boils down to a pairwise loss function. However, it becomes a listwise loss for the general GSF models when .
4.4. Training and Inference
While GSF extends traditional pointwise scoring functions (PSF) by scoring each document according to their comparisons with others, it also loses the property of PSF that each model only produces one score for each document, which could cause multiple efficiency issues in practice. In this section, we propose multiple sampling strategies for the efficient training and inference of GSF models.
4.4.1. Speed Up Training
As shown in Equation (6), the final ranking score of a GSF model is the aggregation of intermediate relevance scores from all possible permutations of a group with size in a list with size . Suppose that the computation complexity of a DNN with input documents is , then such a scoring paradigm has a complexity of , which is prohibitive in real systems. To speed up the training process, we conduct the following group sampling to reduce the permutation set . For each training instance with document list , we randomly shuffle it before feeding into the model. Then, we only take the subsequences (instead of subsets) of size from the shuffled input to form . In this way, we reduce the size from to subsequences: each starting at the th position and ending at th position in a circular manner. This leads to a reduced training complexity with time. Also, because each appears in exactly groups, all documents have equal probability to be compared to others and to be placed as the th document in each sublist . With enough training examples, the GSF trained with our sampling strategy asymptotically approaches the GSF trained with all permutations, and it is insensitive to document order.
4.4.2. Efficient Inference
At inference time, the desired output is a ranked list. This can be done in a straightforward manner for pointwise scoring functions, but becomes nontrivial for GSF. To tackle this challenge, we propose our inference methods for the following two scenarios: fixed list size and varying list size.
Inference with Fixed List Size. Given a fixed list size at inference time, the most straightforward solution to do inference with GSF is to train the DNN model with the same list size . Thus, the score for a document can be directly computed as with Equation (6).
Inference with Varying List Size. When it is impossible to make or the list size at inference time cannot be determined beforehand, it is hard to use the GSF architecture in training directly. In this case, we extract the group comparison function from GSF and use it to do inference. For each document in the inference list , we compute its score by the expected marginal score as
(8) 
where the expectation is over all possible permutations of size in that contains document , and is the th value of the function output (also referred to as in Equation (5)). For example, when , can be rewritten as
The expectation in Equation (8) can be approximated effectively by Monte Carlo methods (Robert and Casella, 2005). For example, we can randomly sample a couple of documents and use the average over the sampled documents. The larger the sample size is, the better this approximation becomes. In our experiments, we found that sample size for group size is good enough. Thus the inference can be done efficiently in since is a predefined constant.
In fact, the case for fixed list size is a special case of the varying list size. When contains all adjacent documents as groups, it is equivalent to sample groups for each document and the document’s position varies from 1 to in these groups. The score of the document is proportional to the sum of the corresponding values in these groups.
5. Relationship with Existing Models
In this section, we discuss the relationship between the existing learningtorank algorithms and our proposed models. There are two important hyperparameters for GSF: the list size and the group size . By varying and , we will show that several representative algorithms can be formulated as special cases in our framework.
5.1. Pointwise Scoring and Pairwise Loss
Pairwise loss functions with pointwise scoring functions are popular in learningtorank algorithms. In these algorithms, the scoring function takes a single document as input and outputs a single score. The loss function takes a pair of scores from two documents and defines the loss based on the consistency between the scores and the preferences of the two documents. In our GSF model, let and , and be the two documents in , and be the graded relevance judgments, and and be the output scores. Then the logistic loss in our GSF model is
Such a loss is very similar to the one used in LambdaMART (Burges, 2010). Other pairwise loss such as hinge loss used in RankingSVMs (Joachims, 2002) can be defined similarly as Equation (7).
5.2. Pointwise Scoring and Listwise Loss
A traditional listwise model uses a pointwise scoring function with a listwise loss computed with all documents in the candidate list. We show the connection between our method with a representative traditional listwise method – the ListNet method (Cao et al., 2007).
In our GSF model, let and be the length of the list. Then and is computed based on document only and we use to denote it. The softmax loss is
In the ListNet approach, a distribution over all the permutations of the documents is defined based on the scoring function, and another one is defined using the labels. The loss is the cross entropy between these two distributions. Because such a loss is not computationally expensive, a simplified version is to consider the marginal probability that a document appears at the first position in the two distributions over the permutations. The resultant cross entropy loss is then:
5.3. Pairwise Scoring Functions
The pairwise scoring model is a recent model proposed by Dehghani et al. (Dehghani et al., 2017) that predicts the preference between two documents based on their features. As far as we know, this is the only published learningtorank model that takes more than one document as the input of its scoring function.
In the GSF model (see Section 4.1&4.2), let and . Let be a pair of documents. Then
If we define the aggregation step (Equation (6)) as
(9) 
where is an indicator function that denotes whether is the first document in , and use a sigmoid cross entropy loss as
(10) 
then it is equivalent to (Dehghani et al., 2017). However, there is a slight difference between this pairwise input model and our GSF with (GSF(2, 2)). While (Dehghani et al., 2017) used only the first score from each group (pair) of documents, our GSF uses the scores of all documents from the two groups in Equation (6). In addition, our GSF model has the following two mathematically attractive properties that are not held by (Dehghani et al., 2017).

Reflexive. When the input has two identical documents , then always holds.

Antisymmetric. If are scores for input , then are scores for input .
6. Experimental Setup
To evaluate our proposed methods, we compare different types of learningtorank models on a standard learningtorank dataset. In this section, we describe our experimental design and setup.
6.1. LearningtoRank Models
The learningtorank models we compared in our experiments include both DNN models and treebased models.
6.1.1. DNN Models
Pointwise Scoring Functions  
RankNet, GSF(2, 1)  GSF with and . This model is comparable to a standard neural network model with pointwise scoring and pairwise loss, e.g., RankNet (Burges et al., 2005). 
ListNet, GSF(, 1)  GSF with and . This is closely related to the existing models with pointwise scoring and listwise loss, e.g., ListNet (Cao et al., 2007). 
Pairwise Scoring Functions  
GSF(2, 2)  GSF with and . This is a DNN model with pairwise scoring and our proposed list loss functions. 
Generic Groupwise Scoring Functions  
GSF(, )  The GSF model with list size and group size . 
Table 1 lists all the DNN models that we considered in our experiments. We group these models based on the input to their scoring functions as pointwise, pairwise, and groupwise. In this table, GSF(2, 1)and GSF(, 1) represent the existing DNN models with pointwise scoring functions in the current literature. GSF(2, 2) is a GSF model that takes a pair of documents in their scoring function, and GSF(, ) is the generic representation of the GSF model with a list size and a group size . We instantiate and in our experiments.
All DNN models in this paper were built on the 3layer feedforward network described in Section 3. The hidden layer sizes of DNN are set as 256, 128 and 64 for , and
. We used the TensorFlow toolkit for model implementations.
6.1.2. Treebased Models
For treebased models, we primarily use the stateoftheart MART and LambdaMART (Burges, 2010) models. Furthermore, we also explore a hybrid approach in which predictions of DNN models are used as input features for treebased models. We compare hybrid models with both standalone DNN and treebased models in our experiments.
6.2. Data Sets
The data set used in our experiments is a public LETOR data set MSLRWEB30K (Qin and Liu, 2013). This is a largescale learningtorank data set that contains 30K queries. On average there are 120 documents per query and each document has 136 numeric features. All the documents are labeled with graded relevance from 0 to 4. We report the results on Fold1 as a test set, while using the rest of the folds for training.
On the LETOR data set, there are more than a hundred documents per query. Instead of setting the list size to be the largest possible number of documents per query, we conduct the following sampling to form training data for DNN models. Given an ( and used in our paper), we randomly shuffle the list of documents for each query and then use a rolling window of size over the shuffled document list to obtain a list of documents with size . For robustness, we reshuffle ten times, and thus obtain a collection of training data for DNN models.
6.3. Evaluation Metrics
We train the models on the LETOR dataset with the logistic loss in Equation (7
) given its graded relevance labels. The evaluation metric for this data set is the commonly used Normalized Discounted Cumulative Gain (NDCG)
(Järvelin and Kekäläinen, 2002). In this paper, we use NDCG@5 that truncates the cumulative sum to the top 5 positions. Significance test are conducted based on student’s ttest.
7. Results
In this section, we describe our experiment results in detail. We compare GSF with pointwise scoring models and pairwise scoring models with pairwise loss. And we also compare GSF with the stateoftheart tree models in both standalone and hybrid way.
Table 2 shows the results of different models on the LETOR data set. For reproducibility, we use the learningtorank models, RankNet, MART and LambdaMART as implemented in the opensource Ranklib toolkit^{2}^{2}2https://sourceforge.net/p/lemur/wiki/RankLib/, and report the actual value of NDCG@5 for each model.
In Table 2(a), we observe that all GSF models outperform RankNet – another DNNbased learningtorank model – by a very large margin. For the different variants of GSF, GSF(2, 2) is better than GSF(2, 1), and GSF(5, 2) is better than GSF(5, 1), which indicates that having a group size larger than 1 indeed improves the performance of a GSF model.
In Table 2(b), treebased models are more competitive than the DNN models in standalone settings. However, we observe that the hybrid LambdaMART+GSF models outperform the stateoftheart LambdaMART algorithm. For example, we achieve 1.5% improvement using GSF(5, 2) over LambdaMART, a statistically significant result. Also, in the hybrid mode, GSFs with consistently outperform the GSFs with . This, again, confirms that, in general, a groupwise scoring function is better than a pointwise scoring function.
(a)  GSF  

RankNet  GSF(2, 1)  GSF(2, 2)  GSF(5, 1)  GSF(5, 2)  
32.28  40.40  41.10  41.10  41.50  
(b)  LambdaMART+GSF  
MART  LambdaMART  GSF(2, 1)  GSF(2, 2)  GSF(5, 1)  GSF(5, 2) 
43.51  44.23  44.51  44.69  44.60  44.90 
8. Conclusion
In this paper, we went beyond the traditional pointwise scoring functions and introduced a novel setting of groupwise scoring functions (GSFs) in the learningtorank framework. We implemented GSFs using a deep neural network (DNN) that can efficiently handle large input spaces. We showed that GSFs can include several existing learningtorank models as special cases. We compared both GSF models and treebased models based on a standard learningtorank data set. Experimental results show that GSFs significantly benefit several stateoftheart DNN and treebased models, due to their ability to combine listwise loss and groupwise scoring functions.
Our work also opens up a few interesting future research directions: how to do inference with GSFs in a more principled way using techniques in (Wauthier et al., 2013), how to define GSFs using more sophisticated DNN like CNN, rather than simple concatenation, and how to leverage the more advanced DNN matching techniques proposed in (Mitra et al., 2017; Pang et al., 2016; Guo et al., 2016; Pang et al., 2017), in addition to the standard learningtorank features, in our GSFs.
References
 (1)
 Agrawal et al. (2009) Rakesh Agrawal, Sreenivas Gollapudi, Alan Halverson, and Samuel Ieong. 2009. Diversifying Search Results. In Proc. of the 2nd ACM International Conference on Web Search and Data Mining (WSDM). 5–14.
 Ai et al. (2018) Qingyao Ai, Keping Bi, Jiafeng Guo, and W. Bruce Croft. 2018. Learning a Deep Listwise Context Model for Ranking Refinement. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR ’18). 135–144.
 Bennett et al. (2008) Paul N. Bennett, Ben Carterette, Olivier Chapelle, and Thorsten Joachims. 2008. Beyond Binary Relevance: Preferences, Diversity, and Setlevel Judgments. SIGIR Forum 42, 2 (2008), 53–58.
 Borisov et al. (2016) Alexey Borisov, Ilya Markov, Maarten de Rijke, and Pavel Serdyukov. 2016. A Neural Click Model for Web Search. In Proc. of the 25th International Conference on World Wide Web (WWW). 531–541.
 Burges et al. (2005) Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. 2005. Learning to rank using gradient descent. In Proc. of the 22nd International Conference on Machine Learning (ICML). 89–96.
 Burges (2010) Christopher J.C. Burges. 2010. From RankNet to LambdaRank to LambdaMART: An Overview. Technical Report Technical Report MSRTR201082. Microsoft Research.
 Burges et al. (2006) Christopher J. C. Burges, Robert Ragno, and Quoc Viet Le. 2006. Learning to Rank with Nonsmooth Cost Functions. In Proc. of the 19th International Conference on Neural Information Processing Systems (NIPS). 193–200.
 Cao et al. (2007) Zhe Cao, Tao Qin, TieYan Liu, MingFeng Tsai, and Hang Li. 2007. Learning to rank: from pairwise approach to listwise approach. In Proc. of the 24th International Conference on Machine Learning (ICML). 129–136.
 Carbonell and Goldstein (1998) Jaime Carbonell and Jade Goldstein. 1998. The Use of MMR, Diversitybased Reranking for Reordering Documents and Producing Summaries. In Proc. of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 335–336.
 Chapelle et al. (2009) Olivier Chapelle, Donald Metzler, Ya Zhang, and Pierre Grinspan. 2009. Expected Reciprocal Rank for Graded Relevance. In Proc. of the 18th ACM Conference on Information and Knowledge Management (CIKM). 621–630.
 Craswell et al. (2008) Nick Craswell, Onno Zoeter, Michael Taylor, and Bill Ramsey. 2008. An experimental comparison of click positionbias models. In Proc. of the 2008 International Conference on Web Search and Data Mining (WSDM). 87–94.
 Dehghani et al. (2017) Mostafa Dehghani, Hamed Zamani, Aliaksei Severyn, Jaap Kamps, and W. Bruce Croft. 2017. Neural Ranking Models with Weak Supervision. In Proc. of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 65–74.
 Diaz (2007) Fernando Diaz. 2007. Regularizing querybased retrieval scores. Information Retrieval 10, 6 (2007), 531–562.
 Dupret and Piwowarski (2008) Georges E. Dupret and Benjamin Piwowarski. 2008. A user browsing model to predict search engine click data from past observations.. In Proc. of the 31st Annual International ACM SIGIR Conference on Reseach and Development in Information Retrieval (SIGIR). 331–338.
 Edizel et al. (2017) Bora Edizel, Amin Mantrach, and Xiao Bai. 2017. Deep CharacterLevel ClickThrough Rate Prediction for Sponsored Search. In Proc. of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 305–314.

Friedman (2001)
Jerome H Friedman.
2001.
Greedy function approximation: a gradient boosting machine.
Annals of Statistics 29, 5 (2001), 1189–1232.  Guo et al. (2016) Jiafeng Guo, Yixing Fan, Qingyao Ai, and W. Bruce Croft. 2016. A Deep Relevance Matching Model for Adhoc Retrieval. In Proc. of the 25rd ACM International Conference on Information and Knowledge Management (CIKM). 55–64.
 Huang et al. (2013) PoSen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning Deep Structured Semantic Models for Web Search Using Clickthrough Data. In Proc. of the 22nd ACM International Conference on Information and Knowledge Management (CIKM). 2333–2338.
 Järvelin and Kekäläinen (2002) Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gainbased evaluation of IR techniques. ACM Transactions on Information Systems 20, 4 (2002), 422–446.
 Jiang et al. (2017) Zhengbao Jiang, JiRong Wen, Zhicheng Dou, Wayne Xin Zhao, JianYun Nie, and Ming Yue. 2017. Learning to Diversify Search Results via Subtopic Attention. In Proc. of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 545–554.
 Joachims (2002) Thorsten Joachims. 2002. Optimizing Search Engines Using Clickthrough Data. In Proc. of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 133–142.
 Joachims (2006) Thorsten Joachims. 2006. Training linear SVMs in linear time. In Proc. of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 217–226.
 Joachims et al. (2005) Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, and Geri Gay. 2005. Accurately Interpreting Clickthrough Data As Implicit Feedback. In Proc. of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 154–161.
 Koczkodaj (1996) Waldemar W. Koczkodaj. 1996. Statistically accurate evidence of improved error rate by pairwise comparisons. Perceptual and Motor Skills 82, 1 (1996), 43–48.
 Liu (2009) TieYan Liu. 2009. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval 3, 3 (2009), 225–331.
 Manning et al. (2008) Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA.

Mitra
et al. (2017)
Bhaskar Mitra, Fernando
Diaz, and Nick Craswell.
2017.
Learning to Match Using Local and Distributed Representations of Text for Web Search. In
Proc. of the 26th International Conference on World Wide Web (WWW). 1291–1299.  Pang et al. (2016) Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, and Xueqi Cheng. 2016. A Study of MatchPyramid Models on Adhoc Retrieval. (2016). arXiv:1606.04648
 Pang et al. (2017) Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Jingfang Xu, and Xueqi Cheng. 2017. DeepRank: A New Deep Architecture for Relevance Ranking in Information Retrieval. In Proc. of the 2017 ACM Conference on Information and Knowledge Management (CIKM). 257–266.
 Qin and Liu (2013) Tao Qin and TieYan Liu. 2013. Introducing LETOR 4.0 Datasets. (2013). arXiv:1306.2597
 Qin et al. (2008) Tao Qin, TieYan Liu, XuDong Zhang, DeSheng Wang, and Hang Li. 2008. Global ranking using continuous conditional random fields. In Proc. of the 21st International Conference on Neural Information Processing Systems (NIPS). 1281–1288.
 Robert and Casella (2005) Christian P. Robert and George Casella. 2005. Monte Carlo Statistical Methods (Springer Texts in Statistics). SpringerVerlag.
 Robertson (1977) Stephen E. Robertson. 1977. The probability ranking principle in IR. Journal of Documentation 33, 4 (1977), 294–304.
 Taylor et al. (2008) Michael Taylor, John Guiver, Stephen Robertson, and Tom Minka. 2008. SoftRank: Optimizing Nonsmooth Rank Metrics. In Proc. of the 2008 International Conference on Web Search and Data Mining (WSDM). 77–86.

Wang et al. (2018)
Xuanhui Wang, Nadav
Golbandi, Michael Bendersky, Donald
Metzler, and Marc Najork.
2018.
Position Bias Estimation for Unbiased Learning to Rank in Personal Search. In
Proc. of the 11th International Conference on Web Search and Data Mining (WSDM). 610 –618.  Wauthier et al. (2013) Fabian L. Wauthier, Michael I. Jordan, and Nebojsa Jojic. 2013. Efficient Ranking from Pairwise Comparisons. In Proc. of the 30th International Conference on Machine Learning (ICML). 109–117.
 Xia et al. (2008) Fen Xia, TieYan Liu, Jue Wang, Wensheng Zhang, and Hang Li. 2008. Listwise approach to learning to rank: theory and algorithm. In Proc. of the 25th International Conference on Machine Learning (ICML). 1192–1199.

Xia
et al. (2016)
Long Xia, Jun Xu,
Yanyan Lan, Jiafeng Guo, and
Xueqi Cheng. 2016.
Modeling Document Novelty with Neural Tensor Network for Search Result Diversification. In
Proc. of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 395–404.  Xu and Li (2007) Jun Xu and Hang Li. 2007. AdaRank: A Boosting Algorithm for Information Retrieval. In Proc. of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 391–398.
 Ye and Doermann (2013) Peng Ye and David Doermann. 2013. Combining preference and absolute judgements in a crowdsourced setting. In ICML 2013 Workshop on Machine Learning Meets Crowdsourcing.
 Yilmaz et al. (2014) Emine Yilmaz, Manisha Verma, Nick Craswell, Filip Radlinski, and Peter Bailey. 2014. Relevance and effort: An analysis of document utility. In Proc. of the 23rd ACM International Conference on Information and Knowledge Management (CIKM). 91–100.
 Zamani et al. (2017) Hamed Zamani, Michael Bendersky, Xuanhui Wang, and Mingyang Zhang. 2017. Situational Context for Ranking in Personal Search. In Proc. of the 26th International Conference on World Wide Web (WWW). 1531–1540.
Comments
There are no comments yet.