1. Introduction
Evaluation is crucial in Information Retrieval (IR). The development of models, tools and methods has significantly benefited from the availability of reusable test collections formed through a standardized and thoroughly tested methodology, known as the Cranfield paradigm [9]. Under the Cranfield paradigm the evaluation of retrieval systems typically involves assembling a document collection, creating a set of information needs (topics), and identifying a set of documents relevant to the topics.
One of the simplifying assumptions made by the Cranfield paradigm is that the relevance judgments are complete, i.e. for each topic all relevant documents in the collection have been identified. When the document collection is large identifying all relevant documents is difficult due to the immense human labor required. In order to avoid judging the entire document collection depthk pooling [25] is being used: a set of retrieval systems (also called runs) ranks the document collection against each topic, and only the union of the top retrieved documents is being assessed by human assessors. Documents outside the depthk pool are considered irrelevant. Pooling aims at being fair to all runs and hopes for a diverse set of submitted runs that can provide a good coverage of all relevant documents. Nevertheless, the underestimation of recall [30] and the pooling bias generated when reusing these pooled collections to evaluate novel systems that retrieve relevant but unjudged documents [30, 6, 27, 18] are wellknown problems.
The literature suggests a number of approaches to cope with missing judgments (an overview can be found in [23] and [13]): (1) Defining IR measures that are robust to missing judgments, like bpref [7]. The developed measures however may not precisely capture the notion of retrieval effectiveness one requires, while some have been shown to remain biased [28]. (2) Running a metaexperiment where runs are “left out” from contributing to the pool and measuring the bias experienced by these leftout runs compared to the original pool, which is then used to correct measurements over new retrieval systems [27, 18, 15, 16]. (3) Leaving the design of the evaluation measure unrestricted, but instead introducing a document selection methodology that carefully chooses which documents to be judged. Methods proposed under this approach belong to two categories: (a) samplebased methods [4, 28, 20, 29, 24], and (b) active selection methods [10, 2, 19, 17].
Samplebased methods devise a sampling strategy that randomly selects a subset of documents to be assessed; evaluation measures are then inferred on the basis of the obtained sample. Different methods employ different sampling distributions. Aslam et al. [4] and Yilmaz and Aslam [28]
use a uniform distribution over the ranked document collection, while
Pavlu and Aslam [20] and Yilmaz et al. [29] recognize that relevant documents typically reside at the top of the ranked lists returned by participating runs and use stratified sampling to draw larger sample from the top ranks. Schnabel et al. [24]also use a weightedimportance sampling method on documents with the sampling distribution optimized for a comparative evaluation between runs. In all aforementioned work, an experiment that dictates the probability distribution under which documents are being sampled is being designed in such a way that evaluation measures can be defined as the expected outcome of this experiment. Evaluation measures can then be estimated by the judged documents sampled. In all cases the sampling distribution is being defined at the beginning of the sampling process and remains fixed throughout the experiment. Samplebased methods have the following desirable properties: (1) on average, estimates have no systematic error, (2) past data can be reused by new, novel runs without introducing bias, and (3) sampling distributions can be designed to optimize the number of judgments needed to confidently and accurately estimate a measure.
On the other hand, activeselection methods recognize that systems contributing documents to the pool vary in quality. Based on this observation they bias the selection of documents towards those retrieved by good retrieval systems. The selection process is deterministic and depends on how accurately the methods can estimate the quality of each retrieval system. Judging is performed in multiple rounds: at each round the best system is identified, and the next unjudged document in the ranked list of this system is selected to be judged. The quality of systems is calculated at the end of each round, as soon as a new judgment becomes available. Activeselection methods include MovetoFront [10], FixedBudget Pooling [17], and MultiArmed Bandits [19]. Losada et al. [19] considers the problem as an explorationexploitation dilemma, balancing between selecting documents from the bestquality run, and exploring the possibility that the quality of some runs might be underestimated at different rounds of the experiment. The advantage of activeselection methods compared to samplebased methods is that they are designed to identify as many relevant document as possible, by selecting documents with the highest relevance probability. The disadvantage is that the judging process is not fair to all runs, with the selection of documents being biased towards goodperforming runs.
In this paper, we follow a samplebased approach for an efficient largescale evaluation. Different from past samplebased approaches we account for the fact that some systems are of higher quality than others, and we design our sampling distribution to oversample documents from these systems. At the same time, given that our approach is a samplebased approach the estimated evaluation measures are, by construction, unbiased on average, and judgments can be used to evaluate new, novel systems without introducing any systematic error. The method we propose therefore is an active sampling method with the probability distribution over documents changing at every round of judgments through the reestimation of the quality of the participating runs. Accordingly, our solution consists of a sampling step and an estimation
step. In the sampling step, we construct a distribution over runs and a distribution over documents in a ranked list and calculate a joint distribution over documents to sample from. In the estimation step, we use the HorvitzThompson estimator to correct for the bias in the sampling distribution and estimate evaluation measure values for all the runs. The estimated measures then dictate the new sampling distribution over systems, and hence a new joint distribution over the ranked collection of documents.
Therefore, the contribution of this paper is a new sampling methodology for largescale retrieval evaluation that combines the advantages of the samplebased and the activeselection approaches. We demonstrate that the proposed method outperforms stateoftheart methods in terms of effectiveness, efficiency, and reusability.
2. Active sampling
In this section we introduce our new sampling method.
Symbol  Description 

document collection  
Sample set  
Subset of , only containing unique documents  
Total number of unique documents in  
Total number of contributing runs  
Number of sampling rounds  
Number of unique documents sampled in round  
Number of documents sampled in round  
th document  
Relevance of document  
Rank of document  
Probability of th system run being sampled  
Probability of the document ranked in th  
system run being sampled 
2.1. Active sampling algorithm
The key idea underlying our sampling strategy is to place a probability distribution over runs and a probability distribution over documents in the ranked lists of the runs, and iteratively sample documents from the joint distribution. At each round, we sample a set of documents from the joint probability distribution (batch sampling) and request relevance judgments by human assessors. The judged documents are then used to update the probability distribution over runs. The process is repeated until we reach a fixed budget of human assessments (Figure 1).
The process is illustrated in Algorithm 1, while Table 1 shows the notation used throughout the paper. Initially, we provide a prior distribution over runs , a prior distribution over the ranks of the documents for each run , and the document collection . Given that we have no prior knowledge of the system quality it is reasonable to use a uniform probability distribution over runs, i.e. . At each round , we calculate the selection probabilities of the documents (that is the probability that a document is selected at each sampling time) for each document , and then sample a document on the basis of this distribution. We use sampling with replacement with varying probabilities
to sample documents, which is closely related to how we calculate the unbiased estimators and it is describe in Section 3. The sampled documents
are then judged by human assessors, with the relevance of these documents denoted as , and the new data are added to which is used to update the posterior distribution over runs.2.2. Distribution over runs
The distribution over runs determines the probability of sampling documents from each run. Similar to activeselection methods, we make the assumption that good systems retrieve more relevant documents at the top of their rankings compared to bad systems. Based on this assumption we wish to oversample from rankings of good systems.
Any distribution that places a higher probability to better performing systems could be used here. In this work we consider the estimated performance of the retrieval systems on the basis of the relevance judgments accumulated at each round of assessments as system weights and normalize these weights to obtain a probability distribution over runs. Different evaluation measures can be used to estimate the performance of each run after every sampling round. Here we define a probability distribution proportional to estimated average precision introduced in Section 3.2.
Figure 2 demonstrates the accuracy of the estimated (normalized) average precision at the end of four sampling rounds compared to the (normalized) average precision when the entire document collection (or to be more accurate the depth100 pool for topic 251 in TREC 5) is used. At every round the estimates (denoted with circular markers of different sizes for different rounds) better approximates the target values (denoted with a line). The details of the measure approximations are provided at Section 3.
2.3. Distribution over document ranks
The distribution over document ranks for a system determines the probability of sampling a document at a certain rank of the ordered list returned by run . The underlying assumption that defines this probability distribution is that runs satisfy the Probability Ranking Principle (PRP) [22] which dictates that the probability of relevance monotonically decreases with the rank of the document. Hence, if we let denote the probability of sampling a document at rank , then it is natural to assume is a function of and monotonically decreases with . Once again, any distribution that agrees with PRP can be used; researcher have used a number of such distributions (e.g. see Aslam et al. [3], Pavlu and Aslam [20], Hofmann et al. [11]).
In this work we consider an APprior distribution proposed by Aslam et al. [3] and Pavlu and Aslam [20] which aims to define the probability at each rank on the basis of the contribution of this rank in the calculation of average precision. The intuition is that when rewriting , the implicit weight associated with rank can be obtained by summing weights associated with all pairs involving , i.e. . Then the APprior distribution is defined as follows:
where is the rank of a document and the total number of documents in the collection. Similar to Aslam et al. [3], Pavlu and Aslam [20] and all other samplebased methods, this distribution is defined at the beginning of the sampling process and remains fixed throughout the experiment.
3. Retrieval performance estimator
In this section, we discuss the estimation of evaluation measures on the basis of the sampling procedure described in Algorithm 1. We first calculate the inclusion probabilities of each document in the collection, and then demonstrate how these probabilities can be used by a HorvitzThompson estimator to produce unbiased estimators of the population mean, and subsequently of some popular evaluation measures. The HorvitzThompson estimator, together with the calculated inclusion probabilities can be used to calculate the majority of the evaluation measures used in IR; in this paper we focus on three of them, Precision, Recall, and Average Precision. Other measures can be derived in similar ways (e.g. see Table 1 in Schnabel et al. [24]).
3.1. Sampling with replacement with varying probabilities
Sampling procedure. At each round of our iterative sampling process described in Algorithm 1, documents are sampled from a collection of size . At each round, the unconditional probability of sampling a document (selection probability) is , as defined in Step 2 of Algorithm 1, with
and  
for  
Let denote the index of the documents composing the sample set. The probability of a document being sampled (firstorder inclusion probability) at the end of the sampling process is given by
which accounts for varying probabilities across different rounds, while the probability of any two different document and being sampled (secondorder inclusion probability) is given by
For the details of the derivation of the inclusion probabilities the reader can refer to Thompson [26]. Using these inclusion probabilities together with the HorvitzThompson estimator allows us to construct unbiased estimators for different evaluation measures in IR.
HorvitzThompson estimator of population total. Horvitz and Thompson [12] propose a general sampling theory for constructing unbiased estimators of population totals. With any sampling design, with or without replacement, the unbiased HorvitzThompson estimator of the population total is
where is the subset of , only containing unique documents.
An unbiased estimator of the variance of the population total estimator is given by:
For the details of these derivations the reader can refer to Thompson [26].
3.2. Evaluation metrics
In this work we consider three of the most popular evaluation measures in IR, precision at a certain cutoff, PC(r), average precision, AP, and Rprecision, RP. We first clarify the exact expressions of the evaluation metrics with regard to the population, then introduce the estimators of these evaluation metrics on the sample set. Let
denote a population of documents and let be an indicator variable of , with if the document is relevant, and otherwise. The population total is the summation of all , i.e. the total number of relevant documents in the collection, while the population mean is the population total divided by the population size. If the population of documents considered is the documents ranked in the top for some run then the population mean is the precision at cutoff .Based on the definition, precision at cutoff r, average precision, and precision at rank R are defined as:
Suppose that we have sampled documents , with associated relevance labels . We wish to estimate the total number of relevant documents in the collection, R, PC(r), AP and RP. Note that AP and RP, as many other evaluation measures that are normalized are ratios. For these measures, similar to previous work [20] we can estimate the numerator and denominator separately, and while this ratio estimator is not guaranteed to be unbiased, the bias tends to be small and decrease with an increasing sample size [26, 21].
The unbiased estimators for the four aforementioned measures based on HorvitzThompson can be calculated by:
4. Experiment setup
In this section we introduce our research questions, the statistics we use to evaluate the performance of the proposed estimators, and the data sets and baselines used in our experiments ^{1}^{1}1The implementation of the algorithm and the experiments run can be found at https://github.com/dli1/activesampling. The batch size for all the experiments has been set to 3.
4.1. Research questions
In the remainder of the paper we aim to answer the following research questions:
 RQ1:

How does active sampling perform compared to other samplebased and activeselection methods regarding bias and variance in the calculated effectiveness measures?
 RQ2:

How fast active sampling estimators approximate the actual evaluation measures compared to other samplebased and activeselection methods?
 RQ3:

Is the test collection generated by active sampling reusable for new runs that do not contribute in the construction of the collection?
The aforementioned questions allow us to have a thorough examination of the effectiveness as well as the robustness of the proposed method.
4.2. Statistics
To answer the research questions put forward in the previous section, we need to quantify the performance of different document selection methods.
Our first goal is to measure how close the estimation of an evaluation measure is to its actual value when the full judgment set is being used. Assume that a document selection algorithm chooses a set of documents to calculate an evaluation measure. Let’s denote the estimated measure with , for some run . Let’s also assume that the actual value of that evaluation measure, when the full judgment set is used, is . The root mean squared error (rms) of the estimator over a sample set measures how close on average the estimated and the actual values are. We follow the definition in [20] :
To further decompose the estimation errors made by different methods we also calculate the bias, and the variance decomposed from the mean square error (mse) between the estimator and the corresponding real value. Bias expresses the extent to which the average estimator over all sample sets differs from the actual value of a measure, while variance expresses the extent to which the estimator is sensitive to the particular choice of a sample set (see [5]). The mse, , can be rewritten as , which can further be rewritten as . The first term denotes the bias and second the variance of the estimator. Taking all runs into account, we have
A second measurement we are interested in is how far the inferred ranking of systems when estimating an evaluation measure is to the actual ranking of systems when the entire judged collection is being used. Following previous work [2, 4, 20, 28, 29] we also report the Kendall’s between estimated and actual rankings. Even though the Kendall’s is an important measure when it comes to comparative evaluation, rms error remains our focus, since test collections have found use not only in the evaluation of retrieval systems but also in learning retrieval functions [14]. In the latter case, for some algorithms, the accuracy of the estimated values is more important than just the correct ordering of systems.
4.3. Test collections
We conduct our experiments on TREC 5–8 AdHoc and TREC 9–11 Web tracks. The details of the data sets can be found in Table 2. In our experiments we did not exclude any participating run, and we considered the relevance judgments released by NIST (qrels) as the complete set of judgment over which the actual values of measures are being computed.
TREC  Task type  Topics  # runs  # rel doc  # judgement  # rel doc per query  # judgement per query 

TREC5  Adhoc  251300  61  5524  133681  110.48  2673.6 
TREC 6  Adhoc  301350  74  4611  72270  92.22  1445.4 
TREC 7  Adhoc  351400  103  4674  80345  93.48  1606.9 
TREC 8  Adhoc  401450  129  4728  86830  94.56  1736.6 
TREC 9  Web  451500  104  2617  70070  52.34  1401.4 
TREC 10  Web  501550  97  3363  70400  67.26  1408.0 
TREC 11  Web  551600  69  1574  56650  31.48  1133.0 
4.4. Baselines
We use two activeselection and one samplebased methods as baselines:
MovetoFront (MTF) [10]. MTF is a deterministic, iterative selection method. At the first round, all runs are given equal priorities. At each round, the method selects the run with the highest priority and obtains the judgment of the first unjudged document in the ranked list of the given run. If the document is relevant the method selects the next unjudged document until a nonrelevant document is judged. If that happens the priority of the current run is being reduced and the run with the highest priority is selected next.
Multiarmed Bandits (MAB) [19]. Similar to MTF, MAB aims to find as many relevant documents as possible. MAB casts document selection as a multiarmed bandit problem, and different to MTF it randomly decides whether to select documents from the best run on the current stage, or sample a document across the entire collection. For the MAB baseline we used the best method MMNS with its default setting reported in [19] ^{2}^{2}2http://tec.citius.usc.es/ir/code/poolingbandits.html.
Stratified sampling [20]. Stratified sampling is a stochastic method based on importance sampling. The probability distribution over documents used is the APprior distribution, which remains unchanged throughout the sampling process. Similar to our approach, the HorvitzThompson estimator is used to estimate the evaluation metrics. The stratified sampling approach proposed by Pavlu and Aslam [20] has been used in the construction of the TREC Million Query track collection [8], it outperforms methods using uniform random sampling [4, 28] and demonstrate similar performance to Yilmaz et al. [29].
5. Results and analysis
5.1. Bias and Variance
This first experiment is designed to answer RQ1 and is conducted on TREC 5.
We reduce the retrieved document lists of all runs to the top100 ranks (so that all documents in the ranked lists are judged) and consider this the ground truth rankings, based on which the actual values of MAP, RP and P@30 are calculated. The judgment effort is set to 10% of the depth100 pool for each query, and different methods are used to obtain the corresponding subset and calculate the estimated MAP, RP and P@30 for each run. For any stochastic method (i.e. the sampling methods and MAB) the experiment is repeated 30 times. Based on the estimated and actual values we calculate , and its decomposition to and for each estimator.
Figure 3 shows a number of scatter plots for MTF, MAB, Stratif (stratified sampling), and our method denoted as Active (active sampling). Each point in the plots corresponds to a given run. To declutter the figure, the shown points for the samplebased methods are computed over a single sample. An unbiased estimator should lead to points that lie on the x=y line. As it can be observed the active sampling estimated values are the ones that are closer to the diagonal. As expected, and by construction, precision is unbiased, while the bias introduced in the ratio estimators of AP and RP is smaller that all activeselection methods, and comparable to the stratified sampling method.
A decomposition of the mse into bias and variance can be found in Figure 4. As expected the variance of activeselection method is zero (or close to zero) since MTF is a deterministic method, while the randomness of MAB is only in the decision between exploration and exploitation. Active sampling has a much lower variance than stratified sampling, which demonstrates one of the main contribution of the our sampling method: biasing the sampling distribution towards good performing runs improves the estimation of the evaluation measures. The bias of the samplebased methods, as expected, is nearzero, while it is smaller than zero for the activeselection methods, since they do not correct for their preference to select documents from good performing runs. For example, the bias on P@30 of activeselection methods are much smaller than zero, because the greedy strategies only count the number of relevant documents and thus underestimate P@30; while the sampling methods can avoid the problem by using unbiased estimators. This demonstrates the second main contribution of our approach: using sampling avoids any systematic error in the calculation of measures. Therefore, the proposed sampling method indeed combines the advantages of both samplebased and activeselection methods that have been proposed in the literature.
5.2. Effectiveness
This second experiment is designed to answer RQ2 and is conducted on TREC 511. In this experiment we vary the judgment effort from 1% to 20% of the depth100 pool. At each sampling percentage, when samplebased methods are used, we first calculate the error and Kendall’s values for a given sample and then average these values over 30 sample sets.
Figure 5 shows the average and value at different sample sizes. For all TREC tracks active sampling demonstrates a lower rms error than stratified sampling, MTF, and MAB for sampling rates greater than 35%. At lower sampling rates activeselection methods show an advantage compared to samplebased methods that suffer from high variance. Regarding Kendall’s active sampling outperforms all methods for TREC 5–8, for sampling rates greater than 5%, while for TREC 10 and 11 it picks up at sampling rates greater than 10%. TREC 10 and 11 are the two collection with the smallest number of relevant documents per query, hence finding these document using activeselection methods leads to a better ordering of systems when the percentage of judged documents is very small. For those small percentages the samplebased methods demonstrate high variance, and it really depends on how lucky one is when drawing the sample of documents. The variance of error and Kendall’s across the 30 different samples drawn in this experiment for the estimation of MAP on TREC 11 can be seen in Figure 6.
Overall, when comparing active sampling with MTF and MAB, we find that our method outperforms them regarding . This indicates once again that the calculated inclusion probabilities and the HorvitzThompson estimator allows active sampling to produce an unbiased estimation of the actual value of the evaluation measures. When comparing active sampling with stratified sampling, both of which use the HorvitzThompson estimator, we can find that our method outperforms stratified sampling regarding Kendall’s . This indicates that the dynamic strategy we employ is beneficial compared to a static sampling distribution. Therefore, active sampling indeed combines the advantage of both methods.
5.3. Reusability
Constructing a test collection is a laborious task, hence it is very important that the proposed document selection methods construct test collections that can be used to evaluate new, novel algorithms without introducing any systematic errors. This experiment is designed to answer RQ3 and is conducted on TREC 511. In this experiment we split the runs into contributing runs and leftout runs. Using the contributing runs we construct a test collection for each different document selection method. We then calculate the estimated measures for all runs including those that were left out from the collection construction experiment. In our experiment, we use a onegroupout split of the runs. Runs that contributed in the sampling procedure come from different participating groups. Groups often submit different versions of the same retrieval algorithm, hence, typically, all the runs submitted by the same participating group differ very little in the ranking of the documents. To ensure that leftout runs are as novel as possible we leave out all runs for a given group. Regarding the calculation of rms error and Kendall’s we compute rms error and Kendall’s considering both participating and leftout runs.
Figure 7 shows the average error and Kendall’s values at different sample sizes using the latter aforedescribed option to isolate the effect of the different document selection methods on new, novel systems. In general, the trends observed in Figure 5 can also be observed in Figure 7, with active sampling outperforming all other methods regarding rms error and Kendall’s for sampling rates greater than 5%. For sampling rates lower than 5% in collections with very few relevant documents per topic (such as TREC 10 and 11) the activeselection methods perform better than the samplebased methods, however we can also conclude that at these low sampling rates none of the methods lead to reliably reusable collections.
6. Conclusion
In this paper we consider the problem of largescale retrieval evaluation. We tackle the problem of devising a samplebased approach  active sampling. Our method consists of a sampling step and an unbiased estimation step. In the sampling step, we construct two distributions, one over retrieval systems that is updated at every round of relevance judgments giving larger probabilities to better quality runs, and one over document ranks that is defined at the beginning of the sampling process and remains static throughout the experiment. Document samples are drawn from the joint probability distribution, and inclusion probabilities are computed at the end of the entire sampling process accounting for varying probabilities across sampling rounds. In the estimation step, we use the wellknown HorvitzThompson estimator to estimate evaluation metrics for all system runs.
The proposed method is designed to combine the advantages of two different families of methods that have appeared in the literature: samplebased and activeselection approaches. Similar to the former, our method leads to unbiased, by construction, estimators of evaluation measures, and can safely be used to evaluate new, novel runs that have not contributed to the generation of the test collection. Similar to the latter, the attention of our method is put on good quality runs with the hope of identifying more relevant documents and reduce the variability naturally introduced in the estimation of a measure due to sampling.
To examine the performance of the proposed method, we tested against stateoftheart samplebased and activeselection methods over seven TREC AdHoc and Web collections, TREC 5–11. Compared to samplebased approaches, such as stratified sampling, out method indeed demonstrated lower variance, while compared against activeselection approaches, such as MovetoFront, and MultiArmed Bandits, our method, as expected, has lower, nearzero bias. For sampling rates as low as 5% of the entire depth100 pool, the proposed method outperforms all other methods regarding effectiveness and efficiency and leads to reusable test collections.
Acknowledgements.
This research was supported by the Google Faculty Research Award program. All content represents the opinion of the authors, which is not necessarily shared or endorsed by their respective employers and/or sponsors.References
 [1]
 Aslam et al. [2003] Javed A. Aslam, Virgiliu Pavlu, and Robert Savell. 2003. A Unified Model for Metasearch, Pooling, and System Evaluation. In Proceedings of the Twelfth International Conference on Information and Knowledge Management (CIKM ’03). ACM, New York, NY, USA, 484–491. DOI:http://dx.doi.org/10.1145/956863.956953
 Aslam et al. [2005] Javed A Aslam, Virgiliu Pavlu, and Emine Yilmaz. 2005. Measurebased metasearch. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 571–572.
 Aslam et al. [2006] Javed A Aslam, Virgil Pavlu, and Emine Yilmaz. 2006. A statistical method for system evaluation using incomplete judgments. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 541–548.
 Bishop [2006] Christopher M Bishop. 2006. Pattern recognition. Machine Learning 128 (2006), 1–58.
 Buckley et al. [2007] Chris Buckley, Darrin Dimmick, Ian Soboroff, and Ellen Voorhees. 2007. Bias and the Limits of Pooling for Large Collections. Inf. Retr. 10, 6 (Dec. 2007), 491–508. DOI:http://dx.doi.org/10.1007/s107910079032x
 Buckley and Voorhees [2004] Chris Buckley and Ellen M. Voorhees. 2004. Retrieval Evaluation with Incomplete Information. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’04). ACM, New York, NY, USA, 25–32. DOI:http://dx.doi.org/10.1145/1008992.1009000
 Carterette et al. [2009] Ben Carterette, Virgiliu Pavlu, Evangelos Kanoulas, Javed A. Aslam, and James Allan. 2009. If I Had a Million Queries. In Advances in Information Retrieval, 31th European Conference on IR Research, ECIR 2009, Toulouse, France, April 69, 2009. Proceedings (Lecture Notes in Computer Science), Mohand Boughanem, Catherine Berrut, Josiane Mothe, and Chantal SouléDupuy (Eds.), Vol. 5478. Springer, 288–300. DOI:http://dx.doi.org/10.1007/978364200958727
 Cleverdon [1967] Cyril Cleverdon. 1967. The Cranfield tests on index language devices. In Aslib proceedings, Vol. 19. MCB UP Ltd, 173–194.
 Cormack et al. [1998] Gordon V Cormack, Christopher R Palmer, and Charles LA Clarke. 1998. Efficient construction of large test collections. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 282–289.
 Hofmann et al. [2011] Katja Hofmann, Shimon Whiteson, and Maarten de Rijke. 2011. A Probabilistic Method for Inferring Preferences from Clicks. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM ’11). ACM, New York, NY, USA, 249–258. DOI:http://dx.doi.org/10.1145/2063576.2063618
 Horvitz and Thompson [1952] Daniel G Horvitz and Donovan J Thompson. 1952. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association 47, 260 (1952), 663–685.
 Kanoulas [2015] Evangelos Kanoulas. 2015. A Short Survey on Online and Offline Methods for Search Quality Evaluation. In Information Retrieval  9th Russian Summer School, RuSSIR 2015, Saint Petersburg, Russia, August 2428, 2015, Revised Selected Papers (Communications in Computer and Information Science), Pavel Braslavski, Ilya Markov, Panos M. Pardalos, Yana Volkovich, Dmitry I. Ignatov, Sergei Koltsov, and Olessia Koltsova (Eds.), Vol. 573. Springer, 38–87. DOI:http://dx.doi.org/10.1007/97833194171893

Li [2014]
Hang Li. 2014.
Learning to Rank for Information Retrieval and Natural Language Processing, Second Edition
. Morgan & Claypool Publishers. DOI:http://dx.doi.org/10.2200/S00607ED2V01Y201410HLT026  Lipani et al. [2016a] Aldo Lipani, Mihai Lupu, and Allan Hanbury. 2016a. The Curious Incidence of Bias Corrections in the Pool. Springer International Publishing, Cham, 267–279. DOI:http://dx.doi.org/10.1007/978331930671120
 Lipani et al. [2016b] Aldo Lipani, Mihai Lupu, Evangelos Kanoulas, and Allan Hanbury. 2016b. The Solitude of Relevant Documents in the Pool. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (CIKM ’16). ACM, New York, NY, USA, 1989–1992. DOI:http://dx.doi.org/10.1145/2983323.2983891
 Lipani et al. [2017] Aldo Lipani, Mihai Lupu, Joao Palotti, Guido Zuccon, and Allan Hanbury. 2017. Fixed Budget Pooling Strategies based on Fusion Methods. In Proc. of SAC.
 Lipani et al. [2016] Aldo Lipani, Guido Zuccon, Mihai Lupu, Bevan Koopman, and Allan Hanbury. 2016. The Impact of FixedCost Pooling Strategies on Test Collection Bias. In Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval (ICTIR ’16). ACM, New York, NY, USA, 105–108. DOI:http://dx.doi.org/10.1145/2970398.2970429
 Losada et al. [2016] David E Losada, Javier Parapar, and Álvaro Barreiro. 2016. Feeling lucky?: multiarmed bandits for ordering judgements in poolingbased evaluation. In proceedings of the 31st annual ACM symposium on applied computing. ACM, 1027–1034.
 Pavlu and Aslam [2007] V Pavlu and J Aslam. 2007. A practical sampling strategy for efficient retrieval evaluation. Technical Report. Technical report, Northeastern University.
 Raj [1964] Des Raj. 1964. A note on the variance of the ratio estimate. J. Amer. Statist. Assoc. 59, 307 (1964), 895–898.
 Robertson [1997] S. E. Robertson. 1997. Readings in Information Retrieval. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, Chapter The Probability Ranking Principle in IR, 281–286. http://dl.acm.org/citation.cfm?id=275537.275701
 Sanderson [2010] Mark Sanderson. 2010. Test Collection Based Evaluation of Information Retrieval Systems. Foundations and Trends in Information Retrieval 4, 4 (2010), 247–375. DOI:http://dx.doi.org/10.1561/1500000009
 Schnabel et al. [2016] Tobias Schnabel, Adith Swaminathan, Peter I. Frazier, and Thorsten Joachims. 2016. Unbiased Comparative Evaluation of Ranking Functions. In Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval (ICTIR ’16). ACM, New York, NY, USA, 109–118. DOI:http://dx.doi.org/10.1145/2970398.2970410
 Sparck Jones and van Rijsbergen [1975] K. Sparck Jones and C.J. van Rijsbergen. 1975. Report on the need for and provision of an ’ideal’ information retrieval test collection. Technical Report. Computer Laboratory, Cambridge University.
 Thompson [2012] Steven K. Thompson. 2012. Sampling (3 ed.). John Wiley & Sons, Inc., Hoboken, New Jersey.
 Webber and Park [2009] William Webber and Laurence A. F. Park. 2009. Score Adjustment for Correction of Pooling Bias. In Proceedings of the 32Nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’09). ACM, New York, NY, USA, 444–451. DOI:http://dx.doi.org/10.1145/1571941.1572018
 Yilmaz and Aslam [2006] Emine Yilmaz and Javed A. Aslam. 2006. Estimating Average Precision with Incomplete and Imperfect Judgments. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM ’06). ACM, New York, NY, USA, 102–111. DOI:http://dx.doi.org/10.1145/1183614.1183633
 Yilmaz et al. [2008] Emine Yilmaz, Evangelos Kanoulas, and Javed A. Aslam. 2008. A Simple and Efficient Sampling Method for Estimating AP and NDCG. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’08). ACM, New York, NY, USA, 603–610. DOI:http://dx.doi.org/10.1145/1390334.1390437
 Zobel [1998] Justin Zobel. 1998. How Reliable Are the Results of Largescale Information Retrieval Experiments?. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’98). ACM, New York, NY, USA, 307–314. DOI:http://dx.doi.org/10.1145/290941.291014