Efficient Test Collection Construction via Active Learning

01/17/2018 ∙ by Md Mustafizur Rahman, et al. ∙ Qatar University The University of Texas at Austin 0

To create a new IR test collection at minimal cost, we must carefully select which documents merit human relevance judgments. Shared task campaigns such as NIST TREC determine this by pooling search results from many participating systems (and often interactive runs as well), thereby identifying the most likely relevant documents in a given collection. While effective, it would be preferable to be able to build a new test collection without needing to run an entire shared task. Toward this end, we investigate multiple active learning (AL) strategies which, without reliance on system rankings: 1) select which documents human assessors should judge; and 2) automatically classify the relevance of remaining unjudged documents. Because scarcity of relevant documents tends to yield highly imbalanced training data for model estimation, we investigate sampling strategies to mitigate class imbalance. We report experiments on four TREC collections with varying scarcity of relevant documents, reporting labeling accuracy achieved, as well as rank correlation when evaluating participant systems using these labels vs. NIST judgments. Results demonstrate the effectiveness of our approach, coupled with further analysis showing how varying relevance scarcity, within and across collections, impacts findings.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Test collections provide the foundation for Cranfield-based evaluation of information retrieval (IR) systems (Cleverdon, 1967; Sanderson et al., 2010). However, it has become increasingly expensive to manually judge so many documents as collection sizes have grown. At the same time, failing to collect sufficient relevance judgments can compromise evaluation reliability (Voorhees, 2006). Even commercial search engines, despite their query logs, reportedly continue to depend on large teams of human assessors (Google, 2016). Consequently, there has been tremendous interest in developing scalable yet reliable IR evaluation methodology.

To create a new IR test collection at minimal cost, we must carefully select which documents merit human relevance judgments. The dominant way to do this is to run a shared task campaign, such as NIST TREC, and pool search results from many participating systems (and often interactive runs as well) in order to identify the most likely relevant documents for human assessors to judge (Cormack et al., 1998; Carterette et al., 2006; Pavlu and Aslam, 2007). While this approach has been long canonized in IR practice, running a shared task is difficult, slow, and expensive. If one’s goal is merely to build a new test collection, it would be nice if one could do this without needing to run a shared task (Soboroff, 2013).

Toward this end, we investigate active learning (AL) (Settles, 2012) to support test collection construction without reliance on shared task document rankings. In particular, we learn a topic-specific document classification model for each search topic. We investigate two distinct applications of AL. Firstly, we investigate use of AL to select which documents assessors should judge, and we explore two document selection strategies (Cormack and Grossman, 2014): continuous active learning (CAL) and simple active learning (SAL). Secondly, while IR evaluation typically ignores unjudged documents or assumes them to be non-relevant, our supervised AL model gives us the opportunity to automatically classify relevance of all unjudged documents. This ability to use any hybrid combination of human and automatic relevance judgments in evaluation provides a very flexible tradeoff space for balancing cost vs. accuracy in relevance judging.

While we are not the first to pursue automatic or semi-automatic relevance labeling (Carterette and Allan, 2007; Büttcher et al., 2007; Hui and Berberich, 2015; Cormack and Grossman, 2014; Nguyen et al., 2015), prior studies either do not use AL (Carterette and Allan, 2007; Büttcher et al., 2007; Hui and Berberich, 2015) or do not apply it constructing IR test collections (Cormack and Grossman, 2014; Nguyen et al., 2015)

. Our study is further distinguished from prior work in our attention to label imbalance. Firstly, we show the deleterious effect highly-skewed training data can have on classifier accuracy (largely ignored in prior work) and investigate sampling strategies

(Liu, 2004; Błaszczyński et al., 2013) as a means to ameliorate this. We also not only measure and study the effects of imbalance across a diverse set of test collections, but also report experiments studying the effect of varying label imbalance across topics within each individual test collection.

Because AL is supervised, an initial seed set of labeled documents is needed to boostrap AL. We consider two distinct scenarios for how these seed judgments might be obtained: interactive search (IS) and Rank-based Document Selection (RDS). We emphasize that IS and RDS are not competing methods, but alternative scenarios. IS assumes a traditional NIST TREC process in which topic assessors utilize an IS system during lengthy topic creation, producing seed judgments as a by-product. RDS, on the other hand, assumes a scenario like the TREC Million Query Track (Carterette et al., 2009) in which topic formation is extremely brief and assessors are not provided an IS system in which to explore the collection. In this scenario, an off-the-shelf IR system is used instead to produce a single document ranking; assessors then judge documents in this rank-order until enough seed judgments have been collected to kickstart AL.

Reported experiments span four TREC collections, three document selection methods, two scenarios for selecting seed data, and two applications of AL: document selection and automatic document labeling. We study effects of label imbalance in detail, both across diverse collections and within individual collections, and we investigate sampling approaches to ameliorate imbalance. For our hybrid human-automatic labeling approach, we evaluate both labeling accuracy, as well as rank correlation in evaluating IR systems, vs. “ground-truth” NIST judgments. Finally, we investigate choice of classification metric, i.e., how precision vs. recall should be weighted in evaluating classifier effectiveness if our ultimate end-goal is to maximize evaluation reliability (i.e., rank correlation).

The remainder of this paper is organized as follows. Section 2 reviews related work. We then present our approach in Section 3. Next, we describe our experimental setup (Section 4) and results (Section 5). We conclude and discuss future work in Section 6.

2. Related Work

Ever-larger document collections challenge systems-based Cranfield (Cleverdon, 1967) evaluation of IR systems due to needing to collect so many relevance judgments. In response to this challenge, many metrics and methods have been proposed to support incomplete and minimal judging (Buckley and Voorhees, 2004; Aslam et al., 2006; Yilmaz and Aslam, 2006; Carterette et al., 2006; Aslam and Yilmaz, 2007; Pavlu and Aslam, 2007; Yilmaz and Aslam, 2008). However, while many methods now exist to intelligently select which documents to judge, these methods typically assume a shared task context (e.g., TREC) in which document ranking information from many participating systems is available. In contrast, we want to be able to construct a new test collection without needing to run a shared task (Soboroff, 2013).

Büttcher et al. (2007)

label relevance using SVM and logistic regression models, and we both report results for the same 2006 Terabyte track run on Gov2. Our results are not directly comparable to theirs because they assume a traditional machine learning setup, whereas we motivate and adopt a

finite-pool setting (Section 4.2). However, we effectively reproduce their method as a baseline, using logistic regression, random document selection, and no correction for class imbalance. We show strong improvement over this baseline.

While Hui and Berberich (2015) use document ranking information in their own proposed method, they also reproduce Büttcher et al. (2007)’s SVM method as a baseline, reporting results on the same WebTrack 2013 and 2014 collections we use in this study. However, as with Büttcher et al. (2007), they do not assume a finite-pool scenario and therefore, our results are not directly comparable. That said, our same baseline configuration described above roughly reproduces their SVM-based automatic labeling approach on these collections.

For AL document selection, we evaluate the CAL and SAL methods Cormack and Grossman (2014) assess in the domain of e-discovery, where they focus on set-based rather than ranked retrieval. They find that CAL is more effective than SAL, but they neither discuss or use sampling to address imbalance in classifier training data nor do they investigate how class imbalance contributes to CAL’s superior performance. Judging cost is also measured differently in this domain: no document can be “screened in” automatically since all must be reviewed for privilege following discovery. Finally, they truncate each document to its first bytes.

Similarly, Nguyen et al. (2015) investigate AL-based relevance judging in the domain of systematic-review in medicine, which bears much in common with e-discovery (Lease et al., 2016). As above, AL is used to reduce labeling costs but without intent to construct a test collection or evaluate IR systems based on automatic labels. They also adopt a finite-pool evaluation setting, but unlike us, they use both trusted judges and crowds in combination for human judging.

Baruah et al. (2016) propose two AL-based approaches to reduce the labor that is required for annotating nugget relevance. By computing the likelihood of a sentence containing a nugget, their AL-based approaches develop an ordered-list of the sentence and nugget pairs to be annotated. However, AL is employed here to ease the nugget annotation task without intending to develop a test collection.

Rajput et al. (2012) develop a framework for constructing a test collection using an iterative process between updating nuggets and annotating documents. This process consists of: i) selecting documents to judge based on existing nuggets and document ranking information; ii) extracting nuggets automatically from documents judged relevant; and iii) updating nugget weights based on relevant and non-relevant documents judged so far. However, because their automatic nugget extraction fails to extract nuggets from documents which are difficult to parse (e.g. TREC Web Track), the authors fall back to using document rankings from participating systems. Thus this approach also depends on a shared task evaluation.

Various IR evaluation metrics have been proposed to provide resiliency to incomplete relevance judgments. For example,

bpref (Buckley and Voorhees, 2004) approximates Mean Average Precision (MAP) by making clear distinctions between judged non-relevant documents vs. unjudged documents assumed to be non-relevant. Yilmaz and Aslam (2006) propose induced average precision (indAP) which discards unjudged documents, and inferred average precision (infAP) which estimates precision when unjudged documents are encountered. Grönqvist (2005) proposes RankEff to address similar concerns.

A variety of methods have also been proposed to intelligently select which documents retrieved in a shared task should be judged. pooling is a simple example of this. Move-to-front (MTF) pooling (Cormack et al., 1998) emphasizes documents for annotation which are ranked by systems that have already provided relevant documents. Pavlu and Aslam (2007) simplify the complexity of (Aslam et al., 2006) by introducing a sampling distribution which samples documents ranked higher by many systems. Carterette et al. (2006) propose to construct minimal test collection to judge ranking system with confidence.

Crowdsourcing studies have sought to improve work efficiency by shifting the judging burden to online crowds that are cheaper and more scalable (Alonso et al., 2008). Rather than incomplete judging, crowdsourcing studies have largely focused on a different cost vs. quality evaluation tradeoff: the impact of judging errors on evaluation stability. Automatic prediction extends this tradeoff even further.

3. Approach

In describing our approach, we begin by defining our task and learning model (Section 3.1). Following this, Section 3.2 describes the active learning (AL) approaches we pursue. Finally, we describe our sampling strategy for addressing class imbalance in Section 3.3.

3.1. Task Definition and Learning Model

We assume the Cranfield model of system-based IR evaluation that is based on pre-defined search topics and relevance judgments (Voorhees, 2001). In order to construct a hybrid human-machine system for binary relevance judging of collection documents, we induce a topic-specific binary classifier for each search topic in the topic set of topics. Assume we have a document collection of documents (represented by extracted features). Let denote the binary relevance judgment for ¡document , topic ¿. The training data for topic is comprised of a set of pairs .

For each arbitrary search topic for which we wish to train a topic-specific classifier

, we must collect topic-specific training data. As such, we are not in a “big data” setting in which the recent wave of neural approaches would be well suited. Instead, we adopt logistic regression as our learning model to infer the probability of relevance

for each document for topic :


with model parameters. We represent each document by a canonical TF-IDF (Salton and Buckley, 1988) feature representation (see Section 4.1).

Input :   Document collection batch size total budget
Output :   Relevance judgments for topics
1 for topic to  do
2       Select seed documents for topic
3       Rj←{⟨xi,yij⟩|xi∈S} Collect initial judgments
4       Learn relevance classifier using
5       b←b-|S| Update remaining budget
6 end for
7for topic to  do
8       if  then return Budget exhausted
9       predict topical relevance of document using
10       Select documents to judge next for topic
11       Rj←Rj∪{⟨xi,yij⟩|xi∈S} Collect judgments
12       Re-estimate relevance classifier using expanded
13       b←b-u Update remaining budget
14 end for
Algorithm 1 Active Learning Algorithm

3.2. Active Learning

Traditional (passive) learning assumes that examples to label for training are drawn uniformly at random from the population domain being modeled. While such IID random selection is both simple and guaranteed to faithfully represent the population with increasing fidelity as sample size increases, it is not particularly efficient with regard to optimizing annotator effort. In particular, some examples are far more informative for model updates than others.

In contrast, active learning (Settles, 2012) iteratively selects which document should be labeled next in order to maximize the classifier’s learning curve for each topic. This reduces the amount of human effort required to induce an effective model. Algorithm 1 describes our active learning strategy to develop a test collection. The first loop (Lines 2-7) collects the seed document labels for each topic using either interactive search or rank-based document selection strategy (Section 3.2.1). Using these seed documents, a topic-specific document classifier (Section 3.1) is trained. In the second loop (Lines 8-15), the learned classifier is used to select documents for further annotation. Those further annotated documents are employed to re-train the topic-specific classifier. This process continues iteratively until we exhaust the allocated budget.

3.2.1. Seed Document Selection

In order to learn a topic-specific document relevance classifier, topic-specific training data is needed. We assume that no such labeled data for each topic exists in advance (i.e., one cannot anticipate every possible search topic of interest for which a user might wish to search). Consequently, we must collect an initial seed set of human relevance judgments for each search topic in order to initialize our AL model. While we could simply select a (uniform) random sample, assuming large class imbalance, it is unlikely that such random selection would find any relevant documents (imagine randomly sampling documents from the Web in order to find a relevant document for a particular topic).

Instead, we consider two scenarios, motivated by NIST TREC processes from past tracks: 1) interactive search (also known as search-guided assessment (Oard et al., 2004)); and 2) rank-based document selection.

Interactive Search (IS). Consider traditional NIST TREC practice for constructing search topics (Voorhees, 2001, 2016):

  • For the traditional ad hoc tasks, assessors … would create a query and judge about 100 documents… The judging was an intrinsic part of the topic development routine because we needed to know that the topic had sufficiently many (but not too many) relevant … (These judgments made during the topic development phase were then discarded…) Creating a set of 50 topics for a newswire ad hoc collection was budgeted at about 175-225 assessor hours, which works out to about 4 hours per final topic.

Our IS scenario is naturally accommodated by this above process of topic formation. We assume either (i) an assessor has a search topic in mind of their own devising; or (ii) the assessor selects a real user query from some search engine’s query log and then backfits a mental search topic to the observed query. Either way, the assessor then searches the document collection in order to find some minimal set of relevant and non-relevant documents needed to establish the topic as viable. If not, the topic is discarded (topics with too few relevant or non-relevant documents provide little information for A/B comparison of alternative search algorithms). Whereas the NIST process above would traditionally discard these initial judgments, we would instead keep them as the seed set for active learning. As such, we would essentially get seed documents for AL for free as a by-product of topic development.

Rank-based Document Selection (RDS). In contrast with the TREC ad hoc topic creation process described above, the TREC Million Query (MQ) Track (Carterette et al., 2009) used a rather different procedure to develop topics. Queries were sampled from a large search engine query log, and the assessment system showed 10 randomly selected queries to each assessor, who then selected one and converted it into a standard TREC topic by back-fitting a topic description. Carterette et al. (2009) reported that median time for viewing a list of queries was 22 seconds and back-fitting a topic description was 76 seconds. Critically, the MQ track skipped the extended IS process of topic creation described earlier for relevance calibration (Scholer et al., 2013) and instead began immediate judging of assigned documents.

The RDS scenario is meant to support seed data selection for a case like this in which there is no interactive search interface or free judgments available from the preceding topic formation process. Instead, we assume access to some moderately effective (off-the-shelf or in-house) IR system which takes a search query as input, searches the given document collection, and produces a ranked list of documents. The assessor is then asked to proceed down the ranked list until at least relevant and non-relevant document(s) have been found, or some maximum effort is reached without success, in which case the topic is discarded (just as NIST traditionally abandoned topics which IS failed to find a sufficient mix of relevant and non-relevant documents). While RDS resembles traditional pooling in selecting the top-ranked documents for judging, pooling involves fusing results from many participating systems (and often manual runs as well) in order to find enough relevant documents to create a robust topic for reliable IR experimentation. In contrast, we select only seed documents via a single system ranking, then rely on AL to identify relevant documents to create a robust topic.

A question for our study is how we should measure and report the different costs of these alternative IS and RDS scenarios? While we assume topic creation is performed as a separate process, and so largely beyond the scope of our study, the ad hoc vs. MQ processes clearly vary greatly in human effort each requires (i.e., hours vs. minutes per topic). IS conceptually provides seed documents as a free by-product, but it is only because this cost is incurred earlier. In contrast, RDS involves far less initial overhead in topic creation, but effectively shifts this some of this cost downstream to our stage of collecting seed judgments. We return to this issue in Section 4.4.

3.2.2. Pool and Batch Learning

In this work, we assume pool-based active learning (Lewis and Gale, 1994), in which the learner selects the next most informative instance to label from a fixed pool of unlabeled instances (i.e., the set of unlabeled documents for a given collection). We also assume batch learning, in which at each time step, we select unlabeled examples to be labeled next.

3.2.3. Document Selection Criteria

We consider three document selection strategies (Cormack and Grossman, 2014): Simple Passive Learning (SPL), Simple Active Learning (SAL), Continuous Active Learning (CAL).


selects documents uniformly at random. As such, it corresponds to traditional supervised learning in which training data is assumed to be IID from the domain. We include SPL as a

baseline against which active selection criteria are benchmarked.

SAL (more commonly referred to as uncertainty sampling (Lewis and Gale, 1994)), selects the document to label next for which the current model is maximally uncertain of its correct label, such that labeling this document is expected to maximally inform the current model. We adopt a common uncertainty function based on entropy: (Tang et al., 2002):


where is the a posteriori probability from the classifier and is relevant or non-relevant. With binary relevance, SAL selects:


With CAL, the learning algorithm selects the unlabeled document which the current model predicts is most likely to be relevant:


While SAL is more commonly used in AL, Cormack and Grossman (2014) find that CAL is more effective. However, their finding is motivated by their task of goal of finding as many relevant documents as possible, rather than from a modeling standpoint of how to best train a learner. We believe label imbalance is key here. When there is large class imbalance (i.e., label skew), a primary concern is finding examples of the rare class to which a learner can be exposed. With IR, relevant documents are the proverbial needles in the haystack of class imbalance. IR researchers have long been familiar with using relevance feedback (Rocchio, 1971) to automatically expand a query (and thereby better model relevance for the underlying topic), and CAL appears to follow a similar strategy. Section 5.1 reports analysis of CAL’s effectiveness as a function of class imbalance.

3.2.4. Stopping criteria

We run the active learning algorithm until we exhaust our allocated budget. One can then inspect the average gain curves across the diverse test collections evaluated in order to determine expected classifier effectiveness for any given annotation budget. While we do not explore it here, algorithmic stopping criteria could also be pursued. For example, Cormack (Cormack)

propose a heuristic stopping criterion based on “knee-detection” of the classifier gain curve in identifying relevant documents.

3.3. Handling Class Imbalance

Given a training set with large class imbalance, a classifier induced from that training data is unlikely to correctly classify the minority class (Wallace et al., 2011) (e.g., relevant documents). One popular technique to address this is undersampling (Liu, 2004), in which the number of instances from the majority class are discarded in order to restore class balance. However, discarding data risks losing useful information regarding the majority class. An improved version of this, undersampling with bagging (Błaszczyński et al., 2013), builds an ensemble of multiple models trained from different undersampling trials, then predicts the class of an input according to aggregate votes of the multiple models. In a third technique, oversampling (Liu, 2004), minority class instances are duplicated in order to balance the class distribution.

In unreported experiments with all three approaches, oversampling was found to most consistently perform best, and so we adopt it for reported results. We apply it at each stage of AL after new labels are collected before model training.

As a baseline for evaluating the utility of sampling, we also report direct use of full (imbalanced) training data without any sampling, as done by prior work (Büttcher et al., 2006; Cormack and Grossman, 2014; Hui and Berberich, 2015). Section 5.1 reports hybrid labeling results with and without oversampling.

Track Collection Topics #Docs #Judged %Rel
WT’14 ClueWeb12 251-300 52,343,021 14,432 39.2%
WT’13 ClueWeb12 201-250 52,343,021 14,474 28.7%
TB’06 Gov2 801-850 25,205,179 31,984 18.4%
Adhoc’99 Disks45-CR 401-450 528,155 86,830 5.4%
Table 1. Test collection statistics. As collections have grown larger, judging budgets have also shrank, leading to increased prevalence of relevant documents in later tracks.

4. Experimental Setup

4.1. Datasets and Preprocessing

We conduct our experiments on four TREC tracks and datasets (see Table 1): the 2013-2014 Web Tracks (Collins-Thompson et al., 2013, 2015) on ClueWeb12111lemurproject.org/clueweb12/, the 2006 Terabyte track (Büttcher et al., 2006) on Gov2222ir.dcs.gla.ac.uk/test_collections/gov2-summary.htm, and the 1999 TREC-8 ad hoc track (Voorhees and Harman, 2000) on TIPSTER disks 4-5333trec.nist.gov/data/docs_eng.html (excluding the congressional record). Because we assume binary relevance in this work, we collapse NIST graded relevance judgments to binary.

As shown in Table 1, later tracks show increasing prevalence of relevant documents in judged pools, from approximately 5% to almost 40%. This increase stems from the confluence of larger collections having more relevant documents and judging budgets shrinking to shallower pooling across participating system runs.

We use IndriBuildIndex444www.lemurproject.org/indri.php to parse documents, perform text normalization, remove standard stopwords (Lewis et al., 2004), perform Krovetz stemming (Krovetz, 1993), and output text statistics. We compute unigram TF-IDF features for each document using Python’s TfidfVectorizer in its sklearn.feature_extraction. Collection documents are finally represented as

dimensional TF-IDF feature vectors.

While AL experiments would ideally allow the learner to request a label for any document in a given collection, we are limited by only having NIST relevance judgments for those documents in the existing TREC pool for each topic. While we could collect new relevance judgments for documents outside the pool, this could be problematic in that our secondary assessors would likely disagree markedly from the original NIST primary assessor in their conception of relevance criteria for each topic (Voorhees, 2000; Al-Harbi and Smucker, 2014; Wakeling et al., 2016), biasing any evaluation based on those labels. Similarly, simply assuming any unjudged document is non-relevant would also be problematic and introduce noise into our classifier evaluation555While IR evaluations commonly assume unjudged documents are non-relevant, it is typically assumed that all highly ranked documents have been judged, and that evaluation metrics minimize penalties on low-ranked documents. Evaluating classifier accuracy on unlabeled test data, in contrast, would only introduce unhelpful noise.. Consequently, we limit AL document selection and automatic labeling in our experiments to those documents found in the existing TREC pool for each track. This assumption is least constraining for the TREC-8 Ad Hoc track (Voorhees and Harman, 2000), in which deep pooling resulted in many non-relevant documents being judged ( of the pool), and more constraining for later tracks, where larger collections and smaller judging budgets have conspired to yield far higher prevalence of relevant documents in pools (Table 1) than found in general collections.

4.2. Finite-pool Evaluation

Our task goal is to combine human assessors and automatic classification in whatever combination enables us to best judge the relevance of collection documents with maximum accuracy at minimal cost. Not to be confused with pooling of system runs (Voorhees, 2001), this evaluation setting is known as a finite-pool (Wallace et al., 2010; Nguyen et al., 2015) or technology-assisted review (TAR) scenario (Cormack and Grossman, 2014). It differs markedly from a typical, fully-automated machine learning (ML) evaluation. We begin with a finite set of documents needing to be assessed for relevance (i.e., the document collection). As discussed earlier, no labeled data exists in advance for training a topic-specific classifier for each arbitrary search topic. There is also no partition of training vs. testing data; whereas typical ML assumes we wish to create a reusable classifier, using held-out testing data to assess classifier generalization to future unseen examples, our goal here is merely to label the finite pool of collection documents before us (and no others). Classifier reusability and generalization for unseen data is irrelevant.

In fact, use of an automatic classifier is completely optional; one could simply assign all documents to human assessors for manual judging and skip automatic classification altogether. While this would yield perfect classification accuracy (we assume human assessors are infallible in this work), it would also incur steep cost (since we must pay our assessors for each manual judgment). At the other extreme, human assessors could be asked to label only a minimal seed set of documents to train the classifier, which could then be used to automatically label all remaining unjudged documents. While this would provide the lowest cost solution, such spartan training data is unlikely to yield sufficient classifier accuracy to be useful. We present learning curves which map the space between these extremes in balancing accuracy vs. cost (see Section 4.5).

4.3. Human-only vs. Hybrid Judging

We evaluate two potential applications of AL to IR test collection construction: 1) selecting which documents human assessors should judge; and 2) further automatically labeling unjudged documents.

Human-only Judging. Because exhaustive manual judging of an entire collection would be cost-prohibitive, only a subset of documents are typically judged by human assessors. Traditional practice uses only these human judgments to evaluate IR systems (Voorhees, 2001). Consequently, we first evaluate the use of active learning (AL) (Settles, 2012) for document selection only, i.e., determining the best set of documents for human assessors to judge in order to best evaluate IR systems on a limited assessment budget.

Hybrid Judging. We also consider using AL to further automatically label the relevance of remaining unjudged documents. We then evaluate whether such hybrid judging improves evaluation of IR systems vs. traditional practice of using only human judgments.

4.4. Seed Document Selection

Section 3.2.1 introduced two scenarios, IS and RDS, for selecting AL seed documents. We now discuss how we implement them and measure cost in our evaluation. As discussed there, each scenario assumes to a different preceding process of topic creation, and that preceding process has implications for our downstream process in collecting judgments for those topics. Consequently, we emphasize that it is difficult (and likely inappropriate) to directly compare cost vs. effectiveness of IS and RDS. Consequently, our intent is not to present them as competing methods, but to evaluate how AL behaves under each condition.

Interactive Search (IS)

. As noted earlier, IS conceptually generates seed documents for free as part of topic formation. However, it seems somewhat odd to report no cost for seed judgments when the topic creation process may involve hours of human effort per topic

(Voorhees, 2001, 2016). We report cost in this study with regard to the number of human judgments, and while we could assume the cited average of 100 documents judged per topic (Voorhees, 2016), this would still only include the judgments, and not the IS time spent by the assessor. Finally, because NIST has discarded those judgments made during topic formation, we neither have those judgments or know how many actual judgments were made per topic. Consequently, there is no clearly correct way to account for the cost of IS seed judgments. Consequently, our solution here is to be maximally transparent about this challenge and how we handle it. In particular: 1) we assume 5 relevant and 5 non-relevant seed judgments for all topics are produced during topic creation; 2) we randomly sample final relevant and non-relevant NIST judgments to select the ones to use; and 3) we report the cost of these judgments like any other judgments collected during AL (i.e., cost of 10 here). Over all 4 collections, only 5 total topics were found to have relevant documents, and so only these 5 topics were discarded (consistently across all reported experiments).

Track MAP Track Avg. Track STD.
WT’14 0.181 0.165 0.065
WT’13 0.111 0.115 0.041
TB’06 0.350 0.276 0.089
Adhoc’99 0.260 0.234 0.096
Table 2. MAP scores of systems used for Rank-based Document Selection (RDS) vs. track average and std. deviation.

Rank-based Document Selection (RDS). Recall RDS assumes use of some IR system to rank documents for each topic. The assessor then judges documents in rank-order to create labeled seed data to initialize AL. In this work, rather than running our own IR system, we simply sample an existing ranking from each TREC track uniformly at random from the set of participating systems (see Table 2 for statistics of our randomly-selected system vs. statistics of other participating systems). We assume the assessor proceeds down the ranked list until at least 1 relevant and 1 non-relevant document is found, after which we proceed using AL. We report cost as the total number of judgments made down the ranked-list until this condition is met. Assuming label imbalance, we apply sampling to remedy this for classifier training.

Figure 1. Human judging cost (x-axis) vs. F1 classification accuracy (y-axis) for hybrid human-machine judging of document pools for four TREC Tracks. The % of human judgments on x-axis is wrt. #Judged for each collection (Table 1).

4.5. Evaluation Metrics

Learning Curves and AUC. We present our results as learning curves showing cost vs. effectiveness of each method being evaluated. In particular, we report method effectiveness at varying cost points, as well as the overall Area Under Curve (AUC) effectiveness across all cost points, approximated via the Trapezoid rule666 Computed via python’s numpy.trapz..

Cost. We measure cost with regard to manual judging budget, i.e., the cost of human judgments (assuming automatic classification is free). We report cost in batch size increments. Specifically, we use 10% of the pool size for each topic as the batch size, reporting results at {0%, 10%, 20%, …, 100%} human judging of each topic’s pool. Note that we assume cost of each human label as constant, whether it be in seed judging (IS or RDS) or during AL.

Labeling Accuracy. We measure our hybrid (human + automatic) AL labeling accuracy in terms of (i.e,

), the harmonic mean of precision and recall, as averaged across topics:


While is typically used by convention, we also analyze results for other settings of in order to establish which value is best aligned with maximizing rank correlation (see Section 5.2.1). This allows us to investigate how precision vs. recall should be weighted in evaluating classifier effectiveness if our ultimate end-goal is to maximize reliability in evaluating IR systems.

Rank Correlation. We also assess the reliability of using our labeling methods to evaluate IR systems. A relative performance ranking of participating systems in each track is then induced based on these metrics. As ground truth ranking, we calculate MAP scores for participating systems using all pool NIST judgments. Proposed methods are used to induce another relative system ranking (discussed below), we then compute the Kendall’s rank correlation (Kendall, 1938) between the ground-truth system ranking vs. our proposed method’s ranking. By convention, is assumed to constitute an acceptable correlation level for reliable IR evaluation (Voorhees, 2000).

Incomplete Pool Judging. In this setting, we use human judgments only to evaluate IR systems, scored via bpref (Buckley and Voorhees, 2004), which ignores the documents for which no judgment is available. We adopt Soboroff’s corrected bpref formulation777“This… corrects a bug in (Buckley and Voorhees, 2004) and follows the actual implementation in trec_eval…” (Soboroff, 2006):


where we have R documents judged relevant and N judged non-relevant, r is a relevant retrieved document, and n is a member of the first R non-relevant retrieved documents. We compute all IR evaluation metrics via standard trec_eval888trec.nist.gov/trec_eval/.

Complete Pool Judging. In this other setting, we automatically classify all unjudged pool documents via AL, then evaluate IR systems using MAP (since the pool is completely labeled).

5. Results and Discussion

We now present experimental results. We first evaluate labeling accuracy of our hybrid AL approaches (Section 5.1). We then report Kendall’s rank correlation results using AL for (i) human-only (incomplete) judging of pool documents; and (ii) hybrid human-machine labeling of all pool documents (Section 5.2). Finally, we discuss correlation of F-measure and Kendall’s (Section 5.2.1).

5.1. Hybrid Labeling Results

We begin by reporting the F-measure labeling performance of our hybrid AL approaches in the finite-pool scenario. As discussed in Section 3.2.1, we consider two scenarios for how seed documents are selected to initialize AL: IS and RDS. Because scarcity of relevant documents tends to yield highly imbalanced training data for automatic labeling, we also compare oversampling (Section 3.3) vs. using imbalanced data without correction.

Figure 1 presents F1 performance results of the three document selection approaches: SPL, SAL, and CAL. The x-axis of each plot indicates the percentage of pool documents manually judged (using NIST labels), with the remainder of the pool automatically labeled by the classifier. NIST judges are treated as infallible, so all methods ultimately converge to 100% F1 at the right-end of each plot, corresponding to complete manual judging of the entire pool.

Seed Selection Scenarios. As noted earlier, it is difficult (and perhaps inappropriate) to directly compare IS vs. RDS since each depends on different underlying scenario assumptions of how topics are created, and the challenge of fairly reporting comparable costs in the two cases. As a consequence, IS results may appear artificially better than those of RDS since we pay less in our stage of the process to collect seed judgments. Instead, we suggest each scenario be considered separately in the context of its topic creation process. We do not report full analyses of other experimental questions under both scenarios later in the paper simply due to lack of space.

Oversampling vs. Direct Use of Training Data. Comparing the middle and bottom rows of Figure 1, we see that oversampling with IS provides superior performance (in 11 out of 12 cases). The outstanding performance of oversampling can be justified by the fact that by oversampling the minority class (i.e., relevant documents) helps the classifier to develop a more sophisticated model of the relevant class and thus improve classifier performance. Recall that prior work on automatic document labeling (Büttcher et al., 2006; Hui and Berberich, 2015) used training data directly, without any correction for class imbalance.

Active vs. Passive Learning. Comparing active learning (SAL and CAL) against passive learning (SPL) methods, for 11 of the 12 different plots of Figure 1, SAL and CAL consistently outperform SPL in terms of AUC. We also see that for TB’06 and Adhoc’99 collections, both CAL and SAL with IS and oversampling perform comparably, requiring around 30% (TB’06) and 40% (Adhoc’99) of human judgments to achieve 90% F1 measure. In contrast, SPL requires 80% (TB’06) and 70% (Adhoc’99). Interestingly, in later rank correlation experiments (Section 5.2), SPL will be seen to fare far better when IR systems are evaluated using only human judgments (but not when using hybrid judgments).

CAL vs. SAL. It is clearly evident from Figure 1 that CAL consistently provides better performance than SAL in terms of AUC. For example, using only 10% to 20% of human judgments, CAL achieves higher F1 in almost every plot. This finding is consistent with that of Cormack and Grossman (2014), despite the various differences between our studies discussed earlier.

Figure 2. Varying % of relevant documents per topic in each collection’s document pool. Table 1 also reports the mean.

Figure 3. F1 labeling accuracy across test collections when binning topics by % of relevant documents in each topic’s judgment pool. All results use CAL, IS, and oversampling.

Varying Scarcity of Relevant Documents. Prevalence of relevant documents can vary widely across different test collections, as well as across topics within a single test collection. For example, Figure 2 shows that WT’14 has the highest average prevalence (around 40%), while Adhoc’99 has only 5% average prevalence of relevant documents. Looking at Rows 1-2 of Figure 1, we see that across the 4 collections that varying prevalence plays an important role in explaining the differing performance of AL vs. passive learning. For TB’06 and Adhoc’99, where we have low prevalence rate (less than 20%), SAL and CAL outperform SPL by a large margin. However, with the higher prevalence rates in WT’13 and WT’14, SPL performs much better, though is still outperformed by CAL. Even with a very low prevalence rate, AL with oversampling can be very effective. Another notable observation is that as we move from higher prevalence collections (e.g., WT’14) to lower prevalence collections (e.g., Adhoc’99), CAL’s AUC consistently increases; the same does not always hold for SAL.

To further investigate effects of prevalence, we binned topics for each collection according to their prevalence rate. This yielded five clusters for WT’14, WT’13, and TB’06, and three clusters for Adhoc’99 (due to its narrower range of prevalence across topics). Figure 3 plots F1 labeling accuracy as a function of prevalence bin, assuming CAL, IS, and oversampling. Surprisingly, we see the highest AUC achieved when we have the lowest prevalence for all collections except WT’13. It appears that as there are more relevant documents to be found, it becomes increasingly difficult to model the relevant class’s nuances without further training data.

Figure 4. Kendall’s rank correlation achieved with and without automatic labeling. Top Row: hybrid human-classifier labeling, evaluating systems by MAP. Bottom Row: human judgments only, evaluating systems by bpref (Buckley and Voorhees, 2004) due to incomplete pool judging. Ground truth ranking is induced by system MAP scores using all NIST judgments.

5.2. Rank Correlation Results

As has been discussed, we consider two applications of AL for aiding IR test collection construction: 1) selecting documents for human judging (only); and 2) automatically labeling unjudged documents. We evaluate these two approaches here in turn. All results in this section assume IS and oversampling.

Using Only Human Judgments. We first consider evaluating IR systems using only human judgments of the documents selected by AL. Since only a subset of the document pool is judged, systems are scored using bpref (Buckley and Voorhees, 2004). Figure 4’s bottom row presents Kendall’s rank correlation results. The x-axis indicates the percentage of the pool judged, and the y-axis indicates correlation. We plot results for CAL and SAL AL strategies, as well as baseline SPL. Surprisingly, SPL achieves highest correlation for 3 of the 4 collections, excepting only Adhoc’99, where CAL initially performs best before all methods eventually converge.

Using Hybrid Labeling. We next consider the second condition (2) of automatically labeling unjudged documents. Since all pool documents are labeled either manually or automatically, we evaluate IR systems using MAP. Results are presented in the top row of Figure 4. Unlike human-only judging, here we do see superior performance of the AL methods. In fact, for low prevalence collections (e.g. TB’06 and Adhoc’99), AL with hybrid labeling far exceeds performance of the SPL passive learning. Recall that prior work on automatic labeling assumed passive learning (Büttcher et al., 2006; Hui and Berberich, 2015).

Human vs. Hybrid Labeling. Table 3 collects the best AUC performance among the three protocols in each plot in Figure 4. Interestingly, results are mixed, suggesting that neither human nor hybrid labeling is always superior to the other. Moreover, varying prevalence does not appear to explain these mixed results: the hybrid approach works better for the collections at the two extremes of prevalence (WT’14 and Adhoc’99). Consequently, it would seem that further experimentation and analysis are needed.

Labeling WT’14 WT’13 TB’06 Adhoc’99
Hybrid 87.9 82.4 84.7 87.8
Human-only 83.0 85.6 84.8 85.6
Table 3. Average (AUC) Kendall’s rank correlation achieved, with and without automatic labeling, by the best performing AL methods in Figure 4 (see its AUC results).

5.2.1. Which F-Measure Should We Maximize?

Finally, we consider the following question. If one is ultimately interested in providing labels which are maximally useful for evaluating IR systems, but one does not have system rankings or ground-truth labels for directly measuring (and optimizing) rank correlation, which surrogate classification metric should one strive to optimize? While we reported labeling accuracy via F1 in Section 5.1, we largely chose F1 because it is a simple and canonical classification metric. However, it is not obvious whether its equal weighting of precision vs. recall is actually the optimal metric to maximize if one’s true goal is to maximize rank correlation in IR evaluation.

To investigate the above question, we computed F-Measure (Equation 5) for several different settings of , varying its emphasis on precision vs. recall, assuming CAL, IS, and oversampling. We then measured Pearson correlation between resulting labeling scores vs. Kendall’s rank correlation. Due to space constraints, we do not plot the Pearson correlation achieved at each cost point, but instead just report the Pearson AUC for each setting of .

Surprisingly, results in Table 4 show that the canonical setting in F1 appears to achieve the highest average correlation with across collections (though the range of average scores is fairly small). Based on this result, we recommend any follow-on work proposing a potentially better hybrid labeling system and wanting to establish empirical improvement, but without measuring rank correlation, should optimize as most correlated with on average; one can thus reasonably expect that improvements in F1 will translate to better correlation in evaluating IR systems.

Collection =0.25 =0.50 =1.0 =3.0 =5.0
WT’14 0.787 0.782 0.769 0.680 0.602
WT’13 0.969 0.971 0.976 0.978 0.955
TB’06 0.857 0.889 0.941 0.990 0.993
Adhoc’99 0.926 0.945 0.969 0.988 0.990
Average 0.885 0.897 0.914 0.909 0.885
Table 4. Pearson correlation between Kendall’s rank correlation and classification F-measure when varying F’s value, assuming CAL, IS, and oversampling.

6. Conclusion and Future Work

While many AL strategies (Settles, 2012) have been proposed for general text classification, little research has considered the utility of AL for helping tackle the constant, large-scale work of collecting IR relevance judgments (Cormack and Grossman, 2014; Nguyen et al., 2015). Nearly all existing test collection construction algorithms (Aslam et al., 2006; Carterette and Allan, 2005; Carterette et al., 2006; Cormack et al., 1998; Pavlu and Aslam, 2007) presume availability of shared task system (and possibly interactive) runs in order to identify potentially relevant documents for human assessors to judge. As has been recently discussed (Soboroff, 2013), it would be nice to be able to construct new IR test collections without having to go to all of the trouble of organizing a shared task evaluation.

Our approach presents various limitations and opportunities for further improvement. Because we only have NIST relevance judgments for pool documents, our experiments were constrained to these existing pools. On one hand, this constraint under-estimates AL potential since there is less flexibility in document selection. On the other hand, since the pools themselves derive from existing shared task evaluations, our goal of avoiding any reliance on a shared task is somewhat undermined by these data constraints. A separate concern is that predicting relevance via a classifier introduces the obvious risk that the choice of classifier might favor some ranking systems whose underlying ranking strategy might be same as the classifier used for predicting the relevance.

An interesting direction for future work would be to explore methods for using non-pool documents to further augment training data, thereby improving classifier accuracy, but without changing pool data used for testing. This might involve collecting secondary assessor judgments for documents outside the pool, inferring document relevance from fused participant rankings, simply assuming non-relevance outside the pool, or forgoing the need for relevance labels altogether by adopting a semi-supervised approach. Another interesting direction would be to investigate varying budget allocated across topics based on expected topic difficulty, e.g., by the proportion of the estimated the number of relevant documents for each topic vs. actual budget expenditures thus far.

7. Acknowledgments

This work was made possible by NPRP grant# NPRP 7-1313-1-245 from the Qatar National Research Fund (a member of Qatar Foundation). The statements made herein are solely the responsibility of the authors. We thank the Texas Advanced Computing Center (TACC) at the University of Texas at Austin for computing resources enabling this research.


  • (1)
  • Al-Harbi and Smucker (2014) Aiman L Al-Harbi and Mark D Smucker. 2014. A qualitative exploration of secondary assessor relevance judging behavior. In Proceedings of the 5th Information Interaction in Context Symposium. ACM, 195–204.
  • Alonso et al. (2008) Omar Alonso, Daniel E. Rose, and Benjamin Stewart. 2008. Crowdsourcing for relevance evaluation. ACM SIGIR Forum 42, 2 (2008), 9–15.
  • Aslam et al. (2006) Javed A Aslam, Virgil Pavlu, and Emine Yilmaz. 2006. A statistical method for system evaluation using incomplete judgments. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 541–548.
  • Aslam and Yilmaz (2007) Javed A Aslam and Emine Yilmaz. 2007. Inferring document relevance from incomplete information. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management. ACM, 633–642.
  • Baruah et al. (2016) Gaurav Baruah, Haotian Zhang, Rakesh Guttikonda, Jimmy Lin, Mark D Smucker, and Olga Vechtomova. 2016. Optimizing Nugget Annotations with Active Learning. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. ACM, 2359–2364.
  • Błaszczyński et al. (2013) Jerzy Błaszczyński, Jerzy Stefanowski, and Łukasz Idkowiak. 2013. Extending bagging for imbalanced data. In Proceedings of the 8th International Conference on Computer Recognition Systems CORES 2013. Springer, 269–278.
  • Buckley and Voorhees (2004) Chris Buckley and Ellen M Voorhees. 2004. Retrieval evaluation with incomplete information. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 25–32.
  • Büttcher et al. (2006) Stefan Büttcher, Charles LA Clarke, and Ian Soboroff. 2006. The TREC 2006 Terabyte Track.. In TREC, Vol. 6. 39.
  • Büttcher et al. (2007) Stefan Büttcher, Charles LA Clarke, Peter CK Yeung, and Ian Soboroff. 2007. Reliable information retrieval evaluation with incomplete and biased judgements. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 63–70.
  • Carterette and Allan (2005) Ben Carterette and James Allan. 2005. Incremental test collections. In Proceedings of the 14th ACM international conference on Information and knowledge management. ACM, 680–687.
  • Carterette and Allan (2007) Ben Carterette and James Allan. 2007. Semiautomatic evaluation of retrieval systems using document similarities. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management. ACM, 873–876.
  • Carterette et al. (2006) Ben Carterette, James Allan, and Ramesh Sitaraman. 2006. Minimal test collections for retrieval evaluation. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 268–275.
  • Carterette et al. (2009) Ben Carterette, Virgil Pavlu, Evangelos Kanoulas, Javed A Aslam, and James Allan. 2009. If I had a million queries. In European conference on information retrieval. Springer, 288–300.
  • Cleverdon (1967) Cyril Cleverdon. 1967. The Cranfield tests on index language devices. In Aslib proceedings, Vol. 19. MCB UP Ltd, 173–194.
  • Collins-Thompson et al. (2013) Kevyn Collins-Thompson, Paul N. Bennett, Fernando Diaz, Charlie Clarke, and Ellen M. Voorhees. 2013. TREC 2013 Web Track Overview. In Proceedings of The Twenty-Second Text REtrieval Conference, TREC 2013, Gaithersburg, Maryland, USA, November 19-22, 2013.
  • Collins-Thompson et al. (2015) Kevyn Collins-Thompson, Craig Macdonald, Paul Bennett, Fernando Diaz, and Ellen M Voorhees. 2015. TREC 2014 web track overview. Technical Report. DTIC Document.
  • Cormack (Cormack) Gordon V Cormack. Waterloo (Cormack) Participation in the TREC 2015 Total Recall Track.
  • Cormack and Grossman (2014) Gordon V Cormack and Maura R Grossman. 2014. Evaluation of machine-learning protocols for technology-assisted review in electronic discovery. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval. ACM, 153–162.
  • Cormack et al. (1998) Gordon V Cormack, Christopher R Palmer, and Charles LA Clarke. 1998. Efficient construction of large test collections. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 282–289.
  • Google (2016) Google. 2016. Search Quality Rating Guidelines. Inside Search: How Search Works (March 28 2016). www.google.com/insidesearch/.
  • Grönqvist (2005) Leif Grönqvist. 2005. Evaluating latent semantic vector models with synonym tests and document retrieval. In ELECTRA workshop: Methodologies and evaluation of lexical cohesion techniques in real-world applications beyond bag of words. Citeseer, 86–88.
  • Hui and Berberich (2015) Kai Hui and Klaus Berberich. 2015. Selective labeling and incomplete label mitigation for low-cost evaluation. In International Symposium on String Processing and Information Retrieval. Springer, 137–148.
  • Kendall (1938) Maurice G Kendall. 1938. A new measure of rank correlation. Biometrika 30, 1/2 (1938), 81–93.
  • Krovetz (1993) Robert Krovetz. 1993. Viewing morphology as an inference process. In Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 191–202.
  • Lease et al. (2016) Matthew Lease, Gordon V. Cormack, An Thanh Nguyen, Thomas A. Trikalinos, and Byron C. Wallace. 2016. Systematic Review is e-Discovery in Doctor’s Clothing. In Proceedings of the Medical Information Retrieval (MedIR) Workshop at the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval.
  • Lewis and Gale (1994) David D Lewis and William A Gale. 1994. A sequential algorithm for training text classifiers. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval. Springer-Verlag New York, Inc., 3–12.
  • Lewis et al. (2004) David D Lewis, Yiming Yang, Tony G Rose, and Fan Li. 2004. SMART stopword list. Journal of Machine Learning Research (2004).
  • Liu (2004) Alexander Yun-chung Liu. 2004. The effect of oversampling and undersampling on classifying imbalanced text datasets. Ph.D. Dissertation. Citeseer.
  • Nguyen et al. (2015) An Thanh Nguyen, Byron C Wallace, and Matthew Lease. 2015. Combining crowd and expert labels using decision theoretic active learning. In Third AAAI Conference on Human Computation and Crowdsourcing.
  • Oard et al. (2004) Douglas W Oard, Dagobert Soergel, David Doermann, Xiaoli Huang, G Craig Murray, Jianqiang Wang, Bhuvana Ramabhadran, Martin Franz, Samuel Gustman, James Mayfield, et al. 2004. Building an information retrieval test collection for spontaneous conversational speech. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 41–48.
  • Pavlu and Aslam (2007) V Pavlu and J Aslam. 2007. A practical sampling strategy for efficient retrieval evaluation. Technical Report. College of Computer and Information Science, Northeastern University.
  • Rajput et al. (2012) Shahzad Rajput, Matthew Ekstrand-Abueg, Virgil Pavlu, and Javed A Aslam. 2012. Constructing test collections by inferring document relevance via extracted relevant information. In Proceedings of the 21st ACM international conference on Information and knowledge management. ACM, 145–154.
  • Rocchio (1971) Joseph John Rocchio. 1971. Relevance feedback in information retrieval. (1971).
  • Salton and Buckley (1988) Gerard Salton and Christopher Buckley. 1988. Term-weighting approaches in automatic text retrieval. Information processing & management 24, 5 (1988), 513–523.
  • Sanderson et al. (2010) Mark Sanderson et al. 2010. Test collection based evaluation of information retrieval systems. Foundations and Trends® in Information Retrieval 4, 4 (2010), 247–375.
  • Scholer et al. (2013) Falk Scholer, Diane Kelly, Wan-Ching Wu, Hanseul S Lee, and William Webber. 2013. The effect of threshold priming and need for cognition on relevance calibration and assessment. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. ACM, 623–632.
  • Settles (2012) Burr Settles. 2012. Active learning.

    Synthesis Lectures on Artificial Intelligence and Machine Learning

    6, 1 (2012), 1–114.
  • Soboroff (2006) Ian Soboroff. 2006. Dynamic test collections: measuring search effectiveness on the live web. In Proc. SIGIR. 276–283.
  • Soboroff (2013) Ian M Soboroff. 2013. Building Test Sollections (without running a community evaluation). In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. 1132–1132. https://isoboroff.github.io/Test-Colls-Tutorial/Tutorial-slides/.
  • Tang et al. (2002) Min Tang, Xiaoqiang Luo, and Salim Roukos. 2002. Active learning for statistical natural language parsing. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 120–127.
  • Voorhees (2006) E.M. Voorhees. 2006. Overview of the TREC 2005 robust retrieval track. In Proc. TREC.
  • Voorhees (2000) Ellen M Voorhees. 2000. Variations in relevance judgments and the measurement of retrieval effectiveness. Information processing & management 36, 5 (2000), 697–716.
  • Voorhees (2001) Ellen M Voorhees. 2001. The philosophy of information retrieval evaluation. In Workshop of the Cross-Language Evaluation Forum for European Languages. Springer, 355–370.
  • Voorhees (2016) Ellen M Voorhees. 2016. (2016). Personal communication.
  • Voorhees and Harman (2000) Ellen M. Voorhees and Donna Harman. 2000. Overview of the Eighth Text REtrieval Conference (TREC-8). 1–24.
  • Wakeling et al. (2016) Simon Wakeling, Martin Halvey, Robert Villa, and Laura Hasler. 2016. A comparison of primary and secondary relevance judgements for real-life topics. In Proceedings of the 2016 ACM on Conference on Human Information Interaction and Retrieval. ACM, 173–182.
  • Wallace et al. (2010) Byron C. Wallace, Kevin Small, Carla E. Brodley, and Thomas A. Trikalinos. 2010. Active Learning for Biomedical Citation Screening. In Proceedings of 16th ACM SIGKDD Intl. Conference on Knowledge Discovery & Data Mining. 173–182.
  • Wallace et al. (2011) Byron C Wallace, Kevin Small, Carla E Brodley, and Thomas A Trikalinos. 2011. Class imbalance, redux. In Data Mining (ICDM), 2011 IEEE 11th International Conference on. 754–763.
  • Yilmaz and Aslam (2006) Emine Yilmaz and Javed A Aslam. 2006. Estimating average precision with incomplete and imperfect judgments. In Proceedings of the 15th ACM international conference on Information and knowledge management. ACM, 102–111.
  • Yilmaz and Aslam (2008) Emine Yilmaz and Javed A Aslam. 2008. Estimating average precision when judgments are incomplete. Knowledge and Information Systems 16, 2 (2008), 173–211.