There are several application domains—including legal e-discovery and systematic review for evidence-based medicine—where finding all, or substantially all, relevant documents is crucial. Current state-of-the-art methods for achieving high recall rely on machine-learning methods that learn to discriminate relevant from non-relevant documents based on large numbers of human relevance assessments. In many instances, thousands of assessments may be required. These human assessments represent the primary cost of such methods, which can be prohibitive when expert assessments are required. In this work, we examine whether it is possible to use sentence-level assessments in place of document-level assessments to reduce the time needed to make judgments, the number of judgments needed, or both. We present a novel strategy to evaluate this hypothesis, and show simulation results using standard test collections which indicate that assessment effort can be reduced to judging a single sentence from a document without meaningful reduction in recall. Replacing documents with sentences has the potential to reduce the cost and burden associated with achieving high recall in many important applications.
Simulation methods have long been a staple of information-retrieval (IR) evaluation. The dominant methodology of studies reported in the literature derives from Sparck Jones’ “ideal” test collection (Sparck Jones and Van Rijsbergen, 1975), in which the results of ad hoc searches for each of a set of topics within a dataset are compared to relevance labels for a subset of the documents, rendered after the fact by human assessors. This approach is generally considered to yield reliable comparisons of the relative effectiveness ad hoc IR systems that do not rely on relevance feedback.
To simulate relevance feedback, we require a substantially complete set of relevance labels prior to the simulation; the reviewer’s response to any particular document during the simulation is determined by consulting these previously determined labels. Furthermore, to simulate the presentation of isolated sentences rather than documents to the reviewer for feedback, we require a prior relevance label for each sentence in every document, with respect to every topic.
In the current study, we augment four publicly available test collections with sentence-level relevance labels derived using a combination of the available relevance labels, new assessments, heuristics, and machine-learning in Section5. We use the available labels to simulate document-level relevance feedback, and the newly created labels to simulate sentence-level relevance feedback in Section 4. Both are evaluated in terms of document-level recall—the fraction of relevant documents presented in whole or in part to the reviewer—as a function of reviewer effort. Effort is measured in two ways—as the total number of assessments rendered by the reviewer, and as the total number of sentences viewed by the reviewer in order to render those assessments in Section 6. We assume that the reviewer’s actual time and effort is likely to fall somewhere between these two bounds.
In addition to choosing whether to present a full document or isolated sentence to the reviewer for feedback, it is necessary to choose the manner in which the document or sentence is selected. As a baseline, we used the Baseline Model Implementation (“BMI”) implementation of the AutoTAR Continuous Active Learning method (“CAL”) shown in Section 2
, which repeatedly uses supervised learning to select and present to the reviewer for labeling the next-most-likely relevant document, which is then added to the training set. We extended BMI to incorporate three binary choices: (1) whether topresent full documents or sentences to the reviewer for feedback; (2) whether to train the learning algorithm using full documents or isolated sentences; and (3) whether to select the highest-scoring document, and the highest-scoring sentence within that document, or to select the highest scoring sentence, and the document containing that sentence. We evaluated all eight combinations of each of these three binary choices in Section 4.
We conjectured that while sentence-level feedback might be less accurate than document-level feedback, yielding degraded recall for a given number of assessments, that sentence-level feedback could be rendered more quickly, potentially yielding higher recall for a given amount of reviewer time and effort. We further conjectured that selecting the highest-scoring sentence (as opposed to the highest-scoring document) and/or using sentences (as opposed to documents) for training might help to improve the accuracy and hence efficiency of sentence-level feedback.
Contrary to our conjecture, we found that sentence-level feedback resulted in no meaningful degradation in accuracy, and that the methods intended to mitigate the anticipated degradation proved counterproductive in Section 7. Our results suggest that relevance feedback based on isolated sentences can yield higher recall with less time and effort, under the assumption that sentences can be assessed, on average, more quickly than full documents.
2 Related Work
While the problem of High-Recall Information Retrieval (HRIR) has been of interest since the advent of electronic records, it currently commands only a small fraction of contemporary IR research. The most pertinent body of recent HRIR research derives from efforts to improve the effectiveness and efficiency of Technology-Assisted Review (“TAR”) for electronic discovery (eDiscovery) in legal, regulatory, and access-to-information contexts, where the need is to find substantially all documents that meet formally specified criteria within a finite corpus. A similar problem has been addressed within context of systematic review for evidence-based medicine and software engineering, where the need is to find reports of substantially all studies measuring a particular effect. Constructing an ideal test collection for IR evaluation entails a similar need: to identify substantially all of the relevant documents for each topic. Although the focus of TREC has diversified since its inception in 1992, and methods to achieve high-recall have evolved, the original impetus for TREC was to support the needs of information analysts, who “were willing to look at many documents and repeatedly modify queries in order to get high recall.” (Voorhees et al, 2005).
The method of conducting multiple searches with the aim of achieving high recall, dubbed Interactive Search and Judging (ISJ), while common, has rarely been evaluated with respect to how well it achieves its overall purpose. The initial TREC tasks evaluated one single search, assuming that improvements would contribute to an end-to-end process involving multiple searches. An early study by Blair and Maron (1985) indicated that searchers employing ISJ on an eDiscovery task believed they had achieved 75% recall when in fact they had achieved only 20%. Within the context of the TREC 6 ad hoc task, Cormack et al (1998) used ISJ to achieve 80% recall with 2.1 hours of effort, on average, for each of 50 topics. A principal difference between the two studies is that Cormack et al. used “shortest substring ranking and an interface that displayed relevant passages and allowed judgments to be recorded,” whereas Blair and Maron used Boolean searches and reviewed printed versions of entire documents.
The current states of the art for HRIR and for its evaluation are represented by the tools and methods of the TREC Total Recall Track, which ran in 2015 and 2016 (Roegiest et al, 2015; Grossman et al, 2016), and form the baseline for this study. The Total Recall Track protocol simulates a human in the loop conducting document-level relevance assessments, and measures recall as a function of the number of assessments, where recall is the fraction of all relevant documents presented to the reviewer for assessment. BMI, an HRIR implementation conforming to the Total Recall protocol, was supplied to Total Recall Track participants in advance, and used as the baseline for comparison.
No method evaluated in the TREC Total Recall Track surpassed the overall effectiveness of BMI (Roegiest et al, 2015; Grossman et al, 2016; Zhang et al, 2015). A prior implementation of the same method had been shown to surpass the effectiveness of the ISJ results of (Cormack and Grossman, 2015) on the TREC 6 data shown in Figure 1, as well as a similar method independently contrived and used successfully by Soboroff and Robertson (2003) to construct relevance labels for the TREC 11 Filtering Track (Robertson and Soboroff, 2002). Recently, BMI and a method independently derived from CAL have produced results that compare favorably to competing methods for systematic review (Kanoulas et al, 2017; Cormack and Grossman, 2017c; Baruah et al, 2016). BMI has shown effectiveness that compares favorably with exhaustive manual review in categorizing 402,000 records from Governor Tim Kaine’s administration as Governor of Virginia (Cormack and Grossman, 2017a).
BMI is an implementation of CAL, which is effectively a relevance-feedback (RF) method, albeit with a different objective and implementation than to construct the ultimate query by selecting and weighting search terms, as typically reported in the RF literature (Aalbersberg, 1992; Ruthven and Lalmas, 2003)
. CAL uses supervised machine-learning algorithms that have been found to be effective for text categorization, but with the goal of retrieving every relevant document in a finite corpus, rather than to construct the ultimate automatic classifier for a hypothetical infinite population. Given these differences, results from RF and text categorization should not be assumed to apply to CAL. In particular, relevance feedback for non-relevant documents has been shown to be important for CAL(Pickens et al, 2015), while uncertainty sampling has shown no effectiveness benefit over relevance sampling, while incurring added complexity (Cormack and Grossman, 2014).
The TREC Legal Track (2006–2011) (Baron et al, 2006; Tomlinson et al, 2007; Oard et al, 2008; Hedin et al, 2009; Cormack et al, 2010; Grossman et al, 2011) investigated HRIR methods for eDiscovery, which are now described as TAR. The main task from 2006 through 2008 evaluated the suitability of ad hoc IR methods for this task, with unexceptional results. A number of RF and text categorization tasks were also run, each of which involved categorizing or ranking the corpus based on a fixed set of previously labeled training examples, begging the question of how this training set would be identified and labeled within the course of an end-to-end review effort starting with zero knowledge. 2008 saw the introduction of the interactive task, reprised in 2009 and 2010, for which teams conducted end-to-end reviews using technology and processes of their own choosing, and submitted results that were evaluated using relevance assessments on a non-uniform statistical sample of documents. In 2008 and 2009, San Francisco e-discovery service provider H5 achieved superior results using a rule-based approach (Hogan et al, 2008); in 2009 the University of Waterloo employed a combination of ISJ and CAL to achieve comparable results (Cormack and Mojdeh, 2009). In a retrospective study using secondary data from TREC 2009 (Grossman and Cormack, 2010), two of the authors of the current study concluded that the rule-based and ISJ+CAL approaches both yielded results that compared favorably to the human assessments used for evaluation. It was not possible, however, given the design of the TREC task, to determine the relative contributions of the technology, the process, and the quality and quantity of human input to the H5 or Waterloo results.
Prior to CLEF 2017, the systematic review literature described primarily text categorization efforts similar to those employed by the TREC Legal Track, in which the available data were partitioned into training and test sets, and effectiveness evaluated with respect to classification or ranking of the test set (Hersh and Bhupatiraju, 2003; Wallace et al, 2010, 2013). One notable exception is Yu et al (2016) which affirms the effectiveness of CAL for systematic review.
Contemporary interactive search tools—including the tools employed for ISJ—typically display search results as document surrogates (Hearst, 2009), which consist of excerpts or summaries from which the reviewer can decide whether or not to view a document, or whether or not to mark it relevant. For example, the ISJ method described above used the result rendering shown in Figure 1, which consists of a fragment of text from the document, accompanied by radio buttons for the reviewer to render a relevance assessment. Typically, the surrogate consists in whole or in part of a query-biased summary or excerpt of the full document.
Tombros and Sanderson (1998) found that reviewers could identify more relevant documents for each query by reviewing the extracted summary, while at the same time, making fewer labeling errors. In a subsequent study, Sanderson (1998) found that “[t]he results reveal that reviewers can judge the relevance of documents from their summary almost as accurately as if they had had access to the document’s full text.” An assessor took, on average, seconds to assess each summary and seconds to assess each full document. Smucker and Jethani (2010) also used query-biased snippets of documents for relevance judgment in a user-study setting. The results show that the average time to judge a summary was around seconds while the time to judge a document was around seconds. Smucker and Jethani also found that reviewers were less likely to judge summaries relevant than documents.
In passage retrieval, the goal is to accurately identify fragments of documents,—as opposed to entire documents—that contain relevant information. Some studies (Allan, 2005; Salton et al, 1993) have shown that passage retrieval can help to identify relevant documents and hence to improve the effectiveness of document retrieval. Liu and Croft (2002) used passage retrieval in a language model and found that passages can provide more reliable retrieval than full documents.
To evaluate the effectiveness of passage retrieval systems, the TREC 2004 HARD Track employed an adapted form of test collection, in which assessors were asked to partition each relevant document to separate regions of text containing relevant information from regions containing no relevant information.
The accuracy and completeness of relevance assessments in test collections has been an ongoing concern since their first use in IR evaluation (Voorhees et al, 2005). It is well understood that it is typically impractical to have a human assessor label every document in a realistically sized corpus; it is further understood that human assessments are not perfectly reliable. Nonetheless, it has been observed that it is possible to select enough documents for assessment, and that human assessment is reliable enough to measure the relative effectiveness of IR systems, under the assumption that unassessed documents are not relevant. The pooling method suggested by Sparck Jones (Sparck Jones and Van Rijsbergen, 1975) and pioneered at TREC (Voorhees et al, 2005) appears to yield assessments that are sufficiently complete—given the level of assessment effort and the size of the initial TREC collections—to reliably rank the relative effectiveness of different IR systems. Other methods of selecting review, including ISJ, have been observed to be similarly effective, while entailing less assessment effort (Cormack and Mojdeh, 2009; Soboroff and Robertson, 2003).
Evaluation measures have been proposed that avoid the assumption that unassessed documents are not relevant, gauging system effectiveness only on the basis of documents for which relevance labels are available such as bpref (Clarke et al, 2005). Büttcher et al (2007) achieved reliable evaluation results by using an SVM classifier to label all of the unlabeled documents in TREC GOV2 collection (Büttcher et al, 2006), using the labeled documents as a training set. Systems were then evaluated assuming both the human-assessed and machine-classified labels to be authoritative.
The problem of evaluating user-in-the-loop systems has been investigated using human subjects as well as simulated human responses (Voorhees et al, 2005). For a decade, experiments using human subjects were the subject of the TREC Interactive Track and related efforts (Over, 2001), which exposed many logistical hurdles in conducting powerful, controlled, and realistic experiments to compare system effectiveness (Dumais, 2005; Voorhees et al, 2005). Not the least of these hurdles was the fact that the human subjects frequently disagreed with each other, and with the labels used for assessment, raising the issue of how the results of different subjects should be compared, and how human-in-the-loop results should be compared to fully automated results. To the authors’ knowledge, no controlled end-to-end evaluation of the effectiveness of HRIR methods has been conducted using human subjects.
Simulated responses are more easily controlled, at the expense of realism. The simplest assumption is that the human is infallible, and will assess a document exactly as specified by the relevance label in the test collection. This assumption was made in relevance-feedback studies by Drucker et al (2001), the TREC Spam Track (Cormack and Lynam, 2005b), and the TREC Total Recall Track (Roegiest et al, 2015; Grossman et al, 2016). Cormack and Grossman (2014)
used a “training standard” to simulate relevance feedback, separate from the “gold standard” used to estimate recall.Cormack and Grossman (2017b) used the results of “secondary” assessors from TREC 4 to simulate feedback, separate from the primary assessor whose labels are used to evaluate recall. In the same study, Cormack and Grossman used assessments rendered previously by the Virginia Senior State Archivist to simulate relevance feedback, and post-hoc blind assessments by the same archivist to estimate recall. Cormack and Grossman distinguish between “system recall,” which denotes the fraction of all relevant documents presented to the reviewer, from “user recall,” which denotes the fraction of all relevant documents that are presented to the reviewer and assessed as relevant by the reviewer.
A second simplifying assumption in simulation experiments is to quantify reviewer effort by the number of documents or surrogates presented to the reviewer for review. We know that the reviewer’s speed depends on a number of factors …
The challenge of acquiring a complete set of labels for relevance assessment was addressed within the context of the TREC Spam Track (Cormack and Lynam, 2005b). The Track coordinators, use an iterative process (Cormack and Lynam, 2005a) in which a number of spam classifiers were applied to the corpus, and disagreements between the classifiers and a provisional labeling were adjudicated by the coordinators. The process was repeated several times, until substantially all labels were adjudicated in favor of the provisional gold standard. At this point, the provisional gold standard was adopted as ground truth, and its labels were used to simulate human feedback and, subsequently, to measure effectiveness. A later study by Kolcz and Cormack (2009) measured the error rate of the gold standard, according to the majority vote of a crowdsourced labeling effort. The observed error rate for the gold standard—1.5%—was considerably lower than the the observed error rate of 10% for individual crowdsource workers.
The TREC 11 Filtering Track coordinators used a method similar to CAL to identify documents which were assessed for relevance; after TREC, further documents selected using the pooling method were assessed for relevance, and used to estimate the recall of the original effort to have been 78% (Sanderson and Joho, 2004; Cormack and Grossman, 2015). The additional assessments did not materially affect the evaluation results.
Cormack and Grossman (Cormack and Grossman, 2014) found CAL to be superior to classical supervised learning and active learning protocols (dubbed “simple passive learning” (SPL) and “simple active learning” (SAL), respectively) for HRIR. Cormack and Grossman observed comparable results for TREC Legal Track collection detailed above, as well as four private datasets derived from real legal matters.
The TREC Total Recall Track used a total of seven test collections (Roegiest et al, 2015; Grossman et al, 2016). For five of the collections, including the collections used in the current study, the Track coordinators used ISJ and CAL with two different feature engineering techniques and two different base classifiers to identify and label substantially all relevant documents prior to running the task. These labels were used to simulate reviewer feedback and to evaluate the results. For the 2016 Track, an alternate gold standard was formed by having three different assessors label each of a non-uniform statistical sample of documents for each topic (Grossman et al, 2016; Zhang et al, 2016). The alternate assessments yielded substantially the same evaluation results as the full gold standard. Subsequently, Cormack and Grossman (2017b) used a revised gold standard for both simulation and evaluation, and found no material difference in results.
Relevance labels for the TREC 2004 HARD Track, which were not used at the time to simulate reviewer feedback, but which are used for that purpose in the current study, were rendered for a set of documents selected using the classical pooling method.
This study addresses the question of whether sentence-level relevance feedback can achieve high recall more efficiently than document-level relevance feedback, and, if so, how.
To investigate this question, we apply an extended version of BMI to augmented versions of four public test collections, so as to simulate eight variants of sentence-level and document-level feedback.
To model effort, we count the number of assessments rendered by the simulated reviewer, as well as the number of sentences viewed by the simulated reviewer in rendering those assessments.
4 Continuous Active Learning with Sentences and Documents
BMI implements the AutoTAR CAL method (Cormack and Grossman, 2015), shown in Algorithm 1. The topic statement is labeled as a relevant document and randomly selected documents are labeled as “non-relevant” in the training set shown in Steps 1 and 3. A logistic regression classifier is trained on this training set in Step 4. The highest-scoring documents are selected from the not reviewed documents and appended to system output in Steps 6 and 7. The system output records the list of the reviewed documents. The documents labeled by reviewer are then added to the training set in Step 9. randomly selected documents coded as non-relevant in the training set are replaced by the newly selected random documents in Step 3 and 5. The classifier is re-trained using the new training set. The classifier selects the next highest-scoring not reviewed documents for review in the new batch. This process repeats until enough relevant documents have been found.
We modified BMI to use either sentences or documents at various stages of its processing. As part of this modification, we consider the document collections to be the union of documents and sentences, and choose documents or sentences at each step, depending on a configuration parameter. For example, a single document of sentences becomes documents, where document is the original document and the other documents are the document’s sentences.
BMI uses logistic regression as implemented by Sofia-ML111https://code.google.com/archive/p/sofia-ml/
as its classifier. The logistic regression classifier was configured with logistic loss with Pegasos updates, L2 normalization on feature vectors withas the regularization parameter, AUC optimized training, and training iterations. The features used for training the classifier were word-based tf-idf:
where is the weight of the word, is the term frequency, is the total number of documents and sentences, and is the document frequency where both documents and sentences are counted as documents. The word feature space consisted of words occurring at least twice in the collection and all the words were downcased and stemmed by the Porter stemmer.
Algorithm 2 illustrates our modified BMI that enables either sentence-level or document-level feedback, training, and ranking. The system output in Step 6 records the documents that have been labeled by reviewer. The system output also keeps the order of documents judged by reviewer so that we can use the system output to measure the recall achieved at a certain amount of effort.
Steps 3, 5, 8 and 10 involve choices; we explored two possibilities for each choice, for a total of eight combinations. The principal choice occurs in Step 8: whether to present to the reviewer the best_sent or the best_doc in the pair. We label these alternatives and , respectively. In support of this choice, it is necessary to choose how to build the training set in steps 3 and 10, and how to use the classifier to identify the top (best_sent, best_doc) pairs in Step 5. In Step 10, we choose as new added training examples either: () the best_sent with corresponding label ; or () the best_doc with corresponding label . In step 3, the 100 randomly selected non-relevant training examples are chosen by either: () 100 random sentences; or () 100 random documents. In Step 5, we choose the (best_sent, best_doc) pair either: () the highest-scoring sentence contained in any document not yet in system output, and the document containing that sentence; or () the highest-scoring document not yet in system output, and the highest-scoring sentence within that document. The sentences in () were scored by the same classifier that was also used for document scoring. More formally, if we denote system output by , is defined by Equations 2 and 3:
Using documents for each stage of the process (choosing , , and ) is our baseline, and replicates BMI, except for the use of the union of documents and sentences to compute word features. For brevity, we use the notation to represent this combination of choices, and more generally, we use to denote , and , where . The choices for all the eight combinations are shown in Table 1.
5 Test Collections
We use four test collections to evaluate the eight different variations of continuous active learning. We use the three test collections from the TREC 2015 Total Recall track: Athome1, Athome2, and Athome3. We also use the test collection from the TREC 2004 HARD track (Allan, 2005; Voorhees and Harman, 2000). For each collection, we used NLTK’s Punkt Sentence Tokenizer222http://www.nltk.org/api/nltk.tokenize.html to break all documents into sentences. Corpus statistics for the four collections are shown in Table 2.
In order to compare sentence-level feedback with document-level feedback strategies, we needed complete relevance labels for all sentences as well as for all documents in the collections.
The 2004 HARD track’s collection provided pooled assessments with complete relevance labels for all documents in the pool. In addition, for 25 topics, every relevant document was divided by the TREC assessors into relevant and non-relevant passages identified by character offsets. For the HARD collection, we only use the 25 topics with passage judgments. We considered a sentence to be relevant if it overlapped with a relevant passage. Sentences that did not overlap with a relevant passage were labeled non-relevant.
For both the HARD track collection and the Total Recall collections, sentences from non-relevant and unjudged documents were labeled as non-relevant.
The Total Recall collections provided complete document-level relevance judgments, i.e., the relevance of every document is known. Each relevant document is composed of one or more relevant sentences and zero or more non-relevant sentences. To label the sentences as relevant or non-relevant the first author employed “Scalable CAL” (“S-CAL”) (Cormack and Grossman, 2016) to build a calibrated high-accuracy classifier that was used to label every sentence within every relevant document. Our total effort to train the S-CAL classifier was to review 610, 453, and 376 sentences, on average, per topic, for each of the three Athome datasets, respectively.
While neither of these methods yields a perfect labeling, their purpose is to simulate human feedback, which is likewise imperfect. The internal calibration of our S-CAL classifier indicated its recall and precision both to be above ( for Athome1, Athome2, and Athome3, respectively), which is comparable to human accuracy (Cormack and Grossman, 2016) and, we surmised, would be good enough to test the effectiveness of sentence-level feedback. Similarly, we surmised that overlap between sentences and relevant passages in the HARD collection would yield labels that were good enough for this purpose.
The results of our sentence labeling are shown in Table 3. The average position of the first relevant sentence in each relevant document is shown in the fifth column, while the distribution of such positions is shown in Figure 2. On Athome1, Athome2 and HARD three datasets, more than 50% relevant documents in each dataset have their first relevant sentences located at the first sentences. However, the position of the first relevant sentence in the relevant document is larger than for all the four datasets. It means that the reviewer need to review more than two sentences to find the first relevant sentence in each relevant document under the assumption that reviewer read the document sequentially. The sixth column shows the fraction of relevant documents containing at least one sentence labeled relevant. It shows that nearly every relevant document contains at least one relevant sentence.
The human-in-the-loop CAL simulated by the TREC Total Recall track evaluation apparatus, which the current study adopts and extends, has the following process. Starting with a standard test collection consisting of a set of documents, topic statements, and relevance assessments (“qrels”), the most-likely relevant document is presented to the reviewer for assessment. The reviewer’s response is simulated by consulting the qrels, and fed back to the system, which chooses the next-most-likely-relevant document to present. The process continues until a formal or informal stopping criterion is met, suggesting that substantially all relevant documents have been presented to the reviewer.
To model sentence-level feedback it was necessary to extend the evaluation apparatus to incorporate a sentence dataset and sentence qrels. The sentence dataset consists of all sentences extracted from documents in the document dataset, and the sentence qrels consist of relevance assessments for each sentence. To simulate sentence-level feedback, the apparatus presents to the simulated reviewer a single sentence, as determined by the system under test, and communicates the reviewer’s assessment to the system, which then selects the next sentence for review. The “system-selected documents” used for evaluation consist of the sequence of documents from which the sentences presented to the reviewer were extracted. In our paper, the “system-selected documents” are recorded in the system output () mentioned in the Step 6 of Algorithm 2. The same apparatus is used to simulate document-level feedback, except that here, the system selects a document for presentation to the reviewer, and the reviewer’s feedback is simulated by consulting the document qrels. In document-level-feedback mode, the apparatus is operationally equivalent to the TREC Total Recall apparatus.
Recall is the number of relevant documents presented to the reviewer for assessment, as a fraction of the total number of relevant documents (), regardless of whether document- or sentence-level feedback is employed. In our paper, the documents presented to the reviewer are recorded by the system output (). We measure the recall at effort () using the Equation 6:
where the is the system output truncated at the effort . The relevant documents sets are the gold standard relevance assessments (“qrels”) provided by the TREC Total Recall 2015 Track and HARD 2004 Track for the corresponding datasets and topics.
The Total Recall Track measured recall as a function of effort, where effort was measured by the number of assessments rendered by the reviewer. Gain curves were used to illustrate the overall shape of the function, and recall at particular effort levels were tabulated, where is the number of relevant documents, a is the constant 1, 2, or 4, and is the constant 0, 100, or 1000. Intuitively, these measures show the recall that can be achieved with effort proportional to the number of relevant documents, plus some fixed overhead amount.
We also measure recall as a function of effort , but in this paper, we measure effort as a linear combination of the number of assessments rendered by the reviewer , and the number of sentences that must be read by the reviewer to render a judgment . If a simulated reviewer provides an assessment on a single sentence, the reviewer reads one sentence and makes one assessment. When a full document is presented for assessment, we simulate the reviewer to read the document sequentially from the beginning to the first relevant sentence and then make one assessment. If the document is non-relevant, the assessor needs to read all of the sentences in the document.
The ratio of effort required to make an assessment to the effort required to read a sentence is not necessarily . To explore different ratios of effort, we express effort, , as a linear combination of and :
where is the number of assessments and is the number of sentences read. At one extreme, we only care about the number of assessments, i.e., . At the other extreme, we only count reading effort, i.e., . For sentence-level feedback, , regardless of . For document-level feedback, , and where .
For single assessment on each document , the number of assessments on is . We can simplify the assessment effort defined in Equation 7 for a single document as . If the for the document , then . With the number of sentences needed for reviewing this document increasing, the also increase.
We compared the sentence-level feedback strategies with the document-level feedback strategies on three different dimensions—in total, eight combinations shown in Table 1. As explained in Section 6, we measure performance as recall versus effort. At one extreme, we can measure effort as the number of assessments (judgments) made by the reviewer, i.e., effort = . At the other extreme, we can measure effort as the number of sentences read, i.e., effort = .
Figures 3 and 4 show recall vs. effort for the HARD test collection. Figure 3 measures effort as a function of the number of judgments (), where the horizontal axis reports judgments in multiples of the number of relevant document . For example, documents, where means that twice as many judgments have been made as there are relevant documents. Figure 4 measures effort as a function of the number of sentences read (). The equivalent plots for the Athome collections are found at the end of the paper in Figures 10 – 10.
In general, when effort is measured in terms of judgments only (), we find that the training on and selecting documents to be superior to other methods regardless whether the reviewer judged documents ( strategy) or sentences ( strategy), across all eight combinations, for all four datasets, for all . We also find that training on sentences with the selection of documents ( and ) strategies to be worse than the strategies that training on documents and selecting documents ( and ) on all datasets, but superior to the other four strategies: , , , and . The overall comparison of judgment effort for all the eight combinations is that .
When effort is measured in terms of sentences read only (), all of the sentence-level feedback strategies in which reviewer judges documents achieve much higher recall than the document-level feedback strategies in which reviewer judges sentences for a given level of effort, as measured in terms of the number of sentences reviewed. Among the four sentence-level feedback strategies, is superior, and the relative effectiveness among the sentence-based strategies is consistent with the result when effort is measured by the number of assessments. The overall ranking of four sentence-level feedback strategies evaluated by number of sentences read is .
These results suggest that training using documents and selecting the highest-ranking document from the document-rank list to review ( and ) will lead to superior results over other strategies, regardless of whether sentences or documents are presented to the reviewer for feedback. At the same time, the choice of using sentences () or documents () for feedback has very little impact on the recall that can be achieved for a given number of assessments.
|Athome1||(-0.025, 0.006)||(-0.012, 0.003)||(-0.009, -0.0003)333The mean difference between recall and recall[ddd] equals 0.0046 and at effort = on Athome1.|
|Athome2||(-0.008, 0.014)||(-0.005, 0.003)||(-0.007, 0.002)|
|Athome3||(-0.043, 0.016)||(-0.015, 0.008)||(-0.005, 0.011)|
|HARD||(-0.074, 0.02)||(-0.071, 0.007)||(-0.122, 0.009)|
|Overall||(-0.037, 0.006)||(-0.034, 0.002)||(-0.056, 0.003)|
|Athome1||(0.178, 0.42)||(0.181, 0.508)||(0.107, 0.514)|
|Athome2||(0.308, 0.508)||(0.352, 0.574)||(0.266, 0.545)|
|Athome3||(0.292, 0.537)||(0.244, 0.605)||(0.148, 0.499)|
|HARD||(0.121, 0.279)||(0.222, 0.41)||(0.297, 0.516)|
|Overall||(0.242, 0.348)||(0.307, 0.428)||(0.306, 0.442)|
|Athome1||(0.121, 0.344)||(0.127, 0.393)||(0.041, 0.445)|
|Athome2||(0.21, 0.36)||(0.227, 0.401)||(0.163, 0.39)|
|Athome3||(0.25, 0.373)||(0.246, 0.394)||(0.178, 0.349)|
|HARD||(0.092, 0.225)||(0.193, 0.365)||(0.249, 0.445)|
|Overall||(0.194, 0.29)||(0.247, 0.356)||(0.238, 0.365)|
The actual recall achieved by each method at multiples of is reported in Table 4 (), Table 5 (), and Table 6 (). These tables also report effort as a equal combination of number of judgments and number of sentences read (, where ). In each table, we compare the and
methods and if the difference in recall is statistically significant, we bold the greater value. We measure statistical significance with a two-sided, Student’s t-test and significance is for p-values less than 0.05. For example, in Table4, when effort is equal to the number of relevant documents () and measured by the number of sentences read (1R_Sent) on Athome1, the (recall=0.42) and (recall=0.72) methods are different at a statistically significant level.
The most interesting observation to be made from Tables 4, 5, and 6 is that when effort is measured in number of judgments , and are usually equivalent, and when effort is measured in number of sentences read , is vastly superior to . What this means is that for essentially the same number of judgments, we can achieve the same level of recall by only judging the best sentence from a document — we do not have to bother examining the entire document to judge its relevance.
Defined in Equation 7, effort is a function of the number of assessments and the number of sentences read . We also calculate the confidence interval for the difference of between and . We find that is significantly better for than for all values of when . We show the confidence interval of the difference between and for different effort measurements , and with various values of in Tables 7, 8, and 9.
To get a better sense of when becomes superior to , we varied from to by step size and plotted in Figure 11 the confidence interval for the difference of between and . As can be seen, once the cost of reading sentences starts to have some weight where , becomes superior to . The recall- recall became larger with the increase of .
For single assessment on each document , we can simplify the effort for document as . As shown in Table 3, the position of the first relevant sentence in the relevant document is always larger than . Based our assumption that the reviewer read the document sequentially from the beginning to the first relevant sentence, we can infer . To make this more concrete, if the number of sentences reviewed for is more than , can be superior than in terms of effort to achieve the same level of recall. In other words, if the time to judge a document is substantively more than judging a sentence, is more effective than .
8 Discussion and Conclusions
This simulation study suggests that an active learning method can identify a single sentence from each document that contains sufficient information for a user to assess the relevance of the document for the purpose of relevance feedback. The best-performing active learning method selected for assessment the highest-scoring sentence from the highest-scoring document, based on a model trained using entire documents whose labels were determined exclusively from a single sentence.
Under the assumption that the user can review a sentence more quickly than an entire document, the results of our study suggest that a system in which only sentences were presented to the user would achieve very high recall more quickly than a system in which entire documents were presented.
The synthetic labels used to simulate user feedback were imperfect, but of comparable quality, according to recall and precision, to what has been observed for human users (Voorhees, 2000).
The results of this study provided impetus for an extensive human study by the authors (Zhang et al, 2017).
- Aalbersberg (1992) Aalbersberg IJ (1992) Incremental relevance feedback. In: Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, ACM, pp 11–22
- Allan (2005) Allan J (2005) Hard track overview in trec 2003 high accuracy retrieval from documents. Tech. rep., DTIC Document
- Baron et al (2006) Baron JR, Lewis DD, Oard DW (2006) Trec 2006 legal track overview. In: TREC
- Baruah et al (2016) Baruah G, Zhang H, Guttikonda R, Lin J, Smucker MD, Vechtomova O (2016) Optimizing nugget annotations with active learning. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, ACM, pp 2359–2364
- Blair and Maron (1985) Blair DC, Maron ME (1985) An evaluation of retrieval effectiveness for a full-text document-retrieval system. Communications of the ACM 28(3):289–299
- Büttcher et al (2006) Büttcher S, Clarke CL, Soboroff I (2006) The trec 2006 terabyte track. In: TREC, vol 6, p 39
- Büttcher et al (2007) Büttcher S, Clarke CLA, Yeung PCK, Soboroff I (2007) Reliable information retrieval evaluation with incomplete and biased judgements. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, NY, USA, SIGIR ’07, pp 63–70, DOI 10.1145/1277741.1277755, URL http://doi.acm.org/10.1145/1277741.1277755
- Clarke et al (2005) Clarke CL, Scholer F, Soboroff I (2005) The trec 2005 terabyte track. In: TREC
- Cormack and Grossman (2014) Cormack GV, Grossman MR (2014) Evaluation of machine-learning protocols for technology-assisted review in electronic discovery. In: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, ACM, pp 153–162
- Cormack and Grossman (2015) Cormack GV, Grossman MR (2015) Autonomy and reliability of continuous active learning for technology-assisted review. CoRR abs/1504.06868, URL http://arxiv.org/abs/1504.06868
- Cormack and Grossman (2016) Cormack GV, Grossman MR (2016) Scalability of continuous active learning for reliable high-recall text classification. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, ACM, pp 1039–1048
- Cormack and Grossman (2017a) Cormack GV, Grossman MR (2017a) Navigating imprecision in relevance assessments on the road to total recall: Roger and me. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, pp 5–14
- Cormack and Grossman (2017b) Cormack GV, Grossman MR (2017b) Navigating imprecision in relevance assessments on the road to total recall: Roger and me. In: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, ACM
- Cormack and Grossman (2017c) Cormack GV, Grossman MR (2017c) Technology-assisted review in empirical medicine: Waterloo participation in CLEF ehealth 2017. In: Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, Dublin, Ireland, September 11-14, 2017.
- Cormack and Lynam (2005a) Cormack GV, Lynam TR (2005a) Spam corpus creation for trec. In: CEAS
- Cormack and Lynam (2005b) Cormack GV, Lynam TR (2005b) Trec 2005 spam track overview. In: TREC, pp 500–274
- Cormack and Mojdeh (2009) Cormack GV, Mojdeh M (2009) Machine learning for information retrieval: Trec 2009 web, relevance feedback and legal tracks. In: TREC
- Cormack et al (1998) Cormack GV, Palmer CR, Clarke CL (1998) Efficient construction of large test collections. In: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, ACM, pp 282–289
- Cormack et al (2010) Cormack GV, Grossman MR, Hedin B, Oard DW (2010) Overview of the trec 2010 legal track. In: Proc. 19th Text REtrieval Conference, p 1
Drucker et al (2001)
Drucker H, Shahrary B, Gibbon DC (2001) Relevance feedback using support vector machines. In: ICML, pp 122–129
- Dumais (2005) Dumais S (2005) The Interactive TREC Track: Putting the User Into Search. MIT Press, URL https://www.microsoft.com/en-us/research/publication/interactive-trec-track-putting-user-search/
- Grossman et al (2016) Grossman M, Cormack G, Roegiest A (2016) Trec 2016 total recall track overview. Proc TREC-2016
- Grossman and Cormack (2010) Grossman MR, Cormack GV (2010) Technology-assisted review in e-discovery can be more effective and more efficient than exhaustive manual review. Rich JL & Tech 17:1
- Grossman et al (2011) Grossman MR, Cormack GV, Hedin B, Oard DW (2011) Overview of the trec 2011 legal track. In: TREC, vol 11
- Hearst (2009) Hearst M (2009) Search user interfaces. Cambridge University Press
- Hedin et al (2009) Hedin B, Tomlinson S, Baron JR, Oard DW (2009) Overview of the trec 2009 legal track. Tech. rep., NATIONAL ARCHIVES AND RECORDS ADMINISTRATION COLLEGE PARK MD
- Hersh and Bhupatiraju (2003) Hersh WR, Bhupatiraju RT (2003) Trec genomics track overview. In: TREC, vol 2003, pp 14–23
- Hogan et al (2008) Hogan C, Reinhart J, Brassil D, Gerber M, Rugani SM, Jade T (2008) H5 at trec 2008 legal interactive: user modeling, assessment & measurement. Tech. rep., H5 SAN FRANCISCO CA
- Kanoulas et al (2017) Kanoulas E, Li D, Azzopardi L, Spijker R (2017) Clef 2017 technologically assisted reviews in empirical medicine overview. Working Notes of CLEF pp 11–14
- Kolcz and Cormack (2009) Kolcz A, Cormack GV (2009) Genre-based decomposition of email class noise. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 427–436
- Liu and Croft (2002) Liu X, Croft WB (2002) Passage retrieval based on language models. In: Proceedings of the eleventh international conference on Information and knowledge management, ACM, pp 375–382
- Oard et al (2008) Oard DW, Hedin B, Tomlinson S, Baron JR (2008) Overview of the trec 2008 legal track. Tech. rep., MARYLAND UNIV COLLEGE PARK COLL OF INFORMATION STUDIES
- Over (2001) Over P (2001) The trec interactive track: an annotated bibliography. Information Processing & Management 37(3):369–381
- Pickens et al (2015) Pickens J, Gricks T, Hardi B, Noel M (2015) A constrained approach to manual total recall. In: Proceedings of The Twenty-Fourth Text REtrieval Conference, TREC 2015, Gaithersburg, Maryland, USA, November 17-20, 2015, URL http://trec.nist.gov/pubs/trec24/papers/catres-TR.pdf
- Robertson and Soboroff (2002) Robertson SE, Soboroff I (2002) The trec 2002 filtering track report. In: TREC, vol 2002, p 5
- Roegiest et al (2015) Roegiest A, Cormack G, Grossman M, Clarke C (2015) Trec 2015 total recall track overview. Proc TREC-2015
Ruthven and Lalmas (2003)
Ruthven I, Lalmas M (2003) A survey on the use of relevance feedback for information access systems. The Knowledge Engineering Review 18(2):95–145
- Salton et al (1993) Salton G, Allan J, Buckley C (1993) Approaches to passage retrieval in full text information systems. In: Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, ACM, pp 49–58
- Sanderson (1998) Sanderson M (1998) Accurate user directed summarization from existing tools. In: Proceedings of the seventh international conference on Information and knowledge management, ACM, pp 45–51
- Sanderson and Joho (2004) Sanderson M, Joho H (2004) Forming test collections with no system pooling. In: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, ACM, pp 33–40
- Smucker and Jethani (2010) Smucker MD, Jethani CP (2010) Human performance and retrieval precision revisited. In: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, ACM, pp 595–602
- Soboroff and Robertson (2003) Soboroff I, Robertson S (2003) Building a filtering test collection for trec 2002. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, ACM, pp 243–250
- Sparck Jones and Van Rijsbergen (1975) Sparck Jones K, Van Rijsbergen C (1975) Report on the need for and provision of an’ideal’information retrieval test collection. Computer Laboratory
- Tombros and Sanderson (1998) Tombros A, Sanderson M (1998) Advantages of query biased summaries in information retrieval. In: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, ACM, pp 2–10
- Tomlinson et al (2007) Tomlinson S, Oard DW, Baron JR, Thompson P (2007) Overview of the trec 2007 legal track. In: TREC
- Voorhees (2000) Voorhees EM (2000) Variations in relevance judgments and the measurement of retrieval effectiveness. Information processing & management 36(5):697–716
- Voorhees and Harman (2000) Voorhees EM, Harman D (2000) Overview of the eighth text retrieval conference (trec-8). pp 1–24
- Voorhees et al (2005) Voorhees EM, Harman DK, et al (2005) TREC: Experiment and evaluation in information retrieval, vol 1. MIT press Cambridge
- Wallace et al (2010) Wallace BC, Small K, Brodley CE, Trikalinos TA (2010) Active learning for biomedical citation screening. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 173–182
- Wallace et al (2013) Wallace BC, Dahabreh IJ, Schmid CH, Lau J, Trikalinos TA (2013) Modernizing the systematic review process to inform comparative effectiveness: tools and methods. Journal of comparative effectiveness research 2(3):273–282
- Yu et al (2016) Yu Z, Kraft NA, Menzies T (2016) How to read less: Better machine assisted reading methods for systematic literature reviews. arXiv preprint arXiv:161203224
- Zhang et al (2015) Zhang H, Lin W, Wang Y, Clarke CLA, Smucker MD (2015) Waterlooclarke: TREC 2015 total recall track. In: Proceedings of The Twenty-Fourth Text REtrieval Conference, TREC 2015, Gaithersburg, Maryland, USA, November 17-20, 2015
- Zhang et al (2016) Zhang H, Lin J, Cormack GV, Smucker MD (2016) Sampling strategies and active learning for volume estimation. In: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, ACM, pp 981–984
- Zhang et al (2017) Zhang H, Abualsaud M, Ghelani N, Ghosh A, Smucker MD, Cormack GV, Grossman MR (2017) UWaterlooMDS at the TREC 2017 Common Core Track. TREC 2017