1 Introduction
Contrary to Active Learning, Active Search does not aim at building the best possible classifier with the minimum number of labelled instances, but simply aims at discovering virtually all the instances of the positive class (assuming a binary classification problem) with the minimum “reviewing” effort or cost. Active Search strategies most often include the incremental training of some classifier as a means to speedup and to control the search, but this step is not strictly necessary. The collection – or pool – of objects to search in is assumed to be known in advance and the setting is, in some way, similar to the Transductive Learning setting, but with an online, incremental, “recalloriented” perspective. Consequently, Active Search algorithms can be very different from traditional active learning algorithms. Active Search applications could be found in numerous domains: fraud detection, compliance monitoring, ediscovery, systematic medical reviews, prior art search when filing a patent,
etc.Recently, some pieces of work [8, 9, 19, 22]
have focused on developing Active Search strategies that significantly depart from active learning, by emphasizing the “Total Recall” aspect and the “Continuous Active Learning” setting. The main idea of this family of works is to greedily select the next instances as the ones with the largest estimated probabilities of belonging to the relevant class (these probabilities are given by a classifier incrementally trained from all labelled instances up to the current time). This selection strategy makes sense as these instances are both the most promising ones – given the data reviewed up to the current time – and the ones that can reduce, at least locally and on the short run, the most the uncertainty over the parameters of the classifier. Interestingly enough, in
[12], the authors give a bayesian rationale to this greedy strategy, showing that it maximizes the expected utility in the oneshot setting (i.e. when it remains only one possible query).In addition to belonging to a low prevalence class, relevant objects can take multiple forms or facets: the landscape of the positive class is often “multimodal”. When considering these challenges – unbalanced class distribution and multimodality of the relevant class –, there is a clear need to control the explorationexploitation tradeoff if we want to improve the “baseline” greedy approach and make it more robust: typically, the search starts with a small number of “seed” instances and these seeds rarely cover all modes and facets of the positive class. So, a greedy selection approach, which basically selects the next instances in the vicinity of the positive seeds through the classifier, runs the risk of missing large areas of relevant instances when the corresponding facets are not covered by (or hardly reachable from) the instances reviewed and labelled as positive up to the current time. At the early stage of the search, it can be useful to spend some effort in exploring diverse regions of the instance space, provided that these regions could offer potentially relevant elements in the long run.
The use of MultiArmed Bandits (MAB) appears as a natural choice to solve this explorationexploitation tradeoff. Instead of considering the problem as an instance recommendation problem with a binary response as in [6, 5] and solving it by using contextual bandit strategies, we follow an alternative strategy that turns out to be more efficient in our use cases. This alternative consists in discretizing the structure of the feature space into a finite set of clusters and in relying on the cluster structure to manage the exploitationexploration tradeoff. More precisely, the idea is to consider each cluster as an arm of a MAB and to focus on the most promising ones, while ensuring that the selection strategy covers all facets or clusters of the instance space. However, we face the problem of dealing with nonstationary (or restless) mortal bandits (see for instance [1, 11, 15]). Indeed, each time the system selects a cluster (i.e. an arm), it consumes one instance inside the cluster and, as the cluster size is finite, the cluster can be exhausted. Moreover, even if the system chooses the most likely relevant instances inside a cluster, the reward distribution (the reward being 1 if the chosen instance is relevant and 0 otherwise) is clearly timevarying: roughly speaking, the chance to get extra relevant instances inside a cluster decreases over time. To deal with this nonstationary environment, the method proposed here introduces a forgetting factor when updating the reward posterior distributions of the arms (i.e. clusters).
2 Related work
Even if there is a large literature on active learning, the case of active search has not received the same attention. One of the first works on this problem was presented by [12] who proposed a Bayesian approach that requires computations exponential in the number of lookahead steps (number of future steps considered), which makes this method impractical for large collections. Several pieces of work have approached active search as exploration on graphs [20, 16] or also graph learning [18]. However, the mostly used stateoftheart method remains the greedy (one single step ahead) approach: one of the most striking examples of this is the “Continuous Active Learning” concept of [9], implemented in the form of the AutoTAR system in the field of Technologyassisted Review for eDiscovery [9] (extended further in [22]). Our method is quite different from these active search approaches, as it explicitly controls the explorationexploitation tradeoff, does not use any graph structure, and does not rely on a finitehorizon approach.
Our proposed method shares some similarity with introducing diversity in recommendation, retrieval and active learning: instead of proposing the instance predicted as the most relevant (or the most uncertain) given the feedback received up to the current round, the selection criterion involves an extra term based on diversity (see for instance [21] and, more fundamentally, the concept of Maximal Marginal Relevance ranking [4]). In our method, we are promoting diversity in a different way, by exploiting the cluster structure of the data, and by using MAB in a suitable way.
Our approach to solve active search builds on ideas of clusterbased active learning, based on the “cluster assumption” (instances indside the same cluster are likely to have the same label. [17] and [13] approached active learning by clustering unlabeled instances and selecting instances to query according to a criterion that prefers clusters on the decision boundary and the most representative instances in these clusters.
The use of MAB in Active Learning is not new and relatively well studied. A realtively common way to solve active learning with MAB is to cluster the instances on the pool and consider that each cluster is an arm [3, 7]. In this case, the payoff distribution for each arm is nonstationary since the probability of finding relevant instances in a cluster decreases as the cluster is exploited. The approach of [2], for instance, deals with the nonstationarity by assuming a known fixed trend.
Our work differs from these veins of works by different factors: first of all, these methods tackle the Active Learning problem, while we are targeting the Active Search problem. Secondly, they often rely on a simple classification model, namely that instances inside a cluster all have the same posterior probability of being relevant. Thirdly, we use MAB to explicitly control the explorationexploitation tradeoff. Lastly, we use soft clustering, that allows to propagate acquired information outside a cluster and to speedup the search.
Finally, our work is not the only one to use MAB for Active Search. Even if initially formulated as an item recommendation problem, the MABbased approach of [5, 6]
could be used as such for active search; in this work, the “next item selection” problem is expressed as a contextual bandit, where each instance is an arm; the expected reward (or relevance label) of an instance is expressed as a logistic regression model, whose parameters are sampled following a Thompson sampling strategy from the posterior distribution updated each time a new label is collected. We have implemented this method for our collections (see experimental section) and the results were extremely weak due to the fact that the method is not adapted to sparse highdimensional data and requires a lot of exploration, often exceeding the budget or simply the dataset size.
3 Proposed Method
Traditional active search methods are typically greedy: they are looking for the most promising instances, i.e. the ones with the highest conditional probabilities () as estimated by a classifier trained on the instances reviewed up to the current time. As the class of relevant instances often has a very low prevalence and could be considered as a “rare event”, strategies of this kind make sense as these instances are actually the ones minimizing the risk of reviewing a nonrelevant instance while, simultaneously, reducing the uncertainty as much as possible. For instance, with a logistic regression classifier, most of the instances in the poolset will have probability scores not larger than the prevalence and only a few will have a probability larger and closer to 0.5 ([14]
provides details about these observations). Entropybased or uncertaintybased active learning strategies will favour these instances. Note that this also corresponds to the instances that are the most statistically informative with respect to the variance matrix on the parameter estimates, as an instance contributes to the inverse of this matrix through a factor
[14].However, greedy strategies are likely to fail when the relevant class is “multimodal” or “multifaceted”, in other words when they are multiple, well distinct and possible unbalanced ways of being relevant. As the greedy strategies introduce a high selection bias when incrementally building the training set, it could be that the selection algorithm will miss important sectors of the relevant category, because they are relative far from the positivelylabelled training instances, which constitute an homogeneous set by construction. In practice, this risk strongly depends on the quality of the seed set: the seed set should be diverse enough and have a good coverage of the different facets of relevance, but unfortunately this guarantee is hard to obtain.
There are several ways of translating the active search problem into a MAB problem. For instance, each instance could be considered an an arm but once it is selected, it doesn’t make any sense to choose this arm again. The alternative we adopt in this work, is to choose clusters of pool instances as arms. But this choice has the particularity that the algorithm can “exhaust” a cluster and that, consequently, the related arm will “die” once every instance in the cluster have been reviewed and labelled. If we consider the reward of an arm as the binary label of the instance that the algorithm will choose inside the corresponding cluster (1 if the instance is relevant and 0 otherwise), the reward distribution of an arm is obviously nonstationary as an “exhausted” empty cluster will switch abruptly towards a singlevalue discrete distribution (0 reward with probability 1). Intuitively, we face a diminishing return issue: the retrieval rate of relevant objects decreases as we are exploiting the cluster. Note that, in our approach, we rely on soft clustering so that an instance can belong to multiple clusters with different degrees of membership. This renders the approach more robust with respect to a particular clustering method but, on the other side, we have to adapt the MAB algorithms for a nonstandard reward scheme: the reward obtained for a particular instance should be reassigned to multiple clusters (or arms) with an appropriate weighting.
It is rather usual, in active search, to proceed by batches of (polynomially or exponentially)increasing size: instead of proposing one single instance to review at each iteration, the active search algorithm provides batches of instances to be reviewed which are larger and larger, as the confidence about the classifier performance becomes higher [9]. We also have to adapt the MAB algorithms so that they can provide us with a “batch” of arm trials – some kind of MAB with multiple plays – and update the arms’ sufficient statistics (i.e. the posterior distribution of the reward) accordingly.
Basically, our algorithm consists of the following steps:
(1) Create an initial training set from the seed set, consisting of a synthetic “positive” instance built from the initial user query or of a few positive instances discovered by any means, and a random sample from the poolset, temporarily labelled as “negative” (typically, the random sample size is 100 instances). As the relevant class has low prevalence, most of the instances in the random sample are indeed nonrelevant and the label noise introduced by this approximation is negligible.
(2) Iteratively, create a batch of instances from the pool set using a nonstationary, batch (or multipleplays) extension of the “Thompson sampling” MAB algorithm, ask the reviewer to label them, remove them form the pool set, update the reward distribution estimates; and retrain the classifier based on the labelled documents and a random sample redrawn from the pool set, temporarily labelled as negative.
Actually, the MAB algorithm is a twolevel process, where the algorithm first samples times ( being the batch size) a “conversion rate” for each cluster/arm (the conversion rate is the probability of a highscore member of the cluster to be annotated as a relevant instance) from a Beta posterior distribution and, secondly, selects an instance that maximizes the probability of being relevant, given the sampled conversion rates of the clusters it belongs to.
The Thompson sampling extension, to deal with nonstationary reward of the “mortal” multiarmed bandits with multiple plays (batch of trials), is based on the following ideas:

The posterior distribution of the “conversion rate” of each arm/cluster is a Beta distribution with parameters (
), initialised with for all arms (Jeffreys’s prior). The parameters andcould be considered as the “equivalent” number of successes and failures of a binomial distribution (the cluster reward distribution).

When receiving the label (or, equivalently, the reward) of the instances selected at the previous round, the binary reward is redistributed over the the clusters with a weight equal to the membership value of the instance with respect to the cluster.

Updating the posterior is done using a forgetting factor, discounting the previous (weighted) success/failure counts by a factor . Alternatively, it can be done using a sliding window of size ; in this case, only the (weighted) success/failure counts of the last iterations are taken into account in updating the posterior distributions.

For a batch of size , we repeat times the following steps: for each arm/cluster, draw a value from the Beta distribution associated to the cluster:
; this value should be interpreted as the parameter (the mean) of a Bernoulli distribution modelling the arm reward distribution; in this work, we use an “Optimistic Bayesian sampling” variant, where one does not allow
to be smaller than the empirical discounted mean of the arm reward (based on observations up to the current round). 
When generating the batch of instances, once an instance has been selected, it is removed from the pool set and will not be selected again.
Let us be more precise in the description of the update of the cluster posterior distribution. Assuming that, at round , the conversion rate of cluster (i.e. the probability that a “high score” member of the cluster will be annotated as relevant) follows a Beta distribution with parameters (), then the posterior distribution of the conversion rate at round is with:
(1) 
In these equations, is the forgetting factor to cope with the nonstationarity of the conversion rate distribution, is the batch of instances selected at round , is the membership value of in cluster (interpreted as , the probability that instance
with feature vector
belongs to cluster ) and is the binary reward (i.e. 0/1 label) of instance .As far as the selection criterion is concerned, at each round , we repeat times the following steps: for each cluster , sample from and compute (optimistic Thompson sampling, replacing the sampled value by the empirical mean if the former is smaller than the latter); then choose the instance such that:
(2) 
with the set of unlabelled instances (i.e. the pool), , the probability that instance is relevant, as estimated by the current classifier using labelled instances up to round . Intuitively, this criterion selects the instance that has the best “optimistic” chance of being converted towards a real relevant instance: this “chance” is measured by the product of the marginal “optimistic” conversion rate of “high score” instances (marginalised over the clusters the instance belongs to) and, roughly speaking, the probability of being a “high score” instance (which is trivially estimated by ). The exploration effect relies on the sampling from , which can potentially favour less explored clusters as their posterior distribution is less peaked.
Concretely, the method we propose is summarised by Algorithm 1.
4 Experiments
4.1 Datasets
We have chosen two representative datasets that contain multimodal lowprevalence classes. Both are collection of text documents, but the method is not restricted to textual data (we are currently applying it to movie recommendation problems, based on previous ratings and on item/user features). The first dataset is the Reuters RCV1 Corpus (approximately 807,000 documents), considering only classes at the first level of the hierarchy with a prevalence less than 10% and having at least three children subclasses whose cumulated size is at least 50% of the parent class. These classes are: C17 (Funding/Capital  5.18% prevalence), C18 (Ownership Changes  6.38% prevalence), E14 (Consumer Finance  0.26% prevalence), E31 (Output/Capacity  0.29% prevalence), E51 (Trade/Reserves  2.57% prevalence), G15 (European Community  2.37% prevalence). The reason of selecting these classes is that, by construction, they consist of multiple diverse subclasses and, consequently, are “multifaceted”.
The second collection is the ENRON collection and a set of 8 associated “topics” (i.e. a set of 8 different relevance classes). These 8 topics are the ones used in the TREC 2010 Legal Track benchmark (Topics 200 to 207: see [10] for details on these topics). These topics have prevalence well behind 0.2% (0.1% on average). We conjecture that, as in many realistic ediscovery tasks, these topics are also multimodal/multifaceted. As most of the corpus is not labelled with respect to these 8 classes, we considered the subset (called subsequently “ENRONsubset”) of document assessed for at least one of these topics (resulting in approximately 60,000 documents). In the TREC Legal Track benchmarks, the selection of the documents to be assessed for each topic is done by stratified sampling, the strata being defined by the degree of “agreement” on the relevance by the different systems/teams participating to the benchmark. In particular, the last stratum typically consists of documents never returned by any participating system and, even if it constitutes the main part of the collection, this stratum has a low sampling rate. All performance measures take into account this stratified sampling process by weighting each document with the inverse of the sampling rate of the stratum it belongs to. This means that retrieving a relevant document in the low stratum part is considered as more important (and, consequently rewarded much more) than retrieving relevant documents in the first stratum – the stratum of “easy to find” relevant documents, as any participating system discovered them. In the results shown here after, we have taken these stratum weights into account in computing the achieved recall.
4.2 Results
The classifier used in all experiments is a L2regularised Logistic Regression, based on the tokenised bagofword representation of the collections (stopwords included, but words with document frequency less than 3 filtered out); TFIDF weighting scheme and L2normalisation were applied to each document vector. As soft clustering method, we used LDA (Latent Dirichlet Analysis).
In both cases, we fixed the discounting factor (in updating the priors of the cluster reward distribution) to 0.99 for Reuters RCV1, and to 0.95 for ENRONsubset. The number of arms/clusters () – or, equivalently, the number of latent components in LDA – was fixed to 200 for Reuters RCV1 and to 50 for ENRONsubset. These hyperparameters were tuned on “unused” topics: the “Commodity Markets” class (M14) of Reuters RCV1 and the topic 301 of the TREC 2010 Legal Interactive Task.
The performance measure is simply the proportion of the collection to be reviewed to reach certain levels of recall, focusing on the high recall values. For each collection and each topic, we have performed 10 different runs with different seed sets; each seed set consisted of three random relevant instances of the class or topic. We have limited the reviewing budget to 40% of the collection. The values given in the tables 1 (for the ENRONsubset dataset) and 2 (for the Reuters RCV1 dataset) are the average over these 10 runs.
TOPIC  Baseline  Proposed Method  

Recall=0.5  Recall=0.85  Recall=0.90  Recall=0.975  Recall=0.5  Recall=0.85  Recall=0.90  Recall=0.975  
200  0.29%  2.34%  2.51%  4.86%  0.31%  2.41%  2.52%  3.88% 
201  1.01%  2.23%  2.36%  3.16%  0.74%  1.87%  1.99%  2.97% 
202  0.55%  4.9%  6.68%  10.3%  0.54%  4.48%  6.23%  8.23% 
203  0.86%  40%  40%  40%  0.99%  13.52%  13.61%  13.8% 
204  1.45%  8.26%  10.45%  14.81%  1.41%  7.64%  9.88%  13.36% 
205  4.38%  18.2%  40%  40%  4.78%  18.8%  19.4%  40% 
206  1.07%  40%  40%  40%  1.15%  11.6%  14.26%  40% 
207  0.29%  1.68%  2.42%  40%  0.32%  1.71%  2.2%  3.89% 
TOPIC  Baseline  Proposed Method  

Recall=0.9  Recall=0.95  Recall=0.99  Recall=0.9  Recall=0.95  Recall=0.99  
C17  7.92%  12.27%  25.97%  8.15%  11.9%  19.48% 
C18  6.81%  8.96%  15.87%  6.91%  8.75%  11.21% 
E14  1.19%  2.99%  12.63%  1.48%  2.81%  10.71% 
E31  0.93%  1.67%  17.06%  1.13%  1.61%  12.34% 
E51  6.67%  10.42%  19.85%  6.81%  10.01%  13.45% 
G15  2.39%  2.98%  5.36%  2.5%  2.92%  4.12% 
There are several important observations that we can make from these experimental results:

If the requested level of recall is relatively low, the baseline is still the best choice. But, for a sufficiently high recall, the exploration effort spent during the early phases of the search starts to be beneficial and our method outperforms the baseline. The “breakeven” point between the two strategies depends on the collection and on the topic. In some way, our method can be considered as an “insurance” to be able to reach efficiently a high recall without forgetting significant segments of relevant instances; the cost of this insurance is the extra effort spent in exploring diverse clusters during the search.

The beneficial effect seems to decrease, and even to disappear, for extreme values of recall, especially in the case of the ENRON corpus. The most likely reason of this sudden decline is the label noise: some irrelevant instances – incorrectly labelled as relevant – are virtually unreachable from any classifier built from (correctly labelled) positive instances. And we suspect that noisy labels are more numerous with the ENRON dataset (the relevant class is more “fuzzy” and subject to interpretation) than with the Reuters corpus.
5 Conclusions and Future Works
This paper considers the Active Search problem as a resource allocation task in an uncertain environment and handles it in a way similar to what is done for petroleum drilling and ore mining projects. By softclustering the landscape of the instance feature space and using sampling strategies based on MAB, the method proposed here should be considered as an insurance to decrease the risk of missing a significant amount of relevant objects when the task is to achieve high recall of a lowprevalence, multifaceted relevant class. Future works will focus on analysing the cost/benefit ratio depending on the task and the collection to be processed.
Aknowledgment: This work was partially funded by the French Government under the grant ANR13CORD0020 (ALICIA Project).
References

[1]
Robin Allesiardo and Raphaël Féraud.
EXP3 with drift detection for the switching bandit problem.
In
2015 IEEE International Conference on Data Science and Advanced Analytics, DSAA 2015, Campus des Cordeliers, Paris, France, October 1921, 2015
, pages 1–7, 2015.  [2] Djallel Bouneffouf and Raphael Féraud. Multiarmed bandit problem with known trend. Neurocomputing, 205:16 – 21, 2016.
 [3] Djallel Bouneffouf, Romain Laroche, Tanguy Urvoy, Raphael Feraud, and Robin Allesiardo. Contextual bandit for active learning: Active thompson sampling. In ICONIP 2014, pages 405–412, 2014.
 [4] Jaime Carbonell and Jade Goldstein. The use of mmr, diversitybased reranking for reordering documents and producing summaries. In Proceedings of SIGIR’98, 1998.
 [5] Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In NIPS, pages 2249–2257, 2011.
 [6] Olivier Chapelle, Eren Manavoglu, and Rómer Rosales. Simple and scalable response prediction for display advertising. ACM TIST, 5(4), 2014.
 [7] Timothé Collet and Olivier Pietquin. Optimism in active learning. Intelligence and Neuroscience, 2015:94:94–94:94, jan 2015.

[8]
Gordon V. Cormack and Maura R. Grossman.
Evaluation of machinelearning protocols for technologyassisted review in electronic discovery.
In Proceedings of SIGIR’2014, pages 153–162, 2014.  [9] Gordon V. Cormack and Maura R. Grossman. Scalability of continuous active learning for reliable highrecall text classification. In Proceedings of CIKM’2016, pages 1039–1048, 2016.
 [10] Gordon V. Cormack, Maura R. Grossman, Bruce Hedin, and Douglas W. Oard. Overview of the TREC 2010 legal track. In Proceedings of The Nineteenth Text REtrieval Conference, TREC 2010, 2010.
 [11] Aurélien Garivier and Eric Moulines. On upperconfidence bound policies for nonstationary bandit problems. In Algorithmic Learning Theory, 2011.
 [12] Roman Garnett, Yamuna Krishnamurthy, Xuehan Xiong, Jeff G. Schneider, and Richard P. Mann. Bayesian optimal active search and surveying. In Proceedings of ICML’2012, 2012.
 [13] Dezhi Hong, Hongning Wang, and Kamin Whitehouse. Clusteringbased active learning on sensor type classification in buildings. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, CIKM ’15, pages 363–372, New York, NY, USA, 2015. ACM.
 [14] Gary King and Langche Zeng. Logistic regression in rare events data. Political Analysis, 9, 2001.
 [15] Nir Levine, Koby Crammer, and Shie Mannor. Rotting bandits. CoRR, abs/1702.07274, 2017.

[16]
Yifei Ma, TzuKuo Huang, and Jeff G. Schneider.
Active search and bandits on graphs using sigmaoptimality.
In
Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence, UAI 2015
, pages 542–551, 2015.  [17] Hieu T. Nguyen and Arnold Smeulders. Active learning using preclustering. In Proceedings of the Twentyfirst International Conference on Machine Learning, ICML ’04, pages 79–, New York, NY, USA, 2004. ACM.
 [18] Joseph John Pfeiffer, III, Jennifer Neville, and Paul N. Bennett. Active exploration in networks: Using probabilistic relationships for learning and inference. In Proceedings of CIKM’2014, pages 639–648, 2014.
 [19] David van Dijk, Zhaochun Ren, Evangelos Kanoulas, and Maarten de Rijke. The university of amsterdam (ILPS) at TREC 2015 total recall track. In Proceedings of The TwentyFourth Text REtrieval Conference, TREC 2015, Gaithersburg, Maryland, USA, November 1720, 2015, 2015.
 [20] Xuezhi Wang, Roman Garnett, and Jeff G. Schneider. Active search on graphs. In Proceedings of KDD 2013, pages 731–738, 2013.
 [21] Zuobing Xu, Ram Akella, and Yi Zhang. Incorporating diversity and density in active learning for relevance feedback. In Advances in Information Retrieval, 29th European Conference on IR Research, ECIR 2007, Rome, Italy, April 25, 2007, Proceedings, 2007.
 [22] Haotian Zhang, Wu Lin, Yipeng Wang, Charles L. A. Clarke, and Mark D. Smucker. Waterlooclarke: TREC 2015 total recall track. In Proceedings of The TwentyFourth Text REtrieval Conference, TREC 2015, 2015.
Comments
There are no comments yet.