1 Introduction
We consider online classification of streaming instances within the framework of statistical learning theory. Let be a sequence of instances drawn independently at random from an unknown underlying distribution over an instance space . Each instance has a hidden binary label that relates probabilistically to the instance according to an unknown conditional distribution . At each time , the decision maker decides whether to query the label of the current instance . If yes, is revealed. Otherwise, the decision maker produces a label and incurs a classification error if . The objective is to minimize the expected number of queries over a horizon of length while constraining the total number of classification errors. The tension between label complexity and classification error rate needs to be carefully balanced through a sequential strategy governing the query and labeling decisions at each time.
The above problem arises in applications such as spam detection, misinformation detection on social media, and event detection in realtime surveillance. The key characteristics of these applications are the highvolume streaming of instances and the nuanced definition of labels (e.g., misinformation and misleading content are complex concepts to define). While the latter necessitates human intervention to provide annotations for selected instances, such human annotations, time consuming and expensive to obtain, should be sought after sparingly to ensure scalability. See a compelling case made in (Strickland, 2018) for detecting false content.
1.1 Previous Work on Active Learning
The above problem falls under the general framework of active learning. In contrast to passive learning where labeled examples are given a priori or drawn at random, active learning asserts control over which labeled examples to learn from by actively querying the labels for carefully selected instances. The hope is that by learning from the most informative examples, the same level of classification accuracy can be achieved with much fewer labels than in passive learning.
Offline Active Learning:
Active learning has been studied extensively under the Probably Approximately Correct (PAC) model, where the objective is to output an
optimal classifier with probability using as few labels as possible. The PAC model pertains to offline learning since the decision maker does not need to self label any instances during the learning process. An equivalent view is that classification errors that might have incurred during the learning process are inconsequential, and the tension between label complexity and classification errors is absent. If measured purely by label complexity, the decision maker has the luxury of skipping, at no cost, as many instances as needed to wait for the most informative instance to emerge.A much celebrated active learning algorithm was given by Cohn, Atlas, and Ladner (Cohn et al., 1994). Named after its inventors, the CAL algorithm is applicable to a general hypothesis space . It, however, relies on the strong assumption of realizability (a.k.a., separability)—the existence of an errorfree classifier in . In this realizable case, hypotheses inconsistent with a single label can be safely eliminated from further consideration. Based on this key fact, CAL operates by maintaining two sets at each time: the version space consisting of all surviving hypotheses (i.e., those that are consistent with all past labels), and the region of disagreement, a subset of for which there is disagreement among hypotheses in the current version space regarding their labels. CAL queries labels if and only if the instance falls inside the current region of disagreement. Each queried label reduces the version space, which in turn may shrink the region of disagreement, and the algorithm iterates indefinitely. Note that instances outside the region of disagreement are classified with the same label by all the hypotheses in the current version space. It is thus easy to see that CAL represents a much conservative approach: it only disregards instances whose labels can already be inferred from past labels. Quite surprisingly, by merely avoiding querying labels that carry no additional information, exponential reduction in label complexity can be achieved in a broad class of problems. (see, for example, an excellent survey by Dasgupta (Dasgupta, 2011) and a monograph by Hanneke (Hanneke et al., 2014)).
The CAL algorithm was extended to the agnostic setting by Balcan, Beygelzimer, and Langford (Balcan et al., 2009). In the agnostic setting, labels are probabilistic, and even the best classifier in experiences a nonzero error rate. The main challenge in extending CAL to the agnostic case is the update of the version space: a single inconsistent label can no longer disqualify a hypothesis, and the algorithm needs to balance the desire of quickly shrinking the version space with the irreversible risk of eliminating . Referred to as (agnostic active), the algorithm developed by Balcan, Beygelzimer, and Langford explicitly maintains an neighborhood of in the version space by examining the empirical errors of each hypothesis. In a followup work, Dasgupta, Hsu and Monteleoni (Dasgupta et al., 2008)
introduced an improved algorithm, referred to as DHM after the authors, that simplifies the maintenance of the region of disagreement through a reduction to supervised learning.
(Beygelzimer et al., 2010, 2011) developed approaches that avoid maintaining the version space by determining directly whether the current instance lies in the region of disagreement.The above conservative approach originated from the CAL algorithm is referred to as the disagreementbased approach. The design methodology of this conservative approach focuses on avoiding querying labels that provide no or little additional information. More aggressive approaches that actively seeking out more informative labels to query have been considered in the literature. One such approach is the socalled marginbased. It is specialized for learning homogeneous (i.e. through the origin) linear separators of instances on the unit sphere in and adopts a specific noise model that assumes linearity in terms of the inner product with the Bayes optimal classifier. In this case, the informativeness of a potential label can be measured by how close the instance is to the current decision boundary. Representative work on the marginbased approach includes (Balcan et al., 2007; Dasgupta et al., 2005). Another approach is QueryByCommittee (QBC) (Seung et al., 1992), where the query decision for an instance is determined by whether a committee (e.g., hypotheses drawn at random from the current version space) agrees on its label. The work in (Freund et al., 1997) analyzed the QBC algorithm by focusing on a hypothesis space of homogeneous linear separators and instances drawn from the unit sphere in
with a known uniform distribution.
Besides the streambased model where instances arrive one at a time, active learning has also been considered under the membership query synthesis and the poolbased sampling models (Settles, 2012). In the former, the learner may request label membership for any unlabeled data instance in the input space (Angluin, 1988, 2001; Cohn et al., 1996). In the latter, a large pool of unlabeled data is assumed available in one shot prior to the query process (Lewis & Gale, 1994; Settles & Craven, 2008; Tong & Chang, 2001). These models are less relevant to the online setting considered in this work.
Online Active Learning:
Active learning in the online setting has received much less attention. The work of (CesaBianchi et al., 2003) and (Cavallanti et al., 2009) extended the marginbased approach to the online setting, focusing, as in the offline case, on homogeneous linear separators of instances on the unit sphere in . A specific noise model was adopted, which assumes that the underlying conditional distribution of the labels is fully determined by the Bayesian optimal classifier. We recently found the work by (Yang, 2011) that extended the offline DHM algorithm to a streambased setting with drifting distribution. In this setting, the decision maker first predicts the label of every instance and then decides whether to ask for the true label. In the discussion section, it was remarked that the algorithm can also be applied to the querybeforeprediction setting as considered in this work. The algorithm proposed by (Yang, 2011)
has a different epoch structure and threshold design from the algorithm developed here. Its label complexity of
is higher than the algorithm proposed here under the Massart noise conditionOnline active learning has also been studied in the adversarial setting, where instances are generated by an adversary and algorithms and analysis focus on the worstcase scenario (see, for example, (CesaBianchi et al., 2006; Dekel et al., 2012; CesaBianchi et al., 2009)). The resulting problem is fundamentally different from the statistical learning framework considered in this work.
1.2 Main Results
We study online active learning under a general instance space and a general hypothesis space of VC dimension under the agnostic model. We develop an OnLine Active (OLA) learning algorithm and establish its label complexity and nearoptimal classification accuracy with respect to the best classifier in the hypothesis space under the Massart bounded noise condition. More specifically, the total expected classification errors in excess to over a horizon of length is bounded below independent of , demonstrating that the proposed algorithm offers practically the same level of classification accuracy as with a label complexity that is polylogarithmic in and linear in .
Rooted in the design principle of the disagreementbased approach, this work draws inspirations from the algorithm (Balcan et al., 2009) and the DHM algorithm (Dasgupta et al., 2008) developed under the offline setting. While these offline algorithms can be directly applied to the online setting and circumvent a linear order of excessive classification errors (which would occur with a fixed outage probability ) by setting , this work improves upon and departs from such a direction extension in both algorithm design and performance analysis.
In terms of algorithm design, the proposed OLA algorithm incorporates both structural changes and tight design of parameters, which, while subtle, leads to significant reduction in label complexity. Specifically, OLA operates in stages under an epoch structure, where an epoch ends when a fixed number of labels have been queried. This structure is different from and DHM. In particular, the epochs in are determined by the time instants when the size of the current region of disagreement shrinks by half due to newly obtained labels. Such an epoch structure, however, requires the knowledge of the marginal distribution of the instances for evaluating the size of the region of disagreement. The epoch structure of OLA obviates the need for this prior knowledge. The more deterministic epoch structure of OLA also results in a more tractable relation between the number of queried labels and the size of the current region of disagreement, facilitates the label complexity analysis. A more subtle improvement in OLA is the design of the threshold on the empirical error rate of a hypothesis for determining whether to eliminate it from the version space. By focusing only on empirical errors incurred over significant examples determined by the current region of disagreement, we obtain a tighter concentration inequality and a more aggressive threshold design, which leads to significant reduction in label complexity as compared with and DHM.
A more fundamental departure from the existing results under the offline setting is in the analysis of the label complexity. Under the offline PAC setting, the label complexity of an algorithm is analyzed in terms of the suboptimality gap and the outage probability . Under the online setting, however, the label complexity of an algorithm is measured in terms of the horizon length , which counts both labeled and unlabeled instances. In the analysis of the label complexity of (Balcan et al., 2009; Hanneke, 2007), unlabeled instances are assumed to be cost free, and bounds on the number of unlabeled instances skipped by the algorithm prior to termination are missing and likely intractable. Without a bound on the unlabeled data usage, the offline label complexity in terms of cannot be translated to its online counterpart. Dasgupta, et. al (Dasgupta et al., 2008) provided a bound on the unlabeled data usage in DHM. The bound, however, appears to be loose and translates to a linear label complexity in the online setting.
In this work, we adopt new techniques in analyzing the online label complexity of OLA. The key idea is to construct a submartingale given by the difference of an exponential function of the total queried labels up to and a linear function of . The optimal stopping theorem for submartingales then leads to an upper bound on the exponential function of the label complexity. A bound on the label complexity thus follows from Jensen’s inequality.
We have chosen the disagreementbased design methodology in designing the online active learning algorithm. While approaches that more aggressively seek out informative labels may have an advantage in the offline setting when the learner can skip unlabeled instances at no cost and with no undesired consequences, such approaches may be less suitable in the online setting. The reason is that in the online setting, self labeling is required in the event of no query, classification errors need to be strictly constrained, and no feedback to the predicted labels is available (thus learning has to rely solely on queried labels). These new challenges in the online learning setting are perhaps better addressed by the more conservative disagreementbased design principle that skips instances more cautiously. While this assessment has been confirmed in simulation examples (see Sec. 4), a full quantitatively understanding of the problem, requires extra extensive effort beyond this work. We discuss several open problems in the conclusion section.
2 Problem Formulation
2.1 Instances and Hypotheses
Let be a sequence of instances drawn independently at random from an unknown distribution over a general instance space . Each instance has a hidden binary label , relating probabilistically to the instance according to an unknown conditional distribution . Let
denote the unknown joint distribution of
, where denotes the label space.Let be a set of measurable functions mapping from to . We refer to as the hypothesis space and assume that it has a finite VC dimension . Each element is a hypothesis (i.e., a classifier).
Let be the Bayes optimal classifier under . In other words, for all , is the label that minimizes the probability of classification error:
(1) 
where is the indicator function. Let
(2) 
It is easy to see that
(3) 
We assume that . It is thus the best classifier in , offering the minimum error rate for a randomly generated example under the joint distribution :
(4) 
2.2 Noise Conditions
The function given in (2) is an indicator of the noise level in the labels governed by . A special case, referred to as the realizable/separable case, is when the labels are deterministic with probability , and ( as well) assumes only values of and . In this case, the Bayes optimal classifier is errorfree.
In a general agnostic case with an arbitrary , even has a positive error rate. A particular case, referred to as the Massart (Massart et al., 2006), is when has a positive gap between positive examples (i.e., those with ) and negative examples (those with ). Specifically, there exists such that for all .
2.3 Actions and Performance Measures
At each time , the learner decides whether to query the label of the current instance . If yes, is revealed. Otherwise, the learner predicts the label. A learning strategy consists of a sequence of query rules and a sequence of labeling rules , where and map from causally available information consisting of past actions, instances, and queried labels to, respectively, the query decision of (no query) or (query) and a predicted label at time . With a slight abuse of notation, we also let and denote the resulting query decision and the predicted label at time under these respective rules.
The performance of policy over a horizon of length is measured by the expected number of queries and the expected number of classification errors in excess to that of the Bayes optimal classifier . Specifically, we define the label complexity and the regret as follows.
(5)  
(6) 
where denotes the expectation with respect to the stochastic process induced by . The subscript of will be omitted from the notations when there is no ambiguity in terms of which strategy is under discussion.
The objective is a learning algorithm that minimizes the label complexity while ensuring a bounded regret .
3 Main Results
We present in this section the proposed OLA algorithm and establishes its label complexity and regret performance.
3.1 The OLA Algorithm
The OLA algorithm operates under an epoch structure. When a fixed number of labels have been queried in the current epoch, this epoch ends and the next one starts.
The algorithm maintains two sets: the version space and the region of disagreement , where is the epoch index. is defined as following:
(7) 
where the operator for an arbitrary hypothesis set is defined as
(8) 
The initial version space is set to the entire hypothesis space , and the initial region of disagreement is the instance space . At the end of epoch , these two sets are updated using set of the queried examples in this epoch .
At each time of epoch , the query and prediction decisions are as follows. If , the algorithm queries the label. Otherwise, the algorithm predicts the label of using an arbitrary hypothesis in . At the end of the epoch, and are updated as follows to obtain and . For a hypothesis in the current version space , define its empirical error over as
(9) 
Let be the best hypothesis in in terms of empirical error over , i.e.,
(10) 
The version space is then updated by eliminating each hypothesis whose empirical error over exceeds that of by a threshold that is specific to , , and . Specifically,
(11) 
The new region of disagreement is then determined by :
(12) 
A detailed description of the algorithm is given in Algorithm 1.
The algorithm parameter is set to , where is a positive integer whose value will be discussed in Sec. 3.4. We point out that while the horizon length is used as an input parameter to the algorithm, the standard doubling trick can be applied when is unknown.
3.2 Threshold Design
We now discuss the key issue of designing the threshold for eliminating suboptimal hypotheses. This threshold function controls the tradeoff between two conflicting objectives: quickly shrinking the region of disagreement (thus reducing label complexity) and managing the irreversible risk of eliminating good classifiers (thus increasing future classification errors). A good choice of this threshold function hinges on how tightly we can bound the difference between the empirical error rate and the ensemble error rate under the (unknown) distribution .
We define the error rate of hypothesis under distribution as
(13) 
which is the probability that misclassifies a random instance.
For a pair of hypotheses , define
(14) 
which is the probability that misclassifies a random instance but successfully classifies. For a finite set of samples, the empirical excess error of over on is defined as
(15) 
The threshold in OLA is defined as:
(16)  
where . Here is the th shattering coefficient of an arbitrary hypothesis space . By Sauer’s lemma (Bousquet et al., 2004), with being the VC dimension of .
The choice of this specific threshold function will become clear in Sec. 3.3 when the relationship between the empirical error difference of two hypothesis and the ensemble error rate difference under is analyzed.
3.3 Regret
Here we prove that OLA achieves bounded regret.
First we introduce the following normalized uniform convergence VC bound (Vapnik & Chervonenkis, 2015), which is used in our analysis.
Lemma 1.
(Vapnik & Chervonenkis, 2015) Let be a family of measurable functions . Let be a fixed distribution over . Define
(17) 
For a finite set , define
(18) 
as the empirical average of over . If is an i.i.d. sample of size from , then, with probability at least , for all :
(19) 
where .
Based on Lemma 1, we develop the following concentration inequality in Lemma 2 to show the relationship between the empirical error difference of two hypothesis and the ensemble error rate difference.
Lemma 2.
Let be a set of i.i.d. samples under distribution . For all , we have, with probability at least ,
(20)  
where .
Proof.
Define
(21) 
which is a mapping from to It is not hard to see that and .
Then, we apply the normalized VC bound to a family of measurable functions defined in (21), which gives us
(22) 
and
(23) 
∎
Since all samples in are queried at epoch in the proposed OLA algorithm, we can see that is an i.i.d. sample of size from distribution , which is defined as
(25) 
where for .
Therefore, we can apply Lemma 2 to each epoch with and , which gives us the following corollary.
Corollary 1.
Let . With probability at least , for all and for all , we have
(26)  
Proof.
Apply Lemma 2 with , , , and for each . Since and , those bounds hold simultaneously for all with probability at least . ∎
Here we can see that the design of the threshold corresponds to the right hand side of (26). The fact that (26) holds simultaneously for all ensures that the Bayes optimal classifier is always in the version space. Next, we show that the expected regret of the proposed OLA algorithm is bounded.
Theorem 1.
The expected regret of the OLA algorithm is bounded as follows:
Proof.
First we show that if the inequalities in Lemma 1 hold simultaneously for all (which hold with probability ), we have for all . This can be proved by induction. First, clearly . Assume , apply the inequality in Lemma 1 with we have
(27) 
Note that . Therefore,
(28)  
which indicates that by the querying rule of the online active learning algorithm. Hence, we have . By the labeling rule of the online active learning algorithm, implies . Therefore, .
Hence,
(29) 
as desired. ∎
3.4 Label Complexity
For the purpose of label complexity analysis, we define the following online disagreement coefficient, which is slightly different from the disagreement coefficient defined for offline active learning in (Hanneke, 2007). In the offline PAC settings, the disagreement coefficient is a function of . It has been shown in (Hanneke, 2007) that when is homogeneous linear separator and is uniform, is upper bounded by a constant.
The disagreement metric on is defined by . The online disagreement coefficient is defined as
(30) 
where which is a ”hypothesis ball” centered at with radius .
The quantity bounds the rate at which the disagreement mass of the ball grows as the function of the radius . It is bounded by when is dimensional homogeneous separators (Hanneke, 2007).
Next we upper bound the label complexity for the proposed online active learning algorithm.
Theorem 2.
Let be the expected label complexity of OLA. If , then,
(31) 
where is the disagreement coefficient.
Proof.
Let
(32) 
By the definition of we can show that
(33) 
Next we show that
(34) 
Let
(35) 
If , then
(36) 
By Corollary 1, for all , we have
(37)  
with probability at least . Therefore for all we have
(38)  
with probability . This indicates that for all , . By the definition of , we have with probability at least . Therefore as desired. Furthermore, we have
(39) 
Define . Next we show that is a submartingale. Let be epoch index at time . Since , we have
(40)  
Therefore,
(41)  
as desired. Then by optional stopping theorem,
(42)  
Hence,
(43) 
Since is concave, by Jensen’s Inequality,
(44) 
Since , and , we have
(45) 
as desired.
∎
4 Simulation Examples
We first compare the label complexity comparisons of OLA with offline disagreementbased active learning algorithms: and DHM.
In Figure 1, we consider a onedimensional instance space and threshold classifiers with where . Note that the VC dimension . In Figure 2, we consider the same instance space but a hypothesis space consisting of all intervals . Note that in this case, the VC dimension . In Figure 3, we consider a twodimensional instance space and a hypothesis space includes all the boxes . The VC dimension . In all cases, we set to be uniform and
(46) 
The significant reduction in label complexity offered by OLA is evident from Figures 13. The simulated classification errors are near zero for all three algorithms.
Next we compare OLA with the online marginbased algorithm CBCG proposed in (CesaBianchi et al., 2003) (and analyzed in (Cavallanti et al., 2009)). It is specialized in learning homogeneous separators where is the surface of the unit Euclidean sphere in
under specific noise model: there exists a fixed and unknown vector
with Euclidean norm = 1 such that . Then, the Bayes optimal classifier . Shown in Figure 4 and 5 are the label complexity and classification error comparisons under this specific noise model with , and uniform . It shows that even when comparing under this special setting, OLA offers considerable reduction in label complexity and drastic improvement in classification accuracy. This confirms with the assessment discussed in Sec. 1 that the more conservative disagreementbased approach is more suitable in the online setting than the more aggressive marginbased approach.5 Conclusion and Discussion
Online active learning has received considerable less attention than its offline counterpart. Many realtime streambased applications, however, necessitate a better understanding of this problem. This work is still limited, especially in light of the work by (Yang, 2011) that we recently discovered. A number of problems remain open. We discuss below a representative few.
We have adopted a rather limited Massart bounded noise condition. An immediate next step is to consider the Tsybakov low noise condition (Tsybakov et al., 2004), for which the Massart bounded noise condition is a special case. The algorithm developed by (Yang, 2011) was analyzed under the Tsybakov noise condition, which provides a much related reference point. A thorough comparison on equal footing is needed.
Lower bounds on the label complexity in the online setting remain open. Inspirations may be drawn from the results on the lower bounds in the offline PAC setting, for example, (Kulkarni et al., 1993).
We have not considered the implementation complexity of the proposed algorithm. The computational complexity for maintaining the version space and region of uncertainty for a general hypothesis space is an issue inherent to the disagreementbased approach. In the offline setting, several techniques have been employed to reduce the implementation complexity. In particular, reductions to supervised learning (Dasgupta et al., 2008)
and neural network approximation
(Cohn et al., 1994) are considered to simplify the maintenance of the region of uncertainty. These techniques can be borrowed for the online setting.References
 Angluin (1988) Angluin, D. Queries and concept learning. Machine learning, 2(4):319–342, 1988.
 Angluin (2001) Angluin, D. Queries revisited. In International Conference on Algorithmic Learning Theory, pp. 12–31. Springer, 2001.

Balcan et al. (2007)
Balcan, M.F., Broder, A., and Zhang, T.
Margin based active learning.
In
International Conference on Computational Learning Theory
, pp. 35–50. Springer, 2007.  Balcan et al. (2009) Balcan, M.F., Beygelzimer, A., and Langford, J. Agnostic active learning. Journal of Computer and System Sciences, 75(1):78–89, 2009.
 Beygelzimer et al. (2010) Beygelzimer, A., Hsu, D. J., Langford, J., and Zhang, T. Agnostic active learning without constraints. In Advances in Neural Information Processing Systems, pp. 199–207, 2010.
 Beygelzimer et al. (2011) Beygelzimer, A., Hsu, D., Karampatziakis, N., Langford, J., and Zhang, T. Efficient active learning. In ICML 2011 Workshop on Online Trading of Exploration and Exploitation, 2011.
 Bousquet et al. (2004) Bousquet, O., Boucheron, S., and Lugosi, G. Introduction to statistical learning theory. In Advanced lectures on machine learning, pp. 169–207. Springer, 2004.
 Cavallanti et al. (2009) Cavallanti, G., CesaBianchi, N., and Gentile, C. Linear classification and selective sampling under low noise conditions. In Advances in Neural Information Processing Systems, pp. 249–256, 2009.
 CesaBianchi et al. (2003) CesaBianchi, N., Conconi, A., and Gentile, C. Learning probabilistic linearthreshold classifiers via selective sampling. In Learning Theory and Kernel Machines, pp. 373–387. Springer, 2003.
 CesaBianchi et al. (2006) CesaBianchi, N., Gentile, C., and Zaniboni, L. Worstcase analysis of selective sampling for linear classification. Journal of Machine Learning Research, 7(Jul):1205–1230, 2006.
 CesaBianchi et al. (2009) CesaBianchi, N., Gentile, C., and Orabona, F. Robust bounds for classification via selective sampling. In Proceedings of the 26th annual international conference on machine learning, pp. 121–128. ACM, 2009.
 Cohn et al. (1994) Cohn, D., Atlas, L., and Ladner, R. Improving generalization with active learning. Machine learning, 15(2):201–221, 1994.

Cohn et al. (1996)
Cohn, D. A., Ghahramani, Z., and Jordan, M. I.
Active learning with statistical models.
Journal of artificial intelligence research
, 4:129–145, 1996.  Dasgupta (2011) Dasgupta, S. Two faces of active learning. Theoretical computer science, 412(19):1767–1781, 2011.

Dasgupta et al. (2005)
Dasgupta, S., Kalai, A. T., and Monteleoni, C.
Analysis of perceptronbased active learning.
In International Conference on Computational Learning Theory, pp. 249–263. Springer, 2005.  Dasgupta et al. (2008) Dasgupta, S., Hsu, D. J., and Monteleoni, C. A general agnostic active learning algorithm. In Advances in neural information processing systems, pp. 353–360, 2008.
 Dekel et al. (2012) Dekel, O., Gentile, C., and Sridharan, K. Selective sampling and active learning from single and multiple teachers. Journal of Machine Learning Research, 13(Sep):2655–2697, 2012.
 Freund et al. (1997) Freund, Y., Seung, H. S., Shamir, E., and Tishby, N. Selective sampling using the query by committee algorithm. Machine learning, 28(23):133–168, 1997.
 Hanneke (2007) Hanneke, S. A bound on the label complexity of agnostic active learning. In Proceedings of the 24th international conference on Machine learning, pp. 353–360. ACM, 2007.
 Hanneke et al. (2014) Hanneke, S. et al. Theory of disagreementbased active learning. Foundations and Trends® in Machine Learning, 7(23):131–309, 2014.
 Kulkarni et al. (1993) Kulkarni, S. R., Mitter, S. K., and Tsitsiklis, J. N. Active learning using arbitrary binary valued queries. Machine Learning, 11(1):23–35, 1993.
 Lewis & Gale (1994) Lewis, D. D. and Gale, W. A. A sequential algorithm for training text classifiers. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 3–12. SpringerVerlag New York, Inc., 1994.
 Massart et al. (2006) Massart, P., Nédélec, É., et al. Risk bounds for statistical learning. The Annals of Statistics, 34(5):2326–2366, 2006.
 Settles (2012) Settles, B. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 6(1):1–114, 2012.

Settles & Craven (2008)
Settles, B. and Craven, M.
An analysis of active learning strategies for sequence labeling
tasks.
In
Proceedings of the conference on empirical methods in natural language processing
, pp. 1070–1079. Association for Computational Linguistics, 2008.  Seung et al. (1992) Seung, H. S., Opper, M., and Sompolinsky, H. Query by committee. In Proceedings of the fifth annual workshop on Computational learning theory, pp. 287–294. ACM, 1992.
 Strickland (2018) Strickland, E. AIhuman partnerships tackle” fake news”: Machine learning can get you only so farthen human judgment is required. IEEE Spectrum, 55(9):12–13, 2018.
 Tong & Chang (2001) Tong, S. and Chang, E. Support vector machine active learning for image retrieval. In Proceedings of the ninth ACM international conference on Multimedia, pp. 107–118. ACM, 2001.
 Tsybakov et al. (2004) Tsybakov, A. B. et al. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 32(1):135–166, 2004.
 Vapnik & Chervonenkis (2015) Vapnik, V. N. and Chervonenkis, A. Y. On the uniform convergence of relative frequencies of events to their probabilities. In Measures of complexity, pp. 11–30. Springer, 2015.
 Yang (2011) Yang, L. Active learning with a drifting distribution. In Advances in Neural Information Processing Systems, pp. 2079–2087, 2011.