I Introduction
The key barrier to scaling or applying supervised learning in practice is often the cost of obtaining sufficient annotation. Active Learning (AL) aims to address this by designing query algorithms that effectively predict which points will be useful to annotate, thus enabling efficient allocation of human annotation effort. There are many different AL algorithms, each with appealing – yet completely different – motivations for what constitutes a good question to ask underpinning their design. For example, uncertainty or marginbased sampling
[1, 2] suggests querying the most uncertain or ambiguous point that is the closest point to the decision boundary. Expected error reduction [3, 4] queries points that the current model predicts will reduce its future error. Another typical approach is to label the most representative samples [5, 6, 7]to ensure the major clusters within the dataset are correctly estimated. Besides these approaches, querybycommittee active learning queries points based on the disagreement between a committee of classifiers
[8, 9, 10]. More recent studies investigated hybrid criteria that balance multiple motivations [11, 12, 13].These are all good ideas, yet there are situations where each is ineffective. For example, if the classes are heavily overlapped in an area of featurespace, uncertainty sampling will be tied up querying points in an impossible to solve region. If the current model is very poor, expected error reduction cannot accurately estimate its own future error. If the main data clusters are already well classified, representative sampling focused approaches may not finetune between them.
These thought experiments are reflected empirically. The best algorithm for poolbased AL in practice varies both across datasets and also with the progress of learning within a given dataset [14, 15]. This observation has motivated research into both learning dataset and timespecific weightings for an AL algorithm ensemble. [16, 17]
developed heuristics for switching AL algorithms that are typically good at early versus late stage learning. In contrast,
[14, 15] developed methods for rapid online learning of a datasetspecific weighting for algorithms within an ALensemble.The key insight of the Combination of Active Learning Online (COMB) [14] and Active Learning by Learning (ALBL) [15] algorithms is to formalise the query criteria selection task as a multiarmed bandit (MAB) problem. MAB problems have been well studied and many powerful algorithms with optimality guarantees exist. For example, if each query criterion in the ensemble is considered to be a bandit arm, and the learning improvement after executing a criterion is considered to be the arm’s reward, then MAB algorithms such as EXP3 (Exponentialweight algorithm for Exploration and Exploitation) [18] can be applied to quickly learn the efficacy of the arms (AL criteria) and is guaranteed to achieve a near optimal overall reward (learning improvement). A variant of this is to consider datapoints to be arms, and AL criteria to be experts providing advice about promising arms. Then MAB with expert advice
algorithms such as EXP4.P (Exponentialweight algorithm for Exploration and Exploitation using Expert advice with high probability regret bound)
[19] optimise exploration and exploitation of experts, and achieve provably near optimal reward.The fundamental limitation of existing MABbased approaches to AL is that their underlying MAB algorithms do not take into account the temporal dynamics of active learning: different criteria are effective at different learning stages [16, 17]. This issue is illustrated by Fig. 1(a,c,e), where the most effective criterion varies across the entire time horizon. On fourclass, Density (DE) sampling is slightly better at first and uncertainty (US) is consistently good later on. Similarly in ILPD or german, representative (RS) and density (DE) sampling are better at the crucial early stage before uncertainty becomes better. A second issue is that the scale of an accuracybased reward falls dramatically over time (Fig. 1(b,d,f)). Because of this stationary bandit learners will be unduly biased by the high reward from an initial observation and fail to adapt subsequently. For example in ILDP a stationary learner may fail to make the switch from DE to US as later rewards in favour of US are small in scale compared to the initial reward in favour of DE.
Therefore there are nonstationary aspects both in reward scale, and in reward distribution perarm (MAB perspective) or perexpert (MAB with expert advice perspective). Thus the MAB problem is formally nonstationary, violating a fundamental assumption required to guarantee existing MAB algorithms’ optimality bounds.
Here we develop a performance guaranteed stochastic MAB with expert advice^{1}^{1}1We use terminology from [18]. It also has other names, including ‘contextual bandit’ [19, 20], ‘partiallabel problem’ [21], and ‘associative bandit problem’ [22]. algorithm in a nonstationary environment. Applying this to AL means that, like [15], if there is a single best (but a priori unknown) AL algorithm for a dataset, we are able to quickly discover it and thus approach the performance of an oracle that knows the best algorithm for each dataset. But importantly when different algorithms’ efficacies vary over time within one dataset, we can adapt to this and approach the performance of an oracle that knows the best AL algorithm at each iteration.
Ii Background and Related Work
Iia Active Learning
We denote the pool of data with samples as where the instances are and the labels are . In an active learning scenario, the data are initially a labelled set and unlabelled set where . Training an initial classifier on the samples in the initial set , the algorithm starts to query instances from during iterations . After the supervision of instance is obtained, is removed from the unlabelled set and added to the labelled set , from which classifier is retrained.
IiB Bandit Algorithms
Multiarmed Bandit In multiarmed bandit (MAB) problems, a player pulls a lever from a set of slot machines in a sequence of time steps to maximise her payoff. During the game, she only observes the reward of the specific arm pulled at time step . The aim of the player is to maximise their return, which is the sum of the rewards over the sequence of pulls. This requires a tradeoff between exploration (collect information to estimate the arm with the highest return) and exploitation (focus on the arm with the highest estimated return). Training a bandit learner to solve a MAB problem is then formalized as minimising the regret between the actions chosen by the player’s strategy , and the best arm.
For example, the EXP3 algorithm [18] minimises, for any finite , the “static regret” between the player’s reward and the best arm in retrospect: .
Contextual Multiarmed Bandit The goal of contextual bandits is to build a relationship between available context information and the reward distribution over all arms. For example, LinUCB [23]
makes the linear realizability assumption that there exists an unknown weight vector
with so that regret is minimized, where and . However, learning to predict the reward for each data point accurately appears to be an even harder problem given the limited information from only expert suggestions (Fig 1). More importantly, given the changing reward distribution over time, there is no constant relation between context and reward.Multiarmed Bandit with Expert Advice Expert information about the likely efficacy of each arm is often available. [18] thus introduced an adversarial MAB with expert advice algorithm EXP4 that exploits experts giving advice vectors (probabilities over levers) to the learner at each time. In contrast to MAB without expert advice, the goal is now to identify the best expert rather than the best arm. In this setting the regret to minimise is the difference between the return of the best expert in retrospect and the player:
(1) 
where is the expected reward of an expert and is the expected reward of our policy.
IiC Bandits for Active Learning
For active learning using a MAB with expert advice algorithm, the experts correspond to our ensemble of active learning criteria and the arms are available points in the pool. Each expert (criterion) provides a probability vector encoding preference over arms (instances). Active learners based on MAB with expert advice aim to learn the best criterion for a specific dataset. In COMB [14], the authors propose to use MAB with expert advice in active learning and heuristically designed the classification entropy maximization (CEM) score as the reward of the EXP4 bandit algorithm [18]. A more recent paper [15]
(ALBL) proposed to replace the CEM reward with an unbiased estimation of test accuracy Important Weighted Accuracy (IWA) and used an upgraded bandit algorithm EXP4.P
[19], which improves the earlier EXP4 method. Similarly, another recent paper [24] applied linear upper confidence bound contextual bandit algorithm (LinUCB) to train an ensemble and transferred the knowledge to other datasets. All of these algorithms enable the selection of a suitable active learning criteria for a given dataset. Our contribution is also to perform AL in a datasetspecific way by optimally tuning the exploration and exploitation of an ensemble of AL algorithms; but more importantly to do so dynamically, thus allowing the optimal tuning to vary as learning progresses. Unlike [14, 15, 24] we are able to deal with the nonstationary nature of this process. And unlike the heuristics in [16, 17], we have a theoretical guarantees, and can work with more than two criteria.IiD Nonstationary Property of Active Learning
Demonstration of Nonstationarity We describe a preliminary experiment to demonstrate empirically the existence of nonstationary reward distributions for a MAB formalisation of AL. Following the learning trajectory of our method, we use an oracle to score all the available query points at each iteration (i.e., hypothetically label each point, update the classifier, and check the test accuracy). Using the actual test accuracy as the reward, we can obtain the true expected reward of the th expert at each time step . Fig. 1 summarises the resulting average reward obtained in every 10 iterations of AL. Based on this, we can further compute the proportion of times that each criterion would obtain the highest reward. It can be seen that the MAB problem is nonstationary as the rewards vary systematically, and there is not a single criterion (expert) which obtains the highest proportion of wins throughout learning. Additionally, the ideal combination of criteria varies across datasets. For example, as illustrated in Fig 1, density and uncertainty sampling show better complementary in ILPD, while representative and uncertainty sampling are more complementary in german dataset.
Existing MAB ensembles are not robust to nonstationarity The nonstationary property in the MAB formalisation of AL also highlights the key weakness of COMB and ALBL: they use EXP4/EXP4.P [18, 19] expert advice bandit algorithms which provide guarantees against an inappropriate (static) regret that is only relevant in a stationary problem. In a nonstationary problem, it is clear that even an algorithm that perfectly estimates the best single expert (optimal w.r.t static oracle Eq. 1) can be arbitrarily worse than one which can choose the best expert at each step (optimal w.r.t dynamic oracle). In this paper, we develop an nonstationary stochastic MAB algorithm REXP4 (Restarting Exponentialweight algorithm for Exploration and Exploitation using Expert advice) with bounds against a stricter dynamic oracle notion of optimality more suited for (nonstationary) AL.
Prior attempts at nonstationary active learners A few previous active learning studies also observed that different algorithms are effective at different stages of learning and proposed heuristics for switching two base query criteria (e.g., density sampling at an early stage, and uncertainty sampling later on) [16, 17]. But these only adapt 2 criteria (density and uncertainty) unlike MAB ensembles which learn to combine many criteria, and their heuristics do not provide a principled and optimal way to learn when to switch.
Prior attempts at nonstationary MABs Some previous studies have extended MAB without expert advice learning to the nonstationary setting [25, 26] and provided regret bounds to guarantee the algorithms’ performance. However bandits with expert advice are preferable because they can achieve tighter learning bounds [18, 15] and they do not treat each criterion as a black box, so that one observation can be informative about many arms. Consider an AL situation where two criteria prefer the same instance. In the MAB interpretation (criteria=arms), after observing a reward, you only learn about the criterion/arm chosen at that iteration. In the MAB with expert advice interpretation (criteria=experts), the observed reward generates updates about the efficacy of all criteria that expressed opinions about the point.
Those few MABs extended to the nonstationary setting have other stronger assumptions. For example, the discounted/slidingwindow UCB algorithm [25] assumes the nature of the nonstationarity is that the reward distribution is piecewise and the number of changes is known. Similarly [27] makes the easier piecewise assumption, and also that the retrospective rewards for unpulled arms are available – but they are not in active learning. In [28]
, the authors proposed to measure the total statistical variance of the consecutive distributions at each time interval. Their result provides a big picture of the regret landscape for full information and bandit settings. Their proposed method addresses nonstationary environments but only for the regular MAB problem. Despite the use of the term expert in the title, it does not address the Expertadvice variant of the MAB problem relevant to us. It addresses arms rather than experts over arms.
We propose a nonstationary MAB with expert advice algorithm that has performance guarantees, and validate its practical application to active learning.
Iii Nonstationary MultiArmed Bandit with Expert Advice for Active Learning
Iiia Nonstationary MultiArmed Bandit with Expert Advice: REXP4
To formalise the problem, we assume the expected reward of each expert can change at any time step . The total variation of the expected reward over all steps is
(2) 
Following [29, 26], we assume this total variation in expected reward is bounded by a variation budget . The variation budget captures our assumed constraints on the nonstationary environment. It allows a wide variety of reward changes – from continuous drift to discrete jumps – yet provides sufficient constraint to permit a bandit algorithm to learn in a nonstationary environment. Temporal uncertainty set is defined as the set of reward vector sequences that are subject to the variation budget over all steps.
To bound the performance of a bandit learner in a nonstationary environment, we work with the regret between the learner and a dynamic oracle. The regret is defined as the worstcase difference between the expected policy return and the return of using the best expert at each time .
Definition 1.
Dynamic Regret for MultiArmed Bandit with Expert Advice
(3) 
where is the best possible expected reward among all experts at time . Our regret is against this dynamic oracle, in contrast to prior MABs’ static oracle (Eq 1).
Our nonstationary MAB with expert advice algorithm REXP4 minimises the dynamic regret in Eq 3. As shown in Algorithm 1, it trades off between the need to remember and forget by breaking the task into batches and applying EXP4 [18] on each batch. As the reward distribution changes, it adapts to the change as by reestimating each expert’s reward distribution at each batch. We show the worst case bound on the regret between this REXP4 procedure and the dynamic oracle.
IiiB Regret Bound for REXP4
The regret bound for REXP4 is illustrated in the following theorem. The theorem is proved by following the proof structure of [26] and replacing the term in [26] with the expected reward term in our paper.
Theorem 1.
Let be the REXP4 policy with a batch size and . Then, there is some constant such that for every , and
(4) 
where indicates the smaller number of experts or arms.
The result is an upper bound on the regret between our REXP4 policy and the dynamic oracle. As , it is favourable if either the number of experts or arms is small. This also means it is relatively robust to many arms (as in AL, where arms=data points). If is sublinear in (total variation in reward grows slower than timesteps), then performance converges to that of the oracle.

Get scores of instance from criteria

Normalised the score vector

Obtain the advise vector with

Set and for set

Query the label of instance randomly from according to probability

Move the instance from to

Retrain the classifier and receive reward

For set

For set

IiiC Dynamic Ensemble Active Learning
Based on our REXP4 algorithm for MAB with expert advice, we present DEALREXP4 (Dynamic Ensemble Active Learning) for active learning based on REXP4. Our dynamic ensemble learner will update both base learner and active criteria weights iteratively. More specifically, each ensemble criterion will predict scores for all unlabelled instances. We use exponential ranking normalisation to avoid the issue of different criterion scales, and apply the Gibbs measure where the parameters control the sharpness of the distribution. The denotes the ranking position of the instance’s score where the ranking order is determined by the criterion strategy’s ordering. For example, the entropy criterion prefers points with maximum entropy, so the maximum entropy point has rank 1. Similarly, the minimum margin criterion prefers points with low distance to margin, so the minimum distance point has rank 1. Based on the current suggestions from the criteria members, the active learning ensemble will select an instance for label querying. Then, the base learner will be updated with the new labelled data and the active learner will be updated successively based on the performance improvement of the updated base learner. To learn the nonstationary reward distribution, we use our proposed REXP4 algorithm to learn the weights of active learning criteria in an online adaptive way by introducing the restart scheme. Giving the current withinbatch index , the restart scheme will be activated when , otherwise updates follow the EXP4 rule. The details are described in Algorithm 2 with an illustration in Fig. 2.
In DEALREXP4 we set the reward as the resulting accuracy after a classifier update. Thus in the context of active learning, the bound given in Eq. 4 means that we know that the total area under the reward curve obtained by DEALREXP4 is within a bound of the best case scenario that would occur only if we had known the best criterion to use at each iteration. Moreover, if the variation budget grows sublinearly with , DEALREXP4 converges towards this bestexpertperiteration upper bound scenario.
Single Criterion  

Algorithm  Motivation  Stationarity  Importance of Criterion  Ensemble Members  Property  
US [1, 30, 31]  Querying the least confidence  Stationary  Fixed  US  Static  
RS [32]  Query a cluster within Margin  Stationary  Fixed  RS  Static  
DE [16]  Query the major cluster  Stationary  Fixed  DE  Static  
Multiple Criteria  
Algorithms  Motivations  Stationarity  Importance of Criterion  Ensemble Members  Property  
QUIRE [11]  Combining informativeness and representativeness  Stationary  Equal effect  QUIRE  Static  
BMDR [12]  Combining discriminative and representativeness  Stationary  Equal effect  BMDR  Static  
LAL [33]  Combining Multiple motivations  Stationary  Equal effect  Any Criteria  Static  
DUAL[16]  Switching from DE to US once  Nonstationary  Varying  US, DE  Dynamic  
ALGD [17]  Switching between DE to US  Nonstationary  Varying  US, DE  Dynamic  
Bandit Ensemble Algorithms  
Algorithm  Bandit  Regret  Stationarity  Importance of Criterion  Ensemble Members  Property 
COMB [14]  EXP4 [18]  Stationary  Single best  Any Criteria  Static  
ALBL [15]  EXP4.P [19]  Stationary  Single best  Any Criteria  Static  
LSA [24]  LinUCB [23]  Stationary  Single best combination  Any Criteria  Static  
DEAL  REXP4  NonStationary  Dynamic best  Any Criteria  Dynamic 
IiiD Discussion of Static and Dynamic Active Learning
We divide active learning algorithms into static/dynamic based on the stationary/nonstationary assumption on the importance of each criteria over different time periods.
Static Active Learning Single criterion algorithms are all static, since they solve active learning with only one criterion. Regarding active learning algorithms with multiple motivations: if they are formalised as a single fixed mixture of criteria, they are also static. Since the coefficients of different motivations are fixed over all time steps, they assume that a single weighted combination is suitable at any learning stage. For example, Query Informative and Representative Examples (QUIRE) [11], Learning Active Learning (LAL) [33], and Discriminative and Representative Queries for Batch Mode Active Learning (BMDR) [12] are static active algorithms with multiple motivations.
Previously proposed ensemble algorithms ALBL [15], COMB [14], and Linear Strategy Aggregation (LSA) [24] are also static in the sense that, although the weight proportion of their ensemble members changes as data is gathered, their underlying bandit learner is a stationary one, assuming there is only one best expert or best linear combination over all time.
Dynamic Active Learning In our dynamic active learning research question, we avoid a stationarity assumption on criteria importance over time. A nonstationary algorithm should adapt its weighting proportions over time in response to learning progress. Prior attempts propose heuristics for classifier switching or reweighting [16, 17] between density and uncertainty sampling. Our DEALREXP4 improves on these in that it can use an arbitrary number of criteria of any type beyond 2 specified criteria; and in contrast to prior heuristics, it contains a principled underlying learner with theoretical guarantees. We provide a summary of related prior active learning algorithms in Table I, where the generality and strong notion of regret in DEALREXP4 is clear.
Iv Experiments and Results
Rank  Total  

1st  
2nd  
3rd  
4th  
Total 
Algorithm  Total  

NonStationary Datasets  
ALBL  
COMB  
DUAL  
DEAL  
Stationary Datasets  
ALBL  
COMB  
DUAL  
DEAL 
To evaluate our algorithm, we use 13 datasets from UCI^{2}^{2}2https://archive.ics.uci.edu/ml/datasets.html and LibSVM^{3}^{3}3https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary.html repositories. These datasets are selected following previous relevant papers [24, 15, 11, 6]. We use linear SVM [34] as the base learner. If the datasets do not include a predefined training/testing split, we randomly split for training and the rest for testing. In each trial, we start with 1 randomly labelled point per class. Each experiment is repeated 200 times and the average testing accuracy is reported.
Criteria Ensemble: The ensemble of base learners includes: US: picking the instances with maxentropy (min margin) instance in binary class datasets [1, 30] or minimum BestversusSecondBest (BvSB) [31] in multiclass datasets. RS: clustering the points near the margin [32] then scoring unlabelled points by their distances to the largest centroid. DistanceFurthestFirst (DFF): Focuses on exploration by selecting the furthest unlabeled instance to the nearest labeled instance [35]. We use DFF which selects the furthest unlabelled instance to the nearest labelled instance [35] to replace the RS in multiclass datasets as originally RS is designed for binary class datasets. Both are motivated by exploring the datasets, but DFF does not depend on binary classifiers. Density Estimation (DE): Picking the instance with maximum density in a GMM with 20 diagonal covariance components [16]. RAND: Randomly selecting points can be hard to beat on datasets unsuited to a given criterion. Moreover, including a random expert (for exploration) is necessary to guarantee the performance of the EXP4 subroutine [18, 19].
Competitors: We compare our method to ALBL [15], COMB [14] and DUAL [16]. For COMB, we follow their recommended settings with CEM reward and . For the ALBL, we use their settings and importanceweighted accuracy reward.
For direct comparison, ALBL, COMB and REXP4 use the same ensemble of criteria described above. DUAL is engineered for a specific pair of criteria, so we apply its original version using Uncertainty Sampling and DensityWeighted Uncertainty Sampling. It is also only defined for binary classification problems unlike the others.
DEALREXP4 Settings: For reward, we follow [15, 24] in using the IWA for unbiased estimation of test accuracy. To produce probabilistic preferences for points from all AL criteria, we use exponential ranking normalisation and a Gibbs measure with . We use batch size throughout. The choice is based on observing the typical coarse duration of performance gaps among different criteria. For example, RS wins first 20 iterations in Fig. 3(b). The reason for parameterizing in terms of rather than is that it has intuitive meaning in AL context (batchsize), yet implies a corresponding variation budget for any given (Theorem 1).
Characterising dataset (non)stationarity: We first investigate each dataset to characterise its (non)stationarity. We use our DEAL trajectory, and use an oracle to measure the wins of each criterion at each batch in terms of performance increase. A dataset with stationary reward distribution would tend to have a consistent winner, and viceversa. Although (non)stationarity is a continuum, we will describe a dataset as stationary if at least two criteria have a fraction of wins above threshold .
DEAL versus Individual Criteria Examples comparing the performance of DEAL and individual criteria in the ensemble are shown in Fig. 3. There is no single criterion that works best for all datasets, moreover different criteria are effective at different stages of learning. While DEAL is not best across all datasets and all timesteps (this would require the actual dynamic oracle upper bound), it performs well overall. This is summarised quantitatively across all 13 datasets in Tab. II
. Each method’s performance is evaluated by the area under the learning curve at different proportions of added instances. The results show the number of wins/ties/losses of DEAL versus the alternative ensemble member of specified highest rank according to twosided ttest. This shows for example that DEAL often ties with the topranked ensemble member (30 draws vs 1st rank), is usually at least as good as the second ranked member (50 wins and 45 ties vs only 35 losses) and is never the worst (0 losses vs 4th rank).
Comparison vs StateoftheArt We compare our DEALREXP4 with stateoftheart alternatives to tuning an ALensemble. Sometimes DUAL performs well, but it is highly variable depending on whether the criterion switch heuristic makes a good choice or not, as seen in Fig. 4. Tab. III summarises the results across all datasets in terms of AUC wins/draws/losses of each approach against the alternatives. DUAL has a lower rowtotal as it is defined for binary problems only, so not evaluated on wine and letter datasets. The main observation is that DEAL outperforms the alternatives particularly on nonstationary datasets. On stationary datasets we are only slightly worse than ALBL. This is expected as REXP4 performs forgetting in order to adapt to changes in expert efficacy, meaning that we cannot exploit the best criterion as aggressively as ALBL’s EXP4.P MAB learner. Nevertheless, overall DEAL is fairly robust to stationary datasets (small margin behind ALBL), while ALBL is not robust to nonstationary datasets (larger margin behind DEAL).
V Conclusion
We proposed a nonstationary multiarmed bandit with expert advice algorithm REXP4, and demonstrated its application to online learning of a criterion ensemble in active learning. The theoretical results provide bounds on REXP4’s optimality. The empirical results show that active learning with DEALREXP4 tends to perform near the best criterion in the ensemble. It performs comparable to state of the art alternative ensembles on stationary datasets, and outperforms them on nonstationary datasets.
References
 [1] D. D. Lewis and W. A. Gale, “A sequential algorithm for training text classifiers,” in Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ser. SIGIR ’94. New York, NY, USA: SpringerVerlag New York, Inc., 1994, pp. 3–12.

[2]
S. Tong and D. Koller, “Support vector machine active learning with applications to text classification,”
Proc. 17th International Conf. on Machine Learning"
, vol. 2, pp. 45–66, Mar. 2002.  [3] N. Roy and A. McCallum, “Toward optimal active learning through sampling estimation of error reduction,” in Proc. 18th International Conf. on Machine Learning. Morgan Kaufmann, San Francisco, CA, 2001, pp. 441–448.

[4]
T. M. Hospedales, S. Gong, and T. Xiang, “A unifying theory of active
discovery and learning,” in
European Conference on Computer Vision
. Berlin, Heidelberg: SpringerVerlag, 2012, pp. 453–466.  [5] D. A. Cohn, Z. Ghahramani, and M. I. Jordan, “Active learning with statistical models,” in Advances in Neural Information Processing Systems 7, G. Tesauro, D. S. Touretzky, and T. K. Leen, Eds. MIT Press, 1995, pp. 705–712.

[6]
R. Chattopadhyay, Z. Wang, W. Fan, I. Davidson, S. Panchanathan, and J. Ye, “Batch mode active sampling based on marginal probability distribution matching,”
ACM Trans. Knowl. Discov. Data, vol. 7, no. 3, pp. 13:1–13:25, Sep. 2013.  [7] K. Yu, J. Bi, and V. Tresp, “Active learning via transductive experimental design,” in Proceedings of the 23rd International Conference on Machine Learning, ser. ICML ’06. New York, NY, USA: ACM, 2006, pp. 1081–1088.

[8]
H. S. Seung, M. Opper, and H. Sompolinsky, “Query by committee,” in
Proceedings of the Fifth Annual Workshop on Computational Learning Theory
, ser. COLT ’92. New York, NY, USA: ACM, 1992, pp. 287–294. [Online]. Available: http://doi.acm.org/10.1145/130385.130417  [9] N. Abe and H. Mamitsuka, “Query learning strategies using boosting and bagging,” in Proceedings of the Fifteenth International Conference on Machine Learning, ser. ICML ’98. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1998, pp. 1–9.

[10]
C. C. Loy, T. M. Hospedales, T. Xiang, and S. Gong, “Streambased joint
explorationexploitation active learning,” in
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2012.  [11] S.J. Huang, R. Jin, and Z.H. Zhou, “Active learning by querying informative and representative examples.” in Advances in Neural Information Processing Systems 23, 2010, pp. 892–900.
 [12] Z. Wang and J. Ye, “Querying discriminative and representative samples for batch mode active learning,” ACM Trans. Knowl. Discov. Data, vol. 9, no. 3, pp. 17:1–17:23, Feb. 2015.
 [13] Z. Wang, B. Du, L. Zhang, L. Zhang, and X. Jia, “A novel semisupervised activelearning algorithm for hyperspectral image classification,” IEEE Trans. Geoscience and Remote Sensing, vol. 55, no. 6, pp. 3071–3083, 2017.
 [14] Y. Baram, R. ElYaniv, and K. Luz, “Online choice of active learning algorithms,” Journal of Machine Learning Research, vol. 5, pp. 255–291, Dec. 2004.

[15]
W.N. Hsu and H.T. Lin, “Active learning by learning,” in
Proceedings of the TwentyNinth AAAI Conference on Artificial Intelligence
, ser. AAAI’15. AAAI Press, 2015, pp. 2659–2665.  [16] P. Donmez, J. G. Carbonell, and P. N. Bennett, “Dual strategy active learning,” in Proceedings of ECML 2007. Springer Verlag, September 2007.
 [17] T. M. Hospedales, S. Gong, and T. Xiang, “Finding rare classes: Active learning with generative and discriminative models.” IEEE Trans. Knowl. Data Eng., vol. 25, no. 2, pp. 374–386, 2013.
 [18] P. Auer, N. CesaBianchi, Y. Freund, and R. E. Schapire, “The nonstochastic multiarmed bandit problem,” SIAM, vol. 32, no. 1, pp. 48–77, 2002.
 [19] A. Beygelzimer, J. Langford, L. Li, L. Reyzin, and R. E. Schapire, “Contextual bandit algorithms with supervised learning guarantees.” in Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, G. Gordon, D. Dunson, and M. Dudík, Eds., vol. 15. Fort Lauderdale, FL, USA: PMLR, 11–13 Apr 2011, pp. 19–26.
 [20] J. Langford and T. Zhang, “The epochgreedy algorithm for multiarmed bandits with side information,” in Advances in Neural Information Processing Systems 20, J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, Eds. Curran Associates, Inc., 2008, pp. 817–824.
 [21] S. M. Kakade, S. Shalevshwartz, and A. Tewari, “Efficient bandit algorithms for online multiclass prediction,” in Proceedings of the 25th International Conference on Machine Learning, ser. ICML ’08. New York, NY, USA: ACM, 2008, pp. 440–447.
 [22] A. L. Strehl, C. Mesterharm, M. L. Littman, and H. Hirsh, “Experienceefficient learning in associative bandit problems,” in Proceedings of the 23rd International Conference on Machine Learning, 2006, pp. 889–896.
 [23] W. Chu, L. Li, L. Reyzin, and R. E. Schapire, “Contextual bandits with linear payoff functions,” in Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, G. Gordon, D. Dunson, and M. Dudík, Eds., vol. 15. Fort Lauderdale, FL, USA: PMLR, 11–13 Apr 2011, pp. 208–214.
 [24] H. Chu and H. Lin, “Can active learning experience be transferred?” in IEEE 16th International Conference on Data Mining, ICDM 2016, December 1215, 2016, Barcelona, Spain, 2016, pp. 841–846.
 [25] A. Garivier and E. Moulines, “On upperconfidence bound policies for nonstationary bandit problems,” in Algorithmic Learning Theory, 2008.
 [26] O. Besbes, Y. Gur, and A. Zeevi, “Stochastic multiarmedbandit problem with nonstationary rewards,” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2014, pp. 199–207.
 [27] J. Y. Yu and S. Mannor, “Piecewisestationary bandit problems with side observations,” in Proceedings of the 26th Annual International Conference on Machine Learning, ser. ICML ’09, 2009, pp. 1177–1184.
 [28] C.Y. Wei, Y.T. Hong, and C.J. Lu, “Tracking the best expert in nonstationary stochastic environments,” in Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, Eds. Curran Associates, Inc., 2016, pp. 3972–3980.
 [29] O. Besbes, Y. Gur, and A. J. Zeevi, “Nonstationary stochastic optimization,” Operations Research, vol. 63, pp. 1227–1244, 2015.
 [30] B. Settles, “Active learning literature survey,” University of Wisconsin–Madison, Computer Sciences Technical Report 1648, 2009.
 [31] A. J. Joshi, F. Porikli, and N. Papanikolopoulos, “Multiclass active learning for image classification.” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
 [32] Z. Xu, K. Yu, V. Tresp, X. Xu, and J. Wang, “Representative sampling for text classification using support vector machines,” in Advances in Information Retrieval. Berlin, Heidelberg: Springer Berlin Heidelberg, 2003, pp. 393–407.
 [33] K. Konyushkova, R. Sznitman, and P. Fua, “Learning active learning from data,” in Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran Associates, Inc., 2017, pp. 4225–4235.
 [34] R.E. Fan, K.W. Chang, C.J. Hsieh, X.R. Wang, and C.J. Lin, “Liblinear: A library for large linear classification,” J. Mach. Learn. Res., vol. 9, pp. 1871–1874, Jun. 2008.
 [35] D. S. Hochbaum and D. B. Shmoys, “A best possible heuristic for the kcenter problem,” Math. Operations Research, vol. 10, no. 2, pp. 180–184, 1985.
Comments
There are no comments yet.