Dynamic Ensemble Active Learning: A Non-Stationary Bandit with Expert Advice

by   Kunkun Pang, et al.

Active learning aims to reduce annotation cost by predicting which samples are useful for a human teacher to label. However it has become clear there is no best active learning algorithm. Inspired by various philosophies about what constitutes a good criteria, different algorithms perform well on different datasets. This has motivated research into ensembles of active learners that learn what constitutes a good criteria in a given scenario, typically via multi-armed bandit algorithms. Though algorithm ensembles can lead to better results, they overlook the fact that not only does algorithm efficacy vary across datasets, but also during a single active learning session. That is, the best criteria is non-stationary. This breaks existing algorithms' guarantees and hampers their performance in practice. In this paper, we propose dynamic ensemble active learning as a more general and promising research direction. We develop a dynamic ensemble active learner based on a non-stationary multi-armed bandit with expert advice algorithm. Our dynamic ensemble selects the right criteria at each step of active learning. It has theoretical guarantees, and shows encouraging results on 13 popular datasets.



There are no comments yet.



Building Bridges: Viewing Active Learning from the Multi-Armed Bandit Lens

In this paper we propose a multi-armed bandit inspired, pool based activ...

An empirical evaluation of active inference in multi-armed bandits

A key feature of sequential decision making under uncertainty is a need ...

A Robust UCB Scheme for Active Learning in Regression from Strategic Crowds

We study the problem of training an accurate linear regression model by ...

IALE: Imitating Active Learner Ensembles

Active learning (AL) prioritizes the labeling of the most informative da...

Learning active learning at the crossroads? evaluation and discussion

Active learning aims to reduce annotation cost by predicting which sampl...

Adapting Behaviour for Learning Progress

Determining what experience to generate to best facilitate learning (i.e...

Whom to Test? Active Sampling Strategies for Managing COVID-19

This paper presents methods to choose individuals to test for infection ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The key barrier to scaling or applying supervised learning in practice is often the cost of obtaining sufficient annotation. Active Learning (AL) aims to address this by designing query algorithms that effectively predict which points will be useful to annotate, thus enabling efficient allocation of human annotation effort. There are many different AL algorithms, each with appealing – yet completely different – motivations for what constitutes a good question to ask underpinning their design. For example, uncertainty or margin-based sampling

[1, 2] suggests querying the most uncertain or ambiguous point that is the closest point to the decision boundary. Expected error reduction [3, 4] queries points that the current model predicts will reduce its future error. Another typical approach is to label the most representative samples [5, 6, 7]

to ensure the major clusters within the dataset are correctly estimated. Besides these approaches, query-by-committee active learning queries points based on the disagreement between a committee of classifiers

[8, 9, 10]. More recent studies investigated hybrid criteria that balance multiple motivations [11, 12, 13].

These are all good ideas, yet there are situations where each is ineffective. For example, if the classes are heavily overlapped in an area of feature-space, uncertainty sampling will be tied up querying points in an impossible to solve region. If the current model is very poor, expected error reduction cannot accurately estimate its own future error. If the main data clusters are already well classified, representative sampling focused approaches may not fine-tune between them.

These thought experiments are reflected empirically. The best algorithm for pool-based AL in practice varies both across datasets and also with the progress of learning within a given dataset [14, 15]. This observation has motivated research into both learning dataset and time-specific weightings for an AL algorithm ensemble. [16, 17]

developed heuristics for switching AL algorithms that are typically good at early versus late stage learning. In contrast,

[14, 15] developed methods for rapid online learning of a dataset-specific weighting for algorithms within an AL-ensemble.

The key insight of the Combination of Active Learning Online (COMB) [14] and Active Learning by Learning (ALBL) [15] algorithms is to formalise the query criteria selection task as a multi-armed bandit (MAB) problem. MAB problems have been well studied and many powerful algorithms with optimality guarantees exist. For example, if each query criterion in the ensemble is considered to be a bandit arm, and the learning improvement after executing a criterion is considered to be the arm’s reward, then MAB algorithms such as EXP3 (Exponential-weight algorithm for Exploration and Exploitation) [18] can be applied to quickly learn the efficacy of the arms (AL criteria) and is guaranteed to achieve a near optimal overall reward (learning improvement). A variant of this is to consider data-points to be arms, and AL criteria to be experts providing advice about promising arms. Then MAB with expert advice

algorithms such as EXP4.P (Exponential-weight algorithm for Exploration and Exploitation using Expert advice with high probability regret bound)

[19] optimise exploration and exploitation of experts, and achieve provably near optimal reward.

The fundamental limitation of existing MAB-based approaches to AL is that their underlying MAB algorithms do not take into account the temporal dynamics of active learning: different criteria are effective at different learning stages [16, 17]. This issue is illustrated by Fig. 1(a,c,e), where the most effective criterion varies across the entire time horizon. On fourclass, Density (DE) sampling is slightly better at first and uncertainty (US) is consistently good later on. Similarly in ILPD or german, representative (RS) and density (DE) sampling are better at the crucial early stage before uncertainty becomes better. A second issue is that the scale of an accuracy-based reward falls dramatically over time (Fig. 1(b,d,f)). Because of this stationary bandit learners will be unduly biased by the high reward from an initial observation and fail to adapt subsequently. For example in ILDP a stationary learner may fail to make the switch from DE to US as later rewards in favour of US are small in scale compared to the initial reward in favour of DE.

Therefore there are non-stationary aspects both in reward scale, and in reward distribution per-arm (MAB perspective) or per-expert (MAB with expert advice perspective). Thus the MAB problem is formally non-stationary, violating a fundamental assumption required to guarantee existing MAB algorithms’ optimality bounds.

Here we develop a performance guaranteed stochastic MAB with expert advice111We use terminology from [18]. It also has other names, including ‘contextual bandit’ [19, 20], ‘partial-label problem’ [21], and ‘associative bandit problem’ [22]. algorithm in a non-stationary environment. Applying this to AL means that, like [15], if there is a single best (but a priori unknown) AL algorithm for a dataset, we are able to quickly discover it and thus approach the performance of an oracle that knows the best algorithm for each dataset. But importantly when different algorithms’ efficacies vary over time within one dataset, we can adapt to this and approach the performance of an oracle that knows the best AL algorithm at each iteration.

(a) Proportion of Wins:
(b) Relative Accuracy
Increments: “fourclass”
(c) Proportion of Wins:
(d) Relative Accuracy
Increments: “ILPD”
(e) Proportion of Wins:
(f) Relative Accuracy
Increments: “german”
Fig. 1: Examples of non-stationary AL in UCI datasets “fourclass” and “ILPD” using four algorithms/criteria: US, RS, DE, and RAND. Above: Proportion of times each criterion generates the largest increase in accuracy. Below: Relative increase in accuracy. In the relative part all increments are re-scaled by subtracting the minimum increment of accuracy over all criteria in each bin.

Ii Background and Related Work

Ii-a Active Learning

We denote the pool of data with samples as where the instances are and the labels are . In an active learning scenario, the data are initially a labelled set and unlabelled set where . Training an initial classifier on the samples in the initial set , the algorithm starts to query instances from during iterations . After the supervision of instance is obtained, is removed from the unlabelled set and added to the labelled set , from which classifier is retrained.

Ii-B Bandit Algorithms

Multi-armed Bandit In multi-armed bandit (MAB) problems, a player pulls a lever from a set of slot machines in a sequence of time steps to maximise her payoff. During the game, she only observes the reward of the specific arm pulled at time step . The aim of the player is to maximise their return, which is the sum of the rewards over the sequence of pulls. This requires a trade-off between exploration (collect information to estimate the arm with the highest return) and exploitation (focus on the arm with the highest estimated return). Training a bandit learner to solve a MAB problem is then formalized as minimising the regret between the actions chosen by the player’s strategy , and the best arm.

For example, the EXP3 algorithm [18] minimises, for any finite , the “static regret” between the player’s reward and the best arm in retrospect: .

Contextual Multi-armed Bandit The goal of contextual bandits is to build a relationship between available context information and the reward distribution over all arms. For example, LinUCB [23]

makes the linear realizability assumption that there exists an unknown weight vector

with so that regret is minimized, where and . However, learning to predict the reward for each data point accurately appears to be an even harder problem given the limited information from only expert suggestions (Fig 1). More importantly, given the changing reward distribution over time, there is no constant relation between context and reward.

Multi-armed Bandit with Expert AdviceExpert information about the likely efficacy of each arm is often available. [18] thus introduced an adversarial MAB with expert advice algorithm EXP4 that exploits experts giving advice vectors (probabilities over levers) to the learner at each time. In contrast to MAB without expert advice, the goal is now to identify the best expert rather than the best arm. In this setting the regret to minimise is the difference between the return of the best expert in retrospect and the player:


where is the expected reward of an expert and is the expected reward of our policy.

Ii-C Bandits for Active Learning

For active learning using a MAB with expert advice algorithm, the experts correspond to our ensemble of active learning criteria and the arms are available points in the pool. Each expert (criterion) provides a probability vector encoding preference over arms (instances). Active learners based on MAB with expert advice aim to learn the best criterion for a specific dataset. In COMB [14], the authors propose to use MAB with expert advice in active learning and heuristically designed the classification entropy maximization (CEM) score as the reward of the EXP4 bandit algorithm [18]. A more recent paper [15]

(ALBL) proposed to replace the CEM reward with an unbiased estimation of test accuracy Important Weighted Accuracy (IWA) and used an upgraded bandit algorithm EXP4.P

[19], which improves the earlier EXP4 method. Similarly, another recent paper [24] applied linear upper confidence bound contextual bandit algorithm (LinUCB) to train an ensemble and transferred the knowledge to other datasets. All of these algorithms enable the selection of a suitable active learning criteria for a given dataset. Our contribution is also to perform AL in a dataset-specific way by optimally tuning the exploration and exploitation of an ensemble of AL algorithms; but more importantly to do so dynamically, thus allowing the optimal tuning to vary as learning progresses. Unlike [14, 15, 24] we are able to deal with the non-stationary nature of this process. And unlike the heuristics in [16, 17], we have a theoretical guarantees, and can work with more than two criteria.

Ii-D Non-stationary Property of Active Learning

Demonstration of Non-stationarity We describe a preliminary experiment to demonstrate empirically the existence of non-stationary reward distributions for a MAB formalisation of AL. Following the learning trajectory of our method, we use an oracle to score all the available query points at each iteration (i.e., hypothetically label each point, update the classifier, and check the test accuracy). Using the actual test accuracy as the reward, we can obtain the true expected reward of the th expert at each time step . Fig. 1 summarises the resulting average reward obtained in every 10 iterations of AL. Based on this, we can further compute the proportion of times that each criterion would obtain the highest reward. It can be seen that the MAB problem is non-stationary as the rewards vary systematically, and there is not a single criterion (expert) which obtains the highest proportion of wins throughout learning. Additionally, the ideal combination of criteria varies across datasets. For example, as illustrated in Fig 1, density and uncertainty sampling show better complementary in ILPD, while representative and uncertainty sampling are more complementary in german dataset.

Existing MAB ensembles are not robust to non-stationarity The non-stationary property in the MAB formalisation of AL also highlights the key weakness of COMB and ALBL: they use EXP4/EXP4.P [18, 19] expert advice bandit algorithms which provide guarantees against an inappropriate (static) regret that is only relevant in a stationary problem. In a non-stationary problem, it is clear that even an algorithm that perfectly estimates the best single expert (optimal w.r.t static oracle Eq. 1) can be arbitrarily worse than one which can choose the best expert at each step (optimal w.r.t dynamic oracle). In this paper, we develop an non-stationary stochastic MAB algorithm REXP4 (Restarting Exponential-weight algorithm for Exploration and Exploitation using Expert advice) with bounds against a stricter dynamic oracle notion of optimality more suited for (non-stationary) AL.

Prior attempts at non-stationary active learners A few previous active learning studies also observed that different algorithms are effective at different stages of learning and proposed heuristics for switching two base query criteria (e.g., density sampling at an early stage, and uncertainty sampling later on) [16, 17]. But these only adapt 2 criteria (density and uncertainty) unlike MAB ensembles which learn to combine many criteria, and their heuristics do not provide a principled and optimal way to learn when to switch.

Prior attempts at non-stationary MABs Some previous studies have extended MAB without expert advice learning to the non-stationary setting [25, 26] and provided regret bounds to guarantee the algorithms’ performance. However bandits with expert advice are preferable because they can achieve tighter learning bounds [18, 15] and they do not treat each criterion as a black box, so that one observation can be informative about many arms. Consider an AL situation where two criteria prefer the same instance. In the MAB interpretation (criteria=arms), after observing a reward, you only learn about the criterion/arm chosen at that iteration. In the MAB with expert advice interpretation (criteria=experts), the observed reward generates updates about the efficacy of all criteria that expressed opinions about the point.

Those few MABs extended to the non-stationary setting have other stronger assumptions. For example, the discounted/sliding-window UCB algorithm [25] assumes the nature of the non-stationarity is that the reward distribution is piece-wise and the number of changes is known. Similarly [27] makes the easier piecewise assumption, and also that the retrospective rewards for un-pulled arms are available – but they are not in active learning. In [28]

, the authors proposed to measure the total statistical variance of the consecutive distributions at each time interval. Their result provides a big picture of the regret landscape for full information and bandit settings. Their proposed method addresses non-stationary environments but only for the regular MAB problem. Despite the use of the term expert in the title, it does not address the Expert-advice variant of the MAB problem relevant to us. It addresses arms rather than experts over arms.

We propose a non-stationary MAB with expert advice algorithm that has performance guarantees, and validate its practical application to active learning.

Iii Non-stationary Multi-Armed Bandit with Expert Advice for Active Learning

Iii-a Non-stationary Multi-Armed Bandit with Expert Advice: REXP4

To formalise the problem, we assume the expected reward of each expert can change at any time step . The total variation of the expected reward over all steps is


Following [29, 26], we assume this total variation in expected reward is bounded by a variation budget . The variation budget captures our assumed constraints on the non-stationary environment. It allows a wide variety of reward changes – from continuous drift to discrete jumps – yet provides sufficient constraint to permit a bandit algorithm to learn in a non-stationary environment. Temporal uncertainty set is defined as the set of reward vector sequences that are subject to the variation budget over all steps.

To bound the performance of a bandit learner in a non-stationary environment, we work with the regret between the learner and a dynamic oracle. The regret is defined as the worst-case difference between the expected policy return and the return of using the best expert at each time .

Definition 1.

Dynamic Regret for Multi-Armed Bandit with Expert Advice


where is the best possible expected reward among all experts at time . Our regret is against this dynamic oracle, in contrast to prior MABs’ static oracle (Eq 1).

Our non-stationary MAB with expert advice algorithm REXP4 minimises the dynamic regret in Eq 3. As shown in Algorithm 1, it trades off between the need to remember and forget by breaking the task into batches and applying EXP4 [18] on each batch. As the reward distribution changes, it adapts to the change as by re-estimating each expert’s reward distribution at each batch. We show the worst case bound on the regret between this REXP4 procedure and the dynamic oracle.

Iii-B Regret Bound for REXP4

The regret bound for REXP4 is illustrated in the following theorem. The theorem is proved by following the proof structure of [26] and replacing the term in [26] with the expected reward term in our paper.

Theorem 1.

Let be the REXP4 policy with a batch size and . Then, there is some constant such that for every , and


where indicates the smaller number of experts or arms.

The result is an upper bound on the regret between our REXP4 policy and the dynamic oracle. As , it is favourable if either the number of experts or arms is small. This also means it is relatively robust to many arms (as in AL, where arms=data points). If is sub-linear in (total variation in reward grows slower than timesteps), then performance converges to that of the oracle.


and an epoch size

  1. Set Epoch index

  2. Repeat while

    • Set

    • Initialisation: for any expert set weight

    • Repeat for , Call EXP4 Algorithm[18]

    • Set and return to the beginning of step 2

Algorithm 1 Pseudocode of algorithm REXP4
Inputs: , initial weight , , ,labelled set , unlabelled set , initial classifier
for  do
  1. Get scores of instance from criteria

  2. Normalised the score vector

  3. Obtain the advise vector with

  4. Set and for set

  5. Query the label of instance randomly from according to probability

  6. Move the instance from to

  7. Retrain the classifier and receive reward

  8. For set

  9. For set

     if  then Reset and
     end if
end for
Algorithm 2 DEAL: Dynamic Ensemble Active Learning

Iii-C Dynamic Ensemble Active Learning

Fig. 2: Illustration of DEAL System. Light blue: Taking the unlabelled set as the input, each expert will output a score that is normalised before input to the DEAL active learner. is the th criterion score of th instance. Orange: the active learner to make a decision. Green: updating the labelled set, unlabelled set, and the classifier. Light yellow: The restart detection scheme. Ensemble weights are then updated differently between (light red) or at (dark red) restarts.

Based on our REXP4 algorithm for MAB with expert advice, we present DEAL-REXP4 (Dynamic Ensemble Active Learning) for active learning based on REXP4. Our dynamic ensemble learner will update both base learner and active criteria weights iteratively. More specifically, each ensemble criterion will predict scores for all unlabelled instances. We use exponential ranking normalisation to avoid the issue of different criterion scales, and apply the Gibbs measure where the parameters control the sharpness of the distribution. The denotes the ranking position of the instance’s score where the ranking order is determined by the criterion strategy’s ordering. For example, the entropy criterion prefers points with maximum entropy, so the maximum entropy point has rank 1. Similarly, the minimum margin criterion prefers points with low distance to margin, so the minimum distance point has rank 1. Based on the current suggestions from the criteria members, the active learning ensemble will select an instance for label querying. Then, the base learner will be updated with the new labelled data and the active learner will be updated successively based on the performance improvement of the updated base learner. To learn the non-stationary reward distribution, we use our proposed REXP4 algorithm to learn the weights of active learning criteria in an online adaptive way by introducing the restart scheme. Giving the current within-batch index , the restart scheme will be activated when , otherwise updates follow the EXP4 rule. The details are described in Algorithm 2 with an illustration in Fig. 2.

In DEAL-REXP4 we set the reward as the resulting accuracy after a classifier update. Thus in the context of active learning, the bound given in Eq. 4 means that we know that the total area under the reward curve obtained by DEAL-REXP4 is within a bound of the best case scenario that would occur only if we had known the best criterion to use at each iteration. Moreover, if the variation budget grows sub-linearly with , DEAL-REXP4 converges towards this best-expert-per-iteration upper bound scenario.

Single Criterion
Algorithm Motivation Stationarity Importance of Criterion Ensemble Members Property
US [1, 30, 31] Querying the least confidence Stationary Fixed US Static
RS [32] Query a cluster within Margin Stationary Fixed RS Static
DE [16] Query the major cluster Stationary Fixed DE Static
Multiple Criteria
Algorithms Motivations Stationarity Importance of Criterion Ensemble Members Property
QUIRE [11] Combining informativeness and representativeness Stationary Equal effect QUIRE Static
BMDR [12] Combining discriminative and representativeness Stationary Equal effect BMDR Static
LAL [33] Combining Multiple motivations Stationary Equal effect Any Criteria Static
DUAL[16] Switching from DE to US once Non-stationary Varying US, DE Dynamic
ALGD [17] Switching between DE to US Non-stationary Varying US, DE Dynamic
Bandit Ensemble Algorithms
Algorithm Bandit Regret Stationarity Importance of Criterion Ensemble Members Property
COMB [14] EXP4 [18] Stationary Single best Any Criteria Static
ALBL [15] EXP4.P [19] Stationary Single best Any Criteria Static
LSA [24] LinUCB [23] Stationary Single best combination Any Criteria Static
DEAL REXP4 Non-Stationary Dynamic best Any Criteria Dynamic
TABLE I: Summary of Active Learning Algorithms

Iii-D Discussion of Static and Dynamic Active Learning

We divide active learning algorithms into static/dynamic based on the stationary/non-stationary assumption on the importance of each criteria over different time periods.

Static Active Learning Single criterion algorithms are all static, since they solve active learning with only one criterion. Regarding active learning algorithms with multiple motivations: if they are formalised as a single fixed mixture of criteria, they are also static. Since the coefficients of different motivations are fixed over all time steps, they assume that a single weighted combination is suitable at any learning stage. For example, Query Informative and Representative Examples (QUIRE) [11], Learning Active Learning (LAL) [33], and Discriminative and Representative Queries for Batch Mode Active Learning (BMDR) [12] are static active algorithms with multiple motivations.

Previously proposed ensemble algorithms ALBL [15], COMB [14], and Linear Strategy Aggregation (LSA) [24] are also static in the sense that, although the weight proportion of their ensemble members changes as data is gathered, their underlying bandit learner is a stationary one, assuming there is only one best expert or best linear combination over all time.

Dynamic Active Learning In our dynamic active learning research question, we avoid a stationarity assumption on criteria importance over time. A non-stationary algorithm should adapt its weighting proportions over time in response to learning progress. Prior attempts propose heuristics for classifier switching or reweighting [16, 17] between density and uncertainty sampling. Our DEAL-REXP4 improves on these in that it can use an arbitrary number of criteria of any type beyond 2 specified criteria; and in contrast to prior heuristics, it contains a principled underlying learner with theoretical guarantees. We provide a summary of related prior active learning algorithms in Table I, where the generality and strong notion of regret in DEAL-REXP4 is clear.

(a) fourclass
(b) ILPD
Fig. 3: Comparison of DEAL-REXP4 versus individual ensemble members.
(a) fourclass
(b) german
(c) ILPD
(d) letter
Fig. 4: Comparison of active learning with our DEAL-REXP versus alternative state of the art bandit algorithms.

Iv Experiments and Results

Rank Total
TABLE II: Win/Tie/Loss counts of DEAL-REXP4 versus ensemble members in terms of AUC at specified learning stage.
Algorithm Total
Non-Stationary Datasets
Stationary Datasets
TABLE III: Win/Tie/Loss counts of DEAL-REXP4 and state of the art alternatives at specified learning stages.

To evaluate our algorithm, we use 13 datasets from UCI222https://archive.ics.uci.edu/ml/datasets.html and LibSVM333https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary.html repositories. These datasets are selected following previous relevant papers [24, 15, 11, 6]. We use linear SVM [34] as the base learner. If the datasets do not include a pre-defined training/testing split, we randomly split for training and the rest for testing. In each trial, we start with 1 randomly labelled point per class. Each experiment is repeated 200 times and the average testing accuracy is reported.

Criteria Ensemble: The ensemble of base learners includes: US: picking the instances with max-entropy (min margin) instance in binary class datasets [1, 30] or minimum Best-versus-Second-Best (BvSB) [31] in multiclass datasets. RS: clustering the points near the margin [32] then scoring unlabelled points by their distances to the largest centroid. Distance-Furthest-First (DFF): Focuses on exploration by selecting the furthest unlabeled instance to the nearest labeled instance [35]. We use DFF which selects the furthest unlabelled instance to the nearest labelled instance [35] to replace the RS in multiclass datasets as originally RS is designed for binary class datasets. Both are motivated by exploring the datasets, but DFF does not depend on binary classifiers. Density Estimation (DE): Picking the instance with maximum density in a GMM with 20 diagonal covariance components [16]. RAND: Randomly selecting points can be hard to beat on datasets unsuited to a given criterion. Moreover, including a random expert (for exploration) is necessary to guarantee the performance of the EXP4 subroutine [18, 19].

Competitors: We compare our method to ALBL [15], COMB [14] and DUAL [16]. For COMB, we follow their recommended settings with CEM reward and . For the ALBL, we use their settings and importance-weighted accuracy reward.

For direct comparison, ALBL, COMB and REXP4 use the same ensemble of criteria described above. DUAL is engineered for a specific pair of criteria, so we apply its original version using Uncertainty Sampling and Density-Weighted Uncertainty Sampling. It is also only defined for binary classification problems unlike the others.

DEAL-REXP4 Settings: For reward, we follow [15, 24] in using the IWA for unbiased estimation of test accuracy. To produce probabilistic preferences for points from all AL criteria, we use exponential ranking normalisation and a Gibbs measure with . We use batch size throughout. The choice is based on observing the typical coarse duration of performance gaps among different criteria. For example, RS wins first 20 iterations in Fig. 3(b). The reason for parameterizing in terms of rather than is that it has intuitive meaning in AL context (batch-size), yet implies a corresponding variation budget for any given (Theorem 1).

Characterising dataset (non)stationarity: We first investigate each dataset to characterise its (non)stationarity. We use our DEAL trajectory, and use an oracle to measure the wins of each criterion at each batch in terms of performance increase. A dataset with stationary reward distribution would tend to have a consistent winner, and vice-versa. Although (non)stationarity is a continuum, we will describe a dataset as stationary if at least two criteria have a fraction of wins above threshold .

DEAL versus Individual Criteria Examples comparing the performance of DEAL and individual criteria in the ensemble are shown in Fig. 3. There is no single criterion that works best for all datasets, moreover different criteria are effective at different stages of learning. While DEAL is not best across all datasets and all time-steps (this would require the actual dynamic oracle upper bound), it performs well overall. This is summarised quantitatively across all 13 datasets in Tab. II

. Each method’s performance is evaluated by the area under the learning curve at different proportions of added instances. The results show the number of wins/ties/losses of DEAL versus the alternative ensemble member of specified highest rank according to two-sided t-test. This shows for example that DEAL often ties with the top-ranked ensemble member (30 draws vs 1st rank), is usually at least as good as the second ranked member (50 wins and 45 ties vs only 35 losses) and is never the worst (0 losses vs 4th rank).

Comparison vs State-of-the-Art We compare our DEAL-REXP4 with state-of-the-art alternatives to tuning an AL-ensemble. Sometimes DUAL performs well, but it is highly variable depending on whether the criterion switch heuristic makes a good choice or not, as seen in Fig. 4. Tab. III summarises the results across all datasets in terms of AUC wins/draws/losses of each approach against the alternatives. DUAL has a lower row-total as it is defined for binary problems only, so not evaluated on wine and letter datasets. The main observation is that DEAL outperforms the alternatives particularly on non-stationary datasets. On stationary datasets we are only slightly worse than ALBL. This is expected as REXP4 performs forgetting in order to adapt to changes in expert efficacy, meaning that we cannot exploit the best criterion as aggressively as ALBL’s EXP4.P MAB learner. Nevertheless, overall DEAL is fairly robust to stationary datasets (small margin behind ALBL), while ALBL is not robust to non-stationary datasets (larger margin behind DEAL).

V Conclusion

We proposed a non-stationary multi-armed bandit with expert advice algorithm REXP4, and demonstrated its application to online learning of a criterion ensemble in active learning. The theoretical results provide bounds on REXP4’s optimality. The empirical results show that active learning with DEAL-REXP4 tends to perform near the best criterion in the ensemble. It performs comparable to state of the art alternative ensembles on stationary datasets, and outperforms them on non-stationary datasets.


  • [1] D. D. Lewis and W. A. Gale, “A sequential algorithm for training text classifiers,” in Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ser. SIGIR ’94.   New York, NY, USA: Springer-Verlag New York, Inc., 1994, pp. 3–12.
  • [2]

    S. Tong and D. Koller, “Support vector machine active learning with applications to text classification,”

    Proc. 17th International Conf. on Machine Learning"

    , vol. 2, pp. 45–66, Mar. 2002.
  • [3] N. Roy and A. McCallum, “Toward optimal active learning through sampling estimation of error reduction,” in Proc. 18th International Conf. on Machine Learning.   Morgan Kaufmann, San Francisco, CA, 2001, pp. 441–448.
  • [4] T. M. Hospedales, S. Gong, and T. Xiang, “A unifying theory of active discovery and learning,” in

    European Conference on Computer Vision

    .   Berlin, Heidelberg: Springer-Verlag, 2012, pp. 453–466.
  • [5] D. A. Cohn, Z. Ghahramani, and M. I. Jordan, “Active learning with statistical models,” in Advances in Neural Information Processing Systems 7, G. Tesauro, D. S. Touretzky, and T. K. Leen, Eds.   MIT Press, 1995, pp. 705–712.
  • [6]

    R. Chattopadhyay, Z. Wang, W. Fan, I. Davidson, S. Panchanathan, and J. Ye, “Batch mode active sampling based on marginal probability distribution matching,”

    ACM Trans. Knowl. Discov. Data, vol. 7, no. 3, pp. 13:1–13:25, Sep. 2013.
  • [7] K. Yu, J. Bi, and V. Tresp, “Active learning via transductive experimental design,” in Proceedings of the 23rd International Conference on Machine Learning, ser. ICML ’06.   New York, NY, USA: ACM, 2006, pp. 1081–1088.
  • [8] H. S. Seung, M. Opper, and H. Sompolinsky, “Query by committee,” in

    Proceedings of the Fifth Annual Workshop on Computational Learning Theory

    , ser. COLT ’92.   New York, NY, USA: ACM, 1992, pp. 287–294. [Online]. Available: http://doi.acm.org/10.1145/130385.130417
  • [9] N. Abe and H. Mamitsuka, “Query learning strategies using boosting and bagging,” in Proceedings of the Fifteenth International Conference on Machine Learning, ser. ICML ’98.   San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1998, pp. 1–9.
  • [10] C. C. Loy, T. M. Hospedales, T. Xiang, and S. Gong, “Stream-based joint exploration-exploitation active learning,” in

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2012.
  • [11] S.-J. Huang, R. Jin, and Z.-H. Zhou, “Active learning by querying informative and representative examples.” in Advances in Neural Information Processing Systems 23, 2010, pp. 892–900.
  • [12] Z. Wang and J. Ye, “Querying discriminative and representative samples for batch mode active learning,” ACM Trans. Knowl. Discov. Data, vol. 9, no. 3, pp. 17:1–17:23, Feb. 2015.
  • [13] Z. Wang, B. Du, L. Zhang, L. Zhang, and X. Jia, “A novel semisupervised active-learning algorithm for hyperspectral image classification,” IEEE Trans. Geoscience and Remote Sensing, vol. 55, no. 6, pp. 3071–3083, 2017.
  • [14] Y. Baram, R. El-Yaniv, and K. Luz, “Online choice of active learning algorithms,” Journal of Machine Learning Research, vol. 5, pp. 255–291, Dec. 2004.
  • [15] W.-N. Hsu and H.-T. Lin, “Active learning by learning,” in

    Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence

    , ser. AAAI’15.   AAAI Press, 2015, pp. 2659–2665.
  • [16] P. Donmez, J. G. Carbonell, and P. N. Bennett, “Dual strategy active learning,” in Proceedings of ECML 2007.   Springer Verlag, September 2007.
  • [17] T. M. Hospedales, S. Gong, and T. Xiang, “Finding rare classes: Active learning with generative and discriminative models.” IEEE Trans. Knowl. Data Eng., vol. 25, no. 2, pp. 374–386, 2013.
  • [18] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, “The nonstochastic multiarmed bandit problem,” SIAM, vol. 32, no. 1, pp. 48–77, 2002.
  • [19] A. Beygelzimer, J. Langford, L. Li, L. Reyzin, and R. E. Schapire, “Contextual bandit algorithms with supervised learning guarantees.” in Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, G. Gordon, D. Dunson, and M. Dudík, Eds., vol. 15.   Fort Lauderdale, FL, USA: PMLR, 11–13 Apr 2011, pp. 19–26.
  • [20] J. Langford and T. Zhang, “The epoch-greedy algorithm for multi-armed bandits with side information,” in Advances in Neural Information Processing Systems 20, J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, Eds.   Curran Associates, Inc., 2008, pp. 817–824.
  • [21] S. M. Kakade, S. Shalev-shwartz, and A. Tewari, “Efficient bandit algorithms for online multiclass prediction,” in Proceedings of the 25th International Conference on Machine Learning, ser. ICML ’08.   New York, NY, USA: ACM, 2008, pp. 440–447.
  • [22] A. L. Strehl, C. Mesterharm, M. L. Littman, and H. Hirsh, “Experience-efficient learning in associative bandit problems,” in Proceedings of the 23rd International Conference on Machine Learning, 2006, pp. 889–896.
  • [23] W. Chu, L. Li, L. Reyzin, and R. E. Schapire, “Contextual bandits with linear payoff functions,” in Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, G. Gordon, D. Dunson, and M. Dudík, Eds., vol. 15.   Fort Lauderdale, FL, USA: PMLR, 11–13 Apr 2011, pp. 208–214.
  • [24] H. Chu and H. Lin, “Can active learning experience be transferred?” in IEEE 16th International Conference on Data Mining, ICDM 2016, December 12-15, 2016, Barcelona, Spain, 2016, pp. 841–846.
  • [25] A. Garivier and E. Moulines, “On upper-confidence bound policies for non-stationary bandit problems,” in Algorithmic Learning Theory, 2008.
  • [26] O. Besbes, Y. Gur, and A. Zeevi, “Stochastic multi-armed-bandit problem with non-stationary rewards,” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds.   Curran Associates, Inc., 2014, pp. 199–207.
  • [27] J. Y. Yu and S. Mannor, “Piecewise-stationary bandit problems with side observations,” in Proceedings of the 26th Annual International Conference on Machine Learning, ser. ICML ’09, 2009, pp. 1177–1184.
  • [28] C.-Y. Wei, Y.-T. Hong, and C.-J. Lu, “Tracking the best expert in non-stationary stochastic environments,” in Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, Eds.   Curran Associates, Inc., 2016, pp. 3972–3980.
  • [29] O. Besbes, Y. Gur, and A. J. Zeevi, “Non-stationary stochastic optimization,” Operations Research, vol. 63, pp. 1227–1244, 2015.
  • [30] B. Settles, “Active learning literature survey,” University of Wisconsin–Madison, Computer Sciences Technical Report 1648, 2009.
  • [31] A. J. Joshi, F. Porikli, and N. Papanikolopoulos, “Multi-class active learning for image classification.” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
  • [32] Z. Xu, K. Yu, V. Tresp, X. Xu, and J. Wang, “Representative sampling for text classification using support vector machines,” in Advances in Information Retrieval.   Berlin, Heidelberg: Springer Berlin Heidelberg, 2003, pp. 393–407.
  • [33] K. Konyushkova, R. Sznitman, and P. Fua, “Learning active learning from data,” in Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds.   Curran Associates, Inc., 2017, pp. 4225–4235.
  • [34] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, “Liblinear: A library for large linear classification,” J. Mach. Learn. Res., vol. 9, pp. 1871–1874, Jun. 2008.
  • [35] D. S. Hochbaum and D. B. Shmoys, “A best possible heuristic for the k-center problem,” Math. Operations Research, vol. 10, no. 2, pp. 180–184, 1985.