Log In Sign Up

GALAXY: Graph-based Active Learning at the Extreme

by   Jifan Zhang, et al.

Active learning is a label-efficient approach to train highly effective models while interactively selecting only small subsets of unlabelled data for labelling and training. In "open world" settings, the classes of interest can make up a small fraction of the overall dataset – most of the data may be viewed as an out-of-distribution or irrelevant class. This leads to extreme class-imbalance, and our theory and methods focus on this core issue. We propose a new strategy for active learning called GALAXY (Graph-based Active Learning At the eXtrEme), which blends ideas from graph-based active learning and deep learning. GALAXY automatically and adaptively selects more class-balanced examples for labeling than most other methods for active learning. Our theory shows that GALAXY performs a refined form of uncertainty sampling that gathers a much more class-balanced dataset than vanilla uncertainty sampling. Experimentally, we demonstrate GALAXY's superiority over existing state-of-art deep active learning algorithms in unbalanced vision classification settings generated from popular datasets.


page 1

page 2

page 3

page 4


Class-Balanced Active Learning for Image Classification

Active learning aims to reduce the labeling effort that is required to t...

S2: An Efficient Graph Based Active Learning Algorithm with Application to Nonparametric Classification

This paper investigates the problem of active learning for binary label ...

Poisson Reweighted Laplacian Uncertainty Sampling for Graph-based Active Learning

We show that uncertainty sampling is sufficient to achieve exploration v...

Active Learning under Label Shift

Distribution shift poses a challenge for active data collection in the r...

VaB-AL: Incorporating Class Imbalance and Difficulty with Variational Bayes for Active Learning

Active Learning for discriminative models has largely been studied with ...

BALanCe: Deep Bayesian Active Learning via Equivalence Class Annealing

Active learning has demonstrated data efficiency in many fields. Existin...

Regional based query in graph active learning

Graph convolution networks (GCN) have emerged as the leading method to c...

1 Introduction

Training deep learning systems can require enormous amounts of labeled data. Active learning aims to reduce this burden by sequentially and adaptively selecting examples for labeling, with the goal of obtaining a relatively small dataset of especially informative examples. The most common approach to active learning is uncertainty sampling

. The idea is to train a model based on an initial set of labeled data and then to select unlabeled examples that the model cannot classify with certainty. These examples are then labeled, the model is re-trained using them, and the process is repeated. Uncertainty sampling and its variants can work well when the classes are balanced. However, in many applications datasets may be very unbalanced, containing very rare classes or one very large majority class. As an example, suppose an insurance company would like to train an image-based machine learning system to classify various types of damage to the roofs of buildings

(Conathan et al., 2018). It has a large corpus of unlabeled roof images, but the vast majority contain no damage of any sort.

Unfortunately, under extreme class imbalance, uncertainty sampling tends to select examples mostly from the dominant class(es), often leading to very slow learning. In this paper, we take a novel approach specifically targeting the class imbalance problem. Our method is guaranteed to select examples that are both uncertain and class-diverse

; i.e., the selected examples are relatively balanced across the classes even if the overall dataset is extremely unbalanced. In a nutshell, our algorithm sorts the examples by their softmax uncertainty scores and applies a bisection procedure to find consecutive pairs of points with differing labels. This procedure encourages finding uncertain points from a diverse set of classes. In contrast, uncertainty sampling focuses on sampling around the model’s decision boundary and therefore will collect a biased sample if this model decision boundary is strongly skewed towards one class. Figure

1 displays the results of one of our experiments, showing that our proposed GALAXY algorithm learns much more rapidly and collects a significantly more diverse dataset than uncertainty sampling.

We make the following contributions in this paper:

  • we develop a novel, scalable algorithm GALAXY, tailored to the extreme class imbalance setting, which is frequently encountered in practice,

  • GALAXY is easy to implement, requiring relatively simple modifications to commonplace uncertainty sampling approaches,

  • we conduct extensive experiments showing that GALAXY outperforms a wide collection of deep active learning algorithms in the imbalanced settings, and

  • we give a theoretical analysis showing that GALAXY selects much more class-balanced batches of uncertain examples than traditional uncertainty sampling strategies.

Figure 1: These plots depict results on a modified version of CIFAR100 with a class imbalance of . Left: The plot displays the balanced accuracy of the methods where the per-class accuracy is weighted by the class size. Right: The plot displays the percentage of labels queried from the minority class.

2 Related Work

Deep Active Learning: There are two main algorithmic approaches in deep active learning: uncertainty and diversity sampling. Uncertainty sampling queries the unlabeled examples that are most uncertain. Often, uncertainty is quantified by distance to the decision boundary of the current model (e.g., (Tong and Koller, 2001; Kremer et al., 2014; Balcan et al., 2009)). Several variants of uncertainty sampling have been proposed for deep learning (e.g., (Gal et al., 2017; Ducoffe and Precioso, 2018; Beluch et al., 2018).

In a batch setting, uncertainty sampling often leads to querying a set of very similar examples, giving much redundant information. To deal with this issue, diversity sampling queries a batch of diverse examples that are representative of the unlabeled pool. Sener and Savarese (2017) propose a Coreset approach for diversity sampling for deep learning. Others include (Gissin and Shalev-Shwartz, 2019; Geifman and El-Yaniv, 2017). However, under the class imbalance scenarios where collecting minority class examples is crucial, previous work by Coleman et al. (2020) has shown such methods to be less effective as they are expected to collect a subset with equal imbalance to the original dataset.

Recently, significant attention has been given to designing hybrid methods that query a batch of informative and diverse examples. Ash et al. (2019) balances uncertainty and diversity by representing each example by its last layer gradient and aiming to select a batch of examples with large Gram determinant. Citovsky et al. (2021) uses hierarchical agglomerative clustering to cluster the examples in the feature space and then cycles over the clusters querying the examples with smallest margin. Finally, Ash et al. (2021) uses experimental design to find a batch of diverse and uncertain examples.

Class Imbalance Deep Active Learning: A number of recent works have studied active learning in the presence of class imbalance. Coleman et al. (2020) proposes SEALS, a method for the setting of class imbalance and an enormous pool of unlabeled examples. Kothawade et al. (2021) proposes SIMILAR, which picks examples that are most similar with the collected in-distribution examples and most different from the known out-of-distribution examples. Their method achieves this by maximizing the conditional mutual information. Our setting is closest to their out-of-distribution imbalance scenario. Finally, Emam et al. (2021)

tackles the class imablance issue by proposing BASE which queries the same number of examples per each predicted class based on margin from decision boundary, where the margin is defined by the distance to the model boundary in the feature space of the neural network.

By contrast with the above methods, our method searches adaptively in the output space within each batch for the best threshold separating two classes and provably produces a class-balanced set of labeled examples. Adaptively searching for the best threshold is especially helpful in the extreme class imbalance setting where the decision boundary of the model is often skewed. If all labels were known, theoretically it may be possible to modify the training algorithm to obtain a model without any skew towards on class. However, this is not possible in active learning where we do not know the labels a priori. In addition, it is expensive and undesirable in practical cases to modify the training algorithm (Roh et al., 2020), making it attractive to work for any off-the-shelf training algorithm (like ours).

Graph-based active learning: There have been a number of proposed graph-based adaptive learning algorithms (e.g., (Zhu et al., 2003a, b; Cesa-Bianchi et al., 2013; Gu and Han, 2012; Dasarathy et al., 2015). Our work is most closely related to (Dasarathy et al., 2015), which proposed a graph-based active learning with strong theoretical guarantees. While assumes a graph as an input and will perform badly on difficult graphs, our work builds a framework that combines the ideas of with deep learning to perform active learning while continually improving the graph. We review this work in more detail in Section 4.

3 Problem Statement and Notation

We investigate the pool-based batched active learning setting, where the learner has access to a large pool of unlabeled data examples and there is an unknown ground truth label function giving the label of each example. At each iteration , the learner selects a small batch of unlabeled examples from the pool, observes its labels , and adds the examples to , the set of all the examples that have been labeled so far. After each batch of examples is queried, the learner updates the deep learning model training on all of the examples in and uses this model to inform which batch of examples are selected in the next round.

We are particularly interested in the extreme class imbalance problem where one class is significantly larger than the other classes. Mathematically,

where denote the number of examples that belong to the -th class. Here, is some small class imbalance factor and each of the in-distribution classes , …, contains far fewer examples than the out-of-distribution -th class. This models scenarios, for example in roof damage classification, self-driving and medical applications, in which a small fraction of the unlabeled examples are from classes of interest and the rest of the examples may be lumped into an “other” or “irrelevant” category.

Henceforth, we denote to be a model trained on the labeled set , where is a training algorithm that trains on a labeled set and outputs a classifier.

4 Review of Graph-based Active Learning

Figure 2: All three graphs contain the same eight numbered examples but connected in different ways. The ground truth binary labels are represented by the black and white coloring of the examples. As a result, each of their cut boundaries are: (a) , (b) and (c) .

To begin, we introduce some notation. With slight abuse of notation, we define a undirected graph over the pool as with vertex set and edge set , where each node in the graph is also an example in the pool . Let denote the shortest path connecting and in the graph , and let denote its length.111In the special case when and are not connected, we define .

Dasarathy et al. (2015) proposed a graph-based active learning algorithm (see Algorithm 1) that aims to identify all the cuts , namely every edge that connects an oppositely labeled pair of examples. In particular, if one labeled all of the examples in the cut boundary , one would be able to classify every example in the pool correctly. As an example, take the linear graph in Figure 2(a) where each node represents a numbered example and is associated with a binary label (black/white). It is thus necessary to query at least seven examples to identify the cut boundary and therefore the labeling.

performs an alternating two phased procedure. First, if every pair of connected examples have the same label, queries an unlabeled example uniform at random. Second, whenever there exist paths connecting examples with different labels, the algorithm bisects along the shortest among these paths until it identifies a cut. The algorithm then removes the identified edge from the graph.

Dasarathy et al. (2015) has shown that the PAC sample complexity to identify all cuts highly depends on the input graph’s structural properties. As an example consider Figure 2, which depicts several graphs on the same set of examples. In graph (a), one needs to query at least seven examples, while in graph (b) one need only query two examples. Indeed, such a difference can be made arbitrarily large. The work of Dasarathy et al. (2015), however, did not address the major problem of how to obtain an “easier” graph for active learning.

  Input: Graph , total budget
  Initialize: Labeled set where are uniform random samples from
  for  do
     if  then
        Query the mid point of
     end if
     Update labeled set:
     Remove cuts from current graph:
  end for
  Return: Labeled set
Algorithm 1 : Shortest Shortest Path

5 Galaxy

Our algorithm GALAXY shown in Algorithm 4 blends graph-based active learning and deep active learning through the following two alternating steps

  • Given a trained neural network, we construct a graph and apply a modified version of to it, efficiently collecting an informative batch of labels (Algorithm 4).

  • Given a new batch of labeled examples, we train a better neural network model that will be used to construct a better graph for active learning (Algorithm 2).

To construct graphs from a learned neural network in multi-class settings, we take a one-vs-all approach on the output (softmax) space as shown in Algorithm 2. For each class , we build a linear graph by ranking the model’s confidence margin on each example . For a neural network , the confidence margin is simply defined as , where denotes the

-th element of the softmax vector. We break ties by the confidence scores

themselves (equivalently by ). Intuitively, for each graph , we sort examples according to their likelihood to belong to class . Indeed, when is a perfect classifier on the pool, each linear graph constructed behaves like Figure 2(b), i.e., every example in class is perfectly separated from all other classes with only one cut in between.

  Input: Pool , neural network
  Confidence for each :
  for  do
     Compute margins
     Sort by margin and break ties by confidence: is a permutation of such that
     Connect edges
  end for
  Return: Graphs , rankings
Algorithm 2 Build Graph

Our algorithm GALAXY shown in Algorithm 4 proceeds in a batched style. For each batch, GALAXY first trains a neural network to obtain graphs constructed by the procedure described above. It then performs style bisection-like queries on all of the graphs but with two major differences.

  • To accommodate multiple graphs, we treat each linear graph as a binary one-vs-all graph, where we gather all shortest paths that connects a queried example in class and a queried example in any other classes. If such shortest paths exist, we then find the shortest of these shortest paths across all and the bisect the resulting shortest shortest path like in .

  • When no such shortest path exists, instead of querying an example uniform at random as in , we increase the order of the graphs by Algorithm 3 and perform bisection procedures on the updated graphs. Here, we refer to an -th order linear graph where each example is connected to all of its neighboring examples from each side. For example, Figure 2(c) shows a graph of order as opposed to an order graph shown in Figure 2(b). Intuitively, bisecting after the Connect procedure is equivalent with querying around the discovered cuts. For example in the case of Figure 2(b), after querying examples and , our algorithm will connect second order edges and query exactly examples and as the next two queries.

  Input: Graphs , rankings , edge order
  for  do
  end for
  Return: Graphs
Algorithm 3 Connect: build higher order edges
  Input: Pool , neural network training algorithm , number of rounds , batch size ()
  Initialize: Uniformly sample elements without replacement from to form
  for  do
     Train neural network:
     Graph order:
     for  do
        Find shortest shortest path among all graphs:
        if  then
           Recompute by (1)
        end if
        Query the mid point of
        Update labeled set:
        Remove cuts for each :
     end for
  end for
  Return: Final classifier
Algorithm 4 GALAXY

6 Analysis

Figure 3: (a) and (b) denotes two different linear graphs generated from two different classifiers by ranking their corresponding confidence scores. The ground truth label of each example is represented by its border – solid blue lines for class ID and dotted red lines for class OOD. The linear graph in (a) is a separable graph where all examples in class ID are of low confidence scores while class OOD examples have higher confidence scores. By contrast, the linear graph in (b) is non-separable.

6.1 Galaxy at the Extreme

In this section, we analyze the behavior of GALAXY in the two-class setting and specifically when class OOD (out-of-distribution) has much more examples than class ID (in-distribution). In a binary separable case, we bound expected batch balancedness of both the bisection procedure and GALAXY, whereas uncertainty sampling could fail to sample any ID examples at all. At the end we also show a noise tolerance guarantee that GALAXY

will find the optimal uncertainty threshold with high probability. For proper indexing below, we let

and .

Reduction to single linear graph. Recall in Algorithm 2, we build a graph for each class by sorting the margin scores of that class on the pool. In the binary classification case, it is sufficient to consider one graph generated from sorting confidence scores. This follows due to the symmetry of the two graphs in the binary case.

Universal approximator and region of uncertainty. Since neural networks are universal approximators, we make the following assumption.

Assumption 6.1.

Given a labeled subset of , let be the neural network classifier trained on . We assume classifies every example in perfectly. Namely, .

Definition 6.2.

Let denote the labeled example in class ID with the highest confidence and denote the labeled example in class OOD with the lowest confidence. We then define all examples in between, i.e. , to be the region of uncertainty.

In practice, since the neural network model should be rather certain in its predictions on the labeled set, we expect the region of uncertainty to be relatively large. We show an example in Figure 3 where filled circles represent the labeled examples. The filled blue example on the left is and the filled red example on the right is . The region of uncertainty are then all of the examples in between.

In the following, we first derive our balancedness results in the separable case such as in Figure 3(a) and turn to noise tolerance analysis in the end. First in the separable case, we let denote the number of in-distribution examples and denote the number of out-of-distribution examples both in the region of uncertainty. First we analyze the bisection following procedure that adaptively finds the true uncertainty threshold (cut in the separable linear graph).

Definition 6.3.

Our bisection procedure works as follows when given region of uncertainty with examples.

  • Let represent the number of examples in the latest region of uncertainty, query the example based on the sorted uncertainty scores. Here, or with equal probability.

  • If observe ID, update the region of uncertainty to be examples ranked based on uncertainty scores. Recurse on the new region of uncertainty.

    Similarly, if observe OOD, update the region of uncertainty to be examples ranked based on uncertainty scores. Recurse on the new region of uncertainty.

  • Terminate once the region of uncertainty is empty ().

The exact number of labels collected from the ID and OOD classes depends on the specific numbers of examples in the region of uncertainty. We characterize the generic behavior of the biscection process with a simple probabilistic model showing the following theorem that the method tends to find balanced examples among both classes. Proofs of the following results appear in the Appendices.

Theorem 6.4 (Sample Balancedness of Bisection).

Assume for some and that the examples labeled in the first bisection steps are all from class OOD. At least examples remain in the region of uncertainty and suppose that . If we let and be the number of queries in each of the ID and OOD classes made by the bisection procedure described in Definition 6.3, we must have

where the expectations are with respect to the uniform distribution above.

The unbalancedness factor of the region of uncertainty is at most . When is large, we must have . Thus, the bisection procedure collects a batch that improves on the unbalanced factor exponentially.

Next, we characterize the balancedness of the full GALAXY algorithm. When running GALAXY on a separable linear graph, it is equivalent with first running bisection procedure to find the optimal uncertainty threshold, followed by querying around the two sides of the threshold equally. We therefore incorporate our previous analysis on the bisection procedure and especially focus on the second part where one queries around the optimal uncertainty threshold.

Corollary 6.5 (Sample Balancedness of Batched Galaxy, Proof in Appendix B).

Assume and are under same noiseless setting as in Theorem 6.4. If GALAXY takes an additional queries after the bisection procedure terminates, so that examples are labeled in total and if we let and be the number of queries in each class made by GALAXY, we must have

where and the expectations are with respect to the uniform distribution in .

In the above theorem, since when is large, we can then recover a constant factor of balancedness. On the other hand, uncertainty sampling does not enjoy the same balancedness guarantees when the model decision boundary is biased towards the OOD class.

Proposition 6.6 (Sample Balancedness of Uncertainty Sampling).

Assume and are under same noiseless setting as in Theorem 6.4. If we let and be the number of queries in each of the ID and OOD classes made by an uncertainty sampling procedure with batch size steps, we have in the worst case

where the expectations are with respect to .

The above proposition can been seen as demonstrated by Figure 3, where when training a model under extreme imbalance, the model could be biased towards OOD and thus the true confidence threshold . Since , in the worst case, uncertainty sampling could have selected a batch all in OOD regardless of the value takes. Therefore, in such cases, we have .

We will now show GALAXY’s robustness in non-separable graphs. We model the noises by randomly flipping the true labels of a separable graph.

Theorem 6.7 (Noise Tolerance of Galaxy, Proof in Appendix C).

Let . Suppose the true label of each example in the region of uncertainty is corrupted independently with probability . Let denote the batch size of GALAXY, and be the number of queries in each class made by GALAXY, with probability at least we have

where the expectations are with respect to .

Note in practice batch size in active learning is usually small. When , the above result also implies that with about labels corrupted at random, GALAXY collects a balanced batch with probability at least .

6.2 Time Complexity

We compare the per-batch running time of GALAXY with confidence sampling, showing that they are comparable in practice. Recall that is the batch size, is the pool size and is the number of classes. Let denote the forward inference time of a neural network on a single example.

Confidence sampling has running time , where comes from forward passes on the entire pool, comes from computing the maximum confidence of each example and is the time complexity of choosing the top-B examples according to uncertainty scores. On the other hand, our algorithm GALAXY has time . Here, is the complexity of constructing linear graphs (Algorithm 2) by sorting through margin scores and comes from finding the shortest shortest path, for elements among graphs.

In practice dominates all of the other terms, so making these running times comparable. Indeed, in all of our experiments conducted in Section 7.2, GALAXY is less than 5% slower when compared to confidence sampling.

7 Experiments

We conduct experiments under different class imbalance settings. These settings are generated from three image classification datasets with various class imbalance factors. If the classes are balanced in the dataset, then most active learning strategies (including GALAXY) perform similarly, with relative small differences in performance, so we focus our presentation on unbalanced situations. We will first describe the setups (Section 7.1) before turning to the results in Section 7.2. Finally, we present a comparison with vanilla algorithm and demonstrate the importance of reconstructing the graphs in Section 7.3.

7.1 Setup

We use the following metric and training algorithm to reweight each class by its number of examples. By doing this, we downweight significantly the large “other” class while not ignoring it completely. More formally, we state our metric and training objective below.

Metric: Given a fixed batch size and after iterations, let denote the labeled set after the final iteration. Let be a model trained on the labeled set. We wish to maximize the balanced accuracy over the pool

Recall that is the number of examples in class . In all of our experiments, we set and .

Remark 7.1.

Finding good active classifiers on the pool is closely related to finding good classifiers that generalizes. See Boucheron et al. (2005) for standard generalization bounds or Katz-Samuels et al. (2021) for a detailed discussion.

Training Algorithm : Our training algorithm takes a labeled set as input. Let denote the number of labeled examples in class , we use a cross entropy loss weighted by for each class

. Note unlike the evaluation metric, we do not directly reweight the classes by

, as the active learning algorithms only have knowledge of labels of

in practice. Furthermore for all experiments, we use the ResNet-18 model in PyTorch pretrained on ImageNet for initialization and cold-start the training for every labeled set

. We use the Adam optimization algorithm with learning rate of and a fixed epochs for each .


, CIFAR-10, 3 classes

(b) , CIFAR-100, 10 classes

, SVHN, 2 classes

Figure 4: Performance of GALAXY against baselines on selected settings. Legend shown in (c) is shared across all three plots.

7.2 Results on Extremely Unbalanced Datasets

We generate the extremely unbalanced settings for both binary and multi-class classification from popular vision datasets CIFAR-10, CIFAR-100 and SVHN. CIFAR-10 and SVHN both initially have 10 balanced classes while CIFAR-100 have balanced classes. We construct the large “other” class by grouping the majority of the original classes into one out-of-distribution class. Suppose there are originally ( or ) balanced classes in the original dataset, we form a () class extremely unbalanced dataset by reusing classes as in the original dataset, whereas class contains all examples in classes in the original dataset. Table 1 shows the detailed sizes of the extremely unbalanced datasets.

Name # Classes
Table 1: Dataset details for each extremely unbalanced scenario. denotes the number of images in the out-of-distribution class while is the total number of images in all in-distribution classes. is the class imbalance factor defined in Section 3.

Comparison Algorithms: We compare our algorithm GALAXY against eight baselines. SIMILAR (Kothawade et al., 2021), Cluster Margin (Citovsky et al., 2021), BASE (Emam et al., 2021), BADGE (Ash et al., 2019) and BAIT (Ash et al., 2021) have all been described in Section 2. For SIMILAR, we use the FLQMI relaxation of the submodular mutual information (SMI). We are unable to compare to the FLCMI relaxation of the submodular conditional mutual information (SCMI) due to excessively high memory usage required by the submodular maximization at pool size . As demonstrated in Kothawade et al. (2021) however, one should expect only marginal improvement over FLQMI relaxation of the SMI. For Cluster Margin

we choose clustering hyperparameters so there are exactly

clusters. We choose margin batch size to be while the target batch size is set to .

In addition to the above methods, Confidence Sampling (Settles, 2009) is a type of uncertainty sampling that queries the least confident examples in terms of . Here, is a classifier that outputs softmax scores and maximization is take with respect to classes. Most Likely Positive (Jiang et al., 2018; Warmuth et al., 2001, 2003)

is a heuristic often used in active search, where the algorithm selects the examples most likely to be in the in-distribution classes by its predictive probabilities. Lastly,

Random is the naive uniform random strategy. For each setting, we average over individual runs for each of GALAXY, Cluster Margin, BASE, Confidence Sampling, Most Likely Positive and Random. Due to computational constraints, we are only able to have single runs for each of SIMILAR, BADGE and BAIT

. For algorithms with multiple runs, the standard error is also plotted as the confidence intervals. To demonstrate the active gains more clearly, all of our curves are smoothed by moving average with window size


As shown in Figure 4, to achieve any balanced accuracy, GALAXY outperforms all baselines in terms of the number of labels requested, saving up to queries in some cases when comparing to the second best method. For example in unbalanced SVHN with 2 classes, to achieve , GALAXY takes queries while the second best algorithm takes queries. In unbalanced CIFAR-100 with 3 classes, to reach accuracy, GALAX takes queries while the second best algorithm takes queries. As expected, Cluster Margin and BASE are competitive in many settings as they also target unbalanced settings. BAIT and BADGE tend to perform less well primarily due to their focus on collecting data-diverse examples, which has roughly the same class-imbalance as the pool. Full experimental results on all settings are presented in Appendix D.

Figure 5: Number of in-distribution labels for CIFAR-10, 3 classes

As shown in Figure 5, GALAXY’s success relies on its inherent feature of collecting a more balanced labeled set of uncertain examples. In particular, GALAXY is collecting a significantly more in-distribution examples than most baseline algorithms including uncertainty sampling. On the other hand, although SIMILAR and Most Likely Positive both collect more examples in the in-distribution classes, their inferiority in balanced accuracy suggests that the examples are not representative enough. Indeed, both methods are inherently collecting labels for example that are certain. This thus suggests the importance of collecting batches that are not only balanced but also uncertain.

7.3 Comparison: vs Galaxy

Figure 6: Comparison of GALAXY with vanilla with 1-nearest-neighbor and neural network classifiers. We use the CIFAR-100, 10 classes data setting for comparison.

In this section, we conduct experiment to compare the original approach (Dasarathy et al., 2015) against our method. For

, we construct a 10-nearest-neighbor graph from feature vectors of a ResNet-18 model pretrained on ImageNet. We show two curves of

using two different models – 1-nearest-neighbor prediction on the graph and neural network training in Section 7.1. We note that the models training does not affect the active queries, whereas GALAXY constantly constructs graphs based on these updated models. As shown in Figure 6, GALAXY outperforms with both models by a significant margin, showing the necessity on learning and constructing better graphs (Algorithm 2).

8 Future Direction

In this paper, we propose a novel graph-based approach to deep active learning that particularly targets the extreme class imbalance cases. We show that our algorithm GALAXY outperforms all existing methods by collecting a mixture of balanced yet uncertain examples. GALAXY runs on similar time complexity as other uncertainty based methods by retraining the neural network model only after each batch. However, it still requires sequential and synchronous labelling within each batch. This means the human labelling effort cannot be parallelized by multiple annotators. For future work, we would like to incorporate asynchronous labelling and investigate its effect on our algorithm.


  • J. T. Ash, S. Goel, A. Krishnamurthy, and S. Kakade (2021) Gone fishing: neural active learning with fisher embeddings. arXiv preprint arXiv:2106.09675. Cited by: §2, §7.2.
  • J. T. Ash, C. Zhang, A. Krishnamurthy, J. Langford, and A. Agarwal (2019) Deep batch active learning by diverse, uncertain gradient lower bounds. arXiv preprint arXiv:1906.03671. Cited by: §2, §7.2.
  • M. Balcan, A. Beygelzimer, and J. Langford (2009) Agnostic active learning. Journal of Computer and System Sciences 75 (1), pp. 78–89. Cited by: §2.
  • W. H. Beluch, T. Genewein, A. Nürnberger, and J. M. Köhler (2018) The power of ensembles for active learning in image classification. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 9368–9377. Cited by: §2.
  • S. Boucheron, O. Bousquet, and G. Lugosi (2005) Theory of classification: a survey of some recent advances. ESAIM: probability and statistics 9, pp. 323–375. Cited by: Remark 7.1.
  • N. Cesa-Bianchi, C. Gentile, F. Vitale, and G. Zappella (2013) Active learning on trees and graphs. arXiv preprint arXiv:1301.5112. Cited by: §2.
  • G. Citovsky, G. DeSalvo, C. Gentile, L. Karydas, A. Rajagopalan, A. Rostamizadeh, and S. Kumar (2021) Batch active learning at scale. Advances in Neural Information Processing Systems 34. Cited by: §2, §7.2.
  • C. Coleman, E. Chou, J. Katz-Samuels, S. Culatana, P. Bailis, A. C. Berg, R. Nowak, R. Sumbaly, M. Zaharia, and I. Z. Yalniz (2020) Similarity search for efficient active learning and search of rare concepts. arXiv preprint arXiv:2007.00077. Cited by: §2, §2.
  • D. Conathan, U. Oswal, and R. Nowak (2018)

    Active sparse feature selection using deep convolutional features for image retrieval

    SIAM International Conference on Data Mining. First workshop on AI in insurance.. External Links: Link Cited by: §1.
  • G. Dasarathy, R. Nowak, and X. Zhu (2015) S2: an efficient graph based active learning algorithm with application to nonparametric classification. In Conference on Learning Theory, pp. 503–522. Cited by: §2, §4, §4, §7.3.
  • M. Ducoffe and F. Precioso (2018) Adversarial active learning for deep networks: a margin based approach. arXiv preprint arXiv:1802.09841. Cited by: §2.
  • Z. A. S. Emam, H. Chu, P. Chiang, W. Czaja, R. Leapman, M. Goldblum, and T. Goldstein (2021) Active learning at the imagenet scale. arXiv preprint arXiv:2111.12880. Cited by: §2, §7.2.
  • Y. Gal, R. Islam, and Z. Ghahramani (2017) Deep bayesian active learning with image data. In International Conference on Machine Learning, pp. 1183–1192. Cited by: §2.
  • Y. Geifman and R. El-Yaniv (2017) Deep active learning over the long tail. arXiv preprint arXiv:1711.00941. Cited by: §2.
  • D. Gissin and S. Shalev-Shwartz (2019) Discriminative active learning. arXiv preprint arXiv:1907.06347. Cited by: §2.
  • Q. Gu and J. Han (2012) Towards active learning on graphs: an error bound minimization approach. In 2012 IEEE 12th International Conference on Data Mining, pp. 882–887. Cited by: §2.
  • S. Jiang, G. Malkomes, M. Abbott, B. Moseley, and R. Garnett (2018) Efficient nonmyopic batch active search. In 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Cited by: §7.2.
  • J. Katz-Samuels, J. Zhang, L. Jain, and K. Jamieson (2021) Improved algorithms for agnostic pool-based active classification. arXiv preprint arXiv:2105.06499. Cited by: Remark 7.1.
  • S. Kothawade, N. Beck, K. Killamsetty, and R. Iyer (2021) Similar: submodular information measures based active learning in realistic scenarios. Advances in Neural Information Processing Systems 34. Cited by: §2, §7.2.
  • J. Kremer, K. Steenstrup Pedersen, and C. Igel (2014)

    Active learning with support vector machines

    Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 4 (4), pp. 313–326. Cited by: §2.
  • Y. Roh, K. Lee, S. E. Whang, and C. Suh (2020) Fairbatch: batch selection for model fairness. arXiv preprint arXiv:2012.01696. Cited by: §2.
  • O. Sener and S. Savarese (2017)

    Active learning for convolutional neural networks: a core-set approach

    arXiv preprint arXiv:1708.00489. Cited by: §2.
  • B. Settles (2009) Active learning literature survey. Cited by: §7.2.
  • S. Tong and D. Koller (2001) Support vector machine active learning with applications to text classification. Journal of machine learning research 2 (Nov), pp. 45–66. Cited by: §2.
  • M. K. Warmuth, J. Liao, G. Rätsch, M. Mathieson, S. Putta, and C. Lemmen (2003) Active learning with support vector machines in the drug discovery process. Journal of chemical information and computer sciences 43 (2), pp. 667–673. Cited by: §7.2.
  • M. K. Warmuth, G. Rätsch, M. Mathieson, J. Liao, and C. Lemmen (2001) Active learning in the drug discovery process.. In NIPS, pp. 1449–1456. Cited by: §7.2.
  • X. Zhu, Z. Ghahramani, and J. D. Lafferty (2003a) Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International conference on Machine learning (ICML-03), pp. 912–919. Cited by: §2.
  • X. Zhu, J. Lafferty, and Z. Ghahramani (2003b)

    Combining active learning and semi-supervised learning using gaussian fields and harmonic functions

    In ICML 2003 workshop on the continuum from labeled to unlabeled data in machine learning and data mining, Vol. 3. Cited by: §2.

Appendix A Proof of Theorem 6.4


First, when , it’s easy to see by induction that after queries, the region of uncertainty shrinks to have at least examples. Therefore, after steps, we must have .

Next, let denote the number of OOD labels queried after bisecting steps, namely . Since in the last examples, the number of ID examples is , we must have the number of OOD examples to be symmetrically . Therefore, due to symmetry of distribution and the bisection procedure, in expectation the bisection procedure queries equal numbers of ID and OOD examples, i.e. . Together we must have

where the last inequality follows from the total number of queries so . ∎

Appendix B Proof of Corollary 6.5


As shown in Theorem 6.4, even without the additional queries, we must have . Now, for the process of querying two sides of the cut, with queries we can guarantee that at least examples to the left of the cut must have been queried and are in ID. Therefore, . As a result, we the have and , so

Appendix C Proof of Theorem 6.7

Lemma C.1 (Noise Tolerance of Bisection).

Let . If the true label of each example in the region of uncertainty is corrupted independently with probability , the bisection procedure recovers the true uncertainty threshold with probability at least .


Bisection procedure will make queries and for each query the label could be corrupted with probability . Therefore, by union bound, we must then have

Now we start to prove Theorem 6.7.


By Lemma C.1, we know with probability , all of the bisection queries are not corrupted. Furthermore, as proved in Theorem 6.4, we at least take number of queries in class ID, so . As a result, with probability at least we have the desired balancedness bound. ∎

Appendix D Full Experimental Results on CIFAR-10, CIFAR-100 and SVHN

(b) #In-distribution Label
Figure 7: CIFAR-10, 2 classes
(b) #In-distribution Label
Figure 8: CIFAR-10, 3 classes
(b) #In-distribution Label
Figure 9: CIFAR-100, 2 classes
(b) #In-distribution Label
Figure 10: CIFAR-100, 3 classes
(b) #In-distribution Label
Figure 11: CIFAR-100, 10 classes
(b) #In-distribution Label
Figure 12: SVHN, 2 classes
(b) #In-distribution Label
Figure 13: SVHN, 3 classes