In real-life applications, large amounts of unlabeled data are easy to be obtained, but labeling them is expensive and time-consuming (Shen et al., 2004). AL is an effective way to solve this problem – it achieves greater accuracy with less training data by selecting the most informative or representative instances from which it learns, and then querying their labels from oracles/annotators (Zhan et al., 2021b). Current AL sampling strategies have been tested in relatively simple and clean well-studied datasets (Kothawade et al., 2021) like MNIST (Deng, 2012) and CIFAR10 (Krizhevsky et al., 2009) datasets. However, in realistic scenarios, when collecting unlabeled data samples, unrelated data samples (i.e., out-of-domain data) might be mixed in with the task-related data, e.g., images of letters when the task is to classify images of digits (Du et al., 2021). Most AL methods are not robust to OOD data scenarios. For instance, Karamcheti et al. (2021) has demonstrated empirically
that collective outliers hurt AL performances under Visual Question Answering (VQA) tasks.Meanwhile, selecting and querying OOD samples that are invalid for the target model will waste the labeling cost (Du et al., 2021), and make the AL sampling process less effective.
There is a natural conflict between uncertainty-based AL sampling strategy and OOD data detection – most AL sampling schemes, especially uncertainty-based measures, prefer selecting data that are hardest to be classified by the current basic learner/classifier, but these data might also
be OOD samples. For example, a typical uncertainty-based AL method is Maximum Entropy (ENT) (Lewis and Catlett, 1994; Shannon, 2001), which selects data whose predicted class probabilities have largest entropy. However, ENT is also a typical OOD detection method – high entropy of the predicted class distribution (high predictive uncertainty) suggests that the input may be OOD (Ren et al., 2019). Therefore, when naively applying uncertainty-based AL in OOD data scenarios, OOD samples are more likely to be selected for labeling than ID data. Fig. 1 shows an example on the EX8 dataset (Ng, 2008), which illustrates this conflict. Moreover, almost all of the OOD data has high ENT, and thus the AL algorithm is more likely to select OOD data for labeling, thus wasting the labeling budget.
Although the OOD problem has been demonstrated to affect AL in real-life applications (Karamcheti et al., 2021), there are only a few studies on it (Kothawade et al., 2021; Du et al., 2021). In (Kothawade et al., 2021), OOD is only considered as a sub-task and its sub-modular mutual information based sampling scheme is both time and memory consuming. (Du et al., 2021) needs to pre-train extra self-supervised models like SimCLR (Chen et al., 2020), and also introduces hyper-parameters to trade-off between semantic and distinctive scores, whose values affect the final performance (see Section 4.3 in (Du et al., 2021)). These two factors limit the range of its application. We compared our work with (Du et al., 2021), (Chen et al., 2020) in detail in Appendix A.4.1.
In this paper, to address the above issues, we advocate simultaneously considering both the AL criterion and ID confidence when designing AL sampling strategies. Since these two objectives are conflict, we define the AL sampling process under OOD data scenarios as a multi-objective optimization problem (Seferlis and Georgiadis, 2004). Unlike traditional methods for handling multiple-criteria based AL, such as weighed-sum optimization (Zhan et al., 2022) or two-stage optimization (first select a subset based on one objective then make final decisions based on the other objective) (Shen et al., 2004), we propose a novel and flexible batch-mode Pareto Optimization Active Learning (POAL) framework with fixed batch size. The contributions of this paper are as follows:
We propose AL under OOD data scenarios within the framework of a multi-objective optimization problem.
Our framework is flexible and can accommodate different combinations of AL methods and OOD detection methods according to various target tasks. In this paper, we use ENT as the AL objective and Mahalanobis distance based ID confidence scores.
Naively applying Pareto optimization to AL will result in a Pareto Front with non-fixed size, which can introduce large computational cost. To enable efficient Pareto optimization, we propose a Monte-Carlo (MC) Pareto optimization algorithm for fixed-size batch-mode AL.
Our framework works well on both classical ML and DL tasks, and we propose pre-selecting and early-stopping techniques to reduce the computational cost on large-scale datasets.
Our framework has no trade-off hyper-parameters for balancing the AL and OOD objectives. This is crucial since i) AL is data-insufficient, and there might be no extra validation set for tuning hyper-parameters; or ii) hyper-parameter tuning in AL can be label expensive, since every change of the hyper-parameters causes AL to label new examples, thus provoking substantial labeling inefficiency (Ash et al., 2020).
2 Related Work
Pool-based Active Learning.
and widely adopted in various tasks like natural language processing(Dor et al., 2020), semantic parsing (Duong et al., 2018), object detection (Haussmann et al., 2020), image classification/segmentation (Gal et al., 2017; Yoo and Kweon, 2019)
, etc. Most pool-based AL sampling schemes rely on fixed heuristic sampling strategies, which follow two main branches: uncertainty-based measures and representative/diversity-based measures(Ren et al., 2021; Zhan et al., 2022). Uncertainty-based approaches aim to select data that maximally reduce the uncertainty of target basic learner (Ash et al., 2020). Typical uncertainty-based measures that perform well on classical ML tasks (e.g., Query-by-Committee (Seung et al., 1992), ENT (Lewis and Catlett, 1994), Bayesian Active Learning by Disagreement BALD (Houlsby et al., 2011)) have also been generalized to DL tasks (Wang and Shang, 2014; Gal et al., 2017; Beluch et al., 2018; Zhan et al., 2022). Representative/diversity-based methods select batches of unlabeled data that are most representative of the unlabeled set. Typical methods include Hierarchical sampling (Dasgupta and Hsu, 2008), -Means, and Core-Set approach (Sener and Savarese, 2018). However, these methods measure representativeness/diversity using pair-wise distances/similarities calculated across the whole data pool, which is computationally expensive. Both of these two branches of AL methods might not work well under OOD data scenarios. As discussed in Section 1, uncertainty-based measure like ENT are not robust to OOD samples, since OOD samples are naturally difficult to be classified. Representative/diversity-based methods are also not robust to OOD scenarios, since outliers/OOD samples that are far from other data are more likely to be selected first as representative data. To address these issues, we proposed our POAL framework, which can be adopted to various AL sampling scheme and OOD data scenarios.
Detecting OOD data is of vital importance in ensuring the reliability and safety of ML systems in real-life applications (Yang et al., 2021), since the OOD problem severely influences real-world predictions/decisions. For example, in medical diagnosis, the trained classifier could wrongly classify a healthy OOD sample as pathogenic (Ren et al., 2019). Existing methods compute ID confidence scores based on the predictions of (ensembles of) classifiers trained on ID data, e.g., the ID confidence can be the entropy of the predictive class distribution (Ren et al., 2019). Hendrycks and Gimpel (2017)
observed that a well-trained neural network assigns higher softmax scores to ID data than OOD data. Follow-up workODIN (Liang et al., 2017) amplifies the effectiveness of OOD detection with softmax score by considering temperature scaling and input pre-processing. Others (Lee et al., 2018; Cui et al., 2020) proposed a simple and effective method for detecting both OOD and adversarial data samples based on Mahalanobis distance. Due to its superior performance compared with other OOD strategies (see Table 1 in (Ren et al., 2019)), our POAL framework uses Mahalanobis distance as the basic criterion for calculating the ID confidence score. Nonetheless, our framework is general and any ID confidence score could be adopted.
Many real-life applications require optimizing multiple objectives that are conflicting with each other. For example, in the sensor placement problem, the goal is to maximize sensor coverage while minimizing deployment costs (Watson et al., 2004). Since there is no single solution that simultaneously optimizes each objective, Pareto optimization can be used to find a set of “Pareto optimal” solutions with optimal trade-offs of the objectives (Miettinen, 2012). A decision maker can then select a final solution based on their requirements. Compared with traditional optimization methods for multiple objectives (e.g., weighted-sum (Marler and Arora, 2010)), Pareto optimization algorithms are designed via different meta-heuristics without any trade-off parameters (Zhou et al., 2011; Liu et al., 2021).
Besides optimization of the two conflicting objectives (AL score and ID confidence), we also need to perform batch-mode subset selection. In previous work on subset selection, Pareto Optimization Subset Selection (POSS) (Qian et al., 2015) solves the subset selection problem by optimizing two objectives simultaneously: maximizing the criterion function, and minimizing the subset size. However, POSS does not support more than one criterion and cannot work with fixed subset size, and thus is not suitable for our task. Therefore, we propose Monte-Carlo POAL, which achieves: i) optimization of multiple objectives; 2) no extra trade-off parameters for tuning; iii) subset selection with fixed-size.
In this section, we introduce our definition of pool-based AL under OOD data scenarios, the overview of POAL, and its detailed implementations.
3.1 Problem Definition
We consider a general pool-based AL process for a -class classification task with feature space , label space under OOD data scenarios. We assume that the oracle can provide a fixed number of labels, and when queried with an OOD sample the oracle will return an “OOD” label111In our implementation, the OOD label is -1. to represent data outside of the specified classification task. Let be the number of labels that the oracle can provide, i.e., the budget. Our goal is to select the most informative instances with fewer OOD samples so as to obtain the highest classification accuracy. To reduce the computational cost, we consider batch-mode AL where batches of samples with fixed size are selected and queried. We denote the AL acquisition function as , where refers to AL sampling strategy, and the basic learner/classifier as . We denote the current labeled set as and the large unlabeled data pool as . The labeled data are sampled i.i.d. over data space , i.e., , and . Under OOD data scenarios, active learners may query OOD samples whose labels are not in . To simplify the problem settings, only ID samples are added to .
3.2 Pareto Optimization for Active Learning
In this section we introduce our proposed POAL framework as outlined in Figure 2. In each AL iteration, we firstly utilize the clean labeled set that only contains ID data for training a basic classifier
and constructing class-conditional Gaussian Mixture Models (GMMs) for detecting OOD samples in . Based on classifier , AL acquisition function and GMMs, we calculate the informativeness/uncertainty score and ID confidence score for each unlabeled sample in . We then apply our proposed Pareto optimization algorithm for fixed-size subset selection, which outputs a batch of unlabeled samples with both higher informativeness/uncertainty score and higher ID confidence score. We next introduce each component of our POAL framework.
3.3 AL sampling strategy
Typical pool-based AL sampling methods select data by maximizing the acquisition function , i.e., . Different from the typical approach, in our work we use the AL scores for the whole unlabeled pool for the subsequent Pareto optimization. Thus, we convert the acquisition function to a querying density function via . In this paper we use entropy (ENT) as our basic AL sampling strategy. ENT selects data samples whose predicted class probabilities have largest entropy, and its acquisition function is , where is the posterior class probability using classifier .
Our framework is flexible due to its capability to easily incorporate other AL sampling strategies. A basic requirement is that its acquisition function can be easily converted to a querying density function. In general, AL methods that explicitly provide per-sample scores (e.g., class prediction uncertainty) or inherently provide pair-wise rankings among the unlabeled pool (e.g., -Means) can be easily converted to a querying density. Thus, many uncertainty-based measure like Query-by-Committee (Seung et al., 1992), Bayesian Active Learning by Disagreements (BALD) (Houlsby et al., 2011; Gal et al., 2017), and Loss Prediction Loss (LPL) (Yoo and Kweon, 2019) are applicable since they explicitly provide uncertainty information per sample. Also, -Means and Core-Set (Sener and Savarese, 2018) approaches are applicable since they provide pair-wise similarity information for ranking. On the other hand, AL approaches that only provide overall scores for the candidate subsets, such as Determinantal Point Processes (Bıyık et al., 2019; Zhan et al., 2021a), are unable to be used in our framework.
3.4 ID confidence score via Mahalanobis distance
The motivation of employing Mahalanobis distance for the ID confidence score in AL is from (Lee et al., 2018). Intuitively, ID unlabeled data can be distinguished from OOD unlabeled data since the ID unlabeled data should be closer to the ID labeled data () in feature space. One possible solution is to calculate the minimum distance between an unlabeled sample and the labeled data of its predicted label (pseudo label provided by classifier ) as in (Du et al., 2021), and if this minimum distance is large then the unlabeled sample is likely to be OOD, and vice versa.
However, calculating pair-wise distances between all labeled and unlabeled data is computationally expensive.
A more efficient method is to summarize the ID labeled data via a data distribution (probability density function) and then compute distances of the unlabeled data to the ID distribution. Since we adopt GMMs to represent the data distribution, then our ID confidence score is based on Mahalanobis distance.
A more efficient method is to summarize the ID labeled data via a data distribution (probability density function) and then compute distances of the unlabeled data to the ID distribution. Since we adopt GMMs to represent the data distribution, then our ID confidence score is based on Mahalanobis distance.
AL for classical ML models focuses on training basic learner with a fixed feature representations, while AL for DL jointly optimizes the feature representation and classifier (Zhan et al., 2021b; Ren et al., 2021). Considering that the differences between AL for ML and DL can influence the estimation of the data distributions, and thus the Mahalanobis distance calculation, we adopt different settings as follows.
Classical ML tasks.
We represent the labeled data distribution as a GMM, which is estimated from . Specifically, since we have labels in , we represent each class with its own class-conditional GMM,
are the mixing coefficient, mean vector and covariance matrix of the-th Gaussian component of the -th class, and is a multivariate Gaussian distribution with mean and covariance . The parameters of the class-conditional GMM for class are estimated from all the labeled data for class , using a maximum likelihood estimation (MLE) (Reynolds, 2009)
and the Expectation-Maximization (EM) algorithm(Dempster et al., 1977). Finally, inspired by (Lee et al., 2018), we define the ID confidence score via the Mahalanobis distance between the unlabeled sample and its closest Gaussian component in any class,
where the Mahalanobis distance is .
For DL, we follow (Lee et al., 2018) to calculate , since it makes good use of both low- and high-level features generated by the deep neural network (DNN) and also applies calibration techniques, e.g., input-preprocessing (Liang et al., 2018) and feature ensemble Lee et al. (2018) to improve the OOD detection capability. Specifically, denote as the output of the -th layer of a DNN for input . For feature ensemble, class-conditional Gaussian distributions with shared covariance are estimated for each layer and for each class by calculating the empirical mean and shared covariance :
Next we find the closest class to an for each layer via An input pre-processing step is applied to make the ID/OOD data more separable by adding a small perturbation to : , where controls the perturbation magnitude222In our experiments we follow (Lee et al., 2018) and set .. Finally, the ID confidence score is obtained by averaging the Mahalanobis distances for over the layers,
3.5 Monte-Carlo Pareto Optimization Active Learning
Our framework for AL for OOD data contains two criteria, the AL informativeness score and the ID confidence score. In normal AL, there are two typical strategies to perform sampling with multiple AL criteria: i) weighted-sum optimization and ii) two-stage optimization (Zhan et al., 2022). However, both these optimization strategies have drawbacks in our AL for OOD scenario. In weighted-sum optimization, the objective functions are summed up with weight : . However, this introduces an extra hyper-parameter for tuning, but the true proportion of OOD data samples in is unknown, and thus, we cannot tell which criterion is more important. Furthermore, due to the lack of data, there is not enough validation data for properly tuning . For two-stage optimization, a subset of possible ID samples is first selected with threshold , and then the most informative samples in the subset are selected by maximizing . This yields a similar problem as weighted-sum optimization: if has not been properly selected, we might i) select OOD samples in the first stage (when is too large), and these OOD samples will be more likely to be selected first due to their higher informativeness score in the second stage; ii) select ID samples that are close to existing samples (when is too small), and due to the natural conflict between the two criteria, the resulting subset will be non-informative, thus influencing the AL performance.
Considering this peculiar contradiction between AL and OOD criteria, developing a combined optimization method that does not require manual tuning of this tradeoff is of vital importance. Therefore, we propose POAL for balancing and automatically without requiring hyper-parameters. Consider one iteration of the AL process, where we need to select samples from . The size of the search space is the number of combinations choose , , and the search is an NP-hard problem in general.
Inspired by (Qian et al., 2015), we use a binary vector to represent a candidate subset: , where represents -th sample is selected, and otherwise. For the unlabeled set , we denote the vector of AL scores as and the vector of ID confidence scores as . The two criteria scores for the subset are then computed as and .333To keep the two criteria consistent, we normalize the ID confidence scores: . We next define the following rank relationships between two candidate subsets and (Qian et al., 2015):
denotes that is dominated by , such that both scores for are no better than those of , i.e., and .
denotes that is strictly dominated by , such that has one strictly smaller score (e.g., ) and one score that is not better (e.g., ).
and are incomparable if both is not dominated by and is not dominated by .
Our goal is to find optimal subset solution(s) , with , that dominates the remaining subset solutions also satisfy . Due to the large searching space, it is impossible to traverse all possible subset solutions. Thus we propose Monte-Carlo POAL for fixed-size subset selection. Monte-Carlo POAL iteratively generates a candidate solution at random, and checks it against the current Pareto set . If there is no candidate solution in that strictly dominates , then is added to and all candidate solutions in that are dominated by are removed.
Our proposed Monte-Carlo POAL method is inspired by POSS (Qian et al., 2015), but there are essential differences. POSS only supports one criterion (e.g., just ), while the other criterion is that the subset size does not exceed (i.e., ). This results in different subset sizes for the candidate solutions in when using POSS, and POSS can finally pick one solution from by maximizing/minimizing the only criterion. Moreover, in POSS, in each iteration, the generated solution is not completely random, as it is related to candidate solutions in . POSS first randomly selects one solution in , then generates a new solution by randomly flipping each bit of selected solution with equal probability, which may results in the candidate subset size that is not . However, in our task, we need fixed-size subset solutions and we have two conflict criteria, and thus POSS cannot satisfy our requirements. Due to these differences in how candidate solutions are generated and the number of criteria , the theoretical bound on the number of iterations needed for convergence of POSS () (Qian et al., 2015) cannot be applied to our Monte-Carlo POAL method. Thus, when should the iterations of our Monte-Carlo POAL terminate? We noticed in previous research (Qian et al., 2017) that the Pareto set empirically converges much faster than (see Fig. 2 in (Qian et al., 2017)). Therefore, we propose an early-stopping technique. Detailed comparisons between POAL and POSS are in Appendix A.4.2.
The Monte-Carlo POAL should terminate when there is no significant change in the Pareto set after many successive iterations, which indicates that a randomly generated has little probability to change since most non-dominated solutions are already included in
. We propose an early-stopping strategy with reference to automatic termination techniques in Multi-objective Evolutionary Algorithms(Saxena et al., 2016). Firstly, we need to define the difference between two Pareto sets, and . We represent each candidate solution as a 2-dimensional feature vector, . The two Pareto sets can then be compared via their feature vector distributions, and . In our paper, we employed maximum mean discrepancy (MMD) (Gretton et al., 2006) to measure the difference:
The distance is based on the notion of embedding probabilities in a reproducing kernel Hilbert space . In our work, we utilize the RBF kernel with various bandwidths. Second, at iteration
, we compute the mean and standard deviation (SD) of the MMD scores in a sliding window of the previousiterations,
If there is no significant difference in the mean and SD after several iterations, (e.g., and do not change within 2 decimal points), then we assume that has converged and we can stop iterating.
Finally, after obtaining the converged , we need to pick one non-dominated solution in as the final subset selection. We denote the final selection as , where is the final selection criteria. In our work, we set . That is, we select the subset with maximum intersection with other non-dominated solutions. The maximum intersection operation implies a weighting process, where the final selected subset contains samples that were commonly used in other non-dominated solutions. The whole process of our Monte-Carlo POAL with early stopping is detailed in Appendix Algorithm A1.
Pre-selection for large-scale data.
We next consider using POAL on large datasets. The searching space of size is extremely large for a large unlabeled data pool. We propose an optional pre-selecting technique to decrease the searching space. Our goal is to select an optimistic subset from . We conduct naive Pareto optimization on the original unlabeled data pool. Firstly, an optimal Pareto front , a set of non-dominated data points, is selected by normal Pareto optimization with objectives and . Since the size of is not fixed and might not meet our requirement of the minimum pre-selected subset size , we iteratively update by firstly adding the data points in to , and then excluding from for the next round of naive Pareto optimization. This iteration terminates when . The pseudo code of pre-selecting process is presented in Appendix Algorithm A2. The pre-selected subset size is set according to personal requirements (i.e., the computing resources and time budget). In our paper, we set .
In this section, we evaluate the effectiveness of our proposed Monte-Carlo POAL on both classical ML and DL tasks.
4.1 Experimental Design
For classical ML, we use pre-processed data from LIBSVM(Chang and Lin, 2011):
Synthetic dataset EX8 uses EX8a as ID data and EX8b as OOD data (Ng, 2008).
For DL, we adopt the following image datasets:
CIFAR10 (Krizhevsky et al., 2009) has classes, and we construct two datasets: CIFAR10-04 splits the classes according to ID:OOD ratio of 6:4, and CIFAR10-06 splits data with ratio as 4:6.
CIFAR100 (Krizhevsky et al., 2009) has classes, and we construct CIFAR100-04 and CIFAR100-06 using the same ratios as CIFAR10.
Our work helps existing AL methods select more informative ID data samples while preventing OOD data selection. Thus, in our experiments, we adopt ENT as our basic AL sampling strategy for our POAL framework. We compare against baseline methods without OOD detection: 1) normal ENT; 2) random sampling (RAND); 3) Mahalanobis distance as sampling strategy (MAHA). To show the effectiveness of our MC Pareto optimization, we compare against 2 alternative combination strategies: 1) weighted sum optimization (WeightedSum) using weights ; 2) two-stage optimization (TwoStage) using threshold . We also compare against several widely adopted AL techniques, -Means, LPL (Yoo and Kweon, 2019) and BALD (Gal et al., 2017), to observe the performance under OOD data scenarios. Finally, we report the oracle results of ENT where only ID data is selected first (denoted as IDEAL-ENT), which serves as an upper-bound of our POAL.
The training/test split of the datasets are fixed in the experiments, while the initial label set and unlabeled set are randomly generated from the training set. Experiments are repeated times in classical ML tasks and times for DL tasks. We use the same basic classifier for each dataset (details in Appendix A.2.2). To evaluate model performance, we measure accuracy and plot accuracy vs. budget curves to present the performance change with increasing labeled samples. To calculate average performance, we compute area under the accuracy-budget curve (AUBC) (Zhan et al., 2021b), with higher values reflecting better performance under varying budgets. More details about the experiments are in Appendix A.2, including dataset splits, ratios, basic learner settings, and AL hyper-parameters (budget , batch size , and initial label set size), etc.
4.2 Results on classical ML tasks
The overall performance on various datasets are shown in Figure 6. We first compare the performance of POAL against each single component score, ENT and MAHA. POAL is closest to the oracle performance of IDEAL-ENT, and both ENT and MAHA are not effective when used as a single selection strategy under OOD scenarios. Next we compare POAL against other combination methods using the same and scores. On all datasets, POAL is superior to WeightedSum and TwoStage, demonstrating that joint Pareto optimization better considers the two conflicting criteria.
We next analyze the results for each dataset. The synthetic dataset EX8 has two properties: the ID data is non-linearly separable and the OOD data is far from ID data (see data distribution in Appendix Figure A1-a). These properties make the calculation of more accurate. Meanwhile, both OOD and ID data close to the decision boundaries have high entropy score (see Figure 1). As shown in Figure 6-a, ENT always selects OOD data with higher entropy until all OOD data are selected (210 OOD data are in EX8), and it performs even worse than RAND. MAHA prefers to select ID samples first (see Appendix Figure A3), but the data samples that have smaller Mahalanobis distance are also easily classified and far from decision boundaries, and thus not informative. This confirms the claim that both informativeness/uncertainty-based AL measure and ID confidence could not be directly used in AL under OOD scenarios. The curve of our POAL is close to the oracle IDEAL-ENT, which indicates that POAL selects the most informative ID data, while excluding the OOD samples.
For real-life datasets, POAL also demonstrates its superiority on vowel (see Figure 6-b), a multi-class classification task. In letter, we fixed the ID data (letters a-j) and gradually increase the OOD data (letters k-z). From Figure 6e, we observe that except for IDEAL-ENT, all methods are influenced by increasing OOD data, e.g., the AUBC of RAND is in Figure 6c, in Figure 6d. Although our method is also influenced by increasing OOD data, it is still superior to all baselines. MAHA performs better on distinct ID-/OOD- distribution data scenarios. However, if there is overlap between the two distributions, MAHA does not perform very well due to the overlap of ID/OOD distributions (see appendix Figures A1 and A3). Similar conclusions were also reported in (Ren et al., 2019), where MAHA does not perform well on genomic datasets, since OOD bacteria classes are interlaced under the same taxonomy.
4.3 Results on DL tasks
We next present experiments showing that our POAL also works on large-scale datasets together with deep AL. The accuracy-budget curves for CIFAR10 and CIFAR100 with ID:OOD ratios 6:4 and 4:6, together with their number of OOD samples selected during the AL processes, are presented in Figure 4. POAL outperforms both uncertainty- and representative/diversity-based baselines like LPL, BALD and -Means on all tasks. Although LPL performs fairly well on normal CIFAR10 and CIFAR100 datasets (Yoo and Kweon, 2019; Zhan et al., 2022), under OOD data scenarios, OOD samples have larger predicted loss values, and thus influence the selection decision. Similar phenomenon is observed for BALD and ENT. The performances of -Means are close to RAND, since both methods sampled ID/OOD data with the same ratio. Although -Means will not select OOD samples first, like uncertainty-based measures, it still failed to provide comparable performances. POAL performs well by selecting less OOD samples during the AL iterations, as shown in Figure 4(e-h). Although the pre-selecting strategy in POAL makes the search space smaller, i.e., changes the global solutions to local solution, POAL still performs significantly better than baseline AL methods. Finally, POAL outperforms the other combination strategies, WeightedSum and TwoStage, by large margins.
5 Conclusion, Discussion and Future Work
In this paper, we proposed an effective and flexible Monte-Carlo POAL framework for improving AL when encountering OOD data. Our POAL framework is able to i) incorporate various AL sampling strategies; ii) handle large-scale datasets; iii) support batch-mode AL with fixed batch size; iv) prevent trade-off hyper-parameter tuning; and v) work well on both classical ML and DL tasks. Our work has two limitations: i) for the ID confidence score calculation, Mahalanobis distance only performs well on distinct ID-/OOD- distributions and thus influence POAL performance; ii) we need many iterations for the Pareto set converge based on MC sampling (see Appendix A.3.2 for timing cost comparison). Future work could improve OOD detection criterion by introducing OOD-specific feature representations and other techniques for better distinguishing ID/OOD distributions. Although MC sampling requires more iterations, compared with other sampling schemes like importance sampling and Metropolis-Hasting, it is efficient and keeps the method free from initialization problems. We will try to find more efficient sampling schemes while reducing the impact of initialization.
- Theoretical foundations and algorithms for outlier ensembles. Acm sigkdd explorations newsletter 17 (1), pp. 24–47. Cited by: 2nd item, 2nd item.
- Deep batch active learning by diverse, uncertain gradient lower bounds. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, Cited by: 3rd item, 4th item, §A.2.2, item 5, §2, Figure 4.
- UCI machine learning repository. Irvine, CA, USA. Cited by: 2nd item, 3rd item, 2nd item, 3rd item.
- The power of ensembles for active learning in image classification. In , pp. 9368–9377. Cited by: §2.
- Batch active learning using determinantal point processes. arXiv preprint arXiv:1906.07975. Cited by: §3.3.
LIBSVM: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST) 2 (3), pp. 1–27. Cited by: §4.1.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. Cited by: §A.3.2, §1.
- Accelerating monte carlo bayesian prediction via approximating predictive uncertainty over the simplex. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §2.
- Hierarchical sampling for active learning. In Proceedings of the 25th international conference on Machine learning, pp. 208–215. Cited by: §2.
- Adaptive greedy approximations. Constructive approximation 13 (1), pp. 57–98. Cited by: §A.4.2.
- Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological) 39 (1), pp. 1–22. Cited by: §3.4.
- The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine 29 (6), pp. 141–142. Cited by: §1.
- Active learning for bert: an empirical study. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 7949–7962. Cited by: §2.
- Contrastive coding for active learning under class distribution mismatch. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8927–8936. Cited by: 2nd item, §A.3.2, §A.4.1, §A.4.1, §A.5, §1, §1, §3.4, Figure 4.
- UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Cited by: 2nd item, 3rd item.
- Active learning for deep semantic parsing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 43–48. Cited by: §2.
- Letter recognition using holland-style adaptive classifiers. Machine learning 6 (2), pp. 161–182. Cited by: 3rd item, 3rd item.
- Deep bayesian active learning with image data. In International Conference on Machine Learning, pp. 1183–1192. Cited by: 1st item, 6th item, §2, §3.3, §4.1.
- A kernel method for the two-sample-problem. Advances in neural information processing systems 19. Cited by: §3.5.
- Scalable active learning for object detection. In 2020 IEEE intelligent vehicles symposium (iv), pp. 1430–1435. Cited by: §2.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: 10th item, §A.2.2, §A.3.2.
- A baseline for detecting misclassified and out-of-distribution examples in neural networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, Cited by: §2.
- Bayesian active learning for classification and preference learning. arXiv preprint arXiv:1112.5745. Cited by: 1st item, 6th item, §2, §3.3.
- DeepAL: deep active learning in python. arXiv preprint arXiv:2111.15258. Cited by: 6th item, §A.2.2.
- Submodular combinatorial information measures with applications in machine learning. In Algorithmic Learning Theory, pp. 722–754. Cited by: §A.4.1.
- Billion-scale similarity search with gpus. IEEE Transactions on Big Data 7 (3), pp. 535–547. Cited by: 7th item, §A.2.2.
- Mind your outliers! investigating the negative impact of outliers on active learning for visual question answering. arXiv preprint arXiv:2107.02331. Cited by: §1, §1.
- Prism: a unified framework of parameterized submodular information measures for targeted data subset selection and summarization. arXiv preprint arXiv:2103.00128. Cited by: §A.4.1.
- Learning active learning from data. Advances in neural information processing systems 30. Cited by: 11st item, §A.3.1.
- Similar: submodular information measures based active learning in realistic scenarios. Advances in Neural Information Processing Systems 34. Cited by: 3rd item, §A.3.2, §A.4.1, §A.4.1, §1, §1, Figure 4.
- Learning multiple layers of features from tiny images. Cited by: 4th item, §1, 1st item, 2nd item.
- A simple unified framework for detecting out-of-distribution samples and adversarial attacks. Advances in neural information processing systems 31. Cited by: 8th item, §2, §3.4, §3.4, §3.4, footnote 2.
Heterogeneous uncertainty sampling for supervised learning. In Machine Learning, Proceedings of the Eleventh International Conference, Rutgers University, New Brunswick, NJ, USA, July 10-13, 1994, pp. 148–156. Cited by: 6th item, §1, §2.
- Enhancing the reliability of out-of-distribution image detection in neural networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, Cited by: §3.4.
- Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv preprint arXiv:1706.02690. Cited by: §2.
- HydraText: multi-objective optimization for adversarial textual attack. arXiv preprint arXiv:2111.01528. Cited by: §2.
- The weighted sum method for multi-objective optimization: new insights. Structural and multidisciplinary optimization 41 (6), pp. 853–862. Cited by: §2.
- Nonlinear multiobjective optimization. Vol. 12, Springer Science & Business Media. Cited by: §2.
- Stanford cs229 - machine learning - andrew ng. . Cited by: 1st item, §1, 1st item.
- Pytorch: an imperative style, high-performance deep learning library. Advances in neural information processing systems 32. Cited by: 1st item.
- Scikit-learn: machine learning in python. the Journal of machine Learning research 12, pp. 2825–2830. Cited by: §A.2.2.
- Optimizing ratio of monotone set functions.. In IJCAI, pp. 2606–2612. Cited by: §A.4.2, §3.5.
- Subset selection by pareto optimization. Advances in neural information processing systems 28. Cited by: §A.4.2, §2, §3.5, §3.5.
- Likelihood ratios for out-of-distribution detection. Advances in Neural Information Processing Systems 32. Cited by: §A.3.1, §1, §2, §4.2.
- A survey of deep active learning. ACM Computing Surveys (CSUR) 54 (9), pp. 1–40. Cited by: §2, §3.4.
- Gaussian mixture models.. Encyclopedia of biometrics 741 (659-663). Cited by: §3.4.
Entropy-based termination criterion for multiobjective evolutionary algorithms.
IEEE Transactions on Evolutionary Computation20 (4), pp. 485–498. Cited by: §3.5.
- The integration of process design and control. Elsevier. Cited by: §1.
Active learning for convolutional neural networks: A core-set approach. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, Cited by: §2, §3.3.
- Active learning literature survey. Cited by: §2.
Query by committee.
Proceedings of the fifth annual workshop on Computational learning theory, pp. 287–294. Cited by: 11st item, §A.3.1, §2, §3.3.
- A mathematical theory of communication. ACM SIGMOBILE mobile computing and communications review 5 (1), pp. 3–55. Cited by: §1.
Multi-criteria-based active learning for named entity recognition. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, 21-26 July, 2004, Barcelona, Spain, pp. 589–596. Cited by: §1, §1.
- ALiPy: active learning in python. Technical report Nanjing University of Aeronautics and Astronautics. Note: available as arXiv preprint https://arxiv.org/abs/1901.03802 External Links: Cited by: 11st item.
- A new active labeling method for deep learning. In 2014 International joint conference on neural networks (IJCNN), pp. 112–119. Cited by: 6th item, §A.3.2, §2.
- A multiple-objective analysis of sensor placement optimization in water networks. In Critical transitions in water and environmental resources management, pp. 1–10. Cited by: §2.
- Generalized out-of-distribution detection: a survey. arXiv preprint arXiv:2110.11334. Cited by: §2.
- Learning loss for active learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 93–102. Cited by: 2nd item, 5th item, §A.2.2, §2, §3.3, §4.1, §4.3.
- Multiple-criteria based active learning with fixed-size determinantal point processes. arXiv preprint arXiv:2107.01622. Cited by: §3.3.
A comparative survey: benchmarking for pool-based active learning.
Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, pp. 4679–4686. Cited by: §A.3.1, §1, §2, §3.4, §4.1.
- A comparative survey of deep active learning. arXiv preprint arXiv:2203.13450. Cited by: §1, §2, §3.5, §4.3.
- Multiobjective evolutionary algorithms: a survey of the state of the art. Swarm and evolutionary computation 1 (1), pp. 32–49. Cited by: §2.
Appendix A Appendix
a.1.1 Pseudo code of Monte-Carlo POAL framework.
Our Monte-Carlo POAL with early stopping is detailed in Algorithm 1 . The implementation of MMD is the RBF kernel with various bandwidths with open-source implementation
. The implementation of MMD is the RBF kernel with various bandwidths with open-source implementation444The implementation of is: https://github.com/easezyc/deep-transfer-learning/blob/master/MUDA/MFSAN/MFSAN_3src/mmd.py.
a.1.2 Pseudo code of pre-selecting technique.
The detailed pre-selecting process is presented in Algorithm 2.
a.2 Experimental settings
We summarized the datasets we adopted in our experiments in Table 1, including how to split the datasets (initial labeled data, unlabeled data pool and test set), dataset information (number of categories, number of feature dimensions) and task-concerned information (maximum budget, batch size, numbers of repeated trials and basic learners adopted in each task). Especially, we recorded the ID OOD ratio in each task. We down-sampled letter dataset and control each category only have 50 data samples in training set.
We visualized the datasets of classical ML tasks by t-Distributed Stochastic Neighbor Embedding (t-SNE)555https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html, as shown in Figure 5. We split ID/OOD data by setting OOD data samples as semitransparent grey dots to better observe the distinction between ID-/OOD- data distributions.
|Dataset||of ID classes||of feature dimension||of initial labeled set||of unlabeled pool||of test set||of Maximum Budget||Ratio||batch size||of repeat trials||basic learner|
a.2.2 Baselines and implementations
We introduce the baselines and how to make implementations in this section.
For the basic learner/classifier, we adopted Gaussian Process Classifier (GPC), Logistic Regression (LR) and ResNet18 in our experiments. We didn’t choose GPC in tasks, we set the number of training epochs as 30, the kernel size of the first convolutional layer in ResNet18 is (consistent with PyTorch CIFAR implementation
For the basic learner/classifier, we adopted Gaussian Process Classifier (GPC), Logistic Regression (LR) and ResNet18 in our experiments. We didn’t choose GPC inletter dataset since the accuracy-budget curves based on GPC are not monotonically increasing. The requirement of the selection of a basic classifier is: the basic classifier can provide soft outputs, that is, provide predictive class probabilities to calculate the uncertainty of each unlabeled sample, e.g., entropy. We use the implementation of the GPC with RBF kernel666https://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.GaussianProcessClassifier.html and LR777https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression of scikit-learn library [Pedregosa et al., 2011] with default settings. For ResNet18, we employed ResNet18 [He et al., 2016] based on PyTorch with Adam optimizer (learning rate: ) as the basic learner in DL tasks. In CIFAR10 and CIFAR100
tasks, we set the number of training epochs as 30, the kernel size of the first convolutional layer in ResNet18 is
(consistent with PyTorch CIFAR implementation888https://github.com/kuangliu/pytorch-cifar/blob/master/models/resnet.py). Input pre-processing steps include random crop (), random horizontal flip () and normalization.
For the implementation of the classical ML baselines, we have introduced it in main paper.
For the baselines in DL tasks, including -Means, ENT, BALD, LPL and BADGE. We implemented them based on PyTorch. About ENT, BALD, and RAND, we use the implementation of DeepAL999https://github.com/ej0cl6/deep-active-learning[Huang, 2021]. About -Means, to facilitate the process, we use the faiss-gpu101010https://github.com/facebookresearch/faiss/tree/26abede81245800c026bd96a386c51430c1f4170 [Johnson et al., 2019] version instead of scikit-learn version in [Huang, 2021]. For LPL [Yoo and Kweon, 2019], we adopted a re-implementation version111111https://github.com/Mephisto405/Learning-Loss-for-Active-Learning. For BADGE [Ash et al., 2020], we adopted the original version provided the author121212https://github.com/JordanAsh/badge. We provide simple introductions of BALD, LPL and BADGE as follows:
Loss Prediction Loss (LPL) [Yoo and Kweon, 2019]: it is a loss prediction strategy by attaching a small parametric module that is trained to predict the loss of unlabeled inputs with respect to the target model, by minimizing the loss prediction loss between predicted loss and target loss. LPL picks the top data samples with highest predicted loss.
Batch Active learning by Diverse Gradient Embeddings (BADGE) [Ash et al., 2020]: it first measures uncertainty as the gradient magnitude with respect to the parameters in the output layer in the first stage, it then clusters the samples by -Means++ in the second stage.
We run all our experiments on a single Tesla V100-SXM2 GPU with 16GB memory except for running SIMILAR related experiments. Since SIMILAR needs much memory. We run the experiments ( SIMILAR on down-sampled CIFAR10) on another workstation of Tesla V100-SXM2 GPU with 94GB memory in total.
a.2.3 Licences of datasets and methods
We listed the licence of datasets we used in our experiments:
We listed the license of libraries we used in our experiments:
PyTorch [Paszke et al., 2019]: Modified BSD.
CCAL [Du et al., 2021]: Not listed.
SCMI(DISTIL)131313https://decile-team-distil.readthedocs.io/en/latest/ [Kothawade et al., 2021]: MIT Licence.
BADGE [Ash et al., 2020]: Not listed.
LPL [Yoo and Kweon, 2019]: Not listed.
KMeans (faiss library [Johnson et al., 2019] implementation): MIT Licence.
MAHA141414https://github.com/pokaxpoka/deep_Mahalanobis_detector [Lee et al., 2018]: Not listed.
GMM, GPC, LR (scikit-learn library): BSD License.
ResNet18 [He et al., 2016]: MIT License.
a.3 Experimental results
a.3.1 Additional experimental results of classical ML tasks
Additional results of experiments with classical ML tasks in main paper, including accuracy vs. budget curves and number of OOD samples selected vs. budge curves.
We present the complete accuracy vs. budget curves and numbers of OOD samples selected vs. budget curves during AL processes in Figure 6 and Figure 7 respectively. We could observe from Figure 5, Figure 6 and Figure 7 that the capability of Mahalanobis distance is limited by the distinction between ID data distribution and OOD data distribution. EX8 has distinct ID/OOD data distributions (see Figure 5-a), thus the Mahalanobis distance well distinguishes ID and OOD data, and reaches the optimal performance (Figure 7-a, MAHA has the same curve with IDEAL-ENT). However, for vowel and letter dataset, the ID/OOD data distributions are not as distinct as EX8 (see Figure 5 b-r), thus the performance of MAHA is influenced. Furthermore, it affects the performance of our POAL. It makes our POAL behaves less perfectly on vowel and letter datasets than on the EX8 dataset. Similar conclusions appear in [Ren et al., 2019]. This is our future work, to find better ways to differentiate ID and OOD data distributions.
Flexibility: POAL with more sampling strategies, i.e., LAL and QBC.
To present the flexibility of our POAL, e.g., is able to incorporate more AL sampling schemes, we conduct a simple experiment on classical ML tasks, we incorporate QBC [Seung et al., 1992] and LAL [Konyushkova et al., 2017]. The selection of basic AL sampling strategies is according to the comparative survey [Zhan et al., 2021b], in which both QBC and LAL show competitive performance across multiple tasks and AL sampling strategies. We repeat the trials 10 times in each experiment. The results are presented in Figure 9. We conduct experiments on EX8 and vowel datasets. This experiment shows that our POAL is flexible to handle various AL sampling strategies.
a.3.2 Additional experimental results on DL tasks.
POAL vs. CCAL.
We described the detailed information of CCAL [Du et al., 2021] in Section A.4.1. We here compare the POAL and CCAL on CIFAR10-04 and CIFAR10-06 datasets. Together with the same experimental design in our experiments. We utilize the open-source code implementation of CCAL161616https://github.com/RUC-DWBI-ML/CCAL/. We train SimCLR [Chen et al., 2020], the semantic/distinctive feature extractor, which is provided by CCAL’s source code. We train the two feature extractors with 700 epochs and batch size of 32 on a single V100 GPU. The comparative results are shown in Figure 8. We have moderately better performance than CCAL on CIFAR10-04 dataset, i.e., our POAL has 0.762 AUBC performance and CCAL is 0.754. Note that in Figure 8-c, POAL has similar efficiency with CCAL on preventing selecting OOD data samples on CIFAR10-04 dataset. In CIFAR10-06 task, as the OOD ratio increases, our POAL outperforms CCAL, i.e., our POAL has 0.84 AUBC performance while CCAL is only 0.819. And POAL selects less OOD samples than CCAL, as shown in Figure 8-d. Both POAL and CCAL perform better than normal AL sampling schemes (i.e., ENT, LPL, BALD, KMeans, BADGE and Random) on both CIFAR10-04 and CIFAR10-06 datasets.
POAL vs. SIMILAR.
Since SIMILAR (SCMI - FLCMI as acquisition function, see Section A.4.1) is both time and memory consuming on large scale datasets, it could not run experiments with the same settings as our experiments. In the original paper [Kothawade et al., 2021], they down-sample CIFAR10 training set from to , they set class 0 to 8 as ID classes and the 9 to 10 as OOD classes. They first randomly selected ID data samples as initial labeled set, then they randomly selected ID samples and together with the remaining OOD samples as unlabeled data pool (size is ). We followed the implementation details provided in [Kothawade et al., 2021] and the open-source code171717https://github.com/decile-team/distil. The implementation details are: We use ResNet18 [He et al., 2016] with the same implementations as them in our experiments. Additionally we use a SGD optimizer with an initial learning rate of 0.01, the momentum of 0.9, and a weight decay of . The learning rate is decayed using cosine annealing for each epoch. The batch size is 250, the total budget is . [Kothawade et al., 2021]. However, we cannot acquire the random seed and times of repeated experiments, thus we don’t know how they split/down-sampled the dataset. We can not completely reproduce the original experimental environment in [Kothawade et al., 2021]. In our experiments, we set the random seed as 4666 and repeat the experiment 3 times. We use the source code of FLCMI submodular function181818https://github.com/decile-team/distil/blob/main/distil/active_learning_strategies/scmi.py.
The comparison experiments are shown in Figure 10. Besides our POAL and SIMILAR, we also provide more baselines as reference, i.e., IDEAL-ENT, ENT, Margin [Wang and Shang, 2014] and Random. As shown in Figure 10, both SIMILAR and POAL have better performance than normal AL sampling strategies. From the aspect AUBC evaluation metric, our model has comparable performance with
have better performance than normal AL sampling strategies. From the aspect AUBC evaluation metric, our model has comparable performance withSIMILAR, 0.667 vs 0.669. But we have lower standard deviation value than SIMILAR, that is, our POAL is more stable. From the aspect of Accuracy vs. Budget curves, in early stages (e.g., Budget < 2,500), our POAL is better than SIMILAR and in latter stages SIMILAR exceeds POAL. The reason is, in SIMILAR, they calculate the similarities between ID labeled set and unlabeled pool and calculate the dissimilarities between OOD labeled set and unlabeled pool. In early stages, the OOD data is insufficient, thus SIMILAR would select more OOD samples, as shown in Figure 10-b. From this experiment, we find that SIMILAR only performs well when we have enough information of both ID and OOD data samples, which results in more OOD data sample selection. Our POAL only considers the distance between unlabeled samples and ID labeled samples, so we are more efficient on preventing OOD sample selection. Additionally, our POAL is more widely adopted, since we could be adopted on large-scale datasets. But SIMILAR is limited by the computation condition. Our method is also more time efficient than SIMILAR, as shown in Table 2. SIMILAR runs five times longer than our POAL.
|Time||6419.0 (109.9)||32837.7 (897.7)||2225.0 (12.6)||1690.0 (15.3)||1507.7 (5.8)||1573.7 (9.8)|
Hyper-parameter sensitivity in POAL with early stopping conditions.
In our main paper and Algorithm 1, we have a hyper-parameter (sliding window size) to control the early stopping condition. Large refers to more strict early stop condition and vice versa. If there is no significant change within iterations/populations, then we can stop early. In our experiments, we set . In this section, we conduct an ablation study to see if strict/more strict condition would influence the final performance, as shown in Table 3. We found that (less strict) does not affect the model performance (there is no significant difference of AUBC performance between and ). Less strict early stopping also requires less running time, e.g., in CIFAR100-04 dataset, we have 66880.7 seconds average running time with and 4932.3 seconds average running time with .
|POAL with||0.7620 (0.0033)||62082.7 (7371.9)|
|POAL with||0.7613 (0.0024)||56107.0 (3364.2)|
|POAL with||0.8400 (0.0029)||34946.0 (4860.2)|
|POAL with||0.8403 (0.0029)||47362.0 (188.7)|
|POAL with||0.4807 (0.0009)||66880.7 (1583.5)|
|POAL with||0.4797 (0.0017)||4932.3 (1891.2)|
|POAL with||0.5253 (0.0005)||62762.0 (4277.4)|
|POAL with||0.5270 (0.0008)||55955.3 (5162.4)|
Summarizing: comparison of running time and AUBC performance across all models and DL tasks.
To observe the efficiency of each AL sampling strategies in our experiments, we summarized the overall performance across all models and datasets we adopted in our experiments, including the mean and standard deviation (SD) of AUBC performance and running time. We record the running time from the start of the AL process to the output of the final basic learner. From Table 4, we could observed that our model outperforms all baselines (except for ideal model – IDEAL-ENT) in terms of AUBC performance. For running time, compared with normal AL sampling strategies, both POAL and CCAL are incomparable. Among POAL, CCAL and SIMILAR, which are designed specific to AL under OOD data scenarios, the running cost of our POAL is affordable. Especially, as mentioned in previous section of hyper-parameter sensitivity, the running timing cost of our POAL can be reduced by loosing the early stop conditions (less ).
|AUBC||Time (seconds)||AUBC||Time (seconds)|
|RAND||0.7500 (0.0014)||10805.3 (148.0)||0.8080 (0.0008)||5217.7 (348.0)|
|IDEAL-ENT||0.8007 (0.0029)||14578.0 (838.2)||0.8737 (0.0019)||9398.0 (696.4)|
|POAL ()||0.7620 (0.0033)||62082.7 (7371.9)||0.8400 (0.0029)||34946.0 (4860.2)|
|POAL ()||0.7613 (0.0024)||56107.0 (3364.2)||0.8403 (0.0029)||47362.0 (188.7)|
|CCAL||0.7543 (0.0009)||76129.3 (5853.7)||0.8190 (0.0037)||35715.3 (328.5)|
|ENT||0.7350 (0.0029)||10155.3 (872.3)||0.7960 (0.0057)||5485.7 (655.4)|
|LPL||0.7510 (0.0071)||12384.7 (1555.9)||0.7873 (0.0103)||3783.3 (326.9)|
|BADGE||0.7437 (0.0029)||41646.0 (12723.5)||0.8063 (0.0037)||17539.3 (628.0)|
|KMeans||0.7440 (0.0014)||18785.7 (2571.3)||0.8100 (0.0022)||8627.3 (407.2)|
|BALD||0.7480 (0.0022)||10478.3 (679.7)||0.8060 (0.0022)||4158.3 (431.1)|
|TwoStage||0.7300 (0.0029)||8342.3 (1059.4)||0.7970 (0.0028)||5524.7 (821.3)|
|WeightedSum-1.0||0.7337 (0.0019)||9489.7 (454.5)||0.8033 (0.0033)||2006.3 (13.8)|
|WeightedSum-5.0||0.7327 (0.0029)||11892.3 (1733.9)||0.8060 (0.0045)||7998.3 (81.6)|
|WeightedSum-0.2||0.7340 (0.0045)||8249.3 (2019.2)||0.8063 (0.0012)||5751.0 (795.3)|
|AUBC||Time (seconds)||AUBC||Time (seconds)|
|RAND||0.4560 (0.0016)||11563.3 (985.5)||0.4453 (0.0026)||11075.3 (1198.2)|
|IDEAL-ENT||0.5250 (0.0008)||17736.7 (1694.7)||0.5707 (0.0017)||19098.0 (1458.8)|
|POAL ()||0.4807 (0.0009)||66880.7 (1583.5)||0.5253 (0.0005)||62762.0 (4277.4)|
|POAL ()||0.4797 (0.0017)||4932.3 (1891.2)||0.5270 (0.0008)||55955.3 (5162.4)|
|ENT||0.4267 (0.0034)||11202.3 (2484.7)||0.4100 (0.0036)||10940.0 (1318.4)|
|LPL||0.4140 (0.0037)||17797.0 (4093.8)||0.4087 (0.0042)||5633.0 (252.3)|
|BADGE||0.4530 (0.0000)||58877.7 (13876.3)||0.4430 (0.0008)||58046.3 (17296.9)|
|KMeans||0.4527 (0.0005)||34944.3 (4121.3)||0.4413 (0.0021)||17465.0 (3570.3)|
|BALD||0.4467 (0.0040)||14699.0 (2675.9)||0.4313 (0.0061)||4574.3 (140.5)|
|TwoStage||0.4347 (0.0021)||11484.3 (4360.1)||0.4137 (0.0017)||3234.3 (91.1)|
|WeightedSum-1.0||0.4267 (0.0025)||12722.0 (4164.8)||0.4103 (0.0005)||9455.3 (468.5)|
|WeightedSum-5.0||0.4287 (0.0039)||14822.3 (6249.7)||0.4073 (0.0050)||11309.3 (1303.4)|
|WeightedSum-0.2||0.4290 (0.0022)||11736.7 (4394.3)||0.4097 (0.0033)||8865.7 (1230.1)|
a.4 Related Work
a.4.1 Comparison between POAL and CCAL, SIMILAR
In this section, we discussed the related work that involved AL under OOD data scenarios. As we mentioned in main paper, there is little related work of AL under OOD data scenarios. To the best of our knowledge, there are two published paper/work that discussed AL under OOD data scenarios: Contrastive Coding for Active Learning (CCAL) [Du et al., 2021] and Submodular Information Measures Based Active Learning In Realistic Scenarios (SIMILAR) [Kothawade et al., 2021]. We would introduce these two papers in detail and compared them with our work.
Du et al.  concerns class distribution mismatch problem in AL, their goal is to select most informative samples with matched categories. CCAL is the first work related to AL for class distribution mismatch, it proposed a contrastive coding based method, which extracts semantic and distinctive features by contrastive learning. Semantic feature refers to category-level features, that can be exploited to filter invalid samples with mismatched categories. Distinctive features describe individual level. It is a AL task-specific feature, and used to select the most representative and informative class-matched unlabeled data samples. CCAL achieves good performance on well-studied datasets, e.g., CIFAR10 and CIFAR100 datasets. CCAL utilized self-supervised model like SimCLR and CSI, it is a very good idea since it well extracts semantic and distinctive features, compared with normal feature representations like the output of penultimate layer of neural networks. However, it also brings limitations, that training a self-supervised model has both high computational and timing cost, especially on large-scale data sets. Based on the release code of CCAL191919https://github.com/RUC-DWBI-ML/CCAL , we take 3 days to train a distinctive feature extraction model (with 700 epochs and batch size 32, on CIFAR10 dataset) on a V100 GPU. Additionally,
, we take 3 days to train a distinctive feature extraction model (with 700 epochs and batch size 32, on CIFAR10 dataset) on a V100 GPU. Additionally,CCAL adopted weighted-sum optimization to combine semantic and distinctive scores, whose acquisiton function is as follows:
where the threshold is for narrowing the semantic scores of samples selectively, and controls the slope of the tanh function. brings two hyper-parameters for balancing semantic score and distinctive score. As discussed in Section 4.3 in [Du et al., 2021], the choice of the two hyper-parameters would influence the final performance. The final shortcoming comes from the calculation of distinctive score, to obtain representativeness information. CCAL needs pair-wise comparison among the whole data pool. It is very time and memory consuming on large-scale data sets.
SIMILAR is an active learning framework using previously proposed submodular information measures (SIM) as acquisition functions. Apparently, [Kothawade et al., 2021] is an application of SIM [Iyer et al., 2021, Kaushal et al., 2021]. Kothawade et al.  proposed many variants of SIM to deal with various realistic scenarios, e.g., data imbalance, rare-class, OOD, etc. AL under OOD data scenario is only a sub-task of their work. They adopted a submodular conditional mutual information (SCMI) function that best matches AL under OOD data scenario tasks, that is, Facility Location Conditional Mutual Information (FLCMI), and the function is:
where they use the currently labeled OOD points as the conditioning set , and the currently labeled in-distribution (ID) points as the query set . is unlabeled data set. It jointly models the similarity between and and their dissimilarity with . The advantages and disadvantages of SIMILAR are both clear. The highlight of SIMILAR is, the proposed methods have strong theoretical supports [Iyer et al., 2021, Kaushal et al., 2021]. However, the proposed FLCMI is both time and memory consuming, it could not handle large-scale datasets. For instance, we cannot run experiments on full CIFAR10 datset with one V100 GPU with 32GB memory, it would report memory error. Actually, in [Kothawade et al., 2021], they conduct OOD related experiments on downsampled CIFAR10 dataset, they downsample the dataset to size (the size of whole CIFAR10 dataset for training is ).
To sum up, the existing researches that concern AL under OOD data scenarios remain the following problems to be overcome: 1) time/memory cost on large-scale datasets (both appear in CCAL and SIMILAR); 2) additional trade-off hyper-parameters for balancing different (even conflict) criteria need hand tuning or tuned by extra validation set. To address these issues, we proposed our POAL framework, we solved the first problem, to save time/memory cost by pre-selection and early stopping techniques, we are able to reduce the searching space to meet any researcher’s computational conditions/requirements. We solved the trade-off hyper-parameter problem by applying pareto optimization for self-adaptive balancing various (even conflict) objectives. Therefore, our proposed framework can more easily be adapted to various tasks.
The empirical comparison between our POAL and CCAL, SIMILAR are in previous section A.3.2.
a.4.2 Differences between POAL and POSS
In this part, we focus on the differences between our proposed POAL and Pareto Optimization for Subset Selection (POSS) [Qian et al., 2015]. POSS is originally applied to the subset selection problem, which is defined as follows:
where is a set of variables, is a criterion function and is a positive integer. The subset selection problem aims to select a subset such that is optimized with the constraint , where denotes the size of a set. The subset selection problem is NP-hard in general [Davis et al., 1997].
POSS solves the subset selection by separating the problem into two objectives, namely optimizing the criterion function and minimizing the subset size. Usually the two objectives are conflicting. Thus the problem is transformed to a multi-objective optimization problem:
Due to different trade-offs of objectives, multi-objective optimization algorithms need to find a Pareto set containing Pareto optimal solutions. Specifically, POSS first initializes with a random solution, and then selects a solution from to generate a new solution by flipping each bit with probability ( is the number of variables). The new solution will be added it to if it is not strictly dominated by any solution in , and the solutions in that are dominated by the new solution will also be excluded. POSS repeats this iteration for times, and is proved to achieve the optimal approximation guarantee of with expected running time.
Our POAL differs from POSS in three aspects. Firstly, although POSS is in the form of bi-objective optimization, it actually supports only one criterion function. But our POAL needs to solve two different criterion functions, which is a much harder problem in fact. Secondly, POSS solves the problem with the constraint , while our POAL needs fixed-size subset solutions, which means the search space is different. The last difference lies in the generation of new solutions in the algorithm. POSS generates new solutions by flipping the bits of the solutions from the previous iterations. This operation may change the subset size, violating our setting of fixed-size solutions. So we adopt Monte-Carlo approach to generate fixed-size solutions that are not related to the solutions from the previous iterations.
Given these differences, we cannot utilize the theoretical proof of in POSS. Nevertheless, we noticed in previous research [Qian et al., 2017] that the Pareto set empirically converges much faster than , i.e., see Figure 2 in [Qian et al., 2017]. Therefore, we propose an early-stopping technique to terminate our Monte-Carlo POAL algorithm.
a.5 Limitations and negative societal impacts
As mentioned in Section 5 in main paper, the limitations of our work comes from two aspects: (1) the initiations of Mahalanobis distance based ID confidence score calculation; and (2) the Monte-Carlo sampling scheme in our POAL for seeking non-dominated fixed-sized subset solutions is not high-efficiency. In future work we will try more suitable feature representations that are able to construct more distinct ID-/OOD- data distributions, more suitable feature representations like semantic and distinctive feature representations in [Du et al., 2021]. For the concerns of the efficiency of Monte-Carlo sampling scheme, although many sampling schemes like (adaptive) importance sampling and metropolis-hastings sampling would be more efficient than Monte-Carlo sampling, however, these methods might suffer from initialization problem. Monte-Carlo sampling has no such problem, In future work, we would try to propose more efficient sampling methods for finding non-dominated subset solutions.
The negative societal impacts of our work include using POAL with pre-selecting strategies (POAL-PS) on large-scale datasets would induce bias problems. Since Pre-Selection technique would firstly filter some unlabeled data samples in one AL iteration, and thus have no probability to be selected. Our work also has positive societal impacts. Our framework can be easily adopted to various data scenarios, due to its simplicity and compatibility. We have no limitation to applied downstream tasks. Our framework is flexible, it is able to incorporate various AL sampling strategies, various OOD detection methods, different scales of target data and various personal requirements (e.g., maximum budget, batch size).