Supervised deep learning has achieved remarkable results across a variety of domains by leveraging large, labeled datasetsLeCun et al. (2015). However, our ability to collect data far outstrips our ability to label it, and this difference only continues to grow. This problem is especially stark in applications, such as medical imaging, where the ground truth must be provided by a highly trained specialist. Even in cases where labeled data is sufficient, there may be reasons to limit the amount of data used to train a model, e.g., time, financial constraints, or to minimize the model’s carbon footprint.
Fortunately, the relationship between a model’s performance and the amount of training data is not linear—there often exists a small subset of highly informative samples that can provide most of the information needed to learn to solve a task. In this case, we can achieve nearly the same performance by labeling (and training on) only those informative samples, rather than the entire dataset. The challenge, of course, is that the true usefulness of a sample can only be established a posteriori, after we have used it to train our model.
The growing field of active learning (AL) is concerned with automatically predicting which samples from an unlabeled dataset are most worth labeling.111As noted in Sinha et al. (2019), active learning can also refer to approaches that generate or synthesize novel samples. In this paper, however, we will only be concerned with sampling-based active learning. In the standard AL framework, a selector identifies an initial set of promising samples; these are then labeled by an oracle (e.g., a human expert) and used to train a task network Gal et al. (2017). The selector then progressively requests labels for additional batches of samples, up to either a percentage threshold (e.g., 40% of the total data) or until a performance target is met. In short, an active learning system seeks to construct the smallest possible training set which will produce the highest possible performance on the underlying task/s.
In this paper, we formulate active learning as an open-set recognition (OSR) problem, a generalization of the standard classification paradigm. In OSR only some of the inputs are from one of the known classes; the classifier must label the remaining inputs as out-of-distribution (OOD) or unknown. Intuitively, our hypothesis is that the samples most worth labeling are those that are most different from the currently labeled pool. Training on these samples will allow the network to learn features that are underrepresented in the existing training data. In short, our AL selection mechanism consists of picking unlabeled samples that are OOD relative to the labeled pool.
Figure 1 illustrates our proposed approach. In more detail, our classifier is a variational neural network (VNN) Mundt et al. (2019b), which produces high-confidence (i.e., low-entropy) outputs only for inputs that are highly similar to the training set. We use the inverse of this confidence measure to select which unlabeled samples to query next. In other words, our selector requests labels for the samples that the classifier is least confident about because this implies that the existing training set does not contain items that are similar to them. As we detail in Sec. 4, our OSR-based approach achieved state-of-the-art results in a number of datasets and AL variations, far surpassing existing methods.
2 Related Work
Recent approaches to the problem of Active Learning can be categorized as query-acquiring or query-synthesizing Sinha et al. (2019). The distinction lies in whether the unlabeled OOD samples are immediately accessible (pool-based), or are instead synthesized using a generative model Mahapatra et al. (2018); Mayer and Timofte (2018); Zhu and Bento (2017). Assuming access to a pool of unlabeled OOD data, a strategy must be devised which selects only the most useful or informative samples from that distribution. It has been routinely demonstrated that training samples do not contain equal amounts of useful information Settles (2010). In other words, some training distributions result in better task performance than others. Thus, the aim of an active learning system is to minimize the amount of training data required to achieve the highest possible performance on an underlying task, e.g. image classification. This is a form of learning efficiency, which we wish to improve, or maximize. As in Gal et al. (2017), we therefore wish to learn the acquisition function that chooses the data points for which a label should be requested. The learned acquisition function is, in essence, an intelligent sampling function, the aim of which is to outperform random iid sampling of the unlabeled distribution, thereby maximizing the learning efficiency of the system. As such, various sampling strategies have been proposed which can typically be grouped into three broad categories Sinha et al. (2019). They include uncertainty-based techniques, representation-based models Sener and Savarese (2017), and hybrid approaches Nguyen and Smeulders (2004).
Open Set Recognition (OSR), on the other hand, refers to the ability of a system to distinguish between data it has already seen (the training distribution), and data to which it has not yet been exposed (OOD data). Though OSR has been scrutinized for decades, recent progress has come about via careful design of the heuristics used quantify the similarity between historical and current data distributions Mundt et al. (2019a, b); Higgins et al. (2017). Since such measures have been shown to substantially improve OSR performance, and since OSR is an inherent necessity for Active Learning, it stands to reason that AL systems would benefit from the integration of such techniques. This is intuitive since it would be unhelpful for a system to request labels for data points which are nearly identical to those it has already seen. Redundancy, or excess similarity, in the training distribution can therefore be said to decrease the learning efficiency of the system. One of the most promising approaches to OSR incorporates ideas from Extreme Value Theory (EVT) in order to quantify the epistemic uncertainty of the model Mundt et al. (2019a).
Though seemingly complimentary, very little work has been done to merge the distinct fields of Active Learning and OSR. In this work, we explicitly merge the fields of OSR and AL by adopting the heuristics used in Mundt et al. (2019a) to quantify a model’s predictive uncertainty w.r.t. newly acquired unlabeled data, in order to infer the degree to which the data is likely to improve the performance of a classifier, if that data were to be integrated into the labeled training distribution. In the process, we demonstrate how EVT-inspired heuristics can assist in improving the learning efficiency of deep learning systems.
As noted above, our active learning approach iteratively selects samples from an unlabeled pool based on the confidence level of its OSR classifier. Below, we first formalize the active learning paradigm we are tackling, then detail our proposed system. In particular, we provide an overview of VNNs and explain how we use their outputs to select new data points to label.
3.1 Formal problem definition
Let us now describe our active learning protocol while introducing few
notations. Each active learning problem is denoted as P = (C, , ),
it is dedicated to the classification of classes in a set C, coming with two
sets of examples, the first one being used to infer a prediction model,
and the second one, , being used to evaluate the inferred model where
Let be a dataset consisting of i.i.d. data points where each sample is a
-dimensional feature vector andrepresents the target label. The samples in are partitioned into two disjoint subsets: a labeled set and an unlabeled set . We denote the state of a subset at at a given iteration of our algorithm as (, resp), for . For simplicity, we assume that .
At , we randomly select data samples from and request the oracle to provide the labels for these points. We then remove the selected data samples from and add them to along with their labels. Finally, we train a classifier with parameters on the labeled pool . In each subsequent iteration, we use our OSR criterion (see Sec. 3.2) to select additional data samples from . We query the labels of these new samples, add them to , and train our classifier on all the labeled data. We continue this process until the size of reaches a predefined limit (40% in our experiments).
Importantly, unlike other formulations of AL which assumes access to task boundaries and an i.i.d distribution. This has clear limitations when the i.i.d assumption is not satisfied or when the task boundaries are not available. In our experimental setup we assume can contain training data from multiple tasks. In addition, we assume no task IDs. Our OSR selection criterion allows our system to learn multiple tasks without specifying the current task.
3.2 Active learning system
Our AL system (Fig. 1) has two main components: a variational neural network Mundt et al. (2019b), which serves as our classifier, and an entropy-based selection mechanism. We discuss each component below.
3.2.1 Variational Neural Networks (VNNs)
Variational neural networks (VNNs) Mundt et al. (2019b) are a supervised variant of
-variational autoencoders (-VAE) Higgins et al. (2017). The latter is itself a variant of VAEs Doersch (2016) but with a regularized cost function. That is, the cost function for a -VAE consists of two terms: the reconstruction error, as with a regular VAE, and an entanglement penalty on the latent vector. This penalty forces the dimensions of the latent space to be as uncorrelated as possible, making them easier to interpret.
A VNN combines the encoder-decoder architecture of a -VAE with a probabilistic linear classifier (see Fig. 1
for a visual representation). As such, its loss function includes a classification error, i.e., a supervised signal, in addition to the reconstruction and entanglement terms:
As detailed in Mundt et al. (2019b), , , and are the parameters of the encoder, decoder, and classifier, resp, while and
are the reconstruction and classification terms. The last term is the entanglement penalty, which is given by the Kullback-Leibler divergence between the latent vector distribution and an isotropic Gaussian distribution.
In this work, we evaluated both the full framework discussed above (dubbed in our experiments), which uses the loss function in Eq. 1, and a simplified version () without the reconstruction error:
Following a variational formulation as shown in the Mundt et al. (2019b), the models and have natural means to capture epistemic uncertainty. As our experiments show, both versions outperform the state of the art, but achieves better results overall.
3.2.2 Sample Selection
Motivated by class disentanglement ability of the Eq. 1, we aim to select samples from the unlabeled pool . However, instead of using information about extreme distance values in the penultimate layer activations to modify a Softmax prediction’s confidence, we propose to employ the EVT based on the class conditional posterior. In this sense, any unlabeled sample will be regarded as a sample containing useful information if its distance to the classes latent means is extreme with respect to what has been observed for the majority of correctly predicted data instances, i.e., the sample falls into a region of low density under the aggregate posterior and is more likely to have information which is unknown to the neural network with parameters at this point.
So we define two sampling algorithms like below
1. Uncertainty Sampling : This is the conventional sampling method in which sample is picked from the unlabeled pool for which the underlying model is most uncertain. Model uncertainity can be measured in several ways. One approach is shown in (2), is to capture models predictions for any given input from the unlabeled pool . we compute the utility of the unlabeled instances in to collect number of informative samples where the utility is model uncertainity which captures most of the epistemic uncertainity. The selected informative samples are sent to the oracle to obtain its label
. Note that the closer the probability is to zero, the more likely it is that it model is very uncertain about that sample.
2.Wiebull Distribution Sampling : Our second sampling technique as shown in Algorithm (3) is based on heavy-tail weibull distribution, let us consider any selected stage of active learning lifecycle at which is the data which used for training the model and we first obtain each class mean latent vector for all correctly predicted seen data instances i.e., m=1,…, to construct a statistical meta-recognition model as shown in below equation 3 which quantifies all the per class latent means of the correctly classified training samples present in .
and the respective set of latent distances of correctly classified point to the all the means as as
where signifies the choice of distance metric. We proceed to fit a per class heavy-tail weibull distribution on for a given tail-size . As the distance are based on each of the individual class conditional approximate posterior, thus it bounds the latent space regions of such a high density. The tightness bounds is characterized through
which can been seen as prior belief with respect to outlier quantity present in the data inherently. The choice ofdetermines the dimensionlity of obtained distance distributions. For our experiments, we find that the cosine distance and thus a univariate Weibull distance distribution per class seems to be sufficient.
3.3 Noisy Oracle and Non-i.i.d Setup:
When applying active learning to real world applications, human experts traditionally function as oracles to provide lables for the requested samples. When a user/model requests the labels for the selected data samples from the Oracle, the quality or accuracy of the labels depends on the expertness of the Oracle. However, human makes mistakes, hence these mistakes leads to noisy labels.
We consider both types of the oracle an ideal oracle which provides labels for requested samples with no error and a noisy oracle, which provides labels for the required images with some percentage of error or noise. This error might be occurring because of a lack of human expertise (oracle) or maybe oracle getting confused between similar classes of images as some classes causing ambiguity for the oracle. To create the same paradox here, we also applied noise to related classes.
4 Experimental Results
In all our experiments we start with a random acquisition of K samples from the
unlabeled pool where K is an
initial budget and set to 10% of training dataset . The
acquired data samples are sent to oracle for labeling, the annonated data will be added to labeled
pool which serves as training data for our model .
The unlabeled pool consists
of remaining data samples for which labels are unknown. After the model training
, based on the utility
estimates provided for each sample by the sampling function.
we sample an additional b samples from the
unlabeled pool where is
budget and set to 5% of training dataset at each stage. Once the oracle had labeled the requested data
samples, the newly labeled samples will be added to labeled pool
, and model is trained using
the new labeled pool . The annotation, training process
continues until annotated the 40% of the training set
. We assume both the cases where the unlabeled pool
consists of samples from the same distributions from the training dataset and
unknown dataset. We presume the oracle is perfect unless stated otherwise0.
We have evaluated our model on standard Image classification tasks such as CIFAR10, CIFAR 100 both with 60k images of size 32 by 32. We have measured the performance of both of our models by measuring the average accuracy over the 5 runs. We have trained our model at each step with 10%, 15%, 20%, 25%, 30%, 35%, 40% of annotated data out of the training set as it becomes available with labels provided by the oracle.
4.1 Implementation Details and Baselines
Baselines: We compare our method against various competitive methods including Variational Adversarial Active Learning (VAAL) Sinha et al. (2019), Core-Set Sener and Savarese (2017), Monte-Carlo Dropout Gal and Ghahramani (2016), and Ensembles using Vatiation Ratios (Ensembles w. VarR ) Freeman (1965) Beluch et al. (2018). We also showed the performance of deep Bayesian AL (DBAL) Gal et al. (2017) by following and performing sampling using their proposed approach Gal et al. (2017) and perform sampling using their proposed max entropy scheme to measure the uncertainty. We also show the results achieved using the uniform random sampling in which samples are picked from unlabeled pool using the random sampling methodology. This random sampling method still serves as competitive baseline in the field of active learning.
We used VGG16 Simonyan and Zisserman (2014), for our probabilistic model encoder and
for our joint model’s encoder. We base our encoder as VGG16 and optional decoder
architecture on 14-layer wide residual networks Higgins et al. (2017)
Zagoruyko and Komodakis (2016), in the variational cases with a latent dimensionality
of size 60. The encoder is followed by a classifier that consists of a single linear
layer. We optimize all models using a mini-batch size of 128 using
optimizers such as SGD, ADAM Kingma and Ba (2014) with a learning rate of 0.001
, weight decay value of . For the EVT based
outlier rejection, we fit Weibull models with a tail-size set to 5 % of
training data examples per class, and the distance metric used is cosine.
Training continues for 200 epochs for all the datasets. The initial labeled pool
size for all the experiments has been chosen to be 10 % of the training set
, which is equivalent to 5000, 5000 for CIFAR 10
Krizhevsky et al. (2009), CIFAR100 Krizhevsky et al. (2009) for all
the experiments. The budget side is set to 5 % of the training set, which is
equivalent to 2500, 2500 for CIFAR 10, and CIFAR 100. For the clarity of
nomenclature we define our models as below
4.2 Performance on image classification benchmarks
Fig.2 (a) (left) shows the performance of our model’s compared to competing existing methods. On CIFAR 10, our methods  achieves a mean accuracy of [84.4%, 89.24%, 89.97%, 91.4%] by using 40% of the annotated data after 6 stages with a budget of 2500 per stage after the initial budget, whereas the baseline accuracy is 92.63% using the entire dataset, denoted as Top-1 accuracy in Fig.2 (a) (left) . As shown in Fig.2 (a) (left) the method which performs closest to our model’s is VAAL with accuracy of 80.71% , core-set with accuracy of 80.37% and Ensemble w VarR with accuracy of 79.465%. Mean accuracy of both of our models consistently evidently outperforms all the methods as shown in the Figure 2 (a)(left) including random sampling, DBAL and MC-Dropout. To test the effect of choice of optimizer we ran all of our models on both SGD and ADAM optimizer and found out that using ADAM as an optimizer with our Model 1 significantly outperforms our Model 1 with SGD.
To evaluate the scalability of our approach we evaluate our approach on CIFAR100 dataset with larger no of classes. The maximum achievable mean accuracy is 63.14% on CIFAR100 using 100% of the data denoted as Top-1 accuracy in Fig.2 (a) (right). While our models  achieves a mean accuracy of [54.47%, 60.68%, 61.25%, 61.93%] over 5 runs by using 40% of the annotated data after 6 stages with a budget of 2500 per stage. As shown in Fig.2 (b)(right) other methods which performs closest to our model’s is VAAL with accuracy of 54.47 %, core-set, Ensemble w VarR with accuracy of and with accuracy of 46.78%. Moreover as shown in the figure Fig.2 (b)(right) the proposed methods  can achieve the top performance of VAAL using 20% of the annotated training data itself and  by using 30% of the annotated training data. The proposed models consistently outperform the existing baselines.
4.3 Effect of biased initial pool
We investigated the performance of our models  , where are initial labeled pool is biased. A good AL system is expected to discover data samples of unknown classes in an early stage. Intuitively, initial bias can affect the model training such that it causes the initially labeled samples to be not representative of the underlying data distribution by being inadequate to cover most of the regions in the latent space. We perform this by intentionally removing the data samples for classes. Such that the initial labeled pool won’t have any data sample belonging to these classes. We have performed our experimentation on CIFAR 100 for values of 10, 20 where randomly classes data is removed from the initial labeled pool where superscript 1 indicates the stage 1 to see how it affects the performance of the model. We compare it to the case where samples are randomly selected from all classes. As shown in the figure 6 our method is superior to VAAL, Core-set and random sampling in selecting informative samples from the classes that were underrepresented in the initial labeled set. As our models  achieves an accuracy of [53.35%, 60.54%, 61.36%, 61.55%] for =20 and [54.72%, 60.79%, 61.53%, 61.57] for =10, whereas the closest method to oour approach is VAAL and Core-set with each having accuracies [46.91%, 46.55%] for =20 and [47.10%, 47.63%] for =20 and random sampling has achieved an accuracy of 45.33% for =10 and 45.87% for =20
4.4 Effect of budget size on performance
We repeated the experiements as described in the experiments section 4.1 to test effect of different budget sizes on our model compared to the most competitive baselines on CIFAR100. So for our experiments we tested our model on budget sizes of = 5% and =10%. As shown in the figure our model outperforms VAAL, Core-Set and Ensemble and random sampling, on both the budget sizes of = 5% and =10%. VAAL comes as the second best followed by the Core-set, Ensemble as shown in the figure’s 3.  achieves an accuracy of [61.52%, 61.57%, 61.07%, 61.82%] for =10 and [54.32%, 60.68%, 61.29%, 61.9%] for =20 and the closest approaches fall in the range of 46 % accuracy.
4.5 Noisy Oracle and Noisy UnLabeled Pool
In this analysis we investigate the performance of our models  in the presence of noisy data caused by an inaccurate oracle instead of ideal oracle. we assume similar to VAAL Sinha et al. (2019) setup that erroneous labels are due to the ambiguity between some classes and are not adversarial attacks. a coarse label (the super-class to which it belongs). As shown in the figure LABEL:fig:noisyoracle our models consistently overcomes the existing models like VAAL,Core-set which are independent of task learner. In the case of VAAL a separate VAE, Discriminator is used as part of their sampling strategy.
We also consider an extreme case of Active learning where the i.i.d assumption has been relaxed. We intentionally added 20% data which is equivalent to 10,000 images from other datasets to our existing Unlabeled pool, so our network should not only distinguish between the informative samples and non-informative samples but also given a task to distinguish the data samples from current distribution vs out of the distribution. Whenever models select wrong sample from the other dataset and send to oracle, the human expert discards the sample so it will have an effect on overall budget size and the discarded samples from other datasets are placed back in the unlabeled pool. Which means at any given stage the total no of images from other datasets are 10,000. As you can see at the end of graph in the FIg the increase in the accuracy is pretty less as unlabled pool with have higher impact on the sampling methodology. We have specifically used our sampling strategy 2 to handle this scenario.
4.6 Sampling Time Analysis
The sampling method in the active learning system plays a major role in time efficent training. We have compared our sampling time agnaist other baseline farmeworks. The analysis is done on CIFAR10 dataset using a single NVIDIA 1080 TI and overall time required for each model is shown in the table. As you can see our method have closer sampling times similar to VAAL, DBAL. We have slightly more than VAAL because we need to pass our latent vector through the linear classifier, incase of value the discriminator just output’s the probability itsel. But we do have better training time when compared to VALL because VAAL contains a classifier which is VGG16 similar to our, VAE , followed by a Discriminator and the VAE and discriminator are trained using min-max game approach and optimizing all of them togeather take far more time thatn optimizing a single model like in our case. MC-Dropout collects the uncertainty using multiple forward passes to measure the uncertainty from 10 dropout masks which leads to it’s increased sampling time.
5 Conclusions and Final work
We present a novel deep learning approach for active learning using the Open set recognition techniques. Extensive experiments conducted over the standard datasets to help verify the effectiveness of the proposed approach.
References follow the acknowledgments. Use unnumbered first-level heading for
Sinha et al. (2019) the references. Any choice o Gal and Ghahramani (2016)
consistent. It is permissible to reduce the font size to
small (9 point)
when listing the references. Note that the Reference section does not count
towards the eight pages of content that are allowed.
-  (2018) The power of ensembles for active learning in image classification. In , Vol. , pp. 9368–9377. Cited by: Figure 2, Figure 3, Figure 5, Figure 6, §4.1, §4.6.
-  (2016-06) Tutorial on Variational Autoencoders. arXiv e-prints, pp. arXiv:1606.05908. External Links: Cited by: §3.2.1.
-  (1965) Elementary applied statistics: for students in behavioral science. John Wiley & Sons. Cited by: §4.1.
Dropout as a bayesian approximation: representing model uncertainty in deep learning.
international conference on machine learning, pp. 1050–1059. Cited by: Figure 2, Figure 3, Figure 5, Figure 6, §4.1, §4.6, References.
-  (2017) Deep bayesian active learning with image data. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1183–1192. Cited by: §1, §2, Figure 2, Figure 3, Figure 5, Figure 6, §4.1, §4.6.
-  (2017) Beta-vae: learning basic visual concepts with a constrained variational framework.. ICLR 2 (5), pp. 6. Cited by: §2, §3.2.1, §4.1.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
-  (2009) Learning multiple layers of features from tiny images. Cited by: §4.1.
-  (2015-05-01) Deep learning. Nature 521 (7553), pp. 436–444. External Links: Cited by: §1.
-  (2018) Efficient active learning for image classification and segmentation using a sample selection and conditional generative adversarial network. CoRR abs/1806.05473. External Links: Cited by: §2.
-  (2018) Adversarial sampling for active learning. CoRR abs/1808.06671. External Links: Cited by: §2.
-  (2019) Unified probabilistic deep continual learning through generative replay and open set recognition. CoRR abs/1905.12019. External Links: Cited by: §2, §2.
-  (2019) Open set recognition through deep neural network uncertainty: does out-of-distribution detection require generative classifiers?. 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pp. 753–757. Cited by: §1, §2, §3.2.1, §3.2.1, §3.2.1, §3.2.
-  (2004) Active learning using pre-clustering. In ICML, External Links: Cited by: §2.
Active learning for convolutional neural networks: a core-set approach. arXiv preprint arXiv:1708.00489. Cited by: §2, Figure 2, Figure 3, Figure 5, Figure 6, §4.1, §4.6.
-  (2010-07) Active learning literature survey. pp. . Cited by: §2.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.1.
-  (2019) Variational adversarial active learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5972–5981. Cited by: §2, Figure 2, Figure 3, Figure 5, Figure 6, §4.1, §4.5, §4.6, References, footnote 1.
-  (2016) Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: §4.1.
-  (2017) Generative adversarial active learning. CoRR abs/1702.07956. External Links: Cited by: §2.