The term Active Learning describes the field of selecting samples from a given pool of data in order to subsequently train a Machine Learning algorithm with. This can be done for two major reasons: Firstly, deciding which subset of collected data will be annotated in order to create training and validation set for a supervised Machine Learning task. While it can be comparatively easy and inexpensive to record and gather sensor data and ever decreasing cost makes it affordable to possibly neglect storage expenses, reliable ground truth annotation still requires manual labour and is therefore the crucial factor. In industrial application of Machine Learning to various tasks, budget and time constraints play a significant role and performance can depend on choosing the best samples to train on.
Secondly, one could think of using Active Learning methods as a form of regularization. While increasing the number of available training samples is in general regarded as helpful, certain factors can lead to an impaired performance when doing so. The more objects in a recognition task are standardized, the more redundant information is potentially added to the dataset with each new sample, which can result in worsened generalization. Active Learning methods can also be applied to sanitize a dataset from falsely labelled samples, as a suitable strategy will not pick samples with a conspicuous difference between label and prediction.
However, despite the great potential of Active Learning it also bears significant risks. If applied in an incorrect way it could lead to a sub-optimal sample selection, and, in the worst case, rendering the complete Machine Learning task unsuccessful. In order to point out how to avoid these pitfalls, we examine a set of known Active Learning query strategies, as well as some extensions of our own, and their performance on various different image classification datasets. We then view their performance under different aspects including changing hyperparameters, influence of falsely labelled data and the replaceability of varying CNN architectures. Eventually we regard the performance of the same strategies when applied to a problem of hierarchical classifiers. Our main contributions are:
1.) A robustness investigation of state-of-the-art Active Learning strategies with respect to the impact of falsely labelled data, hyperparameters and the impact of changing the classifier model during the selection phase. 2.) An extension of the Active Learning method based on Entropy computation using the Simpson Diversity. 3.) Theoretical insights and experimental results for Active Learning on Hierarchical Neural Networks.
0.2 Related Work
An overview of methods from the pre Deep Learning area can be found in the very comprehensive review of. Many approaches originating from that time (e.g. Uncertainty sampling, Margin based sampling, Entropy Sampling, …) have been later adapted to neural networks. Additional examples for this include the approach of 
, who applied a Monte Carlo method to compute an estimated error reduction that can be used for sample selection as well as clustering approaches like those described in and .
 and  propose a semi-supervised approach. They use Active Learning to query samples which the network has not yet understood and use label propagation to also utilize well understood samples with ”pseudo-labels”.
In the field of supervised learning,
used a Bayes approach to distil a Monte Carlo approximation of the posterior predictive density for sample selection. In the theoretical work of, Active Learning was rephrased as a convex optimisation problem and the balancing of the selection of samples with high diversity and those that are very representative for a subset are discussed. Unlike many other methods, the core-set approach of  does not use the output layer of a network for Active Learning. Instead they solve a relaxed
-centres problem to minimize the maximal distance to the closest cluster centre for each sample in a space that is spanned by the neurons of a hidden layer of a network. As discussed later, this approach has a very high independence of the actual classes of a network, which can be helpful when dealing with hierarchical networks for example.
 introduced the concept of live-dropout to Active Learning. The idea is to approximate the behaviour of an ensemble of Bayesian estimators by activating dropout during inference and multiple forward passes. They furthermore developed an Active Learning framework which is able to use this and other deep Bayesian methods. In the same line of thought,  investigated live dropout and Query-by-committee methods. However,  used ensembles of CNNs with identical architectures but different weight initiations to show that ensembles work better than ”ensemble approximation methods” like the above mentioned MC dropout of 
or approaches based on geometric distributions like.
Some recent approaches also utilize ”meta” knowledge for Active Learning. 
introduced ”Policy based Active Learning”. There, reinforcement learning is used for stream based Active Learning in a language processing setting. This is very similar to the approach of
who proposed ”Learning Algorithms for Active Learning”. They also used Reinforcement Learning to jointly learn a data representation, an item selection heuristic and a method for constructing prediction functions from labelled training sets. reuse knowledge from previously annotated datasets to improve the Active Learning performance.
In the following we review existing methods from the field of pool-based Active Learning and propose a suggestion of our own. Given a classification model and a dataset , consisting of a feature and label pair , such an algorithm has the following structure:
Considering a large dataset, one can query numerous samples at once. This set of the chosen samples is denoted by .
We take a closer look at uncertainty sampling, a strategy that selects samples the classifier is uncertain about. In this context uncertainty means a low confidence for the predicted class that is given by . We consider three commonly used uncertainty measures:
(a): Considering only one class label, the sample with the least confident label prediction is selected. (b): Margin sampling includes information about the second most certain prediction. The algorithm queries the sample
with the smallest difference between the two most probable class labels. (c): For multi class tasks, it is relevant to consider all label confidences. For each sample every class probability is weighted with its information content and summed up. The algorithm queries the sample with the highest entropy.
For the following experiments we implement eight query strategies.
Based on Least Confident (a):
- Naive Certainty (NC) Low:
Select samples with the minimal maximal activation in classifier logits. Since basing the decision only on the one highest activated neuron is a very straightforward approach, we call this family of strategies the ”Naive” methods.
- NC Range:
Select samples within a certain range of the classifier logits’ activation (e.g. ).
- NC Diversity:
Select samples with the minimal maximal activation in classifier logits and additionally prevent that similar samples are chosen by calculating the diversity of the samples below the threshold compared to those already included in the training set.
- NC Balanced:
Select samples with the minimal maximal activation in classifier logits and balance the class distribution using the reciprocal value of the classification confusion matrix obtained with the previous training set. Terminates if one class contains no more samples to be drawn.
Based on Margin (b):
Select samples with the smallest difference of the two highest firing logits.
Based on Entropy (c):
- Entropy High:
Select samples with the highest entropy.
- Sum of Squared Logits (SOSL):
Core Set Greedy: A similarity measure in the embedding space. Creates a core set by approximating the problem of distributing -centres in points, such that the minimal distance of all points to the nearest centre is maximized. Select samples for which the minimum distance to all samples which are already part of the training set is maximized (cf. ).
0.3.1 Sum of Squared Logits (SOSL) Method
In Active Learning, we require a measure of how sure the classifier is that its class decision during inference is accurate. One possibility for such an accuracy-of-inference measure is to analyze the distribution of logits. Within the trained model of the classifier, the logits can be interpreted as probabilities that the inferred sample belongs to the class associated of the respective logit. If the logits are strongly biased in favour of a certain class, it is very likely that the given sample belongs to the class corresponding to the strongest logit. On the contrary, if the logits do not show a clear preference for a certain class, there is a high risk that taking the class of the strongest logit results in a false prediction. In other words, to which degree the distribution of logits tends towards peaks rather than an equipartition indicates how accurate the inference is going to be.
In previous literature, the Shannon entropy  has been frequently used as a measure of how peaked or equipartitioned a distribution is. A valid strategy for Active Learning could then be to sort out those samples, for which the Shannon entropy , with being the values of the logits, is particularly high. However, a shortcoming of this approach is that it does not adequately account for the situation when the the distribution of logits is admittedly strongly peaked, but with peaks on more than one class logit. Such a situation can easily arise in samples, when they belong to classes showing similarities and the classifier’s model does not yet feature a clear decision boundary between them. In such a case, the distribution of logits is still far away from an equipartition, resulting in a relatively low value for the Shannon entropy . Thus, although labelling these samples would be particularly valuable for fleshing out the decision boundary and allowing the classifier to better separate between classes, they would not be added to Active Learning training set.
To overcome these shortcomings of the Shannon entropy as a measure for characterizing the distribution of logits , we propose to use the Simpson diversity index  instead. The closer the distribution is to an equipartition, the larger becomes. If the shows a strong peak at a certain , is close to zero. Finally, if the are strongly peaked among several classes, will have a small-to-moderate value between zero and one. The latter property of in particular allows to select those samples for labelling, for which the classifier can narrow the class decision down to a few classes, among which it is still unsure. The Active Learning strategy is then to select in each iteration the samples with highest .
0.4 Experiments and Results
We conduct a series of experiments with the query strategies presented in section 0.3 on six different datasets for image classification (cf. Table 1). These consist of the well-known digit classification set MNIST  and the thereof inspired dataset of the Latin alphabet CoMNIST  and clothing classification Fashion-MNIST , as well as general object classification CIFAR-10  and the house number collection SVHN . We furthermore evaluate strategies on a private dataset of different classes of traffic signs (TSR) represented through small grey scale images.
Classification accuracy over training set size for all strategies on Fashion-MNIST (top left), TSR (top right), MNIST (bottom left) and CIFAR10 (bottom right). The plotted value is the median of five runs and the shaded area denotes one standard deviation.
0.4.1 General Performance
Before we analyse the robustness of the presented query strategies, we compare their general performance on the datasets presented above. For each dataset we use a distinct plain feed-forward CNN. Only for CIFAR-10 we use an implementation of ResNet50 . As we are not aiming to find the best architecture for a certain problem but to identify the most promising samples, we choose the number of layers and channels according to the approximate complexity of the task and select learning rates and batch sizes in commonly used ranges.
For all of these experiments, we start with a training set of samples per class of the particular dataset. We train the CNN for up to epochs with an early stopping of . For this purpose we split of the training set into an additional ”development set”. It is not used for training but to validate classification over the course of the training. This is done to obviate an overfitting-like bias with the use of early stopping. Of course the validation accuracy is then determined on the original test set of the respective dataset, using the best network weights acquired during training according to the development set accuracy.
This network is also the one used to then select new samples to be added to the training set utilizing the query strategies. With each iteration we increase the number of samples in the training set by . In all cases we conduct five repetitions per strategy and dataset for statistical significance. To reduce the computational burden, we iteratively draw new samples until we have reached approximately a third of the full size of the respective training set.
Figure 1 illustrates the results of the evaluation of all query strategies. Nearly all findings show a benefit of Active Learning methods and at least some of the query strategies are either hitting the baseline, or are close to it, around the mark. For CIFAR-10 however, this is not true. None of the methods show any profit for this dataset and are in line with the random sample selection, resulting in a nearly perfectly linear increase in accuracy. This does not come as a surprise, as CIFAR-10 has very diverse representations of its classes and seems to contain no redundant information.
0.4.2 Changing Hyperparameters and Falsely Labelled Data
As hyperparameter optimisation is very important in fine-tuning the performance of Machine Learning algorithms, we analyse how much changes in these parameters influence the usability of the Active Learning methods shown.
Figure 2 shows the effect of altering the learning rate over two magnitudes and the batch size up to a factor of , for experiments on MNIST. All methods behave very robustly and do not show to be influenced by these alterations.
Since it can be expected that human annotation, especially in large scale labelling of sensor data, is never perfectly accurate, it is interesting to investigate how this might interfere with the applicability of Active Learning. In Figure 3 (left) we show results for an experiment where we purposely introduced false labels into the Fashion-MNIST training set. It can clearly be seen, that methods relying on a diversity criterion (NC Diversity, Core Set) suffer the most, since their selection process prevents similar samples from being chosen and therefore it can be harder to correct the negative impact that the selection of a wrongly labelled sample would have. Please note that these strategies also show the highest sensitivity to changes in dropout (cf. Figure 3 right).
0.4.3 Replaceability of Classifiers
In the application of Machine Learning, especially in product context, successive refinement of the algorithm is very common. A CNN architecture might be adjusted several times over the course of development or a production process, to optimise the performance or to adapt to changes in the dataset or external restrictions like computational resources. We investigate how the usability of Active Learning might be influenced, if data selection is done by a different network than the one eventually targeted for classification performance. For this purpose we implemented three CNNs, referred to as , and in the following, of different capacity to iteratively select samples from Fashion-MNIST with the query strategies as described above. We then perform a cross-training, where every network is trained with the selections of the others and its own. To ensure comparability, we use the same initial dataset of samples per class for all classifiers and repeat calculations five times.
Figure 4 shows the results for selected strategies. Apart from information about the replaceability of classifiers, these results can show how the classifier capacity itself influences the applicability of Active Learning strategies. For the example of NC Balanced we can note a bias for the own selection performing best with the and classifier, while the medium-sized one shows indifference. The ”weaker” the network gets, the better the performance of the random selection becomes. For the SOSL, this becomes even more clear. While the selection of the classifier is still definitely the best for itself, the smaller networks show the best performance with the randomized set. The results with Entropy High are very similar, but the gaps become even more obvious. now shows a very clear preference for the own selection compared to any other and the performance of the Active Learning strategy selection on the network is now more than three percentage points behind random.
0.4.4 Hierarchical Classifiers
To complete our Active Learning robustness study, we examine a neural network structure different from the straightforward CNNs in the preceding sections.
Hierarchical or cascaded classifiers do not use a single label per sample but a whole label tree (cf. 
). Consequently, label vectors consist of one of the three following options per class: ”1, 0 or not applicable” and each sample belongs to exactly one class per hierarchy level. Furthermore, during the learning phase each class is treated independent of all others. If we have an-class classification problem, ”1-vs-all classifiers” are trained.
This renders all Active Learning strategies which rely on quantifying the uncertainty of the logits useless. All of them (e.g. Naive Certainty, Margin) implicitly rely on the assumption that labels with two possible states are used. As the neurons that belong to classes marked as ”not applicable” are not considered during backpropagation (cf.) they can take arbitrarily high values and thus confuse the mentioned Active Learning methods. As can be seen in Figure 5 this can even result in worse performance than random sampling. However, we can show, that methods, which work in the embeddings space (like the Core Set method), are not effected and thus are also employable for hierarchical neural networks.
In all experiments with the hierarchical classifier we use a private dataset that consists of 12 classes which depict different poses of a human hand (e.g. ”One finger”, ”Two fingers”, ”Fist Thumb Left”, etc.). We use a training set of , a development set of and a test set containing grey scale images of size .
As depicted in Figure 6, we use three levels of hierarchy: 1.) ”Hand”/”No hand”, 2.) Class, 3.) Subclass. A sample of ”Fist Thumb Left” e.g. would have the labels ”Hand + ”Fist Thumb” + ”Fist Thumb Left”. Especially the neurons of the subclasses often have the label ”not applicable” as each subclass belongs to only one class.
We have presented a study on the robustness of Active Learning. While we show that even plain methods can bring a notable profit in different image classification applications, we emphasise, that prior knowledge about the data and the Machine Learning algorithm in use is essential for successful application. As seen in 0.4.1, methods that work well on a number of datasets might suddenly fail on a different one and certain data collections might be inherently unsuitable for this kind of active data selection. Although many changes in hyperparameters and erroneous labels might not influence the performance of particular strategies on one hand (cf. 0.4.2), classifier changes on the other can by all means (cf. 0.4.3). Critical alterations in the way a Machine Learning tasks is tackled, like switching from a straightforward to a hierarchical classifier (cf. 0.4.4), can turn the all previous findings upside down.
These findings underline, that Active Learning can be a helpful tool in data science, but has to be used with knowledge about the targeted utilisation. We aim to continue our endeavours in this field and expand the considerations to segmentation problems and ways to automatically provide assessment on promising combinations of data, Machine Learning algorithms and Active Learning strategies, to avoid possible pitfalls like the ones presented in this work.
-  (2017) Learning algorithms for active learning. CoRR abs/1708.00088. External Links: Cited by: §0.2.
-  (2018-06) The power of ensembles for active learning in image classification. In , Cited by: §0.2.
-  (2008) Hierarchical sampling for active learning. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, New York, NY, USA, pp. 208–215. External Links: Cited by: §0.2.
-  (2017) Active learning strategy for CNN combining batchwise dropout and query-by-committee. In 25th European Symposium on Artificial Neural Networks, ESANN 2017, Bruges, Belgium, April 26-28, 2017, External Links: Cited by: §0.2.
-  (2017) Learning how to active learn: A deep reinforcement learning approach. CoRR abs/1708.02383. External Links: Cited by: §0.2.
-  (2017) Deep Bayesian active learning with image data. In ICML, Cited by: §0.2.
-  (1998) Gradient-based learning applied to document recognition. In Proceedings of the IEEE, 86(11):2278-2324, External Links: Cited by: §0.4.
-  (2016) Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770–778. External Links: Cited by: §0.4.1.
-  (2018-09) What do i annotate next? an empirical study of active learning for action localization. In The European Conference on Computer Vision (ECCV), Cited by: §0.2.
-  (2016) DCNNs on a diet: sampling strategies for reducing the training set size.. CoRR abs/1606.04232. External Links: Cited by: §0.2, §0.3.
-  (2015) Bayesian dark knowledge. In NIPS, Cited by: §0.2.
-  (2009) Learning multiple layers of features from tiny images. Technical report Technical Report 1648, University of Toronto. Cited by: §0.4.
-  (2011) Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, External Links: Cited by: §0.4.
-  (2004) Active learning using pre-clustering. In Proceedings of the Twenty-first International Conference on Machine Learning, ICML ’04, New York, NY, USA, pp. 79–. External Links: Cited by: §0.2.
Deep bayesian active semi-supervised learning. CoRR abs/1803.01216. External Links: Cited by: §0.2.
-  (2001) Toward optimal active learning through sampling estimation of error reduction. In ICML, Cited by: §0.2.
Active learning for convolutional neural networks: a core-set approach. In International Conference on Learning Representations, External Links: Cited by: §0.2, §0.2, §0.3.
-  (2010) Active learning literature survey. Technical report University of Wisconsin-Madison. Cited by: §0.2, §0.3.
-  (1948) A mathematical theory of communication. The Bell System Technical Journal 27 (3), pp. 379–423. External Links: Cited by: §0.3.1.
-  (1949) Measurement of diversity. Nature 163. External Links: Cited by: item Sum of Squared Logits (SOSL):, §0.3.1.
-  (2017) CoMNIST: cyrillic-oriented mnist. In Github, External Links: Cited by: §0.4.
-  (2017-12) Cost-effective active learning for deep image classification. IEEE Trans. Cir. and Sys. for Video Technol. 27 (12), pp. 2591–2600. External Links: Cited by: §0.2.
-  (2018-11) Driver state monitoring with hierarchical classification. pp. 3239–3244. External Links: Cited by: §0.2, §0.4.4, §0.4.4.
-  (2017-08-28)(Website) External Links: Cited by: §0.4.