Deep learning models are improved when trained with more labeled data (Goodfellow et al., 2016). A standard deep learning development procedure involves constructing a large-scale labeled dataset and optimizing a model with it. Yet, in many real-world scenarios, large-scale labeled datasets can be very costly to acquire, especially when expert annotators are required, as in medical diagnosis and loan prediction, or some labels correspond to rare cases, as in self-driving cars. An ideal framework would integrate data labeling and model training to improve model performance with minimal amount of labeled data.
Active learning (AL) (Balcan et al., 2009) assists the learning procedure by judicious selection of unlabeled samples for labeling, with the goal of maximizing the ultimate model performance with minimal labeling cost. We focus on pool-based AL where an unlabeled data pool is given initially and the AL mechanism iteratively selects batches to label in conjunction with training. As the name “learning-based AL selection” suggests, each batch is selected with guidance from the previously-trained model, is labeled, and then added into the labeled dataset on which the model is trained.
Maximization of performance with minimal labeled data requires properly leveraging model learning and AL sample selection, especially in early AL cycles. While using the unlabeled pool as the candidates in the AL sample selection phase, it is natural for pool-based AL methods to integrate SSL objectives to improve performance by learning meaningful data representations from the unlabeled pool (Zhu et al., 2003; Tomanek and Hahn, 2009). However, fewer existing AL methods consider SSL during training (Drugman et al., 2019a; Rhee et al., 2017; Zhu et al., 2003; Sener and Savarese, 2018) compared to those utilizing only labeled samples. Moreover, we believe that AL selection criterion should be in coherence with the SSL objectives to select the most valuable samples, since (1) unsupervised losses could alter the learned representation and decision manifolds significantly (Oliver et al., 2018), and the AL sample selection should reflect that; (2) SSL already results in the embodiment of knowledge from unlabeled data in a meaningful way; thus AL selection should reflect the extra value of the labeled data on top of it. Motivated by these observations, we propose an AL framework that combines SSL with AL and also a selection metric that is strongly related to the training objective.
In the absence of labeled data, a common practice to initiate AL is to uniformly select a small starting subset of data for labeling. Learning-based AL selection is then used in subsequent cycles. The size of the starting subset affects AL performance – when the start size is not sufficiently large, the models learned in subsequent AL cycles are highly-skewed and result in biased selection, a phenomenon commonly known as thecold start problem (Konyushkova et al., 2017; Houlsby et al., 2014). When cold start issues arise, learning-based selection yields samples that lead to lower performance improvement than using naive uniform sampling (Konyushkova et al., 2017)
. Increasing the start size alleviates the cold start problem, but consumes a larger portion of the labeling budget before learning-based AL selection is utilized. With better understanding of data, our method relieves this problem by allowing learning-based sample selection to be initialized from a much smaller start size. However, an ideal solution is determining a proper start size that is large enough in avoiding cold start problems, yet sufficiently small to minimize the labeling cost. To this end, we propose a measure that is empirically shown to be helpful in estimating the proper start size.
Contributions: We propose a simple yet effective selection metric that is in coherent with training objectives in SSL. The proposed AL method is based on an insight that has driven recent advances in SSL (Berthelot et al., 2019; Verma et al., 2019; Xie et al., 2019): a model should be consistent in its decisions between a sample and its meaningfully-distorted versions. This motivates us to introduce an AL selection principle: a sample along with its distorted variants that yields low consistency in predictions indicates that the SSL model is incapable of distilling useful information from that unlabeled sample, thus human labeling is needed. Experiments demonstrate that our proposed metric outperforms previous methods integrated with SSL. With various quantitative and qualitative analyses, we demonstrate the rationale behind why such a selection criteria is highly effective in AL. In addition, in an exploratory analysis we propose a measure that can be used to assist in determining the proper start size to mitigate cold start problems.
2 Related Work
. Traditional AL methods can be roughly classified into three categories: uncertainty-based methods, diversity-based methods and expected model change-based methods. Among uncertainty-based ones, methods based onmax entropy (Lewis and Catlett, 1994; Lewis and Gale, 1994) and max margin (Roth and Small, 2006; Balcan et al., 2007; Joshi et al., 2009) are popular for their simplicity. Some other uncertainty-based methods measure distances between samples and the decision boundary (Tong and Koller, 2001; Brinker, 2003)
. Most uncertainty-based methods use heuristics, while recent work(Yoo and Kweon, 2019a) directly learns the target loss of inputs jointly with the training phase and shows promising results. Diversity-based methods select diverse samples that span the input space the most (Nguyen and Smeulders, 2004; Mac Aodha et al., 2014; Hasan and Roy-Chowdhury, 2015; Sener and Savarese, 2018). There are also methods that consider both uncertainty and diversity (Guo, 2010; Elhamifar et al., 2013; Yang et al., 2015). The third category estimates the future model status and selects samples that encourage optimal model improvement (Roy and McCallum, 2001; Settles et al., 2008; Freytag et al., 2014).
Both AL and SSL aim to improve learning with limited labeled data, thus they are naturally related. Only a few works have considered combining AL and SSL in different tasks. In Drugman et al. (2019b)
, joint application of SSL and AL is considered for speech understanding, and significant error reduction is demonstrated with limited labeled speech data. For AL, their selection criteria is based on a confidence score that quantifies the observed probabilities of words being correct.Rhee et al. (2017) propose an active semi-supervised learning system which demonstrates superior performance in the pedestrian detection task. Zhu et al. (2003) combine AL and SSL using Gaussian fields and validate their method on synthetic datasets. Sener and Savarese (2018) also consider SSL during AL cycles. However, in their setting, the performance improvement is marginal when adding SSL in comparison to their supervised counterpart, potentially due to the suboptimal SSL method.
Agreement-based methods, also referred as “query-by-committee”, base the selection decisions on the opinions of a committee which consist of independent AL metrics or models (Seung et al., 1992; Cohn et al., 1994; McCallumzy and Nigamy, 1998; Iglesias et al., 2011; Beluch et al., 2018; Cortes et al., 2019a). Our method is related to agreement-based AL where samples are determined based on the conformity of different metrics or models. Specifically, our method selects data that mostly disagree with the predictions of its augmentations.
3 Consistency-based Semi-supervised AL
3.1 Proposed Method
We consider the setting of pool-based AL, where an unlabeled data pool is available for selection of samples to label. To minimize the labeling cost, we propose a method that unifies selection and model updates, overviewed in Algorithm 1. The proposed method has two key aspects.
Most conventional AL methods base model learning only on the available labeled data, which ignores the useful information in the unlabeled data. Our first contribution is incorporating a semi-supervised learning (SSL) objective in the training phases of AL. Specifically, as shown in Algorithm 1, each model
is learned by minimizing an objective loss function of the form. The model should both fit the labeled data well and obtain a good representations of the unlabeled data.
The design of the selection criterion plays a crucial role in integrating SSL and AL. To this end, our second contribution is a selection criterion to better integrate AL selection mechanism in the SSL training framework.
It has been observed that predictions of deep neural networks are sensitive to small perturbations on the input data(Zheng et al., 2016; Azulay and Weiss, 2018). Recent successes in SSL (Athiwaratkun et al., 2019; Berthelot et al., 2019; Verma et al., 2019) are based on minimizing the notion of sensitivity to perturbations with the idea of inducing “consistency”, i.e., imposing similarity in predictions when the input is perturbed in a way that would not change its perceptual content. For consistency-based semi-supervised training, a common choice of loss is , where is a distance function such as KL divergence (Xie et al., 2019), or L2 norm (Laine and Aila, 2017; Berthelot et al., 2019) and denotes a perturbation (augmentation) of the input . Our proposal is motivated by the following intuition. First, the unsupervised objective exploits unlabeled data by encouraging consistent predictions across slightly distorted version of each unlabeled sample. Labeling samples with highly inconsistent predictions is valuable, since these samples are hard to be minimized using . Thus, they need human annotations to provide further useful supervision for model training. Second, the data that yields large model performance gain is not necessarily the data with the highest uncertainty, since neural network prefers learning with a particular curriculum (Bengio et al., 2009). The most uncertain data could be too hard to learn, and including them in training would be misleading. Thus, we argue that labeling samples that can be recognized to some extent but not consistently should benefit learning more compared to the most uncertain ones.
Specifically, we propose a simple metric measures the inconsistency across perturbations. There are various ways to quantify consistency. Due to its empirically-observed superior performance, we choose , where
is the number of response classes and is the number of perturbed samples of the original input data , , which can be obtained by standard augmentation operations (e.g. random crops and horizontal flips for image data). Our method selects data samples with high values for labeling, which may possess varying level of difficulty for the model to classify.
|Setting||Methods||# of labeled samples in total|
3.2 Comparisons with Baselines
The practical performance of our method is demonstrated on two commonly used datasets: CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009) on the image classification task. Both datasets have K images in total, of which K images are for testing. CIFAR-10 consists of 10 classes and CIFAR-100 has 100 fine-grained classes. Different variants of SSL methods encourage consistency loss in different ways. In our implementation, we adopt the recently-proposed state-of-the-art method, Mixmatch (Berthelot et al., 2019), which proposes a specific loss term to encourage consistency. Following (Berthelot et al., 2019), we use Wide ResNet-28 (Oliver et al., 2018) with filters as the base model and keep the default hyper-parameters for different settings from (Berthelot et al., 2019). In each cycle, is initialized with . We select samples for labeling by default. augmentations of each image are obtained by horizontally flipping and random cropping, but we observe that augmentations can produce comparable results. For a fair comparison, different selection methods start from the same initial model () and the reported results are over 5 trials.
We consider three representative selection methods for comparison:
Uniform indicates random selection (no AL).
k-center (Sener and Savarese, 2018) selects representative samples by maximizing the distance between a selected sample and its nearest neighbor in the labeled pool. The feature from the last fully connected layer of the target model is used to calculate distances between samples.
As shown in Table 1, our method significantly outperforms the baseline methods which only learn from labeled data at each cycle. When only samples are labeled, our method outperforms kcenter by accuracy. Next, we focus on comparing different methods in SSL framework. Figure 1 shows the effectiveness of our consistency-based selection in SSL setting by comparing with the baselines, when they are integrated into SSL. Our method outperforms baselines by a clear margin: on CIFAR-10, with labeled images, our method outperforms uniform (passive selection) by and outperforms k-center, the state-of-the-art method, by . As the number of labels increases, it is harder to improve model performance, but our method outperforms the uniform selection with K labels using only K labels, halving the labeled data requirements for the similar performance. Given access to all the labels (K) for the entire training set, a fully-supervised model achieves an accuracy of (Berthelot et al., 2019). Our method has comparable performance ( lower accuracy) with only K labeled samples. CIFAR-100 is a more challenging dataset as it has more categories. On CIFAR-100, we observe a consistent outperformance of our method at all AL cycles.
|Methods||# of labeled samples in total|
3.3 Analyses of consistency-based selection
To explain its performance, we analyze the samples selected by our method from several perspectives, which are known to be important for AL.
Uncertainty and overconfident mis-classification: Uncertainty-based AL methods query the data samples close to the decision boundary. However, deep neural networks yield poorly-calibrated uncertainty estimates when the raw outputs are considered – they tend to be overconfident even when they are wrong (Guo et al., 2017; Lakshminarayanan et al., 2017). entropy-based AL metrics would not distinguished such overconfident mis-classifications, thus result in suboptimal selection. Figure 2 (left) demonstrates that our consistency-based selection is superior in detecting high-confident mis-classification cases than entropy. We use entropy to measure the uncertainty of the selected samples by different methods in Figure 2 (middle). It compares different approaches and shows that uniform and k-center methods do not base selection on uncertainty at all, whereas consistency tends to select highly-uncertain samples but not necessarily the top ones. Such samples should contribute to the performance gap with entropy. Figure 2 (right) illustrates some selected samples that are mis-classified with high confidence.
Diversity: Diversity has been proposed as a key factor for AL (Yang et al., 2015). k-center is a state-of-the-art AL method based on diversity (it prefers to select data points that span the whole input space). Towards this end, Figure 3
(right) visualizes the diversity of samples selected by different methods. We use principal components analysis to reduce the dimensionality of embedded samples to a two-dimensional space.uniform chooses samples equally-likely from the unlabeled pool. Samples selected by entropy are clustered in certain regions. On the other hand, consistency selects data samples as diverse as those selected by k-center. The average distances between top samples selected by different methods are shown in Figure 3 (top-left). We can see that entropy chooses samples having small average distances, while consistency has a much larger average distance which is comparable to uniform and k-center.
Class distribution complies with classification error: Figure 3 (bottom-left) shows the per-class classification error and the class distribution of samples selected by different metrics. Samples selected by entropy and consistency are correlated with per class classification error, unlike the samples selected by uniform and k-center.
4 When can we start learning-based AL selection?
4.1 Cold-start failure
When the size of the initial labeled dataset is too small, the learned decision boundaries could be skewed and AL selection based on the model outputs could be biased. To illustrate the problem, Figure 4
shows the toy two-moons dataset using a simple support vector machine (in supervised setting with the RBF kernel) to learn the decision boundary(Oliver et al., 2018). As can be seen, the naive uniform sampling approach achieves better predictive accuracy by exploring the whole space. On the other hand, the samples selected by max entropy concentrate around a poorly learned boundary. In another example, we study the effects of cold start using deep neural networks on CIFAR-10, shown in Figure 5. Using uniform sampling to select different starting sizes, AL methods achieve different predictive accuracy. For example, the model starting with data points clearly under-performs the model starting with samples, when both models reach 150 labeled samples. It may due to the cold start problem encountered when . While, given a limited labeling budget, naively choosing a large start size is also not practically desirable, because it may lead to under-utilization of learning-based selection. For example, our method starting from labeled samples has better performance than starting from 150 or 200, since we have more AL cycles in the former case given the same label budget.
The semi-supervised nature of our learning proposal encourages the practice of initiating learning-based sample selection from a much smaller start size. However, the learned model can still be skewed at extreme early AL stages. These observations motivate us to propose a systematic way of inferring a proper starting size. We analyze this problem and propose an approach to assist in determining the start size in practice.
4.2 An exploratory analysis in start size selection
Recall from the last step of Algorithm 1, if is set such that , i.e., if the entire dataset has been labeled, then the final model is trained to minimize the purely supervised loss on the total labeled dataset . Consider the cross-entropy loss function for any classifier , which we call the AL target loss:
Note that the goal of an AL method can be viewed as minimizing the AL target loss with a small subset of the entire training set (Zhu et al., 2003). In any intermediate AL step, we expect the loss on the current labeled subset to mimic the target loss. If cold start problems occur, the model does a poor job in approximating and minimizing equation 2. The quality of the samples selected in the subsequent AL cycles would be consequently poor. Therefore, it is crucial to understand the performance of the currently-learned model in minimizing the criterion in equation 2. However, since the labeled data set at cycle is a strict subset of the total training set , it is impossible to simply plug the most recently learned model in equation 2 for direct calculation.
Our approximation to the target loss is based on the following proposition, which gives upper and lower bounds on the expected loss, to which the target loss approximates:
For any given distribution of , and any learned model , we have
where is the cross-entropy between two distributions and , is the entropy of the random variable
is the entropy of the random variable, and .
Proposition 1 indicates that the expected cross-entropy loss can be both upper and lower bounded. In particular, both bounds involve the quantity , which suggests that could potentially be tracked to analyze for different numbers of samples. Unlike the unavailable target loss on the entire training set, does not need all data to be labeled. In fact, to compute , we just need to specify a distribution for , which could be assumed from prior knowledge or estimated using all of the labels in the starting cycle.
In Figure 6, we observe a strong correlation between the target loss and , where is assumed to be uniform. We see how can be used to identify the trend when the actual target is minimized. Particularly, in SSL setting, a practitioner may set the starting set size to or labeled samples on CIFAR-10, as the value of essentially ceases decreasing, which coincide with the oracle stopping points if we were given access to the target loss. In contrast, a start size of has much higher , which leads to less favorable performance. A similar pattern in the supervised learning setting is shown in Figure 6.
5 Conclusion and Future work
We presented a simple pool-based AL selection metric to select data for labeling by leveraging unsupervised information of unlabeled data during training. Experiments show that our method outperforms previous state-of-the-art AL methods under the SSL setting. Our proposed metric implicitly balances uncertainty and diversity when making selection. The design of our method focuses on the principles of consistency in SSL. For alternative SSL methods based on other principles, it is necessary to revisit AL selection with respect to their training objectives, which will be considered in future work. In addition, we study and address a very practically valuable yet challenging question — “When can we start learning-based AL selection?”. We present a measure to assist in determining proper start size. Experimental analysis demonstrates that the proposed measure correlates well with the AL target loss (i.e. the ultimate the supervised loss on all labeled data). In practice, it can be tracked to examine the model without requesting a large validation set.
Discussions with Giulia DeSalvo, Chih-kuan Yeh, Kihyuk Sohn, Chen Xing, and Wei Wei are gratefully acknowledged.
- There are many consistent explanations of unlabeled data: why you should average. International Conference on Learning Representations. Cited by: §3.1.
- Why do deep convolutional networks generalize so poorly to small image transformations?. arXiv preprint arXiv:1805.12177. Cited by: §3.1.
- Agnostic active learning. Journal of Computer and System Sciences 75 (1), pp. 78–89. Cited by: §1, §2.
Margin based active learning.
International Conference on Computational Learning Theory, pp. 35–50. Cited by: §2.
- The power of ensembles for active learning in image classification. In , pp. 9368–9377. Cited by: §2.
Proceedings of the 26th annual international conference on machine learning, pp. 41–48. Cited by: §B.2, §3.1.
- MixMatch: a holistic approach to semi-supervised learning. arXiv preprint arXiv:1905.02249. Cited by: §1, §3.1, §3.2, §3.2.
- Incorporating diversity in active learning with support vector machines. In Proceedings of the 20th international conference on machine learning (ICML-03), pp. 59–66. Cited by: §2.
- Improving generalization with active learning. Machine learning 15 (2), pp. 201–221. Cited by: §2.
- Active learning with disagreement graphs. In International Conference on Machine Learning, pp. 1379–1387. Cited by: §2.
- Agnostic active learning without constraints. In International Conference on Machine Learning, Cited by: §2.
- A general agnostic active learning algorithm. In Advances in neural information processing systems, pp. 353–360. Cited by: §2.
- Hierarchical sampling for active learning. In Proceedings of the 25th international conference on Machine learning, pp. 208–215. Cited by: §2.
- Active and semi-supervised learning in asr: benefits on the acoustic and language models. arXiv preprint arXiv:1903.02852. Cited by: §1.
- Active and semi-supervised learning in ASR: benefits on the acoustic and language models. arXiv:1903.02852. Cited by: §2.
- A convex optimization framework for active learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 209–216. Cited by: §2.
- Selecting influential examples: active learning with expected model output changes. In European Conference on Computer Vision, pp. 562–577. Cited by: §2.
- Deep learning. MIT Press. Note: http://www.deeplearningbook.org Cited by: §1.
- On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1321–1330. Cited by: §3.3.
- Active instance sampling via matrix partition. In Advances in Neural Information Processing Systems, pp. 802–810. Cited by: §2.
- Context aware active learning of activity recognition models. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4543–4551. Cited by: §2.
- Cold-start active learning with robust ordinal matrix factorization. In International Conference on Machine Learning, pp. 766–774. Cited by: §1.
- Combining generative and discriminative models for semantic segmentation of ct scans via active learning. In Biennial International Conference on Information Processing in Medical Imaging, pp. 25–36. Cited by: §2.
- Multi-class active learning for image classification. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2372–2379. Cited by: §2.
- Learning active learning from data. In Advances in Neural Information Processing Systems, pp. 4225–4235. Cited by: §1.
- Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §3.2.
- Temporal ensembling for semi-supervised learning. International Conference on Learning Representations. Cited by: §3.1.
- Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pp. 6402–6413. Cited by: §3.3.
- Training confidence-calibrated classifiers for detecting out-of-distribution samples. Neural Information Processing Systems. Cited by: §B.2.
- Heterogeneous uncertainty sampling for supervised learning. In Machine learning proceedings 1994, pp. 148–156. Cited by: §2.
- A sequential algorithm for training text classifiers. In SIGIR’94, pp. 3–12. Cited by: §2.
- Hierarchical subquery evaluation for active learning on a graph. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 564–571. Cited by: §2.
- Employing em and pool-based active learning for text classification. In Proc. International Conference on Machine Learning (ICML), pp. 359–367. Cited by: §2.
- Active learning using pre-clustering. In Proceedings of the twenty-first international conference on Machine learning, pp. 79. Cited by: §2.
- Realistic evaluation of deep semi-supervised learning algorithms. In Advances in Neural Information Processing Systems, pp. 3235–3246. Cited by: §1, §3.2, §4.1.
- Active and semi-supervised learning for object detection with imperfect data. Cognitive Systems Research 45, pp. 109–123. Cited by: §1, §2.
- Margin-based active learning for structured output spaces. In European Conference on Machine Learning, pp. 413–424. Cited by: §2.
- Toward optimal active learning through monte carlo estimation of error reduction. ICML, Williamstown, pp. 441–448. Cited by: §2.
Active learning for convolutional neural networks: a core-set approach. International Conference on Learning Representations. Cited by: §1, §2, §2, 2nd item, 3rd item.
- Multiple-instance active learning. In Advances in neural information processing systems, pp. 1289–1296. Cited by: §2.
- Query by committee. In Proceedings of the fifth annual workshop on Computational learning theory, pp. 287–294. Cited by: §2.
Semi-supervised active learning for sequence labeling.
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2, pp. 1039–1047. Cited by: §1.
- Support vector machine active learning with applications to text classification. Journal of machine learning research 2 (Nov), pp. 45–66. Cited by: §2.
- Interpolation consistency training for semi-supervised learning. International Joint Conferences on Artifical Intelligence. Cited by: §1, §3.1.
- Unsupervised data augmentation for consistency training. arXiv preprint arXiv:1904.12848. Cited by: §1, §3.1.
- Multi-class active learning by uncertainty sampling with diversity maximization. International Journal of Computer Vision 113 (2), pp. 113–127. Cited by: §2, §3.3.
- Learning loss for active learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 93–102. Cited by: §2, 2nd item.
- Learning loss for active learning. In 2019 IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §B.1.
- Improving the robustness of deep neural networks via stability training. In Proceedings of the ieee conference on computer vision and pattern recognition, pp. 4480–4488. Cited by: §3.1.
- Combining active learning and semi-supervised learning using gaussian fields and harmonic functions. In ICML 2003 workshop on the continuum from labeled to unlabeled data in machine learning and data mining, Vol. 3. Cited by: §1, §2, §4.2.
Appendix A Proof of Proposition 1
Appendix B More Discussion
b.1 Consistency-based AL in supervised learning
We also curious about how well our method performs under supervised learning using only labeled samples. Following Yoo and Kweon (2019b), we start with 1000 labeled samples on CIFAR-10. As shown in Table 3, after 4 AL cycles (, totaling 3000 labels), uniform, k-center, entropy and our method (consistency) achieve accuracy of , , and , respectively. It shows that consistency still works even if our model is trained using only labeled samples. However, the improvement of consistency compared to other baseline methods (especially entropy) is marginal.
|Methods||# of labeled samples in total|
b.2 Out-of-distribution and challenging samples
In real-world scenarios, it is very likely that not all unlabeled data are irrelevant to the task. Therefore, if a sample remains high uncertainty given arbitrary perturbations, it is probably a out-of-distribution example (Lee et al., 2018). In addition, selecting the hardest samples are not preferred, because it could be “over-challenging” for current model as suggested by the study of curriculum learning (Bengio et al., 2009). It can be easily inferred that our proposed selection can avoid such cases (see equation 1). More exploration of active learning with out-of-distribution samples is left for future work.