. Thus, it is very attractive to propose a proper labeling scheme to reduce the number of labels required in order to train a classifier.
Active learning has been put forward to overcome the above labeling problem. The main assumption behind active learning is that if an active learner can freely select any samples it wants, it can outperform random sampling with less labeling . Thus, the main task of active learning is querying as little data as possible to minimize the annotation cost while maximizing the learning performance. Active learning tries to achieve this by selecting the most valuable samples. However, it is difficult to define or measure the value of one instance to the learning problem. We can view it as the amount of information carried which potentially promotes the learning performance, once its true label is known 
. As a result of the fact that we do not have an exact measure of the value, there are a great number of selection criteria proposed from different perspectives on how to estimate the usefulness of each sample.
, variance reduction[16, 17, 18, 19] and “Min-max” view active learning [20, 21]. Query-by-committee put forward multiple models as the committees and selected the samples which receive highest level of disagreement from the committees . Uncertainty sampling approach preferred the instances with maximum uncertainty. Based on the measurement of uncertainty, uncertainty sampling can be roughly divided two categories: maximum entropy of the estimated label  and minimum distance from the decision boundary [6, 7]. For example, Tong and Koller 
proposed to query the instance which is closed to the current learning boundary using the classifier of support vector machines. Campbellet al.  shared the same idea with Tong and Koller .
Roy and McCallum  proposed the expected error reduction (EER), which is a popular active learning method. EER aimed to reduce the generalization error when labeling a new instance. Since we do not have access to the test data, Roy and McCallum suggested to compute the “future error” on the unlabeled pool under the assumption that the unlabeled data set is representative of the test distribution. In other words, the unlabeled pool can be viewed as a validation set. Also, we have no knowledge about the true labels of unlabeled samples. EER estimated the average-case criterion of potential loss instead. Expected model change followed the idea of EER, but turned to select the instance which leads to maximum change of the current model. The variance reduction methods tried to minimize the output variances . Schein and Ungar 
extended this approach to expected variance reduction method on logistic regression by following the idea of EER. “Min-max” view active learning was originally proposed by Hoiet al. , where “Min-max” indicates the worst-case criterion is adopted. The key idea behind is to select the sample which minimizes the gain of objective function no matter what its assigned label is. Huang et al.  extended this framework by taking into account all the unlabeled data when calculating the objective function.
Current active learning methods can be split in two classes: retraining-based and retraining-free active learning. Retraining-based active learning represents methods which measure the information of unlabeled sample by labeling it (any possible label) and adding it to the training set to retrain the classification model. Then, some appropriate criteria can be evaluated and used for the sample selection. The second class, retraining-free active learning, contains the remaining methods which not need repeatedly train the model for each unlabeled instance during one single selection. For example, uncertainty sampling and query-by-committee belong to this category.
However, since the true label of the selected unlabeled instance is unknown, these methods resort to calculating the average-case or worse-case criteria with respect to the unknown label. In this paper, we propose a different criterion for retraining-based methods. We incorporate the uncertainty information (measured by the posterior probabilities within the min-max framework) for the selection. The proposed criterion can be seen as a trade-off of the exploration and the exploitation. The uncertainty information plays the role of the exploitation while the retraining-based models act as the exploration part. We concentrate on the pool-based active learning setting which assumes a large pool of unlabeled data along with a small set of labeled data already available. We consider the myopic active learning which sequentially and iteratively selects unlabeled instance.
The rest of this paper is organized as follows. Section II firstly reviews the framework of retraining-based active learning. Then two state-of-the-art methods under the retraining framework are briefly described. Section III demonstrates the primary motivation of the proposed method and derives a general algorithm for retraining-based active learning in detail. It also illustrate how to extend the proposed criterion to current methods. Experimental design and results are reported in IV ; Section V concludes this work followed by some future issues.
Ii Retraining-based Active Learning
In this section, we summarize a general framework of retraining-based active learning. Then we demonstrate two examples under this framework: Expected error reduction and Minimum Loss Increase.
Ii-a Retraining-based Active Learning
Firstly, let us introduce some preliminaries and notation. Let represent the training data set that consists of labeled instances and be the pool of unlabeled instances . Each is a dimensional feature vector, and is the class label of . In this paper, let us focus on binary classification problem firstly, and it is easy to extend this work to multi-class problem by extending to multi-labels set. We denote be the conditional probability of given according to a classifier trained on .
For the retraining-based active learning, its framework can be summarized in Algorithm 1, where represents any selection criterion associated with . The main procedure contains the loops which checks all the points in unlabeled pool over all the possible labels. For example, we firstly select one instance from the unlabeled pool and assign it any possible label. Then we update the labeled set (since we acquire a new labeled sample) and retrain the classifier we use. Based on the new trained classifier, we can measure some kind of selection criteria (e.g., generalization error in EER ). However, since the true label information of last selected sample is unknown, we need calculate some kind of performance, e.g., the average-case in [8, 19, 13], worst-case in , or even the best-case criteria in . Finally, we will query the instance which leads to maximum or minimum value in terms of the criterion we are interested in.
EER is one example of retraining-based active learning, which uses the generalization error as . We get expected model change [12, 13, 14, 15] by adopting model change as the criterion. By adopting variance and logistic regression as the classifier, we get expected variance reduction . Similarly, if we want to minimize the value of objective function after labeling a new instance and use the worst-case performance (corresponding to min-max framework), then we can get [20, 21]. Clearly, the retraining-based approaches may suffer from high computational cost due to the fact that they need go over all the unlabeled data and all the possible labels.
Ii-B Expected Error Reduction
Expected error reduction has demonstrated its effectiveness on text classification domain . There are also some follow-up work of EER contributed by other researchers   . EER aims to select the sample which will reduce the future generalization error. Since we can not see the test data, the unlabeled pool can be used as the validation set to predict the future test error. We encounter a new problem since we do not know the true labels of the pool. Roy and McCallum  suggested, in practice, we can approximately estimate the error using the expected log-loss or loss over the pool. For example, if we adopt the log loss, EER can be written as follows:
where means that the selected instance is labeled and added to . Note that the first term contains the pre-trained label information. The second term is the sum of potential entropy over the unlabeled data set .
Ii-C Minimum Loss Increase
We can find that EER attempts to reduce the future generalization error, however, it is not easy due to the missing of test data and true label information of unlabeled data. There are some researchers which try to solve this problem from a different perspective. Hoi et at.  presented a so called “min-max” view active learning. It prefers the instance which results in a small value of an objective function in spite of its assigned label. This is because the smaller the value of an objective function, the better the learning model, at least in high probability. Assume is the value of an objective function on current labeled data . When we label a new instance and update the training data , we get a new value of objective function . What we want is the minimum increase of objective function, i.e., , when adding one more labeled sample. However, because the second term is independent of the next queried instance, so we can ignore it and focus on minimizing . Since we expect a minimum value of regardless of the assigned label of , we adopt the worst-case performance as follows, instead of the average-case version.
Note that we can view as one choice of mentioned in Algorithm 1.
Let us consider an unconstrained optimization problem using -loss regularized classifier with arbitrary loss : , where is the parameter of learning classifier. If we adopt the Hinge loss , we can derive the same model with “min-max” view active learning described in , but without extend it to batch model setting. If we use square loss , we can get the same model with . Note that, as is stated in , though  includes all the unlabeled data when calculating the objective function, the unlabeled examples play no role since  relaxes the constraint of the labels of unlabeled pool in the end. This operation can guarantee zero contribution of unlabeled data to the objection function. Thus,  is also one special case using the square loss. Moreover, we can conclude that the main idea of min-max view active learning is to minimize the increase of the value of an objective function.
In our paper, we consider the logistic loss , which results in:
where is estimated parameter of -regularized logistic regression model. Logistic regression is chosen as the base classifier since it is generally widely used in many fields and can output the conditional probability straightly, which can be used in active learning . We call this method Minimum Loss Increase (MLI) in this paper. EER tries to minimize the error on unlabeled data while MLI aims to minimize the loss on data already labeled.
Iii A New Retraining-based Active Learner
In this section, we motivate our proposed method and, subsequently, describe a general adaptation for retaining-based active learning models.
Obviously, not knowing the true labels of the unlabeled data complicates calculating the final score of each instance in step 10 in Algorithm 1. One simple possibility is computing the average-case  or worst-case performance , or even the best-case criterion . These choices, however, may fail to take into account some potentially valuable information: Firstly, although the average-case criterion makes use of the label distribution information already known, the expectation calculation can hide or underestimate some outstanding samples due to the re-weighting by . For example, the true label of instance is but the estimated , and the has a maximum value compared with other instances. Then the average-case criterion of , namely , is highly likely to be surpassed by other instances. Secondly, as to the worst-case criterion, it suffers from not taking advantage of label distribution information at all. Worst-case analysis is a safe analysis since it is never underestimated. However, making no use of the available label information can lose sight of some valuable information.
Thus, to overcome the shortcomings mentioned, a new criterion for retraining-based active learning is proposed. The main motivation is that we want to incorporate the uncertainty information (e.g., known label distribution information) within min-max framework for retraining-based models. The proposed criterion is therefore as follows:
where contains the pre-trained label information and represents any criteria we are interested. Note that for some classifiers like logistic regression, we can use the estimated posterior probability as . For classifiers which do not produce a probabilistic output, e.g., SVMs, we can transform their output to some probability using Platt’s  or Duin & Tax’s method . And for , various choices are possible, such as the test error on the unlabeled pool in EER, the output variance as in , or the value of an objective function .
The proposed method can be interpreted as follows: it utilizes the pre-trained label information, although this kind of information might be inaccurate due to limited labeled data we have, it still shows some underlying or potential useful clues which may promote active learning. Firstly, it improves upon the average-case criterion since it does not compute the expected value. The calculation of expectation tends to ruin the discriminative information contained in the data due to its averaging manner. Secondly, it outperforms the worst-case criterion because it takes advantage of the knowledge of the potential label distribution while worst-case analysis does not use this at all. Thus, it avoids the disadvantages of average-case and worst-case criteria. It can be seen as a trade-off between the average-case and the worst-case criteria. Lastly, it can be considered as incorporating uncertainty sampling (encoded by the posterior probabilities) for retraining-based model. If all become one constant term like 1 or itself, then the proposed method will turn into exactly the uncertainty sampling. More specifically, or will act as totally same as uncertainty sampling since they will select the instance whose posterior probability comes closest to 0.5 on the binary problem. This shows that our proposed method actually fuses uncertainty sampling with retraining-based models.
Iii-B Two Examples of the Proposed Method
To provide valuable insights on the underlying characteristic of the proposed method, we apply it to two state-of-the-art retraining-based models EER and MLI. We also demonstrate its advantage on a synthetic data set in Figure 1.
Since our method tries to make use of the uncertainty information, the following adapted methods are termed uncertainty retraining-based active learners. It is easy to extend EER to uncertainty-based error reduction by adopting our method in Equation 2 as follows:
This method is called UEER for short. We can also apply our proposed criterion on MLI. The new approach is called UMLI in this paper. Note that the regularization parameter in Equation 1 is usually quite small, so we ignore it in our adapted criterion:
As is shown in Figure 1 , we construct a synthetic binary data set and two colours represent different classes. We demonstrate the performance of four retraining-based active learners EER, UEER, MLI and UMLI on four corners, respectively. One black triangle and circle in each corner represent two initial labeled points. When we compare UEER with EER, it is obvious that UEER selects a number of instances near the decision boundary while EER explores points in a wider range. This is because our method helps UEER make use of the uncertainty information and uncertainty information makes UEER focus on the region which is least certain about. Similar results can also be found between UMLI and MLI. MLI explores over the data space and queries the points around the border while UMLI balances the exploration and the exploitation. UMLI concentrates on the central part (exploitation) and also searches around the edge. Therefore, we can see that our method enhances retraining-based model by balancing the exploration and the exploitation.
In this section, we investigate the performance of our proposed methods to examine the effectiveness and robustness of our new criterion. The following experiments are limited to binary classification problems. Firstly, we show the experimental setting, then present the extensive experiment results, followed by further discussion and analysis.
Iv-a Experimental setting
We compare the our proposed methods UEER and UMLI against their original version EER and MLI, respectively. Random sampling is also included in this comparison. In all the experiments, we use -regularized logistic regression included in LIBLINEAR package  as default classifier with the same regularization parameter, , for all methods.
The classification accuracy is used as the comparison criterion in our experiment. However, since active learning is a iteratively labeling procedure, we care about the performance during the whole learning process. Thus, it is not reasonable to merely compare the accuracy at some single points. Instead, we generate the learning curve of classification accuracy versus the number of labeled instances. Then, we calculate the area under the learning curve (ALC) as a measure of evaluation.
We test on totally 49 real-world data sets from various real-life applications, including many UCI data sets , MNIST handwritten digit dataset  and 20 Newsgroups dataset . There are 39 datasets from UCI benchmark datasets, such as breast, vehicle, heart and so on. These datasets are pre-processed according to . For wine data set, we conduct class 2 against class 1 and 3 as binary problem. For glass data set, we also split it into two groups (class 1-3 vs. class 5-7) to build binary case. We randomly sub-sample 1000 instances from mushroom for computing efficiency. We select six pairs of letters from Letter Recognition Data Set , i.e., D vs. P, E vs. F, I vs.J , M vs.N, V vs. Y and U vs. V since these pairs look similar to each other and distinguishing them is a little challenging. 3 vs. 5, 5 vs. 8 and 7 vs. 9 are three difficult pairs taken from MNIST data set 111http://yann.lecun.com/exdb/mnist/ and used as the binary classification data set. We randomly sub-sample 1500 instances from the three data sets for computing efficiency. We also test the performance on 20 Newsgroups dataset which is a common benchmark used for text classification 222http://qwone.com/ jason/20Newsgroups/. Following the work of , we also evaluate three binary tasks from 20 Newsgroups dataset: baseball vs. hockey, pc vs. mac, and religion.misc vs. alt.atheism. And the three pairs represent easy, moderate and difficult classification problems, respectively. We apply PCA to reduce the dimensionality of the above three datasets to 500 for computation efficiency. We also use the pre-processed data autos, motorcycles, baseball, hockey used in .
To objectively evaluate the performance, each data set is randomly divided into training and test data set of equal size. At the very beginning of active learning, we assume that only two instances randomly picked up from the training data are labeled, and one of them is from the positive class and the other is from the negative class. We run each active learning algorithm 20 times on each real-world dataset. The average performance of each active learning method is reported in the following section.
|Data set (# Ins, # Fea)||Data set (# Ins, # Fea)||Data set (# Ins, # Fea)|
|ac-inflam (120, 6)||acute (120, 6)||australian (690, 14)|
|blood (748, 4)||breast (683, 10)||credit (690, 15)|
|cylinder (512, 35)||diabetes (768, 8)||fertility (100, 9)|
|german (1000, 24)||glass (214, 9)||haberman (306, 3)|
|heart (270, 13)||hepatitis (255, 19)||hill (606, 100)|
|ionosphere (351, 34)||liver (345, 6)||mushrooms (1000, 112)|
|mammographic (961, 5)||musk1 (476, 166)||ooctris2f (912, 25)|
|ozone (1000, 72)||parkinsons (195, 22)||pima (768, 8)|
|planning (182, 12)||sonar (208, 60)||splice (1000, 60)|
|tictactoe (958, 9)||vc2 (310, 6)||vehicle (435, 18)|
|wine (178, 13)||wisc (699, 9)||wdbc (569, 31)|
|d vs p (1608, 16)||e vs f (1543, 16)||i vs j (1502, 16)|
|m vs n (1575, 16)||v vs y (1577, 16)||u vs v (1550, 16)|
|3 vs 5 (1500, 784)||5 vs 8 (1500, 784)||7 vs 9 (1500, 784)|
|base-hockey (1993, 500)||pc-mac (1945, 500)||misc-atheism (1427, 500)|
|autos (3970, 8014)||motorcycles (3970, 8014)||baseball (3970, 8014)|
|hockey (3970, 8014)|
Table II shows the experimental results on 49 data sets. The datasets in Table II are sorted with respect to the performance of random sampling. We can find that the comparisons contain the datasets which vary from very difficult problems (e.g., hill) to easy tasks (e.g., acute). To clearly demonstrate the advantage of the proposed method, we do pairwise comparison between the original algorithm and its counterpart, e.g.
, EER vs. UEER and MLI vs. UMLI, respectively. On each data set, a paired t-tests atsignificance level is used to determine which method has the best performance or provides comparable outcome. These methods are highlighted in bold face. Over all the experiments, average performances are reported in Table II. “Average Rank” shows the average rank of all the methods with regard to their performances on all the experiments. The lower the value of average rank, the better the method. The “win/tie/loss counts” represents times of our proposed methods versus its counterparts over all the 49 datasets.
As is shown in Table II, our proposed methods UEER and UMLI evidently outperform their counterparts EER and MLI, respectively. UEER surpasses EER in terms of average accuracy, and improves its performance from 0.812 to 0.822. UEER also outperforms EER in terms of “average rank”, which demonstrates the effectiveness of our method. Similar results can be found between UMLI and MLI. UMLI is superior to MLI on the overall performance. Moreover, it is interesting to observe that UEER attains the best overall performance among all the active learning methods. Over all the experimental data sets, the “win/tie/loss” counts of UEER versus EER is , meaning that UEER is the preferred active learner in over half the cases. With regard to UMLI and MLI, the “win/tie/loss” count is , which also shows the clear benefit of our scheme nonetheless. We also notice that even random sampling can surpass all the other methods, e.g., on the blood data set, indicating that, generally, one might not want to use active learners in a blind way.
To investigate the robustness of our method, we also apply the worst-case criterion on EER and the average-case criterion on MLI, respectively. Due to the lack of space, we omit the results on each data set and only report the average performances. The average performance (ALC) of the worst-case on EER is 0.771 while that of the average-case on MLI is 0.710. To our surprise, they definitely show poorer performances in comparison with our method and even perform worse than random sampling. The possible reason may be that: EER computes the error on the unlabeled data and none of the true label are known, the average-case criterion is a safe choice for EER. Since MLI estimates the loss on the enlarged labeled set and only the true label of is unknown, the worst-case criterion is more appropriate for MLI than the average-case criterion. However, since the proposed method is a trade-off of the two criteria, it can adapt to both settings and show a robust performance for different retraining-based models.
In this paper, we propose a new general method for retraining-based active learning. The proposed method can balance a trade-off of the average-case and worst-case criteria by incorporating uncertainty information (carried by the pre-trained posterior probabilities) within min-max framework. It drives current retraining-based models to pay more attention to the exploitation. We employ the new idea on two state-of-the-art methods to investigate its effectiveness. The synthetic data demonstrates that our method prefers to select the instances which are near the decision boundary in comparison with the original retraining-based approaches. Moreover, extensive experiments on 49 real-world datasets also prove that the proposed method is a promising approach for promoting retraining-based active learners.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database,” in CVPR09, 2009.
-  B. Settles, “Active learning literature survey,” University of Wisconsin, Madison, vol. 52, no. 55-66, p. 11, 2010.
R. Chattopadhyay, Z. Wang, W. Fan, I. Davidson, S. Panchanathan, and J. Ye, “Batch mode active sampling based on marginal probability distribution matching,”ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 7, no. 3, p. 13, 2013.
H. S. Seung, M. Opper, and H. Sompolinsky, “Query by committee,” in
Proceedings of the fifth annual workshop on Computational learning theory. ACM, 1992, pp. 287–294.
D. D. Lewis and J. Catlett, “Heterogeneous uncertainty sampling for supervised learning,” inProceedings of the eleventh international conference on machine learning, 1994, pp. 148–156.
-  S. Tong and D. Koller, “Support vector machine active learning with applications to text classification,” The Journal of Machine Learning Research, vol. 2, pp. 45–66, 2002.
-  C. Campbell, N. Cristianini, A. Smola et al., “Query learning with large margin classifiers,” in ICML, 2000, pp. 111–118.
-  N. Roy and A. McCallum, “Toward optimal active learning through monte carlo estimation of error reduction,” ICML, Williamstown, 2001.
-  Y. Guo and R. Greiner, “Optimistic active-learning using mutual information.” in IJCAI, vol. 7, 2007, pp. 823–829.
-  A. Holub, P. Perona, and M. C. Burl, “Entropy-based active learning for object recognition,” in Computer Vision and Pattern Recognition Workshops, 2008. CVPRW’08. IEEE Computer Society Conference on. IEEE, 2008, pp. 1–8.
-  Y. Guo and D. Schuurmans, “Discriminative batch mode active learning,” in Advances in neural information processing systems, 2008, pp. 593–600.
-  B. Settles, M. Craven, and S. Ray, “Multiple-instance active learning,” in Advances in neural information processing systems, 2008, pp. 1289–1296.
-  A. Freytag, E. Rodner, and J. Denzler, “Selecting influential examples: Active learning with expected model output changes,” in Computer Vision–ECCV 2014. Springer, 2014, pp. 562–577.
-  W. Cai, Y. Zhang, S. Zhou, W. Wang, C. Ding, and X. Gu, “Active learning for support vector machines with maximum model change,” in Machine Learning and Knowledge Discovery in Databases. Springer, 2014, pp. 211–226.
-  C. Kading, A. Freytag, E. Rodner, P. Bodesheim, and J. Denzler, “Active learning and discovery of object categories in the presence of unnameable instances,” in Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. IEEE, 2015, pp. 4343–4352.
-  S. C. Hoi, R. Jin, J. Zhu, and M. R. Lyu, “Batch mode active learning and its application to medical image classification,” in Proceedings of the 23rd international conference on Machine learning. ACM, 2006, pp. 417–424.
-  T. Zhang and F. Oles, “The value of unlabeled data for classification problems,” in Proceedings of the Seventeenth International Conference on Machine Learning,(Langley, P., ed.). Citeseer, 2000, pp. 1191–1198.
-  K. Yu, J. Bi, and V. Tresp, “Active learning via transductive experimental design,” in Proceedings of the 23rd international conference on Machine learning. ACM, 2006, pp. 1081–1088.
-  A. I. Schein and L. H. Ungar, “Active learning for logistic regression: an evaluation,” Machine Learning, vol. 68, no. 3, pp. 235–265, 2007.
S. C. Hoi, R. Jin, J. Zhu, and M. R. Lyu, “Semi-supervised svm batch mode active learning for image retrieval,” inComputer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008, pp. 1–7.
-  S.-J. Huang, R. Jin, and Z.-H. Zhou, “Active learning by querying informative and representative examples,” in Advances in neural information processing systems, 2010, pp. 892–900.
-  Y. Yang and M. Loog, “A benchmark and comparison of active learning methods for logistic regression,” arXiv preprint, 2016.
-  J. Platt et al., “Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods,” 1999.
-  R. P. Duin and D. M. Tax, “Classifier conditional posterior probabilities,” in Advances in pattern recognition. Springer, 1998, pp. 611–619.
-  R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, “Liblinear: A library for large linear classification,” The Journal of Machine Learning Research, vol. 9, pp. 1871–1874, 2008.
-  M. Lichman, “UCI machine learning repository,” 2013. [Online]. Available: http://archive.ics.uci.edu/ml
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
-  K. Lang, “Newsweeder: Learning to filter netnews,” in Proceedings of the Twelfth International Conference on Machine Learning, 1995, pp. 331–339.
-  M. Fernández-Delgado, E. Cernadas, S. Barro, and D. Amorim, “Do we need hundreds of classifiers to solve real world classification problems?” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 3133–3181, 2014.
X. Zhu, J. Lafferty, and Z. Ghahramani, “Combining active learning and semi-supervised learning using gaussian fields and harmonic functions,” inICML 2003 workshop on the continuum from labeled to unlabeled data in machine learning and data mining, 2003, pp. 58–65.