Supervised learning methods assume a large well-annotated training set. However, in many real-world applications, labeled data is often difficult and expensive to obtain, but we may have a large pool of unlabeled data. In such settings, active learning can be used where the machine learns an initial model from the labeled data, and then repetitively asks a human to annotate instances from the unlabeled pool.
Active learning has been extensively studied for classification and is most useful in settings where there are limited training annotations [1, 2, 3, 4, 5, 7]. However, there are several limitations with existing active learning: 1) Most works focus on classifications with relatively few works for regression. 2) Most works in active regression are model-focused; they use the learned regression function to select the next query point. Regression is an under-studied problem by the research community but extensively used in practice due to its ease of use and interpretability.
The limitations of being model focused are subtle but important. For example, as the query depends on the model if the previous point queried is incorrectly labeled by the domain expert, then it can produce undesirable results including even increasing model error. Most active learning for regression studies [8, 9, 10, 11, 12] assume that the annotations are accurate and focus on selecting the most informative point to query based on the newly learned function at each iteration. However, the labelers in the real world usually make noisy annotations, especially for regression tasks [13, 14]. Furthermore, the model-based methods could be highly biased when initial labeled data is limited as we discuss in Section 6. We use the term challenging regression settings to denote the regression problems where initial labeled data is limited and noisy due to human annotations.
In this paper, we propose a feature-focused active sampling strategy that tackles active learning for regression without using the regression function. Instead, we formulate the active learning problem as a bipartite graph optimization problem to reduce uncertainty in the unlabeled instances. One set of nodes corresponds to labeled points while the other to unlabeled points. The choice of the best points to move from the unlabeled set to the labeled set is analogous to the classic -Median and -Center problems. Even though these problems are known to be computationally intractable in general, we can adopt a known efficient approximation algorithm for the
-Median problem to solve the batch mode query active learning problem. To demonstrate the versatility of our approach, we explore classic regression settings, including classic linear regression and polynomial regression. A core challenge is how to estimate label uncertainty, and we propose abased measure and show that optimizing this measure is equivalent to optimizing the model’s uncertainty upper bound. Our new approach has been evaluated on several datasets, and it is shown to outperform many state-of-the-art methods (Section 6). The main contributions and novelty of this paper are summarized as follows.
We experimentally demonstrate the effectiveness of our proposed algorithm and the tightness of our bound in both normal and challenging regression settings (Section 6).
The rest of the paper is organized as follows. In Section 2 we discuss related work. In Section 3 we formulate our active learning algorithm in a general bipartite graph optimizing problem. In Section 4 we connect out formulation with well-known graph problems and provide complexity results for the general version of our formulation. In Section 5 we introduce both sequential and batch mode query algorithms which minimize the total uncertainty upper bound for linear regression and then extend them to polynomial regression for complex data. In Section 6 we demonstrate our algorithm’s performance on various domains with noise-free and noisy annotations. Section 7 provides concluding remarks.
2 Related Work
There are three main categories of methods for querying unlabeled points in active learning. The first category is finding the most informative or discriminative example for the current model. Cohn et al. proposed an algorithm 
that minimizes the learner’s error by minimizing its variance to reduce generalization error, assuming a well-specified model and an unbiased learning function and data. Burbidge et al. propose an adaptation of Query-By-Committee in active learning for regression. Sugiyama introduced a theoretically optimal active learning algorithm 
that attempts to minimize the generalization error in the pool based setting. Yu provided passive sampling heuristics to shrink the space of candidate models based on the samples’ geometric characteristics in the feature space. Cai et al. presented a sampling method in regression, which queries the point leading to the largest model change . Although our approach is based on uncertainty sampling which looks for informative points, we are looking not for a single point with the most uncertainty, but for one that can bring overall uncertainty reduction.
The second category is to find the most representative points for the overall patterns of the unlabeled data while preserving the data distribution [17, 18]. In particular, clustering for better sampling representative points has been explored [3, 4]. Considering representative points gives better performance when there are no or very few initially labeled data. However, such approaches have two major weaknesses: their performance heavily depends on the quality of clustering results and their efficiency (in comparison with methods that use informative points) will degrade as the number of labeled data points increases. The idea of a representative point is considered in our approach as we look for overall uncertainty reduction, but we also consider finding an informative point based on our uncertainty measure.
The third category of methods considers informativeness and representativeness simultaneously. Existing work which was motivated by this goal achieved excellent performance [7, 19]. Our work belongs in this category but is different from previous work as we select valuable points based on the instances and not on the learned regression function.
Standard active learning methods usually assume that there is an oracle that can provide accurate annotations for each query. In the real world, the annotations could be noisy. There has been much work on active learning for classification under different noise models and with diverse labelers [20, 14, 21]. Instead of making assumptions on the specific noise type and diverse labelers, our method seeks a model-free sampling strategy to make our active learning model less vulnerable to the potential noise in annotated data.
3 Our Idea and Problem Definition
In this section, we study how to generate active queries for regression methods only using feature space properties of the points. We calculate the uncertainty of each unlabeled point and then ask the oracle to label a subset of points that provides the maximum reduction in the overall uncertainty.
Assume that we have a small labeled set with instances, where is the label of the th instance , and a larger pool of unlabeled data with instances. At each iteration of active learning, the algorithm selects a subset of unlabeled instances and queries the oracle (i.e., a domain expert) to obtain their labels. We aim to choose a subset of points that will reduce the uncertainty by the maximum amount. Uncertainty is reduced both directly (the selected unlabeled points are given labels) and indirectly (the uncertainty of the remaining unlabeled points can be further reduced based on newly labeled points).
This query decision is made based on an evaluation function , which measures the total reduction of uncertainty after querying the new unlabeled points. We first formulate our proposed algorithmic framework as a bipartite graph optimization problem. We then propose a simple yet effective way to calculate the function and finally present our active learning algorithm.
Consider a graph where each node represents a data point. In pool-based active learning, the nodes are partitioned into two groups and corresponding to labeled and unlabeled points respectively. For each node , we add an edge to a node such that the point corresponding to most reduces the label uncertainty of the point corresponding to . The weight of this edge is the uncertainty of ’s labeling. This creates a weighted bipartite graph between and with edge set .
Given , our goal is to reduce the total estimated uncertainty by choosing to add to , where is defined as follows:
However, our aim is not merely to choosing the most uncertain points. We also want the chosen subset to reduce the uncertainty of others points in (i.e., those that were not chosen). We then have a new graph and the total uncertainty that associated with all unlabeled points is now changed to:
Letting denote the total uncertainty reduction from to , our goal is to choose with size that maximizes :
We illustrate our basic intuition with a toy example in Figure 1 where has only one unlabeled point . Querying unlabeled point will move it from to and change the edges in the graph. After the querying, the uncertainty of has been removed since has been labeled and its corresponding uncertainty changes from to . This is an example of directly reducing uncertainty. However, is the nearest neighbor of an unlabeled point and this unlabeled point’s uncertainty is reduced from to . This is an example indirectly reducing uncertainty. The estimated total reduction in uncertainty is . Our approach aims to find a set that maximizes the estimated total reduction in uncertainty.
4 Complexity Results
In this section, we first formulate a general version of our bipartite graph optimization problem and establish its complexity. This theoretical analysis guides us in developing approximation algorithms for active learning which will be introduced in Section 5. This section can be skipped on the first reading of the paper.
A bipartite graph is a nearest neighbor bipartite graph (or NN-BG) if it satisfies the following two conditions.
A distance value is given for each pair of nodes where and .
For each node , contains exactly one edge , where is a nearest neighbor of among all the nodes in (i.e., for each node , ).
Thus, in any NN-BG , each node in has exactly one edge incident on it; therefore, .
Given an NN-BG , suppose we move a non-empty subset of nodes from to . After this modification, we can find the nearest neighbor for each node and obtain another NN-BG denoted by . We consider the following two problems involving such modifications of NN-BGs.
I. Modification to Minimize the Maximum
Instance: Given an NN-BG , an integer and a non-negative number .
Question: Is there a subset such that (i) and (ii) in the NN-BG , the distance value on each edge is ?
II. Modification to Minimize the Total
Instance: Given an NN-BG , an integer and a non-negative number .
Question: Is there a subset such that (i) and (ii) in the NN-BG , the sum of the distances over all the edges in is ?
We can show that both MMMD and MMTD are NP-complete even when the distance function is a metric. The detailed proofs are shown in the supplementary material.
Problems MMMD and MMTD are NP-complete even when the distance function is a metric.
5 Putting it All Together - Our Algorithms
Here, we first show how to calculate an uncertainty upper bound for each unlabeled point and then translate it as the weight of an edge in the bipartite graph . Further, we propose two algorithms for sequential and batch-mode active learning, which are different approximation algorithms for the bipartite graph optimization problem. Finally, we extend our active learning strategy for linear regression to polynomial regression, which suits more complex data sets.
5.1 Motivating the Use of Distance as an Uncertainty Measure
We now show that the optimum way to calculate the edge weights () is equivalent to constructing a distance measure. Uncertainty sampling is one of the most popular algorithms for active learning in classification . For example, in margin-based methods like SVMs  we can calculate the distance between the unlabeled point and the decision boundary as an uncertainty measure. Similarly, in probabilistic models 
, we can calculate the entropy of unlabeled points for uncertainty. Generally speaking, this method computes a measure of the classifier’s uncertainty for each example in the unlabeled set and then returns the most uncertain one. However, the measurement of uncertainty in regression is not as straightforward as for classification problems.
We define an intuitive uncertainty measure as follows. Let the data be -dimensional. Suppose the current regression model (learned with ) predicts for :
and the ideal model (learned with ) predicts:
We can rewrite the target value of an arbitrary unlabeled point by adding and subtracting the and value for a labeled point . Note that the positive/negative bias terms () cancel each other out.
We now attempt to answer the question of which labeled point to use. Here we make a classic machine learning assumption that the model space used matches the data so that matches . Now we calculate the approximate error (due to not having labels for the unlabeled data) for as the difference between in Eq. 1 and in Eq. 2:
We now propose an uncertainty measure for unlabeled point and show that upper bounds in Theorem 5.1, where is a constant.
Eq. 4 means the current model’s uncertainty for point is based on its nearest labeled point NN measured in distance. The following theorem shows the relationship between Eq. 3 (an unlabeled point’s estimation error) and Eq. 4 (the distance between the unlabeled point and its nearest labeled point).
For any unlabeled point , minimizing is equivalent to minimizing the upper bound of estimation error for .
Proof: From equation (3) we have .
where is an unknown constant. Hence, the tightest upper bound for is achieved when is the nearest neighbor of in measure.
5.2 Proposed Active Selection Strategy for Sequential Query
We now propose an evaluation function for active selection which chooses to query one point at each round. We first calculate the current uncertainty for each unlabeled point as based on Eq. 4. We then calculate the uncertainty for each unlabeled point if had been queried, which we refer to as . Next, we calculate the reduction of uncertainty for each unlabeled point as . Finally we calculate the difference between the graphs defined in Section 3 after querying as:
where . Now we select the point which maximizes the differences as :
The pseudocode for our proposed algorithm is summarized in Algorithm 1. At each iteration of active learning, our algorithm selects a point which maximizes our active selection function. After that, we retrain our model and test the newly trained model with the hold-out test set. The whole process is repeated until the number of querying rounds reaches the chosen maximum value. To speed up our proposed algorithm, the outer loop’s computations in Algorithm 1 can be run in parallel so the total time complexity for one query can be reduced to .
5.3 Proposed Active Selection Strategy for Batch Mode Query
Most active learning methods have focused on sequential active learning which selects a single point to query in each iteration. In this setting, the model has to be retrained after each new example is queried and is not a realistic setting. (Multiple experts annotate data in a parallel labeling system.) Moreover, using our sequential active learning method to optimize the overall uncertainty reduction in the bipartite graph is less accurate than optimizing a group of points. We will demonstrate the performance advantage of batch mode query in Section 6.
Our goal for batch mode query is to select a batch of data points that provides the largest overall uncertainty reduction. Searching for an optimal solution will be computationally expensive (as suggested by the hardness results in Section 3). To reduce the computation overhead, we propose to use a local search approximation algorithm to optimize the batch mode query problem.
The local search algorithm  with a single swap is a popular algorithm for -Median problem. We apply this idea to our active learning strategy. Let the batch size be . We choose an initial set of query points using our sequential active learning method (Algorithm 1). We then repeatedly check the unlabeled points in the pool and swap them with the query points if the newly added point can reduce the total uncertainty reduction. Reference  proved that local search with swaps is a approximation algorithm, which gives an approximation ratio of in the single swap case with linear runtime. The pseudocode for the batch mode query algorithm is summarized in Algorithm 2.
5.4 Extension to Other Forms of Regression
Linear regression is popular for many real-world applications due to its simplicity and interpretability. So far, we have discussed sequential and batch mode query strategies for linear regression. However, it suffers from overfitting problem and does not generalize well to complex data sets. In this section, we show how our current formulation can be extended to other forms of regression. To fit complex data, one common approach within machine learning is to use linear models trained on non-linear functions of the data. This approach maintains the generally fast performance and high interpretability of linear models while allowing them to fit a much wider range of data.
Let an instance with dimensions have the coordinates . We construct the -order polynomial features for by using the multinomial theorem. Thus we have polynomial features like and interaction terms like . A toy example for second-order polynomial features with interaction terms for two dimensional data is as follows: given the original data as: , the new feature is: and the new regression model is: .
To prevent overfitting, we also impose the regularization term to polynomial regression as ridge regression. By considering linear fits within a higher-dimensional space built with these basis functions, the model can be used for a much boarder range of data. We also tested our approach for ridge regression. Due to limited space, our experimental results for ridge regression appear in the supplementary material.
Here we aim to demonstrate the effectiveness of our approach empirically. In particular, we address the following three questions.
How do our results compare with baseline active learning methods for linear regression? This directly tests the tightness of our bound (Theorem 5.1).
How does our proposed method perform when the query results are noisy?
How does our proposed method perform on different forms of regression?
The first and second questions test the usefulness of our proposed algorithms in practice while the last question addresses the effectiveness of applying our proposed method to other forms of regressions.
To compare the effectiveness of our proposed methods and answer the above questions, our algorithms are compared with four representative baselines.
Random: Randomly select an instance from the unlabeled pool, which is widely used as active learning baseline.
Greedy passive sampling : This approach will select an instance from the unlabeled pool which has the largest minimum Euclidean distance from the labeled set in feature space.
Query by committee algorithm : This algorithm selects an instance that has the largest variance among the committee’s prediction. The committee is constructed on bootstrap examples, and the number of committee members is set to be .
Expected model change maximization : This algorithm quantifies the change as the difference between the current model parameters and the new model parameters, and chooses an unlabeled instance which results in the greatest change.
For all the baselines, we use the same parameters in original papers. For simplicity, we use Random, Greedy, QBC, and EMCM to denote the above baselines; our sequential query method is denoted as Ours-Sequential, our batch query method is denoted as Ours-Batch.
|Data sets||# instances||# features||Source|
Data Description. We used six benchmark data sets which are chosen from the UCI machine learning repository  and CMU’s StatLib : Housing, Concrete, Yacht, PM10, Redwine, Whitewine. These data sets were collected from various domains and have been extensively used for testing regression models. Their statistics and descriptions are shown in Table 1.
Experimental Configuration. For each dataset, 1% of the instances are sampled to initialize the labeled set . We use minimal initial data set to demonstrate the advantages of our proposed algorithm. For evaluating the regression model for each run, 30% of the instances are held out as the test set; the rest of the instances are used for active learning. For each run we will query of the unlabeled instances. Note that we did not query all the unlabeled instances because the performances of most methods converge after some queries. In our plots we use query round as our x-axis; round one means , round two means of the unlabeled instances and so on; we have rounds in total. For the batch mode query, we will directly choose a single batch consisting of all points to query.
We report the average results over the runs of experiments. For the features in each dataset, we normalize
the features using the standard score function (Z-score):where is the mean of the population and
is the standard deviation of the population. To simulate the noisy annotations setting, we first calculate the standard deviation of the current labels asand then add a Gaussian noise for each newly added query point.
Results for Non-Noisy Setting. Figure 8 plots the RMSE (Root Mean Squared Error) curves against the total number of queries for all the compared approaches. We first discuss the noise-free results in parts (a) to (f) of Figure 8. Generally speaking, active learning methods usually achieve lower RMSE than the Random method. As can be seen from the plots, both QBC and EMCM perform better than Random and their performances are close to each other. This is reasonable because their ideas are similar; QBC builds a committee to find the unlabeled point which has the largest prediction variance while EMCM builds an ensemble to find the instance which causes the largest changes in the model’s parameters.
EMCM method tends to fluctuate at the beginning of the training since its optimization objective only looks for unlabeled instances that can bring the largest model change, regardless of the the direction of change. Greedy sampling can achieve decent performance in some of the data sets, but its performance is not consistent; a possible reason is that different distributions of the labeled and unlabeled data could fool the greedy method into querying some outliers.Our proposed sequential query approach consistently achieves lower RMSE than other methods in most cases.
For the batch-mode query, one can see querying a batch is always better than our sequential greedy query across all the data sets. This result is expected since the local search approximation algorithm reduces more in terms of the overall uncertainty for unlabeled points. The better performance of batch mode query also reflects that optimizing the overall uncertainty upper bound is useful in practice. We haven’t made comparisons to model-based batch query algorithms since their performance will worse compared to their sequential query versions. For example, batch-EMCM performs worse than sequential EMCM in ; this is because the model is updated after each new example is chosen and added to the training set so that each example is selected with more information.
Results for Noisy Setting. Parts (a) to (f) of Figure 15 show the RMSE curves with noisy annotations. Both QBC and EMCM perform worse than in the noise-free setting (see Figure 8). This result is expected because we introduce Gaussian noise into each active query’s feedback which harms the performance of the model-based methods. Both QBC and EMCM assume the labels are noise-free and depend highly on the quality of learned functions. The greedy method still performs inconsistently, but it is less vulnerable to the noise because it uses feature-based sampling strategy. Furthermore, our proposed approach achieves lower average RMSE in a noisy setting. This is to be expected as our method does not directly depend on the regression function.
|Number of queries|
The overall performance of our sequential query method for linear regression, shown in Table 2, summarizes the ranking counts of our method versus the other methods based on the noise-free setting. Note that an active learning algorithm’s performance varies with the distributions of initial labeled and unlabeled points. Thus, our method performs best in more than half of the tests and behaves consistently as the number of queries is increased. Significantly, this shows that our method does not only performs better on average, but it also consistently outperforms the competitors.
Active Learning Experiments on Different Forms of Regression. We evaluated our extension for ridge regression and the results are in line with the linear regression results. The details are summarized in the supplementary material. To adapt to complex data, we discuss experiments for polynomial regression in this section. We chose three datasets, namely Housing, Concrete, and Yacht. The reason for selecting them is that the performance of linear regression for these datasets is not good enough. Figure 19 plots the RMSE curves for polynomial regression. We set the default regularization parameter as across all the data sets.
Compared to linear regression results in Figure 8, we find that polynomial regression works much better on these three datasets with a much lower RMSE. Greedy behaves in a more unstable manner; it achieves decent performance for Concrete and Yacht datasets but loses to Random for the Housing dataset. This is possible because with larger feature dimensions, Greedy is more likely to pick some outliers which are less informative. The EMCM and QBC methods perform similarly across all data sets. They perform worse at the beginning and gradually become better with an increase in the number of labeled points. Our proposed method performs consistently better than the baseline methods, especially in the first active learning rounds.
We propose a new graph-based approach for active learning in regression. We formulate the active learning problem as a bipartite graph optimization problem to maximize the overall reduction in uncertainty caused by moving points from the unlabeled collection to the labeled collection. Experimental results on benchmark data show that the proposed approach can efficiently find valuable points to improve the active learning method. We explored both sequential and batch mode learning. Experimental results show that the proposed measure of uncertainty and the method achieve promising results for different forms of regression. A limitation of our method is the high computational overhead due to searching instances’ neighbors in each active learning round. In future work, we propose to speed up our algorithms by using more efficient data structures such as KD-trees and advanced hashing methods [5, 6]. Such techniques will enable us to apply our approach to large-scale regression tasks.
We thank the SDM 2020 reviewers for providing helpful suggestions. This work was supported in part by NSF Grants IIS-1908530, OAC-1916805, IIS-1633028 and IIS-1910306.
-  D. A. Cohn, Z. Ghahramani, and M. I. Jordan, “Active learning with statistical models,” JAIR, 1996.
S. Tong and D. Koller, “Support vector machine active learning with applications to text classification,”JMLR, 2001.
-  H. T. Nguyen and A. Smeulders, “Active learning using pre-clustering,” in ICML, 2004.
-  S. Dasgupta and D. Hsu, “Hierarchical sampling for active learning,” in ICML, 2008.
-  B. Qian, X. Wang, J. Wang, H. Li, N. Cao, W. Zhi, and I. Davidson, “Fast pairwise query selection for large-scale active learning to rank,” in ICDM, 2013.
S. Gilpin, B. Qian, and I. Davidson, “Efficient hierarchical clustering of large high dimensional datasets,” inCIKM, 2013.
-  S.-J. Huang, R. Jin, and Z.-H. Zhou, “Active learning by querying informative and representative examples,” TPAMI, 2014.
-  R. Burbidge, J. J. Rowland, and R. D. King, “Active learning for regression based on query by committee,” in IDEAL, 2007.
-  M. Sugiyama and S. Nakajima, “Pool-based active learning in approximate linear regression,” Machine Learning, vol. 75, no. 3, pp. 249–274, 2009.
-  W. Cai, Y. Zhang, and J. Zhou, “Maximizing expected model change for active learning in regression,” in ICDM, 2013.
-  S. Sabato and R. Munos, “Active regression by stratification,” in NIPS, 2014.
-  C. Riquelme, R. Johari, and B. Zhang, “Online active linear regression via thresholding,” in AAAI, 2017.
-  L. Malago, N. Cesa-Bianchi, and J. Renders, “Online active learning with strong and weak annotators,” in NIPS Workshop on Learning from the Wisdom of Crowds, 2014.
-  C. Zhang and K. Chaudhuri, “Active learning from weak and strong labelers,” in NIPS, 2015.
W. Cai, M. Zhang, and Y. Zhang, “Batch mode active learning for regression
with expected model change,”
IEEE Transactions on Neural Networks and Learning Systems, vol. 28, no. 7, pp. 1668–1681, 2016.
-  H. Yu and S. Kim, “Passive sampling for regression,” in ICDM, IEEE, 2010.
-  K. Yu, J. Bi, and V. Tresp, “Active learning via transductive experimental design,” in ICML, 2006.
R. Chattopadhyay, Z. Wang, W. Fan, I. Davidson, S. Panchanathan, and J. Ye, “Batch mode active sampling based on marginal probability distribution matching,”TKDD, 2013.
-  Z. Wang and J. Ye, “Querying discriminative and representative samples for batch mode active learning,” TKDD, 2015.
-  Y. Yan, R. Rosales, G. Fung, and J. G. Dy, “Active learning from crowds.,” in ICML, 2011.
-  S.-J. Huang, J.-L. Chen, X. Mu, and Z.-H. Zhou, “Cost-effective active learning from diverse labelers.,” in IJCAI, 2017.
-  M. R. Gary and D. S. Johnson, “Computers and intractability: A guide to the theory of np-completeness,” 1979.
B. Settles, “Active learning,”
Synthesis Lectures on Artificial Intelligence and Machine Learning, 2012.
-  B. Settles, M. Craven, and S. Ray, “Multiple-instance active learning,” in NIPS, 2008.
-  V. Arya, N. Garg, R. Khandekar, A. Meyerson, K. Munagala, and V. Pandit, “Local search heuristics for k-median and facility location problems,” SIAM Journal on Computing, vol. 33, no. 3, pp. 544–562, 2004.
-  K. Bache and M. Lichman, “UCI machine learning repository,” 2013.
-  P. Vlachos, “StatLib project repository,” Carnegie Mellon University, 2000.