Active learning algorithms aim to learn parameters with less data by querying labels adaptively. However, since such algorithms change the sampling distribution, they can introduce bias in the learned parameters. While there has been some work to understand this (Schütze et al., 2006; Bach, 2007; Dasgupta & Hsu, 2008; Beygelzimer et al., 2009), the most common algorithm, “uncertainty sampling” (Lewis & Gale, 1994; Settles, 2010), remains elusive. One of the oddities of uncertainty sampling is that sometimes the bias is helpful: uncertainty sampling with a subset of the data can yield lower error than random sampling on all the data (Schohn & Cohn, 2000; Bordes et al., 2005; Chang et al., 2017). But sometimes, uncertainty sampling can vastly underperform, and in general, different initializations can yield different parameters asymptotically. Despite the wealth of theory on active learning (Balcan et al., 2006; Hanneke et al., 2014), a theoretical account of uncertainty sampling is lacking.
In this paper, we characterize the dynamics of a streaming variant of uncertainty sampling to explain the bias introduced. We introduce a smoothed version of the zero-one loss which approximates and converges to the zero-one loss. We show that uncertainty sampling, which minimizes a convex surrogate loss on all the points so far, is asymptotically performing a preconditioned stochastic gradient step on the smoothed (non-convex) population zero-one loss. Furthermore, each uncertainty sampling iterate in expectation moves in a descent direction of the smoothed population zero-one loss, unless the parameters are at an approximate stationary point. In addition, uncertainty sampling converges to a stationary point of the smoothed population zero-one loss. This explains why uncertainty sampling sometimes achieves lower zero-one loss than random sampling, since that is approximately the quantity it implicitly optimizes. At the same time, as the zero-one loss is non-convex, we can get stuck in a local minimum with higher zero-one loss (see Figure 1).
Empirically, we validate the properties of uncertainty sampling on a simple synthetic dataset for intuition as well as 22 real-world datasets. Our new connection between uncertainty sampling and zero-one loss minimization clarifies the importance of a sufficiently large seed set, rather than using a single point per class, as is commonly done in the literature (Tong & Koller, 2001; Yang & Loog, 2016).
We focus on binary classification. Let be a data point, where is the input and is the label, drawn from some unknown true data distribution . Assume we have a scoring function , where are the parameters; for linear models, we have , where is the feature map.
Given parameters , we predict if and otherwise, and therefore err when and have opposite signs. Define
to be the zero-one loss (misclassification rate) over the data distribution, the evaluation metric of interest:
where is the Heaviside step function:
Note that the training zero-one loss is step-wise constant, and the gradient is
almost everywhere. However, assuming the probability density function (PDF) ofis smooth, the population zero-one loss is differentiable at most parameters, a fact that will be shown later.
Since minimizing the zero-one loss is computationally intractable (Feldman et al., 2012), it is common to define a convex surrogate which upper bounds the zero-one loss; for example, the logistic loss is given by . Given a labeled dataset
, we can define the estimator that minimizes the sum of the loss plus regularization:
This can often be solved efficiently via convex optimization.
Passive learning: random sampling.
Define the population surrogate loss as
In standard passive learning, we sample randomly from the population and compute . As , the parameters generally converge to the minimizer of , which is in general distinct from the minimizer of .
Active learning: uncertainty sampling.
In this work, we consider the streaming setting (Settles, 2010) where a learner receives a stream of unlabeled examples (known drawn from with unknown ) and must decide to label each point or not. We analyze uncertainty sampling in this setting (Lewis & Gale, 1994; Settles, 2010), which is widely used for its simplicity and efficacy (Yang & Loog, 2016).
Let us denote our label budget as , the number of points we label. Uncertainty sampling (Algorithm 1) begins with labeled points drawn from the beginning of the stream and minimizes the regularized loss (3) to obtain initial parameters. Then, the algorithm takes a point from the stream and labels it with probability for some acceptance function and scalar . One example is , which corresponds to labeling points from the stream if and only if . As gets smaller, we choose points closer to the decision boundary.
If we decide to label , then we obtain the corresponding label and add to . Finally, we update the model by optimizing (3). The process is continued until we have labeled points in total.
We present four types of theoretical results. First, in Section 3.1, we show how the optimal parameters change with the addition of a single point to the convex surrogate (e.g. logistic) loss. Then, in Section 3.2, we introduce a smoothed version of the zero-one loss and show that uncertainty sampling is preconditioned stochastic gradient descent on this smoothed zero-one loss. Finally, we show that uncertainty sampling iterates in expectation move in a descent direction in Section 3.3, and that the uncertainty sampling converges to a stationary point of the smoothed zero-one loss (Section 3.4).
3.1 Incremental Parameter Updates
First, we analyze how the sample convex surrogate loss minimizer changes with a single additional point, showing that the change is a preconditioned111Preconditioned refers to multiplication of a symmetric positive semidefinite matrix by the (stochastic) gradient for (stochastic) gradient descent (Li, 2018; Klein et al., 2011). It is often chosen to approximate the inverse Hessian. gradient step on the additional point. Let us assume the loss is convex and thrice differentiable with bounded derivatives:
Assumption 1 (Convex Loss).
The loss is convex in .
Assumption 2 (Loss Regularity).
The loss is continuously thrice differentiable in , and the first three derivatives are bounded by some constant in the Frobenius norm.
Consider any iterative algorithm that adds a single point each time and minimizes the regularized training loss at each iteration :
to produce . Since and differ by only one point, we expect and to also be close. We can make this formal using Taylor’s theorem. First, since is a minimizer, we have . Then, since the loss is continuously twice-differentiable:
Let be the value of the integral. Since is convex and regularizer is quadratic, is symmetric positive definite, and thus is symmetric positive definite and thus invertible. Therefore, we can solve for :
Since minimizes , we have . Also note that . Thus,
The update above holds for any choice of , in particular, when is chosen by either random sampling or uncertainty sampling.
For random sampling, , so we have
from which one can interpret the iterates of random sampling as preconditioned SGD on the population surrogate loss .
3.2 Parameter Updates of Uncertainty Sampling
Whereas random sampling is preconditioned SGD on the population surrogate loss , we will now show that uncertainty sampling is preconditioned SGD on a smoothed version of the population zero-one loss .
Recall that is the acceptance function. Define as a normalized anti-derivative of , which converges to the Heaviside function when the domain is scaled by . First, we need to make an assumption that ensures has an integral over the real line.
Assumption 3 (Continuous, bounded, even).
has bounded support (),
is even ().
Now we are ready to define , which is made by replacing the Heaviside step function with and scaling the domain by .
We now show that converges to pointwise:
For all , .
This follows from noticing that and applying the Dominated Convergence Theorem. ∎
Before stating the dynamics of uncertainty sampling, we must first define another quantity, which is the probability of accepting a random point :
We must further make an assumption that is locally exactly linear around , which is satisfied for the hinge loss and smoothed versions of the hinge loss.
Assumption 5 (Locally linear ).
There is some neighborhood of where is exactly linear:
We also assume the score is smooth, and the support of is bounded:
Assumption 6 (Smooth Score).
The score is smooth, that is, all derivatives with respect to and exist.
Assumption 7 (Bounded Support).
The support of is bounded.
We are ready to state the relationship between uncertainty sampling iterates (8), which are governed by , and the smoothed zero-one loss :
First, note that
Because is continuously differentiable, is smooth, and has bounded support, by the Leibniz Integral Rule, we can exchange the integral and derivative:
Because is even,
Now we are ready to evaluate . From the definition of uncertainty sampling,
Notice that for and that for . Thus, for ,
Thus, if is drawn using uncertainty sampling, is in the direction of , since the quantity in front of is a scalar that is positive for all common losses.222, , and are all positive, and is negative for all reasonable losses. Similar to how we showed random sampling is preconditioned SGD on the population surrogate loss (9), uncertainty sampling is preconditioned SGD on the smoothed population zero-one loss .
The only assumption that is unorthodox is Assumption 5
, which holds for a smoothed hinge loss, but not the logistic loss. If we remove this assumption, we would incur only a small additive vector term ofwhich goes to quickly as .
3.3 Descent Direction
So far, we have shown that uncertainty sampling is preconditioned SGD on the smoothed population zero-one loss by analyzing . To show that these updates are descent directions on , we need to also consider the preconditioner appearing in (8). Due to quadratic regularization (5), the preconditioner is positive definite. However, we need to be careful since the preconditioner depends on the resulting iterate . Because of this snag, we need to ensure that doesn’t change the preconditioner too much, which we can accomplish by requiring and large enough regularization.
Theorem 9 (Uncertainty Sampling Descent Direction).
It might seem that above is rather large. However, note that we don’t optimize the regularized loss until after random points, so if , then the regularization is always less than the number of data points that contribute to the loss, and the regularization will not dominate. Thus, this constraint on can be intuitively thought of as a constraint on .
Having shown that uncertainty sampling iterates move in descent directions of in expectation, we now turn to showing that they also converge to a stationary point of . To prove convergence, we will need to stay (with high probability) in regions where the assumption of Theorem 8 holds () and also ensure that the parameters stay bounded (with high probability).
As is standard in stochastic gradient convergence analyses, instead of showing convergence of the final parameter iterate , we show the convergence of the parameters from a random iteration. For a budget , let with . Then, define be the randomized parameters. We will show that converges to a stationary point of as .
Define as the failure probability that any parameter iterate is too large or has zero acceptance probability:
We will assume that this failure probability converges to as . As grows, the seed set size grows, the regularization grows, and the effective step size shrinks. Intuitively, this means we might expect that the parameter iterates become more stable. Unless the parameters diverge, we can choose large enough to contain the parameter iterates. Furthermore, the acceptance probability should be non-zero if there are not large regions of the space with zero probability density. Assuming the probability of these failure events goes to zero, the randomized parameters converge to a stationary point:
Theorem 10 (Convergence to Stationary Points).
These results shed light on the mysterious dynamics of uncertainty sampling which motivated this paper. In particular, uncertainty sampling can achieve lower zero-one loss than random sampling because it is implicitly descending on the smoothed zero-one loss . Furthermore, since is non-convex, uncertainty sampling can converge to different values depending on the initialization.
It is important to note that the actual uncertainty sampling algorithm is unchanged—it is still performing gradient updates on the convex surrogate loss. But because its sampling distribution is skewed towards the decision boundary, we can interpret its updates as being on the smoothed zero-one loss with respect to the original data-generating distribution.
We run uncertainty sampling on a simple synthetic dataset to illustrate the dynamics (Section 4.1) as well as 22 real datasets (Section 4.2). In both cases, we show how uncertainty sampling converges to different parameters depending on initialization, and how it can achieve lower asymptotic zero-one loss compared to minimizing the surrogate loss on all the data. Note that most active learning experiments are interested in measuring the rate of convergence (data efficiency), whereas this paper focuses exclusively on asymptotic values and the variation that we obtain from different seed sets. We evaluate only on the zero-one loss, but all algorithms perform optimization on the logistic loss.
4.1 Synthetic Data
(left) shows a mixture of Gaussian distributions in two dimensions. All the Gaussians are isotropic, and the size of the circle indicates the variance (one standard deviation for the inner circle, and two standard deviations for the outer circle). The points drawn from the two red Gaussian distributions are labeledand the points drawn from the two blue ones are labeled . The percentages refer to the mixture proportions of the clusters. We see that there are four local minima of the population zero-one loss, indicated by the green dashed lines. Each minimum misclassifies one of the Gaussian clusters, yielding error rates of about 10%, 20%, 30%, and 40%. The black dotted line corresponds to the parameters that minimize the logistic loss, which yields an error of about 20%.
Figure 2 (right) shows learning curves for different seed sets, which consist of two points, one from each class. We see that the uncertainty sampling learning curves converge to four different asymptotic losses, corresponding to the four local minima of the zero-one loss mentioned earlier. The thick black dashed line is the zero-one loss for random sampling. We see that uncertainty sampling can actually achieve lower loss than random sampling, since the global optimum of the logistic loss does not correspond to the global minimum of the zero-one loss.
4.2 Real-World Datasets
We collected 22 datasets from OpenML (retrieved August, 2017) that had a large number of data points and where logistic regression outperformed the baseline classifier that always predicts the majority label. We further subsampled each dataset to have 10,000 points, which was divided into 7000 training points and 3000 test points. We created a stream of points by randomly selecting points one-by-one with replacement from the dataset. We ran uncertainty sampling on each dataset with random seed sets of sizes that are powers of two from 2 to 4096 and then 7000. We stopped when uncertainty sampling either could not select a point () or when a point had been repeatedly selected more than times. For each dataset and seed set size, we ran uncertainty sampling 10 times, for a total of runs.
In Figure 3, we see scatter plots of the asymptotic zero-one loss of 130 points: 13 seed set sizes, each with 10 runs. The dataset on the left was chosen to exhibit the wide range of convergence values of uncertainty sampling, some with lower zero-one loss than with the full dataset. In both plots, we see that the variance of the zero-one loss of uncertainty sampling decreases as the seed set grows. This is expected from theory since the initialization has less variance for larger seed set sizes (as the seed set size goes to infinity, the parameters converge). For most of the datasets, the behavior was more similar to the plot on the right, where uncertainty sampling has a higher mean zero-one loss than random sampling for most seed sizes.
To gain a more quantitative understanding of all the datasets, we summarized the asymptotic zero-one loss of uncertainty sampling for various random seed set sizes. In Figure 5, we show the proportions of the runs over the datasets where uncertainty sampling converges to a lower zero-one loss than using the entire dataset. In Figure 5, we show a “violin plot” for the distribution of the ratio between the asymptotic zero-one loss of uncertainty sampling and the zero-one loss using the full dataset. We note that the mean and variance of uncertainty sampling significantly drops as the size of the seed set grows larger. The initial parameters are poor if the seed set is small, and it is well-known that poor initializations for optimizing non-convex functions locally can yield poor results, as seen here.
5 Related Work and Discussion
The phenomenon that uncertainty sampling can achieve lower error with a subset of the data rather than using the entire dataset has been observed multiple times in the literature. In fact, the original uncertainty sampling paper (Lewis & Gale, 1994)
notes that “For 6 of 10 categories, the mean [F-score] for a classifier trained on a uncertainty sample of 999 examples actually exceeds that from training on the full training set of 319,463”.Schohn & Cohn (2000)
defines a heuristic that selects the point closest to the decision boundary of an SVM, which is similar to uncertainty sampling in our formulation. In the abstract, the authors note, “We observe… that a SVM trained on a well-chosen subset of the available corpus frequency performs better than one trained onall available data”. More recently, Chang et al. (2017) developed an “active bias” technique that emphasizes the uncertain points and found that it increases the performance compared to using a fully-labeled dataset.
There is also work showing the bias of active learning can harm final performance. Schütze et al. (2006) notes the “missed cluster effect”, where active learning can ignore clusters in the data and never query points from there; this is seen in our synthetic experiment. Dasgupta & Hsu (2008) has a section on the bias of uncertainty sampling and provides another example where uncertainty sampling fails due to sampling bias, which we can explain as due to local minima of the zero-one loss. Bach (2007) and Beygelzimer et al. (2009) note this bias issue and propose different importance sampling schemes to re-weight points and correct for the bias.
In this work, we showed that uncertainty sampling updates are preconditioned SGD steps on the population zero-one loss and move in descent directions for parameters that are not approximate stationary points. Note that this does not give any global optimality guarantees. In fact, for linear classifiers, it is NP-hard to optimize the training zero-one loss below (for any ) even when there is a linear classifier that achieves just training zero-one loss (Feldman et al., 2012).
One of the key questions in light of this work is when optimizing convex surrogate losses yield good zero-one losses. If the loss function corresponds to the negative log-likelihood of awell-specified model, then the zero-one loss will have a local minimum at the parameters that optimize the log-likelihood. If the loss function is “classification-calibrated” (which holds for most common surrogate losses), Bartlett et al. (2006) shows that if the convex surrogate loss of the estimated parameters converges to the optimal convex surrogate loss, then the zero-one loss of the estimated parameters converges to the global minimum of the zero-one loss (Bayes error). This holds only for universal classifiers (Micchelli et al., 2006)
, and in practice, these assumptions are unrealistic. For instance, several papers show how outliers and noise can cause linear classifiers learned on convex surrogate losses to suffer high zero-one loss(Nguyen & Sanner, 2013; Wu & Liu, 2007; Long & Servedio, 2010).
Other works connect active learning with optimization in rather different ways. Ramdas & Singh (2013) uses active learning as a subroutine to improve stochastic convex optimization. Guillory et al. (2009) shows how performing online active learning updates corresponds to online optimization updates of non-convex functions, more specifically, truncated convex losses. In this work, we analyze active learning with offline optimization and show the connection between uncertainty sampling and one particularly important non-convex loss, the zero-one loss.
In summary, our work is the first to show a connection between the zero-one loss and the commonly-used uncertainty sampling. This provides an explanation and understanding of the various empirical phenomena observed in the active learning literature. Uncertainty sampling simultaneously offers the hope of converging to lower error and the danger of converging to local minima (an issue that can possibly be avoided with larger seed sizes). We hope this connection can lead to improved active learning and optimization algorithms.
The code, data, and experiments for this paper are available on the CodaLab platform at https://worksheets.codalab.org/worksheets/0xf8dfe5bcc1dc408fb54b3cc15a5abce8/.
This research was supported by an NSF Graduate Fellowship to the first author.
- Bach (2007) Bach, F. R. Active learning for misspecified generalized linear models. In Advances in neural information processing systems, pp. 65–72, 2007.
Balcan et al. (2006)
Balcan, M.-F., Beygelzimer, A., and Langford, J.
Agnostic active learning.
Proceedings of the 23rd international conference on Machine learning, pp. 65–72. ACM, 2006.
- Bartlett et al. (2006) Bartlett, P. L., Jordan, M. I., and McAuliffe, J. D. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.
- Beygelzimer et al. (2009) Beygelzimer, A., Dasgupta, S., and Langford, J. Importance weighted active learning. In Proceedings of the 26th annual international conference on machine learning, pp. 49–56. ACM, 2009.
- Bordes et al. (2005) Bordes, A., Ertekin, S., Weston, J., and Bottou, L. Fast kernel classifiers with online and active learning. Journal of Machine Learning Research, 6(Sep):1579–1619, 2005.
Chang et al. (2017)
Chang, H.-S., Learned-Miller, E., and McCallum, A.
Active bias: Training more accurate neural networks by emphasizing high variance samples.In Advances in Neural Information Processing Systems, pp. 1003–1013, 2017.
- Dasgupta & Hsu (2008) Dasgupta, S. and Hsu, D. Hierarchical sampling for active learning. In Proceedings of the 25th international conference on Machine learning, pp. 208–215. ACM, 2008.
- Feldman et al. (2012) Feldman, V., Guruswami, V., Raghavendra, P., and Wu, Y. Agnostic learning of monomials by halfspaces is hard. SIAM Journal on Computing, 41(6):1558–1590, 2012.
- Guillory et al. (2009) Guillory, A., Chastain, E., and Bilmes, J. Active learning as non-convex optimization. In Artificial Intelligence and Statistics, pp. 201–208, 2009.
- Hanneke et al. (2014) Hanneke, S. et al. Theory of disagreement-based active learning. Foundations and Trends® in Machine Learning, 7(2-3):131–309, 2014.
- Hoveijn (2007) Hoveijn, I. Differentiability of the volume of a region enclosed by level sets. arXiv preprint arXiv:0712.0915, 2007.
- Klein et al. (2011) Klein, S., Staring, M., Andersson, P., and Pluim, J. P. Preconditioned stochastic gradient descent optimisation for monomodal image registration. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 549–556. Springer, 2011.
- Lewis & Gale (1994) Lewis, D. D. and Gale, W. A. A sequential algorithm for training text classifiers. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 3–12. Springer-Verlag New York, Inc., 1994.
- Li (2018) Li, X.-L. Preconditioned stochastic gradient descent. IEEE transactions on neural networks and learning systems, 29(5):1454–1466, 2018.
- Long & Servedio (2010) Long, P. M. and Servedio, R. A. Random classification noise defeats all convex potential boosters. Machine learning, 78(3):287–304, 2010.
- Micchelli et al. (2006) Micchelli, C. A., Xu, Y., and Zhang, H. Universal kernels. Journal of Machine Learning Research, 7(Dec):2651–2667, 2006.
- Nguyen & Sanner (2013) Nguyen, T. and Sanner, S. Algorithms for direct 0–1 loss optimization in binary classification. In International Conference on Machine Learning, pp. 1085–1093, 2013.
- Ramdas & Singh (2013) Ramdas, A. and Singh, A. Algorithmic connections between active learning and stochastic convex optimization. In International Conference on Algorithmic Learning Theory, pp. 339–353. Springer, 2013.
Schohn & Cohn (2000)
Schohn, G. and Cohn, D.
Less is more: Active learning with support vector machines.In ICML, pp. 839–846. Citeseer, 2000.
- Schütze et al. (2006) Schütze, H., Velipasaoglu, E., and Pedersen, J. O. Performance thresholding in practical text classification. In Proceedings of the 15th ACM international conference on Information and knowledge management, pp. 662–671. ACM, 2006.
- Settles (2010) Settles, B. Active learning literature survey. Computer Sciences Technical Report, 1648, 2010.
- Tong & Koller (2001) Tong, S. and Koller, D. Support vector machine active learning with applications to text classification. Journal of machine learning research, 2(Nov):45–66, 2001.
- Wu & Liu (2007) Wu, Y. and Liu, Y. Robust truncated hinge loss support vector machines. Journal of the American Statistical Association, 102(479):974–983, 2007.
- Yang & Loog (2016) Yang, Y. and Loog, M. A benchmark and comparison of active learning for logistic regression. arXiv preprint arXiv:1611.08618, 2016.
6.1 Descent Direction and Convergence
We prove two lemmas about the parameter updates. First, we show that a single step of the parameter iterates don’t change much due to regularization (Lemma 11). Second, we show that is approximately equal to (Lemma 12) and bound the error in approximation. This is important because it takes the dependency on iterate only through the gradient of the loss at , the point selected at iterate . With these two lemmas and Theorem 8, the descent direction (Theorem 9) is straightforward and the SGD convergence (Theorem 10) follows a standard SGD convergence argument.
6.1.1 Parameter Update Lemmas
As in the main text, we have
Thus, and further . Together, this implies that .
Using the Taylor expansion,
Since the loss is convex with quadratic regularization,
From a Taylor expansion,
We want to solve for , but in order to do this, we need to bound .
Using Lemma 11
Solving for in the Taylor expansion,
Looking at the theorem statement, we are almost done. The only difference between the theorem statement and the equation above is that the theorem statement has while the equation above has . We can use the triangle inequality and bound the difference.
6.1.2 Descent Direction
The first thing to note is that if , then .
Because has bounded support and since is smooth, there exists a constant such that .
And thus, if , then .
This will allow us to use Theorem 8 later in the proof.
Using Lemma 12,
Note that the only part that is dependent on the iteration is the term, which we can evaluate on the right by Theorem 8. Thus,
If the last term is positive then the whole expression is less than and the theorem is proved. A sufficient condition for this to be the case is that,
which are both satisfied for
Assume that the parameter iterates are bounded by and the acceptance probability is non-zero for all parameter iterates. This will occur with probability going to and so if we can show convergence in probability under this condition, then unconditional convergence in probability follows. The set of parameters that are bounded are a compact set and thus and are bounded by some constant, call it .
From a Taylor expansion, for some between and ,
Taking an expectation conditioned on ,