DeepAI
Log In Sign Up

Active Learning in the Overparameterized and Interpolating Regime

05/29/2019
by   Mina Karzand, et al.
University of Wisconsin-Madison
0

Overparameterized models that interpolate training data often display surprisingly good generalization properties. Specifically, minimum norm solutions have been shown to generalize well in the overparameterized, interpolating regime. This paper introduces a new framework for active learning based on the notion of minimum norm interpolators. We analytically study its properties and behavior in the kernel-based setting and present experimental studies with kernel methods and neural networks. In general, active learning algorithms adaptively select examples for labeling that (1) rule-out as many (incompatible) classifiers as possible at each step and/or (2) discover cluster structure in unlabeled data and label representative examples from each cluster. We show that our new active learning approach based on a minimum norm heuristic automatically exploits both these strategies.

READ FULL TEXT VIEW PDF

page 8

page 10

12/08/2020

Active Learning: Problem Settings and Recent Developments

In supervised learning, acquiring labeled training data for a predictive...
04/23/2021

One-Round Active Learning

Active learning has been a main solution for reducing data labeling cost...
03/06/2018

Multi-class Active Learning: A Hybrid Informative and Representative Criterion Inspired Approach

Labeling each instance in a large dataset is extremely labor- and time- ...
03/21/2017

Episode-Based Active Learning with Bayesian Neural Networks

We investigate different strategies for active learning with Bayesian de...
06/27/2012

Batch Active Learning via Coordinated Matching

Most prior work on active learning of classifiers has focused on sequent...
10/13/2022

Meta-Query-Net: Resolving Purity-Informativeness Dilemma in Open-set Active Learning

Unlabeled data examples awaiting annotations contain open-set noise inev...
02/09/2021

Bounded Memory Active Learning through Enriched Queries

The explosive growth of easily-accessible unlabeled data has lead to gro...

1 Overparameterized Models and Active Learning

The success of deep learning systems has sparked interest in understanding how and why overparameterized models that interpolate the training data often display surprisingly good generalization properties

[27, 11, 38, 12, 8, 1, 10, 23]. Notably, it is now understood that mininum norm solutions have the potential to generalize well in the overparameterized, interpolating regime [11, 9, 23, 25]. These findings suggest that the theory of active learning also requires careful reconsideration. This paper introduces a new framework for active learning based on the notion of minimum norm interpolators. We propose a new active learning heuristic for kernel machines inspired by the recent breakthroughs mentioned above.

Active learning algorithms adaptively select examples for labeling based on two general strategies [15]: 1) selecting examples that rule-out as many (incompatible) classifiers as possible at each step; and 2) discovering cluster structure in unlabeled data and labeling representative examples from each cluster. We show that our new active learning approach based on a minimum norm heuristic automatically exploits both these strategies: Figure 1 depicts where samples are selected using the new criterion based on minimum norm interpolators.

Figure 1: Active learning using minimum norm interpolators strategically selects examples for labeling (red points). (a) reduces to binary search in simple 1-d threshold problem setting; (b) labeling is focused near decision boundary in multidimensional setting; (c) automatically discovers clusters and labels representative examples from each.

1.1 Review of Active Learning

Active learning involves selectively labeling examples based on some measure of informativeness and/or label uncertainty; see [34, 35] for a broad overview of the field. Here we discuss the key ideas exploited in active learning and put our contribution in context. We emphasize that this is not a comprehensive literature review.

Perhaps the simplest example of active learning is binary search or bisection to identify a single decision threshold in a one-dimensional classification setting. More general “version space” approaches [13, 3, 31, 22] extend this idea to arbitrary classes. Alternatively, the one-dimensional binary search problem naturally generalizes to the problem of learning a multi-dimensional linear classifier [4, 17, 5] and its kernel-based extensions [37]. These methods are “margin-based,” selecting labeling examples in the vicinity of the learned decision boundary. However, in the overparameterized and interpolating regime, potentially every labeled example is near the learned decision boundary. This calls for a more refined sample selection criterion. Roughly speaking, our new selection criteria favor examples close to the current decision boundary and closest to oppositely labeled examples. This yields optimal bisection sampling in simple low-dimensional settings.

Exploiting the geometric structure of data (e.g., clustering) is also a powerful heuristic for active learning [15]

. Graph-based active learning inspired by semi-supervised learning

[41, 14]

and hierarchical clustering methods

[16] are two common geometric approaches to active learning. Bayesian methods based on notions of information-gain can also exploit distributional structure [18, 20, 24, 26], but require appropriate specifications of prior distributions. The new active learning criteria proposed in this paper are capable of automatically exploiting geometric structure in data, such as clusters, without specifying priors, forming a graph, or clustering.

Inspired by the past theory and methods discussed above, several proposals have been recently made for deep active learning [33, 19, 39, 36]. None of this work, however, directly addresses overparameterized and interpolating active learners.

2 A New Active Learning Criterion

At each iteration of the active learning algorithm, looking at the currently labeled set of samples, a new unlabeled point is selected to be labeled. The criterion we are proposing to pick the samples to be labeled is based on a ‘max-min’ operator. We will describe the criterion in its most general form along with the intuition behind this choice of criterion. In the remaining of the paper, we will go through some theoretical results about the properties of variations of this criterion in various setups along with some additional descriptive numerical evaluations and simulations.

Let with labels be the set of labeled examples so far. We assume binary valued labels, but our approach can be generalized to the multiclass setting. Let be the set of unlabeled samples. In other words, we have a partially labeled training set. In the interpolating regime, the goal is to correctly label all the points in so that the training error is zero. Passive learning generally requires labeling every point in . Active learning sequentially selects points in for labeling, with the aim of learning a correct classifier without necessarily labeling all of . Our setting can be viewed as an instance of pool-based active learning.

Let be a rich collection of functions capable of interpolating the training data. For example, could be a nonparametric infinite dimensional Reproducing Kernel Hilbert Space (RKHS) or an overparameterized neural network representation. Let be the minimum norm function that interpolates the labeled examples. Note that depends on the set and their labels, but we do not show that dependency in our notation for brevity. Define is the minimum norm interpolating function based on and the point with label . For each , we select a label according to one of the following criteria:

(1)

The rationale for the definition of is that operating in the interpolating regime, we select the label that yields the minimum norm interpolant (i.e., the “smoother” of the two interpolants). Intuitively, one could argue that our best guess of the label of is (where is the interpolation function based only on ). We will show if is constrained to be an element of an RKHS, then for all (assuming the Hilbert norm is used in Eq. (1)). In general, the definition is more flexible since one can choose an arbitrary norm in; e.g., a variety of norms could be considered if is represented by a neural network. Define The norm of the minimizer or the norm of the difference are viewed as score functions for the point :

(2)

Then, the next point to be labeled by our active learning algorithm is We show in Section 3.2 that if is an element of RKHS and the Hilbert norm is used, the two above selection criteria are equivalent. These selection criteria are motivated by the aim of interpolating all the data.

Interpolating functions with larger scores are less smooth (in terms of norm) or have greater changes from the previous interpolator ( and , respectively). The intuition is that attacking the most challenging points in the input space first may eliminate the need to label other “easier” examples later. The distinction between the two definitions of the function in Eq. (2) is as follows. Scoring unlabeled points according to the definition priotorizes labeling the examples which result in minimum norm interpolating functions with largest norm. Since the norm of the function can be associated with its smoothness, roughly speaking, this means that this criteria picks the points which give the least smooth interpolating functions. We expect that these point are the most “informative" of samples with some measure of informativeness. In Section 3 we will show some theoretical results supporting this intuition. Scoring unlabeled points according to the definition priotorizes labeling the samples which result in minimum norm interpolating functions with largest change. Again, we expect that these point are the most “informative" of samples with some measure of information. In Section 3 we will show some theoretical results supporting this intuition.

We did not specify the norms in Equations (1) and (2) purposefully to have freedom to pick different norms in each of them in different variations of the definition of function. If these norms are the same, using function, is chosen such that

This the reason we call this a ‘max-min’ operator.

3 Interpolating Active Learners in an RKHS

In this section, we will focus on minimum norm interpolating functions in a Reproducing Kernel Hilbert Space (RKHS). We present theoretical properties for general RKHS settings, detailed analytical results in the one-dimensional setting, and numerical studies in multiple dimensions. Broadly speaking, we establish the following properties: the proposed score functions

tend to select examples near the decision boundary of , the current interpolator;

the score is largest for unlabeled examples near the decision boundary and close to oppositely labeled examples, in effect searching for the boundary in the most likely region of the input space;

in one dimension the interpolating active learner coincides with an optimal binary search procedure;

using data-based function norms, rather than the RKHS norm, the interpolating active learner executes a tradeoff between sampling near the current decision boundary and sampling in regions far away from currently labeled examples, thus exploiting cluster structure in the data.

3.1 Kernel Methods

A Hilbert space is associated with an inner product: for . This induces a norm defined by . A symmetric bivariate function is positive semidefinite if for all , and elements the matrix K with element is positive semidefinite (PSD). These functions are called PSD kernel functions. A PSD kernel constructs a Hilbert space, of functions on . For any and any , the function and .

For and , we define to be the interpolating function (i.e., ) with minimum norm (according to the norm associated with ). Define K to be by matrix such that and

. Then, one can show that there exists a vector the vector

such that can be decomposed as

(3)

Using the property , then

3.2 Properties of General Kernels for Active Learning

We first show that using RKHS for interpolation and Hilbert norm in (1), . We then show that the two score functions in (2) are equivalent in the RKHS setting.

Proposition 1.

For and , define and according to the definitions in Section 2. Then, .

Proof.

Let , and . Let K be the kernel matrix for the elements in and be the kernel matrix for the elements in (Hence, can be constructed by and ). Then, for

where Schur’s complement formula gives (a) and Woodbury Identity with a bit algebra gives (b). (c) uses (3) for the minimum norm interpolating function. Hence, iff . ∎

Proposition 2.

If the norm used in definition of and is the Hilbert norm, then for all . Hence, and can be used interchangeably.

We omit the proof for brevity. It is based on some simple algebra which shows that for any and ,

3.3 Laplace Kernel in One Dimension

To develop some intuition, we consider active learning on one-dimension. The sort of target function we have in mind is a multiple threshold classifier. Optimal active learning in this setting coincides with binary search. We now show that the proposed score function with Hilbert norm and the Laplace kernel results in an optimal active learning in one dimension (proof in Appendix A).

Proposition 3.

[Max Min criteria in one dimension] Define to be the Laplace kernel in one dimension. Then the following statements hold for any value of :

  1. Let and be two neighboring labeled points in . Then for all .

  2. Let and be two pairs of neighboring labeled points such that , then

    • if and . Then .

    • if and . Then .

    • if and . Then .

    • if and . Then .

The key conclusion drawn from these properties is that the midpoints between the closest oppositely labeled neighboring examples have the highest score. If there are no oppositely labeled neighbors, then the score is largest at the midpoint of the largest gap between consecutive samples. Thus, the score results in a binary search for the thresholds definining the classifier. Using the proposition above, it is easy to show the following result, proved in the Appendix B.

Corollary 1.

Consider

points uniformly distributed in the interval

labeled according to a -piecewise constant function so that and length of the pieces are roughly on the order of . Then by running the proposed active learning algorithm with Laplace Kernel and any bandwidth, after queries the sign of the resulting interpolant correctly labels all examples (i.e., the training error is zero).

3.4 General Radial-Basis Kernels in One Dimension

For , we define norm of dimensional vector to be . We define radial basis kernel with parameter and bandwidth to be

(4)

n the next proposition, we look at the special case of radial basis kernels applied to one dimensional functions with only three initial points. We next show how maximizing is equivalent to picking the zero-crossing point of our current interpolator.

Proposition 4 (One Dimensional Functions with Radial Basis Kernels).

Assume that for any pair of samples we have . Assume for a constant value of . Let , and . For such that , we have where is the point satisfying .

The proof is rather tedious and appears in Appendix C. But the idea is based on showing that with small enough bandwidth, is increasing in in the interval and is decreasing in in the same interval. This shows that occurs at such that . We showed that this is equivalent to the condition .

3.5 Data-Based Norm

As mentioned above, one can consider norms other than the RKHS norm. One natural option is to consider the -norm of the output map. Consider

(5)

Intuitively, this score measures the expected change in the squared norm over all unlabeled examples if is selected as the next point. This norm is sensitive to the particular distribution of the data, which is important if the data are clustered. This behavior will be demonstrated in the multidimensional setting discussed next.

3.6 Multidimensional Setting

The properties and behavior found in the one dimensional setting carry over to higher dimensions. In particular, the max-min norm criterion tends to select unlabeled examples near the decision boundary and close to oppositely labeled examples, This is illustrated in Figure 2 below. The inputs points (training examples) are uniformly distributed in the square . We trained an Laplace kernel machine to perfectly interpolate four training points with locations and binary labels as depicted in Figure 2(a). The color depicts the magnitude of the learned interpolating function: dark blue is indicating the “decision boundary” and bright yellow is approximately . Figure 2(b) denotes the score for selecting a point at each location based on RKHS norm criterion. Figure 2(c) denotes the score for selecting a point at each location based on data-based norm criterion discussed above. Both criteria select the point on the decision boundary, but the RKHS norm favors points that are closest to oppositely labeled examples whereas the data-based norm favors points on the boundary further from labeled examples.

Figure 2: Data selection of Laplace kernel active learner. (a) Magnitude of output map kernel machine trained to interpolate four data points as indicated (dark blue is indicating the learned decision boundary). (b) Max-Min RKHS norm selection of next point to label. Brightest yellow is location of highest score and selected example. (c) Max-Min selection of next point to label using data-based norm. Both select the point on the decision boundary, but the RKHS norm favors points that are closest to oppositely labeled examples.

Next we present a modified scenario in which the examples are not uniformly distributed over the input space, but instead concentrated only in certain regions indicated by the magenta highlights in Figure 3(a). In this setting, the example selection criteria differ more significantly for the two norms. The weight norm selection criterion remains unchanged, but is applied only to regions where there are examples. Areas with out examples to select are indicated by dark blue in Figure 3(b)-(c). The data-based norm is sensitive to the non-uniform input distribution, and it scores examples near the lower portion of the decision boundary highest.

Figure 3: Data selection of Laplace kernel active learner. (a) Unlabeled examples are only available in magenta shaded regions. (b) Max-Min selection map using RKHS norm. (c) Max-Min selection map using data-based norm defined in Equation (5).

The distinction between the max-min selection criterion using the RKHS vs. data-based norm is also apparent in the experiment in which a curved decision boundary in two dimensions is actively learned using a Laplace kernel machine, as depicted in Figure 4 below. is the max-min RKHS norm criterion at progressive stages of the learning process (from left to right). The data-based norm is used in defined in Equation (5). Both dramatically outperform a passive (random sampling) scheme and both demonstrate how active learning automatically focuses sampling near the decision boundary between the oppositely labeled data (yellow vs. blue). However, the data-based norm does more exploration away from the decision boundary. As a result, the data-based norm requires slightly more labels to perfectly predict all unlabeled examples, but has a more graceful error decay, as shown on the right of the figure.

Figure 4: Uniform distribution of samples, smooth boundary, Laplace Kernel, Bandwidth. On left, sampling behavior of and

at progressive stages (left to right). On right, error probabilities as a function of number of labeled examples.

Before moving on to active neural network learners, we demonstrate how the data-based norm also tends to automatically select representive examples from clusters when such structure exists in the unlabeled dataset. Figure 5 compares the behavior of selection based on with the RKHS norm and with data-based norm, when data are clusters and each cluster is homogeneously labeled. We see that the data-based norm quickly identifies the clusters and labels a representative from each, leading to faster error decay as shown on the right.

Figure 5: Uniform distribution of samples, smooth boundary, Laplace Kernel, Bandwidth. On left, sampling behavior of and at progressive stages (left to right). On right, error probabilities as a function of number of labeled examples.

4 Interpolating Neural Network Active Learners

Here we briefly examine the extension of the max-min criterion and its variants to neural network learners. Neural network complexity or capacity can be controlled by limiting magnitude of the network weights [6, 29, 40]. A number of weight norms and related measures have been recently proposed in the literature [30, 7, 21, 2, 28]

. For example, ReLU networks with a single hidden layer and minimum

norm weights coincide with linear spline interpolation [32]. With this in mind, we provide empirical evidence showing that defining the max-min criterion with the norm of the network weights yields a neural network active learning algorithm with properties analagous to those obtained in the RKHS setting.

Consider a single hidden layer network with ReLU activation units trained using MSE loss. In Figure 6

we show the results of an experiment implemented in PyTorch in the same settings considered above for kernel machines in Figures 

2 and 3. We trained an overparameterized network with hidden layer units to perfectly interpolate four training points with locations and binary labels as depicted in Figure 6(a). The color depicts the magnitude of the learned interpolating function: dark blue is indicating the “decision boundary” and bright yellow is approximately . Figure 6(b) denotes the with the weight norm (i.e., the norm of the resulting network weights when a new sample is selected at that location). The brightest yellow indicates the highest score and the location of the next selection. Figure 6(c) denotes the with the data-based norm defined in Section 3.5. In both cases, the max occurs at roughly the same location, which is near the current decision boundary and closest to oppositely labeled points. The data-based norm also places higher scores on points further away from the labeled examples. Thus, the data selection behavior of the neural network is analagous to that of the kernel-based active learner (compare with Figure 2).

Next we present a modified scenario in which the examples are not uniformly distributed over the input space, but instead concentrated only in certain regions indicated by the magenta highlights in Figure 7(a). In this setting, the example selection criteria differ more significantly for the two norms. The weight norm selection criterion remains unchanged, but is applied only to regions where there are examples. Areas without examples to select are indicated by dark blue in Figure 7(b)-(c). The data-based norm is sensitive to the non-uniform input distribution, and it scores examples near the lower portion of the decision boundary highest. Again, this is quite similar to the behavior of the kernel active learner (compare with Figure 3).

Figure 6: Data selection of neural network active learner. (a) Magnitude of output map of single hidden layer ReLU network trained to interpolate four data points as indicated (dark blue is indicating the learned decision boundary). (b) Max-Min selection of next point to label using network weight norm. (c) Max-Min selection of next point to label using data-based norm. Both select the point on the decision boundary that is closest to oppositely labeled examples.

Figure 7: Data selection of neural network active learner. (a) Unlabeled examples are only available in magenta shaded regions. (b) Max-Min selection map using network weight norm. (c) Max-Min selection map using data-based norm.

5 MNIST Experiment

Here we illustrate the performance of the proposed active learning method on the MNIST dataset. The goal of the classifier is detecting whether an image belongs to the set of numbers greater or equal to or not. We used Laplace kernel with bandwidth on the vectorized version of a dataset of images. Hence, the dimensionality of each sample is . For comparison, we implemented a passive learning algorithm which selects the points randomly, the active learning algorithm using with RKHS norm, and the active learning algorithm using with data-based function norm described in Section 3.5. Figure 8 depicts the decay of probability of error for each of these algorithms. The behavior is similar to the theory and experiments above. Both the RKHS norm and the data-based norm, score and score respectively, dramatically outperform the passive learning baseline and the data-based norm results in somewhat more graceful error decay. Note that the purpose of this experiment is comparing the performance of various selection algorithms. Hence, we are training our classification algorithm on a small subset of the MNIST dataset and that is the reason for nonnegligible error level in the test set.

Figure 8: Probability of error for learning a classification task on MNIST data set. The performance of three selection criteria for labeling the samples: random slection, active selection based on , and active selection based on . The left curve depicts the probability of error on the training set and the right curve is the probability of error on the test set.

References

Appendix A Max Min criteria in one dimension with Laplace Kernel

Proof of Proposition 1.

Let us look at the Gram matrix for three neighboring points according to the Laplace Kernel.

where we define and . It can be shown that with the above structure

In general, if we look into the Gram matrix of , and define for , we observe:

and the remaining elements of matrix of matrix is zero, i.e., for .

Looking at three neighboring points labeled we can compute :

  1. Consider two neighboring points labeled . Let be such that and . Define and . Then,

    Hence,

    Without loss of generality, assume or , then

    Note that for all such that , we have and is constant. So

    which is attained when or .

    This gives the following statement: For neighboring labeled points such that , we have

    (6)

    Note that the above function is decreasing in .

  2. Consider two neighboring points labeled . Let such that and . Define and .

    This gives the following statement: For neighboring labeled points such that , we have

    (7)

    Note that the above function is increasing in .

  3. Equations (6) and  (7) give the second part of the proposition.

Appendix B Max Min criteria Binary Search Corollary

Proof of Binary Search Corollary.
Figure 9: Uniform distribution of samples in unit interval, multiple thresholds between labels, and active learning using Laplace Kernel, Bandwidth. Probability of error of the interpolated function shown on right.

According to the last property in Proposition 3 the first sample selected will be at the midpoint of the unit interval and the second point will be at or . If the labels agree, then the next sample will be at the midpoint of the largest subinterval (e.g., at if was sampled first). Sampling at the midpoints of the largest subinterval between a consecutive pairs labeled points continues until a point with the opposite label is found. Once a point the with opposite label have been found, Proposition 3 implies that subsequent samples repeatedly bisect the subinterval between the closest pair of oppositely labeled points. This bisection process will terminate with two neighboring points having opposite labels, thus identifying one boundary/threshold of . The total number of labels collected by this bisection process is at most . After this, the algorithm alternates between the two situations above. It performs bisection on the subinterval between the close pair of oppositely labeled points, if such an interval exists. If not, it samples at the midpoint of the largest subinterval between a consecutive pairs of labeled points. The stated label complexity bound follows from the assumptions that there are thresholds and the length of each piece (subinterval between thresholds) is on the order of . ∎

The bisection process is illustrated experimentally in the Figure 9 below. uses the RKHS norm. For comparison, we also show the behavior of the algorithm using and the data-based norm discussed in Section 3.5. Data selection using either score drives the error to zero faster than random sampling (as shown on the left). We clearly see the bisection behavior of , locating one decision boundary/threshold and then another, as the proof corollary above suggests. Also, we see that the data-based norm does more exploration away from the decision boundaries. As a result, the data-based norm has a faster and more graceful error decay, as shown on the right of the figure. Similar behavior is observed in the multidimensional setting shown in Figure 4.

Appendix C One Dimensional Functions with Radial Basis Kernels

Proof of Proposition 4 on maximum score with radial basis kernels.

For the ease of notation, for fixed and , we define as the normalized distance between samples such that and . For , we define such that the distance between the point and is , as in Figure 10. The proposition is based on the assumption that for any pair of points, and is sufficiently small that .

We want to show that the max score happens at the zero crossing of function . Since we normalized all pairwise distances by , instead we will show that there exists a constant such that if , then the max score is achieved at the zero crossing.

Figure 10: The samples and are the labeled samples such that and .

Note that depends on the location of point , characterized by the normalized distance between and denoted by . We want to prove that for small enough bandwidth, is increasing in for . We can use similar argument to show that is decreasing in . This implies

with defined to be the point in which .

To do so, we will show that in this interval. In the proof of Proposition 2, we showed the following form for the function ,

where K is the kernel matrix for the points and . The vector is defined to be . The term is the minimum norm interpolating function based on the points and and their labels evaluated at . Equation 3 shows that

First, we look at K and its inverse in the setup explained above. Using the definition of Radial basis kernels in Equation (4),

(8)

Hence,

The determinant of matrix K is

where we defined

Note that since for , we have , then . Also, there exists a constant such that if , then .

The vector is

Next, we compute