1 Overparameterized Models and Active Learning
The success of deep learning systems has sparked interest in understanding how and why overparameterized models that interpolate the training data often display surprisingly good generalization properties
[27, 11, 38, 12, 8, 1, 10, 23]. Notably, it is now understood that mininum norm solutions have the potential to generalize well in the overparameterized, interpolating regime [11, 9, 23, 25]. These findings suggest that the theory of active learning also requires careful reconsideration. This paper introduces a new framework for active learning based on the notion of minimum norm interpolators. We propose a new active learning heuristic for kernel machines inspired by the recent breakthroughs mentioned above.Active learning algorithms adaptively select examples for labeling based on two general strategies [15]: 1) selecting examples that ruleout as many (incompatible) classifiers as possible at each step; and 2) discovering cluster structure in unlabeled data and labeling representative examples from each cluster. We show that our new active learning approach based on a minimum norm heuristic automatically exploits both these strategies: Figure 1 depicts where samples are selected using the new criterion based on minimum norm interpolators.
1.1 Review of Active Learning
Active learning involves selectively labeling examples based on some measure of informativeness and/or label uncertainty; see [34, 35] for a broad overview of the field. Here we discuss the key ideas exploited in active learning and put our contribution in context. We emphasize that this is not a comprehensive literature review.
Perhaps the simplest example of active learning is binary search or bisection to identify a single decision threshold in a onedimensional classification setting. More general “version space” approaches [13, 3, 31, 22] extend this idea to arbitrary classes. Alternatively, the onedimensional binary search problem naturally generalizes to the problem of learning a multidimensional linear classifier [4, 17, 5] and its kernelbased extensions [37]. These methods are “marginbased,” selecting labeling examples in the vicinity of the learned decision boundary. However, in the overparameterized and interpolating regime, potentially every labeled example is near the learned decision boundary. This calls for a more refined sample selection criterion. Roughly speaking, our new selection criteria favor examples close to the current decision boundary and closest to oppositely labeled examples. This yields optimal bisection sampling in simple lowdimensional settings.
Exploiting the geometric structure of data (e.g., clustering) is also a powerful heuristic for active learning [15]
. Graphbased active learning inspired by semisupervised learning
[41, 14]and hierarchical clustering methods
[16] are two common geometric approaches to active learning. Bayesian methods based on notions of informationgain can also exploit distributional structure [18, 20, 24, 26], but require appropriate specifications of prior distributions. The new active learning criteria proposed in this paper are capable of automatically exploiting geometric structure in data, such as clusters, without specifying priors, forming a graph, or clustering.2 A New Active Learning Criterion
At each iteration of the active learning algorithm, looking at the currently labeled set of samples, a new unlabeled point is selected to be labeled. The criterion we are proposing to pick the samples to be labeled is based on a ‘maxmin’ operator. We will describe the criterion in its most general form along with the intuition behind this choice of criterion. In the remaining of the paper, we will go through some theoretical results about the properties of variations of this criterion in various setups along with some additional descriptive numerical evaluations and simulations.
Let with labels be the set of labeled examples so far. We assume binary valued labels, but our approach can be generalized to the multiclass setting. Let be the set of unlabeled samples. In other words, we have a partially labeled training set. In the interpolating regime, the goal is to correctly label all the points in so that the training error is zero. Passive learning generally requires labeling every point in . Active learning sequentially selects points in for labeling, with the aim of learning a correct classifier without necessarily labeling all of . Our setting can be viewed as an instance of poolbased active learning.
Let be a rich collection of functions capable of interpolating the training data. For example, could be a nonparametric infinite dimensional Reproducing Kernel Hilbert Space (RKHS) or an overparameterized neural network representation. Let be the minimum norm function that interpolates the labeled examples. Note that depends on the set and their labels, but we do not show that dependency in our notation for brevity. Define is the minimum norm interpolating function based on and the point with label . For each , we select a label according to one of the following criteria:
(1) 
The rationale for the definition of is that operating in the interpolating regime, we select the label that yields the minimum norm interpolant (i.e., the “smoother” of the two interpolants). Intuitively, one could argue that our best guess of the label of is (where is the interpolation function based only on ). We will show if is constrained to be an element of an RKHS, then for all (assuming the Hilbert norm is used in Eq. (1)). In general, the definition is more flexible since one can choose an arbitrary norm in; e.g., a variety of norms could be considered if is represented by a neural network. Define The norm of the minimizer or the norm of the difference are viewed as score functions for the point :
(2) 
Then, the next point to be labeled by our active learning algorithm is We show in Section 3.2 that if is an element of RKHS and the Hilbert norm is used, the two above selection criteria are equivalent. These selection criteria are motivated by the aim of interpolating all the data.
Interpolating functions with larger scores are less smooth (in terms of norm) or have greater changes from the previous interpolator ( and , respectively). The intuition is that attacking the most challenging points in the input space first may eliminate the need to label other “easier” examples later. The distinction between the two definitions of the function in Eq. (2) is as follows. Scoring unlabeled points according to the definition priotorizes labeling the examples which result in minimum norm interpolating functions with largest norm. Since the norm of the function can be associated with its smoothness, roughly speaking, this means that this criteria picks the points which give the least smooth interpolating functions. We expect that these point are the most “informative" of samples with some measure of informativeness. In Section 3 we will show some theoretical results supporting this intuition. Scoring unlabeled points according to the definition priotorizes labeling the samples which result in minimum norm interpolating functions with largest change. Again, we expect that these point are the most “informative" of samples with some measure of information. In Section 3 we will show some theoretical results supporting this intuition.
3 Interpolating Active Learners in an RKHS
In this section, we will focus on minimum norm interpolating functions
in a Reproducing Kernel Hilbert Space (RKHS). We present theoretical properties for general RKHS
settings, detailed analytical results in the onedimensional setting,
and numerical studies in multiple dimensions. Broadly speaking, we establish the following
properties: the proposed score functions
tend to select examples near the decision boundary of
, the current interpolator;
the score is largest for unlabeled examples near the decision
boundary and close to oppositely labeled examples, in effect
searching for the boundary in the most likely region of the
input space;
in one dimension the interpolating active learner
coincides with an optimal binary search procedure;
using databased function norms, rather than the RKHS
norm, the interpolating active learner executes a tradeoff between
sampling near the current decision boundary and sampling in regions
far away from currently labeled examples, thus exploiting cluster
structure in the data.
3.1 Kernel Methods
A Hilbert space is associated with an inner product: for . This induces a norm defined by . A symmetric bivariate function is positive semidefinite if for all , and elements the matrix K with element is positive semidefinite (PSD). These functions are called PSD kernel functions. A PSD kernel constructs a Hilbert space, of functions on . For any and any , the function and .
For and , we define to be the interpolating function (i.e., ) with minimum norm (according to the norm associated with ). Define K to be by matrix such that and
. Then, one can show that there exists a vector the vector
such that can be decomposed as(3) 
Using the property , then
3.2 Properties of General Kernels for Active Learning
We first show that using RKHS for interpolation and Hilbert norm in (1), . We then show that the two score functions in (2) are equivalent in the RKHS setting.
Proposition 1.
For and , define and according to the definitions in Section 2. Then, .
Proof.
Let , and . Let K be the kernel matrix for the elements in and be the kernel matrix for the elements in (Hence, can be constructed by and ). Then, for
where Schur’s complement formula gives (a) and Woodbury Identity with a bit algebra gives (b). (c) uses (3) for the minimum norm interpolating function. Hence, iff . ∎
Proposition 2.
If the norm used in definition of and is the Hilbert norm, then for all . Hence, and can be used interchangeably.
We omit the proof for brevity. It is based on some simple algebra which shows that for any and ,
3.3 Laplace Kernel in One Dimension
To develop some intuition, we consider active learning on onedimension. The sort of target function we have in mind is a multiple threshold classifier. Optimal active learning in this setting coincides with binary search. We now show that the proposed score function with Hilbert norm and the Laplace kernel results in an optimal active learning in one dimension (proof in Appendix A).
Proposition 3.
[Max Min criteria in one dimension] Define to be the Laplace kernel in one dimension. Then the following statements hold for any value of :

Let and be two neighboring labeled points in . Then for all .

Let and be two pairs of neighboring labeled points such that , then

if and . Then .

if and . Then .

if and . Then .

if and . Then .

The key conclusion drawn from these properties is that the midpoints between the closest oppositely labeled neighboring examples have the highest score. If there are no oppositely labeled neighbors, then the score is largest at the midpoint of the largest gap between consecutive samples. Thus, the score results in a binary search for the thresholds definining the classifier. Using the proposition above, it is easy to show the following result, proved in the Appendix B.
Corollary 1.
Consider
points uniformly distributed in the interval
labeled according to a piecewise constant function so that and length of the pieces are roughly on the order of . Then by running the proposed active learning algorithm with Laplace Kernel and any bandwidth, after queries the sign of the resulting interpolant correctly labels all examples (i.e., the training error is zero).3.4 General RadialBasis Kernels in One Dimension
For , we define norm of dimensional vector to be . We define radial basis kernel with parameter and bandwidth to be
(4) 
n the next proposition, we look at the special case of radial basis kernels applied to one dimensional functions with only three initial points. We next show how maximizing is equivalent to picking the zerocrossing point of our current interpolator.
Proposition 4 (One Dimensional Functions with Radial Basis Kernels).
Assume that for any pair of samples we have . Assume for a constant value of . Let , and . For such that , we have where is the point satisfying .
The proof is rather tedious and appears in Appendix C. But the idea is based on showing that with small enough bandwidth, is increasing in in the interval and is decreasing in in the same interval. This shows that occurs at such that . We showed that this is equivalent to the condition .
3.5 DataBased Norm
As mentioned above, one can consider norms other than the RKHS norm. One natural option is to consider the norm of the output map. Consider
(5) 
Intuitively, this score measures the expected change in the squared norm over all unlabeled examples if is selected as the next point. This norm is sensitive to the particular distribution of the data, which is important if the data are clustered. This behavior will be demonstrated in the multidimensional setting discussed next.
3.6 Multidimensional Setting
The properties and behavior found in the one dimensional setting carry over to higher dimensions. In particular, the maxmin norm criterion tends to select unlabeled examples near the decision boundary and close to oppositely labeled examples, This is illustrated in Figure 2 below. The inputs points (training examples) are uniformly distributed in the square . We trained an Laplace kernel machine to perfectly interpolate four training points with locations and binary labels as depicted in Figure 2(a). The color depicts the magnitude of the learned interpolating function: dark blue is indicating the “decision boundary” and bright yellow is approximately . Figure 2(b) denotes the score for selecting a point at each location based on RKHS norm criterion. Figure 2(c) denotes the score for selecting a point at each location based on databased norm criterion discussed above. Both criteria select the point on the decision boundary, but the RKHS norm favors points that are closest to oppositely labeled examples whereas the databased norm favors points on the boundary further from labeled examples.
Next we present a modified scenario in which the examples are not uniformly distributed over the input space, but instead concentrated only in certain regions indicated by the magenta highlights in Figure 3(a). In this setting, the example selection criteria differ more significantly for the two norms. The weight norm selection criterion remains unchanged, but is applied only to regions where there are examples. Areas with out examples to select are indicated by dark blue in Figure 3(b)(c). The databased norm is sensitive to the nonuniform input distribution, and it scores examples near the lower portion of the decision boundary highest.
The distinction between the maxmin selection criterion using the RKHS vs. databased norm is also apparent in the experiment in which a curved decision boundary in two dimensions is actively learned using a Laplace kernel machine, as depicted in Figure 4 below. is the maxmin RKHS norm criterion at progressive stages of the learning process (from left to right). The databased norm is used in defined in Equation (5). Both dramatically outperform a passive (random sampling) scheme and both demonstrate how active learning automatically focuses sampling near the decision boundary between the oppositely labeled data (yellow vs. blue). However, the databased norm does more exploration away from the decision boundary. As a result, the databased norm requires slightly more labels to perfectly predict all unlabeled examples, but has a more graceful error decay, as shown on the right of the figure.
Before moving on to active neural network learners, we demonstrate how the databased norm also tends to automatically select representive examples from clusters when such structure exists in the unlabeled dataset. Figure 5 compares the behavior of selection based on with the RKHS norm and with databased norm, when data are clusters and each cluster is homogeneously labeled. We see that the databased norm quickly identifies the clusters and labels a representative from each, leading to faster error decay as shown on the right.
4 Interpolating Neural Network Active Learners
Here we briefly examine the extension of the maxmin criterion and its variants to neural network learners. Neural network complexity or capacity can be controlled by limiting magnitude of the network weights [6, 29, 40]. A number of weight norms and related measures have been recently proposed in the literature [30, 7, 21, 2, 28]
. For example, ReLU networks with a single hidden layer and minimum
norm weights coincide with linear spline interpolation [32]. With this in mind, we provide empirical evidence showing that defining the maxmin criterion with the norm of the network weights yields a neural network active learning algorithm with properties analagous to those obtained in the RKHS setting.Consider a single hidden layer network with ReLU activation units trained using MSE loss. In Figure 6
we show the results of an experiment implemented in PyTorch in the same settings considered above for kernel machines in Figures
2 and 3. We trained an overparameterized network with hidden layer units to perfectly interpolate four training points with locations and binary labels as depicted in Figure 6(a). The color depicts the magnitude of the learned interpolating function: dark blue is indicating the “decision boundary” and bright yellow is approximately . Figure 6(b) denotes the with the weight norm (i.e., the norm of the resulting network weights when a new sample is selected at that location). The brightest yellow indicates the highest score and the location of the next selection. Figure 6(c) denotes the with the databased norm defined in Section 3.5. In both cases, the max occurs at roughly the same location, which is near the current decision boundary and closest to oppositely labeled points. The databased norm also places higher scores on points further away from the labeled examples. Thus, the data selection behavior of the neural network is analagous to that of the kernelbased active learner (compare with Figure 2).Next we present a modified scenario in which the examples are not uniformly distributed over the input space, but instead concentrated only in certain regions indicated by the magenta highlights in Figure 7(a). In this setting, the example selection criteria differ more significantly for the two norms. The weight norm selection criterion remains unchanged, but is applied only to regions where there are examples. Areas without examples to select are indicated by dark blue in Figure 7(b)(c). The databased norm is sensitive to the nonuniform input distribution, and it scores examples near the lower portion of the decision boundary highest. Again, this is quite similar to the behavior of the kernel active learner (compare with Figure 3).
5 MNIST Experiment
Here we illustrate the performance of the proposed active learning method on the MNIST dataset. The goal of the classifier is detecting whether an image belongs to the set of numbers greater or equal to or not. We used Laplace kernel with bandwidth on the vectorized version of a dataset of images. Hence, the dimensionality of each sample is . For comparison, we implemented a passive learning algorithm which selects the points randomly, the active learning algorithm using with RKHS norm, and the active learning algorithm using with databased function norm described in Section 3.5. Figure 8 depicts the decay of probability of error for each of these algorithms. The behavior is similar to the theory and experiments above. Both the RKHS norm and the databased norm, score and score respectively, dramatically outperform the passive learning baseline and the databased norm results in somewhat more graceful error decay. Note that the purpose of this experiment is comparing the performance of various selection algorithms. Hence, we are training our classification algorithm on a small subset of the MNIST dataset and that is the reason for nonnegligible error level in the test set.
References
 [1] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Finegrained analysis of optimization and generalization for overparameterized twolayer neural networks. arXiv preprint arXiv:1901.08584, 2019.

[2]
Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang.
Stronger generalization bounds for deep nets via a compression
approach.
In
International Conference on Machine Learning
, pages 254–263, 2018.  [3] MariaFlorina Balcan, Alina Beygelzimer, and John Langford. Agnostic active learning. Journal of Computer and System Sciences, 75(1):78–89, 2009.

[4]
MariaFlorina Balcan, Andrei Broder, and Tong Zhang.
Margin based active learning.
In
International Conference on Computational Learning Theory
, pages 35–50. Springer, 2007.  [5] MariaFlorina Balcan and Phil Long. Active and passive learning of linear separators under logconcave distributions. In Conference on Learning Theory, pages 288–316, 2013.
 [6] Peter L Bartlett. For valid generalization the size of the weights is more important than the size of the network. In Advances in neural information processing systems, pages 134–140, 1997.
 [7] Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrallynormalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pages 6240–6249, 2017.
 [8] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine learning and the biasvariance tradeoff. arXiv preprint arXiv:1812.11118, 2018.
 [9] Mikhail Belkin, Daniel Hsu, and Ji Xu. Two models of double descent for weak features. arXiv preprint arXiv:1903.07571, 2019.
 [10] Mikhail Belkin, Daniel J Hsu, and Partha Mitra. Overfitting or perfect fitting? risk bounds for classification and regression rules that interpolate. In Advances in Neural Information Processing Systems, pages 2300–2311, 2018.
 [11] Mikhail Belkin, Siyuan Ma, and Soumik Mandal. To understand deep learning we need to understand kernel learning. In International Conference on Machine Learning, pages 540–548, 2018.
 [12] Mikhail Belkin, Alexander Rakhlin, and Alexandre B Tsybakov. Does data interpolation contradict statistical optimality? arXiv preprint arXiv:1806.09471, 2018.
 [13] David Cohn, Les Atlas, and Richard Ladner. Improving generalization with active learning. Machine learning, 15(2):201–221, 1994.
 [14] Gautam Dasarathy, Robert Nowak, and Xiaojin Zhu. S2: An efficient graph based active learning algorithm with application to nonparametric classification. In Conference on Learning Theory, pages 503–522, 2015.
 [15] Sanjoy Dasgupta. Two faces of active learning. Theoretical computer science, 412(19):1767–1781, 2011.
 [16] Sanjoy Dasgupta and Daniel Hsu. Hierarchical sampling for active learning. In Proceedings of the 25th international conference on Machine learning, pages 208–215. ACM, 2008.

[17]
Sanjoy Dasgupta, Adam Tauman Kalai, and Claire Monteleoni.
Analysis of perceptronbased active learning.
In International Conference on Computational Learning Theory, pages 249–263. Springer, 2005.  [18] Yoav Freund, H Sebastian Seung, Eli Shamir, and Naftali Tishby. Selective sampling using the query by committee algorithm. Machine learning, 28(23):133–168, 1997.
 [19] Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep bayesian active learning with image data. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1183–1192. JMLR. org, 2017.
 [20] Daniel Golovin, Andreas Krause, and Debajyoti Ray. Nearoptimal bayesian active learning with noisy observations. In Advances in Neural Information Processing Systems, pages 766–774, 2010.
 [21] Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Sizeindependent sample complexity of neural networks. In Conference On Learning Theory, pages 297–299, 2018.
 [22] Steve Hanneke et al. Theory of disagreementbased active learning. Foundations and Trends® in Machine Learning, 7(23):131–309, 2014.
 [23] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in highdimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560, 2019.

[24]
Ashish Kapoor, Kristen Grauman, Raquel Urtasun, and Trevor Darrell.
Active learning with gaussian processes for object categorization.
In
2007 IEEE 11th International Conference on Computer Vision
, pages 1–8. IEEE, 2007.  [25] Tengyuan Liang and Alexander Rakhlin. Just interpolate: Kernel" ridgeless" regression can generalize. arXiv preprint arXiv:1808.00387, 2018.
 [26] Michael Lindenbaum, Shaul Markovitch, and Dmitry Rusakov. Selective sampling for nearest neighbor classifiers. Machine learning, 54(2):125–152, 2004.
 [27] Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the effectiveness of sgd in modern overparametrized learning. In International Conference on Machine Learning, pages 3331–3340, 2018.
 [28] Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A pacbayesian approach to spectrallynormalized margin bounds for neural networks. arXiv preprint arXiv:1707.09564, 2017.
 [29] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614, 2014.
 [30] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Normbased capacity control in neural networks. In Conference on Learning Theory, pages 1376–1401, 2015.
 [31] Robert D Nowak. The geometry of generalized binary search. IEEE Transactions on Information Theory, 57(12):7893–7906, 2011.
 [32] Pedro Savarese, Itay Evron, Daniel Soudry, and Nathan Srebro. How do infinite width bounded norm networks look in function space? arXiv preprint arXiv:1902.05040, 2019.
 [33] Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A coreset approach. arXiv preprint arXiv:1708.00489, 2017.
 [34] Burr Settles. Active learning literature survey. Technical report, University of WisconsinMadison Department of Computer Sciences, 2009.

[35]
Burr Settles.
Active learning.
Synthesis Lectures on Artificial Intelligence and Machine Learning
, 6(1):1–114, 2012.  [36] Yanyao Shen, Hyokun Yun, Zachary C Lipton, Yakov Kronrod, and Animashree Anandkumar. Deep active learning for named entity recognition. arXiv preprint arXiv:1707.05928, 2017.
 [37] Simon Tong and Daphne Koller. Support vector machine active learning with applications to text classification. Journal of machine learning research, 2(Nov):45–66, 2001.
 [38] Sharan Vaswani, Francis Bach, and Mark Schmidt. Fast and faster convergence of sgd for overparameterized models and an accelerated perceptron. arXiv preprint arXiv:1810.07288, 2018.
 [39] Keze Wang, Dongyu Zhang, Ya Li, Ruimao Zhang, and Liang Lin. Costeffective active learning for deep image classification. IEEE Transactions on Circuits and Systems for Video Technology, 27(12):2591–2600, 2017.
 [40] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
 [41] Xiaojin Zhu, John Lafferty, and Zoubin Ghahramani. Combining active learning and semisupervised learning using gaussian fields and harmonic functions. In ICML 2003 workshop on the continuum from labeled to unlabeled data in machine learning and data mining, volume 3, 2003.
Appendix A Max Min criteria in one dimension with Laplace Kernel
Proof of Proposition 1.
Let us look at the Gram matrix for three neighboring points according to the Laplace Kernel.
where we define and . It can be shown that with the above structure
In general, if we look into the Gram matrix of , and define for , we observe:
and the remaining elements of matrix of matrix is zero, i.e., for .
Looking at three neighboring points labeled we can compute :

Consider two neighboring points labeled . Let be such that and . Define and . Then,
Hence,
Without loss of generality, assume or , then
Note that for all such that , we have and is constant. So
which is attained when or .
This gives the following statement: For neighboring labeled points such that , we have
(6) Note that the above function is decreasing in .

Consider two neighboring points labeled . Let such that and . Define and .
This gives the following statement: For neighboring labeled points such that , we have
(7) Note that the above function is increasing in .
∎
Appendix B Max Min criteria Binary Search Corollary
Proof of Binary Search Corollary.
According to the last property in Proposition 3 the first sample selected will be at the midpoint of the unit interval and the second point will be at or . If the labels agree, then the next sample will be at the midpoint of the largest subinterval (e.g., at if was sampled first). Sampling at the midpoints of the largest subinterval between a consecutive pairs labeled points continues until a point with the opposite label is found. Once a point the with opposite label have been found, Proposition 3 implies that subsequent samples repeatedly bisect the subinterval between the closest pair of oppositely labeled points. This bisection process will terminate with two neighboring points having opposite labels, thus identifying one boundary/threshold of . The total number of labels collected by this bisection process is at most . After this, the algorithm alternates between the two situations above. It performs bisection on the subinterval between the close pair of oppositely labeled points, if such an interval exists. If not, it samples at the midpoint of the largest subinterval between a consecutive pairs of labeled points. The stated label complexity bound follows from the assumptions that there are thresholds and the length of each piece (subinterval between thresholds) is on the order of . ∎
The bisection process is illustrated experimentally in the Figure 9 below. uses the RKHS norm. For comparison, we also show the behavior of the algorithm using and the databased norm discussed in Section 3.5. Data selection using either score drives the error to zero faster than random sampling (as shown on the left). We clearly see the bisection behavior of , locating one decision boundary/threshold and then another, as the proof corollary above suggests. Also, we see that the databased norm does more exploration away from the decision boundaries. As a result, the databased norm has a faster and more graceful error decay, as shown on the right of the figure. Similar behavior is observed in the multidimensional setting shown in Figure 4.
Appendix C One Dimensional Functions with Radial Basis Kernels
Proof of Proposition 4 on maximum score with radial basis kernels.
For the ease of notation, for fixed and , we define as the normalized distance between samples such that and . For , we define such that the distance between the point and is , as in Figure 10. The proposition is based on the assumption that for any pair of points, and is sufficiently small that .
We want to show that the max score happens at the zero crossing of function . Since we normalized all pairwise distances by , instead we will show that there exists a constant such that if , then the max score is achieved at the zero crossing.
Note that depends on the location of point , characterized by the normalized distance between and denoted by . We want to prove that for small enough bandwidth, is increasing in for . We can use similar argument to show that is decreasing in . This implies
with defined to be the point in which .
To do so, we will show that in this interval. In the proof of Proposition 2, we showed the following form for the function ,
where K is the kernel matrix for the points and . The vector is defined to be . The term is the minimum norm interpolating function based on the points and and their labels evaluated at . Equation 3 shows that
First, we look at K and its inverse in the setup explained above. Using the definition of Radial basis kernels in Equation (4),
(8) 
Hence,
The determinant of matrix K is
where we defined
Note that since for , we have , then . Also, there exists a constant such that if , then .
The vector is
Next, we compute