1 Overparameterized Models and Active Learning
The success of deep learning systems has sparked interest in understanding how and why overparameterized models that interpolate the training data often display surprisingly good generalization properties[27, 11, 38, 12, 8, 1, 10, 23]. Notably, it is now understood that mininum norm solutions have the potential to generalize well in the overparameterized, interpolating regime [11, 9, 23, 25]. These findings suggest that the theory of active learning also requires careful reconsideration. This paper introduces a new framework for active learning based on the notion of minimum norm interpolators. We propose a new active learning heuristic for kernel machines inspired by the recent breakthroughs mentioned above.
Active learning algorithms adaptively select examples for labeling based on two general strategies : 1) selecting examples that rule-out as many (incompatible) classifiers as possible at each step; and 2) discovering cluster structure in unlabeled data and labeling representative examples from each cluster. We show that our new active learning approach based on a minimum norm heuristic automatically exploits both these strategies: Figure 1 depicts where samples are selected using the new criterion based on minimum norm interpolators.
1.1 Review of Active Learning
Active learning involves selectively labeling examples based on some measure of informativeness and/or label uncertainty; see [34, 35] for a broad overview of the field. Here we discuss the key ideas exploited in active learning and put our contribution in context. We emphasize that this is not a comprehensive literature review.
Perhaps the simplest example of active learning is binary search or bisection to identify a single decision threshold in a one-dimensional classification setting. More general “version space” approaches [13, 3, 31, 22] extend this idea to arbitrary classes. Alternatively, the one-dimensional binary search problem naturally generalizes to the problem of learning a multi-dimensional linear classifier [4, 17, 5] and its kernel-based extensions . These methods are “margin-based,” selecting labeling examples in the vicinity of the learned decision boundary. However, in the overparameterized and interpolating regime, potentially every labeled example is near the learned decision boundary. This calls for a more refined sample selection criterion. Roughly speaking, our new selection criteria favor examples close to the current decision boundary and closest to oppositely labeled examples. This yields optimal bisection sampling in simple low-dimensional settings.
Exploiting the geometric structure of data (e.g., clustering) is also a powerful heuristic for active learning 
. Graph-based active learning inspired by semi-supervised learning[41, 14]
and hierarchical clustering methods are two common geometric approaches to active learning. Bayesian methods based on notions of information-gain can also exploit distributional structure [18, 20, 24, 26], but require appropriate specifications of prior distributions. The new active learning criteria proposed in this paper are capable of automatically exploiting geometric structure in data, such as clusters, without specifying priors, forming a graph, or clustering.
2 A New Active Learning Criterion
At each iteration of the active learning algorithm, looking at the currently labeled set of samples, a new unlabeled point is selected to be labeled. The criterion we are proposing to pick the samples to be labeled is based on a ‘max-min’ operator. We will describe the criterion in its most general form along with the intuition behind this choice of criterion. In the remaining of the paper, we will go through some theoretical results about the properties of variations of this criterion in various setups along with some additional descriptive numerical evaluations and simulations.
Let with labels be the set of labeled examples so far. We assume binary valued labels, but our approach can be generalized to the multiclass setting. Let be the set of unlabeled samples. In other words, we have a partially labeled training set. In the interpolating regime, the goal is to correctly label all the points in so that the training error is zero. Passive learning generally requires labeling every point in . Active learning sequentially selects points in for labeling, with the aim of learning a correct classifier without necessarily labeling all of . Our setting can be viewed as an instance of pool-based active learning.
Let be a rich collection of functions capable of interpolating the training data. For example, could be a nonparametric infinite dimensional Reproducing Kernel Hilbert Space (RKHS) or an overparameterized neural network representation. Let be the minimum norm function that interpolates the labeled examples. Note that depends on the set and their labels, but we do not show that dependency in our notation for brevity. Define is the minimum norm interpolating function based on and the point with label . For each , we select a label according to one of the following criteria:
The rationale for the definition of is that operating in the interpolating regime, we select the label that yields the minimum norm interpolant (i.e., the “smoother” of the two interpolants). Intuitively, one could argue that our best guess of the label of is (where is the interpolation function based only on ). We will show if is constrained to be an element of an RKHS, then for all (assuming the Hilbert norm is used in Eq. (1)). In general, the definition is more flexible since one can choose an arbitrary norm in; e.g., a variety of norms could be considered if is represented by a neural network. Define The norm of the minimizer or the norm of the difference are viewed as score functions for the point :
Then, the next point to be labeled by our active learning algorithm is We show in Section 3.2 that if is an element of RKHS and the Hilbert norm is used, the two above selection criteria are equivalent. These selection criteria are motivated by the aim of interpolating all the data.
Interpolating functions with larger scores are less smooth (in terms of norm) or have greater changes from the previous interpolator ( and , respectively). The intuition is that attacking the most challenging points in the input space first may eliminate the need to label other “easier” examples later. The distinction between the two definitions of the function in Eq. (2) is as follows. Scoring unlabeled points according to the definition priotorizes labeling the examples which result in minimum norm interpolating functions with largest norm. Since the norm of the function can be associated with its smoothness, roughly speaking, this means that this criteria picks the points which give the least smooth interpolating functions. We expect that these point are the most “informative" of samples with some measure of informativeness. In Section 3 we will show some theoretical results supporting this intuition. Scoring unlabeled points according to the definition priotorizes labeling the samples which result in minimum norm interpolating functions with largest change. Again, we expect that these point are the most “informative" of samples with some measure of information. In Section 3 we will show some theoretical results supporting this intuition.
3 Interpolating Active Learners in an RKHS
In this section, we will focus on minimum norm interpolating functions
in a Reproducing Kernel Hilbert Space (RKHS). We present theoretical properties for general RKHS
settings, detailed analytical results in the one-dimensional setting,
and numerical studies in multiple dimensions. Broadly speaking, we establish the following
properties: the proposed score functions
tend to select examples near the decision boundary of , the current interpolator;
the score is largest for unlabeled examples near the decision boundary and close to oppositely labeled examples, in effect searching for the boundary in the most likely region of the input space;
in one dimension the interpolating active learner coincides with an optimal binary search procedure;
using data-based function norms, rather than the RKHS norm, the interpolating active learner executes a tradeoff between sampling near the current decision boundary and sampling in regions far away from currently labeled examples, thus exploiting cluster structure in the data.
3.1 Kernel Methods
A Hilbert space is associated with an inner product: for . This induces a norm defined by . A symmetric bivariate function is positive semidefinite if for all , and elements the matrix K with element is positive semidefinite (PSD). These functions are called PSD kernel functions. A PSD kernel constructs a Hilbert space, of functions on . For any and any , the function and .
For and , we define to be the interpolating function (i.e., ) with minimum norm (according to the norm associated with ). Define K to be by matrix such that and
. Then, one can show that there exists a vector the vectorsuch that can be decomposed as
Using the property , then
3.2 Properties of General Kernels for Active Learning
For and , define and according to the definitions in Section 2. Then, .
Let , and . Let K be the kernel matrix for the elements in and be the kernel matrix for the elements in (Hence, can be constructed by and ). Then, for
where Schur’s complement formula gives (a) and Woodbury Identity with a bit algebra gives (b). (c) uses (3) for the minimum norm interpolating function. Hence, iff . ∎
If the norm used in definition of and is the Hilbert norm, then for all . Hence, and can be used interchangeably.
We omit the proof for brevity. It is based on some simple algebra which shows that for any and ,
3.3 Laplace Kernel in One Dimension
To develop some intuition, we consider active learning on one-dimension. The sort of target function we have in mind is a multiple threshold classifier. Optimal active learning in this setting coincides with binary search. We now show that the proposed score function with Hilbert norm and the Laplace kernel results in an optimal active learning in one dimension (proof in Appendix A).
[Max Min criteria in one dimension] Define to be the Laplace kernel in one dimension. Then the following statements hold for any value of :
Let and be two neighboring labeled points in . Then for all .
Let and be two pairs of neighboring labeled points such that , then
if and . Then .
if and . Then .
if and . Then .
if and . Then .
The key conclusion drawn from these properties is that the midpoints between the closest oppositely labeled neighboring examples have the highest score. If there are no oppositely labeled neighbors, then the score is largest at the midpoint of the largest gap between consecutive samples. Thus, the score results in a binary search for the thresholds definining the classifier. Using the proposition above, it is easy to show the following result, proved in the Appendix B.
Consider points uniformly distributed in the interval
points uniformly distributed in the intervallabeled according to a -piecewise constant function so that and length of the pieces are roughly on the order of . Then by running the proposed active learning algorithm with Laplace Kernel and any bandwidth, after queries the sign of the resulting interpolant correctly labels all examples (i.e., the training error is zero).
3.4 General Radial-Basis Kernels in One Dimension
For , we define norm of dimensional vector to be . We define radial basis kernel with parameter and bandwidth to be
n the next proposition, we look at the special case of radial basis kernels applied to one dimensional functions with only three initial points. We next show how maximizing is equivalent to picking the zero-crossing point of our current interpolator.
Proposition 4 (One Dimensional Functions with Radial Basis Kernels).
Assume that for any pair of samples we have . Assume for a constant value of . Let , and . For such that , we have where is the point satisfying .
The proof is rather tedious and appears in Appendix C. But the idea is based on showing that with small enough bandwidth, is increasing in in the interval and is decreasing in in the same interval. This shows that occurs at such that . We showed that this is equivalent to the condition .
3.5 Data-Based Norm
As mentioned above, one can consider norms other than the RKHS norm. One natural option is to consider the -norm of the output map. Consider
Intuitively, this score measures the expected change in the squared norm over all unlabeled examples if is selected as the next point. This norm is sensitive to the particular distribution of the data, which is important if the data are clustered. This behavior will be demonstrated in the multidimensional setting discussed next.
3.6 Multidimensional Setting
The properties and behavior found in the one dimensional setting carry over to higher dimensions. In particular, the max-min norm criterion tends to select unlabeled examples near the decision boundary and close to oppositely labeled examples, This is illustrated in Figure 2 below. The inputs points (training examples) are uniformly distributed in the square . We trained an Laplace kernel machine to perfectly interpolate four training points with locations and binary labels as depicted in Figure 2(a). The color depicts the magnitude of the learned interpolating function: dark blue is indicating the “decision boundary” and bright yellow is approximately . Figure 2(b) denotes the score for selecting a point at each location based on RKHS norm criterion. Figure 2(c) denotes the score for selecting a point at each location based on data-based norm criterion discussed above. Both criteria select the point on the decision boundary, but the RKHS norm favors points that are closest to oppositely labeled examples whereas the data-based norm favors points on the boundary further from labeled examples.
Next we present a modified scenario in which the examples are not uniformly distributed over the input space, but instead concentrated only in certain regions indicated by the magenta highlights in Figure 3(a). In this setting, the example selection criteria differ more significantly for the two norms. The weight norm selection criterion remains unchanged, but is applied only to regions where there are examples. Areas with out examples to select are indicated by dark blue in Figure 3(b)-(c). The data-based norm is sensitive to the non-uniform input distribution, and it scores examples near the lower portion of the decision boundary highest.
The distinction between the max-min selection criterion using the RKHS vs. data-based norm is also apparent in the experiment in which a curved decision boundary in two dimensions is actively learned using a Laplace kernel machine, as depicted in Figure 4 below. is the max-min RKHS norm criterion at progressive stages of the learning process (from left to right). The data-based norm is used in defined in Equation (5). Both dramatically outperform a passive (random sampling) scheme and both demonstrate how active learning automatically focuses sampling near the decision boundary between the oppositely labeled data (yellow vs. blue). However, the data-based norm does more exploration away from the decision boundary. As a result, the data-based norm requires slightly more labels to perfectly predict all unlabeled examples, but has a more graceful error decay, as shown on the right of the figure.
Before moving on to active neural network learners, we demonstrate how the data-based norm also tends to automatically select representive examples from clusters when such structure exists in the unlabeled dataset. Figure 5 compares the behavior of selection based on with the RKHS norm and with data-based norm, when data are clusters and each cluster is homogeneously labeled. We see that the data-based norm quickly identifies the clusters and labels a representative from each, leading to faster error decay as shown on the right.
4 Interpolating Neural Network Active Learners
Here we briefly examine the extension of the max-min criterion and its variants to neural network learners. Neural network complexity or capacity can be controlled by limiting magnitude of the network weights [6, 29, 40]. A number of weight norms and related measures have been recently proposed in the literature [30, 7, 21, 2, 28]
. For example, ReLU networks with a single hidden layer and minimumnorm weights coincide with linear spline interpolation . With this in mind, we provide empirical evidence showing that defining the max-min criterion with the norm of the network weights yields a neural network active learning algorithm with properties analagous to those obtained in the RKHS setting.
Consider a single hidden layer network with ReLU activation units trained using MSE loss. In Figure 6
we show the results of an experiment implemented in PyTorch in the same settings considered above for kernel machines in Figures2 and 3. We trained an overparameterized network with hidden layer units to perfectly interpolate four training points with locations and binary labels as depicted in Figure 6(a). The color depicts the magnitude of the learned interpolating function: dark blue is indicating the “decision boundary” and bright yellow is approximately . Figure 6(b) denotes the with the weight norm (i.e., the norm of the resulting network weights when a new sample is selected at that location). The brightest yellow indicates the highest score and the location of the next selection. Figure 6(c) denotes the with the data-based norm defined in Section 3.5. In both cases, the max occurs at roughly the same location, which is near the current decision boundary and closest to oppositely labeled points. The data-based norm also places higher scores on points further away from the labeled examples. Thus, the data selection behavior of the neural network is analagous to that of the kernel-based active learner (compare with Figure 2).
Next we present a modified scenario in which the examples are not uniformly distributed over the input space, but instead concentrated only in certain regions indicated by the magenta highlights in Figure 7(a). In this setting, the example selection criteria differ more significantly for the two norms. The weight norm selection criterion remains unchanged, but is applied only to regions where there are examples. Areas without examples to select are indicated by dark blue in Figure 7(b)-(c). The data-based norm is sensitive to the non-uniform input distribution, and it scores examples near the lower portion of the decision boundary highest. Again, this is quite similar to the behavior of the kernel active learner (compare with Figure 3).
5 MNIST Experiment
Here we illustrate the performance of the proposed active learning method on the MNIST dataset. The goal of the classifier is detecting whether an image belongs to the set of numbers greater or equal to or not. We used Laplace kernel with bandwidth on the vectorized version of a dataset of images. Hence, the dimensionality of each sample is . For comparison, we implemented a passive learning algorithm which selects the points randomly, the active learning algorithm using with RKHS norm, and the active learning algorithm using with data-based function norm described in Section 3.5. Figure 8 depicts the decay of probability of error for each of these algorithms. The behavior is similar to the theory and experiments above. Both the RKHS norm and the data-based norm, score and score respectively, dramatically outperform the passive learning baseline and the data-based norm results in somewhat more graceful error decay. Note that the purpose of this experiment is comparing the performance of various selection algorithms. Hence, we are training our classification algorithm on a small subset of the MNIST dataset and that is the reason for nonnegligible error level in the test set.
-  Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584, 2019.
Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang.
Stronger generalization bounds for deep nets via a compression
International Conference on Machine Learning, pages 254–263, 2018.
-  Maria-Florina Balcan, Alina Beygelzimer, and John Langford. Agnostic active learning. Journal of Computer and System Sciences, 75(1):78–89, 2009.
Maria-Florina Balcan, Andrei Broder, and Tong Zhang.
Margin based active learning.
International Conference on Computational Learning Theory, pages 35–50. Springer, 2007.
-  Maria-Florina Balcan and Phil Long. Active and passive learning of linear separators under log-concave distributions. In Conference on Learning Theory, pages 288–316, 2013.
-  Peter L Bartlett. For valid generalization the size of the weights is more important than the size of the network. In Advances in neural information processing systems, pages 134–140, 1997.
-  Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pages 6240–6249, 2017.
-  Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine learning and the bias-variance trade-off. arXiv preprint arXiv:1812.11118, 2018.
-  Mikhail Belkin, Daniel Hsu, and Ji Xu. Two models of double descent for weak features. arXiv preprint arXiv:1903.07571, 2019.
-  Mikhail Belkin, Daniel J Hsu, and Partha Mitra. Overfitting or perfect fitting? risk bounds for classification and regression rules that interpolate. In Advances in Neural Information Processing Systems, pages 2300–2311, 2018.
-  Mikhail Belkin, Siyuan Ma, and Soumik Mandal. To understand deep learning we need to understand kernel learning. In International Conference on Machine Learning, pages 540–548, 2018.
-  Mikhail Belkin, Alexander Rakhlin, and Alexandre B Tsybakov. Does data interpolation contradict statistical optimality? arXiv preprint arXiv:1806.09471, 2018.
-  David Cohn, Les Atlas, and Richard Ladner. Improving generalization with active learning. Machine learning, 15(2):201–221, 1994.
-  Gautam Dasarathy, Robert Nowak, and Xiaojin Zhu. S2: An efficient graph based active learning algorithm with application to nonparametric classification. In Conference on Learning Theory, pages 503–522, 2015.
-  Sanjoy Dasgupta. Two faces of active learning. Theoretical computer science, 412(19):1767–1781, 2011.
-  Sanjoy Dasgupta and Daniel Hsu. Hierarchical sampling for active learning. In Proceedings of the 25th international conference on Machine learning, pages 208–215. ACM, 2008.
Sanjoy Dasgupta, Adam Tauman Kalai, and Claire Monteleoni.
Analysis of perceptron-based active learning.In International Conference on Computational Learning Theory, pages 249–263. Springer, 2005.
-  Yoav Freund, H Sebastian Seung, Eli Shamir, and Naftali Tishby. Selective sampling using the query by committee algorithm. Machine learning, 28(2-3):133–168, 1997.
-  Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep bayesian active learning with image data. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1183–1192. JMLR. org, 2017.
-  Daniel Golovin, Andreas Krause, and Debajyoti Ray. Near-optimal bayesian active learning with noisy observations. In Advances in Neural Information Processing Systems, pages 766–774, 2010.
-  Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. In Conference On Learning Theory, pages 297–299, 2018.
-  Steve Hanneke et al. Theory of disagreement-based active learning. Foundations and Trends® in Machine Learning, 7(2-3):131–309, 2014.
-  Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560, 2019.
Ashish Kapoor, Kristen Grauman, Raquel Urtasun, and Trevor Darrell.
Active learning with gaussian processes for object categorization.
2007 IEEE 11th International Conference on Computer Vision, pages 1–8. IEEE, 2007.
-  Tengyuan Liang and Alexander Rakhlin. Just interpolate: Kernel" ridgeless" regression can generalize. arXiv preprint arXiv:1808.00387, 2018.
-  Michael Lindenbaum, Shaul Markovitch, and Dmitry Rusakov. Selective sampling for nearest neighbor classifiers. Machine learning, 54(2):125–152, 2004.
-  Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the effectiveness of sgd in modern over-parametrized learning. In International Conference on Machine Learning, pages 3331–3340, 2018.
-  Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1707.09564, 2017.
-  Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614, 2014.
-  Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. In Conference on Learning Theory, pages 1376–1401, 2015.
-  Robert D Nowak. The geometry of generalized binary search. IEEE Transactions on Information Theory, 57(12):7893–7906, 2011.
-  Pedro Savarese, Itay Evron, Daniel Soudry, and Nathan Srebro. How do infinite width bounded norm networks look in function space? arXiv preprint arXiv:1902.05040, 2019.
-  Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. arXiv preprint arXiv:1708.00489, 2017.
-  Burr Settles. Active learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 2009.
Synthesis Lectures on Artificial Intelligence and Machine Learning, 6(1):1–114, 2012.
-  Yanyao Shen, Hyokun Yun, Zachary C Lipton, Yakov Kronrod, and Animashree Anandkumar. Deep active learning for named entity recognition. arXiv preprint arXiv:1707.05928, 2017.
-  Simon Tong and Daphne Koller. Support vector machine active learning with applications to text classification. Journal of machine learning research, 2(Nov):45–66, 2001.
-  Sharan Vaswani, Francis Bach, and Mark Schmidt. Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron. arXiv preprint arXiv:1810.07288, 2018.
-  Keze Wang, Dongyu Zhang, Ya Li, Ruimao Zhang, and Liang Lin. Cost-effective active learning for deep image classification. IEEE Transactions on Circuits and Systems for Video Technology, 27(12):2591–2600, 2017.
-  Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
-  Xiaojin Zhu, John Lafferty, and Zoubin Ghahramani. Combining active learning and semi-supervised learning using gaussian fields and harmonic functions. In ICML 2003 workshop on the continuum from labeled to unlabeled data in machine learning and data mining, volume 3, 2003.
Appendix A Max Min criteria in one dimension with Laplace Kernel
Proof of Proposition 1.
Let us look at the Gram matrix for three neighboring points according to the Laplace Kernel.
where we define and . It can be shown that with the above structure
In general, if we look into the Gram matrix of , and define for , we observe:
and the remaining elements of matrix of matrix is zero, i.e., for .
Looking at three neighboring points labeled we can compute :
Consider two neighboring points labeled . Let be such that and . Define and . Then,
Without loss of generality, assume or , then
Note that for all such that , we have and is constant. So
which is attained when or .
This gives the following statement: For neighboring labeled points such that , we have
Note that the above function is decreasing in .
Consider two neighboring points labeled . Let such that and . Define and .
This gives the following statement: For neighboring labeled points such that , we have
Note that the above function is increasing in .
Appendix B Max Min criteria Binary Search Corollary
Proof of Binary Search Corollary.
According to the last property in Proposition 3 the first sample selected will be at the midpoint of the unit interval and the second point will be at or . If the labels agree, then the next sample will be at the midpoint of the largest subinterval (e.g., at if was sampled first). Sampling at the midpoints of the largest subinterval between a consecutive pairs labeled points continues until a point with the opposite label is found. Once a point the with opposite label have been found, Proposition 3 implies that subsequent samples repeatedly bisect the subinterval between the closest pair of oppositely labeled points. This bisection process will terminate with two neighboring points having opposite labels, thus identifying one boundary/threshold of . The total number of labels collected by this bisection process is at most . After this, the algorithm alternates between the two situations above. It performs bisection on the subinterval between the close pair of oppositely labeled points, if such an interval exists. If not, it samples at the midpoint of the largest subinterval between a consecutive pairs of labeled points. The stated label complexity bound follows from the assumptions that there are thresholds and the length of each piece (subinterval between thresholds) is on the order of . ∎
The bisection process is illustrated experimentally in the Figure 9 below. uses the RKHS norm. For comparison, we also show the behavior of the algorithm using and the data-based norm discussed in Section 3.5. Data selection using either score drives the error to zero faster than random sampling (as shown on the left). We clearly see the bisection behavior of , locating one decision boundary/threshold and then another, as the proof corollary above suggests. Also, we see that the data-based norm does more exploration away from the decision boundaries. As a result, the data-based norm has a faster and more graceful error decay, as shown on the right of the figure. Similar behavior is observed in the multidimensional setting shown in Figure 4.
Appendix C One Dimensional Functions with Radial Basis Kernels
Proof of Proposition 4 on maximum score with radial basis kernels.
For the ease of notation, for fixed and , we define as the normalized distance between samples such that and . For , we define such that the distance between the point and is , as in Figure 10. The proposition is based on the assumption that for any pair of points, and is sufficiently small that .
We want to show that the max score happens at the zero crossing of function . Since we normalized all pairwise distances by , instead we will show that there exists a constant such that if , then the max score is achieved at the zero crossing.
Note that depends on the location of point , characterized by the normalized distance between and denoted by . We want to prove that for small enough bandwidth, is increasing in for . We can use similar argument to show that is decreasing in . This implies
with defined to be the point in which .
To do so, we will show that in this interval. In the proof of Proposition 2, we showed the following form for the function ,
where K is the kernel matrix for the points and . The vector is defined to be . The term is the minimum norm interpolating function based on the points and and their labels evaluated at . Equation 3 shows that
First, we look at K and its inverse in the setup explained above. Using the definition of Radial basis kernels in Equation (4),
The determinant of matrix K is
where we defined
Note that since for , we have , then . Also, there exists a constant such that if , then .
The vector is
Next, we compute