DeepNNK: Explaining deep models and their generalization using polytope interpolation

07/20/2020 ∙ by Sarath Shekkizhar, et al. ∙ University of Southern California 5

Modern machine learning systems based on neural networks have shown great success in learning complex data patterns while being able to make good predictions on unseen data points. However, the limited interpretability of these systems hinders further progress and application to several domains in the real world. This predicament is exemplified by time consuming model selection and the difficulties faced in predictive explainability, especially in the presence of adversarial examples. In this paper, we take a step towards better understanding of neural networks by introducing a local polytope interpolation method. The proposed Deep Non Negative Kernel regression (NNK) interpolation framework is non parametric, theoretically simple and geometrically intuitive. We demonstrate instance based explainability for deep learning models and develop a method to identify models with good generalization properties using leave one out estimation. Finally, we draw a rationalization to adversarial and generative examples which are inevitable from an interpolation view of machine learning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 7

page 8

Code Repositories

DeepNNK_polytope_interpolation

Non parametric, polytope interpolation framework for use with deep learning models


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The goal of any learning system is to identify a mapping from input data space to output classification or regression space based on a finite set of training data with a basic generalization requirement: Models trained to perform well on a given dataset (empirical performance) should perform well on future examples (expected performance), i.e., the gap between expected and empirical performance must be small.

Today, deep neural networks are at the core of several recent advances in machine learning. An appropriate deep architecture is closely tied to the dataset on which it is trained and is selected with significant manual engineering by practitioners or by random search based on

subjective heuristics

[7]. Approaches based on resubstitution (training) error, which is often near zero in deep learning systems, can be misleading, while held out data (validation) metrics introduce possible selection bias and the data they use might be more valuable if it can be used to train the model[1]. However, these methods have steadily improved state of the art metrics on several datasets even though only limited understanding of generalization is available [36]

and generally it is not known whether a smaller model trained for fewer epochs could have achieved the same performance

[10].

A model is typically said to suffer from overfitting when it performs poorly to test (validation) data while performing well on the training data. The conventional approach to avoid overfitting with error minimization is to avoid training an over-parameterized model to zero loss, for example by penalizing the training process with methods such as weight regularization or early stopping [24, 33]. This perspective has been questioned by recent research, which has shown that a model with a number of parameters several orders of magnitude bigger than the dataset size, and trained to zero loss, generalizes to new data as well as a constrained model [46]. Thus, while conventional wisdom about interpolating estimators [24, 33] is that they can achieve zero training error but generally exhibit poor generalization, Belkin and others [3, 4] propose and theoretically study some specific interpolation-based methods, such as simplicial interpolation and kernel weighted and interpolated nearest neighbors (wiNN), that can achieve generalization with theoretical guarantees. [3] suggests that neural networks perform interpolation in a transformed space and that this could help explain their generalization performance. Though this view has spurred renewed interest in interpolating estimators[28, 23]

, there have been no studies of interpolation based classifiers

integrated with a complete neural network. This is due in part to their complexity: working with -simplices [3] would be impractical if the dimension of the data space is high, as is the case for problems of interest where neural networks are used. In contrast, a simpler method such as wiNN does not have the same geometric properties as the simplex approach.

In this paper, we propose a practical and realizable interpolation framework based on local polytopes obtained using Non Negative Kernel regression (NNK)[40] on neural network architectures. As shown in a simple setup in Figure 0(a), a simplicial interpolation, even when feasible, constrains itself to a simplex structure (triangles in ) around each test query, which leads to an arbitrary choice of the containing simplex when data lies on one of the simplicial faces. Thus, in the example of Figure 0(a) only one of the triangles can be used, and only two out of the 4 points in the neighborhood contribute to the interpolation. This situation becomes increasingly common in high dimensions, worsening interpolation complexity. By relaxing the simplex constraint, one can better formulate the interpolation using generalized convex polytope structures, such as those obtained using NNK, that are dependent on the sampled training data positions in the classification space. While our proposed method uses

nearest neighbors (KNN) as a starting point, it differs from other KNN-based approaches, such as wiNN schemes

[16, 6, 3] and DkNN [35, 42]. In particular, these KNN based algorithms can be potentially biased if data instances have different densities in different directions in space. Instead, as shown in Figure 0(b) NNK automatically selects data points most influential to interpolation based on their relative position, i.e., only those neighboring representations that provide new (orthogonal) information for data reconstruction are selected for functional interpolation. In summary, our proposed method combines some of the best features of existing methods, providing a geometrical interpretation and performance guarantees as the simplicial interpolation [3], with much lower complexity, of an order of magnitude comparable to KNN-based schemes.

(a)
(b)
(c)
Fig. 1: (a) Comparison of simplicial and polytope interpolation methods. In the simplex case, the label for node can be approximated based on different triangles (simplices), one of which must be chosen. With the chosen triangle two out of the three points are used for interpolation, so that in this example only half the neighboring points are used for interpolation. Instead, polytope interpolation based on NNK is based on all four data points, which together form a polytope. (b) KRI plane (dashed orange line) corresponding to chosen neighbor . Data points to the right of this plane will be not be selected by NNK as neighbors of . (c) KRI boundary associated convex polytope formed by NNK neighbors at .

To integrate our interpolation framework with a neural network, we replace the final classification layer, typically some type of support vector machine (SVM) with our NNK interpolator during evaluation at training and at test time, while relying on the loss obtained with the original SVM-like layer for backpropagation. This strategy of using a different classifier at final layer is not uncommon in deep learning

[27, 29, 8] and is motivated by the intuition that each layer of a neural network corresponds to an abstract transformation of the input data space catered to the machine task at hand. Note that, unlike the explicit parametric boundaries defined in general by an SVM-like final layer, local interpolation methods produce boundaries that are implicit, i.e., based on the relative positions of the training data in a transformed space. In other words, the proposed DeepNNK procedure allows us to characterize the network by the output classification space rather than relying on a global boundary defined on the space.

Equipped with these tools, we tackle model selection in neural networks from an interpolation perspective using data dependent stability: A model is stable for training set if any change of a single point in does not affect (or yields very small change in) the output hypothesis [19, 44]. This definition is similar but different from algorithmic stability obtained using jackknifing [9, 32] and related statistical procedures such as cross validation [1]

. While the latter is related to using repeatedly the entire training dataset but one for computing many estimators that are combined at the end, the former is concerned with the output estimate at a point not used for its prediction and is the focus of study in our work. Direct evaluation of algorithmic stability in the context of deep learning is impractical for two reasons: First, the increased runtime complexity associated with training the algorithm for different sets. Second, even if computationally feasible, the assessment within each setting is obscured due to randomness in training, for example in weight initialization and batch sampling, which requires repeated evaluation to reduce variance in the performance. Unlike these methods

[1, 9], by focusing on stability to input perturbations at interpolation, our method achieves a practical methodology not involving repetitive training for model selection.

Another challenging issue that prevents the application of deep neural networks in sensitive domains, such as medicine and defense, is the absence of explanations to a prediction obtained[20]. Explainability or interpretability can be defined as the degree to which a human can understand or rationalize a prediction obtained from a learning algorithm. A model is more interpretable than another model if its decisions are easier for a human to comprehend, for e.g., a health care technician looking at a flagged retinal scan[15], than decisions from the other model. Example based explanations can help alleviate the problem of interpretability by allowing humans to understand complex predictions by analogical reasoning with a small set of influential training instances [26, 27].

Our proposed DeepNNK classifier is a neighborhood based approach that makes very few modeling assumptions on data. Each prediction in NNK classifier comes with a set of training data points (neighbors selected by NNK) that interpolate to produce the classification/regression. In contrast to earlier methods such as DkNN[35, 42]

that rely on hyperparameters such as

which directly impact explainability and confidence characterizations, our approach adapts to the local data manifold by identifying a stable set of training instances that most influence an estimate. Further, the gains in interpretability using our framework do not incur a penalty in performance, so that, unlike earlier methods [27, 35], there is no loss in overall performance by using an interpolative last layer, and some cases there are gains, as compared to the the performance achieved with standard SVM-like last layer classifiers. Indeed, we demonstrate performance improvements over standard architectures with SVM-like last layers in case where there is overfitting.

Finally, this paper presents some empirical explanation to generative and adversarial examples, which have gained growing attention in modern machine learning. We show that these instances fall in distinct interpolation regions surrounded by fewer NNK neighbors on average compared to real images.

Ii Preliminaries and Background

Ii-a Statistical Setup

The goal of machine learning is to find a function

that minimizes the probability of error on samples drawn from the joint distribution over

in . Assume to be the marginal distribution of with its support denoted as . Let denote the conditional mean . The risk or error associated with a predictor in a regression setting is given by . The Bayes estimator obtained as the expected value of the conditional is the best predictor and upper bounds other predictors as . Unfortunately, the joint distribution is not known a priori and thus a good estimator is to be designed based on labelled samples drawn from in the form of training data . Further, assume each is corrupted by i.i.d. noise and hence can deviate from the Bayes estimate . For a binary classification problem, the domain of is reduced to , with the plug-in Bayes classifier defined as where . The risk associated to a classifier is defined as and is related to the Bayes risk as . Note that the excess risk associated to in both regression and classification setting is related to and , and is the subject of our work. Note that the generalization risk defined above is dependent on the data distribution while in practice one uses empirical error, defined as where is the error associated in regression or classification setting. We denote by the training set obtained by removing the point from .

Ii-B Deep Kernels

Given data

, kernel based methods observe similarities in a non linearly transformed feature space

referred to as the Reproducing Kernel Hilbert Space (RKHS)[2]. One of the key ingredients in kernel machines is the kernel trick: Inner products in the feature space can be efficiently computed using kernel functions. Due to the non linear nature of the data mapping, linear operations in RKHS correspond to non linear operations in the input data space.

Definition 1.

If is a continuous symmetric kernel of a positive integral operator in space of functions, then there exists a space and mapping such that by Mercer’s theorem

where denotes the inner product.

Kernels satisfying above definition are known as Mercer kernels and have wide range of applications in machine learning [25]. In this work, we center our experiments around the range normalized cosine kernel defined as,

(1)

though our theoretical statements and claims make no assumption on the type of kernel, other than it be positive with range . Similar to [43], we combine kernel definitions with neural networks to incorporate the expressive power of neural networks. Given a kernel function , we transform the input data using the non linear mapping corresponding to deep neural networks (DNN) parameterized by .

(2)

Our normalized cosine kernel of equation (1) is rewritten as

(3)

Ii-C Non Negative Kernel regression (NNK)

The starting point for our interpolation-based classifier is our previous work on graph construction using non negative kernel regression (NNK) [40]. NNK formulates graph construction as a signal representation problem, where each data point is to be approximated by a weighted sum of functions from a dictionary formed by its neighbors. The NNK objective for graph construction can be written as:

(4)

where is a lifting of from observation to similarity space and contains the transformed neighbors.

Unlike nearest neighbor approaches, which select neighbors having the largest inner products and can be viewed as a thresholding-based representation, NNK is an improved basis selection procedure in kernel space leading to a stable and robust representation. Geometrically, NNK can be characterized in the form of kernel ratio interval (KRI) as shown in Figures 0(b) and 0(c). The KRI theorem states that for any positive definite kernel with range in (e.g. the cosine kernel (3)), the necessary and sufficient condition for two data points and to be both NNK neighbors to is

(5)

Inductive application of the KRI produces a closed decision boundary around the data to be approximated () with the identified neighbors forming a convex polytope around the data (). Similar to the simplicial interpolation of [3], the local geometry of our NNK classifier can be leveraged to obtain theoretical performance of bounds as discussed next.

Iii Local Polytope Interpolation

In this section, we propose and analyze a polytope interpolation scheme based on local neighbors111All proofs related to theoretical statements in this section are included in the supplementary material that asymptotically approaches the -nearest neighbor algorithm [12]. Like -nearest neighbor, the proposed method is not statistically consistent in the presence of label noise, but, unlike the former, it’s risk can be studied in the non-asymptotic case with data dependent bounds under mild assumptions on smoothness.

Proposition 1.

Given nearest neighbors of a sample , , the following NNK estimate at is a valid interpolation function:

(6)

where are the non zero weights obtained from the minimization of equation (4), that is:

(7)

where corresponds to the kernel space representation of the nearest neighbors with denoting the kernel similarity with regards to .

The interpolator from Proposition 1 is biased and can be bias-corrected by normalizing the interpolation weights. Thus, the unbiased NNK interpolation estimate is obtained as

(8)

In other words, NNK starts with a crude approximation of neighborhood in the form of nearest neighbors, but instead of directly using these points as sources of interpolation, optimizes and reweighs the selection (most of which are zero) using equation (7) to obtain a stable set of neighbors.

Iii-a A general bound on NNK classifier

We present a theoretical analysis based on the simplicial interpolation analysis by [3] but adapted to the proposed NNK interpolation. We first study NNK framework in a regression setting and then adapt the results for classification. Let in be the training data made available to NNK. Further, assume each is corrupted by independent noise and hence can deviate from the Bayes estimate .

Theorem 1.

For a conditional distribution obtained using unbiased NNK interpolation given training data in , the excess mean square risk is given by

(9)

under the following assumptions

  1. is the marginal distribution of . Let be the convex hull of the training data in transformed kernel space.

  2. The conditional distribution is Holder smooth in kernel space.

  3. Similarly, the conditional variance satisfies smoothness condition.

  4. Let denote the convex polytope around formed by neighbors identified by NNK with non zero weights. The maximum diameter of the polytope formed with NNK neighbors for any data in is represented as .

Remark 1.

Theorem (1) provides a non-asymptotic upper bound for the excess squared risk associated with unbiased NNK interpolation using a data dependent bound. The first term in the bound is associated to extrapolation, where the test data falls outside the interpolation area for the given training data while the last term corresponds to label noise. Of interest are the second and third terms, which merely reflect the dependence of the interpolation on the size of each polytope defined for test data and the associated smoothness of the labels over this region of interpolation. In particular, when all test samples are covered by a smaller polytope, the corresponding risk is closer to optimal. Note that NNK approach leads to a polytope having smallest diameter or volume for the number of points () selected from a set of neighbors. From the theorem, this corresponds to a better risk bound. The bound associated with simplicial interpolation is a special case, where each simplex enclosing the data point is a fixed , corresponding to a -sized polytope. Thus, in our approach the number of points forming the polytope is variable (dependent on local data topology), while in the simplicial case it is fixed and depends on the dimension of the space. Though the latter bound seems better (excess risk is inversely related to ), the diameter of the polytope (simplex) increases with making the excess risk possibly sub optimal.

Corollary 1.1.

Based on an additional assumption that belongs to a simple polytope, the excess mean square risk converges asymptotically as

(10)
Remark 2.

The asymptotic risk of proposed NNK interpolation method is bounded like the -nearest neighbor method in the regression setting by twice the Bayes risk. The rate of convergence of proposed method is dependent on the convergence of the kernel functions centered at the data points.

We now extend our analysis to classification using the plug-in classifier for a given in using the relationship between classification and regression risk [6].

Corollary 1.2.

A plug-in NNK classifier under the assumptions of Corollary 1.1 has excess classifier risk bounded as

(11)
Remark 3.

The classification bound presented here makes no assumptions on the margin associated to the classification boundary and is thus only a weak bound. The bound can be improved exponentially as in [3] when more assumptions such as h-hard margin condition [31] are made.

Iii-B Leave one out stability

The leave one out (LOO) procedure (also known as deleted estimate or U-method) is an important statistical measure with a long history in machine learning [21]. Unlike empirical error, it is almost unbiased [30] and has been often used for model (hyperparameter) selection. Formally, this is represented by

(12)

where the NNK interpolation estimator in the summation for is based on all training points except . We focus our attention to LOO in the context of model stability and generalization as defined in [21, 17]. A system is stable when small perturbations (LOO) to the input data do not affect its output predictions i.e., is stable when

(13)

Theoretical results by Rogers, Devroye and Wagner [38, 19, 18] about generalization of -nearest neighbor methods using LOO performance are very relevant to our proposed NNK algorithm. The choice of in our method is dependent on the relative positions of points and hence replaces the fixed from their results by expectation.

Theorem 2.

The leave one out performance of unbiased NNK classifier given , the maximum number of distinct points that can share the same nearest neighbor, is bounded as

Remark 4.

NNK classifier weighs its neighbors based on RKHS interpolation but obtains the initial set of neighbors based on the input embedded space. This means the value of in NNK setting is dependent on the dimension of the space of points where the data is embedded and not on the possibly infinite dimension of the RKHS. The above bound is difficult to compute in practice due to but bounds do exist for this measure based on convex covering literature [37, 11]. The theorem allows us to relate stability of a model using LOO error to that of generalization. Unlike the bound based on hyperparameter , the bound presented here is training data dependent due to the data dependent selection of neighbors.

More practically, to characterize the smoothness in the classification surface, we introduce variation or spread in LOO interpolation score of the training dataset as

(14)

where is the number of non zero weighted neighbors identified by NNK and is the unbiased NNK interpolation estimate of equation (8). A smooth interpolation region will have variation  in its region close to zero while a spread close to one corresponds to a noisy classification region.

Iv Experiments

In this section, we present an experimental evaluation of DeepNNK for model selection, robustness and interpretability of neural networks. We focus on experiments with CIFAR-10 dataset to validate our analysis and intuitions on generalization and interpretability.

(a) Underparameterized
(b) Regularized
(c) Overfit
Fig. 2: Misclassification error () using fully connected softmax classifier model and interpolating classifiers (weighted KNN, NNK) for different values of parameter at each training epoch on CIFAR-10 dataset. The training data (Top) and test data (Bottom) performance for three different model settings is shown in each column. NNK classification consistently performs as well as the actual model with classification error decreasing slightly as increases. On the contrary, a weighted KNN model error increases for increasing showing robustness issues. The classification error gap between DNN model and leave one out DeepNNK model for train data is suggestive of underfitting () and overfitting (). We claim a good model to be one where the performance of the model agrees with the local NNK model.

We consider a simple 7 layer network comprising 4 convolution layers with reLU activations, 2 max-pool layers and 1 full connected softmax layer to demonstrate model selection. We evaluate the test performance and stability of proposed NNK classifier and compare it to weighted KNN (wiNN) approach for different values of

and 5-fold cross validated linear SVM222Similar to neighborhood methods, the last layer is replaced and trained at each evaluation using a LIBLINEAR SVM [22] with minimal regularization. We use the default library setting for other parameters of the SVM. for three different network settings,

  • Regularized model training: We used 32 depth channels for each convolution layer with dropout at each convolution layer with keep probability 0.9. The data was augmented with random horizontal flips.

  • Underparametrized model: We keep the same model structure and regularization as in the regularized model but reduce the number of depth channels to 16, equivalently the number of parameters of the model by half.

  • Overfit model: To simulate overfitting, we remove data augmentation and dropout regularization in the regularized model while training for the same number of epochs.

Fig. 3: Histogram (normalized) of leave one out interpolation score after 20 epochs with on CIFAR-10. While the network performance on train dataset is considerably different in each setting, we see that the change in the interpolation (classification) landscape associated with the input data is minimal which suggests a small change in generalization of the models. The spread is more shifted towards zero in regularized model indicative of a smoother classification surface.

Figure 2 shows the difference in performance between our method and weighted KNN (wiNN), in particular, while the proposed DeepNNK method improves marginally with larger values of , the wiNN approach degrades in performance. This can be explained by the fact that NNK accommodates new neighbors only if they belong to a new direction in space that improves its interpolation unlike its KNN counterparts which simply interpolate with all neighbors. More importantly, we observe that while NNK method performs on par if not better than the original classifier with SVM last layer, its LOO performance is a better indicator of the generalization as opposed to the empirical model performance on training data. One can clearly identify a regularized model with better stability by observing the deviation in performance between training and the LOO estimate using our proposed method. Note that the cross validated linear SVM model performed sub-optimally in all settings, which suggests that it is unable to capture the complexity of input data or the generalization difference in different models. The choice of better model is reinforced again in Figure 3, where we observe that the histogram of interpolation spread for regularized model is more shifted towards zero relative to under-parameterized and overfit models. Note that, the shift is minimal which is expected as the different in test error associated with each model is small as well.

Fig. 4: Two test examples (first image in each set) with identified NNK neighbors from CIFAR-10 for . We show the assigned and predicted label for the test sample, and assigned label and NNK weight for neighboring (and influential) training instances. Though we were able to identify the correct label for the test sample, one might want to question such duplicates in dataset for downstream applications.

We next present a few interpretability results, showing our framework’s ability to capture training instances that are influential in prediction. Neighbors selected from training data for interpolation by DeepNNK can be used as examples to explain the neural network decision. This intepretability can be crucial to problems with transparency requirements by allowing an observer to interpret the region around a test representation as evidence.

In Figure 4, we show examples in the training dataset that are responsible for a prediction using the simple regularized model defined previously. Machine models and the datasets used for their training often contain biases, such as repeated instances with small perturbations for class balance, which are often undesirable for applications where fairness is important. DeepNNK framework can help understand and eliminate sources of bias by allowing practitioners to identify the limitations of their current system in a semi supervised fashion. Figure 5 shows another application of NNK where the fragile nature of a model over certain training images is brought to light using interpolation spread of equation (14). These experiments show the possibility of DeepNNK framework being used as a debugging tool in deep learning.

Fig. 5: Two training set examples (first images in each set) observed to have maximum discrepancy in LOO NNK interpolation score, as well as their respective neighbors, for . We show the assigned and predicted label for the image being classified, and assigned label and NNK weight for the neighbors. These instances exemplify the possible brittleness of the classification model, which can better inform a user about the limits of the model they are working with.

Finally, we present experimental analysis of generative and adversarial images from the perspective of NNK interpolation. We study these methods using our DeepNNK framework applied on a Wide-ResNet-28-10 [45] architecture trained with autoaugment [14]333DeepNNK achieves test accuracy on CIFAR-10 similar to that of the original network..

(a)
(b)
Fig. 6: Histogram (normalized) of number of neighbors for (a) generated images [41], (b) black box adversarial images [13] and actual CIFAR-10 images. We see that generated and adversarial images on average have fewer neighbors than real images suggestive of the fact these examples often fall in interpolating regions where few train images span the space. An adversarial image is born when these areas of interpolation belong to unstable regions in the classification surface.
Fig. 7: Selected black box adversarial examples (first image) and their NNK neighbors from CIFAR-10 training dataset with . Though changes in the input image is imperceptible to a human eye, one can better characterize a prediction using NNK by observing the interpolation region of the test instance.

Generative and adversarial examples leverage interpolation spaces where a model (discriminator in the case of generative images or a classifier in the case of black box attacks) is influenced by a smaller number of neighboring points. This is made evident in Figure 6 where we see that the number of neighbors in the case of generative and adversarial images is on average smaller than that of real images. We conjecture that this is a property of interpolation where realistic images can be obtained in compact interpolation neighborhoods with perturbations along extrapolating, mislabeled sample directions producing adversarial images. Though the adversarial perturbations in the input image space is visually indistinguishable, the change in the embedding of the adversarial image in the interpolation space is significantly larger, in some cases as in Figure 7, belonging to regions completely different from its class.

V Discussion and Future Work

We discussed various ideas, theoretical and practical, from model interpretability and generalization to adversarial, generative examples. Underlying each of these applications is a single common tool, a local polytope interpolation, whose neighborhood support is determined automatically and is dependent on the input data. DeepNNK provides a way to incorporate recent theoretical works on interpolation and leads to better understanding of deep learning models by tracing their predictions back to the input data it was trained on. We hope these attempts help progress neural networks to more real world scenarios and motivate further studies, methods of diagnosing machine models from the lens of the training data.

We conclude with few open thoughts and questions.

  • Leave one out is a particular instance of the more general problem of how a learning system predicts in response to perturbations of its parameters and data. We believe other kind of perturbations could help better understand neural networks, statistically as well as numerically.

  • The error in data interpolation of equation (7

    ) can be observed as that of data noise or alternatively error arising due to absence of examples in some directions (extrapolation). In either scenario, this error can be used to characterize a notion of distance between the data being interpolated and that available for interpolation. We believe such a measure could help identify datasets shifts in an unsupervised manner with possible applications in domain adaptation, transfer learning.

References

  • [1] U. Anders and O. Korn (1999) Model selection in neural networks. Neural networks 12 (2), pp. 309–323. Cited by: §I, §I.
  • [2] N. Aronszajn (1950) Theory of reproducing kernels. Trans. of the American mathematical society 68 (3), pp. 337–404. Cited by: §II-B.
  • [3] M. Belkin, D. J. Hsu, and P. Mitra (2018) Overfitting or perfect fitting? risk bounds for classification and regression rules that interpolate. In Advances in Neural Inf. Process. Syst., pp. 2300–2311. Cited by: §I, §I, §II-C, §III-A, §V-B, Remark 3.
  • [4] M. Belkin, A. Rakhlin, and A. Tsybakov (2018) Does data interpolation contradict statistical optimality?. arXiv preprint arXiv:1806.09471. Cited by: §I.
  • [5] H. Bernau (1990) Active constraint strategies in optimization. In Geophysical Data Inversion Methods and Applications, pp. 15–31. Cited by: footnote 4.
  • [6] G. Biau and L. Devroye (2015) Lectures on the nearest neighbor method. Springer. Cited by: §I, §III-A.
  • [7] A. Blum and M. Hardt (2015) The ladder: a reliable leaderboard for machine learning competitions. In Int. Conf. on Mach. Learn., pp. 1006–1014. Cited by: §I.
  • [8] M. Bontonou, C. Lassance, G. B. Hacene, V. Gripon, J. Tang, and A. Ortega (2019) Introducing graph smoothness loss for training deep learning architectures. arXiv preprint arXiv:1905.00301. Cited by: §I.
  • [9] O. Bousquet and A. Elisseeff (2002) Stability and generalization. J. of Mach. Learn. research 2 (Mar), pp. 499–526. Cited by: §I.
  • [10] D. Castelvecchi (2016) Can we open the black box of ai?. Nature News 538 (7623), pp. 20. Cited by: §I.
  • [11] L. Chen (2005) New analysis of the sphere covering problems and optimal polytope approximation of convex bodies. J. of Approximation Theory 133 (1), pp. 134–145. Cited by: Remark 4.
  • [12] T. Cover and P. Hart (1967) Nearest neighbor pattern classification. IEEE Trans. on Inf. theory 13 (1), pp. 21–27. Cited by: §III, §V-C.
  • [13] F. Croce and M. Hein (2020) Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. arXiv preprint arXiv:2003.01690. Cited by: Fig. 6.
  • [14] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le (2019) Autoaugment: learning augmentation strategies from data. In

    Proc. of the IEEE Conf. on computer vision and pattern recognition

    ,
    pp. 113–123. Cited by: §IV.
  • [15] J. De Fauw, J. R. Ledsam, B. Romera-Paredes, S. Nikolov, N. Tomasev, S. Blackwell, H. Askham, X. Glorot, B. O’Donoghue, D. Visentin, et al. (2018) Clinically applicable deep learning for diagnosis and referral in retinal disease. Nature medicine 24 (9), pp. 1342–1350. Cited by: §I.
  • [16] L. Devroye, L. Györfi, and A. Krzyżak (1998) The hilbert kernel regression estimate.

    J. of Multivariate Analysis

    65 (2), pp. 209–227.
    Cited by: §I.
  • [17] L. Devroye, L. Györfi, and G. Lugosi (2013) A probabilistic theory of pattern recognition. Vol. 31, Springer Science & Business Media. Cited by: §III-B.
  • [18] L. Devroye and T. Wagner (1979) Distribution-free inequalities for the deleted and holdout error estimates. IEEE Trans. on Inf. Theory 25 (2), pp. 202–207. Cited by: §III-B, §V-E.
  • [19] L. Devroye and T. Wagner (1979) Distribution-free performance bounds for potential function rules. IEEE Trans. on Inf. Theory 25 (5), pp. 601–604. Cited by: §I, §III-B.
  • [20] F. Doshi-Velez and B. Kim (2017) Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608. Cited by: §I.
  • [21] A. Elisseeff, M. Pontil, et al. (2003) Leave-one-out error and stability of learning algorithms with applications. NATO science series sub series iii computer and systems sciences 190, pp. 111–130. Cited by: §III-B.
  • [22] R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin (2008) LIBLINEAR: a library for large linear classification. J. of Mach. Learn. research 9 (Aug), pp. 1871–1874. Cited by: footnote 2.
  • [23] T. Hastie, A. Montanari, S. Rosset, and R. J. Tibshirani (2019) Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560. Cited by: §I.
  • [24] T. Hastie, R. Tibshirani, and J. Friedman (2009) The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media. Cited by: §I.
  • [25] T. Hofmann, B. Schölkopf, and A. J. Smola (2008) Kernel methods in machine learning. The annals of statistics, pp. 1171–1220. Cited by: §II-B.
  • [26] B. Kim, R. Khanna, and O. O. Koyejo (2016) Examples are not enough, learn to criticize! criticism for interpretability. In Advances in Neural Inf. Process. Syst., pp. 2280–2288. Cited by: §I.
  • [27] P. W. Koh and P. Liang (2017) Understanding black-box predictions via influence functions. arXiv preprint arXiv:1703.04730. Cited by: §I, §I, §I.
  • [28] T. Liang and A. Rakhlin (2018) Just interpolate: kernel” ridgeless” regression can generalize. arXiv preprint arXiv:1808.00387. Cited by: §I.
  • [29] S. M. Lundberg and S. Lee (2017) A unified approach to interpreting model predictions. In Advances in Neural Inf. Process. Syst., pp. 4765–4774. Cited by: §I.
  • [30] A. Luntz (1969) On estimation of characters obtained in statistical procedure of recognition. Technicheskaya Kibernetica 3. Cited by: §III-B.
  • [31] P. Massart, É. Nédélec, et al. (2006) Risk bounds for statistical learning. The Annals of Statistics 34 (5), pp. 2326–2366. Cited by: Remark 3.
  • [32] S. Mukherjee, P. Niyogi, T. Poggio, and R. Rifkin (2006) Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Advances in Computational Mathematics 25 (1-3), pp. 161–193. Cited by: §I.
  • [33] K. P. Murphy (2012) Machine learning: a probabilistic perspective. MIT press. Cited by: §I.
  • [34] T. T. Nguyen, J. Idier, C. Soussen, and E. Djermoune (2019) Non-negative orthogonal greedy algorithms. IEEE Transactions on Signal Processing 67 (21), pp. 5643–5658. Cited by: §V-A.
  • [35] N. Papernot and P. McDaniel (2018) Deep k-nearest neighbors: towards confident, interpretable and robust deep learning. arXiv preprint arXiv:1803.04765. Cited by: §I, §I.
  • [36] B. Recht, R. Roelofs, L. Schmidt, and V. Shankar (2018) Do cifar-10 classifiers generalize to cifar-10?. arXiv preprint arXiv:1806.00451. Cited by: §I.
  • [37] C. Rogers (1963) Covering a sphere with spheres. Mathematika 10 (2), pp. 157–164. Cited by: Remark 4.
  • [38] W. H. Rogers and T. J. Wagner (1978) A finite sample distribution-free performance bound for local discrimination rules. The Annals of Statistics, pp. 506–514. Cited by: §III-B.
  • [39] S. Shekkizhar and A. Ortega (2019) Graph construction from data using non negative kernel regression (nnk graphs). arXiv preprint arXiv:1910.09383. Cited by: Lemma 1.
  • [40] S. Shekkizhar and A. Ortega (2020) Graph construction from data by non-negative kernel regression. In 2020 IEEE Int. Conf. on Acoustics, Speech and Signal Process. (ICASSP), pp. 3892–3896. Cited by: §I, §II-C.
  • [41] Y. Song and S. Ermon (2019) Generative modeling by estimating gradients of the data distribution. In Advances in Neural Inf. Process. Syst., pp. 11895–11907. Cited by: Fig. 6.
  • [42] E. Wallace, S. Feng, and J. Boyd-Graber (2018) Interpreting neural networks with nearest neighbors. arXiv preprint arXiv:1809.02847. Cited by: §I, §I.
  • [43] A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing (2016) Deep kernel learning. In Artificial Intelligence and Statistics, pp. 370–378. Cited by: §II-B.
  • [44] B. Yu et al. (2013) Stability. Bernoulli 19 (4), pp. 1484–1500. Cited by: §I.
  • [45] S. Zagoruyko and N. Komodakis (2016) Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: §IV.
  • [46] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2016) Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530. Cited by: §I.

Supplementary Material

V-a Proof of Proposition (1)

Lemma 1 ([39]).

The quadratic optimization problem of (7) satisfies active constraints set444In constrained optimization problems, some constraints will be strongly binding, i.e., the solution to optimization at these elements will be zero to satisfy the KKT condition of optimality. These constraints are referred to as active constraints, knowledge of which helps reduce the problem size as one can focus on the inactive subset that requires optimization. The constraints that are active at a current feasible solution will remain active in the optimal solution [5].. Given a partition , where (inactive) and (active), the solution is the optimal solution provided:

Moreover, the set corresponds to non zero support of the constrained problem if and only if is full rank and [34]. Thus, the solution to (7) is obtained as

(15)
Proof of Proposition 1.

Let correspond to the matrix containing the neighbors with non zero data interpolation weights and the associated labels. The kernel space linear interpolation estimator is obtained by solving

(16)

Therefore, using matrix identity and equation (15) resulting from lemma 1, the estimate at is obtained as

V-B Proof of Theorem (1)

Proof.

The proof follows a similar argument as in the simplicial interpolation bound in [3]. The expected excess mean squared risk can be partitioned based on disjoint sets as555All expectation in this proof are condition on . For the sake of brevity, we do not make this conditioning explicit in our statments.

(17)

For points outside the convex hull, NNK extrapolates labels and no guarantees can be made on the regression without further assumptions. Thus, which reduces the first term on the left of equation (17) to that of theorem.

Let be the solution to NNK interpolation objective (7). Let denote the weight normalized values. The normalized weights follow a Dirichlet(1, 1 …1) distribution with concentration parameters.

(18)

where corresponds to Bayesian estimator errors in the training data and is related to bias. By smoothness assumption on we have

(19)

Since and are independent, we have

(20)

By Jensen’s inequality, and bound in equation (19),

(21)

Let . Under independence assumption on noise, the term with in equation (20) can be rewritten as

where we use the fact that follows Dirichlet distribution. Now, the smoothness assumption on allows us to bound

(22)
(23)

Combining with equation (21), the risk bound for points within the convex hull of training data is obtained as

(24)

Equation (24) along with the reduction for points outside the convex hull obtained earlier gives the excess risk bound and concludes the proof. ∎

V-C Proof of Corollary (1.1)

Proof.

The nearest neighbor convergence lemma of [12]

states that for an i.i.d sequence of random variables

in , the nearest neighbor of from the set converges in probability, . Equivalently, this would correspond to convergence in kernel representation of the data points. Thus, the solution to NNK data interpolation objective is reduced to -nearest neighbor interpolation with and . Now, under the assumption that the belongs to a polytope, the first term on the right of equation (9) vanishes i.e.,

V-D Proof of Corollary (1.2)

Proof.

The excess classification risk associated with this classifier is related the regression risk as

(25)

From Corollary 1.1, we have

By Jensen’s inequality

(26)

Combining with equation (25) gives the required risk bound. ∎

V-E Proof of Theorem (2)

Proof.

The proof is based on the -nearest neighbor result from Theorem 1 in [18] which states that

(27)

As in [18], where the result is extended based on the -nearest neighbor, here it suffices to replace by since each data point on average cannot be NNK neighbors to more than data points. ∎