Non parametric, polytope interpolation framework for use with deep learning models
Modern machine learning systems based on neural networks have shown great success in learning complex data patterns while being able to make good predictions on unseen data points. However, the limited interpretability of these systems hinders further progress and application to several domains in the real world. This predicament is exemplified by time consuming model selection and the difficulties faced in predictive explainability, especially in the presence of adversarial examples. In this paper, we take a step towards better understanding of neural networks by introducing a local polytope interpolation method. The proposed Deep Non Negative Kernel regression (NNK) interpolation framework is non parametric, theoretically simple and geometrically intuitive. We demonstrate instance based explainability for deep learning models and develop a method to identify models with good generalization properties using leave one out estimation. Finally, we draw a rationalization to adversarial and generative examples which are inevitable from an interpolation view of machine learning.READ FULL TEXT VIEW PDF
Trust and credibility in machine learning models is bolstered by the abi...
In the past decade the mathematical theory of machine learning has lagge...
Driven by massive amounts of data and important advances in computationa...
In several real world applications, machine learning models are deployed...
Machine learning algorithms generally suffer from a problem of
Neural networks appear to have mysterious generalization properties when...
Learning can be seen as approximating an unknown function by interpolati...
Non parametric, polytope interpolation framework for use with deep learning models
The goal of any learning system is to identify a mapping from input data space to output classification or regression space based on a finite set of training data with a basic generalization requirement: Models trained to perform well on a given dataset (empirical performance) should perform well on future examples (expected performance), i.e., the gap between expected and empirical performance must be small.
Today, deep neural networks are at the core of several recent advances in machine learning.
An appropriate deep architecture is closely tied to the dataset on which it is trained and is selected with significant manual engineering by practitioners or by random search based on subjective heuristics
subjective heuristics. Approaches based on resubstitution (training) error, which is often near zero in deep learning systems, can be misleading, while held out data (validation) metrics introduce possible selection bias and the data they use might be more valuable if it can be used to train the model. However, these methods have steadily improved state of the art metrics on several datasets even though only limited understanding of generalization is available 
and generally it is not known whether a smaller model trained for fewer epochs could have achieved the same performance.
A model is typically said to suffer from overfitting when it performs poorly to test (validation) data while performing well on the training data. The conventional approach to avoid overfitting with error minimization is to avoid training an over-parameterized model to zero loss, for example by penalizing the training process with methods such as weight regularization or early stopping [24, 33]. This perspective has been questioned by recent research, which has shown that a model with a number of parameters several orders of magnitude bigger than the dataset size, and trained to zero loss, generalizes to new data as well as a constrained model . Thus, while conventional wisdom about interpolating estimators [24, 33] is that they can achieve zero training error but generally exhibit poor generalization, Belkin and others [3, 4] propose and theoretically study some specific interpolation-based methods, such as simplicial interpolation and kernel weighted and interpolated nearest neighbors (wiNN), that can achieve generalization with theoretical guarantees.  suggests that neural networks perform interpolation in a transformed space and that this could help explain their generalization performance. Though this view has spurred renewed interest in interpolating estimators[28, 23]
, there have been no studies of interpolation based classifiersintegrated with a complete neural network. This is due in part to their complexity: working with -simplices  would be impractical if the dimension of the data space is high, as is the case for problems of interest where neural networks are used. In contrast, a simpler method such as wiNN does not have the same geometric properties as the simplex approach.
In this paper, we propose a practical and realizable interpolation framework based on local polytopes obtained using Non Negative Kernel regression (NNK) on neural network architectures. As shown in a simple setup in Figure 0(a), a simplicial interpolation, even when feasible, constrains itself to a simplex structure (triangles in ) around each test query, which leads to an arbitrary choice of the containing simplex when data lies on one of the simplicial faces. Thus, in the example of Figure 0(a) only one of the triangles can be used, and only two out of the 4 points in the neighborhood contribute to the interpolation. This situation becomes increasingly common in high dimensions, worsening interpolation complexity. By relaxing the simplex constraint, one can better formulate the interpolation using generalized convex polytope structures, such as those obtained using NNK, that are dependent on the sampled training data positions in the classification space. While our proposed method uses
nearest neighbors (KNN) as a starting point, it differs from other KNN-based approaches, such as wiNN schemes[16, 6, 3] and DkNN [35, 42]. In particular, these KNN based algorithms can be potentially biased if data instances have different densities in different directions in space. Instead, as shown in Figure 0(b) NNK automatically selects data points most influential to interpolation based on their relative position, i.e., only those neighboring representations that provide new (orthogonal) information for data reconstruction are selected for functional interpolation. In summary, our proposed method combines some of the best features of existing methods, providing a geometrical interpretation and performance guarantees as the simplicial interpolation , with much lower complexity, of an order of magnitude comparable to KNN-based schemes.
To integrate our interpolation framework with a neural network, we replace the final classification layer, typically some type of support vector machine (SVM) with our NNK interpolator during evaluation at training and at test time, while relying on the loss obtained with the original SVM-like layer for backpropagation. This strategy of using a different classifier at final layer is not uncommon in deep learning[27, 29, 8] and is motivated by the intuition that each layer of a neural network corresponds to an abstract transformation of the input data space catered to the machine task at hand. Note that, unlike the explicit parametric boundaries defined in general by an SVM-like final layer, local interpolation methods produce boundaries that are implicit, i.e., based on the relative positions of the training data in a transformed space. In other words, the proposed DeepNNK procedure allows us to characterize the network by the output classification space rather than relying on a global boundary defined on the space.
Equipped with these tools, we tackle model selection in neural networks from an interpolation perspective using data dependent stability: A model is stable for training set if any change of a single point in does not affect (or yields very small change in) the output hypothesis [19, 44]. This definition is similar but different from algorithmic stability obtained using jackknifing [9, 32] and related statistical procedures such as cross validation 
. While the latter is related to using repeatedly the entire training dataset but one for computing many estimators that are combined at the end, the former is concerned with the output estimate at a point not used for its prediction and is the focus of study in our work. Direct evaluation of algorithmic stability in the context of deep learning is impractical for two reasons: First, the increased runtime complexity associated with training the algorithm for different sets. Second, even if computationally feasible, the assessment within each setting is obscured due to randomness in training, for example in weight initialization and batch sampling, which requires repeated evaluation to reduce variance in the performance. Unlike these methods[1, 9], by focusing on stability to input perturbations at interpolation, our method achieves a practical methodology not involving repetitive training for model selection.
Another challenging issue that prevents the application of deep neural networks in sensitive domains, such as medicine and defense, is the absence of explanations to a prediction obtained. Explainability or interpretability can be defined as the degree to which a human can understand or rationalize a prediction obtained from a learning algorithm. A model is more interpretable than another model if its decisions are easier for a human to comprehend, for e.g., a health care technician looking at a flagged retinal scan, than decisions from the other model. Example based explanations can help alleviate the problem of interpretability by allowing humans to understand complex predictions by analogical reasoning with a small set of influential training instances [26, 27].
Our proposed DeepNNK classifier is a neighborhood based approach that makes very few modeling assumptions on data. Each prediction in NNK classifier comes with a set of training data points (neighbors selected by NNK) that interpolate to produce the classification/regression. In contrast to earlier methods such as DkNN[35, 42]
that rely on hyperparameters such aswhich directly impact explainability and confidence characterizations, our approach adapts to the local data manifold by identifying a stable set of training instances that most influence an estimate. Further, the gains in interpretability using our framework do not incur a penalty in performance, so that, unlike earlier methods [27, 35], there is no loss in overall performance by using an interpolative last layer, and some cases there are gains, as compared to the the performance achieved with standard SVM-like last layer classifiers. Indeed, we demonstrate performance improvements over standard architectures with SVM-like last layers in case where there is overfitting.
Finally, this paper presents some empirical explanation to generative and adversarial examples, which have gained growing attention in modern machine learning. We show that these instances fall in distinct interpolation regions surrounded by fewer NNK neighbors on average compared to real images.
The goal of machine learning is to find a functionin . Assume to be the marginal distribution of with its support denoted as . Let denote the conditional mean . The risk or error associated with a predictor in a regression setting is given by . The Bayes estimator obtained as the expected value of the conditional is the best predictor and upper bounds other predictors as . Unfortunately, the joint distribution is not known a priori and thus a good estimator is to be designed based on labelled samples drawn from in the form of training data . Further, assume each is corrupted by i.i.d. noise and hence can deviate from the Bayes estimate . For a binary classification problem, the domain of is reduced to , with the plug-in Bayes classifier defined as where . The risk associated to a classifier is defined as and is related to the Bayes risk as . Note that the excess risk associated to in both regression and classification setting is related to and , and is the subject of our work. Note that the generalization risk defined above is dependent on the data distribution while in practice one uses empirical error, defined as where is the error associated in regression or classification setting. We denote by the training set obtained by removing the point from .
, kernel based methods observe similarities in a non linearly transformed feature spacereferred to as the Reproducing Kernel Hilbert Space (RKHS). One of the key ingredients in kernel machines is the kernel trick: Inner products in the feature space can be efficiently computed using kernel functions. Due to the non linear nature of the data mapping, linear operations in RKHS correspond to non linear operations in the input data space.
If is a continuous symmetric kernel of a positive integral operator in space of functions, then there exists a space and mapping such that by Mercer’s theorem
where denotes the inner product.
Kernels satisfying above definition are known as Mercer kernels and have wide range of applications in machine learning . In this work, we center our experiments around the range normalized cosine kernel defined as,
though our theoretical statements and claims make no assumption on the type of kernel, other than it be positive with range . Similar to , we combine kernel definitions with neural networks to incorporate the expressive power of neural networks. Given a kernel function , we transform the input data using the non linear mapping corresponding to deep neural networks (DNN) parameterized by .
Our normalized cosine kernel of equation (1) is rewritten as
The starting point for our interpolation-based classifier is our previous work on graph construction using non negative kernel regression (NNK) . NNK formulates graph construction as a signal representation problem, where each data point is to be approximated by a weighted sum of functions from a dictionary formed by its neighbors. The NNK objective for graph construction can be written as:
where is a lifting of from observation to similarity space and contains the transformed neighbors.
Unlike nearest neighbor approaches, which select neighbors having the largest inner products and can be viewed as a thresholding-based representation, NNK is an improved basis selection procedure in kernel space leading to a stable and robust representation. Geometrically, NNK can be characterized in the form of kernel ratio interval (KRI) as shown in Figures 0(b) and 0(c). The KRI theorem states that for any positive definite kernel with range in (e.g. the cosine kernel (3)), the necessary and sufficient condition for two data points and to be both NNK neighbors to is
Inductive application of the KRI produces a closed decision boundary around the data to be approximated () with the identified neighbors forming a convex polytope around the data (). Similar to the simplicial interpolation of , the local geometry of our NNK classifier can be leveraged to obtain theoretical performance of bounds as discussed next.
In this section, we propose and analyze a polytope interpolation scheme based on local neighbors111All proofs related to theoretical statements in this section are included in the supplementary material that asymptotically approaches the -nearest neighbor algorithm . Like -nearest neighbor, the proposed method is not statistically consistent in the presence of label noise, but, unlike the former, it’s risk can be studied in the non-asymptotic case with data dependent bounds under mild assumptions on smoothness.
Given nearest neighbors of a sample , , the following NNK estimate at is a valid interpolation function:
where are the non zero weights obtained from the minimization of equation (4), that is:
where corresponds to the kernel space representation of the nearest neighbors with denoting the kernel similarity with regards to .
The interpolator from Proposition 1 is biased and can be bias-corrected by normalizing the interpolation weights. Thus, the unbiased NNK interpolation estimate is obtained as
In other words, NNK starts with a crude approximation of neighborhood in the form of nearest neighbors, but instead of directly using these points as sources of interpolation, optimizes and reweighs the selection (most of which are zero) using equation (7) to obtain a stable set of neighbors.
We present a theoretical analysis based on the simplicial interpolation analysis by  but adapted to the proposed NNK interpolation. We first study NNK framework in a regression setting and then adapt the results for classification. Let in be the training data made available to NNK. Further, assume each is corrupted by independent noise and hence can deviate from the Bayes estimate .
For a conditional distribution obtained using unbiased NNK interpolation given training data in , the excess mean square risk is given by
under the following assumptions
is the marginal distribution of . Let be the convex hull of the training data in transformed kernel space.
The conditional distribution is Holder smooth in kernel space.
Similarly, the conditional variance satisfies smoothness condition.
Let denote the convex polytope around formed by neighbors identified by NNK with non zero weights. The maximum diameter of the polytope formed with NNK neighbors for any data in is represented as .
Theorem (1) provides a non-asymptotic upper bound for the excess squared risk associated with unbiased NNK interpolation using a data dependent bound. The first term in the bound is associated to extrapolation, where the test data falls outside the interpolation area for the given training data while the last term corresponds to label noise. Of interest are the second and third terms, which merely reflect the dependence of the interpolation on the size of each polytope defined for test data and the associated smoothness of the labels over this region of interpolation. In particular, when all test samples are covered by a smaller polytope, the corresponding risk is closer to optimal. Note that NNK approach leads to a polytope having smallest diameter or volume for the number of points () selected from a set of neighbors. From the theorem, this corresponds to a better risk bound. The bound associated with simplicial interpolation is a special case, where each simplex enclosing the data point is a fixed , corresponding to a -sized polytope. Thus, in our approach the number of points forming the polytope is variable (dependent on local data topology), while in the simplicial case it is fixed and depends on the dimension of the space. Though the latter bound seems better (excess risk is inversely related to ), the diameter of the polytope (simplex) increases with making the excess risk possibly sub optimal.
Based on an additional assumption that belongs to a simple polytope, the excess mean square risk converges asymptotically as
The asymptotic risk of proposed NNK interpolation method is bounded like the -nearest neighbor method in the regression setting by twice the Bayes risk. The rate of convergence of proposed method is dependent on the convergence of the kernel functions centered at the data points.
We now extend our analysis to classification using the plug-in classifier for a given in using the relationship between classification and regression risk .
A plug-in NNK classifier under the assumptions of Corollary 1.1 has excess classifier risk bounded as
The leave one out (LOO) procedure (also known as deleted estimate or U-method) is an important statistical measure with a long history in machine learning . Unlike empirical error, it is almost unbiased  and has been often used for model (hyperparameter) selection. Formally, this is represented by
where the NNK interpolation estimator in the summation for is based on all training points except . We focus our attention to LOO in the context of model stability and generalization as defined in [21, 17]. A system is stable when small perturbations (LOO) to the input data do not affect its output predictions i.e., is stable when
Theoretical results by Rogers, Devroye and Wagner [38, 19, 18] about generalization of -nearest neighbor methods using LOO performance are very relevant to our proposed NNK algorithm. The choice of in our method is dependent on the relative positions of points and hence replaces the fixed from their results by expectation.
The leave one out performance of unbiased NNK classifier given , the maximum number of distinct points that can share the same nearest neighbor, is bounded as
NNK classifier weighs its neighbors based on RKHS interpolation but obtains the initial set of neighbors based on the input embedded space. This means the value of in NNK setting is dependent on the dimension of the space of points where the data is embedded and not on the possibly infinite dimension of the RKHS. The above bound is difficult to compute in practice due to but bounds do exist for this measure based on convex covering literature [37, 11]. The theorem allows us to relate stability of a model using LOO error to that of generalization. Unlike the bound based on hyperparameter , the bound presented here is training data dependent due to the data dependent selection of neighbors.
More practically, to characterize the smoothness in the classification surface, we introduce variation or spread in LOO interpolation score of the training dataset as
where is the number of non zero weighted neighbors identified by NNK and is the unbiased NNK interpolation estimate of equation (8). A smooth interpolation region will have variation in its region close to zero while a spread close to one corresponds to a noisy classification region.
In this section, we present an experimental evaluation of DeepNNK for model selection, robustness and interpretability of neural networks. We focus on experiments with CIFAR-10 dataset to validate our analysis and intuitions on generalization and interpretability.
We consider a simple 7 layer network comprising 4 convolution layers with reLU activations, 2 max-pool layers and 1 full connected softmax layer to demonstrate model selection. We evaluate the test performance and stability of proposed NNK classifier and compare it to weighted KNN (wiNN) approach for different values ofand 5-fold cross validated linear SVM222Similar to neighborhood methods, the last layer is replaced and trained at each evaluation using a LIBLINEAR SVM  with minimal regularization. We use the default library setting for other parameters of the SVM. for three different network settings,
Regularized model training: We used 32 depth channels for each convolution layer with dropout at each convolution layer with keep probability 0.9. The data was augmented with random horizontal flips.
Underparametrized model: We keep the same model structure and regularization as in the regularized model but reduce the number of depth channels to 16, equivalently the number of parameters of the model by half.
Overfit model: To simulate overfitting, we remove data augmentation and dropout regularization in the regularized model while training for the same number of epochs.
Figure 2 shows the difference in performance between our method and weighted KNN (wiNN), in particular, while the proposed DeepNNK method improves marginally with larger values of , the wiNN approach degrades in performance. This can be explained by the fact that NNK accommodates new neighbors only if they belong to a new direction in space that improves its interpolation unlike its KNN counterparts which simply interpolate with all neighbors. More importantly, we observe that while NNK method performs on par if not better than the original classifier with SVM last layer, its LOO performance is a better indicator of the generalization as opposed to the empirical model performance on training data. One can clearly identify a regularized model with better stability by observing the deviation in performance between training and the LOO estimate using our proposed method. Note that the cross validated linear SVM model performed sub-optimally in all settings, which suggests that it is unable to capture the complexity of input data or the generalization difference in different models. The choice of better model is reinforced again in Figure 3, where we observe that the histogram of interpolation spread for regularized model is more shifted towards zero relative to under-parameterized and overfit models. Note that, the shift is minimal which is expected as the different in test error associated with each model is small as well.
We next present a few interpretability results, showing our framework’s ability to capture training instances that are influential in prediction. Neighbors selected from training data for interpolation by DeepNNK can be used as examples to explain the neural network decision. This intepretability can be crucial to problems with transparency requirements by allowing an observer to interpret the region around a test representation as evidence.
In Figure 4, we show examples in the training dataset that are responsible for a prediction using the simple regularized model defined previously. Machine models and the datasets used for their training often contain biases, such as repeated instances with small perturbations for class balance, which are often undesirable for applications where fairness is important. DeepNNK framework can help understand and eliminate sources of bias by allowing practitioners to identify the limitations of their current system in a semi supervised fashion. Figure 5 shows another application of NNK where the fragile nature of a model over certain training images is brought to light using interpolation spread of equation (14). These experiments show the possibility of DeepNNK framework being used as a debugging tool in deep learning.
Finally, we present experimental analysis of generative and adversarial images from the perspective of NNK interpolation. We study these methods using our DeepNNK framework applied on a Wide-ResNet-28-10  architecture trained with autoaugment 333DeepNNK achieves test accuracy on CIFAR-10 similar to that of the original network..
Generative and adversarial examples leverage interpolation spaces where a model (discriminator in the case of generative images or a classifier in the case of black box attacks) is influenced by a smaller number of neighboring points. This is made evident in Figure 6 where we see that the number of neighbors in the case of generative and adversarial images is on average smaller than that of real images. We conjecture that this is a property of interpolation where realistic images can be obtained in compact interpolation neighborhoods with perturbations along extrapolating, mislabeled sample directions producing adversarial images. Though the adversarial perturbations in the input image space is visually indistinguishable, the change in the embedding of the adversarial image in the interpolation space is significantly larger, in some cases as in Figure 7, belonging to regions completely different from its class.
We discussed various ideas, theoretical and practical, from model interpretability and generalization to adversarial, generative examples. Underlying each of these applications is a single common tool, a local polytope interpolation, whose neighborhood support is determined automatically and is dependent on the input data. DeepNNK provides a way to incorporate recent theoretical works on interpolation and leads to better understanding of deep learning models by tracing their predictions back to the input data it was trained on. We hope these attempts help progress neural networks to more real world scenarios and motivate further studies, methods of diagnosing machine models from the lens of the training data.
We conclude with few open thoughts and questions.
Leave one out is a particular instance of the more general problem of how a learning system predicts in response to perturbations of its parameters and data. We believe other kind of perturbations could help better understand neural networks, statistically as well as numerically.
The error in data interpolation of equation (7
) can be observed as that of data noise or alternatively error arising due to absence of examples in some directions (extrapolation). In either scenario, this error can be used to characterize a notion of distance between the data being interpolated and that available for interpolation. We believe such a measure could help identify datasets shifts in an unsupervised manner with possible applications in domain adaptation, transfer learning.
J. of Multivariate Analysis65 (2), pp. 209–227. Cited by: §I.
The quadratic optimization problem of (7) satisfies active constraints set444In constrained optimization problems, some constraints will be strongly binding, i.e., the solution to optimization at these elements will be zero to satisfy the KKT condition of optimality. These constraints are referred to as active constraints, knowledge of which helps reduce the problem size as one can focus on the inactive subset that requires optimization. The constraints that are active at a current feasible solution will remain active in the optimal solution .. Given a partition , where (inactive) and (active), the solution is the optimal solution provided:
Let correspond to the matrix containing the neighbors with non zero data interpolation weights and the associated labels. The kernel space linear interpolation estimator is obtained by solving
The proof follows a similar argument as in the simplicial interpolation bound in . The expected excess mean squared risk can be partitioned based on disjoint sets as555All expectation in this proof are condition on . For the sake of brevity, we do not make this conditioning explicit in our statments.
For points outside the convex hull, NNK extrapolates labels and no guarantees can be made on the regression without further assumptions. Thus, which reduces the first term on the left of equation (17) to that of theorem.
Let be the solution to NNK interpolation objective (7). Let denote the weight normalized values. The normalized weights follow a Dirichlet(1, 1 …1) distribution with concentration parameters.
where corresponds to Bayesian estimator errors in the training data and is related to bias. By smoothness assumption on we have
Since and are independent, we have
By Jensen’s inequality, and bound in equation (19),
Let . Under independence assumption on noise, the term with in equation (20) can be rewritten as
where we use the fact that follows Dirichlet distribution. Now, the smoothness assumption on allows us to bound
Combining with equation (21), the risk bound for points within the convex hull of training data is obtained as
Equation (24) along with the reduction for points outside the convex hull obtained earlier gives the excess risk bound and concludes the proof. ∎
The nearest neighbor convergence lemma of 
states that for an i.i.d sequence of random variablesin , the nearest neighbor of from the set converges in probability, . Equivalently, this would correspond to convergence in kernel representation of the data points. Thus, the solution to NNK data interpolation objective is reduced to -nearest neighbor interpolation with and . Now, under the assumption that the belongs to a polytope, the first term on the right of equation (9) vanishes i.e., ∎