Stabilized Nearest Neighbor Classifier and Its Statistical Properties

05/26/2014
by   Wei Sun, et al.
Purdue University
Binghamton University
0

The stability of statistical analysis is an important indicator for reproducibility, which is one main principle of scientific method. It entails that similar statistical conclusions can be reached based on independent samples from the same underlying population. In this paper, we introduce a general measure of classification instability (CIS) to quantify the sampling variability of the prediction made by a classification method. Interestingly, the asymptotic CIS of any weighted nearest neighbor classifier turns out to be proportional to the Euclidean norm of its weight vector. Based on this concise form, we propose a stabilized nearest neighbor (SNN) classifier, which distinguishes itself from other nearest neighbor classifiers, by taking the stability into consideration. In theory, we prove that SNN attains the minimax optimal convergence rate in risk, and a sharp convergence rate in CIS. The latter rate result is established for general plug-in classifiers under a low-noise condition. Extensive simulated and real examples demonstrate that SNN achieves a considerable improvement in CIS over existing nearest neighbor classifiers, with comparable classification accuracy. We implement the algorithm in a publicly available R package snn.

READ FULL TEXT VIEW PDF

Authors

page 23

page 24

09/03/2019

Rates of Convergence for Large-scale Nearest Neighbor Classification

Nearest neighbor is a popular class of classification methods with many ...
06/22/2022

Nearest Neighbor Classification based on Imbalanced Data: A Statistical Approach

In a classification problem, where the competing classes are not of comp...
08/20/2019

Multi-hypothesis classifier

Accuracy is the most important parameter among few others which defines ...
10/22/2019

Minimax Rate Optimal Adaptive Nearest Neighbor Classification and Regression

k Nearest Neighbor (kNN) method is a simple and popular statistical meth...
02/26/2022

Enhanced Nearest Neighbor Classification for Crowdsourcing

In machine learning, crowdsourcing is an economical way to label a large...
06/18/2014

Mass Classification Method in Mammogram Using Fuzzy K-Nearest Neighbour Equality

Mass classification of objects is an important area of research and appl...
01/20/2017

Stability Enhanced Large-Margin Classifier Selection

Stability is an important aspect of a classification procedure because u...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Data science has become a driving force for many scientific studies. As datasets get bigger and the methods of analysis become more complex, the need for reproducibility has increased significantly (Stodden et al., 2014). A minimal requirement of reproducibility is that one can reach similar results based on independently generated datasets. The issue of reproducibility has drawn much attention in the scientific community (see a special issue of Nature111at http://www.nature.com/nature/focus/reproducibility/); Marcia McNutt, the Editor-in-Chief of Science, pointed out that “reproducing an experiment is one important approach that scientists use to gain confidence in their conclusions.” In other words, if conclusions cannot be reproduced, the credit of the researchers, along with the scientific conclusions themselves, will be in jeopardy.

1.1 Stability

Statistics as a subject can help improve reproducibility in many ways. One particular aspect we stress in this article is the stability of a statistical procedure used in the analysis. According to Yu (2013), “reproducibility manifests itself in stability of statistical results relative to ‘reasonable’ perturbations to data and to the model used.” An instable statistical method leads to the possibility that a correct scientific conclusion is not reproducible, and hence is not recognized, or even falsely discredited.

Stability has indeed received much attention in statistics. However, few work has focused on stability itself. Many works instead view stability as a tool for other purposes. For example, in clustering problems, Ben-Hur et al. (2002) introduced the clustering instability to assess the quality of a clustering algorithm; Wang (2010) used the clustering instability as a criterion to select the number of clusters. In high-dimensional regression, Meinshausen and Bühlmann (2010) proposed stability selection procedures for variable selection; Liu et al. (2010) and Sun et al. (2013) applied stability for tuning parameter selection. For more applications, see the use of stability in model selection (Breiman, 1996), analyzing the effect of bagging (Bühlmann and Yu, 2002), and deriving the generalization error bound (Bousquet and Elisseeff, 2002; Elisseeff et al., 2005). While successes of stability have been reported in the aforementioned works, to the best of our knowledge, there has been little systematic methodological and theoretical study of stability itself in the classification context.

On the other hand, we are aware that “a study can be reproducible but still be wrong”222http://simplystatistics.org/2014/06/06/the-real-reason-reproducible-research-is-important/. So can a classification method be stable but inaccurate. Thus, in this article, stability is not meant to replace classification accuracy, which is the primary goal for much of the research work on classification. However, an irreproducible or instable study will definitely reduce its chance of being accepted by the scientific community, no matter how accurate it is. Hence, it is ideal for a method to be both accurate and stable, a goal of the current article.

Moreover, in certain practical domains of classification, stability can be as important as accuracy. This is because providing a stable prediction plays a crucial role on users’ trust on a system. For example, Internet streaming service provider Netflix has a movie recommendation system based on complex supervised learning algorithms. In this application, if two consecutively recommended movies are from two totally different genres, the viewers can immediately perceive such instability, and have a bad user experience with the service

(Adomavicius and Zhang, 2010).

1.2 Overview

The -nearest neighbor (NN) classifier (Fix and Hodges, 1951; Cover and Hart, 1967) is one of the most popular nonparametric classification methods, due to its conceptual simplicity and powerful prediction capability. In the literature, extensive research have been done to justify various nearest neighbor classifiers based on the risk, which measures the inaccuracy of a classifier (Devroye and Wagner, 1977; Stone, 1977; Györfi, 1981; Devroye et al., 1994; Snapp and Venkatesh, 1998; Biau et al., 2010). We refer the readers to Devroye et al. (1996) for a comprehensive study. Recently, Samworth (2012) has proposed an optimal weighted nearest neighbor (OWNN) classifier. Like most other existing nearest neighbor classifiers, OWNN focuses on the risk without paying attention to the classification stability.

Figure 1: Regret and CIS of the NN classifier. From top to bottom, each circle represents the NN classifier with . The red square corresponds to the classifier with the minimal regret and the classifier depicted by the blue triangle improves it to have a lower CIS.

In this article, we define a general measure of stability for a classification method, named as Classification Instability (CIS). It characterizes the sampling variability of the prediction. An important result we show is that the asymptotic CIS of any weighted nearest neighbor classifier (a generalization of NN), denoted as WNN, turns out to be proportional to the Euclidean norm of its weight vector. This rather concise form is crucial in our methodological development and theoretical analysis. To illustrate the relation between risk and CIS, we apply the NN classifier to a toy example (see details in Section 7.1) and plot in Figure 1 the regret (that is, the risk minus a constant known as the Bayes risk) versus CIS, calculated according to Proposition 1 and Theorem 1 in Section 3, for different . As increases, the classifier becomes more and more stable, while the regret first decreases and then increases. In view of the NN classifier with the minimal regret, marked as the red square in Figure 1, one may have the impression that there are other values with similar regret but much smaller CIS, such as the one marked as the blue triangle shown in the plot.

Inspired by Figure 1, we propose a novel method called stabilized nearest neighbor (SNN) classifier, which takes the stability into consideration. The SNN procedure is constructed by minimizing the CIS of WNN over an acceptable region where the regret is small, indexed by a tuning parameter. SNN encompasses the OWNN classifier as a special case.

To understand the theoretical property of SNN, we establish a sharp convergence rate of CIS for general plug-in classifiers. This sharp rate is slower than but approaching , shown by adapting the framework of Audibert and Tsybakov (2007). Furthermore, the proposed SNN method is shown to achieve both the minimax optimal rate in the regret established in the literature, and the sharp rate in CIS established in this article.

Figure 2: Regret and CIS of NN, OWNN, and SNN procedures for a bivariate normal example. The top three lines represent CIS’s of NN, OWNN, and SNN. The bottom three lines represent regrets of NN, SNN, and OWNN. The sample size shown on the x-axis is in the scale.

To further illustrate the advantages of the SNN classifier, we offer a comprehensive asymptotic comparison among various classifiers, through which new insights are obtained. It is theoretically verified that the CIS of our SNN procedure is much smaller than those of others. Figure 2 shows the regret and CIS of NN, OWNN, and SNN for a bivariate example (see details in Section 7.1). Although OWNN is theoretically the best in regret, its regret curve appear to overlap with that of SNN. On the other hand, the SNN procedure has a noticeably smaller CIS than OWNN. A compelling message is that with almost the same accuracy, our SNN could greatly improve stability. In the finite sample case, extensive experiments confirm that SNN has a significant improvement in CIS, and sometimes even improves accuracy slightly. Such appealing results are supported by our theoretical finding (in Corollary 1) that the regret of SNN approaches that of OWNN at a faster rate than the rate at which the CIS of OWNN approaches that of SNN, where both rates are shown to be sharp. As a by-product, we also show that OWNN is more stable than NN and bagged nearest neighbor (BNN) classifiers.

The rest of the article is organized as follows. Section 2 defines CIS for a general classification method. In Section 3, we study the stability of the nearest neighbor classifier, and propose a novel SNN classifier. The SNN classifier is shown to achieve an established sharp rate in CIS and the minimax optimal rate in regret in Section 4. Section 5 presents a thorough theoretical comparison of regret and CIS between the SNN classifier and other nearest neighbor classifiers. Section 6 discusses the issue of tuning parameter selection, followed by numerical studies in Section 7. We conclude the article in Section 8. The appendix and supplementary materials are devoted to technical proofs.

2 Classification Instability

Let

be a random couple with a joint distribution

. We regard as a -dimensional vector of features for an object and

as a label indicating that the object belongs to one of two classes. Denote the prior class probability as

, where is the probability with respect to , and the distribution of given as with . The marginal distribution of can be written as . For a classifier , the risk of is defined as . It is well known that the Bayes rule, denoted as , minimizes the above risk. Specifically, , where and is the indicator function. In practice, a classification procedure is applied to a training data set to produce a classifier . We define the risk of the procedure as , and the regret of as , where denotes the expectation with respect to the distribution of , and is called Bayes risk. Both the risk and regret describe the inaccuracy of a classification method. In practice, for a classifier , the classification error for a test data can be calculated as an empirical version of .

For a classification procedure, it is desired that, with high probability, classifiers trained from different samples yield the same prediction for the same object. Our first step in formalizing the classification instability is to define the distance between two generic classifiers and , which measures the level of disagreement between them.

Definition 1.

(Distance between Classifiers) Define the distance between two classifiers and as .

We next define the classification instability (CIS). Throughout the article, we denote and as two i.i.d. copies of the training sample . For ease of notation, we have suppressed the dependence of on the sample size of .

Definition 2.

(Classification Instability) Define the classification instability of a classification procedure as

(1)

where and are the classifiers obtained by applying the classification procedure to samples and .

Intuitively, CIS is an average probability that the same object is classified to two different classes in two separate runs of a learning algorithm. By definition, , and a small represents a stable classification procedure .

3 Stabilized Nearest Neighbor Classifier

3.1 Review of WNN

For any fixed , let be a sequence of observations with ascending distance to . For a nonnegative weight vector satisfying , a WNN classifier predicts the label of as . Samworth (2012) revealed a nice asymptotic expansion formula for the regret of WNN.

Proposition 1.

(Samworth, 2012) Under Assumptions (A1)–(A4) defined in Appendix A.I, for each , we have, as ,

(2)

uniformly for with defined in Appendix A.II, where , and constants and are defined in Appendix A.II.

Samworth (2012) further derived a weight vector that minimizes the asymptotic regret which led to the optimal weighted nearest neighbor (OWNN) classifier.

3.2 Asymptotically Equivalent Formulation of CIS

Denote two resulting WNN classifiers trained on and as and respectively. With a slight abuse of notation, we denote the CIS of a WNN classification procedure by . According to the definition in (1), classification instability of a WNN procedure is Theorem 1 provides an asymptotic expansion formula for the CIS of WNN in terms of its weight vector .

Theorem 1.

(Asymptotic CIS) Under Assumptions (A1)–(A4) defined in Appendix A.I, for each , we have, as ,

(3)

uniformly for all with defined in Appendix A.II, where the constant with defined in Appendix A.II.

Theorem 1 demonstrates that the asymptotic CIS of a WNN procedure is proportional to . For example, for the NN procedure (that is the WNN procedure with ), its CIS is asymptotically . Therefore, a larger value of leads to a more stable NN procedure, which was seen in Figure 1. Furthermore, we note that the CIS expansion in (3) is related to the first term in (2). The expansions in (2) and (3) allow precise calibration of regret and CIS. This delicate connection is important in the development of our SNN procedure.

3.3 Stabilized Nearest Neighbor Classifier

To stabilize WNN, we consider a weight vector which minimizes the CIS over an acceptable region where the classification regret is less than some constant , that is,

(4)
subject to

By a non-decreasing transformation, we change the objective function in (4) to . Furthermore, considering the Lagrangian formulation, we can see that (4) is equivalent to minimizing subject to the constraints that and , where . The equivalence is ensured by the expansions (2) and (3) in Proposition 1 and Theorem 1, and the fact that both the objective function and the constraints are convex in the variable vector . The resulting optimization is

(5)
subject to

where depends on constants and and . When , (5) leads to the most stable but trivial NN classifier with . The classifier in (5) with (i.e., ) approaches the OWNN classifier considered in Samworth (2012). Note that the two terms and in (5

) represent the bias and variance terms of the regret expansion given in Proposition

1 (Samworth, 2012). By varying the weights of these two terms through , we are able to stabilize a nearest neighbor classifier. Moreover, the stabilized classifier achieves desirable convergence rates in both regret and CIS; see Section 4.

Theorem 2 gives the optimal weight with respect to the optimization (5). We formally define the stabilized nearest neighbor (SNN) classifier as the WNN classifier with the optimal weight .

Theorem 2.

(Optimal Weight) For any fixed , the minimizer of is

where and

The SNN classifier encompasses the OWNN classifier as a special case when .

The computational complexity of our SNN classifier is comparable to that of existing nearest neighbor classifiers. If we preselect a value for , SNN requires no training at all. The testing time consists of two parts: an complexity for the computation of distances, where is the size of training data; and an complexity for sorting distances. The NN classifier, for example, shares the same computational complexity. In practice, is not predetermined and we may treat it as a tuning parameter, whose optimal value is selected via cross validation. See Algorithm 1 in Section 6 for details. We will show in Section 6 that the complexity of tuning in SNN is also comparable to existing methods.

4 Theoretical Properties

4.1 A Sharp Rate of CIS

Motivated by Audibert and Tsybakov (2007), we establish a sharp convergence rate of CIS for a general plug-in classifier. A plug-in classification procedure

first estimates the regression function

by , and then plugs it into the Bayes rule, that is, .

The following margin condition (Tsybakov, 2004) is assumed for deriving the upper bound of the convergence rate, while two additional conditions are required for showing the lower bound. A distribution function satisfies the margin condition if there exist constants and such that for any ,

(6)

The parameter characterizes the behavior of the regression function near , and a larger implies a lower noise level and hence an easier classification scenario.

The second condition is on the smoothness of . Specifically, we assume that belongs to a Hölder class of functions (for some fixed ) containing the functions that are times continuously differentiable and satisfy, for any , where is the largest integer not greater than , is the Taylor polynomial series of degree at , and is the Euclidean norm.

Our last condition assumes that the marginal distribution satisfies the strong density assumption, defined in Supplementary S.III.

We first derive the rate of convergence of CIS by assuming an exponential convergence rate of the corresponding regression function estimator.

Theorem 3.

(Upper Bound) Let be an estimator of the regression function and let be a compact set. Let

be a set of probability distributions supported on

such that for some constants , some positive sequence , and almost all with respect to ,

(7)

holds for any and , where is the probability with respect to . Furthermore, if all the distributions satisfy the margin condition for a constant , then the plug-in classification procedure corresponding to satisfies

for any and some constant depending only on , and .

It is worth noting that the condition in holds for various types of estimators. For example, Theorem 3.2 in Audibert and Tsybakov (2007) showed that the local polynomial estimator satisfies with when the bandwidth is of the order . In addition, Theorem 5 in Section 4.2 implies that holds for the newly proposed SNN classifier with the same . Hence, in both cases, the upper bound is of the order .

We next derive the lower bound of CIS in Theorem 4. As will be seen, this lower bound implies that the obtained rate of CIS, that is, , cannot be further improved for the plug-in classification procedure.

Theorem 4.

(Lower Bound) Let be a set of probability distributions supported on such that for any , satisfies the margin condition , the regression function belongs to the Hölder class , and the marginal distribution satisfies the strong density assumption. Suppose further that satisfies with and . We have

for any and some constant independent of .

Theorems 3 and 4 together establish a sharp convergence rate of the CIS for the general plug-in classification procedure on the set . The requirement in Theorem 4 implies that and cannot be large simultaneously. As pointed out in Audibert and Tsybakov (2007), this is intuitively true because a very large implies a very smooth regression function , while a large implies that cannot stay very long near , and hence when hits , it should take off quickly. Lastly, we note that this rate is slower than , but approaches as the dimension increases when .

As a reminder, Audibert and Tsybakov (2007) established the minimax optimal rate of regret as .

4.2 Optimal Convergence Rates of SNN

In this subsection, we demonstrate that SNN attains the established sharp convergence rate in CIS in the previous subsection, as well as the minimax optimal convergence rate in regret. We further show the asymptotic difference between SNN and OWNN.

In Theorem 5 and Corollary 1 below, we consider SNN with in Theorem 2, where means the ratio sequence stays away from zero and infinity as . Note that under Assumptions (A1)–(A4) defined in Appendix A.I, we have and hence , which agrees with the formulation in Theorem 2.

Theorem 5.

For any and , the SNN procedure with any fixed satisfies

for any and some constants , where is defined in Theorem 4.

Corollary 1 below further investigates the difference between the SNN procedure (with ) and the OWNN procedure in terms of both regret and CIS.

Corollary 1.

For any , , we have, when ,

(8)

where is defined in Theorem 4.

Corollary 1 implies that the regret of SNN approaches that of the OWNN (from above) at a faster rate than the CIS of OWNN approaches that of the SNN procedure (from above). This means that SNN can have a significant improvement in CIS over the OWNN procedure while obtaining a comparable classification accuracy. This observation will be supported by the experimental results in Section 7.2.

Remark 1.

Under Assumptions (A1)–(A4) in Section A.I, which implicitly implies that , and the assumption that , the conclusion in can be strengthened to that for any , . It indicates that SNN’s improvement in CIS is at least in this scenario.

5 Asymptotic Comparisons

In this section, we first conduct an asymptotic comparison of CIS among existing nearest neighbor classifiers, and then demonstrate that SNN significantly improves OWNN in CIS.

5.1 CIS Comparison of Existing Methods

We compare NN, OWNN and the bagged nearest neighbor (BNN) classifier. The NN classifier is a special case of the WNN classifier with weight for and otherwise. Another special case of the WNN classifier is the BNN classifier. After generating subsamples from the original data set, the BNN classifier applies 1-nearest neighbor classifier to each bootstrapped subsample and returns the final prediction by majority voting. If the resample size is sufficiently smaller than , i.e., and , the BNN classifier is shown to be a consistent classifier (Hall and Samworth, 2005). In particular, Hall and Samworth (2005) showed that, for large , the BNN classifier (with or without replacement) is approximately equivalent to a WNN classifier with the weight for , where is the resampling ratio .

We denote the CIS of the above classification procedures as , and . Here in the NN classifier is selected as the one minimizing the regret (Hall et al., 2008). The optimal in the BNN classifier and the optimal weight in the OWNN classifier are both calculated based on their asymptotic relations with the optimal in NN, which were defined in (2.9) and (3.5) of Samworth (2012). Corollary 2 gives the pairwise CIS ratios of these classifiers. Note that these ratios depend on the feature dimension only.

Corollary 2.

Under Assumptions (A1)-(A4) defined in Appendix A.I and the assumption that defined in Appendix A.II is positive, we have, as ,

Figure 3: Pairwise CIS ratios between NN, BNN and OWNN for different feature dimension .

The limiting CIS ratios in Corollary 2 are plotted in Figure 3. A major message herein is that the OWNN procedure is more stable than the NN and BNN procedures for any . The largest improvement of the OWNN procedure over NN is achieved when and the improvement diminishes as . The CIS ratio of BNN over NN equals when and is less than when , which is consistent with the common perception that bagging can generally reduce the variability of the nearest neighbor classifiers. Similar phenomenon has been shown in the ratio of their regrets (Samworth, 2012). Therefore, bagging can be used to improve the NN procedure in terms of both accuracy and stability when . Furthermore, the CIS ratio of OWNN over BNN is less than for all , but quickly converges to as increases. This implies that although the BNN procedure is asymptotically less stable than the OWNN procedure, their difference vanishes as increases.

5.2 Comparisons between SNN and OWNN

Corollary 1 in Section 4.2 implies that OWNN and SNN have the same convergence rates of regret and CIS (note that OWNN is a special case of SNN). Hence, it is of more interest to compare their relative magnitude. The asymptotic comparisons between SNN and OWNN are characterized in Corollary 3.

Corollary 3.

Under Assumptions (A1)-(A4) defined in Appendix A.I and the assumption that defined in Appendix A.II is positive, we have, as ,

where constants and are defined in Appendix A.II.

The second formula in Corollary 3 suggests that as increases, the SNN classifier becomes more and more stable. In Corollary 3, both ratios of the SNN procedure over the OWNN procedure depend on , and two unknown constants and . Since in and in , we further have the following ratios,

(9)
(10)

For any , SNN has an improvement in CIS over the OWNN. As a mere illustration, we consider the case that the regret and the squared CIS are given equal weight, that is, . In this case, the ratios in and only depend on and .

Figure 4: Regret ratio and CIS ratio of SNN over OWNN as functions of and . The darker the color, the larger the value.

Figure 4 shows 3D plots of these two ratios as functions of and . As expected, the CIS of the SNN procedure is universally smaller than OWNN (ratios less than 1 on the right panel), while the OWNN procedure has a smaller regret (ratios greater than 1 on the left panel). For a fixed , as the dimension increases, the regret of SNN approaches that of OWNN, while the advantage of SNN in terms of CIS grows. For a fixed dimension , as increases, the regret ratio between SNN and OWNN gets larger, but the CIS advantage of SNN also grows. According to the definition of in Appendix A.II, a great value of indicates a harder problem for classification; see the discussion after Theorem 1 of Samworth (2012).

Figure 5: Logarithm of relative gain of SNN over OWNN as a function of and when . The grey (white) color represents the case where the logarithm of relative gain is greater (less) than .

Since SNN improves OWNN in CIS, but has a greater regret, it is of interest to know when the improvement of SNN in CIS is greater than its loss in regret. We thus consider the relative gain, defined as the absolute ratio of the percentages of CIS reduction and regret increment, that is, , where and . As an illustration, when , we have the relative gain converges to . Figure 5 shows the log(relative gain) as a function of and . For most combinations of and , the logarithm is greater than (shown in grey in Figure 5), indicating that SNN has an improvement in CIS greater than its loss in regret. In particular, when , the log(relative gain) is positive for all .

6 Tuning Parameter Selection

To select the parameter for the SNN classifier, we first identify a set of values for whose corresponding (estimated) risks are among the smallest, and then choose from them an optimal one which has the minimal estimated CIS. Let denote an SNN classifier with parameter trained from sample . Given a predetermined set of tuning parameter values , the tuning parameter is selected using Algorithm 1 below, which involves estimating the CIS and risk in Steps 1–3 and a two-stage selection in Steps 4 and 5.
Algorithm 1:
Step 1
. Randomly partition into five subsets , .
Step 2. For , let be the test set and , , and be training sets. Obtain predicted labels from and respectively for each . Estimate the CIS and risk of the classifier with parameter by

Step 3. Repeat Step 2 for and estimate the CIS and risk, with being the test set and the rest being the training sets. Finally, the estimated CIS and risk are,

Step 4. Perform Step 2 and Step 3 for each . Denote the set of tuning parameters with top accuracy as

Step 5. Output the optimal tuning parameter as

In our experiments, the predetermined set of tuning parameters is of size . In Step 1, the sample sizes of the subsets are chosen to be roughly equal. In Step 4, the threshold reflects how the set of the most accurate classifiers is defined. Based on our limited experiments, the final experimental result is very robust to the choice of this threshold level within a suitable range.

Compared with the tuning method for the NN classifier, which minimizes the estimated risk only, Algorithm 1 requires additional estimation of the CIS. However, the estimation of the CIS is concurrently conducted with the estimation of the risk in Step 2. Therefore, the complexity of tuning for our SNN classifier is at the same order as that for NN. As will be seen in the numerical experiments below, the additional effort on estimating the CIS leads to improvement over existing nearest neighbor methods in both accuracy and stability.

7 Numerical Studies

We first verify our theoretical findings using an example, and then illustrate the improvements of the SNN classifier over existing nearest neighbor classifiers based on simulations and real examples.

7.1 Validation of Asymptotically Equivalent Forms

This subsection aims to support the asymptotically equivalent forms of CIS derived in Theorem 1 and the CIS and regret ratios in Corollary 3. We focus on a multivariate Gaussian example in which regret and CIS have explicit expressions.

Assume that the underlying distributions of both classes are and and the prior class probability . We choose , which covers at least probability of the sampling region, and set and . In addition, a test set with 1000 observations was independently generated. The estimated risk and CIS were calculated based on replications. In this example, some calculus exercises lead to , and . According to Proposition 1, Theorems 1 and 2, we obtain that

(11)
(12)

with . For a mere illustration, we choose , which corresponds to . So we have

Similarly, the asymptotic regret and CIS of OWNN are (11) and (12) with due to in Samworth (2012).

Figure 6: Asymptotic CIS (red curve) and estimated CIS (box plots over 100 simulations) for OWNN (left) and SNN (right) procedures. These plots show that the estimated CIS converges to its asymptotic equivalent value as increases.

In Figures 6, we plot the asymptotic CIS of the SNN and OWNN classifiers computed using the above formulae, shown as red curves, along with the estimated CIS based on the simulated data, shown as the box plots over 100 replications. As the sample size increases, the estimated CIS approximates its asymptotic value very well. For example, when , the asymptotic CIS of the SNN (OWNN) classifier is 0.078 (0.085) while the estimated CIS is 0.079 (0.086).

Figure 7: Asymptotic risk (regret + the Bayes risk; red curves) and estimated risk (black box plots) for OWNN (left) and SNN procedures (right). The blue horizontal line indicates the Bayes risk, 0.215. These plots show that the estimated risk converges to its asymptotic version (and also the Bayes risk) as increases.

Similarly, in Figure 7, we plot the asymptotic risk, that is, the asymptotic regret in (11) plus the true Bayes risk ( in this example), for the SNN and OWNN classifiers, along with the estimated risk. Here we compute the Bayes risk by Monte Carlo integration. Again the difference of the estimated risk and asymptotic risk decreases as the sample size grows.

Furthermore, according to , the asymptotic CIS ratio of the SNN classifier over the OWNN classifier is in this example, and the empirically estimated CIS ratios are , , and , for . This indicates that the estimated CIS ratio converges to its asymptotic value as increases. However, by , the asymptotic regret ratio of the SNN classifier over the OWNN classifier is , while the estimated ones are , , and , for . It appears that the estimated regret ratio matches with its asymptotic value for small sample size, but they differ for large . This may be caused by the fact that the classification errors are very close to Bayes risk for large and hence the estimated regret ratio has a numerical issue. For example, when , the average errors of the SNN classifier and the OWNN classifier are and , respectively, while the Bayes risk is (see Figure 7). A similar issue was previously reported in Samworth (2012).

7.2 Simulations

In this section, we compare SNN with the NN, OWNN and BNN classifiers. The parameter in NN was tuned from equally spaced grid points from 5 to . For a fair comparison, in the SNN classifier, the parameter was tuned so that the corresponding parameter (see Theorem 2) were equally spaced and fell into the same range roughly.

In Simulation 1, we assumed that the two classes were from and

with the prior probability

and dimension . We set sample size and chose such that the resulting was fixed as for different . Specifically, in Supplementary S.VII we show that

(13)

Hence, we set , , , , for and , respectively.

In Simulation 2, the training data set were generated by setting , or , , , and or .

Simulation 3 has the same setting as Simulation 2, except that and , where is the Toeplitz matrix whose th entry of the first row is .

Figure 8:

Average test errors and CIS’s (with standard error bar marked) of the

NN, BNN, OWNN, and SNN methods in Simulation 1. The -axis indicates different settings with various dimensions. Within each setting, the four methods are horizontally lined up (from the left are NN, BNN, OWNN, and SNN).

Simulation 1 is a relatively easy classification problem. Simulation 2 examines the bimodal effect and Simulation 3 combines bimodality with dependence between variables. In each simulation setting, a test data set of size is independently generated and the average classification error and average estimated CIS for the test set are reported over replications. To estimate the CIS, for each replication, we build two classifiers based on the randomly divided training data, and then estimate CIS by the average disagreement of these two classifiers on the test data.

Figure 8 shows the average error (on the left) and CIS (on the right) for Simulation 1. As a first impression, the test error is similar among different classification methods, while the CIS differs a lot. In terms of the stability, SNN always has the smallest CIS; in particular, as increases, the improvement of SNN over all other procedures becomes even larger. This agrees with the asymptotic findings in Section 5.2. For example, when , all the NN, BNN, and OWNN procedures are at least five times more unstable than SNN. In terms of accuracy, SNN obtains the minimal test errors in all five scenarios, although the improvement in the accuracy is not significant when or . This result suggests that although SNN is asymptotically less accurate than OWNN in theory, the actual empirical difference in the test error is often ignorable.

Figure 9: Average test errors and CIS’s (with standard error bar marked) of the NN, BNN, OWNN, and SNN methods in Simulation 2. The ticks on the -axis indicate the dimensions and prior class probability for different settings. Within each setting, the four methods are horizontally lined up (from the left are NN, BNN, OWNN, and SNN).
Figure 10: Average test errors and CIS’s (with standard error bar marked) of the NN, BNN, OWNN, and SNN methods in Simulation 3. The ticks on the -axis indicate the dimensions and prior class probability for different settings. Within each setting, the four methods are horizontally lined up (from the left are NN, BNN, OWNN, and SNN).

Figures 9 and 10 summarize the results for Simulations 2 and 3. Similarly, in general, the difference in CIS is much obvious than the difference in the error. The SNN procedure obtains the minimal CIS in all 8 cases. Interestingly, the improvements are significant in all the four cases when . Moreover, among 3 out of the 8 cases, our SNN achieves the smallest test errors and the improvements are significant. Even in cases where the error is not the smallest, the accuracy of SNN is close to the best classifier.

7.3 Real Examples

Figure 11: Average test errors and CIS’s (with standard error bar marked) of the NN, BNN, OWNN and SNN methods for four data examples. The ticks on the -axis indicate the names of the examples. Within each example, the four methods are horizontally lined up (from the left are NN, BNN, OWNN, and SNN).

We extend the comparison to four real data sets publicly available in the UCI Machine Learning Repository

(Bache and Lichman, 2013). The first data set is the breast cancer data set () collected by Wolberg and Mangasarian (1990). There are 683 samples and 10 experimental measurement variables. The binary class label indicates whether the sample is benign or malignant. These 683 samples arrived periodically. In total, there are 8 groups of samples which reflect the chronological order of the data. A good classification procedure is expected to produce a stable classifier across these groups of samples. The second data set is the credit approval data set (). It consists of credit card applications and each application has attributes reflecting the user information. The binary class label refers to whether the application is positive or negative. The third data set is the haberman’s survival data set () which contains

cases from study conducted on the survival of patients who had undergone surgery for breast cancer. It has three attributes, age, patient’s year of operation, and number of positive axillary nodes detected. The response variable indicates the survival status: either the patient survived 5 years or longer or the patient died within 5 years. The last data set is the SPECT heart data set (

) which describes the diagnosing of cardiac Single Proton Emission Computed Tomography (SPECT) images. Each of the image sets (patients) had binary feature patterns and was classified into two classes: normal and abnormal.

We randomly split each data set into training and test sets with the equal size. The same tuning procedure as in the simulation is applied here. We compute the test error and the estimated CIS on the test set. The procedure is repeated 100 times and the average error and CIS are reported in Figure 11.

Similar to the simulation results, the SNN procedure obtains the minimal CIS in all four real data sets and the improvements in CIS are significant. The errors of OWNN and SNN have no significant difference, although OWNN is theoretically better in accuracy. These real experiments further illustrate that, with almost the same classification accuracy, our SNN procedure can achieve a significant improvement in the stability, which promotes the reproducibility.

8 Conclusion

Stability is an important and desirable property of a statistical procedure. It provides a foundation for the reproducibility, and reflects the credibility of those who use the procedure. To our best knowledge, our work is the first to propose a measure to quantify classification instability. The proposed SNN classification procedure enjoys increased classification stability with comparable classification accuracy to OWNN.

For classification problems, the classification accuracy is a primary concern, while stability is secondary. In many real cases, however, different classifiers may enjoy a comparable classification accuracy, and a classifier with a better stability stands out. The observation that our method can improve stability while maintaining the similar accuracy suggests that there may exist much more room for improving stability than for improving accuracy. This may be explained by the faster convergence rate of the regret than that of the CIS (Theorem 5).

In theory, our SNN is shown to achieve the minimax optimal convergence rate in regret and a sharp convergence rate in CIS. Extensive experiments illustrate that SNN attains a significant improvement of stability over existing nearest neighbor classifiers, and sometimes even improves the accuracy. We implement the algorithm in a publicly available R package snn.

Our proposed SNN method is motivated by an asymptotic expansion of the CIS. Such a nice property may not exist for other more general classification methods. Hence, it is unclear how the stabilization idea can be carried over to other classifiers in a similar manner. That being said, the CIS measure can be used as a criterion for tuning parameter selection. There exists work in the literature which uses variable selection stability to select tuning parameter (Sun et al., 2013). Classification stability and variable selection stability complement each other and can provide a comprehensive description of the reliability of a statistical procedure.

For simplicity, we focus on the binary classification in this article. The generalization of the SNN classifier to multi-category classification problems (Lee et al., 2004; Liu and Shen, 2006; Liu and Yuan, 2011) is an interesting topic to pursue in the future. Moreover, stability for the high-dimensional, low-sample size data is another important topic. Furthermore, in analyzing a big data set, a popular scheme is divide-and-conquer. It is an interesting research question on how to divide the data and choose the parameter wisely to ensure the optimal stability of a combined classifier.

Appendices

a.i Assumptions (A1) - (A4)

For a smooth function , we denote as its gradient vector at . We assume the following conditions through all the article.

(A1) The set is a compact -dimensional manifold with boundary .

(A2) The set is nonempty. There exists an open subset of which contains such that: (i) is continuous on with an open set containing ; (ii) the restriction of the conditional distributions of , and , to are absolutely continuous with respect to Lebesgue measure, with twice continuously differentiable Randon-Nikodym derivatives and .

(A3) There exists such that . Moreover, for sufficiently small , , where , is gamma function, and is a constant independent of .

(A4) For all , we have , and for all , we have , where is the restriction of to .

Remark 2.

Assumptions (A1)–(A4) have also been employed to show the asymptotic expansion of the regret of the NN classifier (Hall et al., 2008). The condition in (A4) is equivalent to the margin condition with ; see (2.1) in Samworth (2012). Furthermore, these assumptions ensure that and are bounded away from zero and infinity on .

a.ii Definitions of , , , and

For a smooth function : , let its th partial derivative at , the Hessian matrix at , and the th element of . Let . Define

We further define two distribution-related constants

where is the natural -dimensional volume measure that inherits with defined in Appendix A.I. Based on Assumptions (A1)-(A4) in Appendix A.I, and are finite with and , where only when equals zero on .

In addition, for , we denote as the set of satisfying (w.1)–(w.5).

(w.1) ,

(w.2) , where ,

(w.3) with ,

(w.4) ,

(w.5) .

For the NN classifier with , (w.1)–(w.5) reduce to . See Samworth (2012) for a detailed discussion of .

a.iii Proof of Theorem 1

Note that