## 1 Introduction

In this article, we study binary classification for large-scale data sets. Nearest neighbor (NN) is a very popular class of classification methods. The NN method searches for the nearest neighbors of a query point and classify it to the majority class among the neighbors. NN methods do not require sophisticated training that involves optimization but are memory-based (all the data are loaded into the memory when predictions are made.) In the era of big data, the volume of data is growing at an unprecedented rate. Yet, the computing power is limited by space and time and it may not keep pace with the growth of the data volume. For many applications, one of the main challenges of NN is that it is impossible to load the data into the memory of a single machine (Muja and Lowe, 2014). In addition to the memory limitation, there are other important concerns. For example, if the data are collected and stored at several distant locations, it is challenging to transmit the data to a central location. Moreover, privacy and ownership issues may prohibit sharing local data with other locations. Therefore, a distributed approach which avoids transmitting or sharing the local raw data is appealing.

There are new algorithms which are designed to allow analyzing data in a distributed manner. For example, Muja and Lowe (2014) proposed distributed algorithms to approximate the nearest neighbors in a scalable fashion. However, their approach requires subtle and careful design of the algorithm which is not generally accessible to most users who may work on many different platforms. In light of this, a more general, simple and user-friendly approach is the divide-and-conquer strategy. Assume that the total sample size in the training data is . We divide the data set into a number of subsets (or the data are already collected from locations and stored in machines to begin with.) For simplicity, we assume that each subset size is so that . We may allow the number of subsets to grow with the total sample size , in the fashion of For each query point, NN classification is conducted on each subset, and these local predictions are pooled together by majority voting. Moreover, since the NN predictions are made at local locations, and only the predictions (instead of the raw data) are transmitted to the central location, this approach is more efficient in terms of communication cost and it reduces (if not eliminates) much of the privacy and ownership concerns. The resulting classifier, which we coin as the big Nearest Neighbor (bigNN) classifier, can be easily implemented by a variety of users on different platforms with minimal efforts. We point out that this is not a new idea, as it is essentially an ensemble classifier using the majority vote combiner Chawla et al. (2004)

. Moreover, a few recent work on distributed regression and principal component analysis follow this direction

(Zhang et al., 2013; Chen and Xie, 2014; Battey et al., 2015; Lee et al., 2017; Fan et al., 2017; Shang and Cheng, 2017; Zhao et al., 2016). However, distributed classification results are much less developed, especially in terms of the statistical performance of the ensemble learner.In practice, NN methods are often implemented by algorithms capable of dealing with large-scale data sets, such as the kd-tree Bentley (1975) or random partition trees Dasgupta and Sinha (2013). However, even these methods have limitations. As the definition of “large-scale” evolves, it would be good to know if the divide-and-conquer scheme described above may be used in conjunction to these methods. There are related methods such as those that rely on efficient NNs or approximate nearest neighbor (ANN) methods Anastasiu and Karypis (2017); Alabduljalil et al. (2013); Indyk and Motwani (1998); Slaney and Casey (2008). However, little theoretical understanding in terms of the classification accuracy of these approximate methods has been obtained (with rare exceptions like Gottlieb et al. (2014).)

The asymptotic consistency and convergence rates of the NN classification have been studied in details. See, for example, Fix and Hodges Jr (1951); Cover and Hart (1967); Devroye et al. (1994); Chaudhuri and Dasgupta (2014)

. However, there has been little theoretical understanding about the bigNN classification. In particular, one may ask whether the bigNN classification performs as well as the oracle NN method, that is, the NN method applied to the entire data set (which is difficult in practice due to the aforementioned constraints and limitations, hence the name “oracle”.) To our knowledge, our work is the first one to address the classification accuracy of the ensemble NN method, from a statistical learning theory point of view, and build its relation to that of its oracle counterpart. Much progress has been made for the latter.

Cover (1968), Wagner (1971), and Fritz (1975) provided distribution-free convergence rates for NN classifiers. Later works (Kulkarni and Posner, 1995; Gyorfi, 1981)gave rate of convergence in terms of the smoothness of class conditional probability

, such as Hölder’s condition. Recently, Chaudhuri and Dasgupta (2014) studied the convergence rate under a condition more general than the Hölder’s condition of . In particular, a smooth measure was proposed which measures the change in with respect to probability mass rather than distance. More recently, Kontorovich and Weiss (2015) proposed a strongly Bayes consistent margin-regularized 1-NN; Kontorovich et al. (2017) proved a sample-compressed 1-NN based multiclass learning algorithm is Bayes consistent. In addition, under the so-called margin condition, and assumptions on the density function of the covariates, Audibert and Tsybakov (2007) showed a faster rate of convergence. See Kohler and Krzyzak (2007) for results without assuming the density exists. Some other related works about NN methods include Hall et al. (2008), which gave an asymptotic regret formula in terms of the number of neighbors, and Samworth (2012), which gave a similar formula in terms of the weights for a weighted nearest neighbor classifier. Sun et al. (2016) took the stability measure into consideration and proposed a classification instability (CIS) measure. They gave an asymptotic formula of the CIS for the weighted NN classifier.In this article, we give the convergence rate of the bigNN method under the smoothness condition for established by Chaudhuri and Dasgupta (2014), and the margin condition. It turns out that this rate is the same rate as the oracle NN method. That is, by divide and conquer, one does not lose the classification accuracy in terms of the convergence rate. We show that the rate has a minimax property. That is, with some density assumptions, this rate cannot be improved. We find out that the optimal choice of the number of neighbors must scale with the overall sample size and number of splits, and there is an upper limits on how much splits one may use. To further shorten the prediction time, we study the use of the denoising technique Xue and Kpotufe (2018), which allows significant reduction of the prediction time at a negligible loss in the accuracy under certain conditions, which are related to the upper bound on the number of splits. Lastly, we verify the results using an extensive simulation study. As a side product, we also show that the convergence rate of the CIS for the bigNN method is also the same as the convergence for the oracle NN method, which is a sharp rate previously proven. All these theoretical results hold as long as the number of divisions does not grow too fast, i.e., slower than some rate determined by the smoothness of .

## 2 Background and Key Assumptions

Let be a separable metric space. For any , let and be the open and closed balls respectively of radius centered at . Let be a Borel regular probability measure on from which are drawn. We focus on binary classification in which ; given , is distributed according to the class conditional probability function (also known as the regression function) , defined as , where

is with respect to the joint distribution of

.### Bayes classifier, regret, and classification instability

For any classifier , the risk is . The Bayes classifier, defined as , has the smallest risk among all measurable classifier. The risk for the Bayes classifier is denoted as .

The excess risk of classifier compared to the Bayes classifier is , which is also called the regret of . Note that since the classifier is often driven by a training data set that is by itself random, both the regret and the risk are random quantities. Hence, sometimes we may be interested in the expected value of the risk , where the expectation is with respect to the distribution of the training data .

Sometimes, we call the algorithm that maps a training data set to the classifier function a “classifier”. In this sense, classification instability (CIS) was proposed to measure how sensitive a classifier is to sampling of the data. In particular, the CIS of a classifier is defined as

where and are the classification functions trained based on and , which are independent copies of the training data (Sun et al., 2016).

### Big Nearest Neighbor Classifiers

In practice, we have a large training data set , and it may be evenly divided to subsamples with observations in each. For any query point , its nearest neighbors in the th subsample are founded, and the average of their class labels is denoted as . Denote as the th binary NN classifier based on the th subset. Finally, a majority voting scheme is carried out so that the final bigNN classifier is In this article, we are interested in the risk of , denoted as , its corresponding regret, and its CIS.

### Key Assumptions

Many results in this article rely on the following commonly used assumptions. The -smoothness assumption ensures the smoothness of the regression function . In particular, is -smooth if for all , there exist , such that

Chaudhuri and Dasgupta (2014) pointed out that this is more general than, and is closely related to the Hölder’s condition when ( is the dimension), which states that is -Hölder continuous if there exist such that Moreover, Hölder’s continuity implies -smoothness, with the equality

(1) |

This transition formula will be useful in comparing our theoretical results with the existing ones that are based on the Hölder’s condition.

## 3 Main results

Our first main theorem concerns the regret of the bigNN classifier.

###### Theorem 1.

Set as where is a constant. Under the -smoothness assumption of and the -margin condition, we have

The rate of convergence here appears to be independent of the dimension . However, the theorem can be stated in terms of the Hölder’s condition instead, which leads to the rate , due to equality (1), which now depends on . It would be insightful to compare the bound derived here for the bigNN classifier with the oracle NN classifier. Theorem 7 in Chaudhuri and Dasgupta (2014) showed that under almost the same assumptions, with a scaled choice of in the oracle NN method, the convergence rate for the oracle method is also . This means that divide and conquer does not compromise the rate of convergence of the regret when it is used on the NN classifier.

As a matter of fact, the best known rate among nonparametric classification methods under the margin assumption was (Theorems 3.3 and 3.5 in Audibert and Tsybakov (2007)), which, according to (1), was the same rate for bigNN here and for oracle NN derived in Chaudhuri and Dasgupta (2014). In other words, the rate we have is sharp. It was proved that the optimal weighted nearest neighbor classifiers (OWNN) Samworth (2012), bagged nearest neighbor classifiers Hall and Samworth (2005) and the stabilized nearest neighbor classifier Sun et al. (2016) can achieve this rate. See Theorem 2 of the supplementary materials of Samworth (2012) and Theorem 5 of Sun et al. (2016).

Our next theorem concerns the CIS of the bigNN classifier.

###### Theorem 2.

Set the same as in Theorem 1 (). Under the -smoothness assumption and the -margin condition, we have

Again, we remark that the best known rate for CIS for a non-parameter classification method (oracle NN included) is (Theorem 5 in Sun et al. (2016)), where is the power parameter in the Hölder’s condition. This is exactly the rate we derived here for bigNN classifier by noting (1).

We remark that the optimal number of neighbors for the oracle NN is at the order of , while the optimal number of neighbors for each local classifier in bigNN is at the order of which is not equal to (that is, the optimal choice of for the oracle above with replaced by .) In other words, the best choice of in bigNN will lead to suboptimal performance for each local classifier. However, due to the aggregation via majority voting, these suboptimal local classifiers will actually ensemble an optimal bigNN classifier.

Moreover, should grow as grows. In view of the facts that and , this implies an upper bound on . In particular, should be less than

. Conceptually, there exist notions of bias due to small sample size and bias/variance trade-off for ensembles. If

is too large and too small, then the ‘bias’ of the base classifier on each subsample tends to increase, which can not be averaged away by the subsamples.Lastly, bigNN may be seen as comparable to the bagged NN method (Hall and Samworth, 2005). In that context, the sampling fraction is . Hall and Samworth (2005) suggested that when sampling fractions converge to 0, but the resample sizes diverges to infinity, the bagged NN converges to the Bayes rule. Our work gives a convergence rate in addition to the fact that the regret will vanish as grows.

## 4 Pre-training acceleration by denoising

While the oracle NN method or the bigNN method can achieve significantly better performance than 1-NN when and are chosen properly, in practice, many of the commercial tools for nearest neighbor search are optimized for 1-NN only. It is known that for statistical consistency, should grow as to infinity. This imposes practical challenge for the oracle NN to search for the nearest neighbors from the training data, in which could potentially be a very large number. Even in a bigNN in which at each subsample is set to be 1, to achieve statistical consistency one requires to grow with . These naturally lead to the practical difficulty that the prediction time is very large for growing or in the presence of large-scale data sets. In Xue and Kpotufe (2018), the authors proposed a denoising technique to shift the time complexity from the prediction stage to the training stage. In a nutshell, denoising means to pre-train the data points in the training set by re-labelling each data point by its global NN prediction (for a given ). After each data point is pre-trained, at the time of prediction, the nearest neighbors of the query point from among a small number (say ) of subsamples of the training data are identified, and the majority class among these 1-NNs becomes the final prediction for the query point. Note that under this acceleration scheme, at the prediction stage, one only needs to conduct 1-NN search for times, and each time from a subsample with size ; hence the prediction time is significantly reduced to almost the same as the 1-NN. Xue and Kpotufe (2018) further proved that at some vanishing subsampling ratio, the denoised NN can achieve the prediction accuracy at the same order as that of the NN.

In this section we consider using the same technique to accelerate the prediction time for large data sets in conjunction with the bigNN method. The pre-training step in Xue and Kpotufe (2018) was based on, by default, the oracle NN, which is not realistic for very large data sets. We consider using the bigNN to conduct the pre-training, followed by the same 1-NN searches at the prediction stage. Our work and the work of Xue and Kpotufe (2018) can be viewed as supplementary to each other. As Xue and Kpotufe (2018) shifted the computational time from the prediction stage to the training stage, we reduce the computational burden at the training stage by using bigNN instead of the oracle NN.

###### Definition 1.

Denote as a subsample of the entire training data with sample size . Denote as the nearest neighbor of among . The denoised BigNN classifier is . Note that this is the same as the 1-NN prediction for a pre-trained data set in which the data points are re-labeled using the bigNN classifier .

We need additional assumptions to prove Theorem 3. We assume there exists some integer , named the intrinsic dimension, and constant , such that for all and , . We will also use the VC dimension technique Vapnik and Chervonenkis (1971). Although the proof makes use of some results in Xue and Kpotufe (2018), the generalization is not trivial due to the majority voting aggregation in the bigNN classifier.

###### Theorem 3.

Let . Assume VC dimension , intrinsic dimension and constant , the -Hölder continuity of and the -margin condition, with probability at least over ,

Under the Hölder’s condition, the regret of the bigNN classifier has been established to be at the rate of . Theorem 3 suggests that the pre-training step has introduced an additional error at the order of

, ignoring logarithmic factors. Assume for the moment that the intrinsic dimension

equals to the ambient dimension for simplicity. In this case, the additional error is relatively small compare to the original regret of bigNN, provided that the size of each subsample is at the order at least . When the intrinsic dimension is indeed smaller than the ambient dimension , then the additional error is even smaller.In principle, the subsamples at the training stage (with size ) and the subsamples at the prediction stage (with size ; from which we search for the 1NN of ) do not have to be the same. In practice, at the prediction stage, we may use the subsamples that are already divided up by bigNN. In other words, we do not have to conduct two different data divisions, one at the pre-training stage, the other at the prediction. In this case, to continue the discussion in the last paragraph, the additional error due to this pre-training acceleration is negligible as long as the number of total subsamples is no larger than . Incidentally, this matches the upper bound on of previously.

Xue and Kpotufe (2018)

suggested to obtain multiple pre-trained 1-NN estimates from

subsamples repeatedly, and conduct a majority vote among them to improve the performance. For example, they use in the simulation study. The theoretical result does not depend on the number of subsamples . In our simulation studies, we have tried a few values of to compare their empirical performance.## 5 Simulations

All numerical studies are conducted on HPC clusters with two 12-core Intel Xeon Gold Skylake processors and four 10-core Xeon-E5 processors, with memory between 64 and 128 GB.

Simulation 1: We choose the split coefficient and . The number of neighbors is chosen as as stated in the theorems with , truncated at 1. The two classes are generated as and with the prior class probability . The value is chosen to be since the corresponding Hölder exponent . In addition, the test set was independently generated with 1000 observations.

We repeat the simulation for 1000 times for each and . Here both the empirical risk (test error) and the empirical CIS are calculated for both the bigNN and the oracle NN methods. The empirical regret is calculated as the empirical risk minus the Bayes risk, calculated using the known underlying distribution. Note that due to numerical issues and more precision needed for large and , the empirical risk and CIS can present some instability. The R environment is used in this study.

The results are reported in Figures 1. The regret and CIS lines for different values are parallel to each other and are linearly decreasing as grows at the log-log scale, which verifies that the convergence rates are power functions of with negative exponents.

Inspired by the fact that the convergence rate for regret is and that for CIS is

, we fit the following two linear regression models:

using all the dots in Figure 1. If the regression coefficients for are significant, then the convergence rates of regret and CIS are power functions of , and the coefficients themselves are the exponent terms. That is, they should be approximately and . Since the term is categorical, for different values, the regression lines share the common slopes, but have different intercepts. Extremely nice prediction results from these regressions are obtained. In particular, the correlation between the observed and fitted (, resp.) is 0.9916 (0.9896, resp.) The scatter plots between the observed and fitted values are shown in Figure 2, displaying almost prefect fittings. These results verify the rates obtained in our theorems.

Figure 1 in the supplementary materials shows that bigNN has significantly shorter computing time than the oracle method.

Simulation 2: Now suppose we intentionally fix to be a constant (this is may not be the optimal ). After a straightforward modification of the proofs, the rates of convergence for regret and for CIS become and respectively for , and both regret and CIS should decrease as increases. We fix number of neighbors , let range from 0 to 0.7, and let . The rest of the settings is the same as in Simulation 1.

The results are shown in Figures 3. Both lines linearly decay in (both plots are on the log scale for the -axis). We note that the expected slopes in these two plots should be and respectively, which is verified by the figures, where larger means steeper lines.

Simulation 3: In Simulation 3, we compare denoised bigNN with the bigNN method. For denoised bigNN, we try to merge a few pre-training subsamples to be a prediction subsample, leading to the size of each prediction subsample . We set , , the pre-training split coefficient , number of prediction subsampling repeats , and the prediction subsample size coefficient . The two classes are generated as and with the prior class probability . The number of neighbors in the oracle NN is chosen as . The number of local neighbors in bigNN are chosen as where , a small constant that we find works well in this example. In addition, the test set was independently generated with 1000 observations. We repeat the simulation for 300 times for each , and . The results are reported in Figure 4. In each figure, the black diamond shows the prediction time and regret for the bigNN method without acceleration. Different curves represent different number of subsamples queried at the prediction stage, and their performance are similar. We see that the performance greatly changes due to the size of the subsamples at prediction . Small (or ), corresponding to the top-left end of each curve, is fast, but introduced too much bias. Large (or ) values (bottom-right) are reasonably accurate and much faster. The computing times are shown in seconds. For this example, it seems that or will work well.

## 6 Real Data Examples

The OWNN method Samworth (2012) gives the same optimal convergence rate of regret as oracle NN and bigNN, and additionally enjoys an optimal constant factor asymptotically. A ‘big’OWNN (where the base classifier is OWNN instead of NN) is technically doable but is omitted in this paper for more straightforward implementations. The goal of this section is to check how much (or how little) statistical accuracy bigNN will lose even if we do not use OWNN (which is optimal in risk, both in rate and in constant) in each subset. In particular, we compare the finite-sample performance of bigNN, oracle

NN and oracle OWNN using real data. We note that bigNN has the same convergence rate as the other two, but with much less computing time. We deliberately choose to not include other state-of-the-art algorithms (such as SVM or random forest) in the comparison. The impact of divide and conquer for those algorithms is an interesting future research topic.

We have retained benchmark data sets HTRU2 Lyon et al. (2016), Gisette Guyon et al. (2005), Musk 1 Dietterich et al. (1994), Musk 2 Dietterich et al. (1997), Occupancy Candanedo and Feldheim (2016), Credit Yeh and Lien (2009), and SUSY Baldi et al. (2014)

, from the UCI machine learning repository

Lichman (2013). The test sample sizes are set as . Parameters in NN and OWNN are tuned using cross-validation, and the parameter in bigNN for each subsample is the optimally chosen for the oracle NN divided by . In Table 1, we compare the average empirical risk (test error), the empirical CIS, and the speedup of bigNN relative to oracle NN, over 500 replications (OWNN typically has similar computing time as NN and hence the speed comparison with OWNN is omitted). From Table 1, one can see that the three methods typically yield very similar risk and CIS (no single method always wins), while bigNN has a computational advantage. Moreover, it seems that larger values tend to have slightly worse performance for bigNN.Data | size | dim | R.BigNN | R.kNN |
R.OWNN | C.BigNN | C.kNN | C.OWNN | Speedup | |
---|---|---|---|---|---|---|---|---|---|---|

htru2 | 0.1 | 2.0385 | 2.1105 | 2.1188 | 0.3670 | 0.6152 | 0.5528 | 2.72 | ||

htru2 | 0.2 | 2.0929 | 2.1105 | 2.1188 | 0.6323 | 0.6152 | 0.5528 | 7.65 | ||

htru2 | 0.3 | 2.1971 | 2.1105 | 2.1188 | 0.5003 | 0.6152 | 0.5528 | 21.65 | ||

gisette | 0.2 | 3.9344 | 3.5020 | 3.4749 | 4.4261 | 4.4752 | 4.3317 | 5.13 | ||

musk1 | 0.1 | 14.7619 | 14.9767 | 14.9757 | 24.2362 | 23.0664 | 23.2707 | 1.79 | ||

musk2 | 0.2 | 3.8250 | 3.4400 | 3.2841 | 4.7575 | 5.1925 | 4.1615 | 5.73 | ||

occup | 0.1 | 0.6207 | 0.6205 | 0.6037 | 0.3790 | 0.4431 | 0.5795 | 2.93 | ||

occup | 0.2 | 0.6119 | 0.6205 | 0.6037 | 0.3717 | 0.4431 | 0.5795 | 6.97 | ||

occup | 0.3 | 0.6548 | 0.6205 | 0.6037 | 0.3081 | 0.4431 | 0.5795 | 19.19 | ||

credit | 0.1 | 18.8300 | 18.8681 | 18.8414 | 2.7940 | 3.5292 | 3.4392 | 3.36 | ||

credit | 0.2 | 18.8467 | 18.8681 | 18.8414 | 4.3917 | 3.5292 | 3.4392 | 7.86 | ||

credit | 0.3 | 18.9250 | 18.8681 | 18.8414 | 4.2496 | 3.5292 | 3.4392 | 23.22 | ||

SUSY | 0.1 | 19.3103 | 21.0381 | 20.7752 | 7.7034 | 7.4011 | 7.5921 | 4.59 | ||

SUSY | 0.2 | 21.6149 | 21.0381 | 20.7752 | 7.9073 | 7.4011 | 7.5921 | 16.76 | ||

SUSY | 0.3 | 22.3197 | 21.0381 | 20.7752 | 4.6716 | 7.4011 | 7.5921 | 88.22 |

In Figure 2 of the supplementary materials, we allow to grow to . As mentioned earlier, when grows too fast (e.g. in this example), the performance of bigNN starts to deteriorate, due to increased ‘bias’ of the base classifier, despite faster computing.

## 7 Conclusion

Due to computation, communication, privacy and ownership limitations, sometimes it is impossible to conduct NN classification at a central location. In this paper, we study the bigNN classifier, which distributes the computation to different locations. We show that the convergence rates of regret and CIS for bigNN are the same as the ones for the oracle NN methods, and both rates are sharp. We also show that the prediction time for bigNN can be further improved, by using the denoising acceleration technique, and it is possible to do so at a negligible loss in the statistical accuracy.

Convergence rates are only the first step to understand bigNN. The sharp rates give reassurance about worst-case behavior; however, they do not lead naturally to optimal splitting schemes or quantifications of the relative performance of two NN classifiers attaining the same rate (such as bigNN and oracle NN). Achieving these goals will be left as future works. Another future work is to prove the sharp upper bound on .

#### Acknowledgments

This work was supported by the National Science Foundation under Grants DMS-1712907, DMS-1811812, DMS-1821183, and Office of Naval Research, (ONR N00014-18-2759).

Supplementary Materials to: Distributed Nearest Neighbor Classification

Xingye Qiao, Jiexin Duan, and Guang Cheng

## 1 Notations and Preliminary Results

We first define some helpful notations. The support of is . The function , defined for points , can be extended to measurable set with as

To build a natural connection between the geometric radius and the probability measure, we define

is the smallest radius so that an open ball centered at has probability mass at least . Intuitively, a greater leads to a greater .

We are now ready to define the so-called effective interiors of the two classes. The effective interior of class 1 is the set of points with on which the -NN classifier is more likely to be correct (than on its complement):

To see this, note that for a sample with points and for , because there are roughly speaking at most points in , suggests that the average of the class labels of those points in is greater than by at least ; hence one can easily get a correct classification using -NN at point if is .

Similarly, the effective interior for class 0 is defined as

and the effective boundary is defined as

This is the region on which the Bayes classifier and the -NN are very likely to disagree.

###### Theorem A.1.

With probability at least ,

where

and

This theorem essentially says that the probability that the bigNN classification is different from the Bayes rule is about the size of the effective boundary which can be calibrated and controlled. The proof of the theorem starts with the evaluation of the probablity that each classifier on a local machine disagrees with the Bayes rule Chaudhuri and Dasgupta (2014), and bounds the disagreement probability of the ensemble classifier using concentration equalities due to the Chernoff bound.

To bound the excess risk, we first consider pointwise conditional risk. The Bayes classifier has pointwise risk . The pointwise risk for the th base -NN classifier (the bigNN classifier, resp.) is denoted as (, resp.) Next, we prove a lemma that give an upper bound for the expected pointwise regret (the expectation is with respect to the training data) under the -smoothness assumption of

###### Lemma 1.

Set and . Pick any with . Under the -smoothness assumption of ,

We are ready to prove the convergence rate of the regret for the bigNN classifier.

## 2 Proof for Theorem a.1

###### Proof.

Pick any and any , . Let

where is the th nearest neighbor of in the th subsample. Intuitively, is the ball that includes the nearest neighbors. Let denote the mean of the ’s for points . Then

(S.1) |

This is the event that the th base classifier does not agree with the Bayes classifier. Define the th “bad event” as

Now for the main event of interest here, we have

To see this, suppose . Then without loss of generality, lies in , on which for all . Next, suppose further that less than or equal to bad events occurs, which means that more than “good” events (the complements of the bad events) occur, that is, for more than subsets, it holds

and

The first inequality above means that

These suggest that . Recall that since lies in . We can conclude that on the th “good” event the th base -NN classifier has made the same decision as the Bayes classifier. If more than half of the base -NN classifiers agree with the Bayes classifier, then the bigNN classifier also agrees with the Bayes classifier.

Note that

’s are independent and identically distributed Bernoulli random variables. Denote

where the expectation is taken with respect to the distribution of the training data. can be bounded using Lemmas 8 and 9 of Chaudhuri and Dasgupta (2014): forUsing concentration equalities due to the Chernoff bound (cf. Theorem 1.1, Slud, 1977), we have

where .

Taking expectation over , we have

Markov’s inequality leads to that

that is, with probability at least ,

In conclusion, with probability at least ,

∎

## 3 Proof for Lemma 1

###### Proof.

Assume without loss of generality that . The -smooth assumption of implies

for all . Next, for all , we have

Hence and .

The probability of a bad event is bounded by invoking Lemma 9 and Lemma 10 in Chaudhuri and Dasgupta (2014),

(S.2) |

where we substitute to obtain the last equality.

Similarly, for the pointwise risk of the bigNN classifier,

By taking expectation over the training data, we can then conclude that

∎

## 4 Proof for Theorem 1

###### Proof.

We define . Pick any . Lemma 2 bounds the pointwise regret on the set . On the set of , recall that

so that the pointwise regret is always bounded by . Then we have

For the first term,

For the second term,

The last term is decomposed to the sum of the followings with :

If we set ,

Therefore,