One-Nearest-Neighbor Search is All You Need for Minimax Optimal Regression and Classification

02/05/2022
by   J. Jon Ryu, et al.
University of California, San Diego
0

Recently, Qiao, Duan, and Cheng (2019) proposed a distributed nearest-neighbor classification method, in which a massive dataset is split into smaller groups, each processed with a k-nearest-neighbor classifier, and the final class label is predicted by a majority vote among these groupwise class labels. This paper shows that the distributed algorithm with k=1 over a sufficiently large number of groups attains a minimax optimal error rate up to a multiplicative logarithmic factor under some regularity conditions, for both regression and classification problems. Roughly speaking, distributed 1-nearest-neighbor rules with M groups has a performance comparable to standard Θ(M)-nearest-neighbor rules. In the analysis, alternative rules with a refined aggregation method are proposed and shown to attain exact minimax optimal rates.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

07/08/2020

A Nearest Neighbor Characterization of Lebesgue Points in Metric Measure Spaces

The property of almost every point being a Lebesgue point has proven to ...
09/03/2019

Rates of Convergence for Large-scale Nearest Neighbor Classification

Nearest neighbor is a popular class of classification methods with many ...
10/22/2019

Minimax Rate Optimal Adaptive Nearest Neighbor Classification and Regression

k Nearest Neighbor (kNN) method is a simple and popular statistical meth...
11/23/2017

The Nearest Neighbor Information Estimator is Adaptively Near Minimax Rate-Optimal

We analyze the Kozachenko--Leonenko (KL) nearest neighbor estimator for ...
12/12/2018

Prediction of Success or Failure for Final Examination using Nearest Neighbor Method to the Trend of Weekly Online Testing

Using the trends of estimated abilities in terms of item response theory...
11/20/2014

Maximum Likelihood Directed Enumeration Method in Piecewise-Regular Object Recognition

We explore the problems of classification of composite object (images, s...
07/28/2020

Visualizing classification results

Classification is a major tool of statistics and machine learning. A cla...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Arguably being the most primitive, yet powerful nonparametric approaches for various statistical problems, the -nearest-neighbor (

-NN) based algorithms have been one of the essential toolkits in data science since their inception. They have been extensively studied and analyzed over several decades for canonical statistical procedures including classification 

(Fix & Hodges, 1951; Cover & Hart, 1967), regression (Cover, 1968a, b)

, density estimation 

(Loftsgaarden & Quesenberry, 1965; Fukunaga & Hostetler, 1973; Mack & Rosenblatt, 1979), and density functional estimation (Kozachenko & Leonenko, 1987; Leonenko et al., 2008). They are attractive even in this modern age due to their simplicity, decent performance, and rich understanding of their statistical properties.

There exist, however, clear limitations that hinder their wider deployment in practice. First, and most importantly, standard -NN based algorithms are often deemed to be inherently infeasible for large-scale data, as they need to store and process the entire data in a single machine for NN search. Second, though the number of neighbors needs to grow to infinity in the sample size to achieve statistical consistency in general for such procedures (Biau & Devroye, 2015), small is highly preferred in practice to avoid possibly demanding time complexity of large--NN search; see Section 3.1 for an in-depth discussion.

Recently, specifically for regression and classification, a few ensemble based methods (Xue & Kpotufe, 2018; Qiao et al., 2019; Duan et al., 2020) have been proposed aiming to reduce the computational complexity while achieving the accuracy of the optimal standard -NN regression and classification rules; however, theoretical guarantees of those solutions still require large--NN search. Xue & Kpotufe (2018) proposed an idea dubbed as denoising, which is to draw (multiple) subsample(s) and preprocess them with the standard large--NN rule over the entire data in the training phase, so that the -NN information can be hashed effectively by 1-NN searches in the testing phase. Though the resulting algorithm is provably optimal with a small statistical overhead, the denoising step may still suffer prohibitively large complexity for large and/or large in principle. More recently, to address the computational and storage complexity of the standard -NN classifier with large , Qiao et al. (2019) proposed the bigNN classifier, which splits data into subsets, applies the standard -NN classifier to each, and aggregates the labels by a majority vote. This ensemble method works without any coordination among data splits, and thus they naturally fit to large-scale data which may be inherently stored and processed in distributed machines. However, they showed its minimax optimality only when both the number of splits and the base increase as the sample size increases but only a strictly suboptimal guarantee for fixed . Only with the optimality for increasingly large , they suggested to use the bigNN classifier in the preprocessing phase of the denoising framework. A more recent work (Duan et al., 2020) on optimally weighted version of the bigNN classifier still assumes increasingly large .

In this paper, we complete the missing theory for small and show that the bigNN classifier with suffices for minimax rate-optimal classification. More generally, we analyze a variant of the bigNN classifier, called the -split -NN classifier, which is defined as the majority vote over the total nearest-neighbor labels obtained after running -NN search over the data splits. Roughly put, we show that the -split -NN classification rule behaves almost equivalently to the standard -NN rules, for any fixed . In particular, the -split 1-NN rule, equivalent to the bigNN classifier with , is shown to attain a minimax optimal rate up to logarithmic factors under smooth measure conditions. We also provide a minimax-rate-optimal guarantee for regression task with an analogously defined -split -NN regression rule.

Albeit both the algorithm and analysis are simple in nature, the practical implication of theoretical guarantees provided herein together with the divide-and-conquer framework is significant: while running faster than the standard 1-NN rules by processing smaller data with small--NN search in parallel, the -split -NN rules can achieve the same statistical guarantee of the optimal standard -NN rules run over the entire dataset. Moreover, when deploying the rules in practice, we only need to tune the number of splits while fixing , say, simply . We experimentally demonstrate that the split 1-NN rules indeed perform on par with the optimal standard -NN rules as expected by theory, while running faster than the 1-NN rules.

The key technique in our analysis is to analyze intermediate rules that selectively aggregates the small--NN estimates from each data split based on the -th-NN distances from a query point. The intuition is that the intermediate distance-selective rules achieve minimax optimal rates for any fixed by averaging only neighbors close enough to a query point, and we can approximate the -split -NN rules to the intermediate rules up to a logarithmic approximation overhead. These intermediate rules with the distance-selective aggregation scheme attain exact minimax optimal rates for respective problems at the cost of additional complexity for ordering the NN distances.

Organization

The rest of the paper is organized as follows. We conclude this section with discussing related work. Section 2 presents the main results on classification with the formal definition of the split NN rules and their theoretical guarantees. In Section 3, we discuss computational complexity of the standard -NN algorithms and the -split -NN rules, theoretical guarantees for regression, and a refined aggregation scheme that removes the logarithmic factors in the previous guarantees. We demonstrate the convergence rates of the split NN rules and their practicality over the standard -NN rules with experimental results in Section 4. All proofs can be found in Appendix.

1.1 Related Work

The asymptotic-Bayes consistency and convergence rates of the -NN classifier have been studied extensively in the last century (Fix & Hodges, 1951; Cover & Hart, 1967; Cover, 1968a, b; Wagner, 1971; Fritz, 1975; Gyorfi, 1981; Devroye et al., 1994; Kulkarni & Posner, 1995). More recent theoretical breakthroughs include a strongly consistent margin regularized 1-NN classifier (Kontorovich & Weiss, 2015), a universally consistent sample-compression based -NN classifier over a general metric space (Kontorovich et al., 2017; Hanneke et al., 2020), nonasymptotic analysis over Euclidean space (Gadat et al., 2016) and over a doubling space (Dasgupta & Kpotufe, 2014), optimal weighted schemes (Samworth, 2012), stability (Sun et al., 2016), robustness against adversarial attacks (Wang et al., 2018; Bhattacharjee & Chaudhuri, 2020), and optimal classification with a query-dependent  (Balsubramani et al., 2019). For NN-based regression (Cover, 1968a, b; Dasgupta & Kpotufe, 2014, 2019), we mostly extend the analysis techniques of (Xue & Kpotufe, 2018; Dasgupta & Kpotufe, 2019); we refer the interested reader to a recent survey of Chen et al. (2018) for more refined analyses. For a more comprehensive treatment on the -NN based procedures, see (Devroye et al., 1996; Biau & Devroye, 2015) and references therein.

The most closely related work is (Qiao et al., 2019) as mentioned above. In a similar spirit, Duan et al. (2020) analyzed a distributed version of the optimally weighted NN classifier of Samworth (2012). More recently, Liu et al. (2021) studied a distributed version of an adaptive NN classification rule of Balsubramani et al. (2019).

The idea of an ensemble predictor for enhancing statistical power of a base classifier has been long known and extensively studied; see, e.g., (Hastie et al., 2009) for an overview. Among many ensemble techniques, bagging (Breiman, 1996) and pasting (Breiman, 1999)

are closely related to this work. The goal of bagging is, however, mostly to improve accuracy by reducing variance when the sample size is small and the bootstrapping step is computationally demanding in general; see

(Hall & Samworth, 2005; Biau et al., 2010) for the properties of bagged 1-NN rules. The motivation and idea of pasting is similar to the split NN rules, but pasting iteratively evolves an ensemble classifier based on an estimated prediction error based on random subsampling rather than splitting samples. The split NN rules analyzed in this paper are non-iterative and NN-based-rules-specific, and assume essentially no additional processing step beyond splitting and averaging.

Beyond ensemble methods, there are other attempts to make NN based rules scalable based on quantization (Kontorovich et al., 2017; Gottlieb et al., 2018; Kpotufe & Verma, 2017; Xue & Kpotufe, 2018; Hanneke et al., 2020) or regularization (Kontorovich & Weiss, 2015), where the common theme there is to carefully select subsample and/or preprocess the labels. We remark, however, that they typically involve onerous and rather complex preprocessing steps, which may not be suitable for a large-scale data. Approximate NN (ANN) search algorithms (Indyk & Motwani, 1998; Slaney & Casey, 2008; Har-Peled et al., 2012) are yet another practical solution to reduce the query complexity, but ANN-search-based rules such as (Alabduljalil et al., 2013; Anastasiu & Karypis, 2019) hardly have any statistical guarantee (Dasgupta & Kpotufe, 2019) with few exception (Gottlieb et al., 2014; Efremenko et al., 2020). Gottlieb et al. (2014) proposed an ANN-based classifier for general doubling spaces with generalization bounds. More recently, Efremenko et al. (2020) proposed a locality sensitive hashing (Datar et al., 2004) based classifier with Bayes consistency and a strictly suboptimal rate guarantee in . In contrast, this paper focuses on exact-NN-search based algorithms.

2 Main Results

Let be a metric space and let be the outcome (or label) space, i.e., for binary classification. For regression, we assume . We denote by

a joint distribution over

and denote by and the marginal distribution on and the regression function, i.e., , respectively.

We denote an open ball of radius centered at by and the closed ball by . The support of a measure is denoted as .

Given sample and a point , we use to denote the -th-nearest neighbor of from the sample instances and use to denote the corresponding -th-NN label among ; any tie is broken arbitrarily. The -th-NN distance of is denoted as for . We will omit the reference sets or whenever it is clear from the context.

Throughout in the paper, we use , , and to denote the size of the entire data , the number of data splits, and the size of each data split, respectively, assuming that divides .

2.1 Regression

2.1.1 Problem Setting

Given paired data drawn independently from , the goal of regression is to design an estimator based on the data such that the estimate is close to the conditional expectation , where the closeness between and is typically measured by the -norm under

for or the sup norm

2.1.2 The Proposed Rule

Given a query , the standard -NN regression rule outputs the average of the -NN labels, i.e.,

Instead of running -NN search over the entire data, given the number of splits , we first split the data of size into subsets of equal size at random. Let denote the random subsets, where corresponds to the -th split. After finding -NN labels for each data split, the -split -NN (or -NN in short) regression rule is defined as the average of all of returned labels, i.e.,

2.1.3 Performance Guarantees

We can show that the -NN regression rule is nearly optimal in terms of error rate under a standard regularity condition for any fixed .

For a formal statement, we borrow some standard assumptions on the metric measure space in the literature on analyzing -NN algorithms (Dasgupta & Kpotufe, 2019).

Assumption 2.1 (Doubling and homogeneous measure).

The measure on metric space is doubling with exponent , i.e., for any and ,

The measure is -homogeneous, i.e., for some for any and ,

Note that a measure is homogeneous if is doubling and is bounded. The doubling exponent can be interpreted as an intrinsic dimension of a measure space.

Assumption 2.2 (Hölder continuity).

The conditional expectation function is -Hölder continuous for for some and in metric space , i.e., for any ,

Assumption 2.3 (Bounded conditional expectation and variance).

The conditional expectation function and the conditional variance function are bounded, i.e., and .

The following, stronger condition allows us to establish a high-probability bound.

Assumption 2.4.

The collection of closed balls in has finite VC dimension and the outcome space is contained in a bounded interval of length .

The main goal of this paper is to demonstrate that the distributed -NN rules can attain almost statistically equivalent performance to the optimal -NN rules. Hence, our statements in what follows are written in parallel to the known results for the standard -NN rules, to which we include the pointers after cf. for the interested reader.

Theorem 2.1 (cf. Dasgupta & Kpotufe (2019, Theorem 1.3)Xue & Kpotufe (2018, Theorem 1)).

Under Assumptions 2.12.2, and 2.3, the following statement holds for any fixed , where are constants that are independent of the ambient dimension .

  1. [label=()]

  2. For any positive integers such that , we have

  3. If Assumption 2.4 further holds, for any , if , then probability at least over , we have

Remark 2.1 (Minimax optimality).

If we set in Theorem 2.1, the rates become

This rate is known to be minimax optimal under the Hölder continuity of order  (Dasgupta & Kpotufe, 2019).

2.2 Classification

2.2.1 Problem Setting

We consider the binary classification with . Given paired data drawn independently from , the goal of binary classification is to design a (data-dependent) classifier such that it minimizes the classification error . For a classifier , we define its pointwise risk at as , and define the (expected) risk as . Let denote the Bayes-optimal classifier, i.e., for all , and let and denote the pointwise-Bayes risk and the (expected) Bayes risk, respectively. The canonical performance measure of a classifier is its excess risk .

Another important performance criterion is the classification instability proposed by (Sun et al., 2016), which quantifies a stablility of a classification procedure with respect to independent realizations of training data. Given , with a slight abuse of notation, denote as a classification procedure that maps a dataset of size to a classifier . The classification instability of the classification procedure is defined as

where and are independent, i.i.d. data of size .

2.2.2 The Proposed Rule

The standard -NN classifier is defined as the plug-in classifier of the standard -NN regression estimate:

It can be equivalently viewed as the majority vote over the -NN labels given a query.

Similarly, we define the -NN classification rule as the plug-in classifier

2.2.3 Performance Guarantees

As shown in the previous section for regression, we can show that the proposed -NN classifier behaves nearly identically to the standard -NN rules for any fixed . Here, we focus on guarantees on rates of excess risk and classification instability, but the asymptotic Bayes consistency can be also established under a mild condition; see Theorem C.9 in Appendix.

To establish rates of convergence for classification, we recall the following notion of smoothness for the conditional probability defined in (Chaudhuri & Dasgupta, 2014) that takes into account the underlying measure to better capture the nature of classification than the standard Hölder continuity in Assumption 2.2.

Assumption 2.5 (Smoothness).

For and , is -smooth in metric measure space , i.e., for all and ,

We further assume the following, stronger condition on the behavior of the measure around the decision boundary of , so that we can establish a fast rate of convergence.

Assumption 2.6 (Margin condition (Audibert et al., 2007)).

For , satisfies the -margin condition in , i.e., there exists a constant such that

where denotes the decision boundary with margin .

Theorem 2.2 (cf. (Chaudhuri & Dasgupta, 2014, Theorem 4)).

Under Assumptions 2.5 and 2.6, the following statements hold for any fixed , where , , and are constants depending on and .

  1. [label=()]

  2. Pick any and such that . With probability at least over ,

  3. Pick any and set . Then

Remark 2.2 (Minimax optimality).

Note that -Hölder continuity with mild regularity conditions implies -smoothness (Chaudhuri & Dasgupta, 2014, Lemma 2). Hence, if we set in Theorem 2.2(b) as in Remark 2.1, we have

which are known to be minimax optimal under the Hölder continuity (Chaudhuri & Dasgupta, 2014; Sun et al., 2016).

Remark 2.3 (Reduction to regression).

For a regression estimate , let be the plug-in classifier with respect to . Then, via the inequality

the guarantees for the -NN regression rule in Theorem 2.1 readily imply convergence rates of the excess risk (Dasgupta & Kpotufe, 2019) even for a multiclass classification scenario, by adapting the guarantee for a multivariate regression setting. The current statements, however, are more general results for binary classification that apply to beyond smooth distributions, following the analysis by Chaudhuri & Dasgupta (2014).

3 Discussion

3.1 Computational Complexity

The standard -NN rules are known to be asymptotically consistent only if as . Specifically to attain minimax rate-optimality, is required under measures are Hölder continuity of order ; see Remarks 2.1 and 2.2. As alluded to earlier, this large- requirement on the standard -NN rules for statistical optimality may be problematic in practice.

To examine the complexity more carefully, consider Euclidean space

for a moment. Let

denote the test-time complexity of a -NN search algorithm for data of size . The simplest baseline NN search algorithm is the brute-force search, which has time complexity regardless of .111Given a query point, (1) compute the distances from the dataset to the query (); (2) find the -NN distance using introselect algorithm (), (3) pick the -nearest neighbors; (). For extremely large-scale data, however, even may be unwieldy in practice. To reduce the complexity, several alternative data structures specialized for NN search such as KD-Trees (Bentley, 1975) for Euclidean data, and Metric Trees (Uhlmann, 1991) and Cover Trees (Beygelzimer et al., 2006) for non-Euclidean data have been developed; see (Dasgupta & Kpotufe, 2019; Kibriya & Frank, 2007) for an overview and comparison of empirical performance of these specialized data structures for -NN search. These are preferred over the brute-force search for better test time complexity in a moderate size of dimension, say , but for much higher-dimensional data, it is known that the brute-force search may be faster. In particular, the most popular choice of a KD-Tree based search algorithm has time complexity for . The time complexity of exact -NN search is for moderately small ,222One possible implementation of exact -NN search algorithm with KD-tree is to remove already found points and repeatedly find 1-NN points until -NN points are found using KD-tree-based 1-NN search; after the search, the removed points may be reinserted into the KD-tree without affecting the overall complexity for a moderate size of . but for a large the time complexity could be worse than .

Thanks to the fully distributed nature, the -NN classifier have computational advantage over the standard -NN classifier of nearly same statistical power run over the entire data. Suppose that we split data into groups of equal size and they can be processed by parallel processors, where each processor ideally manages data splits. Given the time complexity of a base -NN search algorithm, the -NN algorithms have time complexity

As stated in Section 2, the -NN rules with parallel units may attain the performance of the standard -NN rules in a single machine with the relative speedup of

with a brute-force search, and

with a KD-Tree based search algorithm assuming for simplicity. Hence, the most benefit of the proposed algorithm comes from its distributed nature which reduces both time and storage complexity.

3.2 A Refined Aggregation Scheme

As alluded to earlier, we can remove the logarithmic factors in the guarantees of Theorems 2.1 and 2.2 with a refined aggregation scheme which we call the distance-selective aggregation

. With an additional hyperparameter

such that , we take estimates out of the total values based on the -th-NN distances from the query point to each data split instances. Formally, if denote the -smallest values out of the -th-NN distances , we take the partial average of the corresponding regression estimates:

We call the resulting rule the -split -selective -NN (or -NN in short) regression rule and analogously define the -NN classifier as the induced plug-in classifier, i.e.,

Intuitively, it is designed to filter out some possible outliers based on the -th-NN distances, since a larger -th-NN distance to the query point likely indicates that the returned estimate from the corresponding group is more unreliable.333We remark that the -th-NN distance-based aggregation also works for regression. The choice of -th-NN distance instead of -th-NN distance is due to a technical reason for classification; see Lemma C.4 in Appendix.

We analyze the distance-selective schemes in disguise when proving the main statements, and thus the proofs of the following statements are omitted.

Proposition 3.1.

Under Assumptions 2.12.2, and 2.3, for any fixed , any such that ,

Proposition 3.2.

Under Assumptions 2.5 and 2.6, if we set and for any fixed ,

Note that the refined schemes are indeed minimax rate-optimal without the extra logarithmic factors.

3.3 Comparison to (Qiao et al., 2019)

The bigNN classifier proposed by Qiao et al. (2019) takes the majority vote over the labels each of which is the output of the standard -NN classifier from each data split. Qiao et al. (2019) showed that the bigNN classifier is minimax rate-optimal only when grows to infinity. Their argument is based on that the -NN classification results from each subset of data become consistent as grows, and taking majority vote over consistent guesses will likely result in a consistent guess.

In contrast, the -NN classifier analyzed in this paper takes the majority over all returned labels. We remark, however, that since the two algorithms become equivalent for the most practical case of and both schemes also showed similar performance for small ’s in our experiments (data not shown), the key contribution is in our analysis rather in the algorithmic details. Unlike (Qiao et al., 2019), we establish the rate optimality for any fixed , as long as grows properly, by directly showing that the sets of -NN labels over subsets are almost statistically equivalent to -NN labels over the entire data.

Figure 1: Summary of excess risks from the mixture of two Gaussians experiments.

4 Experiments

The goal of experiments in this section is twofold. First, we present simulated convergence rates of the -NN rules for small , say , are polynomial as predicted by theory with synthetic dataset. Second, we demonstrate that their practical performance is competitive against that of the standard -NN rules with real-world datasets, while generally reducing both validation complexity for model selection and test complexity. In both experiments, we also show the performance of the -NN rules444As alluded to earlier, we used -th-NN distance in experiments for the distance-selective classification rule instead of -th-NN distance for simplicity. to examine the effect of distance-selective aggregation.

Computing resources  For each experiment, we used a single machine with one of the following CPUs: (1) Intel(R) Core(TM) i7-9750H CPU 2.60GHz with 12 (logical) cores or (2) Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz with 28 (logical) cores.

Implementation  All implementations were based on Python 3.8 and we used the NN search algorithms implemented in scikit-learn package (Pedregosa et al., 2011) ver. 0.24.1 and utilized the multiprocessors using the python standard package multiprocessing. The code for experiments can be found in Supplementary Material.


center Dataset Error (% for classification) Test time (s) Valid. time (s) 1-NN -NN (,1)-NN 1-NN -NN (,1)-NN -NN (,1)-NN GISETTE (Guyon et al., 2004) 7.26 4.54 5.11 (4.86 ) 6.13 5.75 6.79 (6.18) 52 262 (270)   w/ brute-force - - - 0.30 0.26 1.20 (2.06) 38 200 (207) HTRU2 (Lyon et al., 2016) 2.91 2.18 2.08 (2.28 ) 0.18 0.18 0.04 (0.04) 18 8 (10) Credit (Dua & Graff, 2019) 26.73 18.68 18.65 (18.93 ) 0.85 1.2 0.2 (0.2) 122 25 (29) MiniBooNE (Dua & Graff, 2019) 13.72 10.63 10.69 (10.62 ) 1.68 2.42 0.98 (0.94) 264 88 (92) SUSY (Baldi et al., 2014) 28.27 20.32 20.55 (20.52 ) 32 35 14 (13) 3041 1338 (1362) BNG(letter,1000,1) (Vanschoren et al., 2013) 46.13 40.88 41.53 (40.72 ) 379 350 17 (14) 2868 619 (959) YearPredictionMSD (Dua & Graff, 2019) 7.22 6.72 6.79 (6.75 ) 33 31 40 (34) 1616 431 (412)   w/ brute-force - - - 15 18 3.5 (3.6) 1529 300 (336)

Table 1: Summary of experiments with benchmark datasets. YearPredictionMSD in the last row is a regression dataset. Recall that -NN is a shorthand for the -split 1-NN rules. The values in the parentheses correspond to the -NN rules. The best values are highlighted in bold.

4.1 Simulated Dataset

We first evaluated the performance of the proposed classifier with a synthetic data following Qiao et al. (2019). We consider a mixture of two isotropic Gaussians of equal weight , where and

denotes the identity matrix. With

, we tested 3 different values of with 5 different sample sizes . We evaluated the -NN rule and -NN rule for with based on and . For comparison, we also ran the standard -NN algorithm with

. We repeated experiments with 10 different random seeds and reported the averages and standard deviations.

The excess risks are plotted in Figure 1. We note that the -NN classifier performs similarly to the baseline -NN classifier across different values of , and the performance can be improved by the -NN classifier. This implies that discarding possibly noisy information in the aggregation could actually improve the performance of the ensemble classifier. Note also that the convergence of the excess risks of the standard -NN classifier and the -NN classifiers is polynomial, indicated by the straight lines, as predicted by theory.

Figure 2: Validation error profiles from 10-fold cross validation.

4.2 Real-world Datasets

We evaluated the proposed rules with publicly available benchmark datasets from the UCI machine learning repository 

(Dua & Graff, 2019) and the OpenML repository (Vanschoren et al., 2013), which were also used in (Xue & Kpotufe, 2018) and (Qiao et al., 2019); see Table 2 for size, feature dimensions, and the number of classes of each dataset. All data were standardized to have zero mean and unit variances; the details of data preprocessing can be found in the code attached.


center Dataset # training # dim. # class. GISETTE (Guyon et al., 2004) 7k 5k 2 HTRU2 (Lyon et al., 2016) 18k 8 2 Credit (Dua & Graff, 2019) 30k 23 2 MiniBooNE (Dua & Graff, 2019) 130k 50 2 SUSY (Baldi et al., 2014) 5000k 18 2 BNG(letter,1000,1) (Vanschoren et al., 2013) 1000k 17 26 YearPredictionMSD (Dua & Graff, 2019) 463k 90 1

Table 2: Summary of dimensions of the benchmark datasets.

We tested four algorithms. The first two algorithms are (1) the standard 1-NN rule and (2) the standard -NN rule with 10-fold cross-validation (CV) over an exponential grid , where denotes the size of training data. The rest are (3) the -NN rule and (4) the -NN rule both with 10-fold CV over . We repeated with 10 different random (0.95,0.05) train-test splits and evaluated first points from the test data to reduce the simulation time. Table 1 summarizes the test errors, test times, and validation times.555

Here, we used a KD-Tree based NN search by default. However, since a KD-Tree based algorithm suffers a curse of dimensionality (recall Section 

3.1), we ran additional trials with a brute-force search for high-dimensional datasets , whose feature dimensions are 5000 and 90, respectively, and report the time complexities in the subsequent rows. The optimal -NN rules consistently performed as well as the optimal standard -NN rules, even running faster than the standard 1-NN rules in the test phase. We remark that the optimally tuned -NN rules (i.e., with the distance-selective aggregation) performed almost identical to the -NN rules, except slight error improvements observed in high-dimensional datasets .

We additionally present Figure 2 which summarizes the validation error profiles from the 10-fold CV procedures; here, as expected, the optimal chosen for -NN rules is in the same order of the optimal for the standard -NN rules.

5 Concluding Remarks

In this paper, we established the near statistical optimality of the -NN rules for the fixed- regime, which makes the sample-splitting-based NN rules more appealing for practical scenarios with large-scale data. We also removed the logarithmic factors by the distance-selective aggregation and exhibited some level of performance boost in experimental results; it is an open question whether the logarithmic factor is fundamental for the vanilla -NN rules or can be removed by a tighter analysis. As supported by both theoretical guarantees and empirical supports, we believe that the -NN rules, especially for , can be widely deployed in practical systems and deserve further study including an optimally weighted version of the classifier as studied in (Duan et al., 2020).

We conclude with remarks on a seeming connection between the proposed distance-selective aggregation and the

-NN based outlier detection methods.

Ramaswamy et al. (2000) and Angiulli & Pizzuti (2002) proposed to use the -NN distance, or some basic statistics such as mean or median of the -NN distances to a query point, as an outlier score; a recent paper (Gu et al., 2019) analyzed these schemes. In view of this line of work, the split select NN rules can be understood as a selective ensemble of inliers based on the -NN distances. It would be an interesting direction to investigate a NN-based outlier detection method for a large-scale dataset, extending the idea of the distance-selective aggregation.

Acknowledgements

The authors appreciate insightful feedback from anonymous reviewers to improve earlier versions of the manuscript. JR would like to thank Alankrita Bhatt, Sanjoy Dasgupta, Yung-Kyun Noh, and Geelon So for their discussion and comments on the manuscript. This work was supported in part by the National Science Foundation under Grant CCF-1911238.

Overview of Appendix

In Appendix, we prove the statements (Theorem 2.1 for regression and Theorem 2.2 for classification) in the main text. For both regression and classification problems, the key idea in our analysis of the -NN rules is to consider the -NN rules (3.2) and (3.2) as a proof device. It relies on the observations that (1) the -NN rules can be closely approximated to the -NN rules with , and (2) -NN rules attain minimax optimality for any fixed and fixed , as long as is chosen properly.

The rest of Appendix is organized as follows. In Appendix A, we state and prove a key technical lemma for analyzing the distributed NN rules. As the regression rules are easier to analyze, we prove Theorem 2.1 in Appendix B. The proof of Theorem 2.2 is presented in Appendix C, including an additional statement on Bayes consistency.

Appendix A A Key Technical Lemma

We first restate a simple yet important observation on the -nearest-neighbors by Chaudhuri & Dasgupta (2014) that the -nearest neighbors of lies in a ball of probability mass of centered at , with high probability. We define the probability radius of mass centered at as the minimum possible radius of a closed ball containing probability mass at least , that is,

Lemma A.1 (Chaudhuri & Dasgupta, 2014, Lemma 8).

Pick any , , , and any positive integers and such that . If are drawn i.i.d. from , then

We now state an analogous version of the above lemma for our analysis of the -NN rules. The following lemma quantifies that, with high probability (exponentially in ) over the split instances , the the -nearest neighbors of from the selected data splits based on the -th-NN distances will likely lie within a small probability ball of mass around the query point.

Lemma A.2.

Pick any positive integer and , and set . If the data splits are independent, we have

for and .

Proof.

Define

so that we can write . Note that for any and .

For each data split indexed by , we define a bad event

Observe that occurs if any only if the closed ball of probability mass contains less than points from . By Lemma A.1, the probability of the bad event is upper bounded by , which is equal to by the choice of . Now, since the data splits are independent,

is a sequence of independent Bernoulli random variables with parameter

. Hence, we have

where denotes a binomial random variable with parameters and . Another application of the multiplicative Chernoff bound to the right-hand side concludes the desired bound. ∎

With this lemma, analyzing the distance-selective aggregation method only requires a slight modification of the existing analyses of the standard -NN algorithms.

Appendix B Regression: Proof of Theorem 2.1

b.1 Proof of Theorem 2.1(a)

This analysis extends the proof of (Dasgupta & Kpotufe, 2019, Theorem 1.3). Let . Let denote the set of splits of . We let and .

Step 1. Error decomposition

Recall that we wish to bound

Here, the inequality follows by Jensen’s inequality. We will consider the -NN regression rule with as a proof device, where is to be determined. Pick any . We denote the conditional expectation of the regression estimate by

where the expectation is over -values given the data splits . Then, we decompose the squared error as

where we used the inequality . Taking expectation over the values given the data splits , we have

(B.1)

We now bound the three terms separately in the next steps.

Step 2(A). Variance term

Consider

(B.2)

Here, (a) follows by the independence of ’s conditioned on the splits and (b) follows from the assumption for all .

Step 2(B). Approximation term

We claim that the second term can be bounded as . We have