1 Introduction
1.1 -NN Algorithm
The -nearest neighbor algorithm (-NN) is a non-parametric method used for classification and regression. For a given sample of pairs and the -NN algorithm outputs
(1.1) |
as an estimate of , where is the set of indices of the nearest neighbors of among ’s.
The choice of is very important. For small values of , the
-NN estimator would have high variance and may overfit to the noise. As
grows, the estimator becomes less flexible and therefore more biased.1.2 Related Literature
The consistency and asymptotic behaviour of -NN regression and classification has been studied by many researchers. In [2, 1], the authors provide the necessary and sufficient conditions on for -NN estimator to be consistent. In [12], the author has shown asymptotic normality of the -NN estimator. The rate of convergence of this estimator has also been studied under assumptions on the density of ’s [3, 7, 5] or the Lipschitz property of the unknown function [9].
In applications, an optimal choice of depends on the sample. Assuming that the variance of noise, , is known and is Lipschitz with known constant , Guerre [4] suggested a choice of as a function of the sample and provided a non-asymptotic bound on the mean squared error of the proposed -NN estimator conditional on the ’s. Although he did not assume independence of the ’s, the assumptions on and seem to be very strong for many real data applications.
In application, a common approach for choosing the value of is by cross-validation [8]. In [10], Li showed that the -NN estimator using the chosen by LOOCV is asymptotically consistent. This result is under the assumption that the distribution of ’s fulfills two regularity conditions that together imply that ’s are dense in their support in a uniform way. Although this result is stronger than the results on the consistency of the -NN estimator for non-random choice of , it does not show why the LOOCV choice of is a competitive choice. More precisely, given other values of that give a consistent -NN estimate, [10] does not show why one should use the value of by LOOCV.
1.3 Our Work
In this work, we study the LOOCV -NN estimator. Although it has been shown previously that this gives us a consistent estimator, it was not shown that this choice of is optimal. In this paper we compare the mean squared error of the proposed -NN estimator with the minimum mean squared error using the -NN algorithm, where the minimum is taken over all choices of , and we show that with high probability they are very close. In section 2 we discuss the setting and provide the main result. In section 3 we discuss a simulated example. Finally we provide all the proofs in section 4.
1.4 Notation
We use boldface for vectors and matrices, e.g.
for a matrix and for a vector. For simplicity, we use for , the set of natural numbers less than or equal to . For given points and , and where we define to be the set of indices of the nearest neighbors of among ’s. Ties are broken uniformly at random. For matrix , the -norm and the Frobenius norm are defined asThroughout the paper, , and denote positive absolute constants.
2 Main result
Let be a set of pairs of observations for . For each , we assume that , where is an unknown continuous function and are drawn independently from an unknown distribution. For simplicity let . We assume that are sub-Gaussian with sub-Gaussian norm bounded by
and .
Noise variables
’s are independent sub-Gaussian mean zero random variables with sub-Gaussian norm
upper bounded by ,For , the -NN estimate of given sample is
(2.1) |
The mean squared error of this estimate is
(2.2) |
where the expectation is with respect to the joint distribution of
and .In practice, we do not know the probability distribution function of the
’s. Therefore we can not compute the . Instead, we can use the given data to estimate the .For each , let be the set of the nearest neighbors of among . Note that in defining we are excluding from the set of query points.
Define
(2.3) |
One may find the best value of for the -NN estimate. But since the distributions of and and function are all unknown, in practice we can not find . Instead, for each we define
(2.4) |
and
(2.5) |
In the Statistics and Machine Learning literature
is known as the leave-one-out cross-validation (LOOCV) estimate of the mean squared error. Note that is a random function (randomness comes from the dependence of on ’s) and therefore is a random variable. For each given sample , we can compute . Therefore a simple idea is to use in the -NN algorithm. Note that in practice the distance between and is not of the main importance for us. The main question is how far is from ? Theorem 2.1 gives a probability tail bound on to answer this question.Theorem 2.1.
For the set of observations from an unknown joint distribution, the choice of for the -NN estimate gives us the minimum mean squared error over all possible choices of , which is typically not computable. Instead, Theorem 2.1 guarantees that using , with high probability, gives that is very close to . This shows that not only does give us a consistent estimator but it is an optimal choice as well.
3 Discussion and Simulations
We should emphasize that in computing we exclude each from the whole set and we do not consider as one of its nearest neighbors. This is in fact very important and prevents us from choosing a value of that suffers from overfitting. The following example helps to see this better.
Example 3.1.
For n = 1000 we have generated ’s i.i.d from and ’s i.i.d from . Let and . For this sample we have and . Now let be the set of nearest neighbors of among all and define
(3.1) |
Note that we are taking the minimum over , since clearly for the sum in the right hand side of 3.1 is equal to zero. For our sample , and , where is the empirical estimate of MSE,
It is clear that by choosing , we will overfit to the noise and therefore the estimated mean squared error is much higher than that of two other values of .

In Figure 1 we have plotted the (solid line) and (dashed line) for . It can be seen that the behaviour of these two is very similar. In fact, as we expect from equality 4.1 in Section 4,
they differ slightly only by a constant. Therefore the point-wise difference of these two curves, when ’s have the same variance is approximately equal to . This shows that looking at the curve of gives us almost the same information as looking at the curve . Therefore computing is enough for finding the optimal choice of .
![]() |
![]() |
4 proofs
Proof.
Theorem 2.1.
By writing , we have
Since ’s are independent with mean zero, taking expectation of the above equality gives us
(4.1) |
Define , for . For each , is a deterministic function and does not depend on a given sample. Remember that is a function of the given sample and therefore is random. So depends on the given sample as well and therefore is random. By definition of , for all , and therefore This gives us . Also by definition of , we have . Putting these two together gives us,
(4.2) | |||||
Therefore to find an upper bound on , it’s enough to find an upper bound on for any arbitrary .
Lemma 4.1.
For any and any
where is a constant that only depends on the dimension , and and are upper bounds on the sub-Gaussian norm of and , and .
Proof.
Define the nonsymmetric matrix in the following way,
(4.3) |
Let . Also let and . Then we can rewrite in the following vector product form
(4.4) |
Using the triangle inequality, we have
Therefore it’s enough to find probability tail bounds on and . Note that is random. To find such bounds we need to have information on the norm of . Lemmas 4.2 and 4.3 provide uniform bound on and .
Lemma 4.2.
Lemma 4.3.
-
Bound on . Conditional on given sample and therefore given , by Hanson-Wright inequality [11] and Lemmas 4.2 and 4.3 we have
(4.7) Also note that
(4.8) The right side of 4.8 does not depend on sample . Therefore is almost surely constant,
(4.9) Putting 4.9 and 4.7 together gives us
(4.10) Note that 4.10 does not depend on , therefore
(4.11) -
Bound on .
(4.17) | |||||
∎
4.1 Proof of Lemmas 4.2 and 4.3
Proof.
Now note that for each , can not be large. In fact at most for those ’s that
(4.20) |
By definition and for any by Corollary 6.1. in [6] there are at most indices such that , where is a constant depends only on . Therefore
(4.21) |
This gives us
∎
Proof.
Lemma 4.3. Note that
(4.22) |
Therefore for any arbitrary such that ,
(4.23) | |||||
(4.24) | |||||
(4.25) | |||||
(4.26) |
Therefore
∎
An R language package knnopt will soon be made available on the CRAN repository.
Acknowledgement
The author is very grateful to her advisor Sourav Chatterjee for his constant encouragement and insightful conversations and comments.
References
- Devroye [1994] Devroye, L; Györfi, L; Krzyzak, A and Lugosi, G.(1994). On the Strong Universal Consistency of Nearest Neighbor Regression Function Estimates. The Annals of Statistics, 22 no. 3 (1994), 1371–1385. 1311980
- Devroye [1982] Devroye, L (1982). Necessary and Sufficient Conditions For The Pointwise Convergence of Nearest Neighbor Regression Function Estimates. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete., 61 no. 4 (1982), 467–481. 0682574
-
Fan [1993]
Fan, J.
(1993) Local Linear Regression Smoothers and Their Minimax Efficiencies.
Ann. Statist. 21 no. 1, 196–216. 1212173 - Guerre [2000] Guerre, E. (2000) Design Adaptive Nearest Neighbor Regression Estimation. J. Multivariate Anal. 75 no. 2, 219–244. 1802549
- Györfi [1981] Györfi, K. C. (1981) The Rate of Convergence of -NN Regression Estimates and Classification Rules. IEEE Trans. Inform. Theory. 27 no. 3, 362–364.
- Györfi, Kohler, Krzyźak and Walk [2002] Györfi, L., Kohler, M., Krzyźak, A. and Walk, H. (2002). A Distribution-Free Theory of Non-parametric Regression. Springer.
- Hall, Marron and Neumann [1997] Hall, P. Marron, J. S. Neumann, M. H. and Titterington, D. M. (1997) Curve Estimation When The Design Density Is Low. Ann. Statist. 25 no. 2, 756–770. 1439322
- James [2013] James, G. Witten, D. Hastie, T. and Tibshirani, R. (2013) An Introduction to Statistical Learning : with Applications in R. Springer.
- Kulkarni and Ponser [1995] Kulkarni, S.R. and Ponser, S.E. (1995) Rates of Convergence of Nearest Neighbor Estimation Under Arbitrary Sampling. IEEE Trans. Inform. Theory. 41 no. 4, 1028–1039. 1366756
- Li [1984] Li, K. C. (1984) Consistency For Cross-validated Nearest Neighbor Estimates in Non-parametric Regression. Ann. Statist. 12 no. 1, 230–240. 0733510
- Rudelson and Vershynin [2013] Rudelson, M. and Vershynin, R. (2013). Hanson-Wright Inequality and sub-Gaussian Concentration. Electron. Commun. Probab., 18(2013) 3125258
- Stute [1984] Stute, W.(1994). Asymptotic Normality of Nearest Neighbor Regression Function Estimates. The Annals of Statistics, 12 no. 3 (1984), 917–929. 1397508