DeepAI

# Optimal choice of k for k-nearest neighbor regression

The k-nearest neighbor algorithm (k-NN) is a widely used non-parametric method for classification and regression. We study the mean squared error of the k-NN estimator when k is chosen by leave-one-out cross-validation (LOOCV). Although it was known that this choice of k is asymptotically consistent, it was not known previously that it is an optimal k. We show, with high probability, the mean squared error of this estimator is close to the minimum mean squared error using the k-NN estimate, where the minimum is over all choices of k.

07/01/2021

### How many samples are needed to reliably approximate the best linear estimator for a linear inverse problem?

The linear minimum mean squared error (LMMSE) estimator is the best line...
08/24/2022

### Metric Effects based on Fluctuations in values of k in Nearest Neighbor Regressor

Regression branch of Machine Learning purely focuses on prediction of co...
03/30/2021

### Generalized Linear Tree Space Nearest Neighbor

We present a novel method of stacking decision trees by projection into ...
07/14/2018

### Non-separable Nearest-Neighbor Gaussian Process Model for Antarctic Surface Mass Balance and Ice Core Site Selection

Surface mass balance (SMB) is an important factor in the estimation of s...
08/01/2019

### Modeling Daily Pan Evaporation in Humid Climates Using Gaussian Process Regression

Evaporation is one of the main processes in the hydrological cycle, and ...
08/04/2021

### The Theory of Perfect Learning

The perfect learning exists. We mean a learning model that can be genera...
08/14/2016

### Bayesian Model Selection Methods for Mutual and Symmetric k-Nearest Neighbor Classification

The k-nearest neighbor classification method (k-NNC) is one of the simpl...

## 1 Introduction

### 1.1 k-NN Algorithm

The -nearest neighbor algorithm (-NN) is a non-parametric method used for classification and regression. For a given sample of pairs and the -NN algorithm outputs

 ^y=^mk,n(x)=1k∑j∈Nk(x)yj, (1.1)

as an estimate of , where is the set of indices of the nearest neighbors of among ’s.

The choice of is very important. For small values of , the

-NN estimator would have high variance and may overfit to the noise. As

grows, the estimator becomes less flexible and therefore more biased.

### 1.2 Related Literature

The consistency and asymptotic behaviour of -NN regression and classification has been studied by many researchers. In [2, 1], the authors provide the necessary and sufficient conditions on for -NN estimator to be consistent. In [12], the author has shown asymptotic normality of the -NN estimator. The rate of convergence of this estimator has also been studied under assumptions on the density of ’s [3, 7, 5] or the Lipschitz property of the unknown function  [9].

In applications, an optimal choice of depends on the sample. Assuming that the variance of noise, , is known and is Lipschitz with known constant , Guerre [4] suggested a choice of as a function of the sample and provided a non-asymptotic bound on the mean squared error of the proposed -NN estimator conditional on the ’s. Although he did not assume independence of the ’s, the assumptions on and seem to be very strong for many real data applications.

In application, a common approach for choosing the value of is by cross-validation [8]. In [10], Li showed that the -NN estimator using the chosen by LOOCV is asymptotically consistent. This result is under the assumption that the distribution of ’s fulfills two regularity conditions that together imply that ’s are dense in their support in a uniform way. Although this result is stronger than the results on the consistency of the -NN estimator for non-random choice of , it does not show why the LOOCV choice of is a competitive choice. More precisely, given other values of that give a consistent -NN estimate, [10] does not show why one should use the value of by LOOCV.

### 1.3 Our Work

In this work, we study the LOOCV -NN estimator. Although it has been shown previously that this gives us a consistent estimator, it was not shown that this choice of is optimal. In this paper we compare the mean squared error of the proposed -NN estimator with the minimum mean squared error using the -NN algorithm, where the minimum is taken over all choices of , and we show that with high probability they are very close. In section 2 we discuss the setting and provide the main result. In section 3 we discuss a simulated example. Finally we provide all the proofs in section 4.

### 1.4 Notation

We use boldface for vectors and matrices, e.g.

for a matrix and for a vector. For simplicity, we use for , the set of natural numbers less than or equal to . For given points and , and where we define to be the set of indices of the nearest neighbors of among ’s. Ties are broken uniformly at random. For matrix , the -norm and the Frobenius norm are defined as

Throughout the paper, , and denote positive absolute constants.

## 2 Main result

Let be a set of pairs of observations for . For each , we assume that , where is an unknown continuous function and are drawn independently from an unknown distribution. For simplicity let . We assume that are sub-Gaussian with sub-Gaussian norm bounded by

 Eexp(μi/C)2≤2,

and .

Noise variables

’s are independent sub-Gaussian mean zero random variables with sub-Gaussian norm

upper bounded by ,

 Eexp(ϵi/K)2≤2.

For , the -NN estimate of given sample is

 ^mk,n(x)=1k∑j∈Nk(x)yj. (2.1)

The mean squared error of this estimate is

 MSE(k)=E[(m(x)−^mk,n(x))2], (2.2)

where the expectation is with respect to the joint distribution of

and .

In practice, we do not know the probability distribution function of the

’s. Therefore we can not compute the . Instead, we can use the given data to estimate the .

For each , let be the set of the nearest neighbors of among . Note that in defining we are excluding from the set of query points.

Define

 k∗:=argmink∈{1,⋯,n−1}MSE(k). (2.3)

One may find the best value of for the -NN estimate. But since the distributions of and and function are all unknown, in practice we can not find . Instead, for each we define

 f(k):=1nn∑i=1(yi−1k∑j∈Nk(i)yj)2, (2.4)

and

 ~k:=argmink∈{1,⋯,n−1}f(k). (2.5)

In the Statistics and Machine Learning literature

is known as the leave-one-out cross-validation (LOOCV) estimate of the mean squared error. Note that is a random function (randomness comes from the dependence of on ’s) and therefore is a random variable. For each given sample , we can compute . Therefore a simple idea is to use in the -NN algorithm. Note that in practice the distance between and is not of the main importance for us. The main question is how far is from ? Theorem 2.1 gives a probability tail bound on to answer this question.

###### Theorem 2.1.

With and defined in 2.3 and 2.5,

 P(|\emphMSE(k∗)−\emphMSE(~k)|≥t) ≤2(n+1)exp[−ncmin(t232K4(1+4γd),t16K2(1+γd)2)] +2(n+1)exp(−nt22048(1+γd)4(M+t)) +(n+1)exp(−2nt2C),

where is a constant that only depends on . Constants , and are upper bounds on the sub-Gaussian norms of and respectively, and .

For the set of observations from an unknown joint distribution, the choice of for the -NN estimate gives us the minimum mean squared error over all possible choices of , which is typically not computable. Instead, Theorem 2.1 guarantees that using , with high probability, gives that is very close to . This shows that not only does give us a consistent estimator but it is an optimal choice as well.

## 3 Discussion and Simulations

We should emphasize that in computing we exclude each from the whole set and we do not consider as one of its nearest neighbors. This is in fact very important and prevents us from choosing a value of that suffers from overfitting. The following example helps to see this better.

###### Example 3.1.

For n = 1000 we have generated ’s i.i.d from and ’s i.i.d from . Let and . For this sample we have and . Now let be the set of nearest neighbors of among all and define

 k†:=argmink∈{2,…,n}∑i∈[n](yi−1k∑j∈N†k(xi)yj)2. (3.1)

Note that we are taking the minimum over , since clearly for the sum in the right hand side of 3.1 is equal to zero. For our sample , and , where is the empirical estimate of MSE,

 ˆMSE(k)=1nn∑i=1(m(xi)−^mk,n(xi))2.

It is clear that by choosing , we will overfit to the noise and therefore the estimated mean squared error is much higher than that of two other values of .

In Figure 1 we have plotted the (solid line) and (dashed line) for . It can be seen that the behaviour of these two is very similar. In fact, as we expect from equality 4.1 in Section 4,

 E[f(k)]=1nn∑i=1E[ϵ2i]+MSE(k),

they differ slightly only by a constant. Therefore the point-wise difference of these two curves, when ’s have the same variance is approximately equal to . This shows that looking at the curve of gives us almost the same information as looking at the curve . Therefore computing is enough for finding the optimal choice of .

From Figure 1(a) and 1(b) it can be seen that -NN has almost recovered the signal in the presence of the strong noise.

## 4 proofs

###### Proof.

Theorem 2.1.

By writing , we have

 f(k) = 1nn∑i=1(yi−1k∑j∈Nk(i)yj)2 = 1nn∑i=1(μi+ϵi−1k∑j∈Nk(i)yj)2 = 1nn∑i=1(μi−1k∑j∈Nk(i)yj)2+ϵ2i+2ϵi(μi−1k∑j∈Nk(i)yj).

Since ’s are independent with mean zero, taking expectation of the above equality gives us

 E[f(k)]=1nn∑i=1E[ϵ2i]+MSE(k). (4.1)

Define , for . For each , is a deterministic function and does not depend on a given sample. Remember that is a function of the given sample and therefore is random. So depends on the given sample as well and therefore is random. By definition of , for all , and therefore This gives us . Also by definition of , we have . Putting these two together gives us,

 P(|MSE(k∗)−MSE(~k)|≥t) (4.2) =P(|g(k∗)−g(~k)|≥t) ≤P(|g(k∗)−f(k∗)|≥t/2)+P(|g(~k)−f(~k)|≥t/2) =P(|E[f(k∗)]−f(k∗)|≥t/2)+P(|E[f(~k)]−f(~k)|≥t/2).

Therefore to find an upper bound on , it’s enough to find an upper bound on for any arbitrary .

###### Lemma 4.1.

For any and any

 P(|f(k)−E[f(k)]|≥t) ≤ 2exp[−ncmin(t28K4(1+4γd),t8K2(1+γd)2)] + 2exp(−nt2512(1+γd)4(M+λ)) + exp(−2nλ2C),

where is a constant that only depends on the dimension , and and are upper bounds on the sub-Gaussian norm of and , and .

###### Proof.

Define the nonsymmetric matrix in the following way,

 bij:=⎧⎪⎨⎪⎩1i=j0j∉Nk(i)−1/kj∈Nk(j). (4.3)

Let . Also let and . Then we can rewrite in the following vector product form

 f(k)=(ϵ+μ)TA(ϵ+μ). (4.4)

Using the triangle inequality, we have

 |f(k)−E[f(k)]| = |ϵTAϵ−E[ϵTAϵ]+2ϵTAμ| ≤ |ϵTAϵ−E[ϵTAϵ]|+2|ϵTAμ|.

Therefore it’s enough to find probability tail bounds on and . Note that is random. To find such bounds we need to have information on the norm of . Lemmas 4.2 and 4.3 provide uniform bound on and .

###### Lemma 4.2.

For , where entries of are defined in 4.3, we have

 ∥A∥2F≤2(1+4γd)n, (4.5)

where is a constant that only depends on .

###### Lemma 4.3.

For , where entries of are defined in 4.3, we have

 ∥A∥2≤4n(1+γd)2, (4.6)

where is a constant that only depends on .

1. Bound on . Conditional on given sample and therefore given , by Hanson-Wright inequality [11] and Lemmas 4.2 and 4.3 we have

 P(|ϵTAϵ−E[ϵTAϵ∣{xi}i∈[n]]|≥t∣{xi}i∈[n]) (4.7) ≤2exp⎡⎣−cmin⎛⎝t2K4∥A∥2F,tK2∥A∥2⎞⎠⎤⎦ ≤2exp[−ncmin(t22K4(1+4γd),t4K2(1+γd)2)].

Also note that

 E[ϵTAϵ∣{xi}i∈[n]] = 1nE[⟨ϵTB,ϵTB⟩] (4.8) = 1nE[⟨n∑i=1ϵiBi,n∑i=1ϵiBi⟩] = (1+1/k)nn∑i=1E[ϵ2i] = (1+1k)∥ϵ∥22n.

The right side of 4.8 does not depend on sample . Therefore is almost surely constant,

 E[ϵTAϵ∣{xi}i∈[n]]=E[ϵTAϵ]. (4.9)

Putting 4.9 and 4.7 together gives us

 P(|ϵTAϵ−E[ϵTAϵ]|≥t∣{xi}i∈[n]) (4.10) =P(|ϵTAϵ−E[ϵTAϵ∣{xi}i∈[n]]|≥t∣{xi}i∈[n]) ≤2exp[−ncmin(t22K4(1+4γd),t4K2(1+γd)2)].

Note that 4.10 does not depend on , therefore

 P(|ϵTAϵ−E[ϵTAϵ]|≥t) (4.11) ≤2exp[−ncmin(t22K4(1+4γd),t4K2(1+γd)2)].
2. Bound on .

By the Hoeffding’s inequality for any ,

 P(|ϵTAμ|≥t∣{xi}i∈[n]) = P(|n∑j=1(n∑i=1ajiμi)ϵj|≥t∣{xi}i∈[n]) (4.12) ≤ 2exp(−t22∑nj=1(∑ni=1ajiμi)2).

Note that

 n∑j=1(n∑i=1ajiμj)2 ≤ ∥A∥22∥μ∥22. (4.13)

Inequalities 4.12 and 4.13 together give

 P(|ϵTAμ|≥t∣{xi}i∈[n]) ≤ 2exp(−t22∥A∥22∥μ∥22). (4.14)

Now by Lemma 4.3

 P(|ϵTAμ|≥t∣{xi}i∈[n]) ≤ 2exp(−n2t232(1+γd)4∥μ∥22) = 2exp(−nt232(1+γd)4∥μ∥22/n).

By SLLN, almost surely. For any , Hoeffding’s inequality gives

 P(∥μ∥22/n≥E[m(x)2]+λ)≤exp(−2nλ2C).

Therefore

 P(|ϵTAμ|≥t) (4.15) =E[P(|ϵTAμ|≥t∣{xi}i∈[n])] ≤2exp(−nt232(1+γd)4(M+λ))+exp(−2nλ2C).

Combining 4.11 and 4.15 gives

 P(|f(k)−E[f(k)]|≥t) (4.17) ≤P(|ϵTAϵ−E[ϵTAϵ]|≥t/2)+P(|ϵTAμ|≥t/4) ≤2exp[−ncmin(t28K4(1+4γd),t8K2(1+γd)2)] +2exp(−nt2512(1+γd)4(M+λ)) +exp(−2nλ2C).

Using Lemma 4.1 for and and union bound we have

 P(|MSE(k∗)−MSE(~k)|≥t) ≤P(|f(~k)−E[f(~k)]|≥t/2)+P(|f(k∗)−E[f(k∗)]|≥t/2) ≤∪nk=1P(|f(k)−E[f(k)]|≥t/2)+P(|f(k∗)−E[f(k∗)]|≥t/2) ≤2(n+1)exp[−ncmin(t232K4(1+4γd),t16K2(1+γd)2)] +2(n+1)exp(−nt22048(1+γd)4(M+λ)) +(n+1)exp(−2nλ2C).

Note that is arbitrary therefore we simply set and this completes the proof of Theorem 2.1,

 P(|MSE(k∗)−MSE(~k)|≥t) ≤2(n+1)exp[−ncmin(t232K4(1+4γd),t16K2(1+γd)2)] +2(n+1)exp(−nt22048(1+γd)4(M+t)) +(n+1)exp(−2nt2C).

### 4.1 Proof of Lemmas 4.2 and 4.3

###### Proof.

Lemma 4.2. Let be the -th row of matrix . Then

 ∥A∥2F = 1n2Tr(ATA) = 1n2Tr(BTBBTB) = 1n2Tr(BBTBBT) = 1n2∑i,j⟨bi,bj⟩2 = 1n(1+1k)+1n2∑i≠j⟨bi,bj⟩2 = 1n(1+1k)+1n2n∑i=1∑j≠i⟨bi,bj⟩2.

For each ,

 ⟨bi,bj⟩=(−1k)[1{i∈Nk(j)}+1{j∈Nk(i)}]+1k2|Nk(i)∩Nk(j)|,

and therefore

 |⟨bi,bj⟩| ≤ 2k. (4.18)

This gives

 ∑j≠i⟨bi,bj⟩2≤4k2|{j:⟨bi,bj⟩≠0}|. (4.19)

Now note that for each , can not be large. In fact at most for those ’s that

 (4.20)

By definition and for any by Corollary 6.1. in [6] there are at most indices such that , where is a constant depends only on . Therefore

 |{j:⟨bi,bj⟩≠0}|≤γdk(k+1). (4.21)

This gives us

 ∥A∥2F ≤ (1+4γd)(1+1k)1n ≤ 2(1+4γd)n.

###### Proof.

Lemma 4.3. Note that

 ∥B∥2=max∥x∥2=1∥Bx∥2. (4.22)

Therefore for any arbitrary such that ,

 ∥Bx∥2 = n∑i=1⟨bi,x⟩2 (4.23) ≤ 2n∑i=1x2i+2k2n∑i=1∑j∈Nk(i)x2j (4.24) = 2∥x∥2+2k2n∑j=1∑{i:j∈Nk(i)}x2j (4.25) ≤ 2(1+γdk2)∥x∥2. (4.26)

Therefore

 ∥A∥2 = ∥B∥22n ≤ 4n(1+γdk2)2 ≤ 4n(1+γd)2.

An R language package knnopt will soon be made available on the CRAN repository.

## Acknowledgement

The author is very grateful to her advisor Sourav Chatterjee for his constant encouragement and insightful conversations and comments.

## References

• Devroye [1994] Devroye, L; Györfi, L; Krzyzak, A and Lugosi, G.(1994). On the Strong Universal Consistency of Nearest Neighbor Regression Function Estimates. The Annals of Statistics, 22 no. 3 (1994), 1371–1385. 1311980
• Devroye [1982] Devroye, L (1982). Necessary and Sufficient Conditions For The Pointwise Convergence of Nearest Neighbor Regression Function Estimates. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete., 61 no. 4 (1982), 467–481. 0682574
• Fan [1993] Fan, J.

(1993) Local Linear Regression Smoothers and Their Minimax Efficiencies.

Ann. Statist. 21 no. 1, 196–216. 1212173
• Guerre [2000] Guerre, E. (2000) Design Adaptive Nearest Neighbor Regression Estimation. J. Multivariate Anal. 75 no. 2, 219–244. 1802549
• Györfi [1981] Györfi, K. C. (1981) The Rate of Convergence of -NN Regression Estimates and Classification Rules. IEEE Trans. Inform. Theory. 27 no. 3, 362–364.
• Györfi, Kohler, Krzyźak and Walk [2002] Györfi, L., Kohler, M., Krzyźak, A. and Walk, H. (2002). A Distribution-Free Theory of Non-parametric Regression. Springer.
• Hall, Marron and Neumann [1997] Hall, P. Marron, J. S. Neumann, M. H. and Titterington, D. M. (1997) Curve Estimation When The Design Density Is Low. Ann. Statist. 25 no. 2, 756–770. 1439322
• James [2013] James, G. Witten, D. Hastie, T. and Tibshirani, R. (2013) An Introduction to Statistical Learning : with Applications in R. Springer.
• Kulkarni and Ponser [1995] Kulkarni, S.R. and Ponser, S.E. (1995) Rates of Convergence of Nearest Neighbor Estimation Under Arbitrary Sampling. IEEE Trans. Inform. Theory. 41 no. 4, 1028–1039. 1366756
• Li [1984] Li, K. C. (1984) Consistency For Cross-validated Nearest Neighbor Estimates in Non-parametric Regression. Ann. Statist. 12 no. 1, 230–240. 0733510
• Rudelson and Vershynin [2013] Rudelson, M. and Vershynin, R. (2013). Hanson-Wright Inequality and sub-Gaussian Concentration. Electron. Commun. Probab., 18(2013) 3125258
• Stute [1984] Stute, W.(1994). Asymptotic Normality of Nearest Neighbor Regression Function Estimates. The Annals of Statistics, 12 no. 3 (1984), 917–929. 1397508