# Classification with the nearest neighbor rule in general finite dimensional spaces: necessary and sufficient conditions

Given an n-sample of random vectors (X_i,Y_i)_1 ≤ i ≤ n whose joint law is unknown, the long-standing problem of supervised classification aims to optimally predict the label Y of a given a new observation X. In this context, the nearest neighbor rule is a popular flexible and intuitive method in non-parametric situations. Even if this algorithm is commonly used in the machine learning and statistics communities, less is known about its prediction ability in general finite dimensional spaces, especially when the support of the density of the observations is R^d. This paper is devoted to the study of the statistical properties of the nearest neighbor rule in various situations. In particular, attention is paid to the marginal law of X, as well as the smoothness and margin properties of the regression function η(X) = E[Y | X]. We identify two necessary and sufficient conditions to obtain uniform consistency rates of classification and to derive sharp estimates in the case of the nearest neighbor rule. Some numerical experiments are proposed at the end of the paper to help illustrate the discussion.

## Authors

• 11 publications
• 10 publications
• 4 publications
• ### Rates of Convergence for Nearest Neighbor Classification

Nearest neighbor methods are a popular class of nonparametric estimators...
06/30/2014 ∙ by Kamalika Chaudhuri, et al. ∙ 0

• ### Topics in Random Matrices and Statistical Machine Learning

This thesis consists of two independent parts: random matrices, which fo...
07/25/2018 ∙ by Sushma Kumari, et al. ∙ 0

• ### Coresets for the Nearest-Neighbor Rule

The problem of nearest-neighbor condensation deals with finding a subset...
02/16/2020 ∙ by Alejandro Flores-Velazco, et al. ∙ 2

• ### On the consistency of the Kozachenko-Leonenko entropy estimate

We revisit the problem of the estimation of the differential entropy H(f...
02/25/2021 ∙ by Luc Devroye, et al. ∙ 0

• ### Signal Recovery from Pooling Representations

In this work we compute lower Lipschitz bounds of ℓ_p pooling operators ...
11/16/2013 ∙ by Joan Bruna, et al. ∙ 0

• ### Classification in asymmetric spaces via sample compression

We initiate the rigorous study of classification in quasi-metric spaces....
09/22/2019 ∙ by Lee-Ad Gottlieb, et al. ∙ 0

• ### Multiclass Classification via Class-Weighted Nearest Neighbors

We study statistical properties of the k-nearest neighbors algorithm for...
04/09/2020 ∙ by Justin Khim, et al. ∙ 4

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The supervised classification model has been at the core of numerous contributions to statistical literature in recent years. It continues to provide interesting problems, both from the theoretical and practical point of views. The classical task in supervised classification is to predict a feature when a variable of interest is observed, the set being finite. In this paper, we focus on the binary classification problem where .

In order to provide a prediction of the label of , it is assumed that a training set is at our disposal, where are i.i.d. and with a common law . This training set makes it possible to retrieve some information on the joint law of and to provide, depending on some technical conditions, a pertinent prediction. In particular, the regression function defined as:

 η(x)=E[Y|X=x],∀x∈Rd

appears to be of primary interest for the statistician (see Section 2 for a formal description of the model). Indeed, given , the term

provides the probability that

is assigned the label , conditionally to the event . Since this function is unknown in practice, prediction rules are based on the training sample .

Several algorithms have been proposed over the years but we do not intend to provide an exhaustive list of the associated papers. For an extended introduction to the supervised classification theory, we refer to [BBL05] or [DGL96]. Among available classification procedures, we can, roughly speaking, divide them into (at least) three families:

• Approaches based on pure entropy considerations and Empirical Risk Minimization (ERM):

Given a classifier, the miss-classification error can be empirically estimated from the learning sample. The ERM algorithm then selects the classifier that minimizes this empirical risk among a given family of candidates. Several studies such as in

[MT99], [BM06], [AT07], [LM14] now provide an almost complete description of their statistical performance. In an almost similar context, some aggregation schemes, first proposed in Boosting procedures by [FS97], have been analyzed in depth in [Lec07] and shown to be adaptive to margin and complexity.

• Methods derived from geometric interpretation or information theory:

For example, the Support Vector Machine classifier (SVM) aims to maximize the margin of the classification rule. It has been intensively studied in the last two decades because of its low computational cost and excellent statistical performances (see

[Vap98], [Ste05] or [BBM08] among others). Classification and Regression Tree is another intuitive standard method that relies on a recursive dyadic partition of the state space, introduced in [BFOS84] and greatly improved by an averaging procedure in [AG97] and [Bre01]

, which is usually referred to as Random Forest, and later theoretically developed in

[BDL08].

• Plug-in rules: The main idea is to mimic the Bayes optimal classifier using a plug-in rule after a preliminary estimation of the function . We refer to [GKKW02] for a general overview, and to [BR03] and [AT07] for some recent statistical results within this framework. The main motivation behind plug-in rules is to transfer properties related to the classical regression problem (estimation of from the sample ) to a quantitative control of the miss-classification error.

In this general overview, the nearest neighbor rule (see Section 2.2 for a complete description) belongs to the last two classes. It corresponds to a plug-in classifier with a simple geometrical interpretation. It has attracted a great deal of attention for the past few decades, from the seminal works of [FH51] and [CH67]. Given an integer , the corresponding classifier is based on a feature average of the -closest observations of in the training set . We also refer to [Sto77], [Győ78], [Gyö81], [DW77] and [DGKL94] for seminal contributions on this prediction rule (both for classification and regression). Recently, this algorithm has received even further attention in mathematical statistics, and is still at the core of several studies: [CG06] examines the situation of general metric space and identifies the importance of the so-called Besicovitch assumption, [HPS08] is concerned with the influence of the integer on the excess risk of the nearest neighbor rule as well as the two notions of the sample structure while [Sam12] describes an improvement of the standard algorithm.

Most of the results obtained for penalized ERM, SVM or plug-in classifiers are based on complexity considerations (metric entropy or Vapnik dimension). In this paper, we mainly use the asymptotic behavior of the small ball probabilities instead (see [Lia11] and the references therein), which can be seen as a dual quantity of the entropy (see [LS01]). We also deal with the more intricate situation of not bounded away from zero densities (especially for non compactly supported measures). For this purpose, we work with both smoothness and minimal mass assumptions (see Section 2.3 for more details) that will provide a pertinent estimation of the function . In particular, it is assumed that we will be able to take advantage of some smoothness properties of the function in order to improve the prediction of the label . According to previous existing studies (see, e.g., [Gyö81]), the associated classification rates appear to be comparable to those obtained in an “estimation” framework and, hence, always greater than . However, it has been proven in [MT99] that fast rates (i.e., faster than ) can be obtained up to some additional margin assumption. It is in fact possible to take advantage of the behavior of the law of around the boundary in order to improve the properties of the classification process.

In this paper, we investigate the nearest neighbor rule with margin assumption and marginal distribution for the variable that is not necessarily compactly supported or lower bounded from zero. The contributions proposed below can be broken down into three different categories.

##### Consistency rate for bounded from below densities

Our first result concerns the optimality of the nearest neighbor classifier in the compact case. We prove that this classification rule reaches the minimax rate of convergence for the excess risk obtained by [AT07] (see Theorem 3.2 below). In particular, under some classical assumptions about the distribution of the couple (which will be illustrated below), we show that:

 supF∈F[R(Φn)−R(Φ∗)]≤Cn−1+α2+d,

where denotes the margin parameter, the dimension of the problem, the miss-classification error of a given classifier and the Bayes classifier.111This result has also been established in the recent work of [Sam12] We obtain this result for both Poisson and Binomial sample-size models. In particular, such a result appears to be a generalization of the ones given in [HPS08] that do not take the margin into account in their study.

##### Consistency rate for general densities

In a second step, we investigate the behavior of the nearest neighbor classifier when the marginal density (w.r.t. the Lebesgue measure) of is not bounded from below on its support. Such an improvement is not of secondary importance since it corresponds to the commonly encountered situation of vanishing or non-compactly supported densities. To do this, we use an additional assumption on the tail of this distribution and prove that generically:

 supF∈F[R(Φn)−R(Φ∗)]≤Cn−1+α2+α+d,

as soon as the bandwidth involved in the classifier is allowed to depend on the spatial position of . The tail assumption on the marginal distribution on involved in this result, will describe the behavior of the density near the set .

##### Lower bounds

Finally, we derive some lower bounds for the supervised classification problem, which extends the results obtained in [AT07] in a slightly different context. We prove that our Tail Assumption is unavoidable to ensure uniform consistency rates for classification in a non-compact case, regardless of dimension . We then see how these upper and lower bounds are linked. In particular, we show that a very unfavorable situation of classification occurs when the regression function oscillates in the tail of the distribution , i.e., we establish that it is even impossible in these situations to obtain uniform consistency rates and thus elucidate two open questions in [Can13].

The paper is organized as follows. In Section 2, we precisely describe the statistical setting related to the classification problem. Some attention is paid to the nearest neighbor rule. Section 3 is devoted to the bounded from below case where we prove that the nearest neighbor classifier reaches the minimax rate of convergence for the excess risk under mild assumptions. We then extend our study to the general (typically non-compact) case in Section 4. This section is supplemented with some supporting numerical results and a glossary of typical situations of location models. We conclude with a discussion of our results, and potential problems. Proofs and technical results are included in Appendix A. The paper is completed by an adaptation to the smooth discriminant analysis model in Appendix B (see, e.g. [MT99] or [HPS08] for another comparison between the so-called Poisson and Binomial models). In particular, although the variables of interest are strongly dependent in this case, we derive (using a Poissonization argument) results similar to those obtained in the classical binary classification model.

We use the following notations throughout the paper. denotes the distribution of the couple and the marginal distribution of , which will be assumed to admit a density with respect to the Lebesgue measure. Similarly, we set and . In the same spirit, , and will hereafter correspond to the expectations w.r.t. the measures , and , respectively. Finally, given two real sequences and , we write (resp. ) is a real constant exists such that (resp. ) for all .

## 2 Statistical setting and nearest neighbor classifier

### 2.1 Statistical Classification problem

In this paper, we study the classical binary supervised classification model (see, e.g., [DGL96] for a complete introduction). An i.i.d. sample , whose distribution is and where is an open set of , is at our disposal. Given a new incoming observation , our goal is to predict its corresponding label . To do this, we use a classifier that provides a decision rule for this problem. Formally, a classifier is a measurable mapping from to . Given a classifier , its corresponding miss-classification error is then defined as:

 R(Φ)=P(Φ(X)≠Y).

In practice, the most interesting classifiers are those associated with the smallest possible error. In this context it is well known (see, e.g., [BBL05]) that the Bayes classifier defined as:

 Φ∗(X)=1{η(X)>12}, where η(x):=E[Y|X=x] ∀x∈Ω, (2.1)

minimizes the miss-classification error, i.e.,

 R(Φ∗)≤R(Φ), ∀Φ:Rd⟶{0,1}.

The classifier provides the best decision rule in the sense that it leads to the lowest possible miss-classification error. Unfortunately, is not available since the regression function explicitly depends on the underlying distribution of . In some sense, the Bayes classifier can be considered as an oracle that provides a benchmark error. Hence, the main challenge in this supervised classification setting is to construct a classifier whose miss-classification error will be as close as possible to the smallest possible one. In particular, the excess risk (also referred to as the regret) defined as

 R(Φ)−R(Φ∗),

appears to be of primary importance. We are interested here in the statistical properties of the nearest neighbor classifier (see Section 2.2 below for more details) based on the sample . In particular, we investigate the asymptotic properties of the excess risk through the minimax paradigm. Given a set of possible distributions for , the minimax risk is defined as:

 δn(F):=infΦsupF∈F[R(Φ)−R(Φ∗)],

where the infimum in the above formula is taken over all measurable classifiers. A classifier is then said to be minimax over the set if:

 supF∈F[R(Φn)−R(Φ∗)]≤Cδn(F),

for some constant . The considered set will be detailed later on and will depend on the behavior of over through some smoothness, margin and minimal mass hypotheses.

### 2.2 The nearest neighbor rule

In this paper, we focus on the nearest neighbor classifier, which is perhaps one of the most widespread and simplest classification procedures. Suppose that the state space is where is a reference distance. Given any sample and for any , we first build the reordered sample with respect to the distances , namely:

 ∥X(1)(x)−x∥≤∥X(2)(x)−x∥≤…≤∥X(n)(x)−x∥.

In this context is the -nearest neighbor of w.r.t. the distance and its corresponding label. Given any integer in , the principle of the nearest neighbor algorithm is to construct a decision rule based on the -nearest neighbor of the input : the -measurable classifier is:

 Φn,k(X)=⎧⎪ ⎪⎨⎪ ⎪⎩1if1kk∑j=1Y(j)(X)>12,0otherwise. (2.2)

For all , the term appears to be an estimator of the regression function . In particular, we can write the classifier as

 Φn,k(X)=1{^ηn(X)>1/2}where^ηn(x)=1kk∑j=1Y(j)(x)∀x∈Ω. (2.3)

Hence, the nearest neighbor procedure can be considered as a plug-in classifier, i.e., a preliminary estimator of the regression function is plugged in our decision rule. It is worth noting that the integer is a regularization parameter. Indeed, if is too small, the classifier will only use a small amount of the neighbors of

, leading to a large variance during the classification process. On the other hand, large values of

will introduce some bias into the decision rule since we use observations that may be far away from the input . In other words, the statistical performances of will depend on a careful choice of the integer . In particular, the number of neighbors considered should carefully grow to with respect to .

For this purpose, we introduce some baselines assumptions into the following section that will make it possible to characterize an optimal value for this regularization parameter.

### 2.3 Baseline assumptions

It is well known that no reliable prediction can be made in a distribution-free setting (see [DGL96]). We restrict the class of possible distributions of below.

Since the nearest neighbor rule is a plug-in classification rule, we expect to take advantage of some smoothness properties of in order to improve the classification process. In fact, when is smooth, the respective values of and are comparable for close enough . In other words, we can infer the sign of from those of the neighbors of .

##### Assumption A1

(Smoothness) The regression function belongs to the Hölder class of parameter with a radius , which is denoted and corresponds to the set of functions such that

 ∀(x1,x2)∈Ω2|η(x1)−η(x2)|≤L|x1−x2|.
###### Remark 2.1.

It would be tempting to consider some more general smoothness classes for the regression function . Nevertheless, the standard nearest neighbor algorithm does not make it possible to use smoothness indexes greater than . An alternative procedure has been proposed in [Sam12]: the idea is then to balance the with a suitable monotonous weighting sequence. However, this modification complicates the statistical analysis and may alter the ideas developed below. We therefore chose to fix the smoothness of to (i.e. restrict our study to ).

Our second assumption was introduced by [Tsy04] in the binary supervised classification model (see [MT99] in a smooth discriminant analysis setting).

##### Assumption A2

(Margin assumption) For any , a constant exists such that:

 PX(0<∣∣∣η(X)−12∣∣∣<ϵ)≤Cϵα,∀ϵ>0.

In such a case, we write .

The Bayes classifier depends on the sign of . Intuitively, it would be easier to mimic the behavior of this classifier when the mass around the set is small. On the other hand, the decision process may be more complicated when is close to with a large probability. Quantifying this closeness is the purpose of this margin assumption.

For the sake of convenience, we use the set throughout the paper, which contains distributions that satisfy both Assumptions A1 and A2, namely:

 FL,α:={P(X,Y):PX(dx)=μ(x)dxandL(Y|X)∼B(η(X))withη∈C1,0(Ω,L)and(μ,η)∈Mα}

We now turn to our last assumption that involves the marginal distribution of the variable .

### 2.4 Minimal Mass Assumption

In the sequel, this type of hypothesis will play a very important role.

##### Assumption A3

(Strong Minimal Mass Assumption) There exists such that the marginal density of satisfies where

 Mmma(Ω,κ):={PX:PX(dx)=μ(x)dx|∃δ0>0,∀δ≤δ0,∀x∈Ω:PX(X∈B(x,δ))≥κμ(x)δd}.

This assumption guarantees that possesses a minimal amount of mass on each ball , this lower bound being balanced by the level of the density on . In some sense, distributions in will make it possible to obtain reliable predictions of the regression function according to its Lipschitz property. The Strong Minimal Mass Assumption A3 may be seen as a refinement of the so-called Besicovitch assumption that is quite popular in the statistical literature (see, e.g., [Dev81] for a version of the Besicovitch assumption used for pointwise consistency or [CG06] for a general discussion on this hypothesis in finite or infinite dimension). It is worth pointing out that the Besicovitch assumption introduced in [CG06] states that satisfies the following -continuity property:

 ∀ϵ>0limδ→0PX{x:1μ(B(x,δ))∫B(x,δ)|η(z)−η(x)|dμ(z)>ϵ}=0 (2.4)

In our setting, since is -Lipschitz (Assumption A1), we can check that for all

 ∫B(x,δ)|η(z)−η(x)|μ(z)dz≤L∫B(x,δ)|x−z|μ(z)dz≤Lδμ(B(x,δ)),

which implies that the right hand side of (2.4) vanishes as soon as . We will see that Assumption A3 is necessary to obtain quantitative estimates for any finite dimensional classification problem in a general setting.

In a slightly different framework, our Assumption A3 is similar to the Strong Density Assumption used in the paper of [AT07] when the density is lower bounded on its (compact) support, which is assumed to possess some geometrical properties ( regularity). This setting is at the core of the study presented in Section 3 below. Assumption A3 also recalls the notion of standard sets used in [Cas07] for the estimation of compact support sets. More generally, the following examples present some standard distributions that satisfy Assumption A3.

###### Example 2.1.

• In , it is not difficult to check that Gaussian measures with non-degenerated covariance matrices satisfy . As a simple example, consider a standard Gaussian law . For any and , if belongs to a compact set , then a constant exists such that . Now, if , we can check that:

 (2π)−1/2∫x+δx−δ(2π)−1/2e−t2/2dt∼(2π)−1/2e−x2/2[exδx−δ−e−xδx+δ]e−δ2/2.

The bracket above is always greater than when . Now, if , a simple Taylor expansion yields

 (2π)−1/2∫x+δx−δ(2π)−1/2e−t2/2dt∼μ(x)1+2xδx≳μ(x)δ.
• The same computations are still possible for symmetric Laplace distributions ( when is small. Thus, any Laplace distribution belongs to . In a same way, when

is a standard Cauchy distribution, we can check that:

 ∫x+δx−δdt1+t2 = 11+x2∫δδ11+h2x+h1+x2dh ∼ 11+x2[2δ−23δ31+x2++8δ3x2(1+x2)2o(δ3)] ≳ δ1+x2

Typically, distributions that do not satisfy the Strong Minimal Assumption (A3) possess some important oscillations in their tails (when the density is close to ). In such a setting, the alternative set , defined as follows, may be considered:

The interest of the weaker compared to is that the statistical abilities of the nearest neighbor rule are still the same with or . Moreover, an analytic criterion that ensures can be found (see Proposition 4.1. This is not the case for the uniform assumption (it is indeed more difficult to ensure the lower bound on the global set ).

Although all the subsequent results may be established for a weaker version of the minimal mass assumption (based on the set ), we will restrict ourselves to its strong formulation (Assumption A3). In Section 3, we prove that the nearest neighbor rule is optimal in the minimax sense provided that the margin and smoothness assumptions hold, with a marginal density of the variable bounded away from and a suitable choice of . In Section 4, we will see that is not yet sufficient to derive consistent classifiers for non compactly supported densities, and a last additional hypothesis is needed.

## 3 Bounded away from zero densities

### 3.1 Minimax consistency of the nearest neighbor rule

In this section, we are interested in the special case of a marginal density bounded from below by a strictly positive constant . In this context, we can state an upper bound on the consistency rate of the nearest neighbor rule.

###### Theorem 3.1.

Assume that Assumptions A1-A3 hold. The nearest neighbor classifier with satisfies

 supPX,Y∈FL,α∩Mmma(Ω,κ)μ−[R(Φn,kn)−R(Φ∗)]≲n−1+α2+d,

where denotes the subset of densities of that are bounded from below by .

Theorem 3.1 establishes a consistency rate of the nearest neighbor rule over . A detailed proof of is presented in Section A.2. Implicitly, we restrict our analysis to compactly supported observations, this assumption being at the core of several statistical analyses (see, e.g., [GKKW02], [BBL05], [MT99] or [HPS08] among others). It is worth pointing out that this setting falls into the framework considered in [AT07].

###### Definition 3.1 (Strong Density Assumption (SDA), [At07]).

The marginal distribution of the variable satisfies the Strong Density Assumption if

• it admits a density w.r.t. the Lebesgue measure of ,

• the density satisfies:

 μ−≤μ(x)≤μ+,∀x∈Supp(μ)

for some constants .

• The support of is -regular, namely:

 λ[Supp(μ)∩B(x,r)]≥c0λ[B(x,r)],∀r≤r0,

for some positive constants and .

As soon as the marginal density is bounded from below by a strictly positive constant, then both SDA and Strong Minimal Mass Assumption (A3) are equivalent, as stated in the following proposition.

###### Proposition 3.1.

For bounded away from zero density, the SDA is equivalent to the Strong Minimal Mass Assumption.

###### Proof.

As soon as the support of is -regular and the density is lower bounded by , then SDA implies a minimal mass type assumption since :

 PX(B(x,δ))=∫B(x,δ)μ(z)dz≥μ−×λ[B(x,δ)∩Supp(μ)]≥c0γdμ−δd.

Conversely, we can also check the fact that the Strong Minimal Mass Assumption (A3) implies the SDA (including the -regularity of ). Indeed, since for any and :

 1≥∫B(x,δ)μ(x)dx≥Cμ(x)δd,

then the density is upper bounded and we obtain that:

 ∫B(x,δ)μ(x)dx≤∥μ∥∞λ[Supp(μ)∩B(x,r)].

We therefore obtain:

 λ[Supp(μ)∩B(x,r)]≥Cμ(x)∥μ∥∞δd≥Cμ−∥μ∥∞δd.

This concludes the proof of this proposition. ∎

It is possible to link the constants involved in SDA with involved in , but we have omitted their relationships here for the sake of simplicity. Minimax rates of excess risk under the SDA are established in [AT07]. A consequence of Proposition 3.1 is that the same lower bound is still valid with .

###### Theorem 3.2 (Theorem 3.3, [At07]).

Assume that Assumptions A1-A3 hold and a exists such that for all . Then, the minimax classification rate is lower bounded as follows:

 infΦsupPX,Y∈FL,α∩Mmma(Ω,κ)μ−[R(Φ)−R(Φ∗)]≳n−1+α2+d.

Thanks to the previous lower bound, we can conclude that the nearest neighbor rule achieve the minimax rate of convergence in the particular case where the density is lower bounded on its (compact) support. As already discussed in [MT99] or [AT07], the higher the margin index is, the smaller the excess risk will be. On the other hand, the performance deteriorates as the dimension of the considered problem increases. This corresponds to the classical curse of the dimensionality. The lower bound obtained by [AT07] is based on an adaptation of standard tools from nonparametric statistics (Assouad’s Lemma). This proof is of primary importance for next lower bound results. It is recalled in Section A for the sake of convenience.

### 3.2 The Smooth discriminant analysis model (Binomial sample-size)

While the supervised classification model (also referred to as the Poisson sample-size model) has been intensively studied in the last decades, the smooth discriminant analysis model has been considered as an alternative approach. This model is presented in [MT99] and is referred to as a binomial model in [HPS08]. It assumes that we have two independent samples and

of i.i.d. random variables at our disposal, with densities

and respectively. Given a new incoming observation, the goal is then to predict its corresponding label, namely to determine whether comes from the density or .

In the classification setting, the positions are drawn according to and the labels are then sampled using , which makes the values of the labels completely independent each other, conditionally to their positions . This key observation is no longer true in the smooth discriminant analysis: conditionally to ordered spatial inputs induced in the nearest neighbor rule, the random variables are not independent. This significantly complicates the analysis of the nearest neighbor rule and is a major difference with the standard classification task.

We briefly provide our main result on the nearest neighbor rule with the smooth discriminant analysis below. More complete details can be found in Appendix B.

###### Theorem 3.3.

The nearest neighbor classifier with satisfies

 supPX,Y∈FL,α∩Mmma(Ω,κ)μ−[RBinom(Φn,kn)−RBinom(Φ∗)]≲log(n)n−1+α2+d,

where denotes the risk in the smooth discriminant analysis setting.

To the best of our knowledge, the performance of the nearest neighbor classifier in the binomial sample-size model has only been studied in [HPS08]. In their paper, the difference between the Poisson and the binomial model is studied through Reny’s representation of order statistics. In contrast, we directly compute an upper bound of the binomial model. Our main argument relies on a Poissonization of the sample size (see, e.g., [Kac49]). Even if it is a standard alternative to cope with dependencies in probability, such a method has not yet been applied for smooth discriminant analysis.

Regarding the obtained consistency rates now, our result misses a log term in the smooth discriminant analysis setting. In [HPS08], the authors show that the difference of the excess risk between the classification and the smooth discriminant analysis is on the order of for twice differentiable functions (instead of only the Lipschitz situation in our case) and their resulting rate is for the optimal choice . Following their argument with a Lipschitz regression function , their excess risk becomes for the binomial model. Hence, for a margin , our result in Theorem 3.3 is weaker than the one in [HPS08] (because of our log term). This is not yet the case as soon as the margin since the result of [HPS08] does not take this parameter, which may be central to obtain fast rates, into account. Moreover, the approach of [HPS08] does not seem to simply manage the margin information of the classification.

Finally, our Poissonization method also applies for general densities that are not necessarily bounded from below (see Appendix B). This is a major difference with the results of [HPS08] that are valid with a compactly supported and bounded away from zero density .

## 4 General finite dimensional case

### 4.1 The Tail Assumption

Results of the previous section are designed for the problem of supervised binary classification with compactly supported inputs and lower bounded densities. Such an assumption is an important prior on the problem that may be improper in several practical settings. Various situations involve Gaussian, Laplace, Cauchy or Pareto distributions on the observations, and both the compactness and the boundedness away from zero assumptions may seem to be very unrealistic. This is even more problematic when dealing with functional classification with a Gaussian White Noise model (GWN). In such a case, observations are described through an infinite sequence of Gaussian random variables and the SDA or

are far from being well-tailored for this situation (see [Lia11] for a discussion and further references).

This section is dedicated to a more general case of binary supervised classification problems where the marginal density of is no longer assumed to be lower bounded on its support. The main problem related to such a setting is that we have to predict labels in places where few (or even no) observations are available in the training set. In order to address this problem, we take the following assumption.

##### Assumption A4

(Tail Assumption) Afunction that satisfies as and that increases in a neighborhood of 0 exists such that

 P(X,Y)∈PT,ψ:={PX:∃ϵ0∈R∗+:∀ϵ<ϵ0,PX({μ<ϵ})≤ψ(ϵ)},

where corresponds to the particular case where .

The aim of this Tail Assumption is to ensure that the set where is small has a small mass. We use the notation because of the interpretation on the tail of , but is not just an assumption on the tail of the . It is, in fact, an assumption on the behavior of near the set . We provide some examples of marginal distribution below that satisfy this tail requirement. In Section 4.2 below, we prove that the Tail Assumption (A4) is unavoidable in this setting. In Section 4.3, we investigate the performances of the nearest neighbor rule in this setting.

###### Example 4.1.

Following are several families of densities in .

• Laplace distributions obviously satisfy

, and a straightforward integration by parts shows that Gamma distributions

satisfy with (the term around is on the order of and thus negligible compared to the term around ).

• An immediate computation shows that the family of Pareto distributions of parameters satisfies where , regardless of the value of .

• The family of Cauchy distributions satisfies with .

• Univariate Gaussian laws with mean and variance satisfy

 γm,σ2(x)≤ϵ⟺|x−m|≥tσ,ϵ:=√2σ√log(1ϵ)+log(1σ√2π),

and a standard result on the size of Gaussian tails (see [BNC89]) yields

 γm,σ2(γm,σ2≤ϵ)=ϵtσ,ϵ[1−1t2σ,ϵ+1.3t4σ,ϵ…]≲ϵ√log(1ϵ).

Hence, univariate Gaussian laws satisfy with .

• If is any real vector of and a covariance matrix whose spectrum is :

 γm,Σ2(γm,Σ2≤ϵ)=γ0,Σ2(γ0,Σ2≤ϵ)≲γ0,Σ2(∥X∥≥√2λ1log(1ϵ)).

Careful inspection of Theorem 1 of [HLS02] now yields

 γ0,Σ2(∥X∥≥√2λ1log(1ϵ))∼CΣ2log(1ϵ)r/2−1ϵ,

where is a constant that only depends on the spectrum of and

is the multiplicity of the eigenvalue

. In particular, satisfy where .

### 4.2 Non-consistency results

We first justify the introduction of the sets and and discuss their influences regarding uniform lower bounds and even consistency of any estimator. To do this, we first state that the Minimal Mass Assumption (A3) is necessary to obtain uniformly consistent classification rules. Second, we assert that the Tail Assumption (A4) is also unavoidable.

###### Theorem 4.1.

Assume that the law belongs to , then:

• No classification rule can be universally consistent if Assumptions A1-A3 hold and not A4. For any discrimination rule and for any , a distribution in exists such that:

 R(Φn)−R(Φ∗)≥ϵ.
• No classification rule can be universally consistent if Assumption A1, A2, A4 hold and not A3. For any discrimination rule and for any , a distribution in exists such that:

 R(Φn)−R(Φ∗)≥ϵ.

The first result asserts that even if the Minimal Mass Assumption A3 holds for the underlying density on , it is not possible to expect a uniform consistency result over the entire class of non-compactly considered densities. In some sense, the support of the variable seems to be too large to obtain reliable predictions with any classifiers without additional assumptions. As discussed above, the Tail Assumption A4 may make it possible to counterbalance this curse of support effect (see next section). Such statistical damage has also been observed for the estimation of densities that are supported on the real line instead of being compactly supported, even though such dramatic consequences are not shown here. We refer to [RBRTM11] and the references therein for a more detailed description.

The second result states that the Strong Minimal Mass Assumption A3 cannot be skipped for uniform consistency rates and no compactly supported densities. This is in line with the former studies of [Győ78] and [DGKL94]. In particular, Lemma 2.2 of [DGKL94] takes advantage of some of the positive consequences of this type of assumption. Our proof relies on the construction of a sample size dependent law on that violates our Minimal Mass Assumption A3 but that keeps the regression function in our smoothness class . This is a major difference with former counter examples built in [DGL96] where the non uniform consistency is obtained with a family of non-smooth regression functions . In our study, we also obtained a family of smooth regression functions for which such phenomena occur. Even in this case, it is still possible to keep the excess risk strictly positive for any classifier (and no longer for only nearest neighbor rules).

Finally, it should be noted that our inconsistency results always occur when building a network of regression functions that oscillate around the value at the neighborhood of the set . In a sense, Theorem 4.1 contributes to the understanding of one of the opens question put forth in [Can13] on the behavior of the nearest neighbor rule when is oscillating about in the tail.

### 4.3 Minimax rates of convergence

In the meantime, when both A2, A3 and A4 hold, we are able to precisely describe the corresponding minimax rate of convergence.

#### 4.3.1 Minimax lower bound

###### Theorem 4.2.

Assume that Assumptions A1-A4 hold. Then

 infΦnsupP(X,Y)∈FL,α∩Mmma(Ω,κ)∩PT,Id[R(Φn)−R(Φ∗)]≳n−1+α2+α+d.

For the sake of convenience, we briefly outline the proof of Theorem 3.2 borrowed from [AT07] in Section A.1. It is then adapted to our new set of assumptions.

Theorem 4.5 below provides some lower bounds for different tails of distributions (through the function ). It should be noted that we recover the known rate of compactly supported densities with the so-called Mild Density Assumption of [AT07] in the particular case . This implies that in the non-compact case, the rate cannot be improved compared to the compact setting, even with an Additional Tail assumption.

#### 4.3.2 An upper bound for the nearest neighbor rule

When the density is no longer bounded away from , the integer will be chosen in order to counterbalance the vanishing probability of the small balls in the tail of the distributions. For example, when , we show that a suitable choice of the integer is:

 kn:=⌊n23+α+d⌋,

which appears to be quite different from the one in the previous section.

###### Theorem 4.3.

Assume that A1-A3 hold and if the Tail Assumption A4 is driven by , the choice yields:

 supP(X,Y)∈FL,α∩PT,Id∩Mmma(Ω,κ)[R(Φn,kn)−R(Φ∗)]≲n−(1+α)(3+α+d).

The proof of Theorem 4.3 is provided in Section A.3. The above results indicate that the price to pay for the classification from entries in compact sets to arbitrary large sets of is translated by the degradation from to at least (see, e.g., Theorem 4.2 when ). Our upper bound for the nearest neighbor rule does not exactly match this lower bound since we obtain in a similar situation . At this step, obtaining the appropriate minimax rate requires slight changes inside the construction of the nearest neighbor rule. This is the purpose of the next paragraph.

#### 4.3.3 Minimax upper bound for an optimal nearest neighbor rule

The upper bound proposed in the theorem can be improved if we change the way in which the regularization parameter is constructed. We use a nearest neighbor algorithm with a number of neighbors that depends on the position of the observation according to the value of the density . More formally, we define for all

 Ωn,0:={x∈Rd: μ(x)≥n−α2+α+d},

and

 Ωn,j=⎧⎨⎩x∈Rd: n−α2+α+d2j≤μ(x)

Setting , we then use for all

 kn(x)=⌊kn,02−2j/(2+d)⌋∨1whenx∈Ωn,j. (4.1)

According to (4.1), the number of neighbors involved in the decision process depends on the spatial position of the input . In some sense, this position is linked to the tail. The statistical performances of the corresponding nearest neighbor classifier is displayed below. Such a construction of this sequence of “slices” may be interpreted as a spatial adaptive bandwidth selection. This bandwidth is smaller at points such that is small. In a sense, this idea is close to the one introduced in [GL14] that provides a similar slicing procedure to obtain an adaptive minimax density estimation on .

###### Theorem 4.4.

Assume that A1-A3 hold and that the Tail Assumption A4 is driven by . Then, if is the classifier associated with (4.1), we have:

 supP(X,Y)∈FL,α∩PT,Id∩Mmma(Ω,κ)[R(Φ∗n,kn)−R(Φ∗)]≲n−(1+α)(2+α+d)(logn)12+1d.

We stress that the upper bound obtained in Theorem 4.4 nearly matches the lower bound proposed in Theorem 4.2, up to a log-term. This log-term can be removed by the use of additional technicalities that are omitted in our proof. Hence, Theorems 4.4 and 4.2 make it possible to identify the exact minimax rate of classification when the Tail Assumption is driven by , that is:

 infΦsupP(X,Y)∈FL,α∩PT,Id∩Mmma(Ω,κ)[R(Φ∗n,kn)−R(Φ∗)]∼n−1+α2+α+d.

#### 4.3.4 Generalizations

We propose several extensions of our previous results (lower and upper bounds) for more general tails of distribution. We also propose to enlighten the Minimal Mass Assumption .

##### Effect of the tail: from PT,Id to PT,ψ
###### Theorem 4.5.

Assume that Assumptions A1-A4 hold. For any tail parameterized by a function , we obtain the following results:

• Lower bound: the minimax classification rate satisfies:

 infΦnsupP(X,Y)∈FL,α∩PT,ψ∩Mmma(Ω,κ)[R(Φn)−R(Φ∗)]≳ϵ1+αn,α,d,

where satisfies the balance

 n−1={ϵn,α,d}2+d×ψ−1({ϵn,α,d}α). (4.2)
• Upper bound: the nearest neighbor rule satisfies

 supP(X,Y)∈FL,α∩PT,ψ∩Mmma(Ω,κ)[R(Φn,kn)−R(Φ∗)]≤Cν1+αn,α,d

with where fulfills the balance:

 n−1=ψ−1({νn,α,d}1+α){νn,α,d}2+d. (4.3)

It would also be possible to propose some generalizations using the sliced nearest neighbor rule presented in Sections 4.3.2 and 4.3.3 for tails driven by a general function , even if we do not include this additional result for the purpose of clarity.

##### Meeting the Minimal Mass Assumption ˜Mmma(Ω,κ)

We now obtain similar rates when using the weaker assumption instead of : the lower bounds of are only useful for some points such that is large enough. We can state the next corollary.

###### Corollary 4.1.

Assume that A1,A2,A4 hold and , then

 supP(X,Y)∈FL,α∩PT,ψ∩˜Mmma(Ω,κ)[R(Φn,kn)−R(Φ∗)]≲ν1+αn,α,d,

with where satisfies the balance

 n−1=ψ−1({νn,α,d}1+α){νn,α,d}2+d.

The condition cannot be easily described through an analytical condition because of its uniform nature over . In contrast, is more tractable in view of the criterion given by the next result (Proposition 4.1). Using a log-density model, we write the density as

 μ(x)=e−φ(x),∀x∈Rd.
###### Proposition 4.1.

Let and assume that a real number exists such that:

 limx:μ(x)⟶0∥∇φ(x)∥φ(x)a=0,

then a suitable can be found such that .

###### Proof.

For any , we compute a lower bound of

 PX(B(x,δ))=∫B(x,δ)e−φ(z)dz.

The Jensen Inequality applied to the normalized Lebesgue measure over , which is denoted , yields

 ∫B(x,δ)e−φ(z)dz≥πd/2δdΓ(