The Nearest Neighbor Information Estimator is Adaptively Near Minimax Rate-Optimal

We analyze the Kozachenko--Leonenko (KL) nearest neighbor estimator for the differential entropy. We obtain the first uniform upper bound on its performance over Hölder balls on a torus without assuming any conditions on how close the density could be from zero. Accompanying a new minimax lower bound over the Hölder ball, we show that the KL estimator is achieving the minimax rates up to logarithmic factors without cognizance of the smoothness parameter s of the Hölder ball for s∈ (0,2] and arbitrary dimension d, rendering it the first estimator that provably satisfies this property.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

02/26/2020

Minimax Optimal Estimation of KL Divergence for Continuous Distributions

Estimating Kullback-Leibler divergence from identical and independently ...
02/05/2022

One-Nearest-Neighbor Search is All You Need for Minimax Optimal Regression and Classification

Recently, Qiao, Duan, and Cheng (2019) proposed a distributed nearest-ne...
04/11/2016

Demystifying Fixed k-Nearest Neighbor Information Estimators

Estimating mutual information from i.i.d. samples drawn from an unknown ...
10/22/2019

Minimax Rate Optimal Adaptive Nearest Neighbor Classification and Regression

k Nearest Neighbor (kNN) method is a simple and popular statistical meth...
06/02/2019

On Testing for Parameters in Ising Models

We consider testing for the parameters of Ferromagnetic Ising models. Wh...
12/01/2021

Minimax Analysis for Inverse Risk in Nonparametric Planer Invertible Regression

We study a minimax risk of estimating inverse functions on a plane, whil...
02/25/2021

On the consistency of the Kozachenko-Leonenko entropy estimate

We revisit the problem of the estimation of the differential entropy H(f...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Information theoretic measures such as entropy, Kullback-Leibler divergence and mutual information quantify the amount of information among random variables. They have many applications in modern machine learning tasks, such as classification 

[48], clustering [46, 58, 10, 41]

and feature selection 

[1, 17]

. Information theoretic measures and their variants can also be applied in several data science domains such as causal inference 

[18], sociology [49] and computational biology [36]

. Estimating information theoretic measures from data is a crucial sub-routine in the aforementioned applications and has attracted much interest in statistics community. In this paper, we study the problem of estimating Shannon differential entropy, which is the basis of estimating other information theoretic measures for continuous random variables.

Suppose we observe

independent identically distributed random vectors

drawn from density function where . We consider the problem of estimating the differential entropy

(1)

from the empirical observations . The fundamental limit of estimating the differential entropy is given by the minimax risk

(2)

where the infimum is taken over all estimators that is a function of the empirical data . Here denotes a (nonparametric) class of density functions.

The problem of differential entropy estimation has been investigated extensively in the literature. As discussed in [2]

, there exist two main approaches, where one is based on kernel density estimators 

[30], and the other is based on the nearest neighbor methods [56, 53, 52, 11, 3], which is pioneered by the work of [33].

The problem of differential entropy estimation lies in the general problem of estimating nonparametric functionals. Unlike the parametric counterparts, the problem of estimating nonparametric functionals is challenging even for smooth functionals. Initial efforts have focused on inference of linear, quadratic, and cubic functionals in Gaussian white noise and density models and have laid the foundation for the ensuing research. We do not attempt to survey the extensive literature in this area, but instead refer to the interested reader to, e.g.,

[24, 5, 12, 16, 6, 32, 37, 47, 8, 9, 54] and the references therein. For non-smooth functionals such as entropy, there is some recent progress [38, 26, 27] on designing theoretically minimax optimal estimators, while these estimators typically require the knowledge of the smoothness parameters, and the practical performances of these estimators are not yet known.

The -nearest neighbor differential entropy estimator, or Kozachenko-Leonenko (KL) estimator is computed in the following way. Let be the distance between and its -nearest neighbor among . Precisely, equals the -th smallest number in the list , here . Let denote the closed ball centered at of radius and be the Lebesgue measure on . The KL differential entropy estimator is defined as

(3)

where is the digamma function with , is the Euler–Mascheroni constant.

There exists an intuitive explanation behind the construction of the KL differential entropy estimator. Writing informally, we have

(4)

where the first approximation is based on the law of large numbers, and in the second approximation we have replaced

by a nearest neighbor density estimator . The nearest neighbor density estimator follows from the “intuition” 111Precisely, we have  [4, Chap. 1.2]. A distributed random variable has mean . that

(5)

Here the final additive bias correction term follows from a detailed analysis of the bias of the KL estimator, which will become apparent later.

We focus on the regime where is a fixed: in other words, it does not grow as the number of samples increases. The fixed version of the KL estimator is widely applied in practice and enjoys smaller computational complexity, see [52].

There exists extensive literature on the analysis of the KL differential entropy estimator, which we refer to [4] for a recent survey. One of the major difficulties in analyzing the KL estimator is that the nearest neighbor density estimator exhibits a huge bias when the density is small. Indeed, it was shown in [42] that the bias of the nearest neighbor density estimator in fact does not vanish even when and deteriorates as gets close to zero. In the literature, a large collection of work assume that the density is uniformly bounded away from zero [23, 29, 57, 30, 53], while others put various assumptions quantifying on average how close the density is to zero [25, 40, 56, 14, 20, 52, 11]. In this paper, we focus on removing assumptions on how close the density is to zero.

1.1 Main Contribution

Let be the Hölder ball in the unit cube (torus) (formally defined later in Definition 2 in Appendix A) and is the Hölder smoothness parameter. Then, the worst case risk of the fixed -nearest neighbor differential entropy estimator over is controlled by the following theorem.

Theorem 1

Let be i.i.d. samples from density function . Then, for , the fixed -nearest neighbor KL differential entropy estimator in (3) satisfies

(6)

where is a constant depends only on and .

The KL estimator is in fact nearly minimax up to logarithmic factors, as shown in the following result from [26].

Theorem 2

[26] Let be i.i.d. samples from density function . Then, there exists a constant depending on only such that for all ,

(7)

where is a constant depends only on , and .

Remark 1

We emphasize that one cannot remove the condition in Theorem 2. Indeed, if the Hölder ball has a too small width, then the density itself is bounded away from zero, which makes the differential entropy a smooth functional, with minimax rates  [51, 50, 43].

Theorem 1 and 2 imply that for any fixed , the KL estimator achieves the minimax rates up to logarithmic factors without knowing for all , which implies that it is near minimax rate-optimal (within logarithmic factors) when the dimension . We cannot expect the vanilla version of the KL estimator to adapt to higher order of smoothness since the nearest neighbor density estimator can be viewed as a variable width kernel density estimator with the box kernel, and it is well known in the literature (see, e.g.,  [55, Chapter 1]) that any positive kernel cannot exploit the smoothness . We refer to [26] for a more detailed discussion on this difficulty and potential solutions. The Jackknife idea, such as the one presented in [11, 3] might be useful for adapting to .

The significance of our work is multi-folded:

  • We obtain the first uniform upper bound on the performance of the fixed -nearest neighbor KL differential entropy estimator over Hölder balls without assuming how close the density could be from zero. We emphasize that assuming conditions of this type, such as the density is bounded away from zero, could make the problem significantly easier. For example, if the density is assumed to satisfy for some constant , then the differential entropy becomes a smooth functional and consequently, the general technique for estimating smooth nonparametric functionals [51, 50, 43] can be directly applied here to achieve the minimax rates . The main technical tools that enabled us to remove the conditions on how close the density could be from zero are the Besicovitch covering lemma (Lemma. 4) and the generalized Hardy–Littlewood maximal inequality.

  • We show that, for any fixed , the -nearest neighbor KL entropy estimator nearly achieves the minimax rates without knowing the smoothness parameter . In the functional estimation literature, designing estimators that can be theoretically proved to adapt to unknown levels of smoothness is usually achieved using the Lepski method [39, 22, 45, 44, 27], which is not known to be performing well in general in practice. On the other hand, a simple plug-in approach can achieves the rate of , but only when is known [26]. The KL estimator is well known to exhibit excellent empirical performance, but existing theory has not yet demonstrated its near-“optimality” when the smoothness parameter is not known. Recent works [3, 52, 11] analyzed the performance of the KL estimator under various assumptions on how close the density could be to zero, with no matching lower bound up to logarithmic factors in general. Our work makes a step towards closing this gap and provides a theoretical explanation for the wide usage of the KL estimator in practice.

The rest of the paper is organized as follows. Section 2 is dedicated to the proof of Theorem 1. We discuss some future directions in Section 3.

1.2 Notations

For positive sequences , we use the notation to denote that there exists a universal constant that only depends on such that , and is equivalent to . Notation is equivalent to and . We write if the constant is universal and does not depend on any parameters. Notation means that , and is equivalent to . We write and .

2 Proof of Theorem 1

In this section, we will prove that

(8)

for any and . The proof consists two parts: (i) the upper bound of the bias in the form of

; (ii) the upper bound of the variance is

. Below we show the bias proof and relegate the variance proof to Appendix B.

First, we introduce the following notation

(9)

Here

is the probability measure specified by density function

on the torus, is the Lebesgue measure on , and is the Lebesgue measure of the unit ball in -dimensional Euclidean space. Hence is the average density of a neighborhood near . We first state two main lemmas about which will be used later in the proof.

Lemma 1

If for some , then for any and , we have

(10)
Lemma 2

If for some and for all , then for any and any , we have

(11)

Furthermore, .

We relegate the proof of Lemma 1 and Lemma 2 to Appendix C. Now we investigate the bias of . The following argument reduces the bias analysis of to a function analytic problem. For notation simplicity, we introduce a new random variable independent of and study . For every , denote by the -nearest neighbor distance from to under distance , i.e., the -nearest neighbor distance on the torus. Then,

(12)
(13)
(14)
(15)

We first show that the second term can be universally controlled regardless of the smoothness of . Indeed, the random variable  [4, Chap. 1.2] and it was shown in [4, Theorem 7.2] that there exists a universal constant such that

(16)

Hence, it suffices to show that for ,

(17)

We split our analysis into two parts. Section 2.1 shows that and Section 2.2 shows that , which completes the proof.

2.1 Upper bound on

By the fact that for any , we have

(18)
(19)

Here the expectation is taken with respect to the randomness in . Define function as

(20)

intuitively means the distance such that the probability mass within is . Then for any , we can split into three terms as

(21)
(22)
(23)
(24)

Now we handle three terms separately. Our goal is to show that for every , for . Then, taking the integral with respect to leads to the desired bound.

  1. Term : whenever , by Lemma 1, we have

    (25)

    which implies that

    (26)
  2. Term : whenever satisfies that , by definition of , we have , which implies that

    (27)

    It follows from Lemma 2 that in this case

    (28)
    (29)

    Hence, we have

    (30)
    (31)
    (32)
  3. Term : we have

    (33)

    For any such that , we have

    (34)

    and by Lemma 2,

    (35)
    (36)

    Hence,

    (38)
    (39)

    where in the last step we have used the fact that since . Finally, we have

    (40)
    (41)

    Note that , and if , we have

    (42)

    Notice that . Hence, we have

    (43)
    (44)

2.2 Upper bound on

By splitting the term into two parts, we have

(45)
(46)
(47)
(48)

here we denote for simplicity of notation. For the term , we have

(49)
(50)
(51)
(52)

In the proof of upper bound of , we have shown that for any . Similarly as in the proof of upper bound of , we have for every . Therefore, we have

(53)

Now we consider . We conjecture that in this case, but we were not able to prove it. Below we prove that . Define the function

(54)

Since , we have . Denote for any , therefore, we have that

(55)
(56)
(57)
(58)
(59)

where the last inequality uses the fact for all . As for , since , and for , we have

(60)
(61)
(62)
(63)
(64)

where in the last inequality we used the fact that for any . Hence,

(65)

Now we introduce the following lemma, which is proved in Appendix C.

Lemma 3

Let be two Borel measures that are finite on the bounded Borel sets of . Then, for all and any Borel set ,

(66)

Here is a constant that depends only on the dimension and

(67)

Applying the second part of Lemma 3 with being the Lebesgue measure and being the measure specified by on the torus, we can view the function as

(68)

Taking , then , so we know that