1 Introduction
Information theoretic measures such as entropy, Kullback-Leibler divergence and mutual information quantify the amount of information among random variables. They have many applications in modern machine learning tasks, such as classification
[48], clustering [46, 58, 10, 41][1, 17]. Information theoretic measures and their variants can also be applied in several data science domains such as causal inference
[18], sociology [49] and computational biology [36]. Estimating information theoretic measures from data is a crucial sub-routine in the aforementioned applications and has attracted much interest in statistics community. In this paper, we study the problem of estimating Shannon differential entropy, which is the basis of estimating other information theoretic measures for continuous random variables.
Suppose we observe
independent identically distributed random vectors
drawn from density function where . We consider the problem of estimating the differential entropy(1) |
from the empirical observations . The fundamental limit of estimating the differential entropy is given by the minimax risk
(2) |
where the infimum is taken over all estimators that is a function of the empirical data . Here denotes a (nonparametric) class of density functions.
The problem of differential entropy estimation has been investigated extensively in the literature. As discussed in [2]
, there exist two main approaches, where one is based on kernel density estimators
[30], and the other is based on the nearest neighbor methods [56, 53, 52, 11, 3], which is pioneered by the work of [33].The problem of differential entropy estimation lies in the general problem of estimating nonparametric functionals. Unlike the parametric counterparts, the problem of estimating nonparametric functionals is challenging even for smooth functionals. Initial efforts have focused on inference of linear, quadratic, and cubic functionals in Gaussian white noise and density models and have laid the foundation for the ensuing research. We do not attempt to survey the extensive literature in this area, but instead refer to the interested reader to, e.g.,
[24, 5, 12, 16, 6, 32, 37, 47, 8, 9, 54] and the references therein. For non-smooth functionals such as entropy, there is some recent progress [38, 26, 27] on designing theoretically minimax optimal estimators, while these estimators typically require the knowledge of the smoothness parameters, and the practical performances of these estimators are not yet known.The -nearest neighbor differential entropy estimator, or Kozachenko-Leonenko (KL) estimator is computed in the following way. Let be the distance between and its -nearest neighbor among . Precisely, equals the -th smallest number in the list , here . Let denote the closed ball centered at of radius and be the Lebesgue measure on . The KL differential entropy estimator is defined as
(3) |
where is the digamma function with , is the Euler–Mascheroni constant.
There exists an intuitive explanation behind the construction of the KL differential entropy estimator. Writing informally, we have
(4) |
where the first approximation is based on the law of large numbers, and in the second approximation we have replaced
by a nearest neighbor density estimator . The nearest neighbor density estimator follows from the “intuition” 111Precisely, we have [4, Chap. 1.2]. A distributed random variable has mean . that(5) |
Here the final additive bias correction term follows from a detailed analysis of the bias of the KL estimator, which will become apparent later.
We focus on the regime where is a fixed: in other words, it does not grow as the number of samples increases. The fixed version of the KL estimator is widely applied in practice and enjoys smaller computational complexity, see [52].
There exists extensive literature on the analysis of the KL differential entropy estimator, which we refer to [4] for a recent survey. One of the major difficulties in analyzing the KL estimator is that the nearest neighbor density estimator exhibits a huge bias when the density is small. Indeed, it was shown in [42] that the bias of the nearest neighbor density estimator in fact does not vanish even when and deteriorates as gets close to zero. In the literature, a large collection of work assume that the density is uniformly bounded away from zero [23, 29, 57, 30, 53], while others put various assumptions quantifying on average how close the density is to zero [25, 40, 56, 14, 20, 52, 11]. In this paper, we focus on removing assumptions on how close the density is to zero.
1.1 Main Contribution
Let be the Hölder ball in the unit cube (torus) (formally defined later in Definition 2 in Appendix A) and is the Hölder smoothness parameter. Then, the worst case risk of the fixed -nearest neighbor differential entropy estimator over is controlled by the following theorem.
Theorem 1
Let be i.i.d. samples from density function . Then, for , the fixed -nearest neighbor KL differential entropy estimator in (3) satisfies
(6) |
where is a constant depends only on and .
The KL estimator is in fact nearly minimax up to logarithmic factors, as shown in the following result from [26].
Theorem 2
[26] Let be i.i.d. samples from density function . Then, there exists a constant depending on only such that for all ,
(7) |
where is a constant depends only on , and .
Remark 1
Theorem 1 and 2 imply that for any fixed , the KL estimator achieves the minimax rates up to logarithmic factors without knowing for all , which implies that it is near minimax rate-optimal (within logarithmic factors) when the dimension . We cannot expect the vanilla version of the KL estimator to adapt to higher order of smoothness since the nearest neighbor density estimator can be viewed as a variable width kernel density estimator with the box kernel, and it is well known in the literature (see, e.g., [55, Chapter 1]) that any positive kernel cannot exploit the smoothness . We refer to [26] for a more detailed discussion on this difficulty and potential solutions. The Jackknife idea, such as the one presented in [11, 3] might be useful for adapting to .
The significance of our work is multi-folded:
-
We obtain the first uniform upper bound on the performance of the fixed -nearest neighbor KL differential entropy estimator over Hölder balls without assuming how close the density could be from zero. We emphasize that assuming conditions of this type, such as the density is bounded away from zero, could make the problem significantly easier. For example, if the density is assumed to satisfy for some constant , then the differential entropy becomes a smooth functional and consequently, the general technique for estimating smooth nonparametric functionals [51, 50, 43] can be directly applied here to achieve the minimax rates . The main technical tools that enabled us to remove the conditions on how close the density could be from zero are the Besicovitch covering lemma (Lemma. 4) and the generalized Hardy–Littlewood maximal inequality.
-
We show that, for any fixed , the -nearest neighbor KL entropy estimator nearly achieves the minimax rates without knowing the smoothness parameter . In the functional estimation literature, designing estimators that can be theoretically proved to adapt to unknown levels of smoothness is usually achieved using the Lepski method [39, 22, 45, 44, 27], which is not known to be performing well in general in practice. On the other hand, a simple plug-in approach can achieves the rate of , but only when is known [26]. The KL estimator is well known to exhibit excellent empirical performance, but existing theory has not yet demonstrated its near-“optimality” when the smoothness parameter is not known. Recent works [3, 52, 11] analyzed the performance of the KL estimator under various assumptions on how close the density could be to zero, with no matching lower bound up to logarithmic factors in general. Our work makes a step towards closing this gap and provides a theoretical explanation for the wide usage of the KL estimator in practice.
1.2 Notations
For positive sequences , we use the notation to denote that there exists a universal constant that only depends on such that , and is equivalent to . Notation is equivalent to and . We write if the constant is universal and does not depend on any parameters. Notation means that , and is equivalent to . We write and .
2 Proof of Theorem 1
In this section, we will prove that
(8) |
for any and . The proof consists two parts: (i) the upper bound of the bias in the form of
; (ii) the upper bound of the variance is
. Below we show the bias proof and relegate the variance proof to Appendix B.First, we introduce the following notation
(9) |
Here
is the probability measure specified by density function
on the torus, is the Lebesgue measure on , and is the Lebesgue measure of the unit ball in -dimensional Euclidean space. Hence is the average density of a neighborhood near . We first state two main lemmas about which will be used later in the proof.Lemma 1
If for some , then for any and , we have
(10) |
Lemma 2
If for some and for all , then for any and any , we have
(11) |
Furthermore, .
We relegate the proof of Lemma 1 and Lemma 2 to Appendix C. Now we investigate the bias of . The following argument reduces the bias analysis of to a function analytic problem. For notation simplicity, we introduce a new random variable independent of and study . For every , denote by the -nearest neighbor distance from to under distance , i.e., the -nearest neighbor distance on the torus. Then,
(12) | |||||
(13) | |||||
(14) | |||||
(15) |
We first show that the second term can be universally controlled regardless of the smoothness of . Indeed, the random variable [4, Chap. 1.2] and it was shown in [4, Theorem 7.2] that there exists a universal constant such that
(16) |
Hence, it suffices to show that for ,
(17) |
We split our analysis into two parts. Section 2.1 shows that and Section 2.2 shows that , which completes the proof.
2.1 Upper bound on
By the fact that for any , we have
(18) | |||||
(19) |
Here the expectation is taken with respect to the randomness in . Define function as
(20) |
intuitively means the distance such that the probability mass within is . Then for any , we can split into three terms as
(21) | |||||
(22) | |||||
(23) | |||||
(24) |
Now we handle three terms separately. Our goal is to show that for every , for . Then, taking the integral with respect to leads to the desired bound.
-
Term : whenever satisfies that , by definition of , we have , which implies that
(27) It follows from Lemma 2 that in this case
(28) (29) Hence, we have
(30) (31) (32) -
Term : we have
(33) For any such that , we have
(34) and by Lemma 2,
(35) (36) Hence,
(38) (39) where in the last step we have used the fact that since . Finally, we have
(40) (41) Note that , and if , we have
(42) Notice that . Hence, we have
(43) (44)
2.2 Upper bound on
By splitting the term into two parts, we have
(45) | |||||
(46) | |||||
(47) | |||||
(48) |
here we denote for simplicity of notation. For the term , we have
(49) | |||||
(50) | |||||
(51) | |||||
(52) |
In the proof of upper bound of , we have shown that for any . Similarly as in the proof of upper bound of , we have for every . Therefore, we have
(53) |
Now we consider . We conjecture that in this case, but we were not able to prove it. Below we prove that . Define the function
(54) |
Since , we have . Denote for any , therefore, we have that
(55) | |||||
(56) | |||||
(57) | |||||
(58) | |||||
(59) |
where the last inequality uses the fact for all . As for , since , and for , we have
(60) | |||||
(61) | |||||
(62) | |||||
(63) | |||||
(64) |
where in the last inequality we used the fact that for any . Hence,
(65) |
Now we introduce the following lemma, which is proved in Appendix C.
Lemma 3
Let be two Borel measures that are finite on the bounded Borel sets of . Then, for all and any Borel set ,
(66) |
Here is a constant that depends only on the dimension and
(67) |
Applying the second part of Lemma 3 with being the Lebesgue measure and being the measure specified by on the torus, we can view the function as
(68) |
Taking , then , so we know that