# Analysis of KNN Information Estimators for Smooth Distributions

KSG mutual information estimator, which is based on the distances of each sample to its k-th nearest neighbor, is widely used to estimate mutual information between two continuous random variables. Existing work has analyzed the convergence rate of this estimator for random variables whose densities are bounded away from zero in its support. In practice, however, KSG estimator also performs well for a much broader class of distributions, including not only those with bounded support and densities bounded away from zero, but also those with bounded support but densities approaching zero, and those with unbounded support. In this paper, we analyze the convergence rate of the error of KSG estimator for smooth distributions, whose support of density can be both bounded and unbounded. As KSG mutual information estimator can be viewed as an adaptive recombination of KL entropy estimators, in our analysis, we also provide convergence analysis of KL entropy estimator for a broad class of distributions.

## Authors

• 5 publications
• 24 publications
03/28/2016

### Analysis of k-Nearest Neighbor Distances with Application to Entropy Estimation

Estimating entropy and mutual information consistently is important for ...
03/28/2016

### Exponential Concentration of a Density Functional Estimator

We analyze a plug-in estimator for a large class of integral functionals...
02/07/2020

### On the Estimation of Information Measures of Continuous Distributions

The estimation of information measures of continuous distributions based...
02/25/2021

### Inductive Mutual Information Estimation: A Convex Maximum-Entropy Copula Approach

We propose a novel estimator of the mutual information between two ordin...
10/22/2019

### Minimax Rate Optimal Adaptive Nearest Neighbor Classification and Regression

k Nearest Neighbor (kNN) method is a simple and popular statistical meth...
04/11/2016

### Demystifying Fixed k-Nearest Neighbor Information Estimators

Estimating mutual information from i.i.d. samples drawn from an unknown ...
09/07/2016

### Breaking the Bandwidth Barrier: Geometrical Adaptive Entropy Estimation

Estimators of information theoretic measures such as entropy and mutual ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Information theoretic quantities, such as Shannon entropy and mutual information, have a broad range of applications in statistics and machine learning, such as clustering

[2, 3][4, 5][6], test of normality [7], etc. These quantities are determined by the distributions of random variables, which are usually unknown in real applications. Hence, the problem of nonparametric estimation of entropy and mutual information using samples drawn from an unknown distribution has attracted significant research interests [8, 9, 10, 11, 12, 13, 14, 15].

Depending on whether the underlying distribution is discrete or continuous, the estimation methods are crucially different. In the discrete setting, there exist efficient methods that attain rate optimal estimation of functionals including entropy and mutual information in the minimax sense [16, 17, 10]. For continuous distributions, many interesting methods have been proposed. Roughly speaking, these methods can be categorized into three different types.

The first type of methods seek to convert the continuous distribution to a discrete one by assigning data points into bins, and then estimate the entropy or mutual information of the discrete distribution. An improvement of this type of methods was proposed in [12], which uses adaptive bin sizes at different locations. This type of methods usually require extensive parameter tuning, especially when the dimensions of the random variables are high. Moreover, numerical experiments show that the accuracy is not as good as other methods [18, 19].

The second type of methods try to learn the underlying distribution first, and then calculate the entropy or mutual information functionals [20, 14, 15]

. The probability density function (pdf) are typically estimated using Kernel method. It has been shown that local linear or local Gaussian approximation can improve the performance

[14, 15]. However, the computation cost of these methods are usually high, since we need to calculate the pdf at a large number of points in order to get a good estimate of the information theoretic functionals. These methods also involve non-trivial parameter tuning when the dimensions of the random variables are high.

The third type, which is the focus of this paper, estimates entropy and mutual information directly based on the -th nearest neighbor (kNN) distances of each sample. A typical example is Kozachenko-Leonenko (KL) differential entropy estimator [8]. Since the mutual information between two random variables is the sum of the entropy of two marginal distributions minus the joint entropy, KL estimator can also be used to estimate mutual information. However, the KL estimator is used three times, and the error may not cancel out. Based on KL estimator, Kraskov, Alexander and Stögbauer [11] proposed a new mutual information estimator, called KSG estimator, which can be viewed as an adaptive recombination of three KL estimators. [11] shows that the empirical performance of KSG estimator is better than estimating marginal and joint entropy separately. Compared with other types of methods, KL entropy estimator and KSG mutual information estimator are computationally fast and do not require too much parameter tuning. In addition, numerical experiments show that these -NN methods can achieve the best empirical performance for a large variety of distributions [19, 21, 18]. As the result, KL and KSG estimators are commonly used to estimate entropy and mutual information.

Despite their widespread use, the theoretical properties of KL and KSG estimators, especially the latter, still need further exploration. In practice, KL and KSG estimators perform well for a broad class of distributions, but the theoretical convergence rates of these estimators were only established under some restrictive assumptions. For KL entropy estimator, one of the major difficulties is that the tail effect is hard to control if the pdf of random variables approaches zero. As a result, most of the previous literatures focus only on distributions that either have bounded support, or satisfy some strict tail assumptions [21, 22, 23, 24]. An exception was [25], which analyzed the convergence rate of KL estimator for one dimensional random variables with unbounded support. Their analysis was restricted to a truncated 1 nearest neighbor estimator.

For KSG mutual information estimator, the analysis is even more challenging, as KSG is actually an adaptive recombination of KL estimators. This adaptivity makes the problem much more difficult. [21] made a significant progress in understanding the properties of KSG estimator. In particular, [21] shows that the estimator is consistent under some mild assumptions (In particular, Assumption 2 of [21]). Furthermore, [21]

provides the convergence rate of an upper bound of bias and variance under some more restrictive assumptions (Assumption 3 of

[21]). However, although not stated explicitly in [21], one can show that, for a pdf that satisfies Assumption 3 of [21], its support set must be bounded. Moreover, its joint, marginal and conditional pdfs are all bounded both from above and away from zero in their supports. As a result, the analysis of [21] does not hold for some commonly seen pdfs, e.g. ones with unbounded support such as Gaussian. Therefore, it is important to extend the analysis of the properties of kNN information estimators to other types of distributions. This understanding can perhaps help us to design a better estimator. For example, we may come up with a modified adaptive kNN estimator, which uses smaller in the tail region, in order to avoid large bias.

In this paper, we analyze kNN information estimators that holds for variables with both bounded and unbounded support. In particular, we make the following contributions:

Firstly, we analyze the convergence rate of KL entropy estimator. Similar to the previous work on KL estimator [25, 21], we truncate the kNN distances to ensure the robustness. Then we provide a rule to select the truncation parameter, and give an upper bound of the bias and variance of the truncated KL estimator. Our result improves [25] in the following aspects: 1) Using a different truncation threshold, we achieve a better convergence rate of bias for one dimensional random variables; 2) We weaken the assumptions. In particular, our analysis has no restrictions on the boundedness of the support of the distribution of random variables, and holds for arbitrary and dimensionality. Some regularity conditions are also relaxed. As a result, our analysis holds for a broad class of distributions. Since we use weakened assumptions, some techniques in [25] can not be directly used to analyze the scenario addressed in this paper. Hence, we use a new approach for the derivation of bias and variance of KL estimator.

Secondly, building on the analysis of KL estimator, we derive the convergence rate of an upper bound on the bias and variance of KSG mutual information estimator for smooth distributions that satisfy a weak tail assumption. Our results hold mainly for two types of distributions. The first type includes distributions that have unbounded support, such as Gaussian distributions. The second type includes distributions that have bounded support but the density functions approach zero. This type is different from the case analyzed in

[21], which focus on distributions with bounded support but the density is bounded away from zero. To the best of our knowledge, this is the first attempt to analyze the convergence rate of KSG estimator for these two types of distributions. Our technique for bounding the bias is significantly different from [21]. In [21], the distribution is assumed to be smooth almost everywhere, but has a non-smooth boundary, which is the main cause of the bias. To deal with the boundary effect, the support of density was divided into an interior region and a boundary region, and then the bias in these two regions were bounded separately. It turns out that the boundary bias is dominant. On the contrary, in our analysis, by requiring that the density is smooth, we can avoid the boundary effect. However, we allow the density to be arbitrarily close to zero in its support. In the region on which the density is low, the kNN distances are large. As a result, larger local bias occurs in these regions. To deal with this situation, we divide the whole support of the density into a central region, on which the density is relatively high, and a tail region, on which the density is lower. We then bound the bias in these two regions separately, and let the threshold dividing the central region and the tail region decay with respect to the sample size with a proper speed, so that the bias in these two regions decay with approximately the same rates. Then the overall convergence rate can be determined.

The remainder of the paper is organized as follows. In Section II, we provide our main result of the analysis of KL entropy estimator, and then compare with [25]. In Section III, we analyze KSG mutual information estimator, and then compare with [21]. In these two sections, we show the basic ideas of the proofs of our main results and relegate the detailed proofs to Appendices. In Section IV, we provide numerical examples to illustrate the analytical results. Finally, in Section V, we offer concluding remarks.

## Ii KL Entropy Estimator

As KSG mutual information estimator depends on KL entropy estimator, in this section, we first derive convergence results for KL estimator.

Consider a continuous random variable with unknown pdf . The differential entropy of is

 h(X)=−∫f(x)logf(x)dx.

Given i.i.d samples drawn from this pdf, the goal of KL estimator is to give a nonparametric estimation of . The expression of KL estimator is given by [8]:

 ^h(X)=−ψ(k)+ψ(N)+logcdx+dxNN∑i=1logϵ(i), (1)

in which is the digamma function defined as with

 Γ(t)=∫∞0ut−1e−udu,

and is the distance from to its -th nearest neighbor. The distance is defined as , in which can be any norm. and are commonly used. is the volume of corresponding unit norm ball.

The original KL estimator may not be robust, i.e. some small outliers in the distribution can lead to very large

-NN distances, which may deteriorate the performance. To address this problem, truncated estimator were proposed [21, 25]:

 ^h(X)=−ψ(k)+ψ(N)+logcdx+dxNN∑i=1logρ(i), (2)

in which

 ρ(i)=min{ϵ(i),aN}

with being a truncation radius that depends on the sample size . In [25], is chosen to be . In this paper, in order to achieve a better convergence rate, we propose to use a different truncation threshold:

 aN=AN−β, (3)

in which is a constant and

 β=⎧⎨⎩13ifdx=1,12dxifdx≥2. (4)

With this choice of truncation threshold, the following theorem gives an upper bound on the bias and variance of the truncated KL entropy estimator defined in (2).

###### Theorem 1.

Suppose that the pdf satisfies the following assumptions:
(a) The Hessian of is bounded everywhere, i.e. there exists a constant such that

 ∥∥∇2f(x)∥∥op≤M,

in which denotes the operator norm;
(b) There exists a constant such that

 ∫f(x)exp(−bf(x))dx≤Cb−1

for any .

Then for sufficiently large , the bias of truncated KL estimator is bounded by:

 (5)
###### Proof.

As was already discussed in [11], the correction term in (2) is designed for correcting the bias caused by the assumption that the average pdf in the ball is equal to the pdf at its center, i.e. , which does not hold in general. Hence, the bias of original KL estimator (1) is caused by the local non-uniformity of the density. If is large, the average pdf in can significantly deviate from . By substituting with , which is upper bounded by , we can control the bias caused by large kNN distances. This type of bias is lower if we use a small . However, the truncation also induces additional bias, which can be serious if is too small. Therefore we need to select carefully to obtain a tradeoff between these two bias terms.

In our proof, we divide the support of into a central region (called , which have a relatively high density) and a tail region (called , which have a relatively low density), and then decompose the bias of the truncated KL estimator (2) into three parts:

 E[^h(X)] = −ψ(k)+ψ(N)+logcdx+dxNN∑i=1E[logρ(i)] (6) (a)= −E[logP(B(X,ϵ))]+logcdx+dxE[logρ] = −E[logP(B(X,ϵ))f(X)cdxρdx1(X∈S2)],

in which (a) comes from order statistics [26, 23] and has been shown in [11]. All of these three terms converge to zero. The first term in (6) is the additional bias caused by truncation in the central region. Note that and are different only when , thus if does not decay to zero too fast, then happens with a high probability. Hence the first term converges to zero. The second term is the bias caused by local non-uniformity of the pdf in the central region. Recall that , will converge to zero, hence the local non-uniformity will gradually disappear with the increase of . The last term is the bias in the tail region. We let the tail region to shrink with the increase of , and let the central region to expand, then the third term can also converge to zero. These three terms are bounded separately, and the results depend on the selection of truncation parameter . The overall convergence rate is determined by the slowest one among these three terms. In our proof, we carefully select to optimize the overall rate.

For detailed proof, please refer to Appendix A. ∎

We now compare our result with that of [25]. The assumptions in Theorem 1 are similar but less strict than the assumptions (A0)-(A2) of [25]. In particular, (A0) of [25] requires that

 ∫f(x)|logf(x)|dx<∞,

and (A1) of [25] requires that the pdf is strictly positive everywhere. We remove these assumptions, while keeping other assumptions of [25] the same. [25] gave a bound on the bias of truncated KL estimator for the case of and . Under these less strict assumptions in Theorem 1, we improve the convergence rate of bias to for the case of , and further derive a bound for general fixed and .

The next theorem gives the upper bound of variance of :

###### Theorem 2.

Assume the following conditions:
(c) The pdf is continuous almost everywhere;
(d) , for all ,

 ∫f(x)log2inf∥∥x′−x∥∥

and

 ∫f(x)log2sup∥∥x′−x∥∥

Under assumptions (c) and (d), if , then the variance of truncated KL estimator is bounded by:

 Var[^h(X)]=O(1N). (9)
###### Proof.

Our proof uses some techniques in [23], which proved convergence of variance of KL estimator with for one dimensional distribution with bounded support. We generalize the result to arbitrary fixed and , and the support set can be both bounded and unbounded, as long as the distribution satisfies assumption (c) and (d) in Theorem 2. However, since our assumptions are weaker, we need some additional techniques to ensure that the derivation is valid.

For detailed proof, please see Appendix B. ∎

Our assumption (c) and (d) are also weaker than the corresponding assumptions (B1) and (B2) in [25]. To show this, we can find a sufficient condition of (c) and (d). These two conditions are both satisfied, if S1) the pdf is Lipschitz or -Hölder continuous with , and S2) . (B1) in [25] required that the pdf is Lipschitz, and (B2) required that

for . We observe that sufficient condition S2) mentioned above only requires it to hold for . Hence our assumptions for the bias and variance of KL estimator are both weaker than those in [25].

(5) and (9) show that when , the mean square error of KL estimator decays with approximately , which is the parametric rate of convergence.

## Iii KSG Mutual Information Estimator

In this section, we focus on KSG mutual information estimator. Consider two continuous random variables and with unknown pdf . The mutual information between and is

 I(X;Y)=h(X)+h(Y)−h(X,Y). (10)

Define the joint variable with , and define the metric in the space as

 d(z,z′)=max{∥∥x−x′∥∥,∥∥y−y′∥∥}. (11)

The KSG estimator proposed in [11] can be expressed as

 ^I(X;Y)=ψ(N)+ψ(k)−1NN∑i=1ψ(nx(i)+1)−1NN∑i=1ψ(ny(i)+1), (12)

with

 nx(i)=N∑j=11(∥x(j)−x(i)∥<ϵ(i)), ny(i)=N∑j=11(∥y(j)−y(i)∥<ϵ(i)),

in which is the distance from to its -th nearest neighbor using the distance metric defined in (11).

We make the following assumptions:

###### Assumption 1.

There exist finite constants , , , , , and , such that
(a) almost everywhere;
(b) The two marginal pdfs are both bounded, i.e. , and ;
(c) The joint and marginal densities satisfy

 ∫f(x,y)exp(−bf(x,y))dxdy ≤ Cc/b, (13) ∫f(x)exp(−bf(x))dx ≤ C′c/b, ∫f(y)exp(−bf(y))dy ≤ C′c/b

for all ;

(d) The Hessian of joint distribution and marginal distribution are bounded everywhere, i.e.

(e) The two conditional pdfs are both bounded, i.e. and .

It was proved in [21] that under its Assumption 2, KSG estimator is consistent, but the convergence rate was unknown. Note that the distributions that satisfy the Assumption 2 of [21] may have arbitrarily slow convergence rate, especially for heavy tail distributions. Our assumptions are stronger than Assumption 2 of [21], in which (a)-(c) were not required. In [21], the convergence rate was derived under its Assumption 3, which also strengthens its Assumption 2. The main difference between Assumption 3 of [21] and our assumptions is that [21] requires

 ∫f(x,y)exp(−bf(x,y))dxdy≤Cce−C0b. (14)

One can show that a joint pdf satisfying assumption (14) is bounded away from and the distribution must have bounded support (For completeness, we provide a proof of this statement in Appendix C). On the contrary, we only require this integration to decay inversely with , see (13). This new assumption is valid for distributions whose joint pdf can approach zero as close as possible, thus our analysis holds for distributions with both bounded and unbounded support. Another difference is that we strengthen the Hessian from bounded almost everywhere to everywhere, to ensure the smoothness of density, and thus avoid the boundary effect. Figure 1 illustrates the difference between [21] and our analysis. [21]

holds for type (a), such as uniform distribution, while our analysis holds for type (b) and (c), such as Gaussian distribution. In addition, we do not truncate the kNN distances as in

[21].

To deal with these assumption differences, our derivation is significantly different from those of [21]. Theorem 3 gives an upper bound of bias under these assumptions:

###### Theorem 3.

Under the Assumption 1, for fixed and sufficiently large , the bias of KSG estimator is bounded by

 |E[^I(X;Y)]−I(X;Y)|=O⎛⎝logNN1dz⎞⎠. (15)
###### Proof.

Recall that KSG estimator is an adaptive combination of two adaptive KL estimators that estimate the marginal entropy, and one original KL estimator that estimates the joint entropy. We express KSG estimator in the following way:

 ^I(X;Y)=1NN∑i=1T(i)=1NN∑i=1[Tx(i)+Ty(i)−Tz(i)],

in which

 T(i):=ψ(N)+ψ(k)−ψ(nx(i)+1)−ψ(ny(i)+1),

and

 Tz(i) := −ψ(k)+ψ(N)+logcdz+dzlogρ(i), Tx(i) := −ψ(nx(i)+1)+ψ(N)+logcdx+dxlogρ(i), Ty(i) := −ψ(ny(i)+1)+ψ(N)+logcdy+dylogρ(i).

We bound the bias of these three KL estimators separately. Note that is actually the KL estimator for the joint entropy. Therefore the bias of joint entropy estimator can be bounded using Theorem 1. For the marginal entropy estimators and , we only need to analyze , and then the bound of can be obtained in the same manner. Note that

 E[Tx]−h(X)=E[E[Tx|X]+logf(X)],

and we call the local bias. The pointwise convergence rate of the local bias is . However, the overall convergence rate is slower than the pointwise convergence rate. In the setting discussed in [21], the boundary bias is dominant. In our case, by dividing the whole support into a central region and a tail region, with the threshold selected carefully, we let the convergence rate of bias at these two regions decay with approximately the same rate.

For detailed proof, please see Appendix D. ∎

###### Theorem 4.

The variance of KSG estimator is bounded by

 (16)
###### Proof.

The proof of variance term follows closely that of [21]. For detailed proof, please see Appendix E. ∎

The rate of convergence of mean square error of KSG estimator approximately attains parametric rate for . Although our result and [21] hold for different types of distributions, the results of convergence rates are similar.

## Iv Numerical Examples

In this section we provide numerical experiments to illustrate the analytical results obtained in this paper.

### Iv-a KL estimator

Since our analysis holds for smooth distributions with both unbounded and bounded support, we conduct two experiments to test the performance of KL entropy estimator for these two cases respectively.

In the first experiment, we use standard Gaussian distribution with , i.e. , in which is the

dimensional identity matrix. These distributions have unbounded supports.

In the second experiment, we use the following distribution that has a bounded support:

 f(x)=Πdxi=1g(xi). (17)

We conduct numerical simulation with , and

 g(x)=⎧⎪⎨⎪⎩√22(1−x2)|x|<√22;√22(|x|−√2)2√22≤|x|≤√2. (18)

It can be shown that the distribution described by (17) and (18) satisfies assumption (a), (b) in Theorem 1, and (c), (d) in Theorem 2. We use for all of the experiments. To show how the bias and variance converge when the sample size increases, the logarithm of the estimated bias and estimated variance are plotted against the log-sample size . The result of the first and the second experiment are shown in Figure 2 and Figure 3, respectively. The value of the data points in these figures are averaged over 100,000 trials. The bias is estimated by calculating the average of the estimated entropy, while the variance is calculated by the sample variance of all the trials.

We then calculate the empirical convergence rates of bias and variance, which can be estimated by calculating the negative slope of each curve in Figure 2 and 3

through linear regression. These results are compared with the theoretical convergence rate, which is obtained from Theorem

1 and 2. The results are shown in Table I and II. In these two tables, we say that the theoretical convergence rate of the bias or the variance is , if it decays with or .

The convergence rate of bias under is faster than our theoretical prediction. Note that the result of Theorem 1 is an upper bound on the error. For particular types of distributions that satisfy some more restrictive assumptions, the convergence can be faster. For or , the empirical results basically agree with Theorem 1. We also observe that the variance of KL estimator increases with the dimension , however, the asymptotic convergence rates are approximately equal to for fixed . This result agrees with our analysis in Theorem 2.

### Iv-B KSG estimator

Similar to the numerical examples of KL estimator, now we use two experiments to test the performance of KSG mutual information estimator for smooth distributions with unbounded support and bounded support, respectively.

In the first experiment, we simulate the convergence of the bias and variance of KSG estimator for a joint Gaussian distribution. We set , and , in which

 K = (1ρρ1),

In this numerical experiment, we use .

In the second experiment, we focus on the performance of KSG estimator for bounded distribution. We estimate the mutual information between and for the following case: is a random variable whose distribution is described by (17) and (18) with . Let another random variable be i.i.d with , and . This distribution has a bounded support, whose density approaches zero. It can be shown that this distribution satisfies assumption 1 (a)-(e).

Similar to the experiments on KL entropy estimator, we fix for both experiments, and then plot and against separately. The result is shown in Figure 4 and 5, in which the value of each data point is averaged over 20,000 trials. To get a better understanding of the convergence rate of KSG estimator, we not only plot the curve for KSG estimator, but also for three entropy estimators decomposed from it:

 ^h(X) = −1NN∑i=1ψ(nx(i)+1)+ψ(N)+logcdx+dxNN∑i=1logϵ(i), ^h(Y) = −1NN∑i=1ψ(ny(i)+1)+ψ(N)+logcdy+dyNN∑i=1logϵ(i), ^h(X,Y) = −ψ(k)+ψ(N)+logcdz+dzNN∑i=1logϵ(i),

in which is exactly the original KL entropy estimator for joint distribution. and are two adaptive KL estimators for marginal distributions.

The comparison of theoretical convergence rate and empirical convergence rate is shown in Table III and Table IV. Similar to KL estimator, we use to denote or for the theoretical convergence rate of bias and variance. The theoretical convergence rates of the bias and variance of KSG estimator come from Theorem 3 and Theorem 4. The result for joint entropy estimator comes from Theorem 1 and Theorem 2. For two adaptive marginal entropy estimators, the result comes from Lemma 12 in the appendix. The empirical convergence rate is obtained by calculating the slope of the curves in Figure 4 and 5 through linear regression, which has already been used in the experiments on KL estimator.

The result shows that the empirical convergence rate of variance agrees with our theoretical results. The empirical convergence of bias of KSG estimator and the decomposed KL estimator is a bit faster than our theoretical prediction. Similar to the discussion of KL estimator, the convergence rates of bias are different for differet types of distributions, and our theoretical result considers the worst case of distributions that satisfy our assumptions. The numerical results illustrate that there is a room for further improving the analysis of the convergence rate. This will be left as our future study. From the above simulation results, we can also observe that the blue line in the left side of Figure 4 and Figure 5 are both lower than the purple one, which shows that the bias of marginal entropy and joint entropy estimators decomposed from the KSG estimator can partially cancel out, and thus reduce the estimation error.

## V Conclusion

In this paper we have analyzed the convergence rate of bias and variance of KL entropy estimator and KSG mutual information estimator for smooth distributions. For KL estimator, we have improved the convergence rate of bias in [25], and have extended the analysis to any fixed and arbitrary dimensions. For KSG estimator, we have derived an upper bound of convergence rate of bias for two types of distributions that were not analyzed before: distributions with unbounded support, i.e. with long tails, and distributions with bounded support but the density can approach zero. We have also used numerical examples to show that the practical performance of KL and KSG estimators generally satisfy our analysis, although sometimes the empirical convergence rate is faster than the upper bound we derived. It is of interest to investigate whether we can derive a tighter bound, probably with more restrictive assumptions.

## Appendix A Proof of Theorem 1: the bias of KL entropy estimator

In this section, we analyze the bias of truncated KL estimator

 ^h(X)=−ψ(k)+ψ(N)+logcdx+dxNN∑i=1logρ(i),

in which

 ρ(i)=min{ϵ(i),aN}, (19)

and the truncation threshold is set to be

 aN=AN−β,

in which . We hope to select a to optimize the convergence rate of bias.

We begin with deriving two lemmas based on Assumptions (a) and (b) in the theorem statement.

###### Lemma 1.

For distributions satisfying assumption (a), there exists constant , such that

 |P(B(x,r))−f(x)cdxrdx|≤C1rdx+2, (20)

in which

###### Proof.
 ∣∣P(B(x,r))−f(x)cdxrdx∣∣ = ∣∣∣∫u∈B(x,r)(f(u)−f(x))du∣∣∣ = ∣∣∣∫u∈B(x,r)(∇f(x))T(u−x)+(u−x)THf(ξ(u))(u−x))du∣∣∣ = ∣∣∣∫u∈B(x,r)(u−x)THf(ξ(u))(u−x)du∣∣∣ ≤ ∣∣∣∫u∈B(x,r)M∥u−x∥22du∣∣∣ ≤ ∣∣∣∫∥u−x∥2<αrM∥u−x∥22du∣∣∣ = dxπn2M(dx+2)Γ(n2+1)αdx+2rdx+2,

in which is the radius of the unit norm ball: , and denotes the Hessian of . ∎

The assumption (b) controls the tail of distribution. We can show that the following lemma holds:

###### Lemma 2.

(1) There exists such that

 Pr(f(X)≤r)≤μr,∀r>0. (22)

(2) For any integer , there exists a constant , so that

 ∫fm(x)exp(−bf(x))dx≤Kmbm. (23)
###### Proof.

(22) can be proved using Markov’s inequality. For (23), note that when , (23) is exactly Assumption (b). We generalize this result to arbitrary . The detailed proof of (22) and (23) can be found in Section A-A. ∎

Now we analyze the convergence rate of KL estimator in (2):

 E[^h(X)]−h(X) (24) (a)= −ψ(k)+ψ(N)+E[log(cdxρdx)]−h(X) (b)= −E[logP(B(X,ϵ))]+E[log(cdxρdx)]−h(X) (c)= −E[logP(B(X,ϵ))]+E[log(f(X)cdxρdx)] (d)= −E[log(P(B(X,ϵ))P(B(X,ρ)))1(X∈S1)] := −I1−I2−I3,

in which (a) uses the fact that ’s are identically distributed for all , thus

 E[dxNN∑i=1logρ(i)]=E[dxlogρ(i)]

for From now on, we omit for convenience.

In (b), we use the pdf of the -th nearest neighbor [26]:

 fϵ(r)=(N−1)!(k−1)!(N−k−1)!Pk−1(B(x,r))(1−P(B(x,r)))N−k−1dP(B(x,r))dr. (25)

Then we show that , which has been proved in [11]. We repeat it here for reader’s understanding and completeness:

 E[logP(B(x,ϵ))|x] = ∫logP(B(x,r))fϵ(r)dr (26) u=P(B(x,r))= (N−1)!(k−1)!(N−k−1)!∫uk−1(1−u)N−k−1logudu = (N−1)!(k−1)!(N−k−1)!B(k,N−k)(ψ(k)−ψ(N)) = ψ(k)−ψ(N),

in which denotes Beta function, . (26) holds for all , hence we can take expectation over , thus (b) is valid.

(c) holds because .

In (d), and are defined as:

 S1={x|f(x)≥λC1cdxA2N−γ}, (27) S2={x|f(x)<λC1cdxA2N−γ}, (28)

in which satisfies

 γ≤min{2β,1−βdx}, (29)

and

Roughly speaking, is the region where the is relatively large, while corresponds to the tail region. Regarding the two regions and , we have the following lemma.

###### Lemma 3.

There exist constants and , such that for ,

 P(ϵ>aN,X∈S1) ≤ C2N−(1−βdx), (31) P(ϵ>aN) ≤ C3N−min{1−βdx,2dx+2}. (32)
###### Proof.

From (24), we know that the bias of KL estimator can be bounded by giving an upper bound to , and separately.

#### A-1 Bound of I1

 I1 = E[(logP(B(X,ϵ))−logP(B(X,ρ)))1(X∈S1)] (a)= E[(logP(B(X,ϵ))−logP(B(X,ρ)))1(X∈S1,ϵ>aN)] (b)≤ E[−logP(X,ρ)1(X∈S1,ϵ>aN)] (c)= E[−logP(X,aN)1(X∈S1,ϵ>aN)] (d)≤ −log[(k+1)N−(γ+βdx)]P(X∈S1,ϵ>aN) (e)= O(N−(1−βdx)logN).

Here (a) uses the definition of in (19), which implies that , are different only when . (b) uses . (c) uses the definition of again, which says that if . (d) uses the lower bound of derived in (59). (e) uses (31) in Lemma 3.

Since always hold, is always nonnegative, therefore the upper and lower bound of is:

 0≤I1=O(N−(1−βdx)logN). (33)

#### A-2 Bound of I2

An upper bound of can be obtained via

 I2 = E[log(P(B(X,ρ))f(X)cdxρdx)1(X∈S1)] (34) (a)≤ E[log(f(X)cdxρdx+C1ρdx+2f(X)cdxρdx)1(X∈S1)] (b)≤ E[C1ρ2f(X)cdx1(X∈S1)] (c)= O(Nγ)E[ρ21(X∈S1)],

in which (a) uses Lemma 1, (b) uses the fact for , (c) is based on the definition of in (27), and we substitute in (34) with its lower bound among .

A lower bound can be obtained similarly:

 I2 ≥ E[log(f(X)cdxρdx−C1ρdx+2f(X)cdxρdx)1(X∈S1)] (35) (a)= −E[1ξ(X)C1ρ2f(X)cdx1(X∈S1)] (b)≥ −2E[C1ρ2f(X)cdx1(X∈S1)] = −O(Nγ)E[ρ21(X∈S1)],

in which we use Lagrange mean value theorem in (a), and ; (b) holds because from (57), we know that , thus .

Now it remains to bound . We have the following lemma:

###### Lemma 4.

(1) For all integer that satisfies , , that depends on , such that

 E[ρd′]≤C4N−d′dx+C5N−(βd′+min{2β,1−βdx}), (36)

for defined in (19).

(2) Furthermore, there exists a constant , such that

 E[ρ2]≤⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩MN−(1+β)logNifdx=1,β≥1/3;MlogNNifdx=2,β≥1/4;MN−2dxifdx≥3,β≥1/(2dx). (37)
###### Proof.

Please see Section A-C for detailed proof. ∎

With (34), (35) and Lemma 4, can be bounded by

 |I2|=⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩O(N−(1+β−γ)logN)ifdx=1,β≥1/3;O(N−(1−γ)logN)ifdx=2,β≥1/4;O(N−(2dx−γ))ifdx≥3,β≥1/(2dx). (38)

#### A-3 Bound of I3

 I3 = E[log(P(B(X,ϵ))f(X)cdxρdx)1(X∈S2)] (39) = E[log(P(B(X,ϵ)))1(X∈S2)]−E[log(f(X))1(X∈S2)]−E[log(cdxρdx)1(X∈S2)].

The first term of (39) can be bounded using (26).

 E[log