# Scalable Hash-Based Estimation of Divergence Measures

We propose a scalable divergence estimation method based on hashing. Consider two continuous random variables X and Y whose densities have bounded support. We consider a particular locality sensitive random hashing, and consider the ratio of samples in each hash bin having non-zero numbers of Y samples. We prove that the weighted average of these ratios over all of the hash bins converges to f-divergences between the two samples sets. We show that the proposed estimator is optimal in terms of both MSE rate and computational complexity. We derive the MSE rates for two families of smooth functions; the Hölder smoothness class and differentiable functions. In particular, it is proved that if the density functions have bounded derivatives up to the order d/2, where d is the dimension of samples, the optimal parametric MSE rate of O(1/N) can be achieved. The computational complexity is shown to be O(N), which is optimal. To the best of our knowledge, this is the first empirical divergence estimator that has optimal computational complexity and achieves the optimal parametric MSE estimation rate.

• 8 publications
• 48 publications
01/27/2018

### Scalable Mutual Information Estimation using Dependence Graphs

We propose a unified method for empirical non-parametric estimation of g...
02/17/2017

### Direct Estimation of Information Divergence Using Nearest Neighbor Ratios

We propose a direct estimation method for Rényi and f-divergence measure...
12/07/2021

### Bless and curse of smoothness and phase transitions in nonparametric regressions: a nonasymptotic perspective

When the regression function belongs to the standard smooth classes cons...
06/23/2011

### Relative Density-Ratio Estimation for Robust Distribution Comparison

Divergence estimators based on direct approximation of density-ratios wi...
02/02/2019

### Bandwidth Selection for the Wolverton-Wagner Estimator

For n independent random variables having the same Hölder continuous den...
06/15/2020

### Faster Wasserstein Distance Estimation with the Sinkhorn Divergence

The squared Wasserstein distance is a natural quantity to compare probab...
10/31/2017

### Rate-optimal Meta Learning of Classification Error

Meta learning of optimal classifier error rates allows an experimenter t...

## 1 Introduction

Information theoretic measures such as Shannon entropy, mutual information, and the Kullback-Leibler (KL) divergence have a broad range of applications in information theory, statistics and machine learning

[1, 2, 3]. When we have two or more data sets and we are interested in finding the correlation or dissimilarity between them, Shannon mutual information or KL-divergence is often used. Rényi and f-divergence measures are two well studied generalizations of KL-divergence which comprise many important divergence measures such as KL-divergence, total variation distance, and -divergence [4, 5].

Nonparametric estimators are a major class of divergence estimators, for which minimal assumptions on the density functions are considered. Some of the non-parametric divergence estimators are based on density plug-in estimators such as -NN [6], KDE [7], and histogram [8]. A few researchers, on the other hand, have proposed direct estimation methods such as graph theoretic nearest neighbor ratio (NNR) [9]. In general, plug-in estimation methods suffer from high computational complexity, which make them unsuitable for large scale applications.

Recent advances on non-parametric divergence estimation have been focused on the MSE convergence rates of the estimator. Singh et al in [7] proposed a plug-in KDE estimator for Rényi divergence that achieves the MSE rate of when the densities are at least times differentiable, and the support boundaries are sufficiently smooth. Kandasamy et al proposed a similar plug-in KDE estimator and extend the optimal MSE rate to densities that are at least differentiable [10]. However, they ignore a major source of error due to the boundaries. Moon et al proposed a weighted ensemble method to improve the MSE rate of plug-in KDE estimators [11]. The proposed estimator for f-divergence achieves the optimal MSE rate when the densities are at least times differentiable. They also assume stringent smoothness conditions at the support set boundary.

Noshad et al proposed a graph theoretic direct estimation method based on nearest neighbor ratios (NNR) [9]. Their estimator is simple and computationally more tractable than other competing estimators, and can achieve the optimal MSE rate of for densities that are at least times differentiable. Although their basic estimator does not require any smoothness assumptions on the support set boundary, the ensemble estimator variant of their estimator does.

In spite of achieving the optimal theoretical MSE rate by aforementioned estimators, there remain serious. The first challenge is the high computational complexity of the estimator. Most KDE based estimators require runtime complexity of , which is not suitable for large scale applications. The NNR estimator proposed in [9] has the runtime complexity of , which is faster than the previous estimators. However, in [9] they require to grow sub-linearly with , which results in much higher complexity than linear runtime complexity. The other issue is the smoothness assumptions made on the support set boundary. Almost all previously proposed estimators assume extra smoothness conditions on the boundaries, which may not hold practical applications. For example, the method proposed in [7] assumes that the density derivatives up to order vanish at the boundary. Also it requires numerous computations at the support boundary, which become complicated when the dimension increases. The Ensemble NNR estimator in [9] assumes that the density derivatives vanish at the boundary. To circumvent this issue, Moon et al [11] assumed smoothness conditions at the support set boundary. However, these conditions may not hold in practice.

In this paper we propose a low complexity divergence estimator that can achieve the optimal MSE rate of for the densities with bounded derivatives of up to . Our estimator has optimal runtime complexity of , which makes it an appropriate tool for large scale applications. Also in contrast to other competing estimators, our estimator does not require stringent smoothness assumptions on the support set boundary.

The structure of the proposed estimator borrows ideas from hash based methods for KNN search and graph constructions problems

[12, 13], as well as from the NNR estimator proposed in [9]. The advantage of hash based methods is that they can be used to find the approximate nearest neighbor points with lower complexity as compared to the exact -NN search methods. This suggests that fast and accurate algorithms for divergence estimation may be derived from hashing approximations of k-NN search. Noshad et al [9] consider the -NN graph of Y in the joint data set , and show that the average exponentiated ratio of the number of X points to the number of Y points among all -NN points is proportional to the Rényi divergence between the X and Y densities. It turns out that for estimation of the density ratio around each point we really do not need to find the exact -NN points, but only need sufficient local samples from X and Y around each point. By using a randomized locality sensitive hashing (LSH), we find the closest points in Euclidean space. In this manner, applying ideas from the NNR estimation and hashing techniques to KNN search problem, we obtain a more efficient divergence estimator. Consider two sample sets and with a bounded density support. We use a particular two-level locality sensitive random hashing, and consider the ratio of samples in each bin with a number of Y samples. We prove that the weighted average of these ratios over all of the bins can be made to converge almost surely to f-divergences between the two samples populations. We also argue that using the ensemble estimation technique provided in [2], we can achieve the optimal parametric rate of . Furthermore, using a simple algorithm for online estimation method has complexity and convergence rate, which is the first optimal online estimator of its type.

## 2 Hash-Based Estimation

In this section, we first introduce the f-divergence measure and propose a hash-based estimator. We outline the main theoretical results which will be proven in section 4.

Consider two density functions and with common bounded support set .

The f-divergence is defined as follows [5].

 Dg(f1(x)||f2(x)) :=∫g\off1(x)f2(x)f2(x)dx =Ef2\of[g\off1(x)f2(x)], (1)

where is a smooth and convex function such that . KL-divergence, Hellinger distance and total variation distance are particular cases of this family. Note that for estimation, we don’t need convexity of and . conditions. Assume that the densities are lower bounded by and upper bounded by . Assume and belong to the Hölder smoothness class with parameter :

• Given a support , a function is called Hölder continuous with parameter , if there exists a positive constant , possibly depending on , such that

 |f(y)−f(x)|≤Gf∥y−x∥γ, (2)

for every .

The function in (2) is also assumed to be Lipschitz continuous; i.e. is Hölder continuous with .

###### Remark 1

The -Hölder smoothness family comprises a large class of continuous functions including continuously differentiable functions and Lipschitz continuous functions. Also note that for , any –Hölder continuous function on any bounded and continuous support is constant.

• Consider the i.i.d samples drawn from and drawn from . Define the fraction . We define the set . We define a positive real valued constant as a user-selectable parameter of the estimator to be defined in 5. We define the hash function as

 H1(x)=\of[h1(x1),h1(x2),...,h1(xd)], (3)

where is the projection of on the th coordinate, and is defined as

 h1(x)=⌊x+bϵ⌋, (4)

for fixed . Let , where and is a fixed real number. We define a random hash function with a uniform density on the output and consider the combined hashing , which maps the points in to .

Consider the mappings of the sets and using the hash function

, and define the vectors

and to respectively contain the number of collisions for each output bucket from the set . We represent the bins of the vectors and respectively by and , .

The hash based f-divergence estimator is defined as

 ˆDg(X,Y):=max⎧⎪ ⎪⎨⎪ ⎪⎩1M∑i≤FMi>0Mi˜g\ofηNiMi,0⎫⎪ ⎪⎬⎪ ⎪⎭, (5)

where .

Note that if the densities and are almost equal, then for each point , , and thus and

tend to zero, as required. In the following theorems we state upper bounds on the bias and variance rates. Let

and , respectively represent the bias and variance of , which is an estimator of the parameter . Then, the following provides a bound on the bias of the proposed estimator.

###### Theorem 2.1

Assume that and are density functions with bounded common support set and satisfying -Hölder smoothness. The bias of the proposed estimator for f-divergence with function can be bounded as

 B\of[ˆDg(X,Y)]=O\ofϵγ+O\of1Nϵd,

where is a positive real constant.

###### Remark 2

In order for the estimator to be asymptotically unbiased, needs to be a function of . The optimum bias rate of can be achieved for .

###### Theorem 2.2

Let be fixed. The variance of the estimator 5 can be bounded as

 V\of[ˆDg(X,Y)]≤O\of1N. (6)
###### Remark 3

The same variance bound holds for the random variable . The bias and variance results easily extend to Rényi divergence estimation.

We next show that, when and belong to the family of differentiable densities, we can improve the bias rate by applying the ensemble estimation approach in [11, 3]. The Ensemble Hash-based (EHB) estimator is defined as follows.

• Assume that the density functions are in the Hölder space , which consists of functions on continuous derivatives up to order and the th partial derivatives are Hölder continuous with exponent . Let be a set of index values with , where is a constant. Let . The weighted ensemble estimator is defined as

 ˆDw:=∑t∈Tw(t)ˆDϵ(t), (7)

where is the hash based estimator of f-divergence, with the hashing parameter of .

###### Theorem 2.3

Let and be the solution to:

 minw ∥w∥2 subject to ∑t∈Tw(t)=1, ∑t∈Tw(t)ti/d=0,i∈N,i≤d. (8)

Then the MSE rate of the ensemble estimator is .

## 3 Online Divergence Estimation

In this section we study the problem of online divergence estimation. In this setting we consider two data steams and with i.i.d samples, and we are interested in estimating the divergence between two data sets. The number of samples increase over time and an efficient update of the divergence estimate is desired. The time complexity of a batch update, which uses the entire update batch to compute the estimate at each time point, is , and it may not be so effective in cases which we need quick detection of any change in the divergence function.

Algorithm 2 updates the divergence with amortized runtime complexity of order . Define the sets , , the number of and samples in each partition, and the divergence estimate between and . Consider updating the estimator with new samples and . In the first and second lines of algorithm 2, the new samples are added to the datasets and the values of and of the bins in which the new samples fall. We can find these bins in using a simple hashing. Note that once and are updated, the divergence measure can be updated, but the number of bins is not increased, by Theorem 2.1, it is clear that the bias will not be reduced. Since increasing the number of bins requires recomputing the bin partitions, a brute force rebinning approach would have order complexity, and it were updated times, the total complexity would be . Here we use a trick and update the hash function only when is a power of . In the following theorem, which is proved in appendix, we show that the MSE rate of this algorithm is order and the total rebinngn computational complexity is order .

###### Theorem 3.1

MSE rate of the online divergence estimator shown in Algorithm 2 is order and the total computational complexity is order .

## 4 Proofs

In this section we derive the bias bound for the densities in Hölder smoothness class, stated in Theorem 2.1. For the proofs of variance bound in Theorem 2.2, convergence rate of EHB estimator in Theorem 2.3, and online divergence estimator in Theorem 3.1, we refer the reader to the Appendix, provided as a supplementary pdf file.

Consider the mapping of the and points by the hash function , and let the vectors represent the distinct mappings of and points under . Here is the number of distinct outputs of . In the following lemma we prove an upper bound on .

###### Lemma 4.1

Let be a density function with bounded support . Then if denotes the number of distinct outputs of the hash function (defined in (3)) of i.i.d points with density , we have

 L ≤O\of1ϵd. (9)
• Let and define as the region defined as

 XI:={x|−cX≤xi≤cX,1≤i≤d}, (10)

where is a constant such that .

is clearly not greater than the total number of bins created by splitting the region into partitions of volume . So we have

 L≤(2cX)dϵd. (11)

Proof of Theorem 2.1 Let and respectively denote the number of collisions of and points in the bins and , using the hash function . stands for the event that there is no collision in bin for the hash function with inputs . We have

 P(Ei) =\of1−1FL+L\of1F\ofF−1FL−1 =1−O\ofLF. (12)

By definition,

 ˆDg(X,Y):=1M∑i≤FMi>0Mi˜g\ofηNiMi.

Therefore

 E[ˆDg(X,Y)] =1ME⎡⎢ ⎢ ⎢⎣∑i≤FMi>0Mi˜g\ofηNiMi⎤⎥ ⎥ ⎥⎦ =1M∑i≤FMi>0P(Ei)E[Mi˜g\ofηNiMi∣∣∣Ei] +1M∑i≤FMi>0P(¯¯¯¯¯¯Ei)E[Mi˜g\ofηNiMi∣∣∣¯¯¯¯¯¯Ei]. (13)

We represent the second term in (4) by . has the interpretation as the bias error due to collisions in hashing. Remember that is defined as the event that there is a collision at bin for the hash function with inputs . For proving as upper bound on , we first need to compute an upper bound on . This is stated in the following lemma.

###### Lemma 4.2

We have

 ∑i≤FMi>0E[Mi∣∣¯¯¯¯¯¯Ei]≤O\ofL (14)
• Define . For each we can rewrite as

 Mi=L∑j=11Ai(j)M′j. (15)

Thus,

 ∑i≤FMi>0E[Mi∣∣¯¯¯¯¯¯Ei] =∑i≤FMi>0E[L∑j=11Ai(j)M′j∣∣ ∣∣¯¯¯¯¯¯Ei] =∑i≤FMi>0L∑j=1M′jE[1Ai(j)∣∣¯¯¯¯¯¯Ei] =∑i≤FMi>0L∑j=1M′jP\ofj∈Ai|¯¯¯¯¯¯Ei =∑i≤FMi>0L∑j=1M′jP\ofj∈Ai,¯¯¯¯¯¯EiP(¯¯¯¯¯¯Ei), (16)

where and can be derived as

 P\ofj∈Ai,¯¯¯¯¯¯Ei =1F\of1−\ofF−1FL−1=O\ofLF2, (17)

and

 P(¯¯¯¯¯¯Ei)=1−P(Ei)=O\ofLF. (18)

Plugging in (17) and (18) in (4) results in

 ∑i≤FMi>0E[Mi∣∣¯¯¯¯¯¯Ei] =∑i≤FMi>0L∑j=1M′jO\of1F =∑i≤FMi>0O\ofMF =O\ofL, (19)

where in the third line we use and . Now in the following lemma we prove a bound on .

###### Lemma 4.3

Let denote the number of distinct outputs of the hash function of the and sample points. The bias of estimator (5) due to hashing collision can be upper bounded by

 BH≤O\ofL2N2 (20)
• From the definition of we can write

 BH: =1M∑i≤FMi>0P(¯¯¯¯¯¯Ei)E[Mi˜g\ofηNiMi∣∣∣¯¯¯¯¯¯Ei] =P(¯¯¯¯¯¯E1)M∑i≤FMi>0E[Mi˜g\ofηNiMi∣∣∣¯¯¯¯¯¯Ei] ≤P(¯¯¯¯¯¯E1)˜g(Rmax)M∑i≤FMi>0E[Mi∣∣¯¯¯¯¯¯Ei] =P(¯¯¯¯¯¯E1)˜g(Rmax)MO(L) =O\ofL2N2, (21)

where in the second line we used the fact that . In the third line we used the upper bound for , and in the fourth line we used the result in equation (4).

Now we are ready to continue the proof of the bias bound in (4). Let be defined as the event that there is no collision for the hash function , and all of its outputs are distinct, that is,

(4) can be written as

 E[ˆDg(X,Y)] =1M∑i≤FMi>0P(Ei)E[Mi˜g\ofηNiMi∣∣∣Ei]+O\ofLF =P(E1)M∑i≤FMi>0E[Mi˜g\ofηNiMi∣∣∣Ei]+O\ofLF =P(E1)M∑i≤FMi>0E[Mi˜g\ofηNiMi∣∣∣E]+O\ofLF (22) =P(E1)ME⎡⎢ ⎢ ⎢⎣∑i≤FMi>0Mi˜g\ofηNiMi∣∣ ∣ ∣∣E⎤⎥ ⎥ ⎥⎦+O\ofLF (23) =1−O(L/F)ME[M∑i=1˜g\ofηN′iM′i]+O\ofLF (24) =EY1∼f2(x)E[˜g\ofηN′1M′1∣∣∣Y1]+O\ofLF, (25)

where in (22) we have used the fact that conditioned on , and are independent of for . In (23) since there is no collision in , and are equal to and for some and . Equation (24) is because the values and are independent of the hash function and its outputs, and finally in equation (25), we used the fact that each set and are i.i.d random variables.

At this point, assuming that the variance of is upper bounded by and using (Lemma 3.2 in [9]), we only need to derive , and then we can simply find the RHS in (25). Note that and

are independent and have binomial distributions with the respective means of

and , where and

are the probabilities of mapping

and points with the respective densities and into bin . Hence,

 E[N′1M′1∣∣∣Y1]=E[N′1∣∣Y1]E[M′1−1∣∣Y1]. (26)

Let denote the area for which all the points map to the same vector . can be written as:

 E[N′i] =N∫x∈Bif1(x)dx =N∫x∈Bif1(Yi)+O(∥x−Yi∥γ)dx =Nϵdf1(Yi)+N∫x∈BiO(∥x−Yi∥γ)dx =Nϵdf1(Yi)+N∫x∈Bi+YiO(∥x∥γ)dx, (27)

where in the second equality we have used the definition in (2). Let define and

 Cγ(Yi):=∫x∈B′i∥x∥γdx. (28)

Note that is a constant independent of , since the volume of is independent of . By defining we can write

 ∫x′∈B′i∥x∥γdx =∫x′∈B′iϵγ∥x∥γ(ϵddx′)=Cγ(Yi)ϵγ+d (29)

Also note that since the number of and points in each bin are independent we have , and therefore

 E[N′i|Yi]=Nϵdf1(Yi)+O\ofNϵγ+dCγ(Yi). (30)

Next, note that

has a non-zero binomial distribution, for which the first order inverse moment can be written as

[14]:

 E[M′i−1|Yi] =\of[Mϵdf2(Yi)+O\ofMϵγ+dC(Yi)]−1 ×\of1+O\of1Mϵdf2(Yi) =\ofMϵdf2(Yi)−1\of[1+O\ofϵγ+O\of1Mϵd] (31)

Thus, (26) can be simplified as

 E[N′1M′1∣∣∣Y1] =f1(Y1)ηf2(Y1)+O\ofϵγ+O\of1Mϵd. (32)

We use (Lemma 3.2 in [9]) and Remark 3, and obtain

 E[˜g\ofηN′1M′1∣∣∣Y1] =g\off1(Y1)f2(Y1)+O\ofϵγ +O\of1Mϵd+O(N−12). (33)

Finally from (25) we get

 B\of[ˆDg(X,Y)] =O\ofϵγ+O\of1Mϵd+O(N−12)+O(LF) =O\ofϵγ+O\of1Nϵd, (34)

where in the second equation we have used the upper bound on in Lemma 4.1 and the fact that . Finally note that we can use a similar method with the same steps to prove the convergence of an estimator for Rényi divergence.

## 5 Discussion and Experiments

In this section we compare and contrast the advantages of the proposed estimator with competing estimators, and provide numerical results. These show the efficiency of our estimator in terms of MSE rate and computational complexity.

Table 1 summarizes the differences between the proposed optimum estimator (EHB) with other competing estimators: Ensemble NNR [9], Ensemble KDE [11] and Mirror KDE [15]. In terms of MSE rate, all of these estimators can achieve the optimal parametric MSE rate of . In terms of computational complexity, our estimator has the best runtime compared to others. The smoothness parameter required for the optimum MSE rate is stated in terms of number of required derivatives of the density functions. The proposed estimator is the first divergence estimator that requires no extra smoothness at the boundaries. It is also the first divergence estimator that is directly applicable to online settings, retaining both the accuracy and linear total runtime. Finally, similar to NNR and Ensemble KDE estimators, the proposed estimator does not require any prior knowledge of the support of the densities.

We next compare the empirical performance of EHB to NNR, and the Ensemble KDE estimators. The experiments are done for two different types of f-divergence; KL-divergence and -divergence defined in [16]. Assume that and are i.i.d. samples from independent truncated Gaussian densities. Figure 1, shows the MSE estimation rate of -divergence with of two Gaussian densities with the respective expectations of and , and equal variances of for different numbers of samples. For each sample size we repeat the experiment times, and compute the MSE of each estimator. While all of the estimators have the same asymptotic MSE rate, in practice the proposed estimator performs better. The runtime of this experiment is shown in Figure 2. The runtime experiment confirms the advantage of the EHB estimator compared to the previous estimators, in terms of computational complexity.

Figure 3, shows the comparison of the estimators of KL-divergence between two truncated Gaussian densities with the respective expectations of and , and equal covariance matrices of , in terms of their mean value and confidence band. The confidence band gets narrower for greater values of , and EHB estimator has the narrowest confidence band.

In Figure 4 the MSE rates of the three -divergence estimators are compared in dimension , , for two independent truncated Gaussian densities with the expectations and covariances , versus different number of samples.

## 6 Conclusion

In this paper we proposed a fast hash based estimation method for f-divergence. We obtained bias and variance convergence rates, and validated our results by numerical experiments. Extending the method to hash-based mutual information estimation is a worthwhile topic for future work.

## References

• [1] T. M. Cover and J. A. Thomas, Elements of information theory. John Wiley & Sons, 2012.
• [2] K. R. Moon and A. O. Hero, “Ensemble estimation of multivariate f-divergence,” in Information Theory (ISIT), 2014 IEEE International Symposium on, pp. 356–360, IEEE, 2014.
• [3] K. R. Moon, M. Noshad, S. Y. Sekeh, and A. O. Hero III, “Information theoretic structure learning with confidence,” in Proc IEEE Int Conf Acoust Speech Signal Process, 2017.
• [4] A. Rényi, “On measures of entropy and information,” in Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1, pp. 547–561, University of California Press, 1961.
• [5] S. M. Ali and S. D. Silvey, “A general class of coefficients of divergence of one distribution from another,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 131–142, 1966.
• [6] B. Póczos and J. G. Schneider, “On the estimation of alpha-divergences.,” in AISTATS, pp. 609–617, 2011.
• [7] S. Singh and B. Póczos, “Exponential concentration of a density functional estimator,” in Advances in Neural Information Processing Systems, pp. 3032–3040, 2014.
• [8] Q. Wang, S. R. Kulkarni, and S. Verdú, “Divergence estimation for multidimensional densities via-nearest-neighbor distances,” IEEE Transactions on Information Theory, vol. 55, no. 5, pp. 2392–2405, 2009.
• [9] M. Noshad, K. R. Moon, S. Y. Sekeh, and A. O. Hero III, “Direct estimation of information divergence using nearest neighbor ratios,” arXiv preprint arXiv:1702.05222, 2017.
• [10] K. Kandasamy, A. Krishnamurthy, B. Poczos, L. Wasserman, et al., “Nonparametric Von Mises estimators for entropies, divergences and mutual informations,” in NIPS, pp. 397–405, 2015.
• [11] K. R. Moon, K. Sricharan, K. Greenewald, and A. O. Hero, “Improving convergence of divergence functional ensemble estimators,” in IEEE International Symposium Inf Theory, pp. 1133–1137, IEEE, 2016.
• [12] Y.-m. Zhang, K. Huang, G. Geng, and C.-l. Liu, “Fast kNN graph construction with locality sensitive hashing,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 660–674, Springer, 2013.
• [13] Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li, “Multi-probe LSH: efficient indexing for high-dimensional similarity search,” in Proceedings of the 33rd international conference on Very large data bases, pp. 950–961, VLDB Endowment, 2007.
• [14] M. Znidaric, “Asymptotic expansion for inverse moments of binomial and Poisson distributions,” arXiv preprint math/0511226, 2005.
• [15] S. Singh and B. Póczos, “Generalized exponential concentration inequality for Renyi divergence estimation.,” in ICML, pp. 333–341, 2014.
• [16] A. Cichocki, H. Lee, Y.-D. Kim, and S. Choi, “Non-negative matrix factorization with -divergence,” Pattern Recognition Letters, vol. 29, no. 9, pp. 1433–1440, 2008.

## A. Variance Proof

• The proof is based on Efron-Stein inequality. We follow similar steps used to prove the variance of NNR estimator in [9]. Note that the proof for variance of is contained in the the variance proof for . Assume that we have two sets of nodes , and for . Here for simplicity we assume that , however, the extension of the proof to the case when and are not equal, is straightforward, by considering a number of virtual points, as considered in [9]. Define . For using the EfronStein inequality on , we consider another independent copy of as and define . Define . By applying EfronStein inequality we have

 V\of[ˆDg(Z)] ≤12N∑i=1E[(ˆDg(Z)−ˆDg(Z(i)))2] =N2E[(ˆDg(Z)−ˆDg(Z(1)))2] ≤N2E⎡⎢ ⎢ ⎢⎣⎛⎜ ⎜⎝1N∑i≤FMi>0Mi˜g\ofηNiMi−1N∑i≤FMi>0M(1)i˜g\ofηN(1)iM(1)i⎞⎟ ⎟⎠2⎤⎥ ⎥ ⎥⎦ =12NE⎡⎢ ⎢ ⎢⎣⎛⎜ ⎜⎝∑i≤FMi>0⎛⎝Mi˜g\ofηNiMi−M(1)i˜g\ofηN(1)iM(1)i⎞⎠⎞⎟ ⎟⎠2⎤⎥ ⎥ ⎥⎦ =12NO\of1=O(1N). (35)

where in the last line we used the fact that and can be different just for two of , and that difference is just . So, the proof is complete.

## B. Proof of Theorem 2.3

Assume that the densities have bounded derivatives up to the order . Then the Taylor expansion of around is as follows

 f(y)=f(x)+∑|i|≤qDif(x)i!∥y−x∥i