Information theoretic measures such as Shannon entropy, mutual information, and the Kullback-Leibler (KL) divergence have a broad range of applications in information theory, statistics and machine learning[1, 2, 3]. When we have two or more data sets and we are interested in finding the correlation or dissimilarity between them, Shannon mutual information or KL-divergence is often used. Rényi and f-divergence measures are two well studied generalizations of KL-divergence which comprise many important divergence measures such as KL-divergence, total variation distance, and -divergence [4, 5].
Nonparametric estimators are a major class of divergence estimators, for which minimal assumptions on the density functions are considered. Some of the non-parametric divergence estimators are based on density plug-in estimators such as -NN , KDE , and histogram . A few researchers, on the other hand, have proposed direct estimation methods such as graph theoretic nearest neighbor ratio (NNR) . In general, plug-in estimation methods suffer from high computational complexity, which make them unsuitable for large scale applications.
Recent advances on non-parametric divergence estimation have been focused on the MSE convergence rates of the estimator. Singh et al in  proposed a plug-in KDE estimator for Rényi divergence that achieves the MSE rate of when the densities are at least times differentiable, and the support boundaries are sufficiently smooth. Kandasamy et al proposed a similar plug-in KDE estimator and extend the optimal MSE rate to densities that are at least differentiable . However, they ignore a major source of error due to the boundaries. Moon et al proposed a weighted ensemble method to improve the MSE rate of plug-in KDE estimators . The proposed estimator for f-divergence achieves the optimal MSE rate when the densities are at least times differentiable. They also assume stringent smoothness conditions at the support set boundary.
Noshad et al proposed a graph theoretic direct estimation method based on nearest neighbor ratios (NNR) . Their estimator is simple and computationally more tractable than other competing estimators, and can achieve the optimal MSE rate of for densities that are at least times differentiable. Although their basic estimator does not require any smoothness assumptions on the support set boundary, the ensemble estimator variant of their estimator does.
In spite of achieving the optimal theoretical MSE rate by aforementioned estimators, there remain serious. The first challenge is the high computational complexity of the estimator. Most KDE based estimators require runtime complexity of , which is not suitable for large scale applications. The NNR estimator proposed in  has the runtime complexity of , which is faster than the previous estimators. However, in  they require to grow sub-linearly with , which results in much higher complexity than linear runtime complexity. The other issue is the smoothness assumptions made on the support set boundary. Almost all previously proposed estimators assume extra smoothness conditions on the boundaries, which may not hold practical applications. For example, the method proposed in  assumes that the density derivatives up to order vanish at the boundary. Also it requires numerous computations at the support boundary, which become complicated when the dimension increases. The Ensemble NNR estimator in  assumes that the density derivatives vanish at the boundary. To circumvent this issue, Moon et al  assumed smoothness conditions at the support set boundary. However, these conditions may not hold in practice.
In this paper we propose a low complexity divergence estimator that can achieve the optimal MSE rate of for the densities with bounded derivatives of up to . Our estimator has optimal runtime complexity of , which makes it an appropriate tool for large scale applications. Also in contrast to other competing estimators, our estimator does not require stringent smoothness assumptions on the support set boundary.
The structure of the proposed estimator borrows ideas from hash based methods for KNN search and graph constructions problems[12, 13], as well as from the NNR estimator proposed in . The advantage of hash based methods is that they can be used to find the approximate nearest neighbor points with lower complexity as compared to the exact -NN search methods. This suggests that fast and accurate algorithms for divergence estimation may be derived from hashing approximations of k-NN search. Noshad et al  consider the -NN graph of Y in the joint data set , and show that the average exponentiated ratio of the number of X points to the number of Y points among all -NN points is proportional to the Rényi divergence between the X and Y densities. It turns out that for estimation of the density ratio around each point we really do not need to find the exact -NN points, but only need sufficient local samples from X and Y around each point. By using a randomized locality sensitive hashing (LSH), we find the closest points in Euclidean space. In this manner, applying ideas from the NNR estimation and hashing techniques to KNN search problem, we obtain a more efficient divergence estimator. Consider two sample sets and with a bounded density support. We use a particular two-level locality sensitive random hashing, and consider the ratio of samples in each bin with a number of Y samples. We prove that the weighted average of these ratios over all of the bins can be made to converge almost surely to f-divergences between the two samples populations. We also argue that using the ensemble estimation technique provided in , we can achieve the optimal parametric rate of . Furthermore, using a simple algorithm for online estimation method has complexity and convergence rate, which is the first optimal online estimator of its type.
2 Hash-Based Estimation
In this section, we first introduce the f-divergence measure and propose a hash-based estimator. We outline the main theoretical results which will be proven in section 4.
Consider two density functions and with common bounded support set .
The f-divergence is defined as follows .
where is a smooth and convex function such that . KL-divergence, Hellinger distance and total variation distance are particular cases of this family. Note that for estimation, we don’t need convexity of and . conditions. Assume that the densities are lower bounded by and upper bounded by . Assume and belong to the Hölder smoothness class with parameter :
Given a support , a function is called Hölder continuous with parameter , if there exists a positive constant , possibly depending on , such that
for every .
The function in (2) is also assumed to be Lipschitz continuous; i.e. is Hölder continuous with .
The -Hölder smoothness family comprises a large class of continuous functions including continuously differentiable functions and Lipschitz continuous functions. Also note that for , any –Hölder continuous function on any bounded and continuous support is constant.
Consider the i.i.d samples drawn from and drawn from . Define the fraction . We define the set . We define a positive real valued constant as a user-selectable parameter of the estimator to be defined in 5. We define the hash function as
where is the projection of on the th coordinate, and is defined as
for fixed . Let , where and is a fixed real number. We define a random hash function with a uniform density on the output and consider the combined hashing , which maps the points in to .
Consider the mappings of the sets and using the hash function
, and define the vectorsand to respectively contain the number of collisions for each output bucket from the set . We represent the bins of the vectors and respectively by and , .
The hash based f-divergence estimator is defined as
Note that if the densities and are almost equal, then for each point , , and thus and
tend to zero, as required. In the following theorems we state upper bounds on the bias and variance rates. Letand , respectively represent the bias and variance of , which is an estimator of the parameter . Then, the following provides a bound on the bias of the proposed estimator.
Assume that and are density functions with bounded common support set and satisfying -Hölder smoothness. The bias of the proposed estimator for f-divergence with function can be bounded as
where is a positive real constant.
In order for the estimator to be asymptotically unbiased, needs to be a function of . The optimum bias rate of can be achieved for .
Let be fixed. The variance of the estimator 5 can be bounded as
The same variance bound holds for the random variable . The bias and variance results easily extend to Rényi divergence estimation.
We next show that, when and belong to the family of differentiable densities, we can improve the bias rate by applying the ensemble estimation approach in [11, 3]. The Ensemble Hash-based (EHB) estimator is defined as follows.
Assume that the density functions are in the Hölder space , which consists of functions on continuous derivatives up to order and the th partial derivatives are Hölder continuous with exponent . Let be a set of index values with , where is a constant. Let . The weighted ensemble estimator is defined as
where is the hash based estimator of f-divergence, with the hashing parameter of .
Let and be the solution to:
Then the MSE rate of the ensemble estimator is .
3 Online Divergence Estimation
In this section we study the problem of online divergence estimation. In this setting we consider two data steams and with i.i.d samples, and we are interested in estimating the divergence between two data sets. The number of samples increase over time and an efficient update of the divergence estimate is desired. The time complexity of a batch update, which uses the entire update batch to compute the estimate at each time point, is , and it may not be so effective in cases which we need quick detection of any change in the divergence function.
Algorithm 2 updates the divergence with amortized runtime complexity of order . Define the sets , , the number of and samples in each partition, and the divergence estimate between and . Consider updating the estimator with new samples and . In the first and second lines of algorithm 2, the new samples are added to the datasets and the values of and of the bins in which the new samples fall. We can find these bins in using a simple hashing. Note that once and are updated, the divergence measure can be updated, but the number of bins is not increased, by Theorem 2.1, it is clear that the bias will not be reduced. Since increasing the number of bins requires recomputing the bin partitions, a brute force rebinning approach would have order complexity, and it were updated times, the total complexity would be . Here we use a trick and update the hash function only when is a power of . In the following theorem, which is proved in appendix, we show that the MSE rate of this algorithm is order and the total rebinngn computational complexity is order .
MSE rate of the online divergence estimator shown in Algorithm 2 is order and the total computational complexity is order .
In this section we derive the bias bound for the densities in Hölder smoothness class, stated in Theorem 2.1. For the proofs of variance bound in Theorem 2.2, convergence rate of EHB estimator in Theorem 2.3, and online divergence estimator in Theorem 3.1, we refer the reader to the Appendix, provided as a supplementary pdf file.
Consider the mapping of the and points by the hash function , and let the vectors represent the distinct mappings of and points under . Here is the number of distinct outputs of . In the following lemma we prove an upper bound on .
Let be a density function with bounded support . Then if denotes the number of distinct outputs of the hash function (defined in (3)) of i.i.d points with density , we have
Let and define as the region defined as
where is a constant such that .
is clearly not greater than the total number of bins created by splitting the region into partitions of volume . So we have
Proof of Theorem 2.1 Let and respectively denote the number of collisions of and points in the bins and , using the hash function . stands for the event that there is no collision in bin for the hash function with inputs . We have
We represent the second term in (4) by . has the interpretation as the bias error due to collisions in hashing. Remember that is defined as the event that there is a collision at bin for the hash function with inputs . For proving as upper bound on , we first need to compute an upper bound on . This is stated in the following lemma.
Define . For each we can rewrite as
where and can be derived as
where in the third line we use and . Now in the following lemma we prove a bound on .
Let denote the number of distinct outputs of the hash function of the and sample points. The bias of estimator (5) due to hashing collision can be upper bounded by
From the definition of we can write
where in the second line we used the fact that . In the third line we used the upper bound for , and in the fourth line we used the result in equation (4).
Now we are ready to continue the proof of the bias bound in (4). Let be defined as the event that there is no collision for the hash function , and all of its outputs are distinct, that is,
(4) can be written as
where in (22) we have used the fact that conditioned on , and are independent of for . In (23) since there is no collision in , and are equal to and for some and . Equation (24) is because the values and are independent of the hash function and its outputs, and finally in equation (25), we used the fact that each set and are i.i.d random variables.
are independent and have binomial distributions with the respective means ofand , where and
are the probabilities of mappingand points with the respective densities and into bin . Hence,
Let denote the area for which all the points map to the same vector . can be written as:
where in the second equality we have used the definition in (2). Let define and
Note that is a constant independent of , since the volume of is independent of . By defining we can write
Also note that since the number of and points in each bin are independent we have , and therefore
Next, note that
has a non-zero binomial distribution, for which the first order inverse moment can be written as:
Thus, (26) can be simplified as
Finally from (25) we get
where in the second equation we have used the upper bound on in Lemma 4.1 and the fact that . Finally note that we can use a similar method with the same steps to prove the convergence of an estimator for Rényi divergence.
5 Discussion and Experiments
In this section we compare and contrast the advantages of the proposed estimator with competing estimators, and provide numerical results. These show the efficiency of our estimator in terms of MSE rate and computational complexity.
|Estimator||HB||NNR||Ensemble KDE||Mirror KDE|
|Required Smoothness ()|
|Extra Smooth Boundaries||No||Yes||Yes||Yes|
|Knowledge about Boundary||No||No||No||Yes|
Table 1 summarizes the differences between the proposed optimum estimator (EHB) with other competing estimators: Ensemble NNR , Ensemble KDE  and Mirror KDE . In terms of MSE rate, all of these estimators can achieve the optimal parametric MSE rate of . In terms of computational complexity, our estimator has the best runtime compared to others. The smoothness parameter required for the optimum MSE rate is stated in terms of number of required derivatives of the density functions. The proposed estimator is the first divergence estimator that requires no extra smoothness at the boundaries. It is also the first divergence estimator that is directly applicable to online settings, retaining both the accuracy and linear total runtime. Finally, similar to NNR and Ensemble KDE estimators, the proposed estimator does not require any prior knowledge of the support of the densities.
We next compare the empirical performance of EHB to NNR, and the Ensemble KDE estimators. The experiments are done for two different types of f-divergence; KL-divergence and -divergence defined in . Assume that and are i.i.d. samples from independent truncated Gaussian densities. Figure 1, shows the MSE estimation rate of -divergence with of two Gaussian densities with the respective expectations of and , and equal variances of for different numbers of samples. For each sample size we repeat the experiment times, and compute the MSE of each estimator. While all of the estimators have the same asymptotic MSE rate, in practice the proposed estimator performs better. The runtime of this experiment is shown in Figure 2. The runtime experiment confirms the advantage of the EHB estimator compared to the previous estimators, in terms of computational complexity.
Figure 3, shows the comparison of the estimators of KL-divergence between two truncated Gaussian densities with the respective expectations of and , and equal covariance matrices of , in terms of their mean value and confidence band. The confidence band gets narrower for greater values of , and EHB estimator has the narrowest confidence band.
In Figure 4 the MSE rates of the three -divergence estimators are compared in dimension , , for two independent truncated Gaussian densities with the expectations and covariances , versus different number of samples.
In this paper we proposed a fast hash based estimation method for f-divergence. We obtained bias and variance convergence rates, and validated our results by numerical experiments. Extending the method to hash-based mutual information estimation is a worthwhile topic for future work.
-  T. M. Cover and J. A. Thomas, Elements of information theory. John Wiley & Sons, 2012.
-  K. R. Moon and A. O. Hero, “Ensemble estimation of multivariate f-divergence,” in Information Theory (ISIT), 2014 IEEE International Symposium on, pp. 356–360, IEEE, 2014.
-  K. R. Moon, M. Noshad, S. Y. Sekeh, and A. O. Hero III, “Information theoretic structure learning with confidence,” in Proc IEEE Int Conf Acoust Speech Signal Process, 2017.
-  A. Rényi, “On measures of entropy and information,” in Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1, pp. 547–561, University of California Press, 1961.
-  S. M. Ali and S. D. Silvey, “A general class of coefficients of divergence of one distribution from another,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 131–142, 1966.
-  B. Póczos and J. G. Schneider, “On the estimation of alpha-divergences.,” in AISTATS, pp. 609–617, 2011.
-  S. Singh and B. Póczos, “Exponential concentration of a density functional estimator,” in Advances in Neural Information Processing Systems, pp. 3032–3040, 2014.
-  Q. Wang, S. R. Kulkarni, and S. Verdú, “Divergence estimation for multidimensional densities via-nearest-neighbor distances,” IEEE Transactions on Information Theory, vol. 55, no. 5, pp. 2392–2405, 2009.
-  M. Noshad, K. R. Moon, S. Y. Sekeh, and A. O. Hero III, “Direct estimation of information divergence using nearest neighbor ratios,” arXiv preprint arXiv:1702.05222, 2017.
-  K. Kandasamy, A. Krishnamurthy, B. Poczos, L. Wasserman, et al., “Nonparametric Von Mises estimators for entropies, divergences and mutual informations,” in NIPS, pp. 397–405, 2015.
-  K. R. Moon, K. Sricharan, K. Greenewald, and A. O. Hero, “Improving convergence of divergence functional ensemble estimators,” in IEEE International Symposium Inf Theory, pp. 1133–1137, IEEE, 2016.
-  Y.-m. Zhang, K. Huang, G. Geng, and C.-l. Liu, “Fast kNN graph construction with locality sensitive hashing,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 660–674, Springer, 2013.
-  Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li, “Multi-probe LSH: efficient indexing for high-dimensional similarity search,” in Proceedings of the 33rd international conference on Very large data bases, pp. 950–961, VLDB Endowment, 2007.
-  M. Znidaric, “Asymptotic expansion for inverse moments of binomial and Poisson distributions,” arXiv preprint math/0511226, 2005.
-  S. Singh and B. Póczos, “Generalized exponential concentration inequality for Renyi divergence estimation.,” in ICML, pp. 333–341, 2014.
-  A. Cichocki, H. Lee, Y.-D. Kim, and S. Choi, “Non-negative matrix factorization with -divergence,” Pattern Recognition Letters, vol. 29, no. 9, pp. 1433–1440, 2008.
A. Variance Proof
The proof is based on Efron-Stein inequality. We follow similar steps used to prove the variance of NNR estimator in . Note that the proof for variance of is contained in the the variance proof for . Assume that we have two sets of nodes , and for . Here for simplicity we assume that , however, the extension of the proof to the case when and are not equal, is straightforward, by considering a number of virtual points, as considered in . Define . For using the EfronStein inequality on , we consider another independent copy of as and define . Define . By applying EfronStein inequality we have
where in the last line we used the fact that and can be different just for two of , and that difference is just . So, the proof is complete.
B. Proof of Theorem 2.3
Assume that the densities have bounded derivatives up to the order . Then the Taylor expansion of around is as follows