I Introduction
In data science and modeling, it is often required to test whether two random variables are independent. Out of several measures that quantify dependencies between random variables [2, 1, 3, 4], the most widely used are mutual information , Pearson’s correlation , and distance correlation .
Mutual information, is generally thought of as a benchmark for quantifying dependencies between random variables; however, it can only be computed by first estimating the joint and marginal probability density functions (pdfs). Pearson’s correlation, , can be directly estimated from data, but it does not capture nonlinear dependencies. Distance correlation, , can also be directly estimated directly from data and can capture nonlinear dependencies, but is in general slow to compute (computational complexity ). Further, distance correlation often does not reflect the nonlinear dependencies correctly as described succinctly by Rényi’s axioms [5], which were slightly improved upon by Granger, Maasoumi and Racine [6]. See Table I. Specifically, distance correlation is not invariant under strictly monotonic transformations (6th axiom in Table I).
An “ideal measure” should satisfy axioms given in Table I and should be directly estimable from the data. A less popular and unnamed measure uses the Bhattacharyya distance between the joint pdf and the product of the marginals as a measure for dependence between two random variables [7, 8]. It has been shown that this measure satisfies all six axioms. Importantly, this measure is invariant under continuous and strictly increasing transformations [9, 6]. It is also closely related to mutual information, kclass entropy and copula [10, 11, 12]. In this paper, we call this measure mutual dependence, .
#  Property  

1  ✓  ✓  ✓  ✓  
2  iff and are independent  ✓  ✓  ✓  
3  ✓  ✓  
4  if there is a strict dependence between and  ✓  ✓  
5 
if the joint distribution of and is normal 
✓  ✓  ✓  ✓ 
6  ✓  ✓ 
Mutual dependence has not been widely used because, like mutual information, it requires nonparametric density estimation to compute the marginal and joint pdfs, which are then substituted into the theoretical measure and numerically integrated to yield estimates. This process is both computationally complex and inaccurate. In this paper, we develop an estimator that estimates mutual dependence directly from the data. It uses our recently proposed BandLimited Maximum Likelihood (BLML) estimator that maximizes the data likelihood function over a set of bandlimited pdfs with known cutoff frequency . The BLML estimator is consistent, efficiently computable, and results in a smooth pdf [13]
. The BLML estimator also has a faster rate of convergence and reduced computational complexity over other widely used nonparametric methods such as kernel density estimators. Along with these properties, if the BLML estimator is substituted into the expression for mutual dependence (see (
5)), the mutual dependence can be computed directly from the data without performing numerical integration, which is often inaccurate and inefficient.We show through simulations that converges faster than and for various data sets with different types of linear and nonlinear dependencies, and the convergence rate for computing is maintained for different type of nonlinearities. is faster to compute than as it has time complexity, where is the number of bins containing a finite number of samples which is always less than or equal to (the number of data samples).
The paper is organized as follows. Section II discusses variation in different measures as a function of mutual information and nonlinearity. Section III introduces the notion of mutual dependence and its estimator. Section IV uses simulation to compare convergence of mutual dependence with Pearson’s and distance correlation for different nonlinearity dependencies and marginal pdfs. We end the paper with conclusions and future work in Section V.
Ii A motivating example
Consider two random variables and defined as:
where and follow either a bandlimited pdf
or a normal pdf
and where is one of four types of (nonlinear) dependence among
The ‘spread’ is varied from to to obtain different degrees of dependencies. Figure 1 illustrates the data generated in this example.
The goal of the dependency measures is to quantify dependencies between and given the data. In cases where underlying pdfs are known these dependency are captured pretty nicely by mutual information.
Therefore in Figure 2, we plot theoretical values for Pearson’s and distance correlation of dependence as a function of mutual information for the four different nonlinearity types and the two different generating pdfs.

Mutual information
(2) 
Pearson’s correlation
(3) 
Distance correlation
(4) here , ,
are the respective characteristic functions.
and are the dimension of and . For details see [3]. (Note we have eliminated the constants from the definition of as they are not needed to define ).
Both Pearson’s and distance correlation measures depend largely on the nonlinearity for a given value of mutual information. This variability may occur because both the types of correlation measures are not invariant to strictly monotonic transformations, unlike mutual information. Therefore, changing the type of nonlinearity results in different values for both Pearson’s and distance correlation, while the mutual information remains invariant. Such variance is undesirable as it may lead to incorrect inferences when comparing dependencies between data having different types of nonlinear dependencies. Therefore, a measure that is invariant to strictly monotonic transformations is desirable.
Iii Mutual dependence and its estimation
In this section, we introduce the mutual dependence, which is based on an unnamed existing measure, and show several properties of this measure. Then, we derive an estimator of mutual dependence derived directly from data generated from bandlimited pdfs. Finally, we describe efficient algorithms to compute this estimator.
Iiia Mutual dependence
Consider two random variables and , their joint distribution , and their marginal distributions and . These random variables are independent if and only if . It is therefore natural to measure dependence as the distance (in the space of pdfs) between the joint and the product of marginal distributions. A good distance candidate is the Bhattacharyya distance (also known as Hellinger distance). See [6, 9] for details.
Definition 1
The mutual dependence between two random variables and is defined as the Bhattacharyya distance between their joint distribution and the product of their marginal distributions and , that is,
(5) 
with
(6) 
We call this measure ‘mutual dependence’ as it represents mutual information most closely. For a given value of mutual information, the value of mutual dependence remains almost the same irrespective of the nonlinearity type, which is not true for Pearson’s and distance correlation measures.
IiiB Properties of mutual dependence
Due to symmetry of , it is easy to see that . The measure if and are partially dependent which quantifies the degree of dependence between the two random variables. In the extreme cases, and are independent and if either or is a Borelmeasurable function of the other. Also, it can be easily established that is invariant under strictly monotonic transformations and , i.e . A detailed description of these properties can be found in [6, 9].
For jointly normal data, the mutual dependence can be estimated by first calculating the Bhattacharyya distance between two multivariate Gaussian distributions
[14](7) 
where and
are the mean vectors and
and covariance matrices. Then substitutinggives
(8) 
This shows that mutual dependence satisfies axiom 5 (see Table 1).
IiiC Estimation of mutual dependence
To estimate we use the BLML method [13] that maximizes the likelihood of observing data samples over the set of bandlimited pdfs. The BLML estimator is shown to outperform kernel density estimators (KDE) both in convergence rates and computational time and hence provides a better alternative for nonparametric estimation of pdfs. In addition, the structure of the BLML estimator is well suited for evaluating the integral in (5), resulting in an estimate which is a direct function of observed data and hence avoids numerical integration errors.
Below we briefly describe the BLML estimator.
Theorem III.1
Consider independent samples of an unknown BL pdf, , with assumed cutoff frequency Then the BLML estimator of is given as:
(9) 
where, is the assumed cutoff frequency, vectors ’s, with , are the data samples, and the vector , is given by
(10) 
Here with .
See [13] for details. Now we introduce the estimator for , in the following theorem.
Theorem III.2
If are paired independent and identically distributed data observations and is the cutoff frequency parameter. Then the estimator for mutual dependence is given as:
(11) 
where is given by:
is:
and is:
IiiD Computation of mutual dependence
As described in [13] solving for
requires exponential time. Therefore, heuristic algorithms also described in
[13] such as BLMLBQP and BLMLTrivial, can be used directly to compute , , approximately for each for small scale () and large scale () problems, respectively.To further improve the computational time BLMLQuick algorithm [13] can also be used. BLMLQuick uses binning and estimates , , approximately for each . It is also shown in [13] that both BLMLTrivial and BLMLQuick algorithms yield consistent estimate of pdfs if the true pdf is strictly positive, therefore in cases where the joint the estimate, is also consistent.
Iv Performance of mutual dependence
In this section, we evaluate the performance of our estimator for mutual information by first comparing the empirical distribution of the estimator with the empirical distribution of the estimators for Pearson’s and distance correlation for different mutual information values, , nonlinearities, , and generating pdfs, We compare the convergence of these metrics to the true values for different sample sizes. Finally, we compare the computational complexity of our estimator with the estimator for distance correlation to evaluate the relative computational time needed to implement each estimator.
Iva Comparison of convergence rate for different nonlinearities
Figures 4 and 5 plot the estimated , and for and from about 50 Monte Carlo runs as a function of for different nonlinearities (linear, quadratic, cubic and sinusoidal) and generating pdfs (bandlimited and normal). Underlaid are the respective theoretical values. Specifically, the first row shows about 50 Monte Carlo computation of for different values, nonlinearities and generating pdfs. It can be seen that for both and , works best for linear and sinusoidal data, but for quadratic data has a larger variance and for cubic data has a larger bias in bandlimited case. The second row shows 50 Monte Carlo computations of for different values, nonlinearities and generating pdfs. It can be seen that for both and , works best for linear data, but for quadratic and sinusoidal data, it has larger bias whereas for cubic data it has larger variance. The bottom row shows 50 Monte Carlo computation of for different values, nonlinearities and generating pdfs. It can be seen that works equally good for all nonlinearities and shows less bias and variance than both and .
Figure 6 plots the integration (over different values) of mean squared error (IMSE) between the theoretical and estimated measures using about 50 Monte Carlo runs, for different nonlinearities and generating pdf types.
(15) 
Here, is the number of Monte Carlo simulations and is the dependency metric. It can be seen from the Figure 6 that the convergence rate is fastest for irrespective of nonlinearity type and/or generating pdf. and show an equally fast convergence rate for linear and normal data, but the rate is slower for nonlinear and nonnormal data. Specifically, the first row shows convergence of , from which it can be established that convergence of to the theoretical values is fastest for linear data. For nonlinear data, the convergence is slo either due to large bias or variance as discussed previously. The second row shows convergence of . It can be seen that does well for linear data, but the rate slows down and saturates for nonlinear data again due to either large bias or variance. Specially, for cubic and bandlimited data, the IMSE of does not decrease with increasing the number of samples, this is due to the nondecreasing variance of the estimator (see Figure 4). The bottom row shows convergence of . It can be seen that converges equally well for all data types and generating pdfs.
IvB Comparison of computational time
The computational complexity of computing is least which is , whereas computational complexity of computing is maximum which is . is same as computational complexity of BLMLQuick algorithm which is , where is the number of bins containing nonzero number of samples, which is always less than equal to . For dense data therefore computation of is a lot quicker than estimating in such cases.
V Conclusions
In this paper, we introduced a novel estimator for measuring dependency that can be directly computed from the data. Our estimator computes the mutual dependence which is an “ideal” measure for dependence between two random variables [6]. Our estimator has advantages over mutual information estimators as it does not require estimating the pdfs from data. It also has advantage over Pearson’s and distance correlation estimators as it is invariant under strictly monotonic transformation. Further, we showed that under simulation, estimators of both Pearson’s and distance correlation require more samples to achieve the same integrated mean squared error (IMSE) as compared to our mutual dependence estimator showing lower convergence rate. The slower convergence rate for the estimators of Pearson’s and distance correlation was due to their higher variance and bias for the nonlinearly dependent data. Such nonlinearities did not affect our estimator and it showed a uniform decrease in IMSE as the sample size increases for all tested nonlinearities. Even further, our estimate for mutual dependence showed a computational time complexity of where is the number of bins, which is superior to the time complexity of distance correlation () and is much faster when the data is dense.
Va Future work
Although our estimator for the mutual dependence showed some nice properties under simulation, it remained to be established that it shows consistency for any nonlinearity which would require building up a theoretical proof. Further, in this paper, we assumed through out that we knew the cutoff frequency of the bandlimited pdf or approximate cutoff frequency for the normal pdf (the band where most of the power of pdf lies, in case it is not band limited). However, in general this cutoff frequency is not known. A more indepth analysis is needed to understand the behavior of our estimator as a function of the cutoff frequency.
References
 [1] J. Lee Rodgers and W. A. Nicewander, “Thirteen ways to look at the correlation coefficient,” The American Statistician, vol. 42, no. 1, pp. 59–66, 1988.
 [2] C. E. Shannon, “A mathematical theory of communication,” Bell Syst. Tech. J., vol. 27, pp. 379–423, 623–656, 1948.
 [3] G. J. Székely, M. L. Rizzo, N. K. Bakirov, et al., “Measuring and testing dependence by correlation of distances,” The Annals of Statistics, vol. 35, no. 6, pp. 2769–2794, 2007.
 [4] G. J. Székely and M. L. Rizzo, “Brownian distance covariance,” Ann. Appl. Stat., vol. 3, pp. 1236–1265, 12 2009.
 [5] A. Renyi, “On measures of dependence,” Acta. Math. Acad. Sci. Hung., vol. 10, pp. 441–451, 1959.
 [6] C. W. Granger, E. Maasoumi, and J. Racine, “A dependence metric for possibly nonlinear processes,” Journal of Time Series Analysis, vol. 25, no. 5, 2004.
 [7] T. Kailath, “The divergence and bhattacharyya distance measures in signal selection,” Communication Technology, IEEE Transactions on, vol. 15, no. 1, pp. 52–60, 1967.

[8]
R. Beran, “Minimum hellinger distance estimates for parametric models,”
The Annals of Statistics, pp. 445–463, 1977.  [9] H. Skaug and D. Tjstheim, “Testing for serial independence using measures of distance between densities,” in Athens Conference of Applied Probability and Time Series (Robinson P and Rosenblatt M, eds.), Springer, 1996.
 [10] C. Genest and R. J. MacKay, “The joy of copulas: Bivarate distribution with uniform marginals,” The American Statistician, vol. 40, pp. 280–3, 1986.
 [11] R. Nelsen, An Introduction to Copulas. SpringerVerlag, Berlin, 1999.
 [12] J. Havrda and F. Charvat, “Quantification method of classification processes: concept of structual entropy,” kybernetika Cislo I. Rocnik, vol. 3, pp. 30–4, 1967.
 [13] R. Agarwal, Z. Chen, and S. S. V, “Nonparametric estimation of bandlimited probability density functions,” arXiv:1503.06236v1, http://arxiv.org/pdf/1503.06236v1.pdf, 2015.
 [14] L. Pardo Llorente, Statistical inference based on divergence measures, vol. 185 of Statistics, textbooks and monographs. Boca Raton, FL: Chapman & Hall/CRC, 2006.
Comments
There are no comments yet.