 # On mutual information estimation for mixed-pair random variables

We study the mutual information estimation for mixed-pair random variables. One random variable is discrete and the other one is continuous. We develop a kernel method to estimate the mutual information between the two random variables. The estimates enjoy a central limit theorem under some regular conditions on the distributions. The theoretical results are demonstrated by simulation study.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The entropy of a discrete random variable

with countable support and is defined to be

 H(X)=−∑ipilogpi,

and the (differential) entropy of a continuous random variable

is defined as

 H(Y)=−∫Rdf(y)logf(y)dy.

If , or is also called the joint entropy of the components in or

. Entropy is a measure of distribution uncertainty and naturally it has application in the fields of information theory, statistical classification, pattern recognition and so on.

Let , be probability measures on some arbitrary measure spaces and respectively. Let be the joint probability measure on the space . If is absolutely continuous with respect to the product measure , let be the Radon-Nikodym derivative. Then the general definition of the mutual information (e.g., ) is given by

 I(X,Y)=∫X×YdPXYlogdPXYd(PX×PY). (1)

If two random variables and are either both discrete or both continuous then the mutual information of and can be expressed in terms of entropies as

 I(X,Y)=H(X)+H(Y)−H(X,Y). (2)

However, in practice and application, we often need to work on a mixture of continuous and discrete random variables. There are several ways for the mixture. 1). One random variable is discrete and the other random variable is continuous; 2). A random variable has both discrete and continuous components, i.e., with probability and with probability where , is a discrete random variable and

is a continuous random variable; 3). a random vector with each dimension component being discrete, continuous or mixture as in 2).

In , the authors extend the definition of the joint entropy for the first case mixture, i.e., for the pair of random variables, where the first random variable is discrete and the second one is continuous. Our goal is to study the mutual information for that case and provide the estimation of the mutual information from a given i.i.d. sample .

In , the authors applied the -nearest neighbor method to estimate the Radon-Nikodym derivative and, therefore, to estimate the mutual information for all three mixed cases. In the literature, if the random variables and are either both discrete or both continuous, the estimation of mutual information is usually performed by the estimation of the three entropies in (2). The estimation of a differential entropy has been well studied. An incomplete list of the related research includes the nearest-neighbor estimator , , ; the kernel estimator , , ,  and the orthogonal projection estimator , . Basharin 

studied the plug-in entropy estimator for the finite value discrete case and obtained the mean, the variance and the central limit theorem of this estimator. Vu, Yu and Kass

 studied the coverage-adjusted entropy estimator with unobserved values for the infinite value discrete case.

## 2 Main results

Consider a random vector . We call a mixed-pair if is a discrete random variable with countable support while is a continuous random variable. Observe that induces measures that are absolutely continuous with respect to the Lebesgue measure, where , for every Borel set in . There exists a non-negative function with be the probability mass function on and be the marginal density function of . Here, , . In particular, denote . We have that

 fi(y)=1pigi(y)

is the probability density function of conditioned on . In , the authors gave the following regulation of mixed-pair and then defined the joint entropy of a mixed-pair.

###### Definition 2.1

(Good mixed-pair). A mixed-pair random variables is called good if the following condition is satisfied:

 ∫X×Rd|g(x,y)logg(x,y)|dxdy=∑i∫Rd|gi(y)loggi(y)|dy<∞.

Essentially, we have a good mixed-pair random variables when restricted to any of the values, the conditional differential entropy of is well-defined.

###### Definition 2.2

(Entropy of a mixed-pair). The entropy of a good mixed-pair random variable is defined by

 H(Z)=−∫X×Rdg(x,y)logg(x,y)dxdy=−∑i∫Rdgi(y)loggi(y)dy.

As then we have that

 H(Z)=−∑i∫Rdgi(y)loggi(y)dy=−∑i∫Rdpifi(y)logpifi(y)dy=−∑ipilogpi∫Rdfi(y)dy−∑ipi∫Rdfi(y)logfi(y)dy=−∑ipilogpi−∑ipi∫Rdfi(y)logfi(y)dy=H(X)+∑ipiH(Y|X=xi). (3)

We take the convention and . From the general formula of the mutual information (1), we get that

 I(X,Y)=∫X×Rdg(x,y)logg(x,y)dxdyh(x)f(y)dxdydxdy=∑i∫Rdgi(y)loggi(y)pif(y)dy=∑i∫Rdgi(y)loggi(y)dy−∑i∫Rdgi(y)logpidy−∑i∫Rdgi(y)logf(y)dy=∑i∫Rdpifi(y)log[pifi(y)]dy−∑ipilogpi∫Rdfi(y)dy−∫Rdf(y)logf(y)dy=∑ipilogpi∫Rdfi(y)dy+∑ipi∫Rdfi(y)logfi(y)dy−∑ipilogpi−∫Rdf(y)logf(y)dy=−H(Z)+H(X)+H(Y)=H(Y)−∑ipiH(Y|X=xi):=H(Y)−∑iIi. (4)

Let be a random sample drawn from a mixed distribution with discrete component having support , and let , with . Also suppose that the continuous component has pdf . Denote , and let

 (5)

and

 ¯H(Y)=−N−1N∑k=1logf(Yk) (6)

be the estimators of , , and respectively, where is the probability density function of conditioned on , . Denote . Let be the covariance matrix of .

###### Theorem 2.1

if and only if and are dependent. For the estimator

 ¯I(X,Y)=¯H−m∑i=0¯Ii (7)

of we have that

 √N(¯I(X,Y)−I(X,Y))→N(0,a⊺Σa) (8)

given that and are dependent. Furthermore, the variance can be calculated by

 a⊺Σa=var(logf(Y))+m∑i=0piEi[logfi(Y)]2−m∑i=0p2i(Ei[logfi(Y)])2−2m∑i=0pi[Eilogfi(Y)logf(Y)−Eilogfi(Y)Elogf(Y)]−2∑0≤i

where is the conditional expectation of given .

Proof. First of all, since is the variance covariance matrix. If then

 var(logf(Y)−m∑i=0I(X=i)logfi(Y))=a⊺Σa=0

and for some constant . But

 logf(Y)−m∑i=0I(X=i)logfi(Y)=m∑i=0I(X=i)logf(Y)fi(Y).

Hence . Then for some constant and for all . But . Hence, and for all . Then and are independent. On the other hand, if and are independent, then for all . Therefore, and . Hence, if and only if and are independent.

Notice that the vector is the sample mean of a sequence of i.i.d. random vectors

 {(logf(Yk),I(Xk=0)logf0(Yk),⋯,I(Xk=m)logfm(Yk))⊺}Nk=1

with mean . Then, by central limit theorem, we have

 √N⎛⎜ ⎜ ⎜ ⎜ ⎜⎝⎛⎜ ⎜ ⎜ ⎜ ⎜⎝¯H¯I0⋮¯Im⎞⎟ ⎟ ⎟ ⎟ ⎟⎠−⎛⎜ ⎜ ⎜ ⎜⎝HI0⋮Im⎞⎟ ⎟ ⎟ ⎟⎠⎞⎟ ⎟ ⎟ ⎟ ⎟⎠→N(¯0,Σ),

and, given , we have (8). By the formula for variance decomposition, we have

 (10)

. Here is the conditional variance of when . By similar calculation,

 Cov(I(X=i)logfi(Y),I(X=j)logfj(Y))=−pipj[Eilogfi(Y)][Ejlogfj(Y)], (11)

for all , and

 Cov(I(X=i)logfi(Y),logf(Y))=pi[Eilogfi(Y)logf(Y)−Eilogfi(Y)Elogf(Y)]. (12)

Thus, the covariance matrix of and therefore can be calculated by the above calculation (10)-(12). We then have (9).

We consider the case when the random variables and are dependent. Note that in this case and we have (8). However, is not a practical estimator since the density functions involved are not known.

Now let be a kernel function in and let be the bandwidth. Then

 ^fik(y)={(N^pi−1)hd}−1∑j≠kI(Xj=i)K{(y−Yj)/h}

are the “leave-one-out” estimators of the functions , , and

 ^Ii=−N−1N∑k=1I(Xk=i)log^fik(Yk) (13)

are estimators of , . Also

 ^H=−N−1N∑k=1log^fk(Yk) (14)

is an estimator of , where

 ^fk(y)={(N−1)hd}−1∑j≠kK{(y−Yj)/h}={(N−1)hd}−1∑j≠k[m∑i=0I(Xk=i)]K{(y−Yj)/h}=m∑i=0N^pi−1N−1^fik(y). (15)
###### Theorem 2.2

Assume that the tails of are decreasing like , respectively, as . Also assume that the kernel function has appropriately heavy tails as in . If and are all greater than in the case , greater than in the case and greater than in the case , then for the estimator

 ^I(X,Y)=^H−m∑i=0^Ii, (16)

we have

 √N(^I(X,Y)−I(X,Y))→N(0,a⊺Σa). (17)

Proof. Under the conditions in the theorem, applying the formula (3.1) or (3.2) from , we have

 ^H=¯H+o(N−1/2),^I0=¯I0+o(N−1/2),⋯,^Im=¯Im+o(N−1/2).

Together with Theorem 2.1, we have (17).

We may take the probability density function of Student-

distribution with proper degree of freedom instead of the normal density function as the kernel function. On the other hand, if

and are independent then and we have that .

## 3 Simulation study

In this section we conduct a simulation study with , i.e., the random variable takes two possible values 0 and 1, to confirm the main results stated in (17) for the kernel mutual information estimation of good mixed-pairs. First we study some one dimensional examples. Let be the Student t distribution with degree of freedom , location parameter and scale parameter and let be the Pareto distribution with density function . We study the mixture for the following four cases: 1). and ; 2). and ; 3). and ; 4). and . For each case, for the first distribution and for the second distribution.

The second row of Table 1 lists the mathematica calculation of the mutual information (MI) as stated in (4) for each case. The third row of Table 1 gives the average of 400 estimates based on formula (16). For each estimate, we use the probability density function of the Student t distribution with degree of freedom 3, i.e. , as the kernel function. We also have simulation study with kernel functions satisfying the conditions in the main results and obtained similar results. We take as the bandwidth for the first three cases and for the last case. The data size for each estimate is in each case. The Pareto distributions and have very dense area on the right of 1. This is the reason that we take a relatively small bandwidth for this case. To apply the kernel method in estimation, one should select an optimal bandwidth based on some criteria, for example, to minimize the mean squared error. It is interesting to investigate the bandwidth selection problem from both theoretical and application viewpoints. However, it seems that the study in this direction is very difficult. We leave it as an open question for future study. It is clear that the average of the estimates matches the true value of mutual information.

We apply mathematica to calculate the covariance matrix of

 (logf(Y),I(X=0)logf0(Y),I(X=1)logf1(Y))⊺

and, therefore, the value of for each case by formulae (10)-(12) or (9). The values of are , , and respectively for the four cases. The fourth row of Table 1 lists the values of

which serves as the asymptotic approximation of the standard deviation of the estimator

in the central limit theorem (17). The last row gives the sample standard deviation from estimates. These two values also have good match. Figure 1: The histograms with kernel density fits of M=400 estimates. Top left: t(3,0,1) and t(12,0,1). Top right: t(3,0,1) and t(3,2,1). Bottom left: t(3,0,1) and t(3,0,3). Bottom right: pareto(1,2) and pareto(1,10). Figure 2: The Q-Q plots of M=400 estimates. Top left: t(3,0,1) and t(12,0,1). Top right: t(3,0,1) and t(3,2,1). Bottom left: t(3,0,1) and t(3,0,3). Bottom right: pareto(1,2) and pareto(1,10).

Figure 1 and 2 show the histograms with kernel density fits and normal Q-Q plots of 400 estimates for each case. It is clear that the values of

We study two examples in the two dimensional case. Let be the two dimensional Student t distribution with degree of freedom , mean and shape matrix . We study the mixture in two cases: 1). and ; 2). and . Here

is the identity matrix. For each case,

for the first distribution and for the second distribution. Table 2 summarizes estimates of the mutual information with and sample size for each estimate. We take as the kernel function. Same as the one dimensional case, we apply mathematica to calculate the true value of MI and which is given in formula (9). Figure 3 shows the histograms with kernel density fits and normal Q-Q plots of 200 estimates for each example. It is clear that the values of also follow a normal distribution in the two dimensional case. In summary, the simulation study confirms the central limit theorem as stated in (17). Figure 3: The histograms and Q-Q plots of M=200 estimates. Left: t5(0,I) and t25(0,I). Right: t5(0,I) and t5(0,3I).

Acknowledgement

The authors thank the editor and the referees for carefully reading the manuscript and for the suggestions that improved the presentation. This research is supported by the College of Liberal Arts Faculty Grants for Research and Creative Achievement at the University of Mississippi. The research of Hailin Sang is also supported by the Simons Foundation Grant 586789.

## References

•  Ahmad, I. A. and Lin, P. E. 1976. A nonparametric estimation of the entropy for absolutely continuous distributions. IEEE Trans. Information Theory. 22, 372-375.
•  Basharin, G. P. 1959. On a statistical estimate for the entropy of a sequence of independent random variables. Theory of Probability and Its Applications. 4, 333-336.
•  Gao, W., Kannan, S., Oh, S. and Viswanath, P. 2017. Estimating mutual information for discrete-continuous mixtures. Advances in Neural Information Processing Systems. 5988-5999.
•  Hall, P. 1987. On Kullback-Leibler Loss and Density Estimation. Ann. Statist. 15, no. 4, 1491-1519.
•  Hall, P. and Morton, S. 1993. On the estimation of entropy. Ann. Inst. Statist. Math. 45, 69-88.
•  Joe, H. 1989. On the estimation of entropy and other functionals of a multivariate density. Ann. Inst. Statist. Math. 41, 683-697.
•  Kozachenko, L. F. and Leonenko, N. N. 1987. Sample estimate of entropy of a random vector. Problems of Information Transmission, 23, 95-101.
•  Laurent, B. 1996. Efficient estimation of integral functionals of a density. Ann. Statist. 24, 659-681.
•  Laurent, B. 1997. Estimation of integral functionals of a density and its derivatives. Bernoulli 3, 181-211.
•  Leonenko, N., Pronzato, L. and Savani, V. 2008. A class of Rényi information estimators for multidimensional densities. Ann. Statist. 36, 2153–2182. Corrections, Ann. Statist. 38 (2010), 3837-3838.
•  Nair, C., Prabhakar, B. and Shah, D. On entropy for mixtures of discrete and continuous variables. arXiv:cs/0607075
•  Tsybakov, A. B. and van der Meulen, E. C. 1994. Root-n consistent estimators of entropy for densities with unbounded support. Scand. J. Statist., 23, 75-83.
•  Vu, V. Q., Yu, B. and Kass, R. E. 2007. Coverage-adjusted entropy estimation. Statist. Med., 26, 4039-4060.