The entropy of a discrete random variablewith countable support and is defined to be
and the (differential) entropy of a continuous random variableis defined as
If , or is also called the joint entropy of the components in or
Let , be probability measures on some arbitrary measure spaces and respectively. Let be the joint probability measure on the space . If is absolutely continuous with respect to the product measure , let be the Radon-Nikodym derivative. Then the general definition of the mutual information (e.g., ) is given by
If two random variables and are either both discrete or both continuous then the mutual information of and can be expressed in terms of entropies as
However, in practice and application, we often need to work on a mixture of continuous and discrete random variables. There are several ways for the mixture. 1). One random variable is discrete and the other random variable is continuous; 2). A random variable has both discrete and continuous components, i.e., with probability and with probability where , is a discrete random variable and
is a continuous random variable; 3). a random vector with each dimension component being discrete, continuous or mixture as in 2).
In , the authors extend the definition of the joint entropy for the first case mixture, i.e., for the pair of random variables, where the first random variable is discrete and the second one is continuous. Our goal is to study the mutual information for that case and provide the estimation of the mutual information from a given i.i.d. sample .
In , the authors applied the -nearest neighbor method to estimate the Radon-Nikodym derivative and, therefore, to estimate the mutual information for all three mixed cases. In the literature, if the random variables and are either both discrete or both continuous, the estimation of mutual information is usually performed by the estimation of the three entropies in (2). The estimation of a differential entropy has been well studied. An incomplete list of the related research includes the nearest-neighbor estimator , , ; the kernel estimator , , ,  and the orthogonal projection estimator , . Basharin 
studied the plug-in entropy estimator for the finite value discrete case and obtained the mean, the variance and the central limit theorem of this estimator. Vu, Yu and Kass studied the coverage-adjusted entropy estimator with unobserved values for the infinite value discrete case.
2 Main results
Consider a random vector . We call a mixed-pair if is a discrete random variable with countable support while is a continuous random variable. Observe that induces measures that are absolutely continuous with respect to the Lebesgue measure, where , for every Borel set in . There exists a non-negative function with be the probability mass function on and be the marginal density function of . Here, , . In particular, denote . We have that
is the probability density function of conditioned on . In , the authors gave the following regulation of mixed-pair and then defined the joint entropy of a mixed-pair.
(Good mixed-pair). A mixed-pair random variables is called good if the following condition is satisfied:
Essentially, we have a good mixed-pair random variables when restricted to any of the values, the conditional differential entropy of is well-defined.
(Entropy of a mixed-pair). The entropy of a good mixed-pair random variable is defined by
As then we have that
We take the convention and . From the general formula of the mutual information (1), we get that
Let be a random sample drawn from a mixed distribution with discrete component having support , and let , with . Also suppose that the continuous component has pdf . Denote , and let
be the estimators of , , and respectively, where is the probability density function of conditioned on , . Denote . Let be the covariance matrix of .
if and only if and are dependent. For the estimator
of we have that
given that and are dependent. Furthermore, the variance can be calculated by
where is the conditional expectation of given .
Proof. First of all, since is the variance covariance matrix. If then
and for some constant . But
Hence . Then for some constant and for all . But . Hence, and for all . Then and are independent. On the other hand, if and are independent, then for all . Therefore, and . Hence, if and only if and are independent.
Notice that the vector is the sample mean of a sequence of i.i.d. random vectors
with mean . Then, by central limit theorem, we have
and, given , we have (8). By the formula for variance decomposition, we have
. Here is the conditional variance of when . By similar calculation,
for all , and
We consider the case when the random variables and are dependent. Note that in this case and we have (8). However, is not a practical estimator since the density functions involved are not known.
Now let be a kernel function in and let be the bandwidth. Then
are the “leave-one-out” estimators of the functions , , and
are estimators of , . Also
is an estimator of , where
Assume that the tails of are decreasing like , respectively, as . Also assume that the kernel function has appropriately heavy tails as in . If and are all greater than in the case , greater than in the case and greater than in the case , then for the estimator
Proof. Under the conditions in the theorem, applying the formula (3.1) or (3.2) from , we have
We may take the probability density function of Student-
distribution with proper degree of freedom instead of the normal density function as the kernel function. On the other hand, ifand are independent then and we have that .
3 Simulation study
In this section we conduct a simulation study with , i.e., the random variable takes two possible values 0 and 1, to confirm the main results stated in (17) for the kernel mutual information estimation of good mixed-pairs. First we study some one dimensional examples. Let be the Student t distribution with degree of freedom , location parameter and scale parameter and let be the Pareto distribution with density function . We study the mixture for the following four cases: 1). and ; 2). and ; 3). and ; 4). and . For each case, for the first distribution and for the second distribution.
The second row of Table 1 lists the mathematica calculation of the mutual information (MI) as stated in (4) for each case. The third row of Table 1 gives the average of 400 estimates based on formula (16). For each estimate, we use the probability density function of the Student t distribution with degree of freedom 3, i.e. , as the kernel function. We also have simulation study with kernel functions satisfying the conditions in the main results and obtained similar results. We take as the bandwidth for the first three cases and for the last case. The data size for each estimate is in each case. The Pareto distributions and have very dense area on the right of 1. This is the reason that we take a relatively small bandwidth for this case. To apply the kernel method in estimation, one should select an optimal bandwidth based on some criteria, for example, to minimize the mean squared error. It is interesting to investigate the bandwidth selection problem from both theoretical and application viewpoints. However, it seems that the study in this direction is very difficult. We leave it as an open question for future study. It is clear that the average of the estimates matches the true value of mutual information.
We apply mathematica to calculate the covariance matrix of
which serves as the asymptotic approximation of the standard deviation of the estimatorin the central limit theorem (17). The last row gives the sample standard deviation from estimates. These two values also have good match.
|mean of estimates||0.01167391||0.1991132||0.1014199||0.2010447|
follow a normal distribution.
We study two examples in the two dimensional case. Let be the two dimensional Student t distribution with degree of freedom , mean and shape matrix . We study the mixture in two cases: 1). and ; 2). and . Here
is the identity matrix. For each case,for the first distribution and for the second distribution. Table 2 summarizes estimates of the mutual information with and sample size for each estimate. We take as the kernel function. Same as the one dimensional case, we apply mathematica to calculate the true value of MI and which is given in formula (9). Figure 3 shows the histograms with kernel density fits and normal Q-Q plots of 200 estimates for each example. It is clear that the values of also follow a normal distribution in the two dimensional case. In summary, the simulation study confirms the central limit theorem as stated in (17).
|mean of estimates||0.0112381||0.2022715|
The authors thank the editor and the referees for carefully reading the manuscript and for the suggestions that improved the presentation. This research is supported by the College of Liberal Arts Faculty Grants for Research and Creative Achievement at the University of Mississippi. The research of Hailin Sang is also supported by the Simons Foundation Grant 586789.
-  Ahmad, I. A. and Lin, P. E. 1976. A nonparametric estimation of the entropy for absolutely continuous distributions. IEEE Trans. Information Theory. 22, 372-375.
-  Basharin, G. P. 1959. On a statistical estimate for the entropy of a sequence of independent random variables. Theory of Probability and Its Applications. 4, 333-336.
-  Gao, W., Kannan, S., Oh, S. and Viswanath, P. 2017. Estimating mutual information for discrete-continuous mixtures. Advances in Neural Information Processing Systems. 5988-5999.
-  Hall, P. 1987. On Kullback-Leibler Loss and Density Estimation. Ann. Statist. 15, no. 4, 1491-1519.
-  Hall, P. and Morton, S. 1993. On the estimation of entropy. Ann. Inst. Statist. Math. 45, 69-88.
-  Joe, H. 1989. On the estimation of entropy and other functionals of a multivariate density. Ann. Inst. Statist. Math. 41, 683-697.
-  Kozachenko, L. F. and Leonenko, N. N. 1987. Sample estimate of entropy of a random vector. Problems of Information Transmission, 23, 95-101.
-  Laurent, B. 1996. Efficient estimation of integral functionals of a density. Ann. Statist. 24, 659-681.
-  Laurent, B. 1997. Estimation of integral functionals of a density and its derivatives. Bernoulli 3, 181-211.
-  Leonenko, N., Pronzato, L. and Savani, V. 2008. A class of Rényi information estimators for multidimensional densities. Ann. Statist. 36, 2153–2182. Corrections, Ann. Statist. 38 (2010), 3837-3838.
-  Nair, C., Prabhakar, B. and Shah, D. On entropy for mixtures of discrete and continuous variables. arXiv:cs/0607075
-  Tsybakov, A. B. and van der Meulen, E. C. 1994. Root-n consistent estimators of entropy for densities with unbounded support. Scand. J. Statist., 23, 75-83.
-  Vu, V. Q., Yu, B. and Kass, R. E. 2007. Coverage-adjusted entropy estimation. Statist. Med., 26, 4039-4060.