MI-NEE
Mutual Information Neural Entropic Estimation
view repo
We point out a limitation of the mutual information neural estimation (MINE) where the network fails to learn at the initial training phase, leading to slow convergence in the number of training iterations. To solve this problem, we propose a faster method called the mutual information neural entropic estimation (MI-NEE). Our solution first generalizes MINE to estimate the entropy using a custom reference distribution. The entropy estimate can then be used to estimate the mutual information. We argue that the seemingly redundant intermediate step of entropy estimation allows one to improve the convergence by an appropriate reference distribution. In particular, we show that MI-NEE reduces to MINE in the special case when the reference distribution is the product of marginal distributions, but faster convergence is possible by choosing the uniform distribution as the reference distribution instead. Compared to the product of marginals, the uniform distribution introduces more samples in low-density regions and fewer samples in high-density regions, which appear to lead to an overall larger gradient for faster convergence.
READ FULL TEXT VIEW PDF
Determining the strength of non-linear statistical dependencies between ...
read it
Estimating mutual information is an important machine learning and stati...
read it
The estimation of an f-divergence between two probability distributions ...
read it
Hardware Trojans (HTs) have drawn more and more attention in both academ...
read it
Understanding the nature of representations learned by supervised machin...
read it
The notion of building blocks can be related to the structure of the
off...
read it
Machine learning theory has mostly focused on generalization to samples ...
read it
Mutual Information Neural Entropic Estimation
The measure of mutual information [25] has significant applications in data mining [11, 8]. An advantage of mutual information over other distances or similarity measures is that, in addition to linear correlation, it also captures non-linear functional or statistical dependency between different features. Therefore, it has been used to select, extract and cluster features [21, 13] in an unsupervised way. The measure has firm theoretic ground in information theory, and can be understood as the fundamental limits of the rate-distortion function [26], channel capacity [25], and secrecy capacity [1].
To apply mutual information to practical scenarios in data mining, one has to estimate it from data samples with limited or no knowledge of the underlying distribution. Mutual information estimation is a well-known difficult problem, especially when the feature vectors are continuous or in a high dimensional space
[4, 21]. Despite the limitation of the well-known histogram approach [28, 20], there are various other estimation methods, including different density estimations using a kernel [18] and the nearest-neighbor distance [15].A more recent work considers iterative estimation using a neural network, called the mutual information neural estimation (MINE)
[2]. Compared to other approaches, MINE appears to inherit the generalization capability of neural network and can work without careful choice of parameters. However, as the neural network needs to be trained iteratively by a gradient descent algorithm, one has to monitor the convergence of the estimate and decide when to stop. If the convergence rate is slow, one may have to wait for a long time and terminate prematurely, which can result in underfitting. Indeed, we discovered a simple bivariate mixed gaussian distribution where MINE converged very slowly, and the problem is more serious in the higher dimensional cases. The objective of this work is to understand and resolve this short-coming, which is essential before applying the neural estimation to real-world datasets that often have very high dimensions. Despite the huge success in the use of neural networks for various machine learning applications
[16, 9, 27, 22, 24], the current understanding of neural network is limited. A proof of the generalization capability is known only for a very simple model [3].We propose an alternative route of neural estimation, called the mutual information neural entropic estimation (MI-NEE), that drastically improves the convergence rate. Roughly speaking, MINE uses a neural network to estimate the divergence from the joint distribution to the product of marginal distributions. If we replace the product of the marginal distributions by a known uniform reference distribution, we can obtain an estimate of the joint entropy instead of the mutual information, but the convergence rate turns out to be much faster. Since the mutual information can be computed simply from the joint and marginal entropies, and the marginal entropies can be estimated more easily than the joint entropy, we can obtain a faster mutual information estimate than MINE.
Our approach, in the use of a custom reference distribution, may resemble contrastive / ratio estimation methods [12, Sec. 12.2.4, pp 495–497], [10], which provides a neural estimation of the ratio between the unknown and the reference distributions (often by casting the unsupervised problem as a classification problem). However, the objective here is to estimate the KL divergence between the unknown distribution and the reference distribution by maximizing a lower bound, namely, the KL divergence between the neural network’s parameterized distribution and the reference distribution. For more details on the contrastive approach and its relation to MINE, see [19, 23]. Other neural network approaches to estimating density with respect to a reference distribution exist, e.g., in [5, 6]
, a neural network is used to obtain a deterministic map between a latent random variable with a known distribution and the data.
We use a sans serif capital letter to denote a random vector/variable and the same character in the normal math font for its alphabet set. denotes the distribution of , which is a pdf if is continuous. The support of (the distribution of) is the subset of values in
with strictly positive probability density.
denotes the expectation operation. For simplicity, all the logarithms are natural logarithms, and so information quantities such as entropy and mutual information are measured in nats instead of bits.Given continuous random vectors/variables and with unknown pdf for , the goal is to estimate the following Shannon’s mutual information from i.i.d. samples of :
(1a) | ||||
(1b) |
where and denote the information divergence and entropy respectively defined as
(2) | ||||
(3) |
Given a continuous random vector/variable with unknown pdf, we want to estimate from i.i.d. samples of .
With chosen to be , , and respectively, we obtain estimates of all the entropy terms in (1b), and therefore, the desired estimate of the mutual information.
We derive a neural estimation of the entropy using a custom reference distribution. The desired mutual information estimation then follows from (1b), where MINE is argued to be a special case when the reference distribution is the product of marginal distributions. We end the section by discussing estimations using the uniform reference distribution.
To estimate the entropy of using a neural network, we rewrite the entropy in terms of the divergence between and a custom reference distribution as:
(4) |
Note that the first term (the cross entropy term) in (4) can be estimated using sample average
(5) |
which is an unbiased estimate because
is a known pdf. For the formula to be valid, the divergence should be bounded, which requires(6) |
Other than the above restriction, however, one is free to choose any reference in the calculation of the entropy in (4). Indeed, not only there is no requirement for to be close to , we will argue that there is a benefit in choosing to be different from , namely, that it can lead to a faster convergence for the neural estimate of the divergence.
As in MINE [2], to apply a neural network to estimate the entropy, we rewrite the divergence using the variational formula [7] as follows:
(7a) | ||||
(7b) | ||||
(7c) |
In the first equality (7a), the infimum is over the choices of a distribution for a random variable with alphabet set . Equality holds because the divergence is non-negative and equal to if and only if . The seemingly redundant infimum term plays an important role in the neural estimation. As can be seen in the equality (7b), the term involving the unknown distribution no longer appears inside the expectation. Instead, we have the term which will be evaluated and optimized by a neural network. More precisely, suppose the neural network computes the function
, it can be turned into a probability distribution by the formula
which is non-negative and integrates over to . Applying this formula to the supremum in (7b) gives the last equality (7c). Since the supremum is achieved uniquely by , it follows that is optimal to (7c) if and only if
(8) |
for some constant . In summary, we have
Note that the objective function in (7c) can be estimated from the samples as
(10) |
where are i.i.d. samples of and are i.i.d. samples of . Although the estimate may have bias from the estimate of the log expectation term , we can reduce such bias by choosing sufficiently large, which is possible since is a known pdf. MINE also has a similar log expectation term but the bias in the estimation of the term and its corresponding gradient may be non-negligible, as the expectation there is with respect to an unknown pdf, namely, the product of the marginal distributions. (We will briefly revisit this in Section 3.3.)
To estimate the supremum in (7c), we apply a neural network as in MINE with parameters that outputs
(11) |
Define the loss function as the negation of
,(12) |
We can iteratively optimize to maximize by updating with standard gradient descent algorithms that use minibatch estimates of the gradient
(13) |
Again, the expectations in the second term can be estimated by any number of samples from the known reference distribution , and so the bias from the estimate of the expectation in the denominator can be made negligible if desired. In practice, the stochasticity involved in the minibatch estimates somehow avoids overfitting even with an over-parameterized neural network [29, 3], and one can often converge to a good minima using a small batch size [17]. To maintain such stochasticity for large , one can simply generate new samples of for each step of the descend algorithm, which is possible as is known.
Altogether, an estimate can be obtained as follows using the estimate (5) of the cross entropy in (4) and the estimate (10) of the divergence in (4) where is optimized by training the neural network (11) for some times using the loss function (12).
The estimate of the entropy is given by
(14) |
where is the parameter after steps of the gradient descend algorithm.
We remark that the above estimate is neither a lower nor an upper bound on the entropy estimate because of the possibilities of underfitting due to insufficient training and overfitting due to the use of sample estimates for the training objective. The same issue applies to MINE. Nevertheless, while one can check whether overfitting occurs using a separate validation set, it is hard to tell if there is underfitting without knowing the ground truth. Indeed, the convergence rate of the parameters may be so slow that one may falsely think that the parameters have converged even if they have not. We found that such situation may be avoided by an appropriate choice of the reference distribution.
By expressing the mutual information in terms of the entropies in (1b), it is straightforward to obtain a mutual information estimate by estimating the entropies as explained in the previous section. We simplify the estimate further by choosing the reference distributions appropriately so that the cross entropy terms in the entropy estimates cancel out:
For any continuous random vectors/variables and ,
(15a) | ||||
(15b) |
where and are independent random variables/vectors with larger support than and , i.e.,
(16) |
Furthermore, the optimal , and satisfy
(17) |
for some constant .
The desired mutual information estimate can be obtained from the sample estimate of (15b) with , , and optimized independently using three neural networks as described in the previous section, i.e., with the loss functions chosen as (12) with set to , , and respectively.
Alternatively, one can train a single neural network with three outputs, one for each . More precisely, construct a neural network with parameters , two inputs and , and three outputs . With
we update the parameters to minimize the sum of the loss functions (12) evaluated for the three choices of , i.e.,
The mutual information can then be estimated with
(18a) | ||||
(18b) | ||||
(18c) |
where is the parameter after training the neural network times.
MINE can be viewed as the special case of the mutual information estimation in the last section when the reference distribution is chosen as the product of marginal distribution of and , i.e.,
(19) |
In this case, both and in (15a) are zero, and so
(20a) | ||||
(20b) |
and the optimal solution satisfies
(21) |
for some constant , where is defined in (17). With the optimal , the first term in (20b) becomes , namely a constant shift of the mutual information, while the second term becomes , which cancels out the constant shift to give the desired mutual information.
Note that there is no need to train the neural network for the outputs and because the corresponding terms (18b) and (18c) do not appear in (20b). To train the remaining output , one cannot sample from the unknown pdf’s and . Instead, as done in MINE, the samples ’s and ’s can be obtained by resampling the samples ’s and ’s independently. As a result, one cannot arbitrarily reduce the bias in estimating the log expectation term and its gradient in (20b). Different from MINE, we choose the following uniform distribution.
We obtain the mutual information estimate with
(22a) | ||||
where is a bounding box with volume and containing all the values of with | ||||
(22b) |
If and are vectors, the above minimization, maximization, and inequalities are elementwise.
There is, however, a technical issue with the above choice of uniform reference. (15b) is valid only if (16) holds, which requires to contain all with . However, such requirement may not be satisfied as is unknown and may have unbounded support. Nevertheless, we argue that the above choice of can still give a good estimate if the density outside has negligible contribution to the mutual information. More precisely, define with density
namely the conditional density of given . Note that goes to as goes to infinity by (22b). We therefore make the mild assumption that
(23) |
for sufficiently large . Since can also be viewed as samples of , we can estimate using the same formula (18). In particular, it is valid to use a uniform reference because its support covers that of .
To evaluate the convergence rate, we plotted the mutual information estimates (18) with uniform reference (22) against the number of training steps and compared the curve to that of MINE. We first consider a simple bivariate mixed gaussian distribution and show that MINE has much slower convergence than our approach even in this low dimensional example. We then consider the higher dimensional case using a basic gaussian distribution and show that our approach can achieve significantly faster convergence rate even with a moderate increase in the dimension.
The bivariate mixed gaussian distribution is defined as
(24) |
where denotes the multivariate gaussian distribution over with mean and covariance matrix , and is a model parameter that specifies the positive and negative correlations of and for each gaussian component. The higher dimensional gaussian distribution is defined as
(25) |
In addition to the correlation coefficient , there is an additional parameter that specifies the dimension of and .
For the mixed gaussian model with sample size points, Figure 1 plots the mutual information estimates after training with a batch size of and learning rate of . For MINE, we follow [2] to use moving average in the gradient estimate, where the moving average rate is set to be . For our approach, instead of using a moving average in the gradient estimate, we increase the reference sample size to times the data sample size . For both MINE and our approach, we further apply a moving average of rate to smooth out jitters in the estimates. Figure 0(a) shows that our approach converges to within of the ground truth close to iterations. Figure 0(b) shows that MINE requires close to iterations. Furthermore, MINE exhibits a staircase convergence with two distinct jumps. The estimate remains close to until the first jump at around iterations. The estimate then remains stagnant at a value smaller than of the ground truth until the second jump at around iterations. We remark that the staircase convergence may mislead one to think that neural network has converged while it has not. We found that the issue can be more serious for smaller values of .
For the higher dimensional gaussian distribution, we consider with again a sample size of and a batch size of . The learning rate is reduced to to avoid excessive jitters. For our approach, we increase the reference sample size to times the data sample size to reduce the effect of overfitting the reference. Figure 1(a) shows that our approach converges to within of the ground truth close to iterations. However, Figure 1(b) shows that MINE is unable to converge to within of the ground truth even after iterations.^{1}^{1}1In [2, Fig. 1], MINE reaches about of the ground truth, however, we were unable to reproduce this results since, to the best of our knowledge, the authors’ parameters choice / code are not publicly available. Nevertheless, our observations remain valid since the comparisons made here between MINE and MI-NEE are performed under comparable parameters / neural network architecture. Indeed, MINE terminates before iterations due to numerical instability issue, but further reducing the learning rate causes excessive slow down in convergence. In contrast, our approach has slight overfitting as the estimate can go above the ground truth. We found that this issue is more pronounced for higher dimension, but can be alleviated by increasing the reference sample size in the expense of more computations for each training step. One can also use a separate validation set to terminate the training of each neural networks before significant overfitting.
The above results can be reproduced by running the corresponding jupyter notebooks using binder [14] at the GitHub repository below:
Normalized mutual information feature selection.
IEEE Transactions on Neural Networks, 20(2):189–201, 2009.Estimation of mutual information using kernel density estimators.
Physical Review E, 52(3):2318, 1995.NeurIPS Workshop on Bayesian Deep Learning
, 2018.
Comments
There are no comments yet.