Inferring the statistical dependencies between two variables from a few measured samples is an ubiquitous task in many areas of study. Variables are often linked through non-linear relations, which contain stochastic components. The standard measure employed to quantify the amount of dependency is the mutual information, defined as the reduction in entropy of one of the variables when conditioning the other variable [1, 2]
. If the states of the joint distribution are well-sampled, the joint probabilities can be estimated by the observed frequencies, yielding the maximum-likelihood estimator of mutual information. However, this procedure on average over-estimates the mutual information[3, 4, 5], so that independent variables may appear to be correlated, especially when the number of samples is small.
The search for an estimator of mutual information that remains approximately unbiased even with small data samples is an open field of research [6, 7, 8, 9, 10, 11]. Here we focus on discrete variables, and assume it is not possible to overcome the scarceness of samples by grouping elements that are close according to some metric. In addition to corrections that only work in the limit of large samples , the state of the art for this problem corresponds to quasi-Bayesian methods that estimate mutual information indirectly through measures of the entropies of the involved variables [8, 13, 14]. These approaches have the drawback of not being strictly Bayesian, since the linear combination of two or more Bayesian estimates of entropies does not, in general, yield a Bayesian estimator of the combination of entropies . The concern is not so much to remain within theoretical Bayesian purity, but rather, to avoid frameworks that may be unnecessarily biased, or where negative estimates of information may arise.
Here we propose a new method for estimating mutual information that is valid in the specific case in which there is an asymmetry between the two variables: One of them has a large number of effective states, and the other only a few. No hypotheses are made about the probability distribution of the large-entropy variable, but the marginal distribution of the low-entropy variable is assumed to be well sampled. The prior is chosen so as to accurately represent the amount of dispersion of the conditional distribution of the low-entropy variable around its marginal distribution. The main finding is that our estimator has very low bias, even in the severely under-sampled regime where there are few coincidences, that is, when a given state of the large-entropy variable is only seldom sampled more than once. The key data statistics that determine the estimated information is the inhomogeneity of the distribution of the low-entropy variable in those states of the high-entropy variable where two or more samples are observed. In addition to providing a practical algorithm to estimate mutual information, our approach sheds light on the way in which just a few samples reveal those specific properties of the underlying joint probability distribution that determine the amount of mutual information.
2 Bayesian approaches to the estimation of entropies
We seek a low-bias estimate of the mutual information between two discrete variables. Let
be a random variable with a large numberof effective states with probabilities , and be a variable that varies in a small set , with . Given the conditional probabilities , the marginal and joint probabilities are and , respectively. The entropy is
and can be interpreted as the average number of well-chosen yes/no questions required to guess the sampled value of (when using a logarithm of base two). The conditional entropy is the average uncertainty of the variable once is known,
The mutual information is the reduction in uncertainty of one variable once we know the other 
Our aim is to estimate when is well sampled, but is severely undersampled, in particular, when the sampled data contain few coincidences in . Hence, for most values , the number of samples is too small to estimate the conditional probability from the frequencies . In fact, when , the maximum likelihood estimator typically underestimates severely , and consequently leads to an overestimation of .
One possibility is to estimate and using a Bayesian estimator, and then plug the obtained values in Eq. 3 to estimate the mutual information. We now discuss previous approaches to Bayesian estimators for entropy, to later analyze the case of information. For definiteness, we focus on , but the same logic applies to , or .
The Bayesian estimator is the expected value of , where are unknown probabilities , and represents the number of sampled data obtained in each state. That is,
Since is the multinomial distribution
and since the normalization constant can be calculated from the integral
the entire gist of the Bayesian approach is to find an adequate prior to plug into Eqs. 4, 5 and 6. For the sake of analytical tractability, is often decomposed into a weighted combination of distributions that can be easily integrated, each tagged by one or or a few parameters, here generically called , that vary within a certain domain,
The decomposition requires to introduce a prior . Hence, the former search for an adequate prior is now replaced by the search for an adequate prior . The replacement implies an assumption and also a simplification. The family of priors that can be generated by Eq. 7 does not encompass the entire space of possible priors. The decomposition relies on the assumption that the remaining family is still rich enough to make good inference about the quantity of interest, in this case, the entropy. The simplification stems from the fact that the search for is more restricted than the search for , because the space of possible alternatives is smaller (the dimensionality of is typically high, whereas the one of is low). Two popular proposals of Bayesian estimators for entropies are NSB  and PYM . In NSB, the functions are Dirichlet distributions, in which takes the role of a concentration parameter. In PYM, these functions are Pitman-Yor processes, and stands for two parameters: one accounting for the concentration, and the other for the so-called discount. In both cases, the Bayesian machinery implies
where is the weight of each in the estimation of the expected entropy
When choosing the family of functions , it is convenient to select them in such a way that the weight can be solved analytically. However, this is not the only requirement. In order to calculate the integral in , the prior also plays a role. The decomposition of Eq. 7 becomes most useful when the arbitrariness in the choice of is less serious than the arbitrariness in the choice of . This assumption is justified when is peaked around a specific value, so that in practice, the shape of hardly has an effect. In these cases, a narrow range of relevant
values is selected by the sampled data, and all assumptions about the prior probability outside this range play a minor role. For the choices of the familiesproposed by NSB and PYM, can be calculated analytically, and one can verify that indeed, a few coincidences in the data suffice for a peak to develop. In both cases, the selected is one for which favours a range of values that are compatible with the measured data (as assessed by ), and also produce non-negligible entropies (Eq. 9).
When the chosen Bayesian estimates of the entropies are plugged into Eq. 3 to obtain an estimate of the information, each term is dominated by its own preferred . Since the different entropies are estimated independently, the values selected by the data to dominate the priors and need not be compatible with the ones dominating the priors of the joint or the conditional distributions. As a consequence, the estimation of the mutual information is no longer Bayesian, and can suffer from theoretical issues, as for example, yield a negative estimate .
A first alternative would be to consider an integrable prior containing a single for the joint probability distribution , and then replace by in the equations above, to calculate . This procedure was tested by Archer et al. , and the results were only good when the collection of values governing the data were well described by a distribution that was contained in the family of proposed priors . The authors concluded that mixtures of Dirichlet priors do not provide a flexible enough family of priors for highly-structured joint distributions, at least for the purpose of estimating mutual information.
To make progress, we note that can be written as
where and stand for the
-dimensional vectorsand , and
represents the Kullback-Leibler divergence. The average divergence betweenand captures a notion of spread. Therefore, the mutual information is sensitive not so much to the value of the probabilities , but rather, to their degree of scatter around the marginal . The parameters controlling the prior should hence be selected in order to match the width of the distribution of values, and not so much each probability. With this intuition in mind, in this paper we put forward a new prior for the whole ensemble of conditional probabilities obtained for different values. In this prior, the parameter controls the spread of the conditionals around the marginal .
3 Prior distribution for conditional entropies
Our approach is valid when the total number of samples is at least of the order of magnitude of , since in this regime, some of the states are expected to be sampled more than once [15, 16]. In addition, the marginal distribution must be well sampled. This regime is typically achieved when has a much larger set of available states than . In this case, the maximum likelihood estimators of the marginal probabilities can be assumed to be accurate, that is,
In this paper, we put forward a Dirichlet prior distribution centered at , that is,
where contains the conditional probabilities corresponding to different values. Large values select conditional probabilities close to , while small values imply a large spread, that pushes the selection towards the border of the -simplex.
For the moment, for simplicity we work with a priordefined on the conditional probabilities , and make no effort to model the prior probability of the vector . In practice, we estimate the values of with the maximum likelihood estimator . Since is assumed to be severely undersampled, this is a poor procedure to estimate . Still, the effect on the mutual information turns out to be negligible, since the only role of in Eq. 10 is to weigh each of the Kullback-Leibler divergences appearing in the average. If is large, each value will appear in several terms of the sum, rendering the individual value of the accompanying irrelevant, only the sum of them matters. In Sect. 6
, we tackle the full problem of making Bayesian inference both inand .
The choice of prior of Eq. 12 is inspired in three facts. First, captures the spread of around , as implied by the Kullback-Leibler divergence in Eq. 12. Admittedly, this divergence is not exactly the one governing the mutual information (Eq. 10), since and are interchanged. Yet, it is still a measure of spread. The exchange, as well as the denominator in Eq. 12, were introduced for the sake of the second fact, namely, analytical tractability. The third fact regards the emergence of a single relevant when the sampled data begin to register coincidences. If we follow the Bayesian rationale of the previous section, now replacing the entropy by the mutual information, we can again define a weight for the parameter
where can be obtained analytically, and is a well behaved function of its arguments, whereas
For each , the vector varies in a -dimensional simplex. For we take the multinomial
The important point here, is that the ratio of the Gamma functions of Eq. 13 develops a peak in as soon as the collected data register a few coincidences in . Hence, with few samples, the prior proposed in Eq. 12 renders the choice of inconsequential.
Assuming that the marginal probability of is well-sampled, the entropy is well approximated by the maximum-likelihood estimator . For each , the expected posterior information can be calculated analytically,
where is the digamma function. When the system is well sampled, , so the effect of becomes negligible, the Digamma functions tend to logarithms, and the frequencies match the probabilities. In this limit, Eq. 15 coincides with the maximal likelihood estimator, which is consistent. The rest of the paper focuses on the case in which the marginal probability of is severely undersampled.
4 A closer look on the case of a symmetric and binary -variable
In this section, for simplicity we take , such that nats. In this case, the Dirichlet prior of Eq. 12 becomes a Beta distribution
Large values of mostly select conditional probabilities close to . If all conditional probabilities are similar, and similar to the marginal, the mutual information is low, since the probability of sampling a specific value hardly depends on . Instead, small values of produce conditional probabilities around the borders ( or ). In this case, is strongly dependent on (see Fig. 1 b), so the mutual information is large. The expected prior mutual information can be calculated using the analytical approach developed by [17, 14],
The prior information is a slowly-varying function of the order of magnitude of , namely of . Therefore, if a uniform prior in information is desired, it suffices to choose a prior on such that ,
When , the expected posterior information (Eq. 15) becomes
The marginal likelihood of the data given is also analytically tractable. The likelihood is binomial for each , so
The posterior for can be obtained by adding a prior , as . The role of the prior becomes relevant when the number of coincidences is too low for the posterior to develop a peak (see below).
In order to gain intuition about the statistical dependence between variables with few samples, we here highlight the specific aspects of the data that influence the estimator of Eq. 19. Grouping together the terms of Eq. (20) that are equal, the marginal likelihood can be rewritten in terms of the multiplicities , that is, the number of states with specific occurrences or ,
The posterior for is independent from states with just a single count, as . Only states with coincidences matter. In order to see how the sampled data favor a particular , we search for the value that maximizes in the particular case where at most two samples coincide on the same , obtaining
Denoting the fraction of -count states that have one count for each value as , Eq. 23 implies that if , and , otherwise. If the -values are independent of , we expect . This case corresponds to a large and, consequently, to a low information. On the other side, for small , the parameter is also small and the information grows.
In Eq. 23, the data only intervene through and , which characterize the degree of asymmetry of the values throughout the different states. This asymmetry, hence, constitutes a sufficient statistics for . If a prior is included, the that maximizes the posterior may shift, but the effect becomes negligible as the number of coincidences grows.
We now discuss the role of the selected in the estimation of information, Eq. (19), focusing on the conditional entropy . First, in terms of the multiplicities, the conditional entropy can be rewritten as
where is the fraction of the samples that fall in states with counts, and is the fraction of all states with counts that have for one -value (whichever) and for the other. Finally, is the estimation of the entropy of a binary variable after samples,
A priori, , as in Fig. 1d. Surprisingly, from the property , it turns out that (in fact, ). Hence, if only a single count breaks the symmetry between the two
values, there is no effect on the conditional entropy. This is a reasonable result, since a single extra count is no evidence of an imbalance between the underlying conditional probabilities, it is just the natural consequence of comparing the counts falling on an even number of states (2) when taking an odd number of samples. Expanding the first terms for the conditional entropy,
In the severely under-sampled regime, these first terms are the most important ones. Typically, takes most of the weight, and Eq. (26) implies that the estimation is close to the prior evaluated in the value of that maximizes the marginal likelihood (or the posterior).
Finally, we mention that when dealing with few samples, it is important to have not just a good estimate of the mutual information, but also, a confidence interval. Even a small information may be relevant, if the evidence attests that it is strictly above zero. The theory developed here also allows us to estimate the posterior variance of the mutual information, as shown in the Appendix. The variance (Eq.33) is shown to be inversely proportional to the number of states , thereby implying that our method benefits from a large number of available states , even if undersampled.
5 Testing the estimator
We now analyze the performance of our estimator in three examples where the number of samples is below or in the order of the effective size of the system . In this regime, most observed states have very few samples. In each example, we define the probabilities and with three different criteria, giving rise to collections of probabilities that can be described with varying success by the prior proposed in this paper, Eq. 16. Once the probabilities are defined, the true value of the mutual information can be calculated, and compared to the one estimated by our method, as well as by three other estimators employed in the literature, in 50 different sets of samples of the measured data. As our estimator we use from Eq. (19) evaluated in the that maximizes the marginal likelihood . We did not observe any improvement when integrating over the whole posterior with the prior of Eq. 18, except when or were of order 1. This fact implies the existence of a well-defined peak in the marginal likelihood.
In the first example (Fig. 2a, d), the probabilities are obtained by sampling a Pitman-Yor distribution with concentration parameter and tail parameter . These values correspond to a PYM prior with a heavy tail. The conditional probabilities are defined by sampling a symmetric Beta distribution , as in Eq. 16. In Fig. 2a, we use . Once the joint probability is defined, 50 sets of samples are generated. The effective size of the system is . We compare our estimator to maximum likelihood (ML), NSB and PYM when applied to and (all methods coincide in the estimation of ). Our estimator has a low bias, even when the number of samples per effective state is as low as . The variance is larger than ML, comparable to NSB and smaller than PYM. All the other methods (ML, NSB and to a lesser extend PYM) overestimate the mutual information. In Fig. 2d, the performance of the estimators is also tested for different values of the exact mutual information , which we explore by varying . For each , the conditional probabilities are sampled once. Each vector contains samples, and is sampled times. Our estimates have very low bias, even as the mutual information goes to zero —namely, for independent variables.
Secondly, we analyze an example where the statistical relation between and is remarkably intricate (example inspired by ), which underscores the fact that making inference about the mutual information does not require inferences on the joint probability distribution. The variable is a binary vector of dimension . Each component represents the presence or absence of one of a maximum of delta functions equally spaced on the surface of a sphere. There are possible vectors, and they are governed by a uniform prior probability: . The conditional probabilities are generated in such a way that they be invariant under rotations of the sphere, that is, , where is a rotation. Using a spherical harmonic representation , the frequency components of the spherical spectrum are obtained, where is the combination of deltas. The conditional probabilities are defined as a sigmoid function of . The offset of the sigmoid is chosen such that , and the gain such that nats. In this example, and unlike the Dirichlet prior implied by our estimator, has some level of roughness (inset in Fig. 2b), due to peaks coming from the invariant classes in . Hence, the example does not truly fit into the hypothesis of our method. With these settings, the effective size of the system is . Our estimator has little bias (Figs. 2b, e), even with samples per effective state. In this regime, around of the samples fall on states that occur only once (), on states that occur twice and on states with counts, or maybe . As mentioned above, in such cases, the value of is very similar to the one that would be obtained by evaluating the prior information of (Eq. 17) at the that maximizes the marginal likelihood , which in turn is mainly determined by . In Fig. 2e, the estimator is tested with a fixed number of samples for different values of the mutual information, which we explore by varying the gain of the sigmoid. The bias of the estimate is small in the entire range of mutual informations.
In the third place, we consider an example where the conditional probabilities are generated from a distribution that is poorly approximated by a Dirichlet prior. The conditional probabilities are sampled from three Dirac deltas, as , with . The delta placed in could be approximated by a Dirichlet prior with a large , while the other two deltas could be approximated by a small , but there is no single value of that can approximate all three deltas at the same time. The states are generated as Bernoulli () binary vectors of dimension , while the conditional probabilities depend on the parity of the sum of the components of the vector . When the sum is even, we assign , and when it is odd, we assign or , both options with equal probability. Although in this case our method has some degree of bias, it still preserves a good performance in relation to the other approaches (see Fig. 2c, f). The marginal likelihood contains a single peak in an intermediate value of , coinciding with none of the deltas in , but still capturing the right value of the mutual information. As in the previous examples we also test the performance of the estimator for different values of the mutual information, varying in this case the value of (with ). Our method performs acceptably for all values of mutual information. The other methods, instead, are challenged more severely, probably because a large fraction of the states have a very low probability, and are therefore difficult to sample. Those states, however, provide a crucial contribution to the relative weight of each of the three values of . PYM, in particular, sometimes produces a negative estimate for .
Finally, we check numerically the accuracy of the analytically predicted mean posterior information (Eq. 19) and variance (Eq. 33) in the severely under-sampled regime. The test is performed in a different spirit than the numerical evaluations of Fig. 2. There, averages were taken for multiple samples of the vector , from a fixed choice of the probabilities and . The averages of Eqs. 19 and Eq. 33, however, must be interpreted in the Bayesian sense. The square brackets in and represent averages taken for a fixed data sample , and unknown underlying probability distributions and . We generate many such distributions with (a Dirichlet Process with concentration parameter ) and . A total of distributions are produced, with sampled from Eq. 18, and three equiprobable values of . For each of these distributions we generate five (5) sets of just samples, thereby constructing a list of cases, each case characterized by specific values of and . Following the Bayesian rationale, we partition this list in classes, each class containing all the cases that end up in the same set of multiplicities —for example, . For each of the most occurring sets of multiplicities (which together cover of all the cases), we calculate the mean and the standard deviation of the mutual information of the corresponding class, and compare them with our predicted estimates and , using the prior from Eq. (18). Figure 3 shows a good match between the numerical (-axis) and analytical (-axis) averages that define the mean information (panel a) and the standard deviation (b). The small departures from the diagonal stem from the fact that the analytical average contains all the possible and , even if some of them are highly improbable for one given set of multiplicities. The numerical average, instead, includes the subset of the 13,500 explored cases that produced the tested multiplicity. All the depicted subsets contained many cases, but still, they remained unavoidably below the infinity covered by the theoretical result.
We have also tested cases where takes more than two values, and where the marginal distribution is not uniform, observing similar performance of our estimator.
6 A prior distribution for the large entropy variable
The prior considered so far did not model the probability of the large-entropy variable . Throughout the calculation, the probabilities were approximated by the maximum likelihood estimator . Here we justify such procedure by demonstrating that proper Bayesian inference on hardly modifies the estimation of the mutual information. To that end, we replace the prior of Eq. 12 by another prior that depends on both and .
The simplest hypothesis is to assume that the prior factorizes as , implying that the marginal probabilities are independent of the conditional probabilities . We propose , so that the marginal probabilities are drawn from a Dirichlet Process with concentration parameter , associated to the total number of pseudo-counts. After integrating in and in , the mean posterior mutual information for fixed hyper-parameters and is
Before including the prior , in the severely undersampled regime the mean posterior information was approximately equal to the prior information evaluated in the best (Eq. 15). The new calculation (Eq. 27) contains the prior information explicitly, weighted by , that is, the ratio between the number of pseudo-counts from the prior and the total number of counts. Thereby, the role of the non-observed (but still inferred) states is established.
The independence assumed between and implies that
The inference over coincides with the one of PYM with the tail parameter as , since
where is the number of states with at least one sample. With few coincidences in , develops a peak around a single value that represents the number of effective states. Compared to the present Bayesian approach, maximum likelihood underestimates the number of effective states (or entropy) in . Since the expected variance of the mutual information decreases with the square root of the number of effective states, the Bayesian variance is reduced with respect to the one of ML.
In this work we propose a novel estimator for mutual information of discrete variables and , which is adequate when has a much larger number of effective states than . If this condition does not hold, the performance of the estimator breaks down. We inspire our proposal in the Bayesian framework, in which the core issue can be boiled down to finding an adequate prior. The more the prior is dictated by the data, the less we need to assume from outside. Equation 10 implies that the mutual information is the spread of the conditional probabilities of one of the variables (for example, , but the same holds for ) around the corresponding marginal ( or , respectively). This observation inspires the choice of our prior (Eq. 12), which is designed to capture the same idea, and in addition, to be analytically tractable. We choose to work with an hyper-parameter that regulates the scatter of around , and not the scatter of around , because the asymmetry in the number of available states of the two variables makes the of the first option (and not the second) strongly modulated by the data, by the emergence of a peak in .
Although our proposal is inspired in previous Bayesian studies, the procedure described here is not strictly Bayesian, since our prior (Eq. 12) requires the knowledge of , which depends on the sampled data. However, in the limit in which is well sampled, this is a pardonable crime, since is defined by a negligible fraction of the measured data. Still, Bayesian purists should employ a two-step procedure to define their priors. First, they should perform Bayesian inference on the center of the Dirichlet distribution of Eq. 12 by maximizing , and then replace in Eq. 12 by the inferred . For all practical purposes, however, if the conditions of validity of our method hold, both procedures lead to the same result.
By confining the set or possible priors to those generated by Eq. 12 we relinquish all aspiration to model the prior of, say, , in terms of the observed frequencies at . In fact, the preferred value is totally blind to the specific value of each sampled datum. Only the number of -values containing different counts of each -value matters. Hence, the estimation of mutual information is performed without attempting to infer the specific way the variables and are related, a property named equitability , and that is shared also by other methods [13, 8, 14]. Although this fact may be seen as a disadvantage, deriving a functional relation between the variables can actually bias the inference on mutual information . Moreover, fitting a relation is unreasonable in the severe under-sampled regime, in which not all -states are observed, most sampled -states contain a single count, and few -states contain more than two counts. At least, without a strong assumption about the probability space. In fact, if the space of probabilities of the involved variables has some known structure or smoothness condition, other approaches that estimate information by fitting the relation first may perform well [9, 10, 11]. Part of the approach developed here could be extended to continuous variables or spaces with a determined metric. This extension is left for future work.
The main result of the paper, is that our estimator has small bias, even in the severely under-sampled regime. It outperforms other estimators discussed in the literature (at least, when the conditions of validity hold), and by construction, it never produces negative values. More importantly, it even works in cases where the collection of true conditional probabilities is not contained in the family of priors generated by , as demonstrated by the second and third examples of Sect. 5. In these cases, the success of the method relies on the peaked nature of the posterior distribution for . Even if the selected provides a poor description of the actual collection of probabilities, the dominant captures the right value of mutual information. This is the sheer instantiation of the equitability property discussed above.
Our method provides also a transparent way to identify the statistics that matter, out of all the measured data. Quite naturally, the states that have not been sampled provide no evidence in shaping , as indicated by Eq. 13, and only shift the posterior information towards the prior (Eq. 27). More interestingly, the states with just a single count are also irrelevant, both in shaping
and in modifying the posterior information away from the prior. These states are unable to provide evidence about the existence of either flat or skewed conditional probabilities. Only the states that have been sampled at least twice contribute to the formation of a peak in , and in deviating the posterior information away from the prior.
Several fields can benefit from the application of our estimator of mutual information. Examples can be found in neuroscience, when studying whether neural activity (a variable with many possible states) correlates with a few selected stimuli or behavioral responses [12, 21, 22], or in genomics, to understand associations between genes (large-entropy variable) and a few specific phenotypes . The method can also shed light into the development of rate-distortion methods to be employed in situations in which only a few samples are available [24, 25]. The possibility of detecting statistical dependences with only few samples is of key importance, not just for analyzing data sets, but also to understand how living organisms quickly infer dependencies in their environments and adapt accordingly .
This research was funded by CONICET, CNEA, ANPCyT Raíces 2016 grant number 1004.
We thank Ilya Nemenman for his fruitful comments and discussions.
Appendix A Expected variance for a symmetric, binary -variable
The posterior variance of the mutual information is
In the first place, we demonstrate that this quantity is proportional to , implying that our estimator becomes increasingly accurate as the number of states of the -variable increases. Given that
it is easy to show that
In other words, if the marginal entropy is well sampled, the variance in the information is mainly due to the variance in the conditional entropy. In turn, is defined as an average of terms (Eq. 32). The independence hypothesis implied in the prior of Eq. 12, and in the way different and factor out in (Eq. 5), imply that the different terms of (Eq. 32) are all independent of each other. The average of independent terms has a variance proportional to 1/, so the estimator proposed here becomes increasingly accurate as grows.
We now derive the detailed dependence of on the sampled data . The mean conditional entropy can be written in terms of , that is, of the entropy of the variable for a particular state with counts at fixed
Similarly, for the second moment,