The Design of Mutual Information

by   Nicholas Carrara, et al.

We derive the functional form of mutual information (MI) from a set of design criteria and a principle of maximal sufficiency. The (MI) between two sets of propositions is a global quantifier of correlations and is implemented as a tool for ranking joint probability distributions with respect to said correlations. The derivation parallels the derivations of relative entropy with an emphasis on the behavior of independent variables. By constraining the functional I according to special cases, we arrive at its general functional form and hence establish a clear meaning behind its definition. We also discuss the notion of sufficiency and offer a new definition which broadens its applicability.



There are no comments yet.


page 1

page 2

page 3

page 4


Factorized Mutual Information Maximization

We investigate the sets of joint probability distributions that maximize...

On shared and multiple information

We address three outstanding problems in information theory. Problem one...

On conditional Sibson's α-Mutual Information

In this work, we analyse how to define a conditional version of Sibson's...

Clustering with Respect to the Information Distance

We discuss the notion of a dense cluster with respect to the information...

Multivariate Extension of Matrix-based Renyi's α-order Entropy Functional

The matrix-based Renyi's α-order entropy functional was recently introdu...

Towards a Non-Stochastic Information Theory

The δ-mutual information between uncertain variables is introduced as a ...

Ranking by Dependence - A Fair Criteria

Estimating the dependences between random variables, and ranking them ac...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The concept of correlation

in inference is as old as the subject itself. From the very beginning, statisticians have been interested in quantifying relationships between propositions when they have incomplete information. Intuitively, one would say that these relationships are quantified by a joint, or a conditional, probability distribution. Given some propositions

and , the conditional distribution tells me what I should believe about given information about . It is true however, that correlations

tend to be discussed in more qualitative terms, rather than strictly in terms of a joint density. In this respect it is often something we think of as being some global property of a whole system. Given an entire joint distribution, we would like to assign it one number that quantifies its global correlations. For generalized distributions however, the behavior of the joint density can be vastly different in different regions. So, if we think correlation is some global description, how does one assign a meaning to the correlations in these generalized distributions when the behavior is local? If we change correlations locally, how does that affect the definition of global correlations? Is there a way to assign a consistent quantitative description to the idea of global correlations so that we can compare two joint distributions with respect to them?

In statistics, the term correlation typically refers to Pearson’s correlation coefficient [4], which is defined as a weighted covariance between two sets of variables. While a useful quantity, the correlation coefficient only captures linear relationships between variables, and hence more complicated situations can lead one to think that there are no correlations present, even when the variables are maximally correlated111The easiest example to see this is for a circle , where . The covariance vanishes, however the variables and are completely dependent.. Being linear it depends on the choice of coordinates, and hence is not coordinate invariant, which is highly problematic. One attempt at fixing the correlation coefficient was suggested by Szekely et. al. [5] in their paper on distance correlation. While the distance correlation seems to solve the problems that occur with correlation coefficients, it is still seemingly dependent on the underlying geometry, which is undesirable.

Mutual information (MI) as a measure of correlation has a long history, beginning with Shannon’s seminal work on communication theory [6] in which he first defines it. While Shannon provided arguments for the functional form of his entropy [6], he did not provide a derivation of (MI). Beginning with Jaynes [7], there have been many improvements on the derivations of entropy and in the decades since, most notably by Shore and Johnson [3], Skilling [2], and Caticha [1]. Also recently Vanslette has provided a derivation for both the standard relative entropy and the quantum relative entropy [8]. Despite this, there has still been no principled approach to the design of (MI). In this paper, we suggest a set of criteria for deriving (MI) as a tool for ranking correlations. The relative entropy allows one to quantify the notion of information by defining it operationally; information is what changes your mind. Much in the same way, the mutual information allows one to quantify the notion of correlation. Unlike entropy, (MI) is only useful in a relative setting. It is not the number that you compute that matters, but rather the comparison between two numbers, hence the emphasis on ranking which was actually part of the motivation for entropy in [2].

Some of the first practical applications of (MI) were in Rate-Distortion theory (RDT) [9], and later in the infoMax principle [10]. (RDT) considers the problem of reconstructing a signal from a nosiy channel, and by defining a distortion function , one minimizes the expected distortion with respect to the mutual information. Modern applications of this idea appear in the Information-Bottleneck Method (IBM) [11, 12]

, which in place of the distortion constraint places a constraint on a particular (MI). The (IBM) also gets inspiration from the infoMax principle, which states that one should attempt to maximize the mutual information between some set of input variables and the output of a machine learning algorithm.

The relative entropy [1] addresses the singular question of how to update ones beliefs when new information becomes available. As we will show in the next section, entropic updating accounts for one of many different families of transformations that are typically used in inference tasks. What we will attempt to derive is in some sense more general in that it helps one quantify many more types of transformations, although it is not necessarily a tool for how to change ones mind. While the relative entropy addresses the question of how to incorporate new information, the (MI) addresses the question of whether a transformation of a statistical manifold preserves correlations.

In this paper we adopt the view of Cox [13] in that probabilities are degrees of rational belief. Inductive inference is the practice of quantifying beliefs; it is the assigning of probabilities to propositions. Probabilities manifest due to our lack of information about said set of propositions. Using the Cox approach, one can derive the sum and product rules from first principles [13, 14]

and hence arrive at Bayes’ theorem. While the Cox axioms tell us how to manipulate probabilities when the information available is fixed, a new set of axioms are needed in order to determine how one should update their beliefs when new information becomes available. Such a set of axioms leads to the functional form of the relative entropy, which is a generalized updating scheme

[15, 16]. We apply a similar logic here; we derive the mutual information as a tool for ranking correlations. Thus, it is not the individual values of (MI) which are of interest but rather the comparison of two or more values when one performs some kind of inferential transformation. We will consider several families of inferential transformations, but there are a few special cases which are of interest. These special cases are what allows us to constrain the functional form of (MI).

An important consequence of deriving mutual information as a tool for ranking is it’s immediate application to the notion of sufficiency. Sufficiency dates back to Fisher, and some would argue Laplace [17], both of whom were interested in finding statistics that contained all relevant information about a sample. Such statistics are called sufficient, however this notion is only a binary label so it does not quantify an amount of sufficiency. To make the idea more flexible, we propose a new definition of sufficiency which is simply the ratio of two mutual informations. For any given set of variables, one can find the global correlations present by computing the (MI). Then, for any function of one set of the variable, one can determine the sufficiency of the function by computing the (MI) after the variables have passed through the function and taking the ratio of the two (MI) values. Such a quantity gives a sense of how close the function is from being a sufficient statistic.

In section II we will outline the basic problem and state the general principle concerning transformations in statistical manifolds. We will then state the design criteria and discuss some of their immediate consequences. In section III we will impose the design criteria to derive the functional form of (MI). In section IV we’ll explore more of the consequences of the functional form of (MI), discuss sufficiency, its relation to the Neyman-Pearson lemma [18] and other statistical quantities. We conclude with a discussion.

Ii The Design Criteria

We wish to design a functional which captures a global property of some joint distribution , where and are some generic set of propositions (either discrete or continuous). In particular we are concerned with quantifying the correlations present in , which amounts to determining the properties of the conditional probabilities and . While the idea of correlation can be difficult to quantify, much like the difficulties with quantifying a notion of information, we can, at least at this stage, say some very basic things about what we mean by correlation. In the most vague sense, correlation refers to conditional dependency, hence if the joint distribution factorizes, i.e. the sets and are independent, , then it is reasonable to claim to that the sets and are uncorrelated. Furthermore, if the set is related to through some deterministic function (which need not be bijective), i.e. , then the joint distribution becomes , and while may still contain a certain amount of uncertainty, as quantified by its probability distribution, the value of is completely determined by . It would also be reasonable then to claim that this distribution represents one of maximal correlation. What is not so clear, is how we assign a meaning to the word correlation when the distribution is inbetween these two edge cases.
This paper is in essense an exercise in finding which design critera lead to the functional we think is the correct one for describing global correlations. While there are an unknown number of constraints we could impose on such a functional, we will see that the design criteria which lead to the desired one capture the spirit of what we would expect for any global measure of correlations. In order to motivate our main principle, we will enumerate a list of general types of transformations that occur in statistical manifolds . There are four major ones, coordinate transformations, entropic updating222This of course includes Bayes rule as a special case [15, 16], marginalization and products. Coordinate transformations, or as we will call them type I transformations, are the following


where , which induces a diffeomorphism of the statistical manifold. While the densities and are not necessarily equal, the probabilities defined in (II.1

) must be (according to the rules of probability theory). Type

II transformations are those induced by updating,


which is Bayes rule. These types of transformations belong to a much larger group induced by entropic updating,


where is the prior. Maximizing (II.3) with respect to constraints induces a translation in the statistical manifold. Translations in the statistical manifold are often thought of as reparametrizations, which in the measure-theoretic language is just the Radon-Nikodym theorem [19]333The (RNT) is a general theorem concerning -measures that requires some more involved mathematics, however here it is a rather trivial property of probability densities.,


where and is the Radon-Nikodym derivative. The (RND) is defined if the zeroes of the distributions and map to each other; i.e. is absolutely continuous with respect to . When the statistical manifold is parameterized by the densities , the zeroes always lie on the boundary of the simplex444In this representation the statistical manifolds have a trivial topology; they are all simply connected.. Both type I and type II transformations are maps from the statistical manifold to itself, . Type III transformations are induced by marginalization,


which is effectively a quotienting of the staistical manifold, ; i.e. for any point , we equivocate all values of . Type IV transformations are created by products,


which are a kind of inverse transformation of type III. There are many different situations that can arise from this type, a most trivial one being an embedding,


which can be useful in many applications. We will denote such a transformation as type IVa. Another trivial example of type IV is,


which we will call type IVb. This set of transformations is not necessarily exhaustive, but is sufficient for our discussion in this paper. From here we state the central principle,
  The Principle Of Maximal Sufficiency. There are no transformations of type I, III, IVa or IVb that can create correlations.   Transformations of type IV can only ever increase correlations (with the exception of IVa and IVb), while type III can only ever decrease correlations. Transformations of type II can either increase or decrease correlations. Essentially we ar claiming that only by updating our beliefs, can our beliefs about the relationship between two sets of variables increase, all other types of transformations necessarily destroy correlations. We would like to derive a quantity, , for the purpose of ranking joint distributions with respect to their correlations; i.e. a distribution which has greater correlations than another, will have a greater value of .
The first design criteria concerns the additivity of local correlations. It mirrors the design criteria for the relative entropy in many iterations [1, 14, 3, 2],

Design Criteria 1.

Given two sets of propositions, , the global functional should be additive in local correlations.

  There are two consequences of this statement. The first concerns the additivity of subsets of some joint space of propositions . If we consider any two subsets of the joint space, say which are mutually disjoint,


then the mutual information should be the sum of the individual mutual informations over the two domains. This is a consistency requirement, that allows one to break up the joint space anyway one wishes and still obtain the same global value. We assume that local correlations, whatever they may happen to be, can be expressed as a function of the joint distribution and its marginals, . This is a reasonable assumption, since we want to consider the conditional dependence between and , however the correlations may also depend on how or behave when information about the other is lost, hence the possible dependence on the marginals. Since the functional must be additive in the local evaluations of , it can be written as an integral (in the continuous case),


Locality is often invoked as a design criteria for relative entropy, such as in [1]555From Caticha’s design criteria, “Axiom I: Locality. Local information has local effects.”. In the case of designing relative entropy as a tool for updating probabilities when new information becomes available, locality is crucial. To see this, consider the following. If one obtains new information about the space which does not depend on a particular subset (or subdomain) of the data , then the posterior distribution one gets should be equal to the prior; one should not change ones mind when no new information about a subset if given. Thus the conditional distribution, , is not updated.
The second consequence of DC1 concerns independence among subspaces of and . Essentially, we could either consider independent subsets, as in eq. (II.9), or we could consider independent subspaces which are of different dimension than the total space . These results are usually implemented as two separate design criteria for relative entropy. Shore and Johnson’s approach [3] presents four axioms, of which III and IV are subsystem and subset independence. Subset independence666The definition from Shore and Johnson’s paper; “It should not matter whether one treats an independent subset of system states in terms of a separate conditional density or in terms of the full system density.” in their framework corresponds to eq. (II.9) and to the Locality axiom of Caticha [1]. It also appears as an axiom in the approach by Skilling [2]777Skilling’s axiom on subset independence reads, “Let be information pertaining only to for and similarly let pertain only to for . Then, if and are disjoint,.” The function refers to the reconstruction of an image and is some Lebesque measure..
The axiom concerning subsystem independence appears in all three approaches [1, 2, 3] and the flavor of the argument is generally the same; if any two variables and are independent, then we should be able to consider their densities either separately, or together.

Design Criteria 2.

The mutual information should be an increasing functional of correlations.

  Whatever correlations may be, we demand that if a distribution has more correlations than another, that its value of mutual information be larger. Thus, any distributions which have the same correlations, should have the same value of mutual information. Saying that two probability distributions have the same correlations is a bit vague, since we could mean that they have the exact same set of local correlations, i.e. their joint distributions are related by a coordinate transformation, however this statement is much more general. Consider the situation in which the set contains two sets of propositions, , so that the joint distribution is,


If however , then the joint distribution reduces to


which is independent of , hence the mutual information,


should have the same value as simply the distribution over the sets . This is the essense of the general character of DC2. We consider the more general case of DC2 in the next section.
DC2 implies some other conditions on , namely,

Corollary 1.

Coordinates carry no information

  This is usually stated as a design criteria for relative entropy [3, 1, 2], however here it’s implied by DC2. The reason is that for any bijective map from , if the global correlations were allowed to change, we would expect situations where,


however, since a coordinate transformation is bijective, we can always apply the inverse ,


and if we consider the joint distribution over all four sets of variables we find,


and so in either case the distribution reduces to dependence on the pairs or , and hence the joint distribution in (II.16) must have the same global correlations as and .

ii.1 Redundancy and Noise

We can quantify some simple notions, such as redundancy and noise, which are the special cases IVa and IVb, using the design criteria. Given a joint space with a joint distribution , we can define the global correlations present in by the mutual information . If the space is a collection of several variables, then the joint distribution can be written


If the conditional probability is independent of then we say that is redundant. This is equivalent to the condition in (II.12), which says that the mutual information of the full set and the set without , (), are equivalent,


Hence, the correlations in are redundant. While the condition that leads to the same mutual information as can be satisfied by (II.18), it is not necessary that . In an extreme case, we could have that is independent of both and ,


In this case we say that the variable is noise, meaning that it adds dimensionality to the space without adding correlations. In the redundant case, the variable does not add dimension to the manifold . In general, each set of variables will contain some amount of redundancy and some amount of noise. We could always perform a coordinate transformation that takes and where,


where are redundant and noisy variables respectively and are the parts left over that contain the relevant correlations. Then the joint distribution becomes,


Thus we have that,


These types of transformations can be exploited by algorithms to reduce the dimension of the space to simplify inferences. This is precisely what machine learning algorithms are designed to do [20]. One particular effort to use mutual information directly in this way is the Information Sieve [21]. Another is the Information Bottleneck Method [22].

Iii Implementing the Design Criteria

The first design criteria DC1 constrains the functional to the form given by eq. (II.10). Since the space can be partitioned in any manner we wish, consistency with requires the global correlations to be a sum of local contributions. By local we mean the probability density for a given proposition and , where the joint distribution is . For this local distribution, the probability of is independent of other values of which are not , and hence the correlations in should add to the total contribution.
From the intergral in (II.10) we can apply the consequences of DC2 in a series of steps. First, the corollary (II.12) implies that the function should be independent of the coordinates . Thus we have that eq. (II.10) reduces to,


While we assume at this point that must be coordinate independent, we can write this condition explicity by introducing a density ,


This is similar to the steps found in the relative entropy derivation[3, 1]. The density ensures that , and hence the expression is explicitly coordinate invariant. DC2 as realized in eqs. (II.12) and (II.19) provides an even stronger restriction on which we can find by appealing to a special case. Since all distributions with the same correlations should have the same value of , then all independent joint distributions will also have the same value, which at this point we assume is just some minimum value,


Inserting this into (III.2) we find,


But this expression must be independent of the underlying distribution , since all independent distributions regarless of the joint space must give the same value . Thus we conclude that the density must be the product marginal where,


And so the expression in (III.2) becomes,


where the function is constant whenever . As a matter of convenience, we can recast the function by multiplying by its argument so that the functional appears as an expected value,




In order to specify the function in (III.7), or equivalently , we appeal again to DC1 and consider a special case. Assume that the space and are the product of two spaces which are independent from each other, i.e. , and and are also independent, so that the joint distribution factors


which upon inserting into (III.7) gives,


However, according to DC2, since and are independent, they cannot be redundant, and hence must each contain their own local information about . Thus, using DC1, we can consider their local contributions separately,


This comes from the fact that DC1 requires the mutual information to be a sum of local contributions, and since each set of distributions is independent, they must be additive in their global correlations.
If we impose that eqs. (III.10) and (III.11) be equivalent, then the unique solution for is the logarithm, which gives the final form of the mutual information,


We then have the additivity of local correlations occurring in two places, both of which are imposed by DC1. Independence of subsets leads to the appearance of the integral, while independence of subspaces gives the logarithm.

iii.1 Consequences of the Design Criteria

One can check that indeed (III.12) satisfies the conditions on redundancy. Given that , the mutual information reduces to,


Likewise if is noise, , then the delta function in (III.13) can be replaced with and the result is the same. While in the case of the mutual information breaks up into a sum over each set and , in general we have that,


which is essentially the grouping property of relative entropy [6]. The notation is often called conditional mutual information [9]

. In general we have the chain rule,


The Data Processing Inequality -

Typically the data processing inequality is demonstrated as a consequence of the definition of (MI). The argument follows from defining the Markov chain,


Then, we can always consider the (MI) between the pair and ,


The conditional (MI) is however,


which is zero since . Since (MI) is positive, we then have that for the Markov chain (III.16),


Equality is achieved only when is also zero; when is a sufficient statistic for . We will discuss the idea of sufficient statistics in a later section.

Upper and Lower Bounds -

The lower bound could have been imposed after eq. (III.6) and the logarithm would have been the solution, however there was no reason a priori to assume that the lower bound was zero. The upper bound can be found by using the case of complete correlation, ,


which is the relative entropy of with respect to . If is a coordinate transformation, i.e. is a bijection, then eq. (III.20) becomes,


since . Hence, the (MI) is unbounded from above in the continuous case. In the discrete case we find,


where is the Shannon entropy. Since (III.22) does not depend on the functional form of , the upper bound is simply the Shannon entropy of one of the two variables. This can be seen by expanding the discrete (MI) as a sum of Shannon entropies,


which in the case of complete correlation (), the joint Shannon entropy becomes,


and so the upper bound is,


If is a bijection, then the entropies since is just a reparametrization of and hence the probabilities .

Iv Sufficiency

There is a large literature on the topic of sufficiency [9, 23] which dates back to work originally done by Fisher [24]. Some have argued that the idea dates back to even Laplace [17], a hundred years before Fisher. What both were trying to do ultimately, was determine whether one could find statistics which contain all possible information about some parameter. Let be a joint distribution over some variables and some parameters we wish to infer . Consider then a function , and also the joint density,


If is a sufficient statistic for with respect to , then the above equation becomes,


and the conditional probability doesn’t depend on . Fisher’s factorization theorem states that a sufficient statistic for will give the following relation,


where and are functions that are not necessarily probabilities; i.e. they are not normalized with respect to their arguments, however since the left hand side is certainly normalized with respect to , then the right hand side must be as well. We can rewrite eq. (IV.2) in terms of the distributions,


Relating eq. (IV.2) and (IV.4) we find,


which upon rearranging gives


We can then identify which only depends on and which is the ratio of two probabilities and hence, not normalized with respect to .

Minimal Sufficient Statistics -

As in the previous section consider an inference problem with some set of continuous variables and a parameter we wish to infer quantified by the probability distribution . Imagine that we are able to generate a family of sufficient statistics, , such that


A sufficient statistic is called minimal, if it can be written as a function of all other sufficient statistics; i.e. . To see the effect of this, consider the joint distribution,


where . The idea that is minimal is essentially the statement that the image of necessarily has a cardinality equal to or smaller than all other sufficient statistics; . The space is the smallest representation of which contains all relevant information about .

iv.1 A New Definition of Sufficiency

While the notion of a sufficient statistic is useful, how can we quantify the sufficiency of a statistic which is not completely sufficient but only partially? The mutual information can provide an answer. We define the sufficiency of a statistic as simply the ratio of mutual informations,


which is always bounded by . While there are many choices for such a measure, this measure seems most appropriate given that the upper bound for is potentially infinite. Statistics for which are called sufficient and correspond to the definition given by Fisher. We can see this by appealing to the special case for some statistic . It is true that,


so that,


where which is the criteria for to be a sufficient statistic. With this definition of sufficiency (IV.9) we have a way of evaluating maps which attempt to preserve correlations between and . These procedures are ubiquitous in machine learning [20], manifold learning and other inference tasks.

The Likelihood Ratio -

Here we will associate the invariance of (MI) to invariance of type I and type II errors. Consider a binary decision problem in which we have some set of discriminating variables

thrown according to two distributions (signal and background) labeled by a parameter . The inference problem can then be cast in terms of the joint distribution . According to the Neyman-Pearson lemma [18], the likelihood ratio,


gives a sufficient statistic for the significance level,


where is typically associated to the null hypothesis. This means that the likelihood ratio (IV.12) will allow us to determine if the data satisfies the significance level in (IV.13). Given Bayes’ theorem, the likelihood ratio is equivalent to,


which is the posterior ratio and is just as good a statistic, since is a constant for all . If we then construct a sufficient statistic for , such that,


then the posterior ratios, and hence the likelihood ratios, are equivalent,


and hence the significance levels are also invariant,


and therefore the type I and type II errors will also be invariant. Thus we can think of (MI) as a tool for finding the type I and type II errors for some unknown probability distribution by constructing some sufficient statistic using some technique (typically a ML technique), and then finding the type I and type II errors on the simpler distribution.
Apart from it’s invariance, we can also show another consequence of (MI) under arbitrary transformations for binary decision problems. Imagine that we successfully construct a sufficient statistic for . Then, it is a fact that the likelihood ratios and will be equivalent for all . Consider that we adjust the probability of one value of by shifting the relative weight of signal and background for that particular value ,


where is some small change, so that the particular value of


which is not equal to the value given from the sufficient statistic. Whether the value is larger or smaller than , in either case either the number of type I or type II errors will increase for the distribution with replaced for the sufficient value . Therefore, for any distribution given by the joint space , the (MI) determines the type I and type II error for any statistic on the data .

V Discussion

As we have seen, one can attach a meaning to (MI) through the specification of the design criteria which lead to its functional form. It is not surprising that concepts like subspace and subsystem independence are the required constraints for deriving it since it has similar properties to that of the relative entropy [1, 2, 3]. While we did not motivate (MI) from a variational principle, one can easily develop one given the properties that (MI) offers. There have been several of these throughout the literature including Rat-Distortion theory [9], infoMax [10], the information bottleneck method [11, 12] and the information sieve [21] to name a few. (MI) can also be used as an updating scheme when the prior is the product marginal distribution. Then, by introducing constraints which induce correlations, one can minimize the mutual information subject to those constraints. The easiest application is for Gaussians.

Consider a multivariate Gaussian distribution over




is the vector for the

-variables , is the vector of means, and is the covariance matrix defined by,