Given truth values for propositions and , the truth value of the conjunction is fully determined by the truth table for the logical connective ‘AND’. But if we are only able to assign non-extremal probabilities and to these propositions, then the joint probability is not fully determined. Further information such as the conditional probability is required. The essence of the problem is that the space of joint distributions over two Boolean values has three dimensions, of which the marginals are only able to constrain two. Nevertheless, some guesses are better than others. Under Bayesian principles [jaynes03, Caticha:2008eso], the best guess is the one that minimizes the expected information loss under the distribution that maximizes the entropy over what is not known, given whatever is known. Here we apply this principle to obtain an optimal estimate for given and . We apply the principle again under the still more information-sparse conditions of knowing neither nor to obtain the expected information loss for any given aggregation operator that estimates from these marginals.
It is important to notice that we are not directly concerned with whether propositions and
hold true; the incompletely known variables in our chosen problems are probability distributions over the four possible joint outcomes of these two Boolean variables. To describe our state of knowledge about these distributions, we must work at the meta-level in which events take values within the space of such distributions, and concern ourselves with distributions over these distribution spaces. Unlike a finite set, which supports a trivial concept of uniformity, this event space forms a continuum, so entropy can be defined only with respect to some fiducial distribution thatdefines uniformity. In spaces of distributions, the uniquely qualified candidate for this meta-distribution is Jeffreys’ non-informative prior. It is proportional to the square root of determinant of the Fisher information metric, which in turn measures the distinguishibility of infinitesimally nearby distributions in the space. A crash course in these information-geometric topics can be found in Appendix A. Given this apparatus, maximum entropy problems become well-formulated as minimum Kullbach-Liebler divergence (KL) problems.
We begin by introducing coordinates for the space of joint categorical distributions over two Boolean variables:
The coordinates range over the 3-simplex defined by
We also introduce the function
which is not to be confused with one of the 3 coordinates, even when written as without making its argument explicit.
In this coordinate system, we obviously have
Abbreviating as and as , our knowledge can be expressed by the constraints
Let us use these constraints to eliminate the coordinates and :
3 The optimal estimate
In Appendix A, we motivate and derive the Fisher information matrix and the measure that it defines over spaces of probability distributions. In the present case, the Fisher information matrix has the single element
The “uniform” measure over this space, Jeffreys’ non-informative prior, is the square root of the determinant of , which is just . This is essentially the Beta() density, except that its support is limited to a sub-interval of the unit interval. The integral over this support region, which gives the normalization constant needed to convert this measure into a probability measure, is an incomplete Beta function for which we have a special case that reduces to
which is plotted in Figure 1.
To see this, recall the standard result and observe that
We therefore conclude that is distributed between and according to the density
with given by (25).
With this distribution in hand, we can derive an optimal estimate for . Suppose we institute a policy of using some fixed function to estimate . Then the expected KL divergence between this estimate and would be
where we choose to weight the log likelihood ratio by the true distribution , not the estimated distribution . The expected information loss is therefore the expectation value of (28) under distribution (27). Abbreviating as and as , this is
We note that and are both non-negative because the integrand of (34) is non-negative and in each case the integral is taken in a positive sense.
The two terms in (30) can be similarly written in terms of a single indefinite integral, but one that does not have a simple expression in terms of elementary functions. (It can be expressed in terms of a generalized hypergeometric function, but we do not pursue this here.) It was integrated numerically to obtain the plot in Figure 2. We note that , as is clear from (30). This term does not depend on the estimate , and in this sense is a constant contribution to the expected information loss.
The optimal estimate minimizes the expected loss (29). Setting its derivative with respect to to zero gives
Because and are non-negative, we see that the optimal estimate (37) lies between 0 and 1, as it should. To confirm that the extremum (37) is indeed a minimum, we note that the second derivative of (29) is proportional to
4 Aggregation operator cost
Expression (37), together with (35), (36), (21) and (22) gives the optimal estimate given that and . Given a different estimate , of which a considerable variety are in use [Detyniecki:01:Aggregation], one can apply (29) and compare the result to (39) to determine how much worse the sub-optimal estimate is for any particular marginals and .
In order to judge how well an aggregation formula works in general, without committing to any particular values of its arguments, we need to average the expected loss of the estimate over all marginal values and under a non-informative distribution. To obtain this distribution over , we assume Jeffreys’ non-informative prior over the joint distributions , apply (11) and (12) to transform to coordinates that eliminate and in favor of and , and integrate out .
As derived in Appendix A, Jeffreys’ prior over the joint distributions is the Dirichlet distribution:
However, this turns out to be an immaterial unit factor, so we have
The marginal over and is then found by integrating out :
This is an incomplete elliptic integral of the first kind. With the definition
and substituting , the antiderivative of the integrand in (43) is
as verified in Appendix B.2. However, if we attempt to obtain the definite integral (43) simply by plugging in the limits, we may find ourselves in violation of the conditions and . Inspecting (45), we see that
The non-negativity constraint on is always satisfied, because and are probabilities lying between and . But the upper bound is problematic when is near either end of its allowed range. That condition can be written
The remaining condition amounts to . Because and are marginal probabilities that include (i.e., ) within their mass, we have and , so the non-negativity constraint is satisfied. The remaining constraint is
Combining these conditions with (48) has the consequences
so because we conclude that we can proceed only if , in which case (48) implies that we must also restrict consideration to .
It remains to determine whether the second condition in (49), , is respected by the limits and . For , having already restricted to , we have
which is always satisfied. Moving on to we obtain
which, having restricted attention to , is always satisfied.
Leaving aside the question of how to handle the remaining cases, let us plug the limits into (45). At , the argument of vanishes, so as is clear from (44), so does . This leaves the lower limit (with the constraint), at which we see
Thus, we obtain
for and . Let us call this expression to remind us of these two conditions.
The remaining cases can be obtained from the symmetries of (43). The most obvious is that . Hence, for all and with , we can say
leaving the cases to be determined.
Less obviously, . Plugging into (43), we have
Let , so . At the limit we have . If , this gives , and if it gives , so this limit becomes , as in (43). At the limit , we have . If , this is , and if it is , so this limit becomes , also agreeing with (43). Thus
The expected information loss (29) was integrated over this density using the optimal estimate (37) and a few other widely used t-norm aggregation operators taken from [Detyniecki:01:Aggregation] to write as a function of and . The resulting overall expected losses are shown in Table 1.
5 Summary and future work
We have set out and demonstrated a principled method for deriving optimal aggregation formulas. The best-guess full joint distribution over two Boolean variables, given the marginal distributions, is given in closed form by (37), and its expected information loss, measured by Kullbach-Liebler divergence, is given by (39). Of course, if one has further information then the maximum entropy problem should be formulated differently, resulting in an improved guess.
We also set out and demonstrated a principled method for evaluating aggregation operators by their expected information loss, averaging over all their possible inputs. We applied the method to produce Table 1, listing the expected loss for the optimal aggregation operator and a few others. The method can be applied to extend the table to anyone’s favorite aggregation operator.
In probabilistic graphical model applications [bishopbook06], conditional independence assumptions are used to estimate joint distributions. These assumptions are often motivated more by practical limitations than by plausibility. At least in principle, it should be a better bet to take a quantitative risk management approach to dealing with such limitations, which is essentially what we have demonstrated in the simplest non-trivial case. One places the bet that has the lowest risk of being badly wrong, in the sense of expected information loss. Therefore it is of considerable interest to attempt to generalize the method in at least two ways: (i) from joint distributions over Boolean variables to joint-distributions over
-valued categorical variables for arbitrary; and (ii) from joint distributions over two variables to distributions over greater numbers. On the latter point, with the machinery given here it is possible to treat the case of Boolean variables by repeated application of the 2-variable formula (37), but we see no good reason to believe that this would give the same result as minimization of expected KL loss within the space of -variable distributions. It would also be of interest to extend the approach to commonly-used families of distributions over continuous spaces.
This work was carried out entirely with the author’s own time and resources, but benefitted from useful conversations with Elizabeth Rohwer and his SRI colleagues John Byrnes, Andrew Silberfarb and others on the Deep Adaptive Semantic Logic (DASL) team, supported in part by DARPA contracts HR001118C0023 and HR001119C0108 and SRI internal funding. All views expressed herein are the author’s own and are not necessarily shared by SRI or the US Government.
Appendix A Fisher Information and Jeffreys’ prior
In this brief introduction to these topics in information geometry [amari93, Caticha_2015], we begin with the principle of invariance
, which holds that any formula for comparing probability distributions should produce the same result under any one-to-one smooth invertable change of random variables. Otherwise the comparison would depend not purely on the distributions, but also on the coordinates used to express the random variables. This leads to the-divergences, among which is the Kullbach-Leibler (KL) divergence. Departing momentarily from this thread, we then introduce the Fisher information matrix and its determinant, and explain the sense in which this determinant measures the number of distinguishable distributions in a given infinitesimal coordinate volume element. Normalized, this density of distributions defines Jeffreys’ prior, expressing that a “randomly chosen” distribution is more likely to be chosen from a (coordinate) region dense in distinguishable distributions than a region containing few such. We then derive the Fisher information matrix from the delta divergence between infinitesimally nearby distributions. Finally, we derive these quantities for the categorical distributions of interest here.
a.1 Invariant divergences
The condition that a functional of a pair of distributions and be invariant with respect to any one-to-one smooth change of variables is rather restrictive. It implies that must be one of the -divergences
or a function constructed from these divergences. It is easy to see that is invariant. Recall that probability densities111 We will often abbreviate as when this notational abuse introduces little risk of confusion. For that matter, we will often be cavalier about notationally distinguishing between probabilities and densities. transform as in order to preserve the probabilities . It is much more complicated [amari93] to prove that all invariant divergences are based on these forms.
For discrete random variables, the invariance is with respect to uninformative subdivision of events. That is, if discrete outcomeis replaced by two possible outcomes and , with , and , then is unchanged. Carrying out a continuum transformation in a discretized approximation also leads to this statement.
The KL and reverse-KL divergences are obtained in the and limits, due to the identity222 The symmetric case is the Hellinger distance.
a.2 The Fisher Metric
Infinitesimally, all the -divergences reduce to the Fisher metric (up to a constant scale factor).
For any family of probability distributions indexed by -dimensional parameter , the distance between distributions and under the Fisher metric is
where is the Fisher information matrix
It is straightforward to verify (59) and (60) by expanding (58) to second order in , and using the conditions which follow from the normalization constraint . This calculation is carried out in section A.2.1.
The Fisher distance has a direct interpretation in terms of the amount of IID data required to distinguish from . To see this when the values of are discrete, note that the typical log likelihood ratio between the probability of samples from , according to and according to is
If we required this log likelihood ratio to exceed some threshold to declare and distinct, this condition would be or , so up to a proportionality constant, is the amount of data required to make this distinction. The result carries over to any distribution over the continuum that is sufficiently regular to be approximated by quantizing the continuum into cells.
It is shown directly in Section A.2.2 that the Fisher information for a distribution over IID samples is times the Fisher information of a single sample.
It can also be shown that the only invariant volume element that can be defined over a family of distributions is one proportional to the square root of the determinant of the Fisher information matrix ; i.e. . This can be regarded as proportional to the number of distinct distributions in the coordinate prism with opposite corners at and . To see this, consider coordinates that diagonalize the Fisher information matrix in the neighborhood of , so that . Recall that we consider two distributions distinct if , i.e. if . We can therefore densely pack the distributions by placing them on the corners of a rectangular lattice separated by along rectangular coordinate . The volume allocated to each distribution is then and the number of distributions in a rectangular prism with side lengths is .
a.2.1 Derivation of Fisher metric from Delta Divergence
Consider (58) with and . Then
so abbreviating as and as and using gives
where we have used
and . Note that because , we have and
Note that this result is independent of . Dependence on begins with the 3rd order terms. These can be expressed by the Eguchi relations [amari93] in terms of the affine connection coefficients of the -geometry.
a.2.2 Direct Derivation of Fisher information for IID data
Consider a data set consisting of IID data points . The likelihood of this data is . The Fisher information is
Using , this is
Observe that factors in the product over for which is equal neither to nor are dependent on only through an overall factor of , which sums to 1 due to normalization. The terms with can be factored into the form
which vanishes due to the normalization condition. This leaves the terms, of which there are . Hence
Hence, we observe that the Fisher information of an IID data set is simply the data set size times the Fisher information of a single sample:
a.3 Categorical geometry
The family of categorical distributions over possible outcomes has independent coordinates , in terms of which the probability of outcome is
where takes values in and we define . The parameter space is bounded by the constraints and the normalization constraint . In a subsequent subsection we will derive the formulas