The Optimal 'AND'

The joint distribution P(X,Y) cannot be determined from its marginals P(X) and P(Y) alone; one also needs one of the conditionals P(X|Y) or P(Y|X). But is there a best guess, given only the marginals? Here we answer this question in the affirmative, obtaining in closed form the function of the marginals that has the lowest expected Kullbach-Liebler (KL) divergence between the unknown "true" joint probability and the function value. The expectation is taken with respect to Jeffreys' non-informative prior over the possible joint probability values, given the marginals. This distribution can also be used to obtain the expected information loss for any other "aggregation operator", as such estimators are often called in fuzzy logic, for any given pair of marginal input values. This enables such such operators, including ours, to be compared according to their expected loss under the minimal knowledge conditions we assume. We go on to develop a method for evaluating the expected accuracy of any aggregation operator in the absence of knowledge of its inputs. This requires averaging the expected loss over all possible input pairs, weighted by an appropriate distribution. We obtain this distribution by marginalizing Jeffreys' prior over the possible joint distributions (over the 3 functionally independent coordinates of the space of joint distributions over two Boolean variables) onto a joint distribution over the pair of marginal distributions, a 2-dimensional space with one parameter for each marginal. We report the resulting input-averaged expected losses for a few commonly used operators, as well as the optimal operator. Finally, we discuss the potential to develop our methodology into a principled risk management approach to replace the often rather arbitrary conditional-independence assumptions made for probabilistic graphical models.

Authors

• 3 publications
• Knowledge Integration for Conditional Probability Assessments

In the probabilistic approach to uncertainty management the input knowle...
03/13/2013 ∙ by Angelo Gilio, et al. ∙ 0

• Kernel Regression by Mode Calculation of the Conditional Probability Distribution

The most direct way to express arbitrary dependencies in datasets is to ...
11/21/2008 ∙ by Steffen Kuehn, et al. ∙ 0

• Multivariate binary probability distribution in the Grassmann formalism

We propose a probability distribution for multivariate binary random var...
09/17/2020 ∙ by Takashi Arai, et al. ∙ 0

• Domain-Liftability of Relational Marginal Polytopes

We study computational aspects of relational marginal polytopes which ar...
01/15/2020 ∙ by Ondrej Kuzelka, et al. ∙ 0

• MESA: Maximum Entropy by Simulated Annealing

Probabilistic reasoning systems combine different probabilistic rules an...
03/13/2013 ∙ by Gerhard Paaß, et al. ∙ 0

• A Polynomial Time Algorithm for Finding Bayesian Probabilities from Marginal Constraints

A method of calculating probability values from a system of marginal con...
03/27/2013 ∙ by J. W. Miller, et al. ∙ 0

• Scaling of Model Approximation Errors and Expected Entropy Distances

We compute the expected value of the Kullback-Leibler divergence to vari...
07/14/2012 ∙ by Guido F. Montúfar, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Given truth values for propositions and , the truth value of the conjunction is fully determined by the truth table for the logical connective ‘AND’. But if we are only able to assign non-extremal probabilities and to these propositions, then the joint probability is not fully determined. Further information such as the conditional probability is required. The essence of the problem is that the space of joint distributions over two Boolean values has three dimensions, of which the marginals are only able to constrain two. Nevertheless, some guesses are better than others. Under Bayesian principles [jaynes03, Caticha:2008eso], the best guess is the one that minimizes the expected information loss under the distribution that maximizes the entropy over what is not known, given whatever is known. Here we apply this principle to obtain an optimal estimate for given and . We apply the principle again under the still more information-sparse conditions of knowing neither nor to obtain the expected information loss for any given aggregation operator that estimates from these marginals.

It is important to notice that we are not directly concerned with whether propositions and

hold true; the incompletely known variables in our chosen problems are probability distributions over the four possible joint outcomes of these two Boolean variables. To describe our state of knowledge about these distributions, we must work at the meta-level in which events take values within the space of such distributions, and concern ourselves with distributions over these distribution spaces. Unlike a finite set, which supports a trivial concept of uniformity, this event space forms a continuum, so entropy can be defined only with respect to some fiducial distribution that

defines uniformity. In spaces of distributions, the uniquely qualified candidate for this meta-distribution is Jeffreys’ non-informative prior. It is proportional to the square root of determinant of the Fisher information metric, which in turn measures the distinguishibility of infinitesimally nearby distributions in the space. A crash course in these information-geometric topics can be found in Appendix A. Given this apparatus, maximum entropy problems become well-formulated as minimum Kullbach-Liebler divergence (KL) problems.

2 Formulation

We begin by introducing coordinates for the space of joint categorical distributions over two Boolean variables:

 θ11 =P(X,Y) (1) θ10 =P(X,¬Y) (2) θ01 =P(¬X,Y) (3)

The coordinates range over the 3-simplex defined by

 θ11 ≥0 (4) θ10 ≥0 (5) θ01 ≥0 (6) θ11+θ10+θ01 ≤1 (7)

We also introduce the function

 θ00(θ)=1−θ11−θ10−θ01 (8)

which is not to be confused with one of the 3 coordinates, even when written as without making its argument explicit.

In this coordinate system, we obviously have

 P(X) =θ11+θ10 (9) P(Y) =θ11+θ01 (10)

Abbreviating as and as , our knowledge can be expressed by the constraints

 θ11+θ10 =a (11) θ11+θ01 =b (12)

Let us use these constraints to eliminate the coordinates and :

 θ10 =a−θ11 (13) θ01 =b−θ11 (14)

The range restrictions (4), (5), (6) and (7) then become

 θ11 ≥0 θ11 ≥0 (16) a−θ11 ≥0 θ11 ≤a (17) b−θ11 ≥0 θ11 ≤b (18) θ11+a−θ11+b−θ11 ≤1 θ11 ≥a+b−1 (19)

which can be summarized as

 θ↓ ≤θ11≤θ↑ (20) θ↓ =max(a+b−1,0) (21) θ↑ =min(a,b) (22)

3 The optimal estimate

In Appendix A, we motivate and derive the Fisher information matrix and the measure that it defines over spaces of probability distributions. In the present case, the Fisher information matrix has the single element

 g(11),(11)(θ11) =∑x∈{T,F}P(x|θ11)[∂θ11lnP(x|θ11)][∂θ11lnP(x|θ11)] =θ11θ−211+(1−θ11)(−(1−θ11))−2 =θ−111+(1−θ11)−1=1/(θ11(1−θ11)) (24)

The “uniform” measure over this space, Jeffreys’ non-informative prior, is the square root of the determinant of , which is just . This is essentially the Beta() density, except that its support is limited to a sub-interval of the unit interval. The integral over this support region, which gives the normalization constant needed to convert this measure into a probability measure, is an incomplete Beta function for which we have a special case that reduces to

 Z =∫θ↑θ↓dx√x(1−x)=2sin−1(√x)∣∣∣θ↑θ↓ =2sin−1(√θ↑)−2sin−1(√θ↓) (25)

which is plotted in Figure 1.

To see this, recall the standard result and observe that

 ddx2sin−1(√x)=2√1−x12√x=1√x(1−x). (26)

We therefore conclude that is distributed between and according to the density

 P(θ11)=1Z1√(θ11(1−θ11)) (27)

with given by (25).

With this distribution in hand, we can derive an optimal estimate for . Suppose we institute a policy of using some fixed function to estimate . Then the expected KL divergence between this estimate and would be

 KL[θ11,f(a,b)] =θ11lnθ11f(a,b)+(1−θ11)ln1−θ111−f(a,b) (28)

where we choose to weight the log likelihood ratio by the true distribution , not the estimated distribution . The expected information loss is therefore the expectation value of (28) under distribution (27). Abbreviating as and as , this is

 ⟨KL[θ11,f(a,b)]⟩θ11 =κZ−AZlnx−BZln(1−x) (29) κ =∫θ↑θ↓dξξlnξ+(1−ξ)ln(1−ξ)√ξ(1−ξ) (30) A =∫θ↑θ↓dξ√ξ/(1−ξ) (31) B =∫θ↑θ↓dξ√(1−ξ)/ξ (32)

Note that with the change of variable , , (32) becomes , so and involve the same indefinite integral, verified in Appendix B.1 (81) to be:

 ∫dξ√ξ1−ξ=tan−1(√ξ/(1−ξ))−√ξ(1−ξ). (34)

This gives

 A= tan−1(√θ↑/(1−θ↑))−tan−1(√θ↓/(1−θ↓)) −√θ↑(1−θ↑)+√θ↓(1−θ↓). (35) B= tan−1(√(1−θ↓)/θ↓)−tan−1(√(1−θ↑)/θ↑) −√(1−θ↓)θ↓+√(1−θ↑)θ↑. (36)

We note that and are both non-negative because the integrand of (34) is non-negative and in each case the integral is taken in a positive sense.

The two terms in (30) can be similarly written in terms of a single indefinite integral, but one that does not have a simple expression in terms of elementary functions. (It can be expressed in terms of a generalized hypergeometric function, but we do not pursue this here.) It was integrated numerically to obtain the plot in Figure 2. We note that , as is clear from (30). This term does not depend on the estimate , and in this sense is a constant contribution to the expected information loss.

The optimal estimate minimizes the expected loss (29). Setting its derivative with respect to to zero gives

 Ax−B1−x =0 A−Ax =Bx x =θ∗11(a,b)=A/(A+B) (37)

We note that contains only the terms of (35) and (36).

Because and are non-negative, we see that the optimal estimate (37) lies between 0 and 1, as it should. To confirm that the extremum (37) is indeed a minimum, we note that the second derivative of (29) is proportional to

 Ax2+B(1−x)2≥0. (38)

Plugging from (37) into (29) gives the expected loss for the optimal estimate:

 ⟨KL[θ11,θ∗11(a,b)]⟩θ11 =κZ−AZlnAA+B−BZlnBA+B =(κ−AlnA−BlnB+(A+B)ln(A+B))/Z (39)

The optimal estimate (37) is plotted along with its expected loss (39) in Figure 3.

4 Aggregation operator cost

Expression (37), together with (35), (36), (21) and (22) gives the optimal estimate given that and . Given a different estimate , of which a considerable variety are in use [Detyniecki:01:Aggregation], one can apply (29) and compare the result to (39) to determine how much worse the sub-optimal estimate is for any particular marginals and .

In order to judge how well an aggregation formula works in general, without committing to any particular values of its arguments, we need to average the expected loss of the estimate over all marginal values and under a non-informative distribution. To obtain this distribution over , we assume Jeffreys’ non-informative prior over the joint distributions , apply (11) and (12) to transform to coordinates that eliminate and in favor of and , and integrate out .

As derived in Appendix A, Jeffreys’ prior over the joint distributions is the Dirichlet distribution:

 P(θ)=π−2(θ11θ10θ01θ00)−12 (40)

To change to coordinates , we must not only substitute (13) and (14) into (40), but also ensure that by dividing by the absolute value of the Jacobian determinant

 (41)

However, this turns out to be an immaterial unit factor, so we have

 P(ϕ) =π−2(θ11(a−θ11)(b−θ11)(1−θ11−(a−θ11)−(b−θ11))−12 =π−2(θ11(a−θ11)(b−θ11)(1+θ11−a−b))−12 (42)

The marginal over and is then found by integrating out :

 P(a,b) =π−2∫θ↑θ↓dθ11(θ11(a−θ11)(b−θ11)(1+θ11−a−b))−12. (43)

This is an incomplete elliptic integral of the first kind. With the definition

 K(ϕ,m)=∫ϕ0(1−msin2θ)−12dθ0≤m≤1,0≤ϕ≤π2 (44)

and substituting , the antiderivative of the integrand in (43) is

 I(x;a,b)=2π2b−a|b−a|1√b(1−b)K(sin−1(√(1−b)(b−x)(1−a)(a−x)),a(1−a)b(1−b)) (45)

as verified in Appendix B.2. However, if we attempt to obtain the definite integral (43) simply by plugging in the limits, we may find ourselves in violation of the conditions and . Inspecting (45), we see that

 m =a(1−a)b(1−b) (46) sin2ϕ =(1−b)(b−x)(1−a)(a−x). (47)

The non-negativity constraint on is always satisfied, because and are probabilities lying between and . But the upper bound is problematic when is near either end of its allowed range. That condition can be written

 a(1−a)b(1−b) <1 a−a2 a and a+b<1) or (b1) (48)

The remaining condition amounts to . Because and are marginal probabilities that include (i.e., ) within their mass, we have and , so the non-negativity constraint is satisfied. The remaining constraint is

 (1−b)(b−x)(1−a)(a−x) ≤1 a−x−a2+ax ≥b−x−b2+bx 0 ≥(b−a)−(b2−a2)+(b−a)x=(b−a)(1−a−b+x) 0 ≤(b−a)(a+b−1−x) (b≥a and x≤a+b−1) or (b≤a and x≥a+b−1) (49)

Combining these conditions with (48) has the consequences

 (b≥a and x≤0) or (b≤a and x≥0) (50)

so because we conclude that we can proceed only if , in which case (48) implies that we must also restrict consideration to .

It remains to determine whether the second condition in (49), , is respected by the limits and . For , having already restricted to , we have

 min(a,b) ≥a+b−1 b ≥a+b−1 1 ≥a (51)

which is always satisfied. Moving on to we obtain

 max(a+b−1,0) ≥a+b−1. (52)

which, having restricted attention to , is always satisfied.

Leaving aside the question of how to handle the remaining cases, let us plug the limits into (45). At , the argument of vanishes, so as is clear from (44), so does . This leaves the lower limit (with the constraint), at which we see

 (1−b)(b−x)(1−a)(a−x) =(1−b)(b−(a+b−1))(1−a)(a−(a+b−1))=(1−b)(1−a)(1−a)(1−b)=1 (53)

Thus, we obtain

 P(a,b)=2π21b(1−b)K(π2,a(1−a)b(1−b)) (54)

for and . Let us call this expression to remind us of these two conditions.

The remaining cases can be obtained from the symmetries of (43). The most obvious is that . Hence, for all and with , we can say

 P(a,b)={P<>(a,b)[b(b,a)[b>a] (55)

leaving the cases to be determined.

Less obviously, . Plugging into (43), we have

 P(1−a,1−b) =π−2∫θ↑θ↓dθ11(θ11(1−a−θ11)(1−b−θ11) (1+θ11−(1−a)−(1−b)))−12 (56) =π−2∫θ↑θ↓dθ11(θ11(1−a−θ11)(1−b−θ11)(a+b+θ11−1))−12

Let , so . At the limit we have . If , this gives , and if it gives , so this limit becomes , as in (43). At the limit , we have . If , this is , and if it is , so this limit becomes , also agreeing with (43). Thus

 P(1−a,1−b) =π−2∫θ↑θ↓dx(x(1−a−x)(1−b−x)(a+b+x−1))−12 =π−2∫θ↑θ↓dx′((x′−a−b+1)(1−a−(x′−a−b+1)) (1−b−(x′−a−b+1))(a+b+(x′−a−b+1)−1))−12 =π−2∫θ↑θ↓dx′(1+x′−a−b)(b−x′)(a−x′)x′ =P(a,b) (57)

If , then , so we can use (57) to obtain . This density is plotted in Figure 4, which clearly shows the singularities at and .

The expected information loss (29) was integrated over this density using the optimal estimate (37) and a few other widely used t-norm aggregation operators taken from [Detyniecki:01:Aggregation] to write as a function of and . The resulting overall expected losses are shown in Table 1.

Details of the numerical integration are given in Appendix C. The numerical procedure applied to the density (54, 55) alone integrated to 0.9989, so these figures should be accurate to roughly 0.1%.

5 Summary and future work

We have set out and demonstrated a principled method for deriving optimal aggregation formulas. The best-guess full joint distribution over two Boolean variables, given the marginal distributions, is given in closed form by (37), and its expected information loss, measured by Kullbach-Liebler divergence, is given by (39). Of course, if one has further information then the maximum entropy problem should be formulated differently, resulting in an improved guess.

We also set out and demonstrated a principled method for evaluating aggregation operators by their expected information loss, averaging over all their possible inputs. We applied the method to produce Table 1, listing the expected loss for the optimal aggregation operator and a few others. The method can be applied to extend the table to anyone’s favorite aggregation operator.

In probabilistic graphical model applications [bishopbook06], conditional independence assumptions are used to estimate joint distributions. These assumptions are often motivated more by practical limitations than by plausibility. At least in principle, it should be a better bet to take a quantitative risk management approach to dealing with such limitations, which is essentially what we have demonstrated in the simplest non-trivial case. One places the bet that has the lowest risk of being badly wrong, in the sense of expected information loss. Therefore it is of considerable interest to attempt to generalize the method in at least two ways: (i) from joint distributions over Boolean variables to joint-distributions over

-valued categorical variables for arbitrary

; and (ii) from joint distributions over two variables to distributions over greater numbers. On the latter point, with the machinery given here it is possible to treat the case of Boolean variables by repeated application of the 2-variable formula (37), but we see no good reason to believe that this would give the same result as minimization of expected KL loss within the space of -variable distributions. It would also be of interest to extend the approach to commonly-used families of distributions over continuous spaces.

6 Acknowledgments

This work was carried out entirely with the author’s own time and resources, but benefitted from useful conversations with Elizabeth Rohwer and his SRI colleagues John Byrnes, Andrew Silberfarb and others on the Deep Adaptive Semantic Logic (DASL) team, supported in part by DARPA contracts HR001118C0023 and HR001119C0108 and SRI internal funding. All views expressed herein are the author’s own and are not necessarily shared by SRI or the US Government.

Appendix A Fisher Information and Jeffreys’ prior

In this brief introduction to these topics in information geometry [amari93, Caticha_2015], we begin with the principle of invariance

, which holds that any formula for comparing probability distributions should produce the same result under any one-to-one smooth invertable change of random variables. Otherwise the comparison would depend not purely on the distributions, but also on the coordinates used to express the random variables. This leads to the

-divergences, among which is the Kullbach-Leibler (KL) divergence. Departing momentarily from this thread, we then introduce the Fisher information matrix and its determinant, and explain the sense in which this determinant measures the number of distinguishable distributions in a given infinitesimal coordinate volume element. Normalized, this density of distributions defines Jeffreys’ prior, expressing that a “randomly chosen” distribution is more likely to be chosen from a (coordinate) region dense in distinguishable distributions than a region containing few such. We then derive the Fisher information matrix from the delta divergence between infinitesimally nearby distributions. Finally, we derive these quantities for the categorical distributions of interest here.

a.1 Invariant divergences

The condition that a functional of a pair of distributions and be invariant with respect to any one-to-one smooth change of variables is rather restrictive. It implies that must be one of the -divergences

 Dδ[P,Q]=1δ(1−δ)[1−∫xP(X=x)δQ(X=x)1−δ] (58)

or a function constructed from these divergences. It is easy to see that is invariant. Recall that probability densities111 We will often abbreviate as when this notational abuse introduces little risk of confusion. For that matter, we will often be cavalier about notationally distinguishing between probabilities and densities. transform as in order to preserve the probabilities . It is much more complicated [amari93] to prove that all invariant divergences are based on these forms.

For discrete random variables, the invariance is with respect to uninformative subdivision of events. That is, if discrete outcome

is replaced by two possible outcomes and , with , and , then is unchanged. Carrying out a continuum transformation in a discretized approximation also leads to this statement.

The KL and reverse-KL divergences are obtained in the and limits, due to the identity222 The symmetric case is the Hellinger distance.

a.2 The Fisher Metric

Infinitesimally, all the -divergences reduce to the Fisher metric (up to a constant scale factor).

For any family of probability distributions indexed by -dimensional parameter , the distance between distributions and under the Fisher metric is

 dD=12∑ijdθigijdθj (59)

where is the Fisher information matrix

 gij(θ)=∫xP(x|θ)[∂θilnP(x|θ)][∂θjlnP(x|θ)] (60)

It is straightforward to verify (59) and (60) by expanding (58) to second order in , and using the conditions which follow from the normalization constraint . This calculation is carried out in section A.2.1.

The Fisher distance has a direct interpretation in terms of the amount of IID data required to distinguish from . To see this when the values of are discrete, note that the typical log likelihood ratio between the probability of samples from , according to and according to is

 −ln[∏xQ(x)TP(x)/∏xP(x)TP(x)]=T∑xP(x)lnP(x)/Q(x)=TD1[P,Q]Q→P+dP−−−−−−→TdD. (61)

If we required this log likelihood ratio to exceed some threshold to declare and distinct, this condition would be or , so up to a proportionality constant, is the amount of data required to make this distinction. The result carries over to any distribution over the continuum that is sufficiently regular to be approximated by quantizing the continuum into cells.

It is shown directly in Section A.2.2 that the Fisher information for a distribution over IID samples is times the Fisher information of a single sample.

It can also be shown that the only invariant volume element that can be defined over a family of distributions is one proportional to the square root of the determinant of the Fisher information matrix ; i.e. . This can be regarded as proportional to the number of distinct distributions in the coordinate prism with opposite corners at and . To see this, consider coordinates that diagonalize the Fisher information matrix in the neighborhood of , so that . Recall that we consider two distributions distinct if , i.e. if . We can therefore densely pack the distributions by placing them on the corners of a rectangular lattice separated by along rectangular coordinate . The volume allocated to each distribution is then and the number of distributions in a rectangular prism with side lengths is .

a.2.1 Derivation of Fisher metric from Delta Divergence

Consider (58) with and . Then

 lnQ(x) =lnP(x|θ+dθ) ≈lnP(x|θ)+∑i∂θilnP(x|θ)dθi+12∑ij∂θi∂θilnP(x|θ)dθidθj (62)

so abbreviating as and as and using gives

 P(x)δQ(x)1−δ≈Pδe(1−δ)lnPe(1−δ)[∑i∂ilnPdθi+12∑ij∂i∂jlnPdθidθj]≈PδP1−δ[1+(1−δ)∑i∂ilnPdθi+(1−δ)2∑ij∂i∂jlnPdθidθj+(1−δ)22∑i∂ilnPdθi∑j∂jlnPdθj]≈P+(1−δ)P∑i∂ilnPdθi+(1−δ)2∑ij∂i∂jPdθidθj−δ(1−δ)2P∑i∂ilnPdθi∑j∂jlnPdθj (63)

where we have used

 ∂i∂jlnP=∂iP−1∂jlnP=P−1∂i∂jP−P−2∂iP∂jP =P−1∂i∂jP−(∂ilnP)(∂jlnP) (64)

and . Note that because , we have and

 ∫xP∂ilnP=∫x∂iP=0. (65)

Inserting (63) into (58) and applying these identities then gives

 Dδ[P,Q] =1δ(1−δ)[1−∫xP(X=x)δQ(X=x)1−δ] ≈12∑ijP(x|θ)∂θilnP(x|θ)∂θjlnP(x|θ)dθidθj=12∑ijgijdθidθj (66)

in agreement with (60) and (59).

Note that this result is independent of . Dependence on begins with the 3rd order terms. These can be expressed by the Eguchi relations [amari93] in terms of the affine connection coefficients of the -geometry.

a.2.2 Direct Derivation of Fisher information for IID data

Consider a data set consisting of IID data points . The likelihood of this data is . The Fisher information is

 g(T)ij=∑XP(X|θ)∂θilnP(X|θ)∂θjlnP(X|θ) (67)

Using , this is

 g(T)ij=∑x1...xT[∏t′′P(xt′′|θ)]∑tt′∂θilnP(xt|θ)∂θjlnP(x′t|θ) (68)

Observe that factors in the product over for which is equal neither to nor are dependent on only through an overall factor of , which sums to 1 due to normalization. The terms with can be factored into the form

 [∑xP(x|θ)∂θilnP(x|θ)][∑x′P(x′|θ)∂θjlnP(x′|θ)]=[∑x∂θiP(x|θ)][∑x′∂θjP(x′|θ)] (69)

which vanishes due to the normalization condition. This leaves the terms, of which there are . Hence

 g(T)ij=T∑xP(x|θ)∂θilnP(x|θ)∂θjlnP(x|θ) (70)

Hence, we observe that the Fisher information of an IID data set is simply the data set size times the Fisher information of a single sample:

 g(T)ij=Tgij (71)

a.3 Categorical geometry

The family of categorical distributions over possible outcomes has independent coordinates , in terms of which the probability of outcome is

 P(x|θ)=θx (72)

where takes values in and we define . The parameter space is bounded by the constraints and the normalization constraint . In a subsequent subsection we will derive the formulas

 gij=δijθi+1θn (73)