The Optimal 'AND'

05/24/2020 ∙ by Richard Rohwer, et al. ∙ 0

The joint distribution P(X,Y) cannot be determined from its marginals P(X) and P(Y) alone; one also needs one of the conditionals P(X|Y) or P(Y|X). But is there a best guess, given only the marginals? Here we answer this question in the affirmative, obtaining in closed form the function of the marginals that has the lowest expected Kullbach-Liebler (KL) divergence between the unknown "true" joint probability and the function value. The expectation is taken with respect to Jeffreys' non-informative prior over the possible joint probability values, given the marginals. This distribution can also be used to obtain the expected information loss for any other "aggregation operator", as such estimators are often called in fuzzy logic, for any given pair of marginal input values. This enables such such operators, including ours, to be compared according to their expected loss under the minimal knowledge conditions we assume. We go on to develop a method for evaluating the expected accuracy of any aggregation operator in the absence of knowledge of its inputs. This requires averaging the expected loss over all possible input pairs, weighted by an appropriate distribution. We obtain this distribution by marginalizing Jeffreys' prior over the possible joint distributions (over the 3 functionally independent coordinates of the space of joint distributions over two Boolean variables) onto a joint distribution over the pair of marginal distributions, a 2-dimensional space with one parameter for each marginal. We report the resulting input-averaged expected losses for a few commonly used operators, as well as the optimal operator. Finally, we discuss the potential to develop our methodology into a principled risk management approach to replace the often rather arbitrary conditional-independence assumptions made for probabilistic graphical models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 6

page 7

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Given truth values for propositions and , the truth value of the conjunction is fully determined by the truth table for the logical connective ‘AND’. But if we are only able to assign non-extremal probabilities and to these propositions, then the joint probability is not fully determined. Further information such as the conditional probability is required. The essence of the problem is that the space of joint distributions over two Boolean values has three dimensions, of which the marginals are only able to constrain two. Nevertheless, some guesses are better than others. Under Bayesian principles [jaynes03, Caticha:2008eso], the best guess is the one that minimizes the expected information loss under the distribution that maximizes the entropy over what is not known, given whatever is known. Here we apply this principle to obtain an optimal estimate for given and . We apply the principle again under the still more information-sparse conditions of knowing neither nor to obtain the expected information loss for any given aggregation operator that estimates from these marginals.

It is important to notice that we are not directly concerned with whether propositions and

hold true; the incompletely known variables in our chosen problems are probability distributions over the four possible joint outcomes of these two Boolean variables. To describe our state of knowledge about these distributions, we must work at the meta-level in which events take values within the space of such distributions, and concern ourselves with distributions over these distribution spaces. Unlike a finite set, which supports a trivial concept of uniformity, this event space forms a continuum, so entropy can be defined only with respect to some fiducial distribution that

defines uniformity. In spaces of distributions, the uniquely qualified candidate for this meta-distribution is Jeffreys’ non-informative prior. It is proportional to the square root of determinant of the Fisher information metric, which in turn measures the distinguishibility of infinitesimally nearby distributions in the space. A crash course in these information-geometric topics can be found in Appendix A. Given this apparatus, maximum entropy problems become well-formulated as minimum Kullbach-Liebler divergence (KL) problems.

2 Formulation

We begin by introducing coordinates for the space of joint categorical distributions over two Boolean variables:

(1)
(2)
(3)

The coordinates range over the 3-simplex defined by

(4)
(5)
(6)
(7)

We also introduce the function

(8)

which is not to be confused with one of the 3 coordinates, even when written as without making its argument explicit.

In this coordinate system, we obviously have

(9)
(10)

Abbreviating as and as , our knowledge can be expressed by the constraints

(11)
(12)

Let us use these constraints to eliminate the coordinates and :

(13)
(14)

The range restrictions (4), (5), (6) and (7) then become

(16)
(17)
(18)
(19)

which can be summarized as

(20)
(21)
(22)

3 The optimal estimate

In Appendix A, we motivate and derive the Fisher information matrix and the measure that it defines over spaces of probability distributions. In the present case, the Fisher information matrix has the single element

(24)

The “uniform” measure over this space, Jeffreys’ non-informative prior, is the square root of the determinant of , which is just . This is essentially the Beta() density, except that its support is limited to a sub-interval of the unit interval. The integral over this support region, which gives the normalization constant needed to convert this measure into a probability measure, is an incomplete Beta function for which we have a special case that reduces to

(25)

which is plotted in Figure 1.

Figure 1: The normalization factor (25) as a function of the marginals. The discontinuities of (21) and (22) are prominently reflected.

To see this, recall the standard result and observe that

(26)

We therefore conclude that is distributed between and according to the density

(27)

with given by (25).

With this distribution in hand, we can derive an optimal estimate for . Suppose we institute a policy of using some fixed function to estimate . Then the expected KL divergence between this estimate and would be

(28)

where we choose to weight the log likelihood ratio by the true distribution , not the estimated distribution . The expected information loss is therefore the expectation value of (28) under distribution (27). Abbreviating as and as , this is

(29)
(30)
(31)
(32)

Note that with the change of variable , , (32) becomes , so and involve the same indefinite integral, verified in Appendix B.1 (81) to be:

(34)

This gives

(35)
(36)

We note that and are both non-negative because the integrand of (34) is non-negative and in each case the integral is taken in a positive sense.

The two terms in (30) can be similarly written in terms of a single indefinite integral, but one that does not have a simple expression in terms of elementary functions. (It can be expressed in terms of a generalized hypergeometric function, but we do not pursue this here.) It was integrated numerically to obtain the plot in Figure 2. We note that , as is clear from (30). This term does not depend on the estimate , and in this sense is a constant contribution to the expected information loss.

The optimal estimate minimizes the expected loss (29). Setting its derivative with respect to to zero gives

(37)

We note that contains only the terms of (35) and (36).

Because and are non-negative, we see that the optimal estimate (37) lies between 0 and 1, as it should. To confirm that the extremum (37) is indeed a minimum, we note that the second derivative of (29) is proportional to

(38)

Plugging from (37) into (29) gives the expected loss for the optimal estimate:

(39)

The optimal estimate (37) is plotted along with its expected loss (39) in Figure 3.

Figure 2: The aggregation-operator-independent term (30,25) as a function of the input marginals. Contour plot on the right, 3d plot on the left.
Figure 3: The optimal aggregation operator (37). Contour plot on upper left, 3D plot on upper right. Corresponding information loss costs in lower two plots. These include the aggregation-operator-independent term (30) shown in Figure 2, which was obtained by numerical integration. Small inaccuracies in this integration resulted in small negative values for the expected KL cost, which are not mathematically possible. These were truncated to 0.

4 Aggregation operator cost

Expression (37), together with (35), (36), (21) and (22) gives the optimal estimate given that and . Given a different estimate , of which a considerable variety are in use [Detyniecki:01:Aggregation], one can apply (29) and compare the result to (39) to determine how much worse the sub-optimal estimate is for any particular marginals and .

In order to judge how well an aggregation formula works in general, without committing to any particular values of its arguments, we need to average the expected loss of the estimate over all marginal values and under a non-informative distribution. To obtain this distribution over , we assume Jeffreys’ non-informative prior over the joint distributions , apply (11) and (12) to transform to coordinates that eliminate and in favor of and , and integrate out .

As derived in Appendix A, Jeffreys’ prior over the joint distributions is the Dirichlet distribution:

(40)

To change to coordinates , we must not only substitute (13) and (14) into (40), but also ensure that by dividing by the absolute value of the Jacobian determinant

(41)

However, this turns out to be an immaterial unit factor, so we have

(42)

The marginal over and is then found by integrating out :

(43)

This is an incomplete elliptic integral of the first kind. With the definition

(44)

and substituting , the antiderivative of the integrand in (43) is

(45)

as verified in Appendix B.2. However, if we attempt to obtain the definite integral (43) simply by plugging in the limits, we may find ourselves in violation of the conditions and . Inspecting (45), we see that

(46)
(47)

The non-negativity constraint on is always satisfied, because and are probabilities lying between and . But the upper bound is problematic when is near either end of its allowed range. That condition can be written

(48)

The remaining condition amounts to . Because and are marginal probabilities that include (i.e., ) within their mass, we have and , so the non-negativity constraint is satisfied. The remaining constraint is

(49)

Combining these conditions with (48) has the consequences

(50)

so because we conclude that we can proceed only if , in which case (48) implies that we must also restrict consideration to .

It remains to determine whether the second condition in (49), , is respected by the limits and . For , having already restricted to , we have

(51)

which is always satisfied. Moving on to we obtain

(52)

which, having restricted attention to , is always satisfied.

Leaving aside the question of how to handle the remaining cases, let us plug the limits into (45). At , the argument of vanishes, so as is clear from (44), so does . This leaves the lower limit (with the constraint), at which we see

(53)

Thus, we obtain

(54)

for and . Let us call this expression to remind us of these two conditions.

The remaining cases can be obtained from the symmetries of (43). The most obvious is that . Hence, for all and with , we can say

(55)

leaving the cases to be determined.

Less obviously, . Plugging into (43), we have

(56)

Let , so . At the limit we have . If , this gives , and if it gives , so this limit becomes , as in (43). At the limit , we have . If , this is , and if it is , so this limit becomes , also agreeing with (43). Thus

(57)

If , then , so we can use (57) to obtain . This density is plotted in Figure 4, which clearly shows the singularities at and .

Figure 4: The non-informative prior distribution over the marginals. (57). Contour plot the left, 3D plot the right. The distribution is singular at every edge, and along the diagonal lines.
Method Formula Expected loss
Optimal Eqn. (37) 0.0203
Product 0.0272
Hamacher () 0.0416
Min 0.0691
Lukasiewicz 0.5183
Table 1: Expected information loss for the optimal estimate and a few widely used t-norm aggregation operators.

The expected information loss (29) was integrated over this density using the optimal estimate (37) and a few other widely used t-norm aggregation operators taken from [Detyniecki:01:Aggregation] to write as a function of and . The resulting overall expected losses are shown in Table 1.

Details of the numerical integration are given in Appendix C. The numerical procedure applied to the density (54, 55) alone integrated to 0.9989, so these figures should be accurate to roughly 0.1%.

5 Summary and future work

We have set out and demonstrated a principled method for deriving optimal aggregation formulas. The best-guess full joint distribution over two Boolean variables, given the marginal distributions, is given in closed form by (37), and its expected information loss, measured by Kullbach-Liebler divergence, is given by (39). Of course, if one has further information then the maximum entropy problem should be formulated differently, resulting in an improved guess.

We also set out and demonstrated a principled method for evaluating aggregation operators by their expected information loss, averaging over all their possible inputs. We applied the method to produce Table 1, listing the expected loss for the optimal aggregation operator and a few others. The method can be applied to extend the table to anyone’s favorite aggregation operator.

In probabilistic graphical model applications [bishopbook06], conditional independence assumptions are used to estimate joint distributions. These assumptions are often motivated more by practical limitations than by plausibility. At least in principle, it should be a better bet to take a quantitative risk management approach to dealing with such limitations, which is essentially what we have demonstrated in the simplest non-trivial case. One places the bet that has the lowest risk of being badly wrong, in the sense of expected information loss. Therefore it is of considerable interest to attempt to generalize the method in at least two ways: (i) from joint distributions over Boolean variables to joint-distributions over

-valued categorical variables for arbitrary

; and (ii) from joint distributions over two variables to distributions over greater numbers. On the latter point, with the machinery given here it is possible to treat the case of Boolean variables by repeated application of the 2-variable formula (37), but we see no good reason to believe that this would give the same result as minimization of expected KL loss within the space of -variable distributions. It would also be of interest to extend the approach to commonly-used families of distributions over continuous spaces.

6 Acknowledgments

This work was carried out entirely with the author’s own time and resources, but benefitted from useful conversations with Elizabeth Rohwer and his SRI colleagues John Byrnes, Andrew Silberfarb and others on the Deep Adaptive Semantic Logic (DASL) team, supported in part by DARPA contracts HR001118C0023 and HR001119C0108 and SRI internal funding. All views expressed herein are the author’s own and are not necessarily shared by SRI or the US Government.

Appendices

Appendix A Fisher Information and Jeffreys’ prior

In this brief introduction to these topics in information geometry [amari93, Caticha_2015], we begin with the principle of invariance

, which holds that any formula for comparing probability distributions should produce the same result under any one-to-one smooth invertable change of random variables. Otherwise the comparison would depend not purely on the distributions, but also on the coordinates used to express the random variables. This leads to the

-divergences, among which is the Kullbach-Leibler (KL) divergence. Departing momentarily from this thread, we then introduce the Fisher information matrix and its determinant, and explain the sense in which this determinant measures the number of distinguishable distributions in a given infinitesimal coordinate volume element. Normalized, this density of distributions defines Jeffreys’ prior, expressing that a “randomly chosen” distribution is more likely to be chosen from a (coordinate) region dense in distinguishable distributions than a region containing few such. We then derive the Fisher information matrix from the delta divergence between infinitesimally nearby distributions. Finally, we derive these quantities for the categorical distributions of interest here.

a.1 Invariant divergences

The condition that a functional of a pair of distributions and be invariant with respect to any one-to-one smooth change of variables is rather restrictive. It implies that must be one of the -divergences

(58)

or a function constructed from these divergences. It is easy to see that is invariant. Recall that probability densities111 We will often abbreviate as when this notational abuse introduces little risk of confusion. For that matter, we will often be cavalier about notationally distinguishing between probabilities and densities. transform as in order to preserve the probabilities . It is much more complicated [amari93] to prove that all invariant divergences are based on these forms.

For discrete random variables, the invariance is with respect to uninformative subdivision of events. That is, if discrete outcome

is replaced by two possible outcomes and , with , and , then is unchanged. Carrying out a continuum transformation in a discretized approximation also leads to this statement.

The KL and reverse-KL divergences are obtained in the and limits, due to the identity222 The symmetric case is the Hellinger distance.

a.2 The Fisher Metric

Infinitesimally, all the -divergences reduce to the Fisher metric (up to a constant scale factor).

For any family of probability distributions indexed by -dimensional parameter , the distance between distributions and under the Fisher metric is

(59)

where is the Fisher information matrix

(60)

It is straightforward to verify (59) and (60) by expanding (58) to second order in , and using the conditions which follow from the normalization constraint . This calculation is carried out in section A.2.1.

The Fisher distance has a direct interpretation in terms of the amount of IID data required to distinguish from . To see this when the values of are discrete, note that the typical log likelihood ratio between the probability of samples from , according to and according to is

(61)

If we required this log likelihood ratio to exceed some threshold to declare and distinct, this condition would be or , so up to a proportionality constant, is the amount of data required to make this distinction. The result carries over to any distribution over the continuum that is sufficiently regular to be approximated by quantizing the continuum into cells.

It is shown directly in Section A.2.2 that the Fisher information for a distribution over IID samples is times the Fisher information of a single sample.

It can also be shown that the only invariant volume element that can be defined over a family of distributions is one proportional to the square root of the determinant of the Fisher information matrix ; i.e. . This can be regarded as proportional to the number of distinct distributions in the coordinate prism with opposite corners at and . To see this, consider coordinates that diagonalize the Fisher information matrix in the neighborhood of , so that . Recall that we consider two distributions distinct if , i.e. if . We can therefore densely pack the distributions by placing them on the corners of a rectangular lattice separated by along rectangular coordinate . The volume allocated to each distribution is then and the number of distributions in a rectangular prism with side lengths is .

a.2.1 Derivation of Fisher metric from Delta Divergence

Consider (58) with and . Then

(62)

so abbreviating as and as and using gives

(63)

where we have used

(64)

and . Note that because , we have and

(65)

Inserting (63) into (58) and applying these identities then gives

(66)

in agreement with (60) and (59).

Note that this result is independent of . Dependence on begins with the 3rd order terms. These can be expressed by the Eguchi relations [amari93] in terms of the affine connection coefficients of the -geometry.

a.2.2 Direct Derivation of Fisher information for IID data

Consider a data set consisting of IID data points . The likelihood of this data is . The Fisher information is

(67)

Using , this is

(68)

Observe that factors in the product over for which is equal neither to nor are dependent on only through an overall factor of , which sums to 1 due to normalization. The terms with can be factored into the form

(69)

which vanishes due to the normalization condition. This leaves the terms, of which there are . Hence

(70)

Hence, we observe that the Fisher information of an IID data set is simply the data set size times the Fisher information of a single sample:

(71)

a.3 Categorical geometry

The family of categorical distributions over possible outcomes has independent coordinates , in terms of which the probability of outcome is

(72)

where takes values in and we define . The parameter space is bounded by the constraints and the normalization constraint . In a subsequent subsection we will derive the formulas

(73)