 # Relations among conditional probabilities

We describe a Groebner basis of relations among conditional probabilities in a discrete probability space, with any set of conditioned-upon events. They may be specialized to the partially-observed random variable case, the purely conditional case, and other special cases. We also investigate the connection to generalized permutohedra and describe a conditional probability simplex.

Comments

There are no comments yet.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Relations among conditional probabilities

In 1974, Julian Besag  discussed the “unobvious and highly restrictive consistency conditions” among conditional probabilities. In this paper we give an answer in the discrete case to the question

What conditions must a set of conditional probabilities satisfy in order to be compatible with some joint distribution?

Let be a finite set of singleton events, and let

be a probability distribution on them. Let

be a set of observable events which will be conditioned on, each a set of at least 2 singleton events. Then for events , in , we can assign conditional probabilities for the chance of given , denoted . Settling Besag’s question then becomes a matter of determining the relations that must hold among the quantities . For example, Besag gives the relation (see also ),

 P(x)P(y)=n∏i=1P(xi|x1,…,xi−1,yi+1,…,yn)P(yi|x1,…,xi−1,yi+1,…,yn). (1)

Since there are in general infinitely many such relations, we would like to organize them into an ideal and provide a nice basis for that ideal. A quick review of language of ideals, varieties, and Gröbner bases appears in Geiger et al. [11, p. 1471] and more detail in Cox et al. . In Theorem 3.2, we generalize relations such as (1) and Bayes’ rule to give a universal Gröbner basis of this ideal, a type of basis with useful algorithmic properties.

The second result generalized in this paper is due to Matúš 

. This states that the space of conditional probability distributions

conditioned on events of size two maps homeomorphically onto the permutohedron. In Theorem 4.3, we generalize this result to arbitrary sets of conditioned-upon events. The resulting image is a generalized permutohedron [20, 24]. This is a polytope which provides a canonical, conditional-probability analog to the probability simplex under the correspondance provided by toric geometry  and the theory of exponential families.

Work on the subject of relations among conditional probabilities has primarily focused on the case where the events in correspond to observing the states of a subset of random variables. Arnold et. al. 

develop the theory for both discrete and continuous random variables, particularly in the case of two random variables, and cast the compatibility of two families of conditional distributions as a solutions to a system of linear equations. Slavkovic and Sullivant

 consider the case of compatible full conditionals, and compute related unimodular ideals.

This paper is organized as follows. In Section 2, we introduce some necessary definitions. In Section 3, we give compatibility conditions in the general case of events in a discrete probability space, with any set of conditioned-upon events. These conditions come in the form of a universal Gröbner basis, which makes them particularly useful for computations: as a result, they may be specialized to the partially observed random variable case, the purely conditional case, and other special cases simply by changing . In [14, 17], we have seen that permutohedra and generalized permutohedra  play a central role in the geometry of conditional independence; the same is true of conditional probability. The geometric results of Matúš  map the space of conditional probability distributions (Definition 2.1) for all possible conditioned events onto the permutohedron . See Figure 1 for a diagram of the -dimensional permutohedron. In Section 4, we will discuss how to extend this result to general , in which case we obtain generalized permutohedra as the image.

This will be accomplished using a version of the moment map of toric geometry (Theorem

7.1). In Section 5, we discuss how to specialize our results to the case of partially observed random variables, including as an example how to recover the relation (1). Finally, in Section 6 we use this specialization to explain the relationship of Bayes’ rule to our constructions. In the Appendix we recall a few necessary facts about toric varieties.

## 2 Conditional probability distributions

Let be a collection of subsets , with , of . Let denote the event algebra, the polynomial ring with indeterminates for all and , i.e. one unknown for each elementary conditional probability. Then we denote by

 ∥E∥=∑I∈E|I|

the number of variables of . We write for when . The unknowns of are meant to represent conditional probabilities, as we now explain. The set indexes the disjoint events, and a point with represents a probability distribution on these events. When for all , the conditional probability of event given event containing it is

 pi|I=pi∑j∈Ipj. (2)

To extend this notion to the case , and to be able to deal with multiple conditioning sets, we make the following standard definition , considered in this form by Matúš .

###### Definition 2.1.

A conditional probability distribution for is a point such that for all with ,

• for all , .

Observe that (ii) is a relative version of (2), as (2) follows from (ii) with , , and . If on the other hand , the whole probability simplex

satisfies the definition. This freedom is known in probability theory as

versions of conditional probability . In algebraic geometry, this corresponds to the notion of a blow-up,  and the simplex to the exceptional divisor. Before we give a homogenized version of Definition 2.1, we consider the homogenized version of probability.

### 2.1 A projective view of probability

Consider a probability space with disjoint atomic events . The space of probability distributions on them is typically represented as a probability simplex, where each is a coordinate such that and . We will be describing families of probability distributions in terms of algebraic varieties, and we prefer to think of points as lying in complex projective space. This is equivalent to letting

be the complex vector space spanned by the outcomes (singleton events) and considering points

as representing mixtures over outcomes or probability distrubutions. There are two ways to match up the notion of the probability simplex with that of complex projective space. One way to do so, restriction, identifies the probability simplex with the real, positive part of the affine open of the with homogeneous coordinates as illustrated in Figure 2.

Alternatively we can use projection, equivalent in the special case that , via the moment map (Theorem 7.1

). The identity matrix

comprised of standard unit vectors defines the probability simplex . The toric variety is then the projective space and the moment map is:

 μ:Pm−1→Δm−1
 μ((y1:⋯:ym))=1∑i|yi||yi|ei

The moment map is the identity map on the probability simplex, but allows us to define a point on the probability simplex for more general points in complex projective space. The fiber over any of these points is the torus , a product of unit circles, since . A similar point of view appears in quantum physics; here is the Hilbert space representing quantum state and the modified moment map defines the probability of observing a classical state (singleton event) .

One interpretation of this freedom is that it suggests there are circumstances where allowing probabilities to be negative and even complex in intermediate computations might be useful. This may seem odd, but it can be argued that negative probabilities are already implicitly employed



. For example, characteristic function methods implicitly write a density as a linear combination of basis functions with ranges unrestricted to

. Even if we are uncomfortable with such interpretations, the compactification and homogenization can simply be viewed as a convienient algebraic trick to make it easy to determine the relations among conditional probabilities we are ultimately interested in. Moreover, for most purposes can be replaced with  as the base field for our ring, and these relations are unchanged.

### 2.2 Homogeneous conditional probability

Analogously to the projective version of probability in Section 2.1, where we replaced the requirement that probabilities sum to one with viewing them as coordinates of a point in projective space, we now define a multihomogeneous version of Definition 2.1. Now, a conditional probability distribution is represented by a point in the product of projective spaces. This product has one for each event which is conditioned upon, and each factor space is equipped with homogeneous coordinates .

###### Definition 2.2.

A projective conditional probability distribution for is a point inside such that for all and ,

 (∑j∈Jpj|J)pi|K=pi|J(∑j∈Jpj|K)

Definition 2.2 specifies the following ideal in the event algebra :

 JE=⟨(∑j∈Jpj|J)pi|K−pi|J(∑j∈Jpj|K):J,K∈E,i∈J⊂K⟩.

This ideal consists of all polynomial relations that a point in must satisfy to be a projective conditional probability distribution. In particular, any honest conditional probability distribution must satisfy these. If we denote by a basis of , this ideal is multihomogeneous with respect to the grading (see e.g.  for more on such gradings). In what follows, it will be convenient to abbreviate Thus would be equal to for honest distributions, by Definition 2.1, but here we regard it as a linear form in . Let denote the product of all of the variables in , and let denote the product . The saturation of an ideal is the ideal generated by all polynomials such that for some . Now we define the ideal , when , by the saturation

 IE:=(JE:(αEβE)∞).

When , let and set . The purpose of saturation is to make sure the desired behavior occurs when some coordinates are zero; for example, it is necessary to move between the conditional independence ideals  generated by expressions and by the cross product differences algebraically without assuming anything about the positivity of the probabilities in question.

In the next section, we describe a matrix such that arises as the toric ideal (Section 7). Our first main result will be a universal Gröbner basis for the toric ideal . Gröbner bases, particularly universal Gröbner bases, have many algorithmic properties that make them a very complete description of an ideal. Cox, Little, and O’Shea  give an accessible overview; see also [23, 12].

## 3 A universal Gröbner basis for relations among conditional probabilities

A Bayes binomial in is a binomial relation of the form

 pi|Kpj|J−pj|Kpi|J

for , with . Let denote the ideal they generate. Bayes binomials get their name because they come from Bayes’ rule; more explanation is given in Section 6.

###### Proposition 3.1.

The ideal generated by the Bayes binomials contains and is contained in the saturation of by the probabilities that would sum to one (where again ):

 JE⊆IBayes(E)⊆(JE:(βE)∞)

and in particular, .

###### Proof.

The ideal is generated by the degree-2 polynomials for and . For each , we have and in , so is in and . For the first inclusion, if is a generator of , we may write it as an element of . ∎

Our universal Gröbner basis of will be given combinatorially by the cycles of a labeled bipartite graph , defined as follows:
Vertices: one vertex for each and one vertex for each
Edges: a directed edge for each and
Edge Labels: the edge is labeled with the indeterminate .

For example, with , the labeled graph for is shown in Figure 3.

Each oriented cycle in the undirected version of defines a binomial as follows: each edge label is on the positive side of the binomial if its edge is directed with the cycle, and on the negative if against. For example, in the graph in Figure 3, consider the cycle . The edges and are directed with the cycle and the edges and are directed against, so the corresponding binomial is . For a higher degree example, with and , we get from the outer cycle, as shown in Figure 4.

A cycle is induced if it has no chord.

###### Theorem 3.2.

The binomials defined by the cycles of give a universal Gröbner basis for . Moreover, is generated by the induced cycle binomials, though not necessarily as a Gröbner basis.

In order to prove Theorem 3.2, we first need to recall some facts about unimodular toric ideals, of which is an example. Unimodular matrices and unimodular toric ideals are defined and characterized as follows, following Sturmfels . A triangulation of is a collection of subsets of the columns of such that is the set of cones in a simplicial fan with support . A triangulation of is unimodular if the normalized volume  is equal to one for all maximal simplices in the triangulation. The matrix is a unimodular matrix if all triangulations of are unimodular. We define a unimodular toric ideal in the following definition-proposition.

###### Proposition 3.3.

 A toric ideal is called unimodular if any of the following equivalent conditions hold.

• Every reduced Gröbner basis of consists of squarefree binomials,

• is a unimodular matrix,

• all the initial ideals of are squarefree.

A special class of unimodular matrices are those coming from bipartite graphs [1, 22]. Let be a bipartite graph. In our case, has

 U={uI:I∈E} and V={vi:i∈∪I∈EI}. (3)

Let be the vertex-edge incidence matrix of : The rows of are labeled , the columns are labeled with the edges, and is if vertex is in edge and zero otherwise. For a cycle in the graph, the cycle binomial is defined (up to sign) as above. Let be the map defined by applying . We say is a circuit if is minimal with respect to inclusion in and the coordinates of are relatively prime . Equivalently, a circuit is an irreducible binomial of the toric ideal with minimal support. The Graver basis of the ideal consists of all circuits. For from a bipartite graph, the circuits of are precisely the cycle binomials of the graph [21, 22]. Additionally, a Graver basis is also a universal Gröbner basis in the case of unimodular toric varieties (Proposition 8.11 of ). We summarize these results in the following proposition.

###### Proposition 3.4.

The vertex-edge incidence matrix of a bipartite graph is unimodular, so is a unimodular toric ideal. The cycle binomials of are the circuits of , and therefore define the Graver basis of . In particular, they give a universal Gröbner basis for .

Now we are able to prove our theorem.

###### Proof of Theorem 3.2.

Let be the vertex-edge incidence matrix of . By Proposition 3.4, its cycle binomials (circuits) give a universal Gröbner basis of . In fact, the induced cycles are enough to generate this ideal . Suppose is a cycle and a chord, and split into two cycles and , both containing (but in opposite directions). Associate cycle binomials and , respectively. Then the -polynomial (§7) with the -containing terms leading is . However, this is no longer necessarily a Gröbner basis. For example, let as in Figure 5.

The outer cycle gives the cycle binomial . The cycle has a chord , and the binomial lies in the ideal of the two binomials

 p1|12p2|123−p2|12p1|123andp2|23p3|123−p3|23p2|123

after splitting along the chord. These are both the induced cycles of the graph. However, for a term order (§7) prioritizing (e.g. lexicographic with ), the leading term of cannot lie in the initial ideal of the ideal generated by the chordal binomials.

Next we show that the graph ideal and conditional probability ideal coincide, . For the containment , first observe that .

This is because if with , we have the subgraph in Figure 6, which is a cycle with associated cycle binomial . Together with Proposition 3.1, we now have

 JE⊆IBayes(E)⊆IAG(E)

so, since saturation is inclusion-preserving and is prime,

 IE=(JE:(αEβE)∞)⊆(IAG(E):(αEβE)∞)=IAG(E).

Now we show the reverse inclusion . Again by Proposition 3.1, we have

 IBayes(E)⊆IE.

Now assume that , so that . We claim that in fact , from which the result will follow. Let be an induced cycle of , and its cycle binomial. We must show that this cycle binomial can be obtained from the Bayes binomials, up to multiplication by . Let be the cycle

 i1←J1→i2←J2→⋯→ik←Jk→i1.

With this notation we have , , , . Then

 fC=pi2|J1pi3|J2⋯pik|Jk−1pi1|Jk−pi1|J1pi2|J2⋯pik|Jk.

We show the first monomial of is equal to the second mod . Pair off as follows:

 (pi1–––pi2pi3⋯pik)pi2|J1–––––pi3|J2pi4|J3⋯pik|Jk−1pi1|Jk Step 1 = (pi2pi2–––pi3⋯pik)pi1|J1pi3|J2–––––pi4|J3⋯pik|Jk−1pi1|Jk Step 2 = (pi2pi3pi3–––⋯pik)pi1|J1pi2|J2pi4|J3–––––⋯pik|Jk−1pi1|Jk Step 3 ⋮

where the equalities hold mod . Continuing in this fashion, at step we have

 = (pi2pi3⋯pik−1pik−1–––––pik)pi1|J1pi2|J2⋯pik−2|Jk−2pik|Jk−1–––––––pi1|Jk Step k−1 = (pi2pi3⋯pik−1pikpik–––)pi1|J1pi2|J2⋯pik−2|Jk−2pik−1|Jk−1pi1|Jk––––– Step k = (pi2pi3⋯pik−1pikpi1)pi1|J1pi2|J2⋯pik−2|Jk−2pik−1|Jk−1pik|Jk Step k+1

as desired. In terms of , this amounts to breaking up a long cycle into 4-cycles passing through , and erasing the overlaps among these cycles. Thus since the induced cycles generate , we have

 IAG(E)⊆(IBayes(E):m∏i=1pi))⊆IE

This proves the result in the special case . In the general case, suppose we have some not containing , enabling us to obtain relations among ’pure’ conditional probabilities (i.e. excluding ). Let and apply the special case of the Theorem. Then by [23, Proposition 4.13(c)], since we have a universal Gröbner basis, we just intersect it with the smaller coordinate ring to obtain a universal Gröbner basis of the smaller ring. This corresponds here to removing the set from and taking the cycle binomials as our new Gröbner basis. ∎

## 4 Conditional probability and the moment map

In this section we show how to recover and generalize some results of Matúš  using toric geometry. The main result we will expand upon maps the space of conditional probability distributions (Definition 2.1) for all possible conditioned events onto the permutohedron by first projecting down to events of size 2, .

For