Total positivity in structured binary distributions

We study binary distributions that are multivariate totally positive of order 2 (MTP2). Binary distributions can be represented as an exponential family and we show that MTP2 exponential families are convex. Moreover, MTP2 quadratic exponential families, which contain ferromagnetic Ising models and attractive Gaussian graphical models, are defined by intersecting the space of canonical parameters with a polyhedral cone whose faces correspond to conditional independence relations. Hence MTP2 serves as an implicit regularizer for quadratic exponential families and leads to sparsity in the estimated graphical model. We prove that the maximum likelihood estimator (MLE) in an MTP2 binary exponential family exists if and only if the sign patterns (1,-1) and (-1,1) are represented in the sample for every pair of vertices; in particular, this implies that the MLE may exist with n=d samples, in stark contrast to unrestricted binary exponential families where 2^d samples are required. Finally, we provide a globally convergent algorithm for computing the MLE for MTP2 Ising models similar to iterative proportional scaling and apply it to the analysis of data from two psychological disorders. Throughout, we compare our results on MTP2 Ising models with the Gaussian case and identify similarities and differences.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

01/30/2013

Graphical Models and Exponential Families

We provide a classification of graphical models according to their repre...
02/18/2021

Learning Continuous Exponential Families Beyond Gaussian

We address the problem of learning of continuous exponential family dist...
11/15/2017

Kernel Conditional Exponential Family

A nonparametric family of conditional distributions is introduced, which...
11/28/2021

An inverse Sanov theorem for curved exponential families

We prove the large deviation principle (LDP) for posterior distributions...
05/19/2015

Vector-Space Markov Random Fields via Exponential Families

We present Vector-Space Markov Random Fields (VS-MRFs), a novel class of...
10/29/2020

Staged trees are curved exponential families

Staged tree models are a discrete generalization of Bayesian networks. W...
08/04/2021

Sparse Continuous Distributions and Fenchel-Young Losses

Exponential families are widely used in machine learning; they include m...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction and motivation

This paper discusses exponential families and, in particular, binary graphical models with a special form of positive dependence. Total positivity is a strong form of positive dependence that has become an important concept in modern statistics; see, e.g., [14, 22]. This property (also called the property) appeared in the study of stochastic orderings, asymptotic statistics, and in statistical physics [19, 30]. Families of distributions with this property lead to many computational advantages [9, 16, 32] and they are a convenient shape constraint in nonparametric statistics [33]. They also became a useful tool in modelling with latent variables; see [10] for an overview. In particular, in [4] the property explicitly appeared in the description of the binary latent class model.

In a recent paper [18], the property was studied in the context of graphical models and conditional independence in general: It was shown that the property of has strong consequences for the interpretation of graphical models, as, for example, any distribution with positive density is faithful to its pairwise independence graph and ensures the absence of paradoxes of Simpson type, since associations do not change signs under marginalization or conditioning with other variables.

Studying graphical models under the property is particularly popular in the Gaussian setting, where the property was shown to simplify inference [5, 27]. In the Gaussian case the property is equivalent to the covariance matrix being an inverse M-matrix, which is a linear constraint on the concentration matrix. This led Slawski and Hein [34] to propose efficient learning procedures based on convex optimization; see also [11, 17, 25]. The present paper develops similar results for binary distributions, in particular for Ising models.

The remainder of this paper is structured as follows: In Section 2 we formally introduce distributions and review various properties of such distributions. While most of this paper concentrates on binary distributions, in Section 3, we provide a general discussion of exponential families, in particular showing that they are convex. Moreover, we show that quadratic exponential families, which contain as special examples ferromagnetic Ising models and attractive Gaussian graphical models, are defined by intersecting the space of canonical parameters with a polyhedral cone, whose faces correspond to conditional independence relations.

Sections 4 and 5 contain our main results. By representing binary distributions as exponential families, in Section 4.1 we show that under the MLE exists if and only if the sign patterns and are represented in the sample for every pair of vertices . This implies that under the MLE may exist with samples, a stark contrast to the unrestricted case, where samples are required. Next, in Section 4.2 we study the KKT conditions for binary exponential families, connect them to so-called imsets [35], and show how this connection can be used for computing the MLE in binary exponential families. In Section 4.3 we study binary exponential families that factorize according to a graph and show that the MLE exists if and only if the sign patterns and are represented in the sample for every edge in . Finally, in Section 4.4 we apply these results to symmetric binary distributions satisfying .

Section 5 is devoted to Ising models, a special class of binary distributions that form a quadratic exponential family. In particular, we develop a globally convergent algorithm for computing the MLE in Ising models analogous to iterative proportional scaling (IPS). In addition, we discuss the special case of Ising models with no external field, a prominent example of a symmetric binary distribution. Such distributions have been studied in [28]

as a close proxy to the Gaussian distribution and we discuss their similarities and differences. Finally, in Section 

6 we apply our algorithm to the analysis of two psychological disorders, showing that the resulting graphical model is highly interpretable and consistent with domain knowledge.

2. Preliminary results and definitions

Let be a finite set and let

be random variables with labels in

. We consider the product space , where is the state space of , inheriting the order from . In this paper, the state spaces are either discrete (finite sets) or open intervals on the real line. All distributions are assumed to have densities with respect to the product measure , referred to as the base measure, where is the counting measure if is discrete, and is the Lebesgue measure giving length 1 to the unit interval if is an open interval.

A function on is said to be multivariate totally positive of order () if

(1)

where and denote the elementwise minimum and maximum, i.e.,

Similarly, a function on is submodular if

and modular if

These inequalities are non-trivial only if are not comparable, that is, neither nor . For , a function that is is simply called totally positive [22]. We say that or the distribution of is if its density function is . In the following, we review various basic properties of distributions; for more details; see, for example, [18].

Proposition 2.1.

The property is closed under taking products, conditioning, and marginalization, that is,

  • if and are on , then their product is ;

  • if has an distribution, then for every , the conditional distribution of is for almost all ;

  • if has an distribution, then for any , the marginal distribution of is :

We also emphasize that the property does not depend on the underlying product measure; more precisely the following property holds.

Proposition 2.2.

Let be the density of with respect to the product measure and let be the density of with respect to another product measure that is absolutely continuous with respect to . Then if is , so is .

This result follows directly from the fact that every function stays after multiplication with a function of the form for some univariate functions , .

If a function is strictly positive, we can write

(2)

where . In this case, is if and only if is submodular. This observation allows employing various results on submodular functions; see, for example, [7]. A simple consequence is that for strictly positive distributions, can be verified by checking that (1) holds for that are not comparable and differ in exactly two coordinates; c.f. [22, Proposition 2.1]. We call such pairs elementary and denote the set of all elementary pairs by . As an example, consider Gaussian distributions. Denoting the covariance matrix by and its inverse by , then the distributions correspond precisely to the ones where is an M-matrix, that is for all ; see, for example, [25].

3. Totally positive exponential families

While the remainder of this work will concentrate on binary distributions, we start with a brief general discussion of for exponential families. We start by showing that maximum likelihood estimation for exponential families under leads to a convex optimization problem. We then discuss conditions for the existence of the MLE and finally specialize these results to quadratic exponential families, which include as prominent examples the Gaussian distribution and the Ising model.

3.1. Convexity of totally positive exponential families

Consider an exponential family with density satisfying

(3)

with sample space , sufficient statistics and base measure . Assume that the family is minimally represented, i.e. that almost surely implies , and that the family is regular so that the space of canonical parameters

is an open convex set. Throughout, we assume that there exists such that is a product distribution, or equivalently,

(4)

Since every distribution in an exponential family can act as the base distribution, we can then pick as the base measure. It then holds that

We say that such an exponential family has a product base. Note that, by Proposition 2.2, changing the base measure to another product measure does not affect the property of total positivity. Examples of exponential families with a product base include Gaussian graphical models and log-linear models.

For an exponential family of the form (3) and any two we define

The density is if and only if for all elementary pairs in . For exponential families with a product base it holds that

which is an affine function in . This directly implies the following result.

Theorem 3.1.

Let denote the subset of canonical parameters for which the density is . Then is a convex set that is relatively closed in .

We note that this result holds also for exponential families without a product measure. However, in that case the set of canonical parameters may be empty.

In [25] we considered the Gaussian setting and showed that is a convex cone. By essentially the same argument, this extends to discrete Gaussian distributions over , which were introduced in [1]. More generally, we obtain the following result.

Proposition 3.2.

The set is obtained by intersecting with a closed convex cone , whose dual cone is the closure of the cone generated by the set

Proof..

The set of inequalities , one for each elementary pair , defines a convex cone in . We have for all if and only if for all in the cone generated by the set , denote if by . This shows that and so . The latter is equal to the closure of by the standard theory of convex cones; see, for example, [13, Section 2.6.1]. ∎

We end this subsection with the following remark.

Remark 3.3.

When is finite, i.e. for log-linear models, Proposition 3.2 implies that is polyhedral. Since is polyhedral also in the Gaussian setting, finiteness of is not a necessary condition. In fact, we will show in Proposition 3.5 that is polyhedral for any quadratic exponential family. When is polyhedral, then every face of intersected with corresponds to the distributions in an exponential subfamily.

3.2. The MLE and its existence

An important consequence of Theorem 3.1 is that any exponential family is a convex exponential family and thus the maximum likelihood estimator (MLE), if it exists, is uniquely defined; see [8, Section 9.4].

Let denote a random sample of size and let be the average of the corresponding sufficient statistics. Let denote the interior of , the convex support of the sufficient statistics. Then by the general theory of exponential families [8], the MLE exists if and only if lies in , in which case it is uniquely defined by

The following theorem extends this result to a characterization of existence of the MLE for the subfamily of distributions. By Proposition 3.2 there exists a closed convex cone such that the space of all canonical parameters is given by . We define

as the Minkowski sum of with the dual of ; c.f. Proposition 3.2.

Theorem 3.4.

Let be a minimally represented regular exponential family. Then the MLE based on exists in the submodel if and only if , in which case is uniquely defined by

  1. primal feasibility:  ,

  2. dual feasibility:   with ,

  3. complementary slackness:    .

Proof..

The maximum likelihood estimation problem can be formulated as the following optimization problem:

subject to

This is a convex optimization problem, since is convex on . The Lagrangian is

where . Let denote the conjugate dual of with domain . Then

and hence the dual optimization problem is given by

subject to

The MLE exists if and only if the primal and dual problems are feasible. The primal problem is feasible by the assumption . The dual problem is feasible if and only if . The characterization of the MLE then follows from the KKT conditions. ∎

Note that sparsity in is a natural consequence of complementary slackness. In the Gaussian setting, this observation led to a series of recent papers [25, 31, 34]. Another natural set-up for similar results is the case of quadratic exponential families discussed in the next section.

3.3. Quadratic exponential families

The density function of a quadratic exponential family is of the form

(5)

with and , where is the set of symmetric matrices in . Important examples of such exponential families in the discrete setting are Ising models, which we discuss in more detail in Section 5, and Gaussian graphical models in the continuous setting. Note that in the binary setting we require in order to obtain a minimally represented exponential family. We start by showing that is a polyhedral cone for any quadratic exponential family.

Proposition 3.5.

The subfamily of distributions in a quadratic exponential family is obtained by intersecting with a polyhedral cone , namely the cone .

Proof..

By [18, Theorem 7.5], a quadratic exponential family is if and only if is for all . This is the case if and only if for every that differ in two coordinates with and , it holds that

or equivalently . This completes the proof. ∎

We denote the mean parameters by and . Then can be transformed to , where is the covariance matrix of . Note that

Each facet of corresponds to one of the ’s being zero; c.f. Remark 3.3. Equivalently, each facet consists of members in the exponential family that satisfy the conditional independence relation . The dual cone of is given by

(6)

Let denote a random sample of size and let and be the corresponding mean parameters. Let denote the sample covariance matrix. By standard exponential family theory, the MLE in the quadratic exponential family (5) corresponds to the unique distribution in the family which matches the sample mean parameters, i.e., , or equivalently, . By adding the constraint, the situation becomes more interesting. As a direct corollary to Theorem 3.4 we obtain the following result regarding the MLE in an quadratic exponential family.

Corollary 3.6.

Let be a minimal regular quadratic exponential family. Let and be the sample mean and covariance matrix. Then the corresponding MLE with and , is uniquely defined by

  • for ,

  • , , and for ,

  • for all .

Proof..

The conditions of Theorem 3.4 translate precisely to (i), (ii), (iii), namely the primal feasibility condition is derived in Proposition 3.5, the dual feasibility condition follows from (6), and the complementary slackness condition follows from the fact that the inner product between dual cones is zero if and only if each summand is zero. ∎

The specialization of this result to Gaussian graphical models was discussed in [25]. Note that the constraint induces sparsity in the MLE through the complementary slackness constraint (iii). For example, if , then complementary slackness implies that simply because in an distribution all covariances are positive. The sparsity pattern of defines a face of the polyhedral cone . As in the Gaussian setting [25, Corollary 2.4], the MLE is the MLE of the quadratic exponential family without the constraint restricted to the face .

Corollary 3.7.

Let denote the MLE of a quadratic exponential family under . Let . Then equals the maximum likelihood estimate in the quadratic exponential family without the constraint under the linear constraints for all .

Proof..

This follows since the unique MLE in this quadratic exponential family is given by the equations (ii) in Corollary 3.6 above. ∎

4. Totally positive binary distributions

For the remainder of this paper, we concentrate on binary distributions, i.e., distributions over the sample space . To simplify notation we often use the following bijection between and the set of all subsets of , namely an element maps to the subset of all for which . For example, in the case the point maps to the subset and to the empty set. Note that and are also isomorphic as lattices because the min-max operators , on correspond to the set operations , in .

Building on the results from Section 3, in the following we provide conditions for existence of the MLE in binary exponential families. In particular, we study the KKT conditions for this setting and develop conditions for existence of the MLE in the special case of binary distributions that factorize according to a graph (such as Ising models) and symmetric binary distributions where (such as Ising models with no external field). Ising models will be discussed in detail in Section 5.

4.1. Existence of the MLE

Let

denote the set of all probability distributions over

and the set of all totally positive binary distributions, i.e.,

We note that is compact and geometrically convex, i.e.,

where

and unless by the Cauchy–Schwarz inequality.

For a lattice we say that a subset of forms a sublattice of if for any two it holds that and . Note that for any its support is always a sublattice of , since

Consider a sample with likelihood function

and let be the the smallest sublattice of containing the sample . We now show that the support of the MLE is given by .

Theorem 4.1.

The likelihood function attains its maximum over in a unique point . Furthermore, it holds that .

Proof..

Continuity of the likelihood function together with compactness of ensures that the maximum is attained. To prove uniqueness, suppose for contradiction that both maximize . Then

contradicting that were maximizers.

Finally, note that and hence . We show by contradiction. Suppose , then we can construct such that , which contradicts the fact that is the MLE; namely, let be projected onto and rescaled to be a probability mass function, i.e. . Then and , which concludes the proof. ∎

We now recall the representation of strictly positive binary distributions as an exponential family. Define for . To write the exponential representation of this family of distributions we consider the space of dimension equipped with the inner product

For

, define a vector

such that if and it is zero otherwise. The set of binary distributions forms a regular exponential family which is minimally represented with canonical parameters for . Denote by the vector of all for and observe that . Then

where . The space of canonical parameters is simply the dimensional real vector space where . The interior of the convex support of the sufficient statistics is given by the set

which we identify with the interior of the probability simplex, namely

The constraints on the space of canonical parameters defining binary distributions are

(7)

for all elementary pairs . We recall that a pair is elementary if there exist a subset and such that corresponds to and corresponds to . The number of such pairs is . Another way to phrase (7) is that is a supermodular set-function that satisfies the normalizing condition ; c.f. [7].

Each inequality in (7) can be formulated as a sign condition on a particular conditional correlation for all and . Therefore, each facet mentioned in Remark 3.3 corresponds to a context specific conditional independence statement where two variables are independent conditioned on a particular value of the remaining variables. When there are six such constraints and these play an important role in the boundary decomposition of the latent class model [3]. To see how they appear in the description of a general binary latent class model see [4].

The following result provides conditions for existence of the MLE for binary distributions. Theorem 1 in [34] states that the MLE for Gaussian distributions exists if and only if all sample correlations are strictly less than one. Proposition 4.2 is the analogous result for binary distributions.

Proposition 4.2.

The MLE exists within if and only if . Furthermore, if and only if every pair-marginal for has both of and represented.

Proof..

The MLE exists in the binary exponential family if and only if the estimator in the extended family has full support. Thus, as a consequence of Theorem 4.1, the MLE exists if and only .

For the second statement we first prove the backward direction using the identification between and subsets of . Suppose every pair-marginal for has both of and represented. This means that for every there is a set with and . But then

Since the set of all singletons for generates the full lattice , we obtain as desired.

We prove the forward direction by proving its contrapositive. Suppose there is a pair such that all sets have the property that

(8)

The set of subsets satisfying (8) form a proper sublattice . Since we obtain that , which completes the proof. ∎

As an example, consider the case . Then the vectors , , generate all of and hence every sample supported on these three points will admit a unique MLE under the constraint. This set is minimal in the sense that it cannot be reduced; none of its subsets generates . There are also minimal generating subsets of size four, e.g. , , , . For general , a minimal generating set of is of order and there always exists a minimal generating set of size exactly . Hence for binary distributions samples can be sufficient for existence of the MLE. This is in sharp contrast with unrestricted binary exponential families, where the MLE exists only if all states are observed at least once.

We end this subsection by characterizing when has full support. These results are critical for Section 4.3 in order to analyze when the MLE of a binary distribution over a graph has full support.

Proposition 4.3.

Let and let . Suppose for all pairs then .

Proof..

For every let such that . Let be the partition of such that for and on . For each define . By construction and . Moreover because is a lattice. Since , again because the support of is a lattice. ∎

Corollary 4.4.

If then has full support if and only if each pair-margin has full support.

4.2. KKT conditions

In this subsection, we translate the KKT conditions of Theorem 3.4 to the binary setting. To do this, we introduce some notation. Following Studený [35], we call the elements in imsets. An important example of an imset is defined earlier. The imset

is called a semi-elementary imset. If form an elementary pair then is called an elementary imset. If this pair is associated to sets and we write . With a slight abuse of notation we denote the class of all elementary imsets by .

Primal feasibility in Theorem 3.4 requires that satisfies (7) or, in other words,

(9)

The dual cone is the cone in generated by all elementary imsets. Dual feasibility in Theorem 3.4 says that for all and

(10)

Although every element in is a non-negative combination of elementary imsets, such a combination is typically not unique. For example,

In particular, the coefficients above are not uniquely defined. But independent of the choice of these coefficients, the complementary slackness condition is equivalent to

By (9), this holds if and only if

(11)

We conclude that for every that appears in a nonnegative linear combination of the form (10). In the following example, we show how this characterization of complementary slackness can be used to compute the MLE.

Example 4.5.

Let and consider the sample represented by the diagram to the left in the following figure, where we again made use of the bijection between and the set of all subsets of .

We claim that represented by the diagram on the right corresponds to the MLE. First we check that is indeed by checking that for all six elementary pairs . Up to the normalizing constant , these are

This proves primal feasibility in Theorem 3.4. Dual feasibility is verified by the following diagram.

In other words, . Complementary slackness follows by direct calculations. Note that the two nonzero generators in the decomposition of correspond precisely to the inequalities for that hold as equalities. These equalities correspond to the conditional independence statement . ∎

While the MLE in Example 4.5 could be computed by hand, calculations get intractable rather quickly. The following example is sufficiently complicated that it cannot easily be calculated by hand, but still simple enough so that numerical optimization using the algorithm developed in [10] yields the provably exact optimum.

Example 4.6.

Moussouris [29] provided a now classical example of a distribution that is globally Markov to its dependence graph but does not factorize; c.f. [26, Example 3.10]. The sample in this example is uniformly supported on eight points

This distribution is globally Markov with respect to the 4-cycle in Figure 1 (left),

1

2

3

4

1

2

3

4
Figure 1. A cycle (left) and a chain (right) with four vertices.

but the MLE for this graphical model as an exponential family does not exist. Note that this sample distribution is not , since for example the inequality

does not hold. On the other hand, since the conditions of Proposition 4.2 are satisfied, the MLE under exists. It is represented by the following diagram, where the highlighted nodes correspond to the eight points supported by the sample.

Primal feasibility of is verified by the following inequalities, one for each of the elementary pairs (labeled by sets and ). Up to the normalizing constant , these are:

Quite surprisingly the MLE is therefore still globally Markov to the 4-cycle even though these constraints were not explicitly enforced. Moreover, satisfies an additional conditional independence relation, namely , and so it is Markov to the smaller graph in Figure 1 (right).

There are many equivalent ways to write the vector . The most canonical is the one using all twelve elementary imsets allowed by the complementary slackness condition (11), that is, the ones corresponding to boldfaced rows above:

Each of the vectors above is a generator of and so . ∎

Remark 4.7.

To show that lies in it is enough to express it as a nonnegative combination of vectors