On the Difficulty of Selecting Ising Models with Approximate Recovery

02/11/2016 ∙ by Jonathan Scarlett, et al. ∙ EPFL 0

In this paper, we consider the problem of estimating the underlying graph associated with an Ising model given a number of independent and identically distributed samples. We adopt an approximate recovery criterion that allows for a number of missed edges or incorrectly-included edges, in contrast with the widely-studied exact recovery problem. Our main results provide information-theoretic lower bounds on the sample complexity for graph classes imposing constraints on the number of edges, maximal degree, and other properties. We identify a broad range of scenarios where, either up to constant factors or logarithmic factors, our lower bounds match the best known lower bounds for the exact recovery criterion, several of which are known to be tight or near-tight. Hence, in these cases, approximate recovery has a similar difficulty to exact recovery in the minimax sense. Our bounds are obtained via a modification of Fano's inequality for handling the approximate recovery criterion, along with suitably-designed ensembles of graphs that can broadly be classed into two categories: (i) Those containing graphs that contain several isolated edges or cliques and are thus difficult to distinguish from the empty graph; (ii) Those containing graphs for which certain groups of nodes are highly correlated, thus making it difficult to determine precisely which edges connect them. We support our theoretical results on these ensembles with numerical experiments.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Graphical models are a widely-used tool for providing compact representations of the conditional independence relations between random variables, and arise in areas such as image processing

[1], statistical physics [2], computational biology [3]

, natural language processing

[4], and social network analysis [5]. The problem of graphical model selection consists of recovering the graph structure given a number of independent samples from the underlying distribution.

While this fundamental problem is NP-hard in general [6], there exist a variety of methods guaranteeing exact

recovery with high probability on

restricted classes of graphs, such as bounded degree and bounded number of edges. Existing works have focused primarily on Ising models and Gaussian models, and our focus in this paper is on the former.

In particular, we focus in the problem of approximate recovery, in which one can tolerate some number of missed edges or incorrectly-included edges. The motivation for such a study is that the exact recovery criterion is very restrictive, and not something that one would typically expect to achieve in practice. In particular, if the number of samples required for exact recovery is very large, it is of significant interest to know the potential savings by allowing for approximate recovery. The answer is unclear a priori, since this can lead to vastly improved scaling laws in some inference and learning problems [7] and virtually no gain in others [8].

Our main focus is on algorithm-independent lower bounds for Ising models, revealing the number of measurements required for approximate recovery regardless of the computational complexity. We extend Fano’s inequality [9, 10] to the case of approximate recovery, and apply it to restricted sets of graphs that prove the difficulty of approximate recovery.

Our main results reveal a broad range of graph classes for which the approximate recovery lower bounds exhibit the same scalings as the best-known exact recovery lower bounds [9, 10], which are known to be tight or near-tight in many cases of interest. This indicates that, at least for the classes that we consider, the approximate recovery problem is not much easier than the exact recovery problem in the minimax sense.

I-a Problem Statement

The ferromagnetic Ising model [11] is specified by a graph with vertex set and edge set . Each vertex is associated with a binary random variable

, and the corresponding joint distribution is

(1)

where

(2)

and is a normalizing constant called the partition function. Here is a parameter to the distribution, sometimes called the inverse temperature.

Let be a matrix of independent samples from this distribution, each row corresponding to one such sample of the variables. Given , an estimator or decoder constructs an estimate of the graph , or equivalently, an estimate of the edge set .

Recovery Criterion: Given some class of graphs, the widely-studied exact recovery criterion seeks to characterize

(3)

We instead consider the following approximate recovery criterion, for some maximum number of errors :

(4)

where , so that denotes the edit distance, i.e., the number of edge insertions and deletions required to transform one graph to another. In this definition, does not depend on , and hence, the number of allowed edge errors does not depend on the graph itself. We consider graph classes with a maximum number of edges equal to some value , and set for some constant not scaling with the problem size. Note that would trivially give .

Graph Classes: We consider the following three nested classes of graphs :

  • (Edge bounded class ) This class contains all graphs with at most edges.

  • (Edge and degree bounded class ) This class contains the graphs in such that each node has degree (i.e., number of edges it is involved in) at most .

  • (Sparse separator class ) This class contains the graphs in satisfying the -separation condition [12]: For any two non-connected vertices in the graph, one can simultaneously block all paths of length or less by blocking at most nodes.

The restriction on the number of edges is motivated by the fact that real-world graphs are often sparse. The restriction on the degree is also relevant in applications, and is particularly commonly-assumed in the statistical physics literature. The sparse separation condition is somewhat more technical, but it is of interest since it is known to permit polynomial-time exact recovery in many cases [12, 13]. Moreover, it is known to hold with high probability for several interesting random graphs; see [12] for some examples.

Generalized Edge Weights: A generalization of the above Ising model allows to take different non-zero values for each , some of which may be negative. Previous works considering model selection for this generalized model have sought minimax bounds with respect to the graph class and these parameters subject to for some and . The lower bounds derived in this paper immediately imply corresponding lower bounds for this generalized setting, provided that our parameter in (2) lies in the range .

Notation and Terminology: Throughout the paper, we let and denote probabilities and expectations with respect to (e.g., , ). We denote the floor function by , and the ceiling function by . We use the standard terminology that the degree of a node is the number of edges in containing , and that a clique is a subset of size at least two within which all pairs of nodes have an edge between them.

I-B Related Work

A variety of algorithms with varying levels of computational efficiency have been proposed for selecting Ising models with rigorous guarantees, including conditional independence tests for candidate neighborhoods [14], correlation tests in the presence of sparse separators [12, 15], greedy techniques [16, 17, 18, 19], convex optimization approaches [20], elementary estimators [21], and intractable information-theoretic techniques [9].

These works have made various assumptions on the underlying model, including incoherence assumptions [20, 21] and long-range correlation assumptions [12, 15]. A notable recent work avoiding these is [19], which provides recovery guarantees using an algorithm whose complexity is only quadratic in the number of nodes for a fixed maximum degree, thus resolving an open question posed in [22].

Early works providing algorithm-independent lower bounds used only graph-theoretic properties [14, 12, 23]; the resulting bounds are loose in general, since they do not capture the effects of the parameters of the joint distribution (e.g., ). Several refined bounds were given in [9] for graphs with a bounded degree or a bounded number of edges. Additional classes were considered in [10], including the bounded girth class and a class related to the separation criterion of [12] (and hence related to defined above). While our techniques build on those of [9, 10], we must consider significantly different ensembles, since those in [9, 10] contain graphs that differ only by one or two edges, thus making approximate recovery trivial.

To our knowledge, the only other work giving an approximate recovery bound for the Ising model is [24], where the degree-bounded class is considered. The effect of edge weights is not considered therein, and the bound is proved by counting graphs rather than constructing restricted ensembles. Consequently, only an necessary condition is shown, in contrast with our bounds containing a or term (cf., Table I). The necessary conditions for list decoding [25] bear some similarity to approximate recovery, but the problem and its analysis are in fact much more similar to exact recovery, allowing the ensembles from [9, 10] to be applied directly.

Beyond Ising models, several works have provided necessary and sufficient conditions for recovering Gaussian graphical models [26, 27, 13, 28, 29]. In this context, a necessary condition for approximate recovery was given in [13, Cor. 7], but the corresponding assumptions and techniques used were vastly different to ours: The random Erdös-Rényi model was considered instead of a deterministic class, and an additional walk-summability condition specific to the Gaussian model was imposed.

I-C Contributions

Our main results, and the corresponding existing results for exact recovery, are summarized in Table I, where we provide necessary scaling laws on the number of samples needed to obtain a vanishing probability of error . Note that some of the exact recovery conditions given in the final column were not explicitly given in [9, 10], but they can easily be inferred from the proofs therein; see Section II for further discussion. We also observe that our analysis requires handling more cases separately compared to [9, 10]; in those works, the final three rows corresponding to in Table I are all a single case giving scaling, and similarly for .

Graph Class Parameters Necessary for approximate recovery (this paper) Best known necessary for exact recovery [9, 10]

 
Bounded edge
Distortion
(Theorems 1 and 2)
Exponential in Exponential in
(between and )

 
Bounded edge and degree
Distortion
(Theorems 3 and 4)
Exponential in Exponential in
(between and )

Bounded edge and degree with sparse separators
Distortion (, )
(Theorem 5)
Exponential in Exponential in
Table I: Summary of main results on parital recovery, and comparisons to the best known necessary conditions for exact recovery. Each entry shows the necessary scaling law for the number of samples required to achieve a vanishing error probability.

Table I reveals the following facts:

  1. In all of the known cases where exact recovery is known to be difficult, i.e., exponential in a quantity that increases in the problem dimension, the same difficulty is observed for approximate recovery, at least for the values of shown. For and , this is true even when we allow for up to a quarter of the edges to be in error. Note that we did not seek to optimize this fraction in our analysis, and we expect similar difficulties to arise even when higher proportions of errors are allowed. In fact, by a simple variation of our analysis outlined in Remark 1 in Section IV-C, we can already increase this fraction from to .

  2. In many of the cases where the necessary conditions for exact recovery lack exponential terms, the corresponding necessary conditions for approximate recovery are identical or near-identical; in particular, see the second and third rows corresponding to , the second and third rows corresponding to , and the second row corresponding to with . While there are logarithmic terms missing in some cases (e.g., vs. ), these are typically insignificant in the regimes considered (e.g., ).

  3. In contrast, there are some cases where significant gaps remain between the best-known conditions for exact recovery and approximate recovery. The two most extreme cases are as follows: (i) If for some small , the necessary conditions for are and , respectively; (ii) If , then the necessary conditions for are and , respectively. It remains an open problem as to whether this behavior is fundamental, or due to a weakness in the analysis.

The starting point of our results is a modification of Fano’s inequality for the purpose of handling approximate recovery. To obtain the above results, we apply this bound to ensembles of graphs that can be broadly classed into two categories. The first considers graphs with a large number of isolated edges, or more generally, isolated cliques. We characterize how difficult each graph is to distinguish from the empty graph, and use this to derive the results given in item 2) above. On the other hand, the results on the exponential terms discussed in item 1) arise from considering ensembles in which several groups of nodes are always highly correlated due to the presence of a large number of edges among them, thus making it difficult to determine precisely which edges these are.

Both of these categories help in providing bounds that match those for exact recovery. For example, the behavior for in [9] is proved by considering graphs with a single isolated edge, and our analysis extends this to approximate recovery by considering graphs with isolated edges. Analogously, the exponential behavior (e.g., in ) in [9] is proved by considering cliques with one edge removed, and our analysis reveals that the same exponential behavior arises even if a constant fraction of the the edges are removed.

We provide numerical results on our ensembles in Section VI supporting our theoretical findings. Specifically, we implement optimal or near-optimal decoding rules in a variety of cases, and find that while approximate recovery can be easier than exact recovery, the general behavior of the two is similar.

Ii Main Results

In this section, we present our main results, namely, algorithm-independent necessary conditions for the criterion in (4) with all . Our conditions are written in terms of asymptotic terms for clarity, but purely non-asymptotic variants can be inferred from the proofs. Throughout the section, we make use of the binary entropy function in nats, . Here and subsequently, all logarithms have base .

All proofs are deferred to later sections; some preliminary results are presented in Section III, a number of ensembles are presented and analyzed in Section IV, and the resulting theorems are deduced in Section V.

Ii-a Bounded Number of Edges Class

We first consider the class of graphs with at most edges. It will prove convenient to treat two cases separately depending on how scales with .

Theorem 1.

(Class with ) For any number of edges such that and , and any distortion level for some , it is necessary that

(5)

in order to have for all .

We proceed by considering two cases as in [9]. In the case that at any rate faster than logarithmic in (or even logarithmic with a constant that is not too small), the sample complexity is dominated by the exponential term , and any recovery procedure requires a huge number of samples. Thus, in this case, even the approximate recovery problem is very difficult. On the other hand, if then the second condition in (5) gives a sample complexity of , since as .

These observations are the same as those made for exact recovery in [9], where the best known necessary conditions for were given. Thus, we have reached similar conclusions even allowing for nearly a quarter of the edges to be in error.

Theorem 2.

(Class with ) For any number of edges of the form for constants and , and any distortion level for some , it is necessary that

(6)

in order to have for all .

As above, the sample complexity is exponential in due to the first term in (6). On the other hand, we claim that when , the second term in (6) leads to the sample complexity . To see this, we choose as in the theorem statement and note that ; since as , this implies that . We thus have , which finally yields .

When and , we have , and hence, these observations are again the same as those made for exact recovery in [9], except that our growth rates do not include a term; this logarithmic factor is insignificant compared to the leading term . In contrast, the gap is more significant when ; in the extreme case, when for some small , we obtain a scaling of , as opposed to .

Ii-B Bounded Degree Class

Next, we consider the glass of graphs such that every node has degree at most , and the total number of edges does not exceed .

Theorem 3.

(Class with ) For any maximal degree and number of edges such that and , and any distortion level for some , it is necessary that

(7)

in order to have for all .

The first term in (7) reveals that the sample complexity is exponential in . On the other hand, if then the second term gives a sample complexity of .

We cannot directly compare Theorem 3 to [9], since there was assumed to be unrestricted for the degree-bounded ensemble. However, the analysis therein is easily extended to , and doing so recovers the nearly identical observations to those above, as summarized in Table I. In this sense, Theorem 3 matches the best known necessary conditions for exact recovery even when nearly a quarter of the edges may be in error.

Theorem 4.

(Class with ) For any maximal degree and number of edges such that and for some , and any distortion level for some , it is necessary that

(8)

in order to have for all .

The sample complexity remains exponential in . By some standard asymptotic expansions similar to those following Theorem 2, we have whenever ; hence, the second condition in (8) becomes . Thus, if then we again get the desired behavior; this means that we can allow for up to . More generally, we instead get the possibly weaker scaling law , which is equivalent to when . In the extreme case, when (the highest growth rate possible given the degree constraint alone), this only recovers scaling.

Ii-C Sparse Separator Class

We now consider the class of graphs in that satisfy the -separation condition [12]. We focus on the case , since the main graph ensemble that we consider for this class is not suited to the case that .

Theorem 5.

(Class with ) Fix any parameters with and , and let be an integer in . For any distortion level for some and , it is necessary that

(9)

in order to have for all .

We proceed by considering only the case , though simplifications of Theorem 5 for are also possible. With , we have for some , and similarly for some [10, Sec. 5]. These identities reveal that the sample complexity is exponential in both and . On the other hand, if and then the second term in (9) gives .

Due to the choice , if we set then we are only in the regime of a constant fraction of errors if . This is true, for example, if so that the separator set size is a fixed fraction of the maximum degree, and so that the separation is with respect to paths of a bounded length.

More generally, to handle larger values of , one can choose a smaller value of , thus leading to a larger value of but with a less stringent condition on the number of measurements in (9). In the extreme case, , and then we are always in the regime of a constant proportion of errors; however, this yields a necessary condition not depending on or .

The graph family studied in [10, Thm. 2] is somewhat different from , in particular not putting any constraints on the maximal degree nor the number of edges. Nevertheless, by choosing the parameters in the proof therein to meet these constraints,111Specifically, in [10, Sec. 9.2], one can set to satisfy the degree constraint, and then choose to ensure there are at most edges in total. one again obtains similar conditions to those above, as summarized in Table I. In particular, for any choice of that grows as , the scaling laws for exact recovery and approximate recovery coincide.

Iii Auxiliary Results

In this section, we provide a number of auxiliary results that will be used to prove the theorems in Section II. We first present a general form of Fano’s depending on both the Kullback-Leibler (KL) divergence and edit distance between graphs, and then provide a number of properties of Ising models that will be useful for characterizing the KL divergence and edit distance in specific scenarios.

Iii-a Fano’s Inequality for Approximate Recovery

As is common in studies of algorithm-independent lower bounds in learning problems, we make use of bounds based on Fano’s inequality [30, Sec. 2.10]. We first briefly outline the most relevant results for the exact recovery problem.

Recall the definitions of and in (3)–(4) with respect to a given graph class . It is known that for any subset , and any covering set such that any graph has an “-close” graph satisfying , we have [10]

(10)

In particular, if is a singleton, solving for gives the necessary condition

(11)

in order to have .

For approximate recovery, we consider ensembles (i.e., choices of ) for which the decoder’s outputs may lie in some set without loss of optimality; in most cases we will have , but in general, need not even be a subset of the graph class . We use the following generalization of (11).

Lemma 1.

Suppose that the decoder minimizing the average error probability with respect to a distortion level , averaged over a graph uniformly drawn from a set , always outputs a graph in some set . Moreover, suppose that there exists a graph such that for all , and that there are at most graphs in within an edit distance of any given graph . Then it is necessary that

(12)

in order to have .

Proof.

See Appendix A. ∎

Iii-B Properties of Ferromagnetic Ising Models

We will use a number of useful results on ferromagnetic Ising models, each of which is either self-evident or can be found in [9] or [10]. We start with some basic properties.

Lemma 2.

For any graphs and with edge sets and respectively, we have the following:

(i) For any pair , we have [9]

(13)

(ii) The divergence between the corresponding distributions satisfies [10, Eq. (4)]

(14)
(15)

(iii) If , then we have for any pair that [10, Eq. (13)]

(16)

(iv) Let be a partition of into disjoint non-empty subsets. If and are such that there are no edges between nodes in and when , then

(17)

where , with containing the edges in between nodes in (and analogously for ).

The remaining properties concern the probabilities, expectations and divergences associated with more specific graphs.

Lemma 3.

(i) If is obtained from by removing a single edge , then [10, Eq. (19)]

(18)

and [10, Lemma 4]

(19)

(ii) Let contain a clique on nodes and no other edges, and let be obtained from by removing a single edge . Then, defining , we have [9, Eq. (31)]

(20)

Moreover, we have [9, Lemma 1]

(21)

and

(22)

(iii) Suppose that for some edge , there exist at least node-disjoint paths of length between and in . Then [10, Lemma 3]

(23)

If the same is true in both and for all , then [10, Cor. 3]

(24)

(iv) More generally, if there exist at least node-disjoint paths of length between for , where the values of are all distinct, then

(25)

Iv Graph Ensembles and Lower Bounds on their Sample Complexities

In this section, we provide necessary conditions for the approximate recovery of a number of ensembles, making use of the tools from the previous section. In particular, we seek choices of , and for substitution into Fano’s inequality in Lemma 1. In Section V, we use these to establish our main theorems.

Iv-a Ensemble 1: Many Isolated Edges

This ensemble contains numerous isolated edges, such that if is small then it is difficult to determine precisely which ones are present. It is constructed as follows with some integer parameter :

Ensemble1() [Isolated edges ensemble]: Each graph in is obtained by forming exactly node-disjoint edges that may otherwise be arbitrary.

For this ensemble, we have the following properties:

  • The number of graphs is , since by the assumption .

  • The maximum degree of each graph is one.

  • For this ensemble, it suffices to trivially let contain all graphs.

  • The number of graphs within an edit distance of any single graph is upper bounded as . Here the term corresponds to choosing edges to remove, and the term upper bounds the number of ways to add new edges. We have also used the fact that is maximized at .

  • From (19), the KL divergence from a single-edge graph to the empty graph is upper bounded by . Using this fact along with (17), any graph in has a KL divergence to the empty graph of at most .

Combining these with (12) gives the necessary condition

(26)

in order to have .

Simplifying both and to , and writing as well as , we can simplify (26) to

(27)

provided that and . Letting for some , this becomes

(28)

Iv-B Ensemble 2: Many Isolated Groups of Nodes

As an alternative to Ensemble 1, this ensemble allows for significantly more edges, in particular permitting . It is constructed as follows with integer parameters and :

Ensemble2(,) [Isolated cliques ensemble]: Form fixed groups of nodes, each containing nodes. Each graph in is formed by forming arbitrarily many edges within each group, but no edges between the groups.

For this ensemble, we have the following:

  • The number of nodes forming these groups is .

  • The total number of possible edges is , and hence the total number of graphs is .

  • The maximal degree of each graph is at most .

  • The decoder can output an element of without loss of optimality, since any inter-group edges declared to be present are guaranteed to be wrong. Thus, we may set .

  • The number of graphs within an edit distance of any single graph is , assuming .

  • In Lemma 4 below, we show that the KL divergence of the graph associated with one group to the corresponding empty graph is upper bounded by . Hence, the KL divergence of any to the empty graph is upper bounded by due to (17).

Substituting these into (12), setting for some , and applying some simplifications, we obtain the following necessary condition for :

(29)

whenever . Note that the binary entropy function arises from the identity as .

It remains to prove the claim on the KL divergence, formalized as follows.

Lemma 4.

Let denote an arbitrary graph with edges connected to at most nodes, and let be the empty graph. Then, it holds that

(30)
Proof.

We prove the claim for the case that contains a single -clique; the general case then follows in a similar fashion using (16).

Let be obtained from by removing a single edge, say indexed by . Defining and , we have from (18) that

(31)

and from (20) that

(32)

Noting the symmetry of the summands with respect to and , we obtain the following when

is odd (the case that

is even is handled similarly, leading to the same conclusion):

(33)
(34)
(35)
(36)

Substituting (36) into (31), solving for , and converting from probability to expectation via (13), we