Robust Estimation of Tree Structured Ising Models

06/10/2020 ∙ by Ashish Katiyar, et al. ∙ The University of Texas at Austin 6

We consider the task of learning Ising models when the signs of different random variables are flipped independently with possibly unequal, unknown probabilities. In this paper, we focus on the problem of robust estimation of tree-structured Ising models. Without any additional assumption of side information, this is an open problem. We first prove that this problem is unidentifiable, however, this unidentifiability is limited to a small equivalence class of trees formed by leaf nodes exchanging positions with their neighbors. Next, we propose an algorithm to solve the above problem with logarithmic sample complexity in the number of nodes and polynomial run-time complexity. Lastly, we empirically demonstrate that, as expected, existing algorithms are not inherently robust in the proposed setting whereas our algorithm correctly recovers the underlying equivalence class.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Undirected graphical models or Markov Random Fields (MRFs) are a powerful tool for describing high dimensional distributions using an associated dependency graph , which encodes the conditional dependencies between random variables. In this paper, we focus on a special class of MRFs called the Ising Model, first introduced in [15] to represent spin systems in quantum physics [7]. Recently, Ising models have also proven quite popular in biology [16], engineering [9, 31]

, computer vision

[30], and also in the optimization and OR communities, including in finance [37], and social networks [26]. The special class of tree-structured Ising models is beneficial for applications in statistical physics over non-amenable graphs. A detailed description and further references can be found in [25].

In this paper, we explore the problem of learning the underlying graph of tree-structured Ising models with independent, unknown, unequal error probabilities. Such scenarios are relatively common in networks where certain malfunctioning sensors can report flipped bits. The presence of noise breaks down the conditional independence structure of the noiseless setting, and the noisy samples may no longer follow an Ising model distribution.

In 2011, [8] highlighted the importance of robustness in Ising models. Recent works in [13, 14, 21] have tried to address this problem. However, they assume the side information of the error probability, which is mostly unavailable and difficult to estimate in most practical settings. In the closely related work for tree-structured Ising models, [27, 28] address this problem as they build on the seminal work of [10], popularly known as the Chow-Liu algorithm, which proposes the use of maximum-weight spanning tree of mutual information as an estimate of the tree structure. In [27], they consider the simplified case where each node has an equal probability of error and [28] assumes that the error doesn’t alter the order of mutual information. Both assumptions imply that asymptotically, Chow-Liu converges to the correct tree. However, these assumptions don’t arise naturally and are difficult to check from access to only noisy data. To the best of our knowledge, there doesn’t exist an analysis of what happens beyond this limiting assumption of order preservation of mutual information.

In fact, section 5.1 of [6] provides an example of the unidentifiability of the problem for a graph on 3 nodes and says that the problem is ill-defined. We reconsider this problem, and show that for the special class of tree structured Ising models, although the problem is not identifiable, nevertheless the unidentifiability is limited to an equivalence class of trees. Thus, more appropriately, one can cast the problem of learning in the presence of unknown, unequal noise as the problem of learning this equivalence class.

Key Contributions
  1. We show that the problem of learning tree structured Ising models when the observations flip with independent, unknown, possibly unequal probability is unidentifiable (Theorem 2).

  2. As depicted in Figure 1, the unidentifiability is restricted to the equivalence class of trees obtained by permuting within the leaf nodes and their neighbors (Theorem 1).

  3. We provide an algorithm to recover this equivalence class from noisy samples. The sample complexity is logarithmic and the time complexity is polynomial in the number of nodes. (Section 4)

  4. We experimentally demonstrate that existing algorithms like Chow-Liu or Sparsitron for learning Ising models are not inherently robust against this noise model whereas our algorithm correctly recovers the equivalence class.(Section 5)

2 Related Work

Efficient algorithms for structure learning of Ising models can be divided into three main categories based on their assumptions: i) special graph structures [1, 10, 11, 32, 5], ii) nature of interaction between variables such as correlation decay property (CDP) [3, 4, 6, 20, 29], iii) bounded degree/width [2, 12, 18, 24, 35, 33]. However, these algorithms assume access to uncorrupted samples.

In the last decade, there has been a lot of research on robust estimation of graphical models [17, 19, 22, 23, 34, 36], and [17] explicitly considered partial identifiability in the presence of noise for the Gaussian case. However, extending the above frameworks to the robust structure learning of Ising models remains a challenge. [13, 14, 21] have tried to solve the problem of robust estimation of general Ising models under the assumption of access to the probability of error for each node. Recently, [27, 28] proposed algorithms to estimate the underlying graph structure of tree-structured Ising models in the presence of noise under the strong assumption that the probability of error does not alter the order of mutual information order for the tree. Both these assumptions are restrictive and impractical. In this paper, we present the first algorithm that can robustly recover the underlying tree structured Ising model (upto an equivalence class) in the presence of corruption via unknown, unequal, independent noise.

3 Identifiability Result

Problem Setup:

Let be a set of random variables with support on and

be the vector of these random variables. Suppose the conditional independence structure of the variables of

is given by a tree . This implies that the distribution of can be represented by an Ising model. In our model, we have observations where each flips with probability . We denote the probability of error by the vector and the noisy random variables by . The error in disrupts the tree structured conditional independence and the graphical model of is a complete graph if . In fact, need not be an Ising model. Given samples of , we want to find the tree structure . The ideas of the analysis for this problem are based on graph structure properties introduced in [17] where they were applied in the different context of Gaussian graphical models.

Equivalence Class for

Given a tree , let us define its equivalence class denoted by . Let the set of all nodes that are connected to one or more leaf nodes of be denoted by . Construct sets from the nodes such that set contains all the leaf nodes connected to in along with . Sample one node from each newly constructed set and create another subset . There exists such unique subsets. Create a tree by setting all the nodes in as the internal node and all the remaining nodes in their corresponding sets as leaf nodes in the same position where was in . Therefore, there exists a one-to-one equivalence relationship between tree and its corresponding set . We define a set of these trees as .

[scale = 0.4]identifiability.jpg

Figure 1: for the above is obtained by permuting the nodes within the dotted regions. Noisy samples from this Ising model can be used to estimate the tree only upto the equivalence class .

Figure 1 gives an example of .

Model Assumptions
Assumption 1.

(Bounded Mean) The absolute value of the mean - .

Assumption 2.

(Bounded Correlation) Correlation of any two nodes and connected by an edge - where .

Assumption 3.

(Bounded error probability) The error probability - .

These assumptions arise naturally. Assumption 1 ensures that no variable approaches a constant and hence gets disconnected from the tree. The lower bound in Assumption 2 also ensures that every node is connected. The upper bound in Assumption 2 ensures that no two nodes are duplicated. Assumption 3 ensures the noisy node doesn’t become independent of every other node due to the error.

Limited unidentifiability of the problem

In Theorem 1, we prove that it is possible to recover from the samples of . Further, we prove that given the distribution of , there exists an Ising model for each tree in such that, for some noise vector, its noisy distribution is the same as that of in Theorem 2

Theorem 1.

Suppose and are sets of binary valued random variables satisfying assumption 1 whose conditional independence is given by trees and respectively satisfying assumption 2. Assume that each node in both these distributions and is allowed to be flipped independently with probability satisfying assumption 3. Let and represent the noisy distributions of and respectively. If , then .

Proof.

The proof of this theorem relies on this key observation: the probability distribution of the noisy samples completely defines the categorization of any set of 4 nodes as star/non-star shape. Any set of 4 nodes forms a non-star if there exists at least one edge which, when removed, splits the tree into two subtrees such that exactly 2 of the 4 nodes lie in one subtree and the other 2 nodes lie in the other subtree. The nodes in the same subtree form a pair. If the set is not a non-star, it is categorized as a star. For example, in Figure

1, form a non-star and form a star.

Once we prove this key observation, the rest of the proof follows easily and we refer the reader to parts (ii) and (iii) of the proof of Theorem 2 from [17]. We denote the correlation between two nodes and in the non-noisy setting by and in the noisy setting by . Similarly the covariance is denoted by and . We utilize the correlation decay property of tree structured Ising models which is stated in Lemma 1.

Lemma 1.

(Correlation Decay) Any 2 nodes and have the conditional independence relation specified by a tree structured Ising Model such that the path between them is if and only if their correlation is given by:

(1)

The proof of this lemma is provided in Appendix, A. We also prove that and in Appendix B.

Categorizing a set of 4 nodes as star/non-star

We first look at a graphical model on 3 nodes whose conditional independence is given by a chain with . By Lemma 1, we have .

Suppose the sign of flip independently with probability respectively. Substituting the values of and in terms of their noisy counterparts gives us:

(2)

If we had prior knowledge about the underlying conditional independence relation, this quadratic equation, which depends only on the quantities measurable from noisy data, could be solved to estimate the probability of error of .

We prove in Appendix C that Equation (2) gives a valid solution for any configuration of 3 nodes in a tree structured Ising model. Therefore, in the absence of the knowledge that , we can estimate a probability of error for each which enforces the underlying graph structure to represent the other 2 nodes independent conditioned on

. Thus, irrespective of the true underlying conditional independence relation we can always find a probability of error for each node which makes any other pair of nodes conditionally independent. We use this concept to classify a tree on 4 nodes as star or non-star shaped.

We follow a notation where denotes the estimated probability of error of which enforces .

Condition for star/non-star shape:

Any set of 4 nodes is categorized as a non-star with forming one pair and forming another pair if and only if:

From Equation (2), this is equivalent to the condition that

Any set of 4 nodes is categorized as a star if and only if:

This is equivalent to the condition that .

In order to see how these conditions correspond to a star/non-star shape, lets consider a chain on 4 nodes as shown in Figure 3. Let each be flipped with probability . With access only to the noisy samples, we estimate the probability of error for each node in order to find the underlying tree. The key idea is that when we estimate the probability of error for a given node, it should be consistent across different conditional independence relations. For instance in the present case, the error estimates and of satisfy . We show that (Lemma 2(b)). We also prove that (Lemma 2). These imply that and . By symmetry, we have and . These conditional independence statements imply that and form a chain with on one side of the chain and on the other side of the chain.

Next, we consider the case when 4 nodes form a star structured graphical model as in Figure 3. Under the same noisy observation setting we prove that , , and (Lemma 3). Thus, we can conclude that the underlying graphical model is star structured.

[width=.5]chain.jpg
Figure 2: A chain structure.
[width=.35]star1.jpg
Figure 3: A Star structure.
Lemma 2.

Let the graphical model on and form a chain as shown in Figure 3. Suppose the bits of each are flipped with probability . Then the following holds:

Lemma 3.

Let the graphical model on and form a star as shown in Figure 3. Suppose the bits of each are flipped with probability . Then the following holds:

The proof of these lemmas and the details of extending these results to generic trees require basic algebraic manipulations and can be found in Appendix D. ∎

Theorem 2.

Let denote the probability distribution of when the error probability of all the neighbors of leaf nodes is non-zero. For any , there exists a set of random variables with conditional independence given by and a corresponding error probability vector such that where denotes the noisy distribution of .

We prove this theorem by explicit calculation of . We utilize Lemma 1 to enforce the conditional independence relations in any tree . The proof is included in Appendix E.

Interestingly these unidentifiability results for noisy tree structured Ising models match the ones for noisy tree structured Gaussian graphical models proposed in [17] inspite of them being graphical models on different class of random variables. We remark that the algorithm they propose to learn the equivalence class in [17] is in the infinite sample domain and does not provide sample complexity results. In the next section we develop an efficient algorithm to learn the equivalence class using samples scaling logarithmically in the number of nodes.

4 Algorithm

In this section, we provide a detailed description of our algorithm (Algorithm 1) that can recover the set from noisy samples. We emphasize that the edges we learn in this algorithm are from any one tree from the equivalence class and not necessarily from (as we have demonstrated, identifying itself is not possible). We illustrate our algorithm (as well as each subroutine) using a toy example (Figure 4).

Algorithm 1 Algorithm to learn the equivalence class of the tree 1:procedure FindTree 2:      for  in  do 3:             4:            if  >1 then 5:                 break                    6:       7:      if  then 8:            Recurse()        Algorithm 2 Algorithm to recursively find all the edges within a subtree 1:procedure Recurse() 2:       3:      for  in  do 4:             5:             6:                   

We first introduce three notions which we extensively use for the algorithm - (i) Equivalence Clusters, (ii) Proximal Sets - , (iii) Categorizing a set of 4 nodes as star/non-star using finite samples.

Equivalence Cluster:

As illustrated in Figure 4(a), an equivalence cluster is defined as a set containing an internal node and all the leaf nodes connected to it. We denote the equivalence cluster containing a node by .

Defining proximal sets:

Due to correlation decay, the correlation between an arbitrary pair of nodes can go down exponentially in the number of nodes which would lead to exponential sample complexity. To avoid this, we only consider nodes with correlations above a constant threshold while making star/non-star categorization. We define two proximal sets for each node - and , where is given by:

In the above expression, we set

, thereby guaranteeing that the set contains at least all the nodes within a radius of 4 of that node (on the true unknown graph). This is because the minimum variance of any node is

, the minimum correlation of any 2 nodes at distance 4 in the absence of noise is and the noise can decrease the covariance by a factor of at most .

[scale = 0.35]example_updated_1_leaf.jpg

Figure 4: Illustration of the algorithm.

It is possible that even if and are not neighbors in the true graph. We define the second set to guarantee that in this case, at least the first node from the path from to is in . Using the correlation decay property, the noisy covariance expression,, and the bounds , and this set is given by:

where . Using the sample covariance values we construct sets and to include at least all the nodes in and with high probability. We denote the sample covariance between nodes and obtained from the noisy observations by . The sets and are defined as:

Classifying into star/non-star shape:

It is easy to see (refer Equation(16) in Appendix D.4) that when any set of 4 nodes forms a non-star such that from a pair, we have:

Therefore, using the finite sample estimate of the noisy correlation, the categorization of 4 nodes into star/non-star such that from a pair is given in Table 1.

=0ex =0ex

We first describe the key steps of the algorithm using the example from Figure 4.

Initialization:
Category
Non-star
Star
Table 1: Star/Non-star categorization

The algorithm starts by learning edges of any leaf node (edge(14,15)). It searches for a leaf node by looking for equivalence cluster with more than one nodes. This is achieved using the FindEC subroutine which, given a node, can find all the nodes in its equivalence cluster.

Recursion:

After finding a pair of a leaf and an internal node, the algorithm calls the Recurse to find all the remaining edges recursively. This is done in 2 steps: (i) Split the nodes in the proximal set of the internal node into different subtrees connected to the internal node using SplitTree (subtree1 = nodes 0-6, subtree2 = nodes 7-13), (ii) Find the nodes in each subtree which have an edge with the internal node. This is done by the simple observation that when we consider the internal node along with one such subtree, the internal node acts like a leaf node for the tree on just these nodes (node 14 is a leaf node for the tree on 1,2,3,4,5,6,14). Hence, FindEC can be used on these subset of nodes. Now, we have a pair of a leaf node and an internal node on a subset of nodes and we call the Recurse just on this subset of nodes.

The runtime of the algorithm is .

Description of subroutines

SplitTree

This function takes in a pair of leaf and internal nodes, and respectively, from a subtree. To ensure that we only consider nodes in the subtree, it also takes in a set which contains the nodes assigned to other subtrees in the previous recursive call. It splits the nodes from the common proximal sets and in the subtree into smaller subtrees that have an edge with . The property used is that 2 nodes and lie in the same smaller subtree if and only if form a non-star such that nodes form a pair. To make sure that no node outside the subtree is considered, we do not split any node within or which pairs with a node from when considered along with and . This is an operation.

FindEC

This function takes in a node and a set of nodes as inputs and returns all the nodes of from . Two nodes are in the same equivalence cluster if any categorization of a set of 4 nodes as star/non-star either results in a star or a non-star where these nodes form a pair. Essentially, for all the nodes in the proximal set of , we verify if they satisfy this condition. We also add edges by arbitrarily choosing any node from as the center node as we are only interested in recovering . This is an operation.

Recurse

This function takes the same arguments as SplitTree and calls SplitTree with those arguments to obtain the different smaller subtrees. It then uses FindEC to find the edges between and these smaller subtrees. When the smaller subtrees are considered along with , acts as their leaf node and the node has an edge with, acts as an internal node. For each of the smaller subtree, it adds the nodes of the remaining smaller subtrees in along with the equivalence cluster from the current subtree which had an edge with as that has already been learnt. It then calls the recurse function with these smaller subtrees.

Need for 2 proximal sets and :

Consider the implementation of FindEC. For each node , we check if it lies in . For all pairs of nodes , we could check whether the condition for is satisfied. If but no node on the path from to is in , we would incorrectly conclude that . To avoid this corner case, instead of considering , we consider . By the definition of , contains a node from the path from to . This results in forming a non-star where do not form a pair and hence it correctly concludes that .

plays a crucial role in SplitTree too. Consider a recursive call to Recurse such that in the previous recursion, nodes in of that step were added to . However, in the current step there might be nodes in from other subtrees which are not in . To ensure that any such node is not considered while creating smaller subtrees, we consider the nodes and ensure that they don’t pair with when considered with the present recursive call’s and .

Setting the radius as 4:

The radius is set as 4 to make sure that each node is considered with at least its nearest neighboring nodes when classified as star/non-star. This would ideally require a radius of 3 but to account for the unidentifiability between a leaf node and its neighbor, we consider a radius of 4.

The detailed description of the complete algorithm including these subroutines- their pseudo-code, proof of correctness and time complexity analysis is presented in Appendix F.

Next, we present the sample complexity result of our algorithm.

Theorem 3.

Consider an Ising model on nodes whose graph structure is the tree and it satisfies assumptions 1 and 2. We get noisy samples from this Ising model where samples from each node are flipped independently with unknown, unequal probability satisfying assumption 3. Algorithm 1 correctly recovers with probability at least if the number of samples is lower bounded as follows:

where , , , .

The proof is included in Appendix G.

Running the algorithm in absence of the knowledge of and :

When we only have access to noisy samples, and , we can set and and solve for given an error budget . This would be a conservative estimate of and in practice, the algorithm would work for lower values of .

5 Experiments

Experimental Setup

The probability distribution of Ising models is represented by

where is the symmetric weight matrix with 0 on the diagonal and

is the bias vector. The support of

defines the Ising model structure. Therefore, a chain structured Ising model with nodes labeled consecutively satisfies if and only if . A star structured Ising model with node 1 as the internal node satisfies if and only if or .

[height=50mm, width=0.48]chained_d11_15.png

Figure 5: . Comparing the performance of Chow-Liu and our algorithm over 50 runs for increasing number of samples.

We sample each non-zero element of uniformly from . The probability of error for each node is sampled uniformly from .

Comparison with the Chow-Liu algorithm is presented in 5.1. In subsection 5.2, we demonstrate the performance of our algorithm as compared to the Sparsitron algorithm. Subsections 5.3, 5.4 and 5.5 report the impact of the maximum error , maximum edge weight and minimum edge weight respectively on the algorithm performance. All of these experiments are done for chain structured Ising models and have .

We next demonstrate the performance of our algorithm on star-structured Ising model in subsection 5.6 as the number of nodes increases. To begin with, we stick to .

Finally we study the impact of (the non-zero mean) on the algorithm performance for both star-structured Ising model and chain structured Ising model in subsection 5.7.

[scale = 0.3]chained_sparsitron_d11_15.png

Figure 6: . Comparing the performance of Sparsitron and our algorithm over 50 runs for increasing number of samples.

5.1 Our Algorithm vs Chow-Liu

Figure 5 depicts that as the number of samples increases, the fraction of correct recoveries using our algorithm approaches 1. Chow-Liu incorrectly converges to a tree not in the equivalent class due to unequal noise in the nodes.

5.2 Our Algorithm vs Sparsitron

[scale = 0.3]chained_q_max.png

Figure 7: . The performance of our algorithm on a chain of 15 nodes over 50 runs for increasing number of samples for different values of .

We set . In Figure 6, we compare the performance of our algorithm with the sparsitron algorithm for chain-structured Ising model on 15 nodes. To evaluate the sparsitron algorithm, we take the output weight matrix and find the maximum weight spanning tree. We call the algorithm a success if this tree is from the equivalence class . We can see that the sparsitron has a low success rate.

5.3 Effect of the Maximum Error Probability

We set and vary to take the values . We evaluate on chain structured Ising model on 15 nodes. Figure 7 illustrates the effect of on the performance of our algorithm. As expected, increase in the probability of error makes it harder to recover the equivalence class.

5.4 Effect of the Maximum Weight

Next, we look at the effect of . We set and vary to take the values . We evaluate on chain structured Ising model on 15 nodes. Intuitively, a high makes it difficult to differentiate between a star and a non-star, therefore the algorithm is expected to perform better for lower . This is indeed what happens as shown in Figure 8. [scale = 0.3]chained_W_max.png Figure 8: . The performance of our algorithm on a chain of 15 nodes over 50 runs for increasing number of samples for different values of . [scale = 0.3]chained_W_min.png Figure 9: . The performance of our algorithm on a chain of 15 nodes over 50 runs for increasing number of samples for different values of .

5.5 Effect of the Minimum Weight

We also look at the effect of . We set and vary to take the values . We evaluate on chain structured Ising model on 15 nodes. We can expect that smaller edge weights make it difficult to accurately estimate the correlation values resulting in higher errors. This is illustrated in Figure 9.

5.6 Algorithm performance for Star-structured tree Ising Model

Till now all the experiments have used chain structured Ising models. In this subsection, we consider the other extreme tree structure-star structure with one internal node and the remaining leaf nodes connected to the internal node. We set . We consider three different graph sizes - 11 nodes, 13 nodes and 15 nodes. The performance of the algorithm is illustrated in Figure 10.

[scale = 0.3]star_our_algo_d11_13_15.png

Figure 10: . The performance of our algorithm on a star of 11, 13 and 15 nodes over 50 runs for increasing number of samples.

As we can see, compared to the chain structured Ising models, star-structured Ising models require much smaller number of samples. This is due to the small radius which results in high absolute values of the correlations. Also, the performance is only slightly impacted by the number of nodes which can also be attributed to the small radius.

5.7 Effect of bias on the algorithm performance.

We observe a very interesting phenomena when we study the impact of the bias term . For chain structured Ising model, the algorithm performs better for lower absolute values in . However, for star structured Ising modes, the algorithm performs better for higher absolute values in . For both the cases we consider a tree on 11 nodes and set . is estimated empirically. We set all the entries in to be equal.

For chain structured Ising model, we consider 3 cases - (i) all the entries in = 0.0, (ii) all the entries in = 0.02 and (iii) all the entries in = 0.04. The result is provided in Figure 11. [scale = 0.3]bias_chain.png Figure 11: . The performance of our algorithm on a chain of 11 nodes over 50 runs for increasing number of samples for varying bias values. [scale = 0.3]bias_star.png Figure 12: . The performance of our algorithm on a star of 11 nodes over 50 runs for increasing number of samples for varying bias values.

For star structured Ising model, we consider 3 cases - (i) all the entries in = 0.0, (ii) all the entries in = 0.2 and (iii) all the entries in = 0.4. The result is provided in Figure 12.

We now give an intuitive justification of the observation. A higher value of bias results in smaller threshold for the proximal sets. For chain structured Ising models, due to correlation decay, we can expect correlation values for some pairs of random variables to be close to the threshold. Estimating them accurately would require more number of samples.

Star structured Ising models, on the other hand, have higher correlation values due a smaller diameter of 2. A low threshold ensures that no node is mistakenly left behind when constructing the proximal sets. Therefore, the performance is better for a higher bias values.

Appendix A Proof of Lemma 1

We prove this by induction on the number of nodes in the path for any 2 nodes .

Base Case :

The path is , therefore we have . For random variables with a support size of 2, this is true if and only if they are conditionally uncorrelated, that is,

(3)

is linear in since the support size of is 2 and therefore we need to need to fit only 2 points and to completely represent the conditional expectation. Therefore the linear least square error (LLSE) estimator of given is also the minimum mean squared estimator . Utilizing the standard result for LLSE, we have:

(4)

Similarly we have:

(5)

Substituting and from Equations (4) and (5) in Equation (3) we get:

Therefore we get which implies .

Inductive Case:

Let the statement be true for any path involving nodes. For a path we have . Therefore the same calculation as the base case holds true by replacing by and by . Therefore . By the inductive assumption, , therefore, .

Appendix B Proof of Covariance of noisy variables.

Lemma 4.

Consider 2 Random variables and with support on whose covariance is denoted by . Now consider the noisy version of these random variables and whose covariance is denoted by . Then we have: