    # A Junction Tree Framework for Undirected Graphical Model Selection

An undirected graphical model is a joint probability distribution defined on an undirected graph G*, where the vertices in the graph index a collection of random variables and the edges encode conditional independence relationships among random variables. The undirected graphical model selection (UGMS) problem is to estimate the graph G* given observations drawn from the undirected graphical model. This paper proposes a framework for decomposing the UGMS problem into multiple subproblems over clusters and subsets of the separators in a junction tree. The junction tree is constructed using a graph that contains a superset of the edges in G*. We highlight three main properties of using junction trees for UGMS. First, different regularization parameters or different UGMS algorithms can be used to learn different parts of the graph. This is possible since the subproblems we identify can be solved independently of each other. Second, under certain conditions, a junction tree based UGMS algorithm can produce consistent results with fewer observations than the usual requirements of existing algorithms. Third, both our theoretical and experimental results show that the junction tree framework does a significantly better job at finding the weakest edges in a graph than existing methods. This property is a consequence of both the first and second properties. Finally, we note that our framework is independent of the choice of the UGMS algorithm and can be used as a wrapper around standard UGMS algorithms for more accurate graph estimation.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

An undirected graphical model is a joint probability distribution

of a random vector

defined on an undirected graph . The graph consists of a set of vertices and a set of edges . The vertices index the random variables in and the edges characterize conditional independence relationships among the random variables in (Lauritzen, 1996). We study undirected graphical models (also known as Markov random fields) so that the graph is undirected, i.e., if an edge , then . The undirected graphical model selection (UGMS) problem is to estimate given observations drawn from . This problem is of interest in many areas including biological data analysis, financial analysis, and social network analysis; see Koller and Friedman (2009) for some more examples.

This paper studies the following problem: Given the observations drawn from and a graph that contains all the true edges , and possibly some extra edges, estimate the graph .

A natural question to ask is how can the graph be selected in the first place? One way of doing so is to use screening algorithms, such as in Fan and Lv (2008) or in Vats (to appear), to eliminate edges that are clearly non-existent in . Another method can be to use partial prior information about to remove unnecessary edges. For example, this could be based on (i) prior knowledge about statistical properties of genes when analyzing gene expressions, (ii) prior knowledge about companies when analyzing stock returns, or (iii) demographic information when modeling social networks. Yet another method can be to use clever model selection algorithms that estimate more edges than desired. Assuming an initial graph has been computed, our main contribution in this paper is to show how a junction tree representation of can be used as a wrapper around UGMS algorithms for more accurate graph estimation.

### 1.1 Overview of the Junction Tree Framework

A junction tree is a tree-structured representation of an arbitrary graph (Robertson and Seymour, 1986). The vertices in a junction tree are clusters of vertices from the original graph. An edge in a junction tree connects two clusters. Junction trees are used in many applications to reduce the computational complexity of solving graph related problems (Arnborg and Proskurowski, 1989). Figure 1(c) shows an example of a junction tree for the graph in Figure 1(b). Notice that each edge in the junction tree is labeled by the set of vertices common to both clusters connected by the edge. These set of vertices are referred to as a separator.

Let be a graph that contains all the edges in . We show that the UGMS problem can be decomposed into multiple subproblems over clusters and subsets of the separators in a junction tree representation of . In particular, using the junction tree, we construct a region graph, which is a directed graph over clusters of vertices. An example of a region graph for the junction tree in Figure 1(c) is shown in Figure 1(d). The first two rows in the region graph are the clusters and separators of the junction tree, respectively. The rest of the rows contain subsets of the separators111See Algorithm 1 for details on how to exactly construct the region graph.. The multiple subproblems we identify correspond to estimating a subset of edges over each cluster in the region graph. For example, the subproblem over the cluster in Figure 1(d) estimates the edges and .

We solve the subproblems over the region graph in an iterative manner. First, all subproblems in the first row of the region graph are solved in parallel. Second, the region graph is updated taking into account the edges removed in the first step. We keep solving subproblems over rows in the region graph and update the region graph until all the edges in the graph have been estimated.

As illustrated above, our framework depends on a junction tree representation of the graph that contains a superset of the true edges. Given any graph, there may exist several junction tree representations. An optimal junction tree is a junction tree representation such that the maximum size of the cluster is as small as possible. Since we apply UGMS algorithms to the clusters of the junction tree, and the complexity of UGMS depends on the number of vertices in the graph, it is useful to apply our framework using optimal junction trees. Unfortunately, it is computationally intractable to find optimal junction trees (Arnborg et al., 1987)

. However, there exists several computationally efficient greedy heuristics that compute close to optimal junction trees

(Kjaerulff, 1990; Berry et al., 2003). We use such heuristics to find junction trees when implementing our algorithms in practice.

### 1.2 Advantages of Using Junction Trees

We highlight three main advantages of the junction tree framework for UGMS.

Choosing Regularization Parameters and UGMS Algorithms: UGMS algorithms typically depend on a regularization parameter that controls the number of estimated edges. This regularization parameter is usually chosen using model selection algorithms such as cross-validation or stability selection. Since each subproblem we identify in the region graph is solved independently, different regularization parameters can be used to learn different parts of the graph. This has advantages when the true graph has different characteristics in different parts of the graph. Further, since the subproblems are independent, different UGMS algorithms can be used to learn different parts of the graph. Our numerical simulations clearly show the advantages of this property.

Reduced Sample Complexity: One of the key results of our work is to show that in many cases, the junction tree framework is capable of consistently estimating a graph under weaker conditions than required by previously proposed methods. For example, we show that if consists of two main components that are separated by a relatively small number of vertices (see Figure 2 for a general example), then, under certain conditions, the number of observations needed for consistent estimation scales like , where is the number of vertices in the smaller of the two components. In contrast, existing methods are known to be consistent if the observations scale like , where is the total number of vertices. If the smaller component were, for example, exponentially smaller than the larger component, then the junction tree framework is consistent with about observations. For generic problems, without structure that can be exploited by the junction tree framework, we recover the standard conditions for consistency.

Learning Weak Edges: A direct consequence of choosing different regularization parameters and the reduced sample complexity is that certain weak edges, not estimated using standard algorithms, may be estimated when using the junction tree framework. We show this theoretically and using numerical simulations on both synthetic and real world data.

### 1.3 Related Work

Several algorithms have been proposed in the literature for learning undirected graphical models. Some examples include References Spirtes and Glymour (1991); Kalisch and Bühlmann (2007); Banerjee et al. (2008); Friedman et al. (2008); Meinshausen and Bühlmann (2006); Anandkumar et al. (2012a); Cai et al. (2011) for learning Gaussian graphical models, references Liu et al. (2009); Xue and Zou (2012); Liu et al. (2012a); Lafferty et al. (2012); Liu et al. (2012b) for learning non-Gaussian graphical models, and references Bresler et al. (2008); Bromberg et al. (2009); Ravikumar et al. (2010); Netrapalli et al. (2010); Anandkumar et al. (2012b); Jalali et al. (2011); Johnson et al. (2012); Yang et al. (2012) for learning discrete graphical models. Although all of the above algorithms can be modified to take into account prior knowledge about a graph that contains all the true edges (see Appendix B for some examples), our junction tree framework is fundamentally different than the standard modification of these algorithms. The main difference is that the junction tree framework allows for using the global Markov property of undirected graphical models (see Definition 2.2) when learning graphs. This allows for improved graph estimation, as illustrated by both our theoretical and numerical results. We note that all of the above algorithms can be used in conjunction with the junction tree framework.

Junction trees have been used for performing exact probabilistic inference in graphical models (Lauritzen and Spiegelhalter, 1988). In particular, given a graphical model, and its junction tree representation, the computational complexity of exact inference is exponential in the size of the cluster in the junction tree with the most of number of vertices. This has motivated a line of research for learning thin junction trees so that the maximum size of the cluster in the estimated junction tree is small so that inference is computationally tractable (Chow and Liu, 1968; Bach and Jordan, 2001; Karger and Srebro, 2001; Chechetka and Guestrin, 2007; Kumar and Bach, 2013). We also make note of algorithms for learning decomposable graphical models where the graph structure is assumed to triangulated (Malvestuto, 1991; Giudici and Green, 1999). In general, the goal in the above algorithms is to learn a joint probability distribution that approximates a more complex probability distribution so that computations, such as inference, can be done in a tractable manner. On the other hand, this paper considers the problem of learning the structure of the graph that best represents the conditional dependencies among the random variables under consideration.

There are two notable algorithms in the literature that use junction trees for learning graphical models. The first is an algorithm presented in Xie and Geng (2008) that uses junction trees to find the direction of edges for learning directed graphical models. Unfortunately, this algorithm cannot be used for UGMS. The second is an algorithm presented in Ma et al. (2008) for learning chain graphs, that are graphs with both directed and undirected edges. The algorithm in Ma et al. (2008) uses a junction tree representation to learn an undirected graph before orienting some of the edges to learn a chain graph. Our proposed algorithm, and subsequent analysis, differs from the work in Ma et al. (2008) in the following ways:

1. Our algorithm identifies an ordering on the edges, which subsequently results in a lower sample complexity and the possibility of learning weak edges in a graph. The ordering on the edges is possible because of our novel region graph interpretation for learning graphical models. For example, when learning the graph in Figure 1(a) using Figure 1(b), the algorithm in Ma et al. (2008) learns the edge by applying a UGMS algorithm to the vertices . In contrast, our proposed algorithm first estimates all edges in the second layer of the region graph in Figure 1(d), re-estimates the region graph, and then only applies a UGMS algorithm to to determine if the edge belongs to the graph. In this way, our algorithm, in general, requires applying a UGMS algorithm to a smaller number of vertices when learning edges over separators in a junction tree representation.

2. Our algorithm for using junction trees for UGMS is independent of the choice of the UGMS algorithm, while the algorithm presented in Ma et al. (2008) uses conditional independence tests for UGMS.

3. Our algorithm, as discussed in (i), has the additional advantage of learning certain weak edges that may not be estimated when using standard UGMS algorithms. We theoretically quantify this property of our algorithm, while no such theory was presented in Ma et al. (2008).

Recent work has shown that solutions to the graphical lasso (gLasso) (Friedman et al., 2008) problem for UGMS over Gaussian graphical models can be computed, under certain conditions, by decomposing the problem over connected components of the graph computed by thresholding the empirical covariance matrix (Witten et al., 2011; Mazumder and Hastie, 2012). The methods in Witten et al. (2011); Mazumder and Hastie (2012) are useful for computing solutions to gLasso for particular choices of the regularization parameter and not for accurately estimating graphs. Thus, when using gLasso for UGMS, we can use the methods in Witten et al. (2011); Mazumder and Hastie (2012) to solve gLasso when performing model selection for choosing suitable regularization parameters. Finally, we note that recent work in Loh and Wainwright (2012) uses properties of junction trees to learn discrete graphical models. The algorithm in Loh and Wainwright (2012) is designed for learning discrete graphical models and our methods can be used to improve its performance.

### 1.4 Paper Organization

The rest of the paper is organized as follows:

• Section 2 reviews graphical models and formulates the undirected graphical model selection (UGMS) problem.

• Section 3 shows how junction trees can be represented as region graphs and outlines an algorithm for constructing a region graph from a junction tree.

• Section 4 shows how the region graphs can be used to apply a UGMS algorithm to the clusters and separators of a junction tree.

• Section 5 presents our main framework for using junction trees for UGMS. In particular, we show how the methods in Sections 3-4 can be used iteratively to estimate a graph.

• Section 6 reviews the PC-Algorithm, which we use to study the theoretical properties of the junction tree framework.

• Section 7 presents theoretical results on the sample complexity of learning graphical models using the junction tree framework. We also highlight advantages of using the junction tree framework as summarized in Section 1.2.

• Section 8 presents numerical simulations to highlight the advantages of using junction trees for UGMS in practice.

• Section 9 summarizes the paper and outlines some future work.

## 2 Preliminaries

In this section, we review some necessary background on graphs and graphical models that we use in this paper. Section 2.1 reviews some graph theoretic concepts. Section 2.2 reviews undirected graphical models. Section 2.3 formally defines the undirected graphical model selection (UGMS) problem. Section 2.4 reviews junction trees, which we use use as a tool for decomposing UGMS into multiple subproblems.

### 2.1 Graph Theoretic Concepts

A graph is a tuple , where is a set of vertices and are edges connecting vertices in . For any graph , we use the notation to denote its edges. We only consider undirected graphs where if , then for . Some graph theoretic notations that we use in this paper are summarized as follows:

• Neighbor : Set of nodes connected to .

• Path : A sequence of nodes such that for .

• Separator : A set of nodes such that all paths from to contain at least one node in . The separator is minimal if no proper subset of separates and .

• Induced Subgraph : A graph over the nodes such that contains the edges only involving the nodes in .

• Complete graph : A graph that contains all possible edges over the nodes .

For two graphs and , we define the following standard operations:

• Graph Union: .

• Graph Difference: .

### 2.2 Undirected Graphical Models

[Undirected Graphical Model (Lauritzen, 1996)] An undirected graphical model is a probability distribution defined on a graph , where indexes the random vector and the edges encode the following Markov property: for a set of nodes , , and , if separates and , then . The Markov property outlined above is referred to as the global Markov property. Undirected graphical models are also referred to as Markov random fields or Markov networks in the literature. When the joint probability distribution is non-degenerate, i.e., , the Markov property in Definition 2.2 are equivalent to the pairwise and local Markov properties:

• Pairwise Markov property: For all , .

• Local Markov property: For all , .

In this paper, we always assume and say is Markov on to reflect the Markov properties. Examples of conditional independence relations conveyed by a probability distribution defined on the graph in Figure 3(d) are and .

### 2.3 Undirected Graphical Model Section (UGMS)

[UGMS] The undirected graphical model selection (UGMS) problem is to estimate a graph such that the joint probability distribution is Markov on , but not Markov on any subgraph of . The last statement in Definition 2.3 is important, since, if is Markov on , then it is also Markov on any graph that contains . For example, all probability distributions are Markov on the complete graph. Thus, the UGMS problem is to find the minimal graph that captures the Markov properties associated with a joint probability distribution. In the literature, this is also known as finding the minimal I-map.

Let be an abstract UGMS algorithm that takes as inputs a set of i.i.d. observations drawn from and a regularization parameter . The output of is a graph , where controls the number of edges estimated in . Note the dependence of the regularization parameter on . We assume is consistent, which is formalized in the following assumption.

There exists a for which as , where .

We give examples of in Appendix B. Assumption 2.3 also takes into account the high-dimensional case where depends on in such a way that .

### 2.4 Junction Trees

Junction trees (Robertson and Seymour, 1986) are used extensively for efficiently solving various graph related problems, see Arnborg and Proskurowski (1989) for some examples. Reference Lauritzen and Spiegelhalter (1988)

shows how junction trees can be used for exact inference (computing marginal distribution given a joint distribution) over graphical models. We use junction trees as a tool for decomposing the UGMS problem into multiple subproblems.

[Junction tree] For an undirected graph , a junction tree is a tree-structured graph over clusters of nodes in such that

1. Each node in is associated with at least one cluster in .

2. For every edge , there exists a cluster such that .

3. satisfies the running intersection property: For all clusters , , and such that separates and in the tree defined by , .

The first property in Definition 2.4 says that all nodes must be mapped to at least one cluster of the junction tree. The second property states that each edge of the original graph must be contained within a cluster. The third property, known as the running intersection property, is the most important since it restricts the clusters and the trees that can be be formed. For example, consider the graph in Figure 3(a). By simply clustering the nodes over edges, as done in Figure 3(b), we can not get a valid junction tree (Wainwright, 2002). By making appropriate clusters of size three, we get a valid junction tree in Fig. 3(c). In other words, the running intersection property says that for two clusters with a common node, all the clusters on the path between the two clusters must contain that common node.

[(Robertson and Seymour, 1986)] Let be a junction tree of the graph . Let . For each , we have the following properties:

1. .

2. separates and .

The set of nodes on the edges are called the separators of the junction tree. Proposition 3 says that all clusters connected by an edge in the junction tree have at least one common node and the common nodes separate nodes in each cluster. For example, consider the junction tree in Figure 3(e) of the graph in Figure 3(d). We can infer that and are separated by and . Similarly, we can also infer that and are separated by , , and . It is clear that if a graphical model is defined on the graph, then the separators can be used to easily define conditional independence relationships. For example, using Figure 3(e), we can conclude that given and . As we will see in later Sections, Proposition 3 allow the decomposition of UGMS into multiple subproblems over clusters and subsets of the separators in a junction tree.

## 3 Overview of Region Graphs

In this section, we show how junction trees can be represented as region graphs. As we will see in Section 5, region graphs allow us to easily decompose the UGMS problem into multiple subproblems. There are many different types of region graphs and we refer the readers to Yedidia et al. (2005) for a comprehensive discussion about region graphs and how they are useful for characterizing graphical models. The region graph we present in this section differs slightly from the standard definition of region graphs. This is mainly because our goal is to estimate edges, while the standard region graphs defined in the literature are used for computations over graphical models.

A region is a collection of nodes, which in this paper can be the clusters of the junction tree, separators of the junction tree, or subsets of the separators. A region graph is a directed graph where the vertices are regions and the edges represent directed edges from one region to another. We use the notation to emphasize that region graphs contain directed edges. A description of region graphs is given as follows:

• The set contains directed edges so that if , then there exists a directed edge from region to region .

• Whenever , then .

Algorithm 1 outlines an algorithm to construct region graphs given a junction tree representation of a graph . We associate a label with every region in and group regions with the same label to partition into groups . In Algorithm 1, we initialize and to be the clusters and separators of a junction tree , respectively, and then iteratively find by computing all possible intersections of regions with the same label. The edges in are only drawn from a region in to a region in . Figure 4(c) shows an example of a region graph computed using the junction tree in Figure 4(b).

###### Remark

Note that the construction of the region graph depends on the junction tree. Using methods in Vats and Moura (2012), we can always construct junction trees such that the region graph only has two sets of regions, namely the clusters of the junction tree and the separators of the junction tree. However, in this case, the size of the regions or clusters may be too large. This may not be desirable since the computational complexity of applying UGMS algorithms to region graphs, as shown in Section 5, depends on the size of the regions.

###### Remark (Region graph vs. Junction tree)

For every junction tree, Algorithm 1 outputs a unique region graph. The junction tree only characterizes the relationship between the clusters in a junction tree. A region graph extends the junction tree representation to characterize the relationships between the clusters as well as the separators. For example, in Figure 4(c), the region is in the third row and is a subset of two separators of the junction tree. Thus, the only difference between the region graph and the junction tree is the additional set of regions introduced in .

###### Remark

From the construction in Algorithm 1, may have two or more regions that are the same but have different labels. For example, in Figure 4(c), the region is in both and . We can avoid this situation by removing from and adding an edge from the region in to the region in . For notational simplicity and for the purpose of illustration, we allow for duplicate regions. This does not change the theory or the algorithms that we develop.

## 4 Applying UGMS to Region Graphs

Before presenting our framework for decomposing UGMS into multiple subproblems, we first show how UGMS algorithms can be applied to estimate a subset of edges in a region of a region graph. In particular, for a region graph , we want to identify a set of edges in the induced subgraph that can be estimated by applying a UGMS algorithm to either or a set of vertices that contains . With this goal in mind, define the children of a region as follows:

 Children: ch(R)={S:(R,S)∈→E}. (1)

We say connects to if . Thus, the children in (1) consist of all regions that connects to. For example, in Figure 4(c),

 ch({2,3,4,6})={{2,3,6},{3,4,6}}.

If there exists a direct path from to , we say is an ancestor of . The set of all ancestors of is denoted by . For example, in Figure 4(c),

 an({5,6,8,9}) =∅, an({3,5,6}) ={{3,5,6,8},{2,3,5,6}},and an({3,6}) ={{3,5,6},{2,3,6},{3,4,6},{2,3,5,6},{2,3,4,6},{3,4,6,7},{3,5,6,8}}}.

The notation takes the union of all regions in and so that

 ¯¯¯¯R=⋃S∈{an(R),R}S. (2)

Thus, contains the union of all clusters in the junction tree that contain . An illustration of some of the notations defined on region graphs is shown in Figure 5. Using , define the subgraph as 222For graphs and , and

 H′R=H[R]∖{∪S∈ch(R)KS}, (3)

where is the induced subgraph that contains all edges in over the region and is the complete graph over . In words, is computed by removing all edges from that are contained in another separator. For example, in Figure 4(c), when , . The subgraph is important since it identifies the edges that can be estimated when applying a UGMS algorithm to the set of vertices .

Suppose . All edges in can be estimated by solving a UGMS problem over the vertices . See Appendix C.

Proposition 4 says that all edges in can be estimated by applying a UGMS algorithm to the set of vertices . The intuition behind the result is that only those edges in the region can be estimated whose Markov properties can be deduced using the vertices in . Moreover, the edges not estimated in share an edge with another region that does not contain all the vertices in . Algorithm 2 summarizes the steps involved in estimating the edges in using the UGMS algorithm defined in Section 2.3. Some examples on how to use Algorithm 2 to estimate some edges of the graph in Figure 4(a) using the region graph in Figure 4(c) are described as follows.

1. Let . This region only connects to . This means that all edges, except the edge in , can be estimated by applying to .

2. Let . The children of this region are , , and . This means that , i.e., no edge over can be estimated by applying to .

3. Let . This region only connects to . Thus, all edges except can be estimated. The regions and connect to , so needs to be applied to .

## 5 UGMS Using Junction Trees: A General Framework

In this section, we present the junction tree framework for UGMS using the results from Sections 3-4. Section 5.1 presents the junction tree framework. Section 5.2 discusses the computational complexity of the framework. Section 5.3 highlights the advantages of using junction trees for UGMS using some examples. We refer to Table 1 for a summary of all the notations that we use in this section.

### 5.1 Description of Framework

Recall that Algorithm 2 shows that to estimate a subset of edges in , where is a region in the region graph , the UGMS algorithm in Assumption 2.3 needs to be applied to the set defined in (2). Given this result, a straightforward approach to decomposing the UGMS problem is to apply Algorithm 2 to each region and combine all the estimated edges. This will work since for any such that , . This means that each application of Algorithm 2 estimates a different set of edges in the graph. However, for some edges, this may require applying a UGMS algorithm to a large set of nodes. For example, in Figure 4(c), when applying Algorithm 2 to , the UGMS algorithm needs to be applied to , which is almost the full set of vertices. To reduce the problem size of the subproblems, we apply Algorithms 1 and 2 in an iterative manner as outlined in Algorithm 3.

Figure 6 shows a high level description of Algorithm 3. We first find a junction tree and then a region graph of the graph using Algorithm 1. We then find the row in the region graph over which edges can be estimated and apply Algorithm 2 to each region in that row. We note that when estimating edges over a region, we use model selection algorithms to choose an appropriate regularization parameter to select the number of edges to estimate. Next, all estimated edges are added to and all edges that are estimated are removed from . Thus, now represents all the edges that are left to be estimated and contains all the edges in . We repeat the above steps on a new region graph computed using and stop the algorithm when is an empty graph.

An example illustrating the junction tree framework is shown in Figure 7. The region graph in Figure 7(b) is constructed using the graph in Figure 7(a). The true graph we want to estimate is shown in Figure 1(a). The top and bottom in Figure 7(c) show the graphs and , respectively, after estimating all the edges in of Figure 7(b). The edges in are represented by double lines to distinguish them from the edges in . Figure 7(d) shows the region graph of . Figure 7(e) shows the updated and where only the edges and are left to be estimated. This is done by applying Algorithm 2 to the regions in of Figure 7(f). Notice that we did not include the region in the last region graph since we know all edges in this region have already been estimated. In general, if for any region , we can remove this region and thereby reduce the computational complexity of constructing region graphs.

### 5.2 Computational Complexity

In this section, we discuss the computational complexity of the junction tree framework. It is difficult to write down a closed form expression since the computational complexity depends on the structure of the junction tree. Moreover, merging clusters in the junction tree can easily control the computations. With this in mind, the main aim in this section is to show that the complexity of the framework is roughly the same as that of applying a standard UGMS algorithm. Consider the following observations.

1. Computing : Assuming no prior knowledge about is given, this graph needs to be computed from the observations. This can be done using standard screening algorithms, such as those in Fan and Lv (2008); Vats (to appear), or by applying a UGMS algorithm with a regularization parameter that selects a larger number of edges (than that computed by using a standard UGMS algorithm). Thus, the complexity of computing is roughly the same as that of applying a UGMS algorithm to all the vertices in the graph.

2. Applying UGMS to regions: Recall from Algorithm 2 that we apply a UGMS algorithm to observations over to estimate edges over the vertices , where is a region in a region graph representation of . Since , it is clear that the complexity of Algorithm 2 is less than that of applying a UGMS algorithm to estimate all edges in the graph.

3. Computing junction trees: For a given graph, there exists several junction tree representations. The computational complexity of applying UGMS algorithms to a junction tree depends on the size of the clusters, the size of the separators, and the degree of the junction tree. In theory, it is useful to select a junction tree so that the overall computational complexity of the framework is as small as possible. However, this is hard since there can be an exponential number of possible junction tree representations. Alternatively, we can select a junction tree so that the maximum size of the clusters is as small as possible. Such junction trees are often referred to as optimal junction trees in the literature. Although finding optimal junction trees is also hard (Arnborg et al., 1987), there exists several computationally tractable heuristics for finding close to optimal junction trees (Kjaerulff, 1990; Berry et al., 2003). The complexity of such algorithms range from to , depending on the degree of approximation. We note that this time complexity is less than that of standard UGMS algorithms.

It is clear that the complexity of all the intermediate steps in the framework is less than that of applying a standard UGMS algorithm. The overall complexity of the framework depends on the number of clusters in the junction tree and the size of the separators in the junction tree. The size of the separators in a junction tree can be controlled by merging clusters that share a large separator. This step can be done in linear time. Removing large separators also reduces the total number of clusters in a junction tree. In the worst case, if all the separators in are too large, the junction tree will only have one cluster that contains all the vertices. In this case, using the junction tree framework will be no different than using a standard UGMS algorithm.

### 5.3 Advantages of using Junction Trees and Region Graphs

An alternative approach to estimating using is to modify some current UGMS algorithms (see Appendix B for some concrete examples). For example, neighborhood selection based algorithms first estimate the neighborhood of each vertex and then combine all the estimated neighborhoods to construct an estimate of (Meinshausen and Bühlmann, 2006; Bresler et al., 2008; Netrapalli et al., 2010; Ravikumar et al., 2010). Two ways in which these algorithms can be modified when given are described as follows:

1. A straightforward approach is to decompose the UGMS problem into different subproblems of estimating the neighborhood of each vertex. The graph can be used to restrict the estimated neighbors of each vertex to be subsets of the neighbors in . For example, in Figure 7(a), the neighborhood of is estimated from the set and the neighborhood of is estimated from the set . This approach can be compared to independently applying Algorithm 2 to each region in the region graph. For example, when using the region graph, the edge can be estimated by applying a UGMS algorithm to . In comparison, when not using region graphs, the edge is estimated by applying a UGMS algorithm to . In general, using region graphs results in smaller subproblems. A good example to illustrate this is the star graph in Figure 7(g). A junction tree representation of the star graph can be computed so that all clusters will have size two. Subsequently, the junction tree framework will only require applying a UGMS algorithm to a pair of nodes. On the other hand, neighborhood selection needs to be applied to all the nodes to estimate the neighbors of the central node which is connected to all other nodes.

2. An alternative approach is to estimate the neighbors of each vertex in an iterative manner. However, it is not clear what ordering should be chosen for the vertices. The region graph approach outlined in Section 5.1 leads to a natural choice for choosing which edges to estimate in the graph so as to reduce the problem size of subsequent subproblems. Moreover, iteratively applying neighborhood selection may still lead to large subproblems. For example, suppose the star graph in Figure 7(g) is in fact the true graph. In this case, using neighborhood selection always leads to applying UGMS to all the nodes in the graph.

From the above discussion, it is clear that using junction trees for UGMS leads to smaller subproblems and a natural choice of an ordering for estimating edges in the graph. We will see in Section 7 that the smaller subproblems lead to weaker conditions on the number of observations required for consistent graph estimation. Moreover, our numerical simulations in Section 8 empirically show the advantages of using junction tree over neighborhood selection based algorithms.

## 6 PC-Algorithm for UGMS

So far, we have presented the junction tree framework using an abstract undirected graphical model selection (UMGS) algorithm. This shows that our framework can be used in conjunction with any UGMS algorithm. In this section, we review the PC-Algorithm, since we use it to analyze the junction tree framework in Section 7. The PC-Algorithm was originally proposed in the literature for learning directed graphical models (Spirtes and Glymour, 1991). The first stage of the PC-Algorithm, which we refer to as , estimates an undirected graph using conditional independence tests. The second stage orients the edges in the undirected graph to estimate a directed graph. We use the first stage of the PC-Algorithm for UGMS. Algorithm 4 outlines . Variants of the PC-Algorithm for learning undirected graphical models have recently been analyzed in Anandkumar et al. (2012b, a). The main property used in is the global Markov property of undirected graphical models which states that if a set of vertices separates and , then . As seen in Line 5 of Algorithm 4, deletes an edge if it identifies a conditional independence relationship. Some properties of are summarized as follows:

1. Parameter : iteratively searches for separators for an edge by searching for separators of size . This is reflected in Line 2 of Algorithm 4. Theoretically, the algorithm can automatically stop after searching for all possible separators for each edge in the graph. However, this may not be computationally tractable, which is why needs to be specified.

2. Conditional Independence Test: Line 5 of Algorithm 4 uses a conditional independence test to determine if an edge is in the true graph. This makes extremely flexible since nonparametric independence tests may be used, see Hoeffding (1948); Rasch et al. (2012); Zhang et al. (2012) for some examples. In this paper, for simplicity, we only consider Gaussian graphical models. In this case, conditional independence can be tested using the conditional correlation coefficient defined as

 Conditional correlation coefficient: ρij|S=Σij−Σi,SΣ−1S,SΣS,j√Σi,i|SΣj,j|S, (4)

where , is the covariance matrix of and , and is the conditional covariance defined by

 ΣA,B|S=ΣA,B−ΣA,SΣ−1S,SΣB,S. (5)

Whenever , then . This motivates the following test for independence:

 Conditional Independence Test: |ˆρij|S|<λn⟹Xito0.0pt$⊥$⊥Xj|XS, (6)

where is computed using the empirical covariance matrix from the observations . The regularization parameter controls the number of edges estimated in .

3. The graphs and : Recall that contains all the edges in . The graph contains edges that need to be estimated since, as seen in Algorithm 2, we apply UGMS to only certain parts of the graph instead of the whole graph. As an example, to estimate edges in a region of a region graph representation of , we apply Algorithm 4 as follows:

 ˆGR=PC(η,\Xfn,H,H′R), (7)

where is defined in (3). Notice that we do not use in (7). This is because Line 4 of Algorithm 4 automatically finds the set of vertices to apply the algorithm to. Alternatively, we can apply Algorithm 4 using as follows:

 ˆGR=PC(η,\Xfn¯¯¯¯R,K¯¯¯¯R,H′R), (8)

where is the complete graph over .

4. The set : An important step in Algorithm 4 is specifying the set in Line 4 to restrict the search space for finding separators for an edge . This step significantly reduces the computational complexity of and differentiates from the first stage of the SGS-Algorithm (Spirtes et al., 1990), which specifies .

## 7 Theoretical Analysis of Junction Tree based PC

We use the PC-algorithm to analyze the junction tree based UGMS algorithm. Our main result, stated in Theorem 7.2, shows that when using the PC-Algorithm with the junction tree framework, we can potentially estimate the graph using fewer number of observations than what is required by the standard PC-Algorithm. As we shall see in Theorem 7.2, the particular gain in performance depends on the structure of the graph.

Section 7.1 discusses the assumptions we place on the graphical model. Section 7.2 presents the main theoretical result highlighting the advantages of using junction trees. Throughout this Section, we use standard asymptotic notation so that implies that there exists an and a constant such that for all , . For , replace by .

### 7.1 Assumptions

1. Gaussian graphical model: We assume , where

is a multivariate normal distribution with mean zero and covariance

. Further, is Markov on and not Markov on any subgraph of . It is well known that this is assumption translates into the fact that if and only if (Speed and Kiiveri, 1986).

2. Faithfulness: If , then and are separated by333If is the empty set, this means that there is no path between and . . This assumption is important for the algorithm to output the correct graph. Further, note that the Markov assumption is different since it goes the other way: if and are separated by , then . Thus, when both (A1) and (A2) hold, we have that .

3. Separator Size : For all , there exists a subset of nodes , where , such that is a separator for and in . This assumption allows us to use when using .

4. Conditional Correlation Coefficient and : Under (A3), we assume that satisfies

 sup{|ρij|S|:i,j∈V,S⊂V,|S|≤η}}≤M<1, (9)

where is a constant. Further, we assume that .

5. High-Dimensionality We assume that the number of vertices in the graph scales with so that as . Furthermore, both and are assumed to be functions of and unless mentioned otherwise.

6. Structure of : Under (A3), we assume that there exists a set of vertices , , and such that separates and in and . Figure 8(a) shows the general structure of this assumption.

Assumptions (A1)-(A5) are standard conditions for proving high-dimensional consistency of the PC-Algorithm for Gaussian graphical models. The structural constraints on the graph in Assumption (A6) are required for showing the advantages of the junction tree framework. We note that although (A6) appears to be a strong assumption, there are several graph families that satisfy this assumption. For example, the graph in Figure 1(a) satisfies (A6) with , , and . In general, if there exists a separator in the graph of size less than , then (A6) is clearly satisfied. Further, we remark that we only assume the existence of the sets , , and and do not assume that these sets are known a priori. We refer to Remark 7.2 for more discussions about (A6) and some extensions of this assumption.

### 7.2 Theoretical Result and Analysis

Recall in Algorithm 4. Since we assume (A1), the conditional independence test in (6) can be used in Line 5 of Algorithm 4. To analyze the junction tree framework, consider the following steps to construct using when given i.i.d. observations :

1. Compute : Apply using a regularization parameter such that

 H=PC(|T|,\Xfn,KV,KV), (10)

where is the complete graph over the nodes . In the above equation, we apply to remove all edges for which there exists a separator of size less than or equal to .

2. Estimate a subset of edges over and using regularization parameters and , respectively, such that

 ˆGVk=PC(η,\Xfn,H[Vk∪T]∪KT,H′Vk∪T),for k=1,2, (11)

where as defined in (3).

3. Estimate edges over using a regularization parameter :

 ˆGT=PC(η,\Xfn,H[T∪neˆGV1∪ˆGV2(T)],H[T]). (12)
4. Final estimate is .

Step 1 is the screening algorithm used to eliminate some edges from the complete graph. For the region graph in Figure 8(b), Step 2 corresponds to applying to the regions and . Step 3 corresponds to applying to the region and all neighbors of estimated so far. Step 4 merges all the estimated edges. Although the neighbors of are sufficient to estimate all the edges in , in general, depending on the graph, a smaller set of vertices is required to estimate edges in . The main result is stated using the following terms defined on the graphical model:

 p1 =|V1|+|T|,p2=|V2|+|T|,pT=|T∪neG∗(T)|,ηT=|T| (13) ρ0 =inf{|ρij|S|:i,js.t.|S|≤ηT&|ρij|S|>0} (14) ρ1 =inf{|ρij|S|:i∈V1,j∈V1∪Ts.t.(i,j)∈E(G∗),S⊆V1∪T,|S|≤η} (15) ρ2 =inf{|ρij|S|:i∈V2,j∈V2∪Ts.t.(i,j)∈E(G∗),S⊆V2∪T,|S|≤η} (16) ρT =inf{|ρij|S|:i,j∈Ts.t.(i,j)∈E,S⊆T∪neG∗(T),ηT<|S|≤η}, (17)

The term is a measure of how hard it is to learn the graph in Step 1 so that and all edges that have a separator of size less than are deleted in . The terms and are measures of how hard it is learn the edges in and (Step 2), respectively, given that . The term is a measure of how hard it is learn the graph over the nodes given that we know the edges that connect to and to .

Under Assumptions (A1)-(A6), there exists a conditional independence test such that if

 n=Ω(max{ρ−20ηTlog(p),ρ−21ηlog(p1),ρ−22ηlog(p2),ρ−2Tηlog(pT)}), (18)

then as . See Appendix E. We now make several remarks regarding Theorem 7.2 and its consequences.

###### Remark (Comparison to Necessary Conditions)

Using results from Wang et al. (2010), it follows that a necessary condition for any algorithm to recover the graph that satisfies Assumptions (A1) and (A6) is that , where is the maximum degree of the graph and and are defined as follows:

 θk =min(i,j)∈G∗[Vk∪T]∖G∗[T]|Σ−1ij|√|Σ−1iiΣ−1jj|,k=1,2. (19)

If is a constant and and are chosen so that the corresponding expressions dominate all other expressions, then (18) reduces to . Furthermore, for certain classes of Gaussian graphical models, namely walk summable graphical models (Malioutov et al., 2006), the results in Anandkumar et al. (2012a) show that there exists conditions under which and . In this case, (18) is equivalent to