# Sequential Local Learning for Latent Graphical Models

Learning parameters of latent graphical models (GM) is inherently much harder than that of no-latent ones since the latent variables make the corresponding log-likelihood non-concave. Nevertheless, expectation-maximization schemes are popularly used in practice, but they are typically stuck in local optima. In the recent years, the method of moments have provided a refreshing angle for resolving the non-convex issue, but it is applicable to a quite limited class of latent GMs. In this paper, we aim for enhancing its power via enlarging such a class of latent GMs. To this end, we introduce two novel concepts, coined marginalization and conditioning, which can reduce the problem of learning a larger GM to that of a smaller one. More importantly, they lead to a sequential learning framework that repeatedly increases the learning portion of given latent GM, and thus covers a significantly broader and more complicated class of loopy latent GMs which include convolutional and random regular models.

## Authors

• 11 publications
• 44 publications
• 66 publications
• ### Inferring Parameters and Structure of Latent Variable Models by Variational Bayes

Current methods for learning graphical models with latent variables and ...
01/23/2013 ∙ by Hagai Attias, et al. ∙ 0

• ### Fast Algorithms for Learning Latent Variables in Graphical Models

We study the problem of learning latent variables in Gaussian graphical ...
06/27/2017 ∙ by Mohammadreza Soltani, et al. ∙ 0

• ### Learning Gaussian Graphical Models with Latent Confounders

Gaussian Graphical models (GGM) are widely used to estimate the network ...
05/14/2021 ∙ by Ke Wang, et al. ∙ 0

• ### From Boltzmann Machines to Neural Networks and Back Again

Graphical models are powerful tools for modeling high-dimensional data, ...
07/25/2020 ∙ by Surbhi Goel, et al. ∙ 0

• ### Learning Restricted Boltzmann Machines via Influence Maximization

Graphical models are a rich language for describing high-dimensional dis...
05/25/2018 ∙ by Guy Bresler, et al. ∙ 0

• ### Deep Probabilistic Graphical Modeling

Probabilistic graphical modeling (PGM) provides a framework for formulat...
04/25/2021 ∙ by Adji B. Dieng, et al. ∙ 0

• ### A Spectral Algorithm for Latent Junction Trees

Latent variable models are an elegant framework for capturing rich proba...
10/16/2012 ∙ by Ankur P. Parikh, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Graphical models (GM) are succinct representation of a joint distribution on a graph where each node corresponds to a random variable and each edge represents the conditional independence between random variables. GM have been successfully applied for various fields including information theory

[12, 19], physics [24][18, 11]

. Introducing latent variables to GM has been popular approaches for enhancing their representation powers in recent deep models, e.g., convolutional/restricted/deep Boltzmann machines

[20, 27]. Furthermore, they are inevitable in certain scenarios when a part of samples is missing, e.g., see [10].

However, learning parameters of latent GMs is significantly harder than that of no-latent ones since the latent variables make the corresponding negative log-likelihood non-convex. The main challenge comes from the difficulty of inferring unobserved/latent marginal probabilities associated to latent/hidden variables. Nevertheless, the expectation-maximization (EM) schemes

[9]

have been popularly used in practice with empirical successes, e.g., contrastive divergence learning for deep models

[14]

. They iteratively infer unobserved marginals given current estimation of parameters, and typically stuck at local optima of the log-likelihood function

[26].

To address this issue, the spectral methods have provided a refreshing angle on learning probabilistic latent models [2]. These theoretical methods exploit the linear algebraic properties of a model to factorize observed (low-order) moments/marginals into unobserved ones. Furthermore, the factorization methods can be combined with convex log-likelihood optimizations under certain structures, coined exclusive views, of latent GMs [7]. Both factorization methods and exclusive views can be understood as ‘local algorithms’ handling certain partial structures of latent GMs. However, up to now, they are known to be applicable to a quite limited class of latent GMs, and not as broadly applicable as EM, which is the main motivation of this paper.

Contribution. Our major question is “Can we learn latent GMs of more complicated structures beyond naive applications of local algorithms, e.g., known factorization methods or exclusive views?”. To address this, we introduce two novel concepts, called marginalization and conditioning, which reduce the problem of learning a larger GM to that of a smaller one. Hence, if the smaller one is possible to be processed by known local algorithms, then the larger one is too. Our marginalization concept suggests to search a ‘marginalizable’ subset of variables of GM so that their marginal distributions are invariant with respect to other variables under certain graphical transformations. It allows to focus on learning the smaller transformed GM, instead of the original larger one. On the other hand, our conditioning concept removes some dependencies among variables of GM, simply by conditioning some subset of variables. Hence, it enables us to discover marginalizable structures which was not before conditioning. At first glance, conditioning looks very powerful as conditioning more variables would discover more desired marginalizable structures. However, as more variables are conditioned, the algorithmic complexity grows exponentially. Therefore, we set an upper bound of those conditioned variables.

Marginalization and conditioning naturally motivate a sequential scheme that repeatedly recover larger portions of unobserved marginals given previous recovered/ observed ones, i.e., recursively recovering unobserved marginals utilizing any ‘black-box’ local algorithms. Developing new local algorithms, other than known factorization methods and exclusive views, are not of major scope. Nevertheless, we provide two new such algorithms, coined disjoint views and linear views, which play a similar role to exclusive views, i.e., can also be combined with known factorization methods. Given these local algorithms, the proposed sequential learning scheme can learn a significantly broader and more complicated class of latent GMs, than known ones, including convolutional restricted Boltzmann machines and GMs on random regular graphs, as described in Section

5. Consequently, our results imply that there exists a one-to-one correspondence between observed distributions and parameters for the class of latent GMs. Furthermore, for arbitrary latent GMs, it can be used for boosting the performance of EM as a pre-processing stage: first run it to recover as large unobserved marginals as possible, and then run EM using the additional information. We believe that our approach provides a new angle for the important problem of learning latent GMs.

Related works. Parameter estimation of latent GMs has a long history, dating back to [9]. While it can be broadly applied to most of latent GMs, EM algorithm suffers not only from local optima but from a risk of slow convergence. A natural alternative to general method

of EM is to constrain the structure of graphical models. In independent component analysis (ICA) and its extensions

[17, 4], latent variables are assumed to be independent inducing simple form of latent distribution using products. Recently, spectral methods has been successfully applied for various classes of GMs including latent tree [21, 31], ICA [8, 25][15][28, 30, 16, 3, 34], latent Dirichlet allocation [1] and others [13, 6, 35, 29]. In particular [2]

proposed an algorithm of tensor type under certain graph structures.

Another important line of work using method of moments for latent GMs, concerns on recovering joint or conditional probabilities only among observable variables (see [5] and its references). [23, 22] proposed spectral algorithms to recover the joint among observable variables when the graph structure is bottlenecked tree. [7] relaxed the constraint of tree structure and proposed a technique to combine method of moments in conjunction with likelihood for certain structures. Our generic sequential learning framework allows to use of all these approaches as key components, in order to broaden the applicability of methods. We note that we primarily focus on undirected pairwise binary GMs in this paper, but our results can be naturally extended for other GMs.

## 2 Preliminaries

### 2.1 Graphical Model and Parameter Learning

Given undirected graph

, we consider the following pairwise binary Graphical Model (GM), where the joint probability distribution on

is defined as:

 P(x)=Pβ,γ(x)=1Zexp⎛⎝∑(i,j)∈Eβijxixj+∑i∈Vγixi⎞⎠, (1)

for some parameter and . The normalization constant is called the partition function.

Given samples drawn from the distribution (1) with some true (fixed but unknown) parameter , the problem of our interest is recovering it. The popular method for the parameter learning task is the following maximum likelihood estimation (MLE):

 maximizeβ,γ1NN∑n=1logPβ,γ(x(n)), (2)

where it is well known [32] that the log-likelihood is concave with respect to , and the gradient of the log-likelihood is

 ∂∂γi1NN∑n=1logPβ,γ(x(n))=1NN∑n=1x(n)i−Eβ,γ[xi] (3) ∂∂βij1NN∑n=1logPβ,γ(x(n))=1NN∑n=1x(n)ix(n)j−Eβ,γ[xixj]. (4)

Here, the last term, expectation of corresponding sufficient statistics, comes from the partial derivative of the log-partition function. Furthermore, it is well known that there exists a one-to-one correspondence between parameter and sufficient statistics (see [32] for details).

One can further observe that if the number of samples is sufficiently large, i.e., , then (2) is equivalent to

 maximizeβ,γ∑x∈{0,1}VPβ∗,γ∗(x)logPβ,γ(x),

where the true parameter achieves the (unique) optimal solution. This directly implies that, once empirical nodewise and pairwise marginals in (3) and (4) approach the true marginals, the gradient method can recover modulo the difficulty of exactly computing the expectations of sufficient statistics.

Now let us consider more challenging task: parameter learning under latent variables. Given a subset of and , we assume that for every sample , are observed/visible and other variables are hidden/latent. In this case, MLE only involves observed variables:

 maximizeβ,γ1NN∑n=1logPβ,γ(x(n)O), (5)

where . Similarly as before, the true parameter achieves the optimal solution of (5) if the number of samples is large enough. However, the log-likelihood under latent variables is no longer concave, which makes the parameter learning task harder. One can apply an expectation-maximization (EM) scheme, but it is typically stuck in local optima.

### 2.2 Tensor Decomposition

The fundamental issue on parameter learning of latent GM is that it is hard to infer the pairwise marginals for latent variables, directly from samples. If one could infer them, it is also possible to recover as we discussed in previous section. Somewhat surprisingly, however, under certain conditions of latent GM, pairwise marginals including latent variables can be recovered using low-order visible marginals. Before introducing such conditions, we first make the following assumption for any GM on a graph considered throughout this paper.

###### Assumption (Faithful)

For any two nodes , if are connected, then are dependent.

This faithfulness assumption implies that GM only has conditional independences given by the graph . We also introduce the following notion [2].

###### Definition (Bottleneck)

A node is a bottleneck if there exists , denoted as ‘views’, such that every path between two of contains .

Figure 0(a) illustrates the bottleneck. By construction, views are conditionally independent given the bottleneck. Armed with this notion, now we introduce the following theorem to provide sufficient conditions for recovering unobserved/latent marginals [2].

###### Theorem

Given GM with a parameter , suppose is a bottleneck with views . If is given, then there exists an algorithm which outputs up to relabeling of , i.e. ignoring symmetry of and .

The above theorem implies that using visible marginals , one can recover unobserved marginals involving . For a bottleneck with more than three views, the joint distribution of the bottleneck and views are recoverable using Theorem 2.2 by choosing three views at once.

Besides , there are other conditions of latent GM which marginals including latent variables are recoverable. Before elaborating on the conditions, we further introduce the following notion for GM on a graph [7].

###### Definition (Exclusive View)

For a set of nodes , we say it satisfies the exclusive view property if for each , there exists , denoted as ‘exclusive view’, such that every path between and contains .

Figure 0(b) illustrates the exclusive view property. Now, we are ready to state the conditions for recovering unobserved marginals using the property [7].

###### Theorem

Given GM with a parameter , suppose a set of nodes satisfies the exclusive view property with a set of exclusive views . If and are given for all and an exclusive view of , then there exists an algorithm which outputs .

At first glance, Theorem 2.2 does not seems to be useful as it requires a set of marginals including every variable corresponding to . However, suppose a set of latent nodes satisfying the property while its set of exclusive views is visible, i.e., is observed. If for all , is a bottleneck with views containing its exclusive view , then one can resort to to obtain .

## 3 Marginalizing and Conditioning

In Section 2.2, we introduced sufficient conditions for recovering unobservable marginals. Specifically, Theorem 2.2 and 2.2 state that for certain structures of latent GMs, it is possible to recover latent marginals simply from low-order visible marginals and in turn the parameters of latent GMs via convex MLE estimators in (2).

Now, a natural question arises: “Can we even recover unobserved marginals for latent GMs with more complicated structures beyond naive applications of the bottlenecks or exclusive views?” To address this, in this section we enlarge the class of such latent GMs by proposing generic concepts, marginalization and conditioning.

### 3.1 Key Ideas

We start by defining two concepts, marginalization and conditioning, formally. The former is a combinatorial concept defined as follows.

###### Definition (Marginalization)

Given graph , we say is marginalizable if for all , there exists a (minimal) set with such that and are disconnected in .111 is the subgraph of induced by . For marginalizable set in , the marginalization of , denoted by , is the graph on with edges

 {(i,j)∈E:i,j∈S}∪{(j,k):Si={j,k} for i∈V∖S}.

In Figure 2, for example, node is disconnected with when removing . Hence, the edge between and is additionally included in the marginalization of .

With the definition of marginalization, the following key proposition reveals that recovering unobserved marginals of a latent GM can be actually reduced to that of much smaller latent GM.

###### Proposition

Consider a GM on with a parameter . If is marginalizable in , then there exists (unique) such that GM on with a parameter inducing the same distribution on , i.e.,

 Pβ,γ(xS)=Pβ′,γ′(xS). (6)

The proof of the above proposition is presented in Appendix A. Proposition 3.1 indeed provides a way of representing the marginal probability on of GM via the smaller GM on . Suppose there exists any algorithm (e.g., via bottleneck, but we don’t restrict ourselves on this method) that can recover a joint distribution , or equivalently sufficient statistics, of latent GM on only using observed marginals in . Then, it should be

 Pβ†,γ†(xS)=Pβ′,γ′(xS),\lx@notefootnoteEquivalently,$β†=β′,γ†=γ′$. (7)

where is the unique parameter satisfying (6). Using Proposition 3.1 and marginalization, one can recover unobserved marginals of a large GM by considering smaller GMs corresponding to marginalizations of the large one. The role of marginalization will be further discussed and clarified in Section 4.

In addition to marginalizing, we introduce the second key ingredient, called conditioning, with which the class of recoverable latent GMs can be further expanded.

###### Proposition

For a graph , for and , is a subgraph of .

The proof of the above proposition is straightforward since (defined in Definition 3.1) for in contains that for in , i.e., the edge set of contains that of . Figure 3 illustrates the example on how conditioning actually broaden the recoverable latent GMs, as suggested in Proposition 3.1. Once the node is conditioned out, the marginalization (Figure 2(c)) is a form that can be handled by .

### 3.2 Labeling Issues

In spite of its usefulness, there is a caveat in performing conditioning: consistent labeling of latent nodes. For example, consider the latent GM as in Figure 3. Conditioned on , is a bottleneck with views , , (Figure 2(c)). If is given, one can recover the conditional distribution up to labeling of , from Theorem 2.2 and conditioning. Here, the conditioning worsens the relabeling problem in the sense that we might choose different labels for for each conditioned value and . As a result, the recovered joint distribution computed as with mixed labeling of , would be different from the true joint. To handle this issue, we define the following concept for consistent labeling of latent variables.

###### Definition (Label-Consistency)

Given GM on with a parameter , we say is label-consistent for if there exists , called ‘reference’, such that

 logPβ,γ(xj=1|xi=1,xC=s)Pβ,γ(xj=1|xi=0,xC=s),

called ‘preference’, is consistently positive or negative for all .333Note that the preference cannot be zero due to Assumption 2.2.

In Figure 3 for example, is label-consistent for with reference since the corresponding preference is the function only on , which is fixed as either or (note that the reference can be arbitrarily chosen due to the symmetry of structure). Using the label-consistency of , one can choose a consistent label of by choosing the label consistent to the preference of the reference node .

Even if is label-consistent under GM with the true known parameter, we need to specify the reference and corresponding preference to obtain a correct labeling on . We note however that attractive GMs (i.e., for all ) always satisfy the label-consistency with any reference node since for any and where are connected in ,

 Pβ,γ(xj=1|xi=1,xC)>Pβ,γ(xj=1|xi=0,xC).

Furthermore, there can be some settings in which we can force the label-consistency from the structure of latent GMs even without the information of its true parameter. For example, consider a latent GM on and a parameter . For a set , a latent node and its neighbor such that is the only path from to in , by symmetry of labels of latent nodes, one can assume that , i.e.,

 Pβ,γ(xj=1|xi=1,xC)>Pβ,γ(xj=1|xi=0,xC),

to force the label-consistency of for . In general, one can still choose labels of latent variables to maximize the log-likelihood of observed variables.

As in conditioning, marginalization also has a labeling issue. Consider a latent GM on . Suppose that every unobserved pairwise marginal can be recovered by two marginalizations of . If there is a common latent node , then the labeling for might be inconsistent. To address this issue, we make the following assumption on graph , node , and parameter of GM.

###### Assumption (Degeneracy)

.

Under the assumption, one can choose a label of to satisfy using the symmetry of labels of latent nodes.

## 4 Sequential Marginalizing and Conditioning

In the previous section, we introduced two concepts marginalization and conditioning to translate the marginal recovery problem of a large GM into that of smaller and tractable GMs. In this section, we present a sequential strategy, adaptively applying marginalization and conditioning, by which we substantially enlarge the class of tractable GMs with hidden/latent variables.

### 4.1 Example

We begin with a simple example describing our sequential learning framework. Consider a latent GM as illustrated in Figure 3(a) and a parameter . Given visible marginal , our goal is to recover all unobserved pairwise marginals including or in order to learn via convex MLE (2). As both nodes and are not a bottleneck, one can consider the conditioning strategy described in the previous section, i.e., the conditional distribution in Figure 3(b). Now, node is a bottleneck with views . Hence, one can recover using where the label of is set to satisfy

 Pβ,γ (xℓ=1|xi=1,x{j,k})>Pβ,γ(xℓ=1|xi=0,x{j,k}),

i.e., node is label consistent. Further, can be recovered using the known visible marginals and the following identity

 Pβ,γ(x{i,j,k,ℓ,m,n})=Pβ,γ(x{i,ℓ,m,n}|x{j,k})Pβ,γ(x{j,k}).

Since we recovered pairwise marginals between and , , , the remaining goal is to recover pairwise marginals including . Now consider a latent GM where is conditioned and it is illustrated in Figure 3(c). At this time, the node is a bottleneck with views , which can be handled by an additional application of (the details are same as the previous case on node ).

This example shows that the sequential application of conditioning extends a class of latent GM that unobserved pairwise marginals are recoverable. Here, we use an algorithm as a black-box, hence one can consider other algorithms as long as they have similar guarantees. One caveat is that conditioning an arbitrary number of variables is very expensive as the learning algorithmic (and sampling) complexity grows exponentially with respect to the number of conditioned variables. Therefore, it would be reasonable to bound the number of conditioned variables.

### 4.2 Algorithm Design

Now, we are ready to state the main learning framework sequentially applying marginalization and conditioning, summarized in Algorithm 1. Suppose that there exists an algorithm, called , e.g., , for a class of pairs such that all satisfy the following:

• Given GM with a parameter on and marginals , outputs the entire distribution , up to labeling of variables on .

For example, consider a graph illustrated in Figure 0(a) with . Then, outputs the entire distribution .

In addition, suppose that there exists an algorithm, called , e.g., , for a class of pairs such that all satisfy the following:

• Given GM with a parameter on and marginals , outputs the distribution where .

Namely, simply merges the small marginal distributions for into the entire distribution on . For example, consider a graph illustrated in Figure 0(b) with

 TG={{i,j,k,ℓ},{i,i′},{j,j′},{k,k′},{ℓ,ℓ′}}

where have exclusive views , respectively. Then, outputs the distribution .

For a GM on with a parameter , suppose we know a family of label-consistency quadruples

 L={ (i,j,p,C):i is label-consistent for C with reference j and preference p}

and marginals for some . As we mentioned in the previous section, we also bound the number of conditioning variables by some . Under the setting, our goal is to recover more marginals beyond initially known ones .

The following conditions for with and are sufficient so that additional marginals can be recovered by conditioning variables on , marginalizing and applying :

• for some

• For all , there exists such that

• For all , there exist and such that ,

where . In the above, implies that if are given, then outputs up to labeling of . In addition, says that the required marginals and are known. Finally, is necessary that all nodes which we need to infer their labels are label-consistent.

Similarly, the following conditions for with and are sufficient so that can be recovered by conditioning variables on and applying where :

• For all , there exists such that ,

In the above, says that the required marginals for merging are given.

The above procedures imply that given initial marginals , one can recover additional marginals , where

 A0={ R∪C:C⊂V,|C|≤K,R⊂V∖C % satisfy C1-C3}, B0={ T∪C:C⊂V,|C|≤K,(G∖C,TG∖C)∈M satisfy C4 where T=∪S∈TGS}, (8)

from and , respectively. One can repeat the above procedure for recovering more marginals as

 σt+1=σt∪At∪Bt.

Recall that we are primarily interested in recovering all pairwise marginals, i.e.,

 {Pβ,γ(xi,xj):(i,j)∈E}.

The following theorem implies that one can check the success of Algorithm 1 in time, where are typically chosen as small constants.

###### Theorem

Suppose we have a label-consistency family of GM on and marginals for some . If Algorithm 1 eventually recover all pairwise marginals, then they do in iterations, where and denote the maximum numbers of conditioning variables and nodes of graphs in , respectively.

The proof of the above theorem is presented in Appendix B. We note that one can design their own sequence of recovering marginals rather than recovering all marginals in for computational efficiency. In Section 5, we provide such examples, of which strategy has the linear-time complexity at each iteration. We also remark that even when Algorithm 1 recovers some, not all, pairwise unobserved marginals for given latent GMs, it is still useful since one can run the EM algorithm using the additional information provided by Algorithm 1. We leave this suggestion for further exploration in the future.

### 4.3 Recoverable Local Structures

For running the sequential learning framework in the previous section, one requires ‘black-box’ knowledge of a label-consistency family and a class of locally recoverable structures of latent GMs, i.e., and . The complete study on them is out of our scope, but we provide the following guidelines on their choices.

As mentioned in Section 3.2,

can be found easily for some class of GMs including attractive ones. One can also infer it heuristically for general GMs in practice. As we mentioned in the previous section, one can choose

that corresponds to . Beyond , in practice, one might hope to choose an additional option for small sized latent GMs since even a generic non-convex solver might compute an almost optimum of MLE due to their small dimensionality.

For the choice of , we mentioned those corresponding to in the previous section. In addition, we provide the following two more examples, called and , as described in Algorithm 2 and 3, respectively. In Algorithm 3, is defined as

Figure 5 illustrates and .

## 5 Examples

In this section, we provide concrete examples of loopy latent GM where the proposed sequential learning framework is applicable. In what follows, we assume that it uses classes corresponding to , , and .

Grid graph. We first consider a latent GM on a grid graph illustrated in Figure 5(a) where boundary nodes are visible and internal nodes are latent. The following lemma states that all pairwise marginals can be successfully recovered given observed ones, utilizing the proposed sequential learning algorithm.

###### Lemma

Consider any latent GM with a parameter illustrated in Figure 5(a), , and . Then, updated under Algorithm 1 contains all pairwise marginals.

In the above, recall that is the set of visible nodes. The proof strategy is illustrated in Figure 6 and the formal proof is presented in Appendix C. We remark that to prove Lemma 5, and are not necessary to use.

Convolutional graph. Second, we consider a latent GM illustrated in Figure 6(a), which corresponds to a convolutional restricted Boltzmann machine (CRBM) [20], and also prove the following lemma.

###### Lemma

Consider any latent GM with a parameter illustrated in Figure 6(a), , and . Then, updated under Algorithm 1 contains all pairwise marginals.

The proof strategy is illustrated in Figure 7 and the formal proof is presented again in Appendix D. We remark that to prove Lemma 5, and are not necessary to use. Furthermore, it is straightforward to generalize the proof of Lemma 5 for arbitrary CRBM.

###### Lemma

Consider any CRBM with visible nodes and a filter size , , and . Then, updated under Algorithm 1 contains all pairwise marginals.444

The theorem holds for arbitrary stride of CRBM.

#### Random regular graph.

Finally, we state the following theorem for latent random regular GMs.

###### Lemma

Consider any latent GM with a parameter on a random -regular graph for some constant , and . There exists a constant such that if the number of latent variables is at most , updated under Algorithm 1 contains all pairwise marginals a.a.s.

The proof of the above lemma is presented in Appendix E, where it is impossible without using our sequential learning strategy. One can obtain an explicit formula of from our proof, but it is quite a loose bound since we do not make much efforts to optimize it.

## 6 Conclusion

In this paper, we present a new learning strategy for latent graphical models. Unlike known algebraic, e.g., and optimization, e.g., , approaches for this non-convex problem, ours is of combinatorial flavor and more generic using them as subroutines. We believe that our approach provides a new angle for the important learning task.

## References

• [1] Animashree Anandkumar, Dean P Foster, Daniel J Hsu, Sham M Kakade, and Yi kai Liu. A spectral algorithm for latent dirichlet allocation. In Advances in Neural Information Processing Systems, pages 917–925, 2012.
• [2] Animashree Anandkumar, Rong Ge, Daniel J Hsu, Sham M Kakade, and Matus Telgarsky. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15(1):2773–2832, 2014.
• [3] Animashree Anandkumar, Daniel J Hsu, and Sham M Kakade. A method of moments for mixture models and hidden markov models. In Conference on Learning Theory, 2012.
• [4] Francis R Bach and Michael I Jordan. Kernel independent component analysis. Journal of machine learning research, 3(Jul):1–48, 2002.
• [5] Borja Balle and Mehryar Mohri. Spectral learning of general weighted automata via constrained matrix completion. In Advances in Neural Information Processing Systems, pages 2159–2167, 2012.
• [6] Arun T. Chaganty and Percy Liang.

Spectral experts for estimating mixtures of linear regressions.

In International Conference on Machine Learning, pages 1040–1048, 2013.
• [7] Arun T. Chaganty and Percy Liang. Estimating latent-variable graphical models using moments and likelihoods. In International Conference on Machine Learning, pages 1872–1880, 2014.
• [8] Pierre Comon and Christian Jutten. Handbook of Blind Source Separation: Independent component analysis and applications. Academic press, 2010.
• [9] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological), pages 1–38, 1977.
• [10] Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. From data mining to knowledge discovery in databases. AI magazine, 17(3):37, 1996.
• [11] William T Freeman, Egon C Pasztor, and Owen T Carmichael. Learning low-level vision.

International journal of computer vision

, 40(1):25–47, 2000.
• [12] Robert Gallager. Low-density parity-check codes. IRE Transactions on information theory, 8(1):21–28, 1962.
• [13] Yoni Halpern and David Sontag. Unsupervised learning of noisy-or bayesian networks. In

Uncertainty in Artificial Intelligence

, pages 272–281, 2013.
• [14] Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800, 2002.
• [15] Daniel Hsu and Sham M. Kakade. Learning mixtures of spherical Gaussians: Moment methods and spectral decompositions. In Innovations in Theoretical Computer Science, 2013.
• [16] Daniel Hsu, Sham M Kakade, and Tong Zhang. A spectral algorithm for learning hidden markov models. Journal of Computer and System Sciences, 78(5):1460–1480, 2012.
• [17] A. Hyvärinen and E. Oja. Independent component analysis: Algorithms and applications. Neural Networks, 13(4-5):411–430, 2000.
• [18] Michael Irwin Jordan. Learning in graphical models, volume 89. Springer Science & Business Media, 1998.
• [19] Frank R. Kschischang and Brendan J. Frey. Iterative decoding of compound codes by probability propagation in graphical models. IEEE Journal on Selected Areas in Communications, 16(2):219–230, 1998.
• [20] Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y Ng.

Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations.

In International Conference on Machine Learning, pages 609–616, 2009.
• [21] Elchanan Mossel and Sébastien Roch. Learning nonsingular phylogenies and hidden markov models. In

Proceedings of the thirty-seventh annual ACM symposium on Theory of computing

, pages 366–375. ACM, 2005.
• [22] Ankur P. Parikh, Le Song, Mariya Ishteva, Gabi Teodoru, and Eric P. Xing. A spectral algorithm for latent junction trees. In Uncertainty in Artificial Intelligence, pages 675–684, 2012.
• [23] Ankur P. Parikh, Le Song, and Eric P. Xing. A spectral algorithm for latent tree graphical models. In International Conference on Machine Learning, pages 1065–1072, 2011.
• [24] Giorgio Parisi and Ramamurti Shankar. Statistical field theory, 1988.
• [25] Anastasia Podosinnikova, Francis Bach, and Simon Lacoste-Julien. Rethinking lda: moment matching for discrete ica. In Advances in Neural Information Processing Systems, pages 514–522, 2015.
• [26] Richard A Redner and Homer F Walker. Mixture densities, maximum likelihood and the em algorithm. SIAM review, 26(2):195–239, 1984.
• [27] Ruslan Salakhutdinov and Geoffrey E Hinton. Deep boltzmann machines. In International Conference on Artificial Intelligence and Statistics, volume 1, page 3, 2009.
• [28] Sajid M Siddiqi, Byron Boots, and Geoffrey J Gordon. Reduced-rank hidden markov models. In International Conference on Artificial Intelligence and Statistics, pages 741–748, 2010.
• [29] Le Song, Animashree Anandkumar, Bo Dai, and Bo Xie. Nonparametric estimation of multi-view latent variable models. In International Conference on Machine Learning, pages 640–648, 2014.
• [30] Le Song, Byron Boots, Sajid M Siddiqi, Geoffrey J Gordon, and Alex J Smola. Hilbert space embeddings of hidden markov models. In International Conference on Machine Learning, pages 991–998, 2010.
• [31] Le Song, Eric P Xing, and Ankur P Parikh. Kernel embeddings of latent tree graphical models. In Advances in Neural Information Processing Systems, pages 2708–2716, 2011.
• [32] Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305, 2008.
• [33] Nicholas C Wormald. Models of random regular graphs. London Mathematical Society Lecture Note Series, pages 239–298, 1999.
• [34] Chicheng Zhang, Jimin Song, Kamalika Chaudhuri, and Kevin Chen. Spectral learning of large structured hmms for comparative epigenomics. In Advances in Neural Information Processing Systems, pages 469–477, 2015.
• [35] James Y Zou, Daniel J Hsu, David C Parkes, and Ryan P Adams. Contrastive learning using spectral methods. In Advances in Neural Information Processing Systems, pages 2238–2246, 2013.

## Appendix A Proof of Proposition 3.1

We use the mathematical induction on where is defined in Definition 3.1. Before starting the proof we define the equivalence class . Now, we start the proof by considering

 ∑i∈V∖SPβ,γ(x)=∑i∈V∖S:i∉[ℓ]∑i∈V∖S:i∈[ℓ]1Zexp⎛⎝∑(i,j)∈Eβijxixj+∑i∈Vγixi⎞⎠ =∑i∈V∖S:i∉[ℓ]1Zexp⎛⎝∑(i,j)∈E:i,j∉[ℓ]βijxixj+∑i∈V∖[ℓ]γixi⎞⎠ ×∑i∈V∖S:i∈[ℓ]exp⎛⎝∑(i,j)∈E:i∈[ℓ]βijxixj+∑i∈[ℓ]γixi⎞⎠ =∑i∈V∖S:i∉[ℓ]1Zexp⎛⎝∑(i,j)∈E:i,j∉[ℓ]βijxixj+∑i∈V∖[ℓ]γixi⎞⎠f[ℓ](xSℓ)

where is some positive function. Since