It is widely recognized that incorporating latent or hidden variables is a crucial aspect of modeling. Latent variables can provide a succinct representation of the observed data through dimensionality reduction; the possibly many observed variables are summarized by fewer hidden effects. Further, they are central to predicting causal relationships and interpreting the hidden effects as unobservable concepts. For instance in sociology, human behavior is affected by abstract notions such as social attitudes, beliefs, goals and plans. As another example, medical knowledge is organized into casual hierarchies of invading organisms, physical disorders, pathological states and symptoms, and only the symptoms are observed.
In addition to incorporating latent variables, it is also important to model the complex dependencies among the variables. A popular class of models for incorporating such dependencies are the Bayesian networks, also known as belief networks. They incorporate a set of causal and conditional independence relationships through directed acyclic graphs (DAG) 
. They have widespread applicability in artificial intelligence[41, 19, 42, 25], in the social sciences [13, 40, 64, 18, 51, 50], and as structural equation models in economics [12, 33, 65, 18, 51, 60].
An important statistical task is to learn such latent Bayesian networks from observed data. This involves discovery of the hidden variables, structure estimation (of the DAG) and estimation of the model parameters. Typically, in the presence of hidden variables, the learning task suffers from identifiability issues since there may be many models which can explain the observed data. In order to overcome indeterminacy issues, one must restrict the set of possible models. We establish novel criteria for identifiability of latent DAG models using only low order observed moments (second/third moments). We introduce a graphical constraint which we refer to as the expansion property
on the DAG. Roughly speaking, expansion property states that every subset of hidden nodes has “enough” number of outgoing edges in the DAG, so they have a noticeable influence on the observed nodes, and thus on the samples drawn from the joint distribution of the observed nodes. This notion implies new identifiability and learning results for DAG structures.
Another class of popular latent variable models are the probabilistic topic models . In topic models, the latent variables correspond to the topics in a document which generate
the (observed) words. Perhaps, the most widely employed topic model is the latent Dirichlet allocation (LDA) , which posits that the hidden topics are drawn from a Dirichlet distribution. Recent approaches have established that the LDA model can be learned efficiently using low-order (second and third) moments, using spectral techniques [4, 5].
The LDA model, however, cannot incorporate arbitrary correlations111 LDA models incorporate only “weak” correlations among topics, since the Dirichlet distribution can be expressed as the set of independently distributed Gamma random variables, normalized by their sum: if
LDA models incorporate only “weak” correlations among topics, since the Dirichlet distribution can be expressed as the set of independently distributed Gamma random variables, normalized by their sum: if, we have . among the latent topics, and various correlated topic models have demonstrated superior empirical performance, e.g. [15, 45], compared to LDA. However, learning correlated topic models is challenging, and further constraints need to be imposed to establish identifiability and provable learning.
A typical (exchangeable) topic model is parameterized by the topic-word matrix, i.e., the conditional distributions of the words given the topics, and the latent topic distribution, which determines the mixture of topics in a document. In this paper, we allow for arbitrary (non-degenerate) latent topic distributions, but impose expansion constraints on the topic-word matrix. In other words, the word support of different topics are not “too similar”, which is a reasonable assumption. Thus, we establish expansion as an unifying criterion for guaranteed learning of both latent Bayesian networks and topic models.
1.1 Summary of contributions
We establish identifiability for different classes of topic models and latent Bayesian networks, and more generally, for linear latent models, and also propose efficient algorithms for the learning task.
1.1.1 Learning Topic Models
Learning under expansion conditions.
We adopt a moment-based approach to learning topic models, and specifically, employ second-order observed moments, which can be efficiently estimated using a small number of samples. We establish identifiability of the topic models for arbitrary (non-degenerate) topic mixture distributions, under assumptions on the topic-word matrix. The support of the topic-word matrix is a bipartite graph which relates the topics to words. We impose a weak (additive) expansion constraint on this bipartite graph. Specifically, let denote the topic-word matrix, and for any subset of topics (i.e., a subset of columns of ), let denote the set of neighboring words, i.e., the set of words, the topics in are supported on. We require that
where is the maximum degree for any topic. Intuitively, our expansion property states that every subset of topics generates sufficient number of words. We establish that under the above expansion condition in (1), for generic222The precise definition for parameter genericity is given in Condition 3. parameters (for non-zero entries of ), the columns of
are the sparsest vectors in the column span, and are therefore, identifiable.
In contrast, note that for all subsets of topics , the condition is necessary for non-degeneracy of , and therefore, for identifiability of the topic model from second order observed moments. This implies that our sufficient condition in (1) is close to the necessary condition for identifiability of sparse models, where the maximum degree of any topic is small. Thus, we prove identifiability of topic models under nearly tight expansion conditions on the topic-word matrix. Since the columns of are the sparsest vectors in the column span under (1), this also implies recovery of through exhaustive search. In addition, we establish that the topic-word matrix can be learned efficiently through optimization, under some (stronger) conditions on the non-zero entries of the topic-word matrix, in addition to the expansion condition in (1). We call our algorithm TWMLearn as it learns the topic-word matrix.
Bayesian networks to model topic mixtures.
The above framework does not impose any parametric assumption on the distribution of the topic mixture (other than non-degeneracy), and employs second-order observed moments to learn the topic-word matrix and the second-order moments of . If
obeys a multivariate Gaussian distribution, then this completely characterizes the topic model. However, for general topic mixtures, this is not sufficient to characterize the distribution of, and further assumptions need to be imposed. A natural framework for modeling topic dependencies is via Bayesian networks . Moreover, incorporating Bayesian networks for topic modeling also leads to efficient approximate inference through belief propagation and their variants , which have shown good empirical performance on sparse graphs.
We consider the case where the latent topics can be modeled by a linear Bayesian network, and establish that such networks can be learned efficiently using second and third order observed moments through a combination of optimization and spectral techniques. The proposed algorithm is called TMLearn as it learns (correlated) topic models.
1.1.2 Learning (Single-View) Latent Linear Bayesian Networks
The above techniques for learning topic models are also applicable for learning latent linear models, which includes linear Bayesian networks discussed in the introduction. This is because our method relies on the presence of a linear map from hidden to observed variables. In case of the topic models, the topic-word matrix represents the linear map, while for linear Bayesian networks, the (weighted) DAG from hidden to observed variables is the linear map. Linear latent models are prevalent in a number of applications such as blind deconvolution of sound and images 
. The popular independent component analysis (ICA) is a special case of our framework, where the sources (i.e., the hidden variables) are assumed to be independent. In contrast, we allow for general latent distributions, and impose expansion conditions on the linear map from hidden to observed variables.
One key difference between topic models and other linear models (including linear Bayesian networks) is that topic models are multi-view (i.e., have multiple words in the same document), while, for general linear models, multiple views may not be available. We require additional assumptions to provide recovery in the single-view setting. We prove recovery under certain rank conditions: we require that , where is the dimension of the observed random vector and , the dimension of the latent vector, and the existence of a partition into three sets each with full column rank. Under these conditions, we propose simple matrix decomposition techniques to first “de-noise” the observed moments. These denoised moments are of the same form as the moments obtained from a topic model and thus, the techniques described for learning topic models can be applied on denoised moments. Thus, we provide a general framework for guaranteed learning of linear latent models under expansion conditions.
Hierarchical topic models. An important application of these techniques is in learning hierarchical linear models, where the developed method can be applied recursively, and the estimated second order moment of each layer can be employed to further learn the deeper layers. See Fig. 1(a) for an illustration.
Examples of graphs which can be learned.
It is useful to consider some concrete examples which satisfy the expansion property in (1):
Full -regular trees. These are tree structures in which every node other than the leaves has children. These are included in the ensemble of hierarchical models. We see that for , the model satisfies the expansion condition (1), but require to satisfy the rank condition. See Fig. 2(a) for an illustration of a full ternary tree with latent variables.
Caterpillar trees. These are tree structures in which all the leaves are within distance one of a central path. See Fig. 2(b) for an illustration. These structures have effective depth one. Let and respectively denote the maximum and the minimum number of leaves connected to a fixed node on the central path. It is immediate to see that if , the structure has the expansion property in (1).
Random bipartite graphs.
Consider bipartite graphs with hidden nodes in one part and observed nodes in the other part. Each edge (between the two parts) is included in the graph with probability, independent from every other edge. It is easy to see that, for any set , the expected number of its neighbors is : . Also, the expected degree of the hidden nodes is . Now, by applying a Chernoff bound, one can show that these graphs have the expansion property with high probability, if , i.e., with probability converging to one as .
1.2 Our techniques
Our proof techniques rely on ideas and tools developed in dictionary learning, spectral techniques, and matrix decomposition. We briefly explain our techniques and their relationships to these areas.
Dictionary learning and optimization.
We cast the topic models as linear exchangeable multiview models in Section 2.2 and demonstrate that the second order (cross) moment between any two words satisfies
where in the topic-word matrix, is the vocabulary size, is the number of topics, and is the topic mixture. Thus, the problem of learning topic models using second order moments reduces to finding matrix , given .
Indeed, further conditions need to be imposed for identifiability of from . A natural non-degeneracy constraint is that the correlation matrix of the hidden topics be full rank, so that , where denotes the column span. Under the expansion condition in (1), for generic parameters, we establish that the columns of are the sparsest vectors in , and are thus identifiable. To prove this claim, we leverage ideas from the work of Spielman et. al. , where the problem of sparsely used dictionaries is considered under probabilistic assumptions. In addition, we develop novel techniques to establish non-probabilistic counterpart of the result of . A key ingredient in our proof is establishing that submatrices of the topic-word matrix, corresponding to any subset of columns and their neighboring rows, satisfy a certain null-space property under generic parameters and expansion condition in (1).
The above identifiability result implies recovery of the topic-word matrix through exhaustive search for sparse vectors in . Instead, we propose an efficient method to recover the columns of through optimization. We prove that method recovers the matrix , under the expansion condition in (1), and some additional conditions on the non-zero entries of .
Spectral techniques for learning latent Bayesian networks.
When the topic distribution is modeled via a linear Bayesian network, we exploit additional structure in the observed moments to learn the relationships among the topics, in addition to the topic-word matrix. Specifically, we assume that the topic variables obey the following linear equations:
where denotes the parents of node in the directed acyclic graph (DAG) corresponding to the Bayesian network. Here, we assume that the noise variables are non-Gaussian (e.g.
, they have non-zero third moment or excess kurtosis), and are independent. We employ theoptimization framework discussed in the previous paragraph, and in addition, leverage the spectral methods of  for learning using second and third observed moments.
We first establish that the model in (3) reduces to independent component analysis (ICA), where the latent variables are independent components, and this problem can be solved via spectral approaches (e.g., ). Specifically, denote , where denotes the dependencies between different hidden topics in (3). Solving for the hidden topics , we have , where denotes the independent noise variables in (3). Thus, the latent Bayesian network in (3) reduces to an ICA model, where are the independent latent components, and the linear map from hidden to the observed variables is given by , where is the original topic-word matrix. We then apply spectral techniques from , termed as excess correlation analysis (ECA), to learn
from the second and third order moments of the observed variables. ECA is based on two singular value decompositions: the first SVD whitens the data (using second moment) and the second SVD uses the third moment to find directions which exhibit information that is not captured by the second moment. Finally, in order to recoverfrom , we exploit the expansion property in (1), and extract as described previously through optimization. The high-level idea is depicted in Fig. 3.
Matrix decomposition into diagonal and low-rank parts for general linear models.
Our framework for learning topic models casts them as linear multiview models, where the words represent the multiple views of the hidden topic mixture , and the conditional expectation of each word given the topic mixture is a linear map of . We extend our results for learning general linear models, where such multiple views may not be available. Specifically, we consider
where are uncorrelated and are independent from the hidden variables . In this case, the second order moments satisfies
and has another noise component , when compared to the second-order (cross) moment for topic models in (2). Note that the rank of is (under non-degeneracy conditions), where is the number of topics. Thus, when is sufficiently small compared to , we can view
as the sum of a low-rank matrix and a diagonal one.
We prove that under the rank condition that333It should be noted that other matrix decomposition methods have been considered previously [22, 36, 56].
Using these techniques, we can relax
Condition 5 to , but only by imposing
stronger incoherence conditions on the low-rank component. (and the existence of a partition of three sets of columns of such that each set has full column rank),
can be decomposed into its low-rank component
and its diagonal component .
Thus, we employ matrix decomposition techniques to “de-noise” the second order
moment and recover from . From here on, we can apply the techniques described previously to recover through optimization. Thus, we develop novel techniques for learning general latent linear models under expansion conditions.
Our presentation focuses on using exact (population) observed moments to emphasize the correctness of the methodology. However, “plug-in” moment estimates can be used with sampled data. To partially address the statistical efficiency of our method, note that higher-order empirical moments generally have higher variance than lower-order empirical moments, and therefore are more difficult to reliably estimate. Our techniques only involve low-order moments (up to third order). A precise analysis of sample complexity involves standard techniques for dealing with sums of i.i.d. random matrices and tensors as in and is left for future study. See Section 6 for the performance of our proposed algorithms under finite number of samples.
1.3 Related work
Probabilistic topic models have received widespread attention in recent years; see  for an overview. However, till recently, most learning approaches do not have provable guarantees, and in practice Gibbs sampling or variational Bayes methods are used. Below, we provide an overview of learning approaches with theoretical guarantees.
Learning topic models through moment-based approaches.
A series of recent works aim to learn topic models using low order moments (second and third) under parametric assumptions on the topic distribution, e.g. single-topic model  (each document consists of a single topic), latent Dirichlet allocation (LDA) , independent components analysis (ICA)  (the different components of , i.e., are independent), and so on; see  for an overview. A general framework based on tensor decomposition is given in 
for a wide range of latent variable models, including LDA and single topic models, Gaussian mixtures, hidden Markov models (HMM), and so on. These approaches do not impose any constraints on the topic-word matrix(other than non-degeneracy). In contrast, in this paper, we impose constraints on , and allow for any general topic distribution. Furthermore, we specialize the results to parametric settings where the topic distribution is a Bayesian network, and for this sub-class, we use ideas from the method of moments (in particular, the excess correlation method (ECA) of ) in conjunction with ideas from sparse dictionary learning.
Learning topic models through non-negative matrix factorization.
Another series of recent works by Arora et. al. [10, 9] employ a similar philosophy as this paper: they allow for general topic distributions, while constraining the topic-word matrix . They employ approaches based on non-negative matrix factorization (NMF), and exploit the fact that is non-negative (recall that corresponds to conditional distributions). The approach and the assumptions are quite different from this work. They establish guaranteed learning under the assumption that every topic has an anchor word, i.e. the word is uniquely generated from the topic, and does not occur under any other topic (with reasonable probability). Note that the presence of anchor words implies expansion constraint: for all subsets of topics, where is the set of neighboring words for topics in . In contrast, our requirement for guaranteed learning is , where is the maximum degree of any topic. Thus our requirement is comparable to , when is small, and our approach does not require presence of anchor word. Additionally, our approach does not assume that the topic-word matrix is positive, which makes it applicable for more general linear models, e.g. when the variables are not discrete and matrix corresponds to a general mixing matrix (note that for discrete variables, corresponds to conditional distribution and is thus non-negative).
As discussed in Section 1.2, we use some of the the ideas developed in the context of sparsely used dictionary learning problem. The problem setup there is that one is given a matrix and is asked to find a pair of matrices and so that is small and also is sparse. Here, is considered as the dictionary being used. Spielman et. al  study this problem assuming that is a full rank square matrix and the observation is noiseless, i.e., . In this scenario, the problem can be viewed as learning a matrix from its row space knowing that enjoys some sparsity structure. Stating the problem this way clearly describes the relation to our work, as we also need to recover the topic-word matrix from its second-order moments , as explained in Section 1.2.
The results of  are obtained assuming that the entries of are drawn i.i.d. from a Bernoulli-Gaussian distribution. The idea is then to seek the rows of sequentially, by looking for the sparse vectors in . Leveraging similar ideas, we obtain non-probabilistic counterpart of the results, i.e., without assuming any parametric distribution on the topic-word matrix. These conditions turn out to be intuitive expansion conditions on the support of the topic-word matrix, assuming generic parameters. Our technical arguments to arrive at these results are different than the ones employed in , since we do not assume any parametric distribution, and its application to learning topic models is novel. Moreover, in fact, it can be shown that the considered probabilistic models considered our , satisfy the expansion property (1) almost surely, and are thus, special cases under our framework. Variants of the sparse dictionary learning problem of  have also been proposed [66, 32]. For a detailed discussion on other works dealing with dictionary learning, refer to .
Linear structural equations.
In general, structural equation modeling (SEM) is defined by a collection of equations , where ’s are the variables associated to the nodes. Recently, there has been some progress on the identifiability problem of SEMs in the fully observed linear models [57, 35, 53, 52]. More specifically, it has been shown that for linear functions and non-Gaussian noise, the underlying graph is identifiable . Moreover, if one restricts the functions to be additive in the noise term and excludes the linear Gaussian case (as well as a few other pathological function-noise combinations), the graph structure is identifiable [35, 53]. Peters et. al. 
consider Gaussian SEMs with linear functions, and the normally distributed noise variables with the same variances and show that the graph structureand the functions are identifiable. However, none of these works deal with latent variables, or address the issue of efficiently learning the models. In contrast, our work here can be viewed as a contribution to the problem of identifiability and learning of linear SEMs with latent variables.
Learning Bayesian networks and undirected graphical models.
The problem of identifiability and learning graphical models from distributions has been the object of intensive investigation in the past years and has been studied in different research communities. This problem has proved important in a vast number of applications, such as computational biology [29, 55], economics [12, 33, 65, 18], sociology [13, 40, 64, 18]
, and computer vision[42, 25]. The learning task has two main ingredients: structure learning and parameter estimation.
Structure estimation of probabilistic graphical models has been extensively studied in the recent years. It is well known that maximum likelihood estimation in fully observed tree models is tractable . However, for general models, maximum likelihood structure learning is NP-hard even when there are no hidden variables. The main approaches for structure estimation are score-based methods, local tests and convex relaxation methods. Score-based methods such as  find the graph structure by optimizing a score (e.g., Bayesian Independence Criterion) in a greedy manner. Local test approaches attempt to build the graph based on local statistical tests on the samples, both for directed and undirected graphical models [61, 1, 20, 7, 38, 34]. Convex relaxation approaches have also been considered for structure estimation (e.g., [46, 54]).
In the presence of latent variables, structure learning becomes more challenging. A popular class of latent variable models are latent trees, for which efficient algorithms have been developed [30, 27, 24, 3]. Recently, approaches have been proposed for learning (undirected) latent graphical models with long cycles in certain parameter regimes . In , latent Gaussian graphical models are estimated using convex relaxation approaches. The authors in  study linear latent DAG models and propose methods to (1) find clusters of observed nodes that are separated by a single latent common cause; and (2) find features of the Markov Equivalence class of causal models for the latent variables. Their model allows for undirected edges between the observed nodes. In , equivalence class of DAG models is characterized when there are latent variables. However, the focus is on constructing an equivalence class of DAG models, given a member of the class. In contrast, we focus on developing efficient learning methods for latent Bayesian networks based on spectral techniques in conjunction with optimization.
2 Model and sufficient conditions for identifiability
We write for the standard norm of a vector . Specifically, denotes the number of non-zero entries in . Also, refers to the induced operator norm on a matrix . For a matrix and set of indices , we let denote the submatrix containing just the rows in and denote the submatrix formed by the rows in and columns in . For a vector , represents the positions of non-zero entries of . We use to refer to the -th standard basis element, e.g., . For a matrix we let (similarly ) denote the span of its rows (columns). For a set , is its cardinality. We use the notation to denote the set . For a vector , is a diagonal matrix with the elements of on the diagonal. For a matrix , is a diagonal matrix with the same diagonal as . Throughout denotes the tensor product.
2.1 Overview of topic models
Consider the bag-of-words model for documents in which the sequence of observed words in the document are exchangeable
, i.e., the joint probability distribution is invariant to permutation of the indices. The well-known De Finetti’s theorem implies that such exchangeable models can be viewed as mixture models in which there is a latent variable such that are conditionally i.i.d. given and the conditional distributions are identical at all the nodes. See Fig.4 for an illustration.
In the context of document modeling, the latent variable can be interpreted as a distribution over the topics occurring in a document. If the total number of topics is , then can be viewed as a distribution over the simplex . The word generation process is thus a hierarchical process: for each document, a realization of is drawn and it represents the proportion of topics in the documents, and for each word, first a topic is drawn from the topic mixture, and then the word is drawn given the topic.
Let denote the topic-word matrix, where denotes the conditional probability of word occurring given that the topic was drawn. It is convenient to represent the words in the document by -dimensional random vectors . Specifically, we set
where is the standard coordinate basis for .
The above encoding allows for a convenient representation of topic models as linear models:
and moreover the second order cross-moments (between two different words) have a simple form:
Thus, the above representation allows us to view topic models as linear models. Moreover, it allows us to incorporate other linear models, i.e. when are not basis vectors. For instance, the independent components model is a popular framework, and can be viewed as a set of linear structural equations with latent variables. See Section 5 for a detailed discussion.
Thus, the learning task using second-order (exact) moments in (5) reduces to recovering from , or equivalently .
2.2 Sufficient conditions for identifiability
We first start with some natural non-degeneracy conditions.
Condition 1 (Non-degeneracy).
The topic-word matrix has full column rank and the hidden variables are linearly independent, i.e., with probability one, if , then , for all .
We note that without such non-degeneracy assumptions, there is no hope of distinguishing different hidden nodes.
We now describe sufficient conditions under which the topic model becomes identifiable using second order observed moments. Given word observations , note that we can only hope to identify the columns of topic-word matrix up to permutation because the model is unchanged if one permutes the hidden variable and the columns of correspondingly. Moreover, the scale of each column of is also not identifiable. To see this, observe that Eq. (5) is unaltered if we both rescale all the coefficients and appropriately rescale the variable . Without further assumptions, we can only hope to recover a certain canonical form of , defined as follows:
We say is in a canonical form if all of its columns have unit norm. In particular, the transformation and the corresponding rescaling of place in canonical form and the distribution over , , is unchanged.
Furthermore, observe that the canonical is only specified up to sign of each column since any sign change of column does not alter its norm.
Thus, under the above non-degeneracy and scaling conditions, the task of recovering from second-order (exact) moments in (5) reduces to recovering from Col. Recall that our criterion for identifiability is that the sparsest vectors in the Col correspond to the columns of . We now provide sufficient conditions for this to occur, in terms of structural conditions on the support of , and parameter conditions on the non-zero entries of .
For structural conditions on the topic-word matrix , we proceed by defining the expansion property of a graph which plays a key role in establishing our identifiability results.
Condition 2 (Graph expansion).
Let denote the bipartite graph formed by the support of : when , and otherwise, and , . We assume that the satisfies the following expansion property:
where is the set of the neighbors of and is the maximum degree of nodes in .
Note that the condition , for all subsets of hidden nodes , is necessary for the matrix to be full column rank. We observe that the above sufficient condition in (6) has an additional degree term , and is thus close to the necessary condition when is small. Moreover, the above condition in (6) is only a weak additive expansion, in contrast to multiplicative expansion, which is typically required for various properties to hold, e.g. .
The last condition is a generic assumption on the entries of matrix . We first define the parameter genericity property for a matrix.
Condition 3 (Parameter genericity).
We assume that the topic-word matrix has the following parameter genericity property: for any with , the following holds true.
where for a set , .
This is a mild generic condition. More specifically if the entries of any arbitrary fixed matrix are perturbed independently, then it satisfies the above generic property with probability one.
Fix any matrix .
Let be a random matrix such that
be a random matrix such thatare independent random variables, and whenever . Assume each variable is drawn from a distribution with uncountable support. Then
3 Identifiability result and Algorithm
In this section, we state our identifiability results and algorithms for learning the topic models under expansion conditions.
Theorem 3.1 (Identifiability of the Topic-Word Matrix).
Theorem 3.1 is proved in Section A.1. As shown in the proof, columns of are in fact the sparsest vectors in the space . This result already implies identifiability of via an exhaustive search, which is an interesting result in its own right. The following theorem provides some conditions under which the columns of can be identified by solving a set of convex optimization problems. Before stating the theorem, we need to establish some notations.
For , we define and . Similarly, for , define and . Thus, for a node (either a topic or a word), is the set of its neighbors and represents the set of nodes with distance exactly two from . Therefore, if is a word node, is the set of its siblings and if is a topic word, is the set of topics with a common child. We further use superscript to denote the set complement.
Theorem 3.2 (Recovery of the Topic-Word Matrix through -minimzation).
Suppose that in each row of , there is a gap between the maximum and the second maximum absolute values. For , let be a permutation such that , and , for some . Further suppose that . In words, each column contains at least one entry that has the maximum absolute value in its row. If the following conditions hold true for , then TWMLearn returns the columns of in canonical form.
for all non-zero vectors .
for all and all non-zero vectors .
Theorem 3.2 is proved in Section A.2. TWMLearn is essentially the ER-SpUD presented in  for exact recovery of sparsely-used dictionaries, but the technical result and application in Theorem 3.2 are novel.
TWMLearn involves solving optimization problems and as the number of words becomes large, this requires a fast method to solve minimization. Traditionally, the
minimization can be formulated as a linear programming (LP) problem. In particular, each of theminimizations in TWMLearn can be written as an LP with inequality constraints and one equality constraint. However, the computational complexity of such a general-purpose formulation is often too high for large scale applications. Alternatively, one can use approximate methods which are significantly faster. There are several relevant algorithms with this theme, such as gradient projection [31, 39], iterative shrinkage-thresholding , and proximal gradient (Nestrov’s method) [47, 48].
4 Bayesian networks for modeling topic distributions
According to Theorem 3.1, we can learn the topic-word matrix without any assumption on the dependence relationships among the hidden topics. (We only need the non-degeneracy assumption discussed in Condition 1 which requires the hidden variables to be linearly independent with probability one.)
Bayesian networks provide a natural framework for modeling topic dependencies, and we employ them here for modeling topic distributions. For these families, we prove identifiability and learning of the entire model, including the topic relationships and the topic-word matrix.
Bayesian networks, also known as belief networks, incorporate a set of causal and conditional independence through directed acyclic graphs (DAG) . They have widespread applicability in artificial intelligence [41, 19, 42, 25], in the social sciences [13, 40, 64, 18, 51, 50], and as structural equation models in economics [12, 33, 65, 18, 51, 60].
We define a DAG model as a pair , where is a joint probability distribution, parameterized by , on variables that is Markov with respect to a DAG with . More specifically, the joint probability factors as
where denotes the set of parents of node in .
We consider a subclass of DAG models for the topics in which the topics obey the linear relations
where represents the noise variable at topic . We further assume that the noise variables are independent.
Let be the matrix with at the entry if and zero everywhere else. Without loss of generality, we assume that hidden (topic) variables , the observed (word) variables and the noise terms are all zero mean. We also denote the variances of and by and , respectively. Let and respectively denote the third moment of and , i.e., and
. Define the skewness ofas:
Finally, define the following moments of the observed variables:
It is convenient to consider the projection of to a matrix as follows:
where denotes the standard inner product.
Consider a DAG model which satisfies the model conditions described in Section 2.2 and the hidden variables are related through linear equations (10). If the noise variables are independent and have non-zero skewness for , then the DAG model is identifiable from and , for an appropriate choice of . Furthermore, under the assumptions of Theorem 3.2, TMLearn returns matrices and up to a permutation of hidden nodes.
Notice that the only limitations on the noise variables are that they are independent555We only require pairwise and triple-wise independence., and have non-zero skewness. Some common examples of non-zero skewness distributions are exponential, chi-squared and Poisson. Note that different topics may have different noise distributions.
Remark 4.2 (Special Cases).
A special case of the above result is when the DAG is empty, i.e. , and the topics are independent. This is popularly known as the independent components model (ICA), and similar spectral techniques have been proposed before for learning ICA . Similarly, the ECA approach proposed above is also applicable for learning latent Dirichlet allocation (LDA), using suitably adjusted second and third order moments . Note that for these special cases, we do not need to impose any constraints on the topic-word matrix (other than non-degeneracy), since we can directly learn and the topic distribution through ECA.
Another immediate application of the technique used in the proof of Theorem 4.1 is in learning fully-observed linear Bayesian networks.
Remark 4.3 (Learning fully-observed BN’s).
Consider an arbitrary fully-observed linear DAG:
and suppose that the noise variables have non-zero skewness. Then, applying the same argument as in the proof of Theorem 4.1, we can learn the matrix (and hence ) from the second and third order moments (We have here).
For sake of simplicity, TMLearn is presented using the ECA method, which uses a single random direction and obtaining singular vectors of . A more robust alternative to this, as described in , is to use the following power iteration to obtain the singular vectors ; we use this variant in the simulations described in Section 6.
random orthonormal basis for . Repeat: For . Orthonormalize .
In principle, we can extend the above framework, combining spectral and approaches, for learning other models on . For instance, when the third order moments of are sufficient statistics (e.g. when is a graphical model with treewidth two), it suffices to learn the third order moments of , i.e. , where denotes the outer product of vectors. This can be accomplished as follows: first employ based approach to learn the topic-word matrix , then consider the third order observed moments tensor . We have that
where denotes the multi-linear map of under
. For details on multi-linear transformation of tensors, see.
4.1 Learning using second-order moments
In Theorem 4.1, we prove identifiability and learning of hidden DAGs from second and third order observed moments. A natural question is what can be done if only the second order moment is provided. The following remark states that if an oracle gives a topological ordering of the DAG structure then the model can be learned only through the second order moment and there is no need to the third order moment.
A topological ordering of a DAG is a labeling of the nodes such that, for every directed edge , we have . It is a well known result in graph theory that a directed graph is a DAG if and only if it admits a topological ordering. Now, consider a DAG model with a full column rank coefficient matrix between the observed and hidden nodes. Further, suppose that an oracle provides us with a topological ordering of the induced DAG on the hidden nodes, i.e., for any labeling of the hidden nodes the oracle returns a permutation of the labels which is faithful to a topological ordering of the DAG. Then, the DAG model (matrices and ) are identifiable from only the second order moment .
5 Extension to general linear (single view) models
We have so far described a framework for identifiability and learning of topic models under expansion conditions. In fact, the developed framework holds for any linear multi-view model. Recall that if are the words in the document, and is the topic mixture variable, we have linearity and multiple (exchangeable and non-degenerate) views corresponding to different words in the document. In particular, the cross-moments between two different words and , given , is
We now extend the results to a general framework where, unlike topic models, only a single observed view is available, and further assumptions are needed to learn in this setting.
Consider an observed random vector and a hidden random vector . Let denote the bipartite graph with observed nodes and hidden nodes . Let be the noise variable associated with , for and denote the variance of by . Throughout we use the notation , and . The noise terms are assumed to be pairwise uncorrelated. The class of models considered are specified by the following assumptions.
Condition 4 (Linear model).
The observed and hidden variables obey the model666Without loss of generality, assume that , , are all zero mean.
where are pairwise uncorrelated and are independent from . Furthermore, the matrix has full column rank and the hidden variables are linearly independent, i.e., with probability one, if , then , for all .
Notice that the structure of is defined by the non-zero coefficients in Eq. (14). Therefore, there is no edge among the observed nodes. We define by letting the entry be if and zero otherwise. We refer to matrix as the coefficient matrix.
The above setting is prevalent in a number of applications such as the blind deconvolution of sound and images . The independent component analysis (ICA) is a special case of the above setting, where the sources are assumed to be independent. In contrast, in our setting, we allow for arbitrary distribution on , and assume expansion (and rank) conditions on the coefficient matrix .
Recall that in case of the topic models, corresponds to the topic-word matrix. Moreover, in the topic model setting, no assumption is made on the noise variables , since the presence of cross-moments (between different words) enables us to remove the dependence on . However, in the single view case the second order observed moment is given by
We now discuss a rank condition on the coefficient matrix , which allows us to remove the noise term from the second order moment .
Condition 5 (Rank condition).
There exists a fixed partition of such that , and has full column rank for all .
Since , for , we have as a consequence . Therefore, it essentially states that the number of hidden nodes should be at most one third of the observed ones. In most applications, we are looking for a few number of hidden effects that can represent the statistical dependence relationships among the observed nodes. Thus the rank condition is reasonable in these cases.
5.1 Matrix decomposition method for denoising
We now show that under the rank assumption in Condition 5, we can extract the noise terms from the observed moments through a matrix decomposition method.
Find a partition of , such that , and for all distinct . (Note that and by rank condition, there exists such a partition ). We now show that the matrix decomposition procedure returns and the diagonal matrix .
Let , with and a diagonal matrix. Suppose that for a fixed partition of , with , all the submatrices and have full column rank , for all . Then, returns and .
5.1.1 Remark on finding the partition
The rank condition for matrix in Condition 5 ensures the existence of a partition of , such that, and has full column rank for all . However, we are not provided with such a partition. We now show that under an incoherence assumption about , a random partitioning of its rows into three groups has the desired property, with fixed positive probability.
Let be a thin singular value decomposition of , where has orthonormal columns, , and is orthogonal. Define the incoherence number of as:
Fix , and consider random submatrices of obtained by the following process: for each row of , independently choose one of the submatrices uniformly at random, and put the row in that submatrix. Fix . Then,
provided that .
Lemma 5.3 is proved in Appendix F. Using this lemma with , we obtain the following. For with full column rank and a random partitioning of its rows into three groups, all the submatrices , are full rank with probability at least , provided that
Thus, we have a procedure for denoising (i.e. recovering the noise terms ) through random partitioning and matrix decomposition under appropriate rank condition. The coefficient matrix can now be extracted from the denoised moments through the procedures listed in the previous sections, under expansion condition 2 and generic parameters condition 3 for the coefficient matrix .
5.2 Application: learning hierarchical models
In the previous section, we developed a general framework for learning linear models with hidden variables.
We now apply the above results for learning hierarchical models, which consist of many layers of hidden variables. We first formally define hierarchical linear models.
A hierarchical linear model is a model with the following graph structure. The nodes of the graph can be partitioned into levels such that there is no edge between the nodes within one level and all the edges are between nodes in adjacent levels, for . Furthermore, the edges are directed from to . The nodes in level correspond to the observed nodes and other levels contain the hidden nodes.
The next theorem concerns identifiability of linear hierarchical models. More specifically, consider a hierarchical model and let be the induced graph with nodes and suppose that the induced model between levels and satisfies the model conditions described in Section 2.2 with coefficient matrix , for : has the rank condition (Condition 5) and parameter genericity property (Condition 3), and (bipartite) graph has the expansion property (Condition 2).
Consider a hierarchical model with levels and suppose that the induced model between levels and satisfies the model conditions described in Section 2.2 with coefficient matrix , for . Then all columns of are identifiable for from the second order observed moment, i.e., . Therefore, the entire model is identifiable up to permuting the nodes within each level.
By the definition of a hierarchical model, the hidden nodes in level are independent. Now consider the case that the nodes in have arbitrary dependence relationships. By using the same argument as in the proof of Theorem 5.5, we can still learn all the coefficient matrices and the second order moment of the variables in layer .
6 Numerical experiments
In the previous sections, we proposed algorithms for learning topic models (multi-view), and general linear single view models. Our algorithms rely on low order (second and third order) moments of the observed variables. In presenting the results and the proofs we assumed that exact observed moments are available to emphasize the validity of the method. In general, these moments should be estimated from sampled data. This brings up the question of sample complexity, namely given a model , how many samples are required to estimate the model parameters with precision . We expect graceful sample complexity for the proposed algorithms as the low order moments can be reliably estimated from data. In this section, we consider two concrete examples of the single view linear models, and validate the performance of the proposed algorithms under finite number of samples.
The first example is a hierarchical model where we require the coefficient matrices between adjacent layers
to be full rank. The second example is an illustration of a model in which the relations among the hidden nodes are described by a (general) DAG, and we require the
coefficient matrix to be full rank.
Example 1. We validate our method on the following configuration.
Graph structure: We consider a hierarchical model with three levels, , and . Levels and contain the hidden nodes with , and level contains the observed nodes with . Coefficient matrices and , respectively representing the linear relationships among the levels , and the levels , , are constructed according to a Bernoulli-Gaussian model. More specifically, , where is an i.i.d. Bernoulli matrix, and has i.i.d. standard normal entries. Further, indicates the entrywise product. In our experiment, we choose to make the model satisfy the expansion property. Also recall that Theorem 3.2 assumes a positive gap between the maximum and the second maximum absolute values in the row, for . For the sake of simplicity, we consider the same gap for all the rows. More specifically, in each row of we change the entry with the maximum absolute value to ensure gap while keeping the sign of this entry unchanged. As we will see, has an important effect on sample complexity of the algorithm. A very small leads to a poor sample complexity and increasing improves the sample complexity of the algorithm. Similar model is used to generate .
Noise variables: For each noise variable, its variance is selected uniformly at random from the interval