## 1 Introduction

Studying communities forms an integral part of social network analysis. A community generally refers to a group of individuals with shared interests (e.g. music, sports), or relationships (e.g. friends, co-workers). Community formation in social networks has been studied by many sociologists, e.g. (Moreno, 1934; Lazarsfeld et al., 1954; McPherson et al., 2001; Currarini et al., 2009), starting with the seminal work of Moreno (1934). They posit various factors such as homophily^{1}^{1}1The term homophily refers to the tendency that individuals belonging to the same community tend to connect more than individuals in different communities. among the individuals to be responsible for community formation.
Various probabilistic and non-probabilistic network models attempt to explain community formation. In addition, they also attempt to quantify interactions and the extent of overlap between different communities, relative sizes among the communities, and various other network properties. Studying such community models are also of interest in other domains, e.g. in biological networks.

While there exists a vast literature on community models, learning these models is typically challenging, and various heuristics such as Markov Chain Monte Carlo (MCMC) or variational expectation maximization (EM) are employed in practice. Such heuristics tend to scale poorly for large networks. On the other hand, community models with guaranteed learning methods tend to be restrictive. A popular class of probabilistic models, termed as

stochastic blockmodels, have been widely studied and enjoy strong theoretical learning guarantees, e.g. (White et al., 1976; Holland et al., 1983; Fienberg et al., 1985; Wang and Wong, 1987; Snijders and Nowicki, 1997; McSherry, 2001). On the other hand, they posit that an individual belongs to a single community, which does not hold in most real settings (Palla et al., 2005).In this paper, we consider a class of mixed membership community models, originally introduced by Airoldi et al. (2008), and recently employed by Xing et al. (2010) and Gopalan et al. (2012). The model has been shown to be effective in many real-world settings, but so far, no learning approach exists with provable guarantees. In this paper, we provide a novel learning approach for learning these mixed membership models and prove that these methods succeed under a set of sufficient conditions.

The mixed membership community model of Airoldi et al. (2008) has a number of attractive properties. It retains many of the convenient properties of the stochastic block model. For instance, conditional independence of the edges is assumed, given the community memberships of the nodes in the network. At the same time, it allows for communities to overlap, and for every individual to be fractionally involved in different communities. It includes the stochastic block model as a special case (corresponding to zero overlap among the different communities). This enables us to compare our learning guarantees with existing works for stochastic block models and also study how the extent of overlap among different communities affects the learning performance.

### 1.1 Summary of Results

We now summarize the main contributions of this paper. We propose a novel approach for learning mixed membership community models of Airoldi et al. (2008). Our approach is a method of moments estimator and incorporates tensor spectral decomposition. We provide guarantees for our approach under a set of sufficient conditions. Finally, we compare our results to existing ones for the special case of the stochastic block model, where nodes belong to a single community.

##### Learning Mixed Membership Models:

We present a tensor-based approach for learning the mixed membership stochastic block model (MMSB) proposed by Airoldi et al. (2008)

. In the MMSB model, the community membership vectors are drawn from the Dirichlet distribution, denoted by

, where is known the Dirichlet concentration vector. Employing the Dirichlet distribution results in sparse community memberships in certain regimes of , which is realistic. The extent of overlap between different communities under the MMSB model is controlled (roughly) via a single scalar parameter, , where is the Dirichlet concentration vector. When , the mixed membership model degenerates to a stochastic block model and we have non-overlapping communities.We propose a unified tensor-based learning method for the MMSB model and establish recovery guarantees under a set of sufficient conditions. These conditions are in in terms of the network size , the number of communities , extent of community overlaps (through ), and the average edge connectivity across various communities. Below, we present an overview of our guarantees for the special case of equal sized communities (each of size ) and homogeneous community connectivity: let

be the probability for any intra-community edge to occur, and

be the probability for any inter-community edge. Let be the community membership matrix, where denotes the row, which is the vector of membership weights of the nodes for the community. Let be the community connectivity matrix such that and for .###### Theorem 1.1 (Main Result).

For an MMSB model with network size , number of communities , connectivity parameters and community overlap parameter , when^{2}^{2}2The notation denotes up to poly-log factors.

(1) |

our estimated community membership matrix and the edge connectivity matrix satisfy with high probability (w.h.p.)

(2) | ||||

(3) |

Further, our support estimates satisfy w.h.p.,

(4) |

where is the true community membership matrix and the threshold is chosen as .

The complete details are in Section 4. We first provide some intuitions behind the sufficient conditions in (1). We require the network size to be large enough compared to the number of communities , and for the separation to be large enough, so that the learning method can distinguish the different communities. This is natural since a zero separation implies that the communities are indistinguishable. Moreover, we see that the scaling requirements become more stringent as increases. This is intuitive since it is harder to learn communities with more overlap, and we quantify this scaling. For the Dirichlet distribution, it can be shown that the number of “significant” entries is roughly with high probability, and in many settings of practical interest, nodes may have significant memberships in only a few communities, and thus, is a constant (or growing slowly) in many instances.

In addition, we quantify the error bounds for estimating various parameters of the mixed membership model in (2) and (3). These errors decay under the sufficient conditions in (1). Lastly, we establish zero-error guarantees for support recovery in (4): our learning method correctly identifies (w.h.p) all the significant memberships of a node and also identifies the set of communities where a node does not have a strong presence, and we quantify the threshold in Theorem 1.1. Further, we present the results for a general (non-homogeneous) MMSB model in Section 4.2.

##### Identifiability Result for the MMSB model:

A byproduct of our analysis yields novel identifiability results for the MMSB model based on low order graph moments. We establish that the MMSB model is identifiable, given access to third order moments in the form of counts of -star subgraphs, i.e. a star subgraph consisting of three leaves, for each triplet of leaves, when the community connectivity matrix is full rank. Our learning approach involves decomposition of this third order tensor. Previous identifiability results required access to high order moments and were limited to the stochastic block model setting; see Section 1.3 for details.

##### Implications on Learning Stochastic Block Models:

Our results have implications for learning stochastic block models, which is a special case of the MMSB model with . In this case, the sufficient conditions in (1) reduce to

(5) |

The scaling requirements in (5) match with the best known bounds^{3}^{3}3There are many methods which achieve the best known scaling for in (5), but have worse scaling for the separation

. This includes variants of the spectral clustering method, e.g.

Chaudhuri et al. (2012). See Chen et al. (2012) for a detailed comparison. (up to poly-log factors) for learning uniform stochastic block models and were previously achieved by Chen et al. (2012) via convex optimization involving semi-definite programming (SDP). In contrast, we propose an iterative non-convex approach involving tensor power iterations and linear algebraic techniques, and obtain similar guarantees. For a detailed comparison of learning guarantees under various methods for learning (homogeneous) stochastic block models, see Chen et al. (2012).Thus, we establish learning guarantees explicitly in terms of the extent of overlap among the different communities for general MMSB models. Many real-world networks involve sparse community memberships and the total number of communities is typically much larger than the extent of membership of a single individual, e.g. hobbies/interests of a person, university/company networks that a person belongs to, the set of transcription factors regulating a gene, and so on. Thus, we see that in this regime of practical interest, where , the scaling requirements in (1) match those for the stochastic block model in (5) (up to polylog factors) without any degradation in learning performance. Thus, we establish that learning community models with sparse community memberships is akin to learning stochastic block models and we present a unified approach and analysis for learning these models.

To the best of our knowledge, this work is the first to establish polynomial time learning guarantees for probabilistic network models with overlapping communities and we provide a fast and an iterative learning approach through linear algebraic techniques and tensor power iterations. While the results of this paper are mostly limited to a theoretical analysis of the tensor method for learning overlapping communities, we note recent results which show that this method (with improvements and modifications) is very accurate in practice on real datasets from social networks, and is scalable to graphs with millions of nodes (Huang et al., 2013).

### 1.2 Overview of Techniques

We now describe the main techniques employed in our learning approach and in establishing the recovery guarantees.

##### Method of moments and subgraph counts:

We propose an efficient learning algorithm based on low order moments, viz., counts of small subgraphs. Specifically, we employ a third-order tensor which counts the number of -stars in the observed network. A -star is a star graph with three leaves (see figure 1) and we count the occurrences of such -stars across different partitions. We establish that (an adjusted)

-star count tensor has a simple relationship with the model parameters, when the network is drawn from a mixed membership model. We propose a multi-linear transformation using edge-count matrices (also termed as the process of whitening), which reduces the problem of learning mixed membership models to the

*canonical polyadic (CP) decomposition*of an orthogonal symmetric tensor, for which tractable decomposition exists, as described below. Note that the decomposition of a general tensor into its rank-one components is referred to as its CP decomposition (Kolda and Bader, 2009) and is in general NP-hard (Hillar and Lim, 2012). However, the decomposition is tractable in the special case of an orthogonal symmetric tensor considered here.

##### Tensor spectral decomposition via power iterations:

Our tensor decomposition method is based on the popular power iterations (e.g. see Anandkumar et al. (2012a)). It is a simple iterative method to compute the stable eigen-pairs of a tensor. In this paper, we propose various modifications to the basic power method to strengthen the recovery guarantees under perturbations. For instance, we introduce adaptive deflation techniques (which involves subtracting out the eigen-pairs previously estimated). Moreover, we initialize the tensor power method with (whitened) neighborhood vectors from the observed network, as opposed to random initialization. In the regime, where the community overlaps are small, this leads to an improved performance. Additionally, we incorporate thresholding as a post-processing operation, which again, leads to improved guarantees for sparse community memberships, i.e., when the overlap among different communities is small. We theoretically establish that all these modifications lead to improvement in performance guarantees and we discuss comparisons with the basic power method in Section 4.4.

##### Sample analysis:

We establish that our learning approach correctly recovers the model parameters and the community memberships of all nodes under exact moments. We then carry out a careful analysis of the empirical graph moments, computed using the network observations. We establish tensor concentration bounds and also control the perturbation of the various quantities used by our learning algorithm via matrix Bernstein’s inequality (Tropp, 2012, thm. 1.4) and other inequalities. We impose the scaling requirements in (1) for various concentration bounds to hold.

### 1.3 Related Work

There is extensive work on modeling communities and various algorithms and heuristics for discovering them. We mostly limit our focus to works with theoretical guarantees.

##### Method of moments:

The method of moments approach dates back to Pearson (1894) and has been applied for learning various community models. Here, the moments correspond to counts of various subgraphs in the network. They typically consist of aggregate quantities, e.g., number of star subgraphs, triangles etc. in the network. For instance, Bickel et al. (2011) analyze the moments of a stochastic block model and establish that the subgraph counts of certain structures, termed as “wheels” (a family of trees), are sufficient for identifiability under some natural non-degeneracy conditions. In contrast, we establish that moments up to third order (corresponding to edge and -star counts) are sufficient for identifiability of the stochastic block model, and also more generally, for the mixed membership Dirichlet model. We employ subgraph count tensors, corresponding to the number of subgraphs (such as stars) over a set of labeled vertices, while the work of Bickel et al. (2011) considers only aggregate (i.e. scalar) counts. Considering tensor moments allows us to use simple subgraphs (edges and stars) corresponding to low order moments, rather than more complicated graphs (e.g. wheels considered by Bickel et al. (2011)) with larger number of nodes, for learning the community model.

The method of moments is also relevant for the family of random graph models termed as exponential random graph models (Holland and Leinhardt, 1981; Frank and Strauss, 1986). Subgraph counts of fixed graphs such as stars and triangles serve as sufficient statistics for these models. However, parameter estimation given the subgraph counts is in general NP-hard, due to the normalization constant in the likelihood (the partition function) and the model suffers from degeneracy issues; see Rinaldo et al. (2009); Chatterjee and Diaconis (2011) for detailed discussion. In contrast, we establish in this paper that the mixed membership model is amenable to simple estimation methods through linear algebraic operations and tensor power iterations using subgraph counts of -stars.

##### Stochastic block models:

Many algorithms provide learning guarantees for stochastic block models. For a detailed comparison of these methods, see the recent work by Chen et al. (2012). A popular method is based on spectral clustering (McSherry, 2001), where community memberships are inferred through projection onto the spectrum of the Laplacian matrix (or its variants). This method is fast and easy to implement (via singular value decomposition). There are many variants of this method, e.g. the work of Chaudhuri et al. (2012) employs normalized Laplacian matrix to handle degree heterogeneities. In contrast, the work of Chen et al. (2012) uses convex optimization techniques via semi-definite programming learning block models. For a detailed comparison of learning guarantees under various methods for learning stochastic block models, see Chen et al. (2012).

##### Non-probabilistic approaches:

The classical approach to community detection tries to directly exploit the properties of the graph to define communities, without assuming a probabilistic model. Girvan and Newman (2002) use betweenness to remove edges until only communities are left. However, Bickel and Chen (2009) show that these algorithms are (asymptotically) biased and that using modularity scores can lead to the discovery of an incorrect community structure, even for large graphs. Jalali et al. (2011) define community structure as the structure that satisfies the maximum number of edge constraints (whether two individuals like/dislike each other). However, these models assume that every individual belongs to a single community.

Recently, some non-probabilistic approaches have been introduced with overlapping community models by Arora et al. (2012) and Balcan et al. (2012). The analysis of Arora et al. (2012) is mostly limited to dense graphs (i.e. edges for a node graph), while our analysis provides learning guarantees for much sparser graphs (as seen by the scaling requirements in (1)). Moreover, the running time of the method of Arora et al. (2012) is quasipolynomial time (i.e. ) for the general case, and is based on a combinatorial learning approach. In contrast, our learning approach is based on simple linear algebraic techniques and the running time is a low-order polynomial (roughly it is for a node network with communities under a serial computation model and under a parallel computation model). The work of Balcan et al. (2012) assumes endogenously formed communities, by constraining the fraction of edges within a community compared to the outside. They provide a polynomial time algorithm for finding all such “self-determined” communities and the running time is , where is the fraction of edges within a self-determined community, and this bound is improved to linear time when . On the other hand, the running time of our algorithm is mostly independent of the parameters of the assumed model, (and is roughly ). Moreover, both these works are limited to homophilic models, where there are more edges within each community, than between any two different communities. However, our learning approach is not limited to this setting and also does not assume homogeneity in edge connectivity across different communities (but instead it makes probabilistic assumptions on community formation). In addition, we provide improved guarantees for homophilic models by considering additional post-processing steps in our algorithm. Recently, Abraham et al. (2012) provide an algorithm for approximating the parameters of an Euclidean log-linear model in polynomial time. However, there setting is considerably different than the one in this paper.

##### Inhomogeneous random graphs, graph limits and weak regularity lemma:

Inhomogeneous random graphs have been analyzed in a variety of settings (e.g., Bollobás et al. (2007); Lovász (2009)) and are generalizations of the stochastic block model. Here, the probability of an edge between any two nodes is characterized by a general function (rather than by a matrix as in the stochastic block model with blocks). Note that the mixed membership model considered in this work is a special instance of this general framework. These models arise as the limits of convergent (dense) graph sequences and for this reason, the functions are also termed as “graphons” or graph limits (Lovász, 2009). A deep result in this context is the regularity lemma and its variants. The weak regularity lemma proposed by Frieze and Kannan (1999), showed that any convergent dense graph can be approximated by a stochastic block model. Moreover, they propose an algorithm to learn such a block model based on the so-called distance. The distance between two nodes measures similarity with respect to their “two-hop” neighbors and the block model is obtained by thresholding the distances. However, the method is limited to learning block models and not overlapping communities.

##### Learning Latent Variable Models (Topic Models):

The community models considered in this paper are closely related to the probabilistic topic models (Blei, 2012), employed for text modeling and document categorization. Topic models posit the occurrence of words in a corpus of documents, through the presence of multiple latent topics in each document. Latent Dirichlet allocation (LDA) is perhaps the most popular topic model, where the topic mixtures are assumed to be drawn from the Dirichlet distribution. In each document, a topic mixture is drawn from the Dirichlet distribution, and the words are drawn in a conditional independent manner, given the topic mixture. The mixed membership community model considered in this paper can be interpreted as a generalization of the LDA model, where a node in the community model can function both as a document and a word. For instance, in the directed community model, when the outgoing links of a node are considered, the node functions as a document, and its outgoing neighbors can be interpreted as the words occurring in that document. Similarly, when the incoming links of a node in the network are considered, the node can be interpreted as a word, and its incoming links, as documents containing that particular word. In particular, we establish that certain graph moments under the mixed membership model have similar structure as the observed word moments under the LDA model. This allows us to leverage the recent developments from Anandkumar et. al. (Anandkumar et al., 2012c, a, b) for learning topic models, based on the method of moments. These works establish guaranteed learning using second- and third-order observed moments through linear algebraic and tensor-based techniques. In particular, in this paper, we exploit the tensor power iteration method of Anandkumar et al. (2012b), and propose additional improvements to obtain stronger recovery guarantees. Moreover, the sample analysis is quite different (and more challenging) in the community setting, compared to topic models analyzed in Anandkumar et al. (2012c, a, b). We clearly spell out the similarities and differences between the community model and other latent variable models in Section 4.4.

##### Lower Bounds:

The work of Feldman et al. (2012) provides lower bounds on the complexity of statistical algorithms, and shows that for cliques of size , for any constant , at least queries are needed to find the cliques. There are works relating the hardness of finding hidden cliques and the use of higher order moment tensors for this purpose. Frieze and Kannan (2008)

relate the problem of finding a hidden clique to finding the top eigenvector of the third order tensor, corresponding to the maximum spectral norm.

Brubaker and Vempala (2009) extend the result to arbitrary -order tensors and the cliques have to be size to enable recovery from -order moment tensors in a node network. However, this problem (finding the top eigenvector of a tensor) is known to be NP-hard in general (Hillar and Lim, 2012). Thus, tensors are useful for finding smaller hidden cliques in network (albeit by solving a computationally hard problem). In contrast, we consider tractable tensor decomposition through reduction to orthogonal tensors (under the scaling requirements of (1)), and our learning method is a fast and an iterative approach based on tensor power iterations and linear algebraic operations. Mossel et al. (2012) provide lower bounds on the separation , the edge connectivity between intra-community and inter-community, for identifiability of communities in stochastic block models in the sparse regime (when ), when the number of communities is a constant . Our method achieves the lower bounds on separation of edge connectivity up to poly-log factors.##### Likelihood-based Approaches to Learning MMSB:

Another class of approaches for learning MMSB models are based on optimizing the observed likelihood. Traditional approaches such as Gibbs sampling or expectation maximization (EM) can be too expensive apply in practice for MMSB models. Variational approaches which optimize the so-called evidence lower bound (Hoffman et al., 2012; Gopalan et al., 2012), which is a lower bound on the marginal likelihood of the observed data (typically by applying a mean-field approximation), are efficient for practical implementation. Stochastic versions of the variational approach provide even further gains in efficiency and are state-of-art practical learning methods for MMSB models (Gopalan et al., 2012). However, these methods lack theoretical guarantees; since they optimize a bound on the likelihood, they are not guaranteed to recover the underlying communities consistently. A recent work (Celisse et al., 2012) establishes consistency of maximum likelihood and variational estimators for stochastic block models, which are special cases of the MMSB model. However, it is not known if the results extend to general MMSB models. Moreover, the framework of Celisse et al. (2012) assumes a fixed number of communities and growing network size, and provide only asymptotic consistency guarantees. Thus, they do not allow for high-dimensional settings, where the parameters of the learning problem also grow as the observed dimensionality grows. In contrast, in this paper, we allow for the number of communities to grow, and provide precise constraints on the scaling bounds for consistent estimation under finite samples. It is an open problem to obtain such bounds for maximum likelihood and variational estimators. On the practical side, a recent work deploying the tensor approach proposed in this paper by Huang et al. (2013) shows that the tensor approach is more than an order of magnitude faster in recovering the communities than the variational approach, is scalable to networks with millions of nodes, and also has better accuracy in recovering the communities.

## 2 Community Models and Graph Moments

### 2.1 Community Membership Models

In this section, we describe the mixed membership community model based on Dirichlet priors for the community draws by the individuals. We first introduce the special case of the popular stochastic block model, where each node belongs to a single community.

##### Notation:

We consider networks with nodes and let . Let be the adjacency^{4}^{4}4Our analysis can easily be extended to weighted adjacency matrices with bounded entries. matrix for the random network and let be the submatrix of corresponding to rows and columns . We consider models with underlying (hidden) communities. For node , let denote its *community membership vector*, i.e., the vector is supported on the communities to which the node belongs. In the special case of the popular stochastic block model described below, is a basis coordinate vector, while the more general mixed membership model relaxes this assumption and a node can be in multiple communities with fractional memberships.
Define . and let
denote the set of column
vectors restricted to . For a matrix , let and denote its column and row respectively. For a matrix with singular value decomposition (SVD) , let denote the -rank SVD of , where is limited to top- singular values of . Let denote the Moore Penrose pseudo-inverse of .
Let be the indicator function. Let denote a diagonal matrix with diagonal entries given by a vector . We use the term high probability to mean with probability for any constant .

##### Stochastic block model (special case):

In this model, each individual is independently assigned to a single community, chosen at random: each node chooses community independently with probability
, for , and we assign in this case, where is the coordinate basis vector.
Given the community assignments , every directed^{5}^{5}5We limit our discussion to directed networks in
this paper, but note that the results also hold for undirected community models, where is a symmetric matrix, and an edge is formed with probability . edge in the network is independently drawn: if node is in community and node is in community (and ),
then the probability of having the edge in the network is . Here, and we refer to it as the *community
connectivity matrix*. This implies that given the community membership vectors and , the probability of an edge from to is (since when and , we have .). The stochastic model has been extensively studied and can be learnt efficiently through various methods, e.g. spectral clustering (McSherry, 2001), convex optimization (Chen et al., 2012). and so on. Many of these methods rely on conditional independence assumptions of the edges in the block model for guaranteed learning.

##### Mixed membership model:

We now consider the extension of the stochastic block model which allows for an individual to belong to multiple communities and yet preserves some of the convenient independence assumptions of the block model. In this model, the community membership vector at node is a probability vector, i.e., , for all . Given the community membership vectors, the generation of the edges is identical to the block model: given vectors and , the probability of an edge from to is , and the edges are independently drawn. This formulation allows for the nodes to be in multiple communities, and at the same time, preserves the conditional independence of the edges, given the community memberships of the nodes.

##### Dirichlet prior for community membership:

The only aspect left to be specified for the mixed membership model is the distribution from which the community membership vectors are drawn. We consider the popular setting of Airoldi et al. (2008), where the community vectors are i.i.d. draws from the Dirichlet distribution, denoted by , with parameter vector

. The probability density function of the Dirichlet distribution is given by

(6) |

where is the Gamma function and the ratio of the Gamma function serves as the normalization constant.

The Dirichlet distribution is widely employed for specifying priors in Bayesian statistics, e.g. latent Dirichlet allocation

(Blei et al., 2003). The Dirichlet distribution is the conjugate prior of the multinomial distribution which makes it attractive for Bayesian inference.

Let denote the normalized parameter vector , where . In particular, note that is a probability vector: . Intuitively, denotes the relative expected sizes of the communities (since ). Let be the largest entry in , and be the smallest entry. Our learning guarantees will depend on these parameters.

The stochastic block model is a limiting case of the mixed membership model when the Dirichlet parameter is , where the probability vector is held fixed and . In the other extreme when , the Dirichlet distribution becomes peaked around a single point, for instance, if and , the Dirichlet distribution is peaked at , where is the all-ones vector. Thus, the parameter serves as a measure of the average sparsity of the Dirichlet draws or equivalently, of how concentrated the Dirichlet measure is along the different coordinates. This in effect, controls the extent of overlap among different communities.

##### Sparse regime of Dirichlet distribution:

When the Dirichlet parameter vector satisfies^{6}^{6}6The assumption that the Dirichlet distribution be in the sparse regime is not strictly needed. Our results can be extended to general Dirichlet distributions, but with worse scaling requirements on the network size for guaranteed learning. , for all , the Dirichlet distribution generates “sparse” vectors with high probability^{7}^{7}7Roughly the number of entries in exceeding a threshold is at most with high probability, when .; see Telgarsky (2012) (and in the extreme case of the block model where , it generates -sparse vectors). Many real-world settings involve sparse community membership and the total number of communities is typically much larger than the extent of membership of a single individual, e.g. hobbies/interests of a person, university/company networks that a person belongs to, the set of transcription factors regulating a gene, and so on.
Our learning guarantees are limited to the sparse regime of the Dirichlet model.

### 2.2 Graph Moments Under Mixed Membership Models

Our approach for learning a mixed membership community model relies on the form of the graph moments^{8}^{8}8We interchangeably use the term first order moments for edge counts and third order moments for -star counts. under the mixed membership model. We now describe the specific graph moments used by our learning algorithm (based on -star and edge counts) and provide explicit forms for the moments, assuming draws from a mixed membership model.

#### Notations

Recall that denotes the adjacency matrix and that denotes the submatrix corresponding to edges going from to . Recall that denotes the community connectivity matrix. Define

(7) |

For a subset of individuals, let denote the submatrix of corresponding to nodes in ,
*i.e.*, . We will subsequently show that is linear map which takes any community vector as input and outputs the corresponding neighborhood vector in expectation.

Our learning algorithm uses moments up to the third-order, represented as a tensor. A third-order tensor is a three-dimensional array whose -th entry denoted by . The symbol denotes the standard Kronecker product: if , , are three vectors, then

(8) |

A tensor of the form is referred to as a rank-one tensor. The decomposition of a general tensor into a sum of its rank-one components is referred to as *canonical polyadic (CP) decomposition* Kolda and Bader (2009). We will subsequently see that the graph moments can be expressed as a tensor and that the CP decomposition of the graph-moment tensor yields the model parameters and the community vectors under the mixed membership community model.

#### 2.2.1 Graph moments under Stochastic Block Model

We first analyze the graph moments in the special case of a stochastic block model (i.e., in the Dirichlet prior in (6)) and then extend it to general mixed membership model. We provide explicit expressions for the graph moments corresponding to edge counts and -star counts. We later establish in Section 3 that these moments are sufficient to learn the community memberships of the nodes and the model parameters of the block model.

##### -star counts:

The primary quantity of interest is a third-order tensor which counts the number of -stars. A -star is a star graph with three leaves and we refer to the internal node of the star as its “head”, and denote the structure by (see figure 1).
We partition the network into four^{9}^{9}9For sample complexity analysis, we require dividing the graph into more than four partitions to deal with statistical dependency issues, and we outline it in Section 3. parts and consider -stars such that each node in the -star belongs to a different partition. This is necessary to obtain a simple form of the moments, based on the conditional independence assumptions of the block model, see Proposition 2.1. Specifically, consider^{10}^{10}10To establish our theoretical guarantees, we assume that the partitions are randomly chosen and are of size . a partition of the network. We count the number of -stars
from to and our quantity of interest is

(9) |

where is the Kronecker product, defined in (8) and is the row vector supported on the set of neighbors of belonging to set . is a third order tensor, and an element of the tensor is given by

(10) |

which is the normalized count of the number of -stars with leaves such that its “head” is in set .

We now relate the tensor to the parameters of the stochastic block model, viz., the community connectivity matrix and the community probability vector , where is the probability of choosing community .

###### Proposition 2.1 (Moments in Stochastic Block Model).

Given partitions , and , where is the community connectivity matrix and is the matrix of community membership vectors, we have

(11) | ||||

(12) |

where is the probability for a node to select community .

##### Remark 1 (Linear model):

In Equation (11), we see that the edge generation occurs under a linear model, and more precisely, the matrix is a linear map which takes a community vector to a neighborhood vector in expectation.

##### Remark 2 (Identifiability under third order moments):

Note the form of the -star count tensor in (12). It provides a CP decomposition of since each term in the summation, viz., , is a rank one tensor. Thus, we can learn the matrices and the vector through CP decomposition of tensor . Once these parameters are learnt, learning the communities is straight-forward under exact moments: by exploiting (11), we find as

Similarly, we can consider another tensor consisting of -stars from to , and obtain matrices and through a CP decomposition, and so on. Once we obtain matrices and for the entire set of nodes in this manner, we can obtain the community connectivity matrix , since . Thus, in principle, we are able to learn all the model parameters ( and ) and the community membership matrix under the stochastic block model, given exact moments. This establishes identifiability of the model given moments up to third order and forms a high-level approach for learning the communities. When only samples are available, we establish that the empirical versions are close to the exact moments considered above, and we modify the basic learning approach to obtain robust guarantees. See Section 3 for details.

##### Remark 3 (Significance of conditional independence relationships):

The main property exploited in proving the tensor form in (12) is the conditional-independence assumption under the stochastic block model: the realization of the edges in each -star, say in , is conditionally independent given the community membership vector , when . This is because the community membership vectors are assumed to be drawn independently at the different nodes and the edges are drawn independently given the community vectors.
Considering -stars from to where form a partition ensures that this conditional independence is satisfied for all the -stars in tensor .

Proof: Recall that the probability of an edge from to given is

and and thus (11) holds. For the tensor form, first consider an element of the tensor, with ,

The equation follows from the conditional-independence assumption of the edges (assuming ). Now taking expectation over the nodes in , we have

where the last step follows from the fact that with probability and the result holds when . Recall that denotes the column of (since ). Collecting all the elements of the tensor, we obtain the desired result.

#### 2.2.2 Graph Moments under Mixed Membership Dirichlet Model

We now analyze the graph moments for the general mixed membership Dirichlet model. Instead of the raw moments (i.e. edge and -star counts), we consider modified moments to obtain similar expressions as in the case of the stochastic block model.

Let denote a vector which gives the normalized count of edges from to :

(13) |

We now define a modified adjacency matrix^{11}^{11}11To compute the modified moments , and , we need to know the value of the scalar , which is the concentration parameter of the Dirichlet distribution and is a measure of the extent of overlap between the communities. We assume its knowledge here.
as

(14) |

In the special case of the stochastic block model , is the submatrix of the adjacency matrix . Similarly, we define modified third-order statistics,

(15) |

and it reduces to (a scaled version of) the -star count defined in (9) for the stochastic block model . The modified adjacency matrix and the -star count tensor can be viewed as a form of “centering” of the raw moments which simplifies the expressions for the moments. The following relationships hold between the modified graph moments , and the model parameters and of the mixed membership model.

###### Proposition 2.2 (Moments in Mixed Membership Model).

Given partitions and and , as in (14) and (15), normalized Dirichlet concentration vector , and , where is the community connectivity matrix and is the matrix of community memberships, we have

(16) | ||||

(17) |

where corresponds to column of and relates to the community membership matrix as

Moreover, we have that

(18) |

##### Remark 1:

The -star count tensor is carefully chosen so that the CP decomposition of the tensor directly yields the matrices and , as in the case of the stochastic block model. Similarly, the modified adjacency matrix is carefully chosen to eliminate second-order correlation in the Dirichlet distribution and we have that

is the identity matrix. These properties will be exploited by our learning algorithm in Section

3.##### Remark 2:

Recall that quantifies the extent of overlap among the communities. The computation of the modified moment requires the knowledge of , which is assumed to be known. Since this is a scalar quantity, in practice, we can easily tune this parameter via cross validation.

Proof: The proof is on lines of Proposition 2.1 for stochastic block models but more involved due to the form of Dirichlet moments. Recall for a mixed membership model, and , therefore . Equation (16) follows directly. For Equation (18), we note the Dirichlet moment, , when and

On lines of the proof of Proposition 2.1 for the block model, the expectation in (17) involves multi-linear map of the expectation of the tensor products among other terms. Collecting these terms, we have that

is a diagonal tensor, in the sense that its -th entry is , and its -th entry is 0 when
are not all equal. With this, we have (17).

Note the nearly identical forms of the graph moments for the stochastic block model in (11), (12) and for the general mixed membership model in (16), (17). In other words, the modified moments and have similar relationships to underlying parameters as the raw moments in the case of the stochastic block model. This enables us to use a unified learning approach for the two models, outlined in the next section.

## 3 Algorithm for Learning Mixed Membership Models

The simple form of the graph moments derived in the previous section is now utilized to recover the community vectors and model parameters of the mixed membership model. The method is based on the so-called tensor power method, used to obtain a tensor decomposition. We first outline the basic tensor decomposition method below and then demonstrate how the method can be adapted to learning using the graph moments at hand. We first analyze the simpler case when exact moments are available in Section 3.2 and then extend the method to handle empirical moments computed from the network observations in Section 3.3.

### 3.1 Overview of Tensor Decomposition Through Power Iterations

In this section, we review the basic method for tensor decomposition based on power iterations for a special class of tensors, viz., symmetric orthogonal tensors. Subsequently, in Section 3.2 and 3.3, we modify this method to learn the mixed membership model from graph moments, described in the previous section. For details on the tensor power method, refer to Anandkumar et al. (2012a); Kolda and Mayo (2011).

Recall that a third-order tensor is a three-dimensional array and we use to denote the -th entry of the tensor . The standard symbol is used to denote the Kronecker product, and is a rank one tensor. The decomposition of a tensor into its rank one components is called the CP decomposition.

##### Multi-linear maps:

We can view a tensor as a multilinear map in the following sense: for a set of matrices , the -th entry in the three-way array representation of is

The term multilinear map arises from the fact that the above map is linear in each of the coordinates, e.g. if we replace by in the above equation, where is a matrix of appropriate dimensions, and are any scalars, the output is a linear combination of the outputs under and respectively. We will use the above notion of multi-linear transforms to describe various tensor operations. For instance, yields a matrix, , a vector, and , a scalar.

##### Symmetric tensors and orthogonal decomposition:

A special class of tensors are the symmetric tensors which are invariant to permutation of the array indices. Symmetric tensors have CP decomposition of the form

(19) |

where denotes the tensor CP rank and we use the notation . It is convenient to first analyze methods for decomposition of symmetric tensors and we then extend them to the general case of asymmetric tensors.

Further, a sub-class of symmetric tensors are those which possess a decomposition into orthogonal components, i.e. the vectors are orthogonal to one another in the above decomposition in (19) (without loss of generality, we assume that vectors are orthonormal in this case). An orthogonal decomposition implies that the tensor rank and there are tractable methods for recovering the rank-one components in this setting. We limit ourselves to this setting in this paper.

##### Tensor eigen analysis:

For symmetric tensors possessing an orthogonal decomposition of the form in (19), each pair , for , can be interpreted as an eigen-pair for the tensor , since

due to the fact that . Thus, the vectors can be interpreted as fixed points of the map

(20) |

where denotes the spectral norm (and is a vector norm), and is used to normalize the vector in (20).

##### Basic tensor power iteration method:

A straightforward approach to computing the orthogonal decomposition of a symmetric tensor is to iterate according to the fixed-point map in (20) with an arbitrary initialization vector. This is referred to as the tensor power iteration method. Additionally, it is known that the vectors are the only stable fixed points of the map in (20). In other words, the set of initialization vectors which converge to vectors other than are of measure zero. This ensures that we obtain the correct set of vectors through power iterations and that no spurious answers are obtained. See (Anandkumar et al., 2012b, Thm. 4.1) for details. Moreover, after an approximately fixed point is obtained (after many power iterations), the estimated eigen-pair can be subtracted out (i.e., deflated) and subsequent vectors can be similarly obtained through power iterations. Thus, we can obtain all the stable eigen-pairs which are the components of the orthogonal tensor decomposition. The method needs to be suitably modified when the tensor is perturbed (e.g. as in the case when empirical moments are used) and we discuss it in Section 3.3.

### 3.2 Learning Mixed Membership Models Under Exact Moments

We first describe the learning approach when exact moments are available. In Section 3.3, we suitably modify the approach to handle perturbations, which are introduced when only empirical moments are available.

We now employ the tensor power method described above to obtain a CP decomposition of the graph moment tensor in (15). We first describe a “symmetrization” procedure to convert the graph moment tensor to a symmetric orthogonal tensor through a multi-linear transformation of . We then employ the power method to obtain a symmetric orthogonal decomposition. Finally, the original CP decomposition is obtained by reversing the multi-linear transform of the symmetrization procedure. This yields a guaranteed method for obtaining the decomposition of graph moment tensor

under exact moments. We note that this symmetrization approach has been earlier employed in other contexts, e.g. for learning hidden Markov models

(Anandkumar et al., 2012b, Sec. 3.3).##### Reduction of the graph-moment tensor to symmetric orthogonal form (Whitening):

Recall from Proposition 2.2 that the modified -star count tensor has a CP decomposition as

We now describe a symmetrization procedure to convert to a symmetric orthogonal tensor through a multi-linear transformation using the modified adjacency matrix , defined in (14). Consider the singular value decomposition (SVD) of the modified adjacency matrix under exact moments:

Define and similarly define and using the corresponding matrices and respectively. Now define

(21) |

and similarly define . We establish that a multilinear transformation (as defined in (3.1)) of the graph-moment tensor using matrices and results in a symmetric orthogonal form.

###### Lemma 3.1 (Orthogonal Symmetric Tensor).

Assume that the matrices and have rank , where is the number of communities. We have an orthogonal symmetric tensor form for the modified -star count tensor in (15) under a multilinear transformation using matrices and :

(22) |

where and

is an orthogonal matrix, given by

(23) |

##### Remark 1:

Note that the matrix orthogonalizes under exact moments, and is referred to as a whitening matrix. Similarly, the matrices and consist of whitening matrices and , and in addition, the matrices and serve to symmetrize the tensor. We can interpret as the stable eigen-pairs of the transformed tensor (henceforth, referred to as the whitened and symmetrized tensor).

##### Remark 2:

The full rank assumption on matrix implies that , and similarly . Moreover, we require the community connectivity matrix to be of full rank^{12}^{12}12In the work of McSherry (2001), where spectral clustering for stochastic block models is analyzed, rank deficient is allowed as long as the neighborhood vectors generated by any pair of communities are sufficiently different. On the other hand, our method requires to be full rank. We argue that this is a mild restriction since we allow for mixed memberships while McSherry (2001) limit to the stochastic block model. (which is a natural non-degeneracy condition). In this case, we can reduce the graph-moment tensor to a -rank orthogonal symmetric tensor, which has a unique decomposition.
This implies that the mixed membership model is identifiable using -star and edge count moments, when the network size , matrix is full rank and the community membership matrices each have rank . On the other hand, when only empirical moments are available, roughly, we require the network size (where is related to the extent of overlap between the communities) to provide guaranteed learning of the community membership and model parameters. See Section 4 for a detailed sample analysis.

Proof: Recall that the modified adjacency matrix satisfies

From the definition of above, we see that it has rank when has rank . Using the Sylvester’s rank inequality, we have that the rank of is at least