1 Introduction
Uncovering latent network structure is an important research area in network model and has a long history [33, 7]. For a node network, traditional approaches usually assume a single adjacency matrix, either binary or realvalued, that quantifies the connection intensity between nodes, and aim to learn the community structure from it. For example, in Stochastic Block Model (SBM) [17]
we assume that nodes within a group have an edge with each other with probability
while nodes across groups have an edge with probability where . In information diffusion we observe the propagation of information among nodes and aim to recover the underlying connections between nodes [28, 14, 13]. In timevarying networks we allow the connections and parameters to change over time [23, 2]. In this paper, we consider the case where we have a sequence of information/event/collaboration with different topics, and we observe a noisy adjacency matrix for each of them. The connection between nodes varies under each topic distribution and this cannot be captured by only one adjacency matrix. For example, each researcher has her own research interests and would collaborate with others only on the areas they are both interested in. Specifically, suppose researcher 1 is interested in computational biology and information theory; researcher 2 is interested in computational biology and nonparametric statistics; researcher 3 is interested in information theory only. Then if researcher 1 wants to work on computational biology, she would collaborate with researcher 2; while if the topic is on information theory, then she would collaborate with researcher 3. As another example, suppose student 1 is interested in music and sports while student 2 is interested in music and chess. If the topic of a University event is music, then there will be an edge between these two students; however, if the topic of the event is sports or chess, then there would not be an edge between them.Intuitively, for a specific information/event/collaboration, there will be an edge between two nodes if and only if they are both interested in the topic of this information/event/collaboration. In this paper we model this intuition by giving each node two nodetopic vectors: one influence vector (how authoritative they are on each topic) and one receptivity vector (how susceptible they are on each topic). In addition, each information/event/collaboration is associated with a distribution on topics. The influence and receptivity vectors are fixed but different topic distributions result in different adjacency matrices among nodes. In this paper we consider both cases where the topic distribution may or may not be known, and provide algorithms to estimate the nodetopic structure with theoretical guarantees on estimation error. In particular, we show that our algorithm converges to the true values up to statistical error. Our nodetopic structure is easier to interpret than a large adjacency matrix among nodes, and the result can be used to make targeted advertising or recommendation systems.
Notation
In this paper we use to denote the number of nodes in the network; we assume there are topics in total, and we observe adjacency matrices under different topic distributions. We use subscript to index samples/observations; subscript to index nodes; and subscript to index topic. For any matrix , we use to denote the number of nonzero elements of . Also, for any ,
is the identity matrix with dimension
.2 Model
Our model to capture the nodetopic structure in networks is built on the intuition that, for a specific information/event/collaboration, there would be an edge between two nodes if they are interested in similar topics, which are also common with that of the information/event/collaboration. Furthermore, the connection is directed where an edge from node 1 to node 2 is more likely to exist if node 1 is influential/authoritative in the topic, and node 2 is receptive/susceptible to the topic. For example, an eminent professor would have a large influence value (but maybe a small receptivity value) on his/her research area, while a highproducing, young researcher would have a large receptivity value (but maybe a small influence value) on his/her research area. Note that the notion of “topic” can be very general. For example it can be different immune systems: different people have different kinds of immune systems, and a disease is more likely to propagate between people with similar and specific immune system.
Our nodetopic structure is parametrized by two matrices . The matrix measures how much a node can infect others (the influence matrix) and the matrix measures how much a node can be infected by others (the receptivity matrix). We use and to denote the elements of and , respectively. Specifically, measures how influential node is on topic , and measures how receptive node is on topic . We use and to denote the columns of and , respectively.
Each observation is associated with a topic distribution on the topics satisfying and . The choice of
can be heuristic and prespecified or alternatively can be decided by methods such as in
[18] which learn the distribution over the number of topics. For each observation , the true adjacency matrix is given by(1) 
or in matrix form,
(2) 
where is a diagonal matrix
(3) 
The interpretation of the model is straightforward from (1). For an observation on topic , there will be an edge if and only if node tends to infect others on topic (large ) and node tends to be infected by others on topic (large ). This intuition applies to each topic and the final value is the summation over all the topics.
If we do not consider self connections, we can zero out the diagonal elements and get
(4) 
For notational simplicity, we still stick to (2) for the definition of in the subsequent sections. The data consists of observations satisfying
(5) 
where the noise term are mean 0 and independent across . They are not necessarily identically distributed and can follow an unstructured distribution. The observations can be either realvalued or binary. For binary observations we are interested in the existence of a connection only, while for realvalued observation we are also interested in how strong the connection is, i.e. larger values indicate stronger connections.
Related Works
There is a vast literature on uncovering latent network structures. The most common and basic model is the Stochastic block model (SBM) [17]
where connections are assumed to be dense within group and are sparse across groups. The exact recovery of SBM can be solved using maximum likelihood method but is NPhard. Many practical algorithms have been proposed for SBM such as Modularity method, EM algorithm, Spectral clustering, etc
[6, 21, 29, 26, 30, 32]. Many variants and extensions of SBM have also been developed to better fit real world network structures, including Degreecorrected block model (DCBM) [22], Mixed membership stochastic block models (MMSB) [4], Degree Corrected Mixed Membership (DCMM) model [20], etc. Other models include information diffusion [28, 14, 13, 38], timevarying networks [23, 2], conjunctive Boolean networks [11, 19, 10], graphical models [1, 5, 36], buyerseller networks [24, 34, 31], etc. [16] and [15] assume a “logistic” model based on covariates to determine whether an edge exists or not. However, most of the existing work focuses on a single adjacency matrix and ignores the nodetopic structure. In [35] the authors propose a nodetopic model for information diffusion problem, but it requires the topic distribution to be known and lacks theoretical guarantees.In [3] the authors study multiple adjacency matrices but it still falls into the SBM framework. The number of blocks need to be predefined (the performance is sensitive to this value) and the output is the block information. As a contrast, our model outputs the (numeric) influencereceptivity information for each node and these nodes do not need to form blocks. Also, their work does not utilize the topic information.
In terms of topicbased network inference, a closely related work is [9] where the authors use adjacency matrices to describe the network structure. However it ignores the nodetopic structure and can only deal with the case where the topic distributions are known, while our method is able to learn the topic distribution and the network structure simultaneously. In Section 6 and 7 we show that our method outperforms this model on both synthetic and real dataset.
Another closely related work is [8] where the authors propose the graph embedding model which also gives each node two dimensional “embedding” vector. However, our model is different in the following senses: 1. The topic information of our model is easier to interpret than the “embedding” vectors. The whole framework of our model is more interpretable: we know all the topics information and the topics of interest for each node. 2. We provide a generative model and thorough theoretical result (error analysis). 3. The graph embedding model focuses on only one observation while our model focuses on observations with each observation having a different topic distribution. In our model, the influence and receptivity vectors interact with topic information, while the graph embedding model cannot deal with that.
If we add up all the adjacency matrices to a single matrix , then it is similar to the mixed membership stochastic block model (MMSB [4]) where and can be nondiagonal. Compared to MMSB, our model allows for asymmetry by considering “influence” and “receptivity”; our model considers that information with a different topic can have different adjacency matrices; also, our model can be used to predict a future adjacency matrix given the topics. Finally, when we have adjacency matrices, it is usually better to analyze them individually instead of adding them up, which may lead to information loss.
3 Optimization
Denote ; with some abuse of notation we can rewrite the loss function (6) as
(8) 
From (8) we can see that solving for is equivalent to solving for rank1 matrix factorization problem on . This model is therefore not identifiable on and , since if we multiply column of by some scalar and multiply column of by , the matrix remains unchanged for any , since does not change. Hence the loss function also remains unchanged. Therefore we need an additional regularization term to ensure a unique solution. To address this issue, we propose the following two alternative regularization terms.

The first regularization term is an penalty on and . We define the following norm
(9) where is a tuning parameter. To see why this penalty ensures unique solution, we focus on column only. The term we want to minimize is
(10) In order to minimize (10) we should select such that the two terms in (10) are equal. In other words, the column sums of and are equal.

The second regularization term is borrowed from matrix factorization literature defined as
(11) This regularization term forces the 2norm of each column of and to be the same.
Both regularization terms force the columns of and to be balanced. Intuitively, this means that, for each topic , the total magnitudes of “influence” and “receptivity” are the same. This acts like a conservation law that the total amount of output should be equal to the total amount of input. At the minimizer, this regularization term is 0, and therefore we can pick any .
The first regularization term introduces bias, but it encourages sparse solution; the second regularization term does not introduce bias, but we need an additional hard thresholding step to get sparsity. Experimentally, both regularizations work; theoretically, the loss function (6) is nonconvex in and , hence proving theoretical results is much harder for the first regularization term. Therefore our theoretical results focus on the second alternative proposed above. The final optimization problem is given by
(12) 
Initialization.
We initialize by solving the convex relaxation problem (8) without the rank1 constraint on , and apply rank1 SVD on estimated
, i.e., we keep only the largest singular value:
. The initialization is given by and . Being a convex relaxation, we can find the global minimum of problem (8) by using gradient descent algorithm.Algorithm.
After the initialization, we alternately apply proximal gradient method [27] on and until convergence. In practice, each node would be interested in only a few topics and hence we would expect and to be sparse. To encourage sparsity we need an additional hard thresholding step on and . The overall procedure is given in Algorithm 1. The operation keeps the largest elements of and zeros out others; the operation keeps all positive values and zeros out others.
4 Theoretical result
In this section we derive the theoretical results for our algorithm. We denote and as the true value and as the corresponding true rank1 matrices. In this section we assume the topic distribution is known. The case where is unknown is considered in Section 5. All the detailed proofs are relegated to the Appendix. We start by stating the following two mild assumptions on the parameters of the problem.
Topic Condition (TC).
Denote the Hessian matrix on as
(13) 
We require for some constant .
Intuitively, this condition requires that, the correlation among topic distributions in the observations cannot be too large. This makes sense because if several topics are highly correlated with each other among the observations, then clearly we cannot distinguish them. If we vectorize each , the Hessian matrix of with respect to is a by matrix and it can be shown that this Hessian matrix is given by where is the Kronecker product. With this condition, the objective function (8) is strongly convex in .
An immediate corollary of this condition is that the diagonal elements of must be at least , i.e., for each topic , we have . This means that at least a constant proportion of the observed data should focus on this topic. The necessity of this condition is also intuitive: if we only get tiny amount of data on some topic, then we cannot expect to recover the structure for that topic accurately.
Sparsity Condition (SC).
Both and are sparse: . (We use a single for notational simplicity, but is not required).
Subspace distance.
For matrix factorization problems, it is common to measure the subspace distance because the factorization is not unique. Here since we know that are exactly rank1 and we have nonnegativity constraints on , we would not suffer from rotation issue (the only way to rotate scalar is , but with nonnegative constraint, is impossible). Therefore the subspace distance between and is just defined as
(14) 
Statistical error.
Denote
(15) 
The statistical error on is defined as
(16) 
where is the error matrix in (5) and is the matrix inner product. Intuitively, this statistical error measures how much accuracy we can expect for the estimator. If we are within distance with the true value, then we are already optimal.
The statistical error depends on the sparsity level . In practice,
is a hyperparameter and one can choose it as a relatively large value to avoid missing true nonzero values. If
is too large, then we include too many false positive edges. This usually does not affect performance too much, since these false positive edges tend to have small values. However, we lose some sparsity and hence interpretability. If we further assume that each node is interested in at least one but not most of the topics, then we have and we can choose where can be a small constant. In this way, the effect of choosing is minimal.In this way we transform the original problem to a standard matrix factorization problem with rank1 matrices . A function is termed to be strongly convex and smooth if there exist constant and such that
(17) 
The objective function (8) is strongly convex and smooth in . Since the loss function (8) is quadratic on each , it is easy to see that the conditions are equivalent to . The lower bound is satisfied according to assumption (TC) with , and the upper bound is trivially satisfied with . Therefore we see that the objective function (8) is strongly convex and smooth in . The following lemma quantifies the accuracy of the initialization.
Suppose are the global minimum of the convex relaxation (8), then we have
(18) 
The bound we obtain from Lemma 4 scales with and therefore can be small as long as we have enough samples. We are then ready for our main theorem. The following Theorem 4 shows that the iterates of Algorithm 1 converge linearly up to statistical error.
Suppose conditions (SC) and (TC) hold. We set the sparsity level . If the step size satisfies
(19) 
then for large enough , after iterations, we have
(20) 
for some constant and constant .
Although we focus on the simplest loss function (6), our analysis works for any general loss functions , as long as the initialization is good and the (restricted) strongly convex and smoothness conditions are satisfied. See [37] for more details.
For time complexity of Algorithm 1, calculating the gradient takes time and hence taking average over all samples takes time. The initialization step involves SVD; but we do not need to obtain the full decomposition since for each we only need the singular vector corresponding to the largest singular value. Finally, the number of iteration is such that has the same order with the statistical error, which gives .
5 Learning network and topic distributions jointly
So far we have assumed that the topic distributions for each sample are given and fixed. However, sometimes we do not have such information. In this case we need to learn the topic distributions and the network structure simultaneously.
We denote as the true topic distribution of observation and is the stack of all the topic distributions. The algorithm for joint learning is simply alternating minimization on and . For fixed , the optimization on is the same as before, and can be solved using Algorithm 1. For fixed , it is straightforward to see that the optimization on is separable for each . For each , we solve the following optimization problem to estimate :
(21)  
This problem is convex in and can be easily solved using projected gradient descent. Namely in each iteration we do gradient descent on and then project to the simplex. The overall procedure is summarized in Algorithm 2. With some abuse of notation we write
(22) 
Besides the scaling issue mentioned in Section 3, the problem now is identifiable only up to permutation of the position of the topics. However we can always permute to match the permutation obtained in . From now on we assume that these two permutations match and ignore the permutation issue. The statistical error on is defined as
(23) 
The problem is much harder with unknown topic distribution. Similar to condition (TC), we need the following assumption on the Hessian matrix on .
Diffusion Condition (DC).
Denote the Hessian matrix on as
(24) 
where is the inner product of matrices . We require that for some constant .
With this condition, the objective function (8) is strongly convex in . The intuition is similar as in condition (TC). We require that can be distinguished from each other.
Initialization.
Define , , as the sample mean of , , , respectively. It is clear that . We then do rank svd on and obtain . We denote and we initialize with
(25) 
To see why this initialization works, we first build intuition for the easiest case, where for each , for each , and the columns of and are orthogonal. In this case it is easy to see that
. Note that this expression in a singular value decomposition of
since we have and the columns and columns are orthogonal. Now that is exactly rank , the best rank approximation would be itself, i.e., . By the uniqueness of singular value decomposition, as long as the singular values are distinct, we have (up to permutation) and therefore . This is exactly what we want to estimate.In order to show this is a reasonable initialization, we impose the following condition.
Orthogonality Condition (OC).
Let and
be the QR decomposition of
and , respectively. Denote as a diagonal matrix with diagonal elements . Denote where captures the diagonal elements and captures the offdiagonal elements. We require that for some constant . Moreover, we require that for some .This condition requires that and
are not too far away from orthogonal matrix, so that when doing the QR rotation, the off diagonal values of
and are not too large. The condition is trivially satisfied with . However, in general is usually a constant that does not scale with , meaning that the topic distribution among the observations is more like evenly distributed than dominated by a few topics.It is useful to point out that the condition (OC) is for this specific initialization method only. Since we are doing singular value decomposition, we end up with orthogonal vectors so we require that and are not too far away from orthogonal; since we do not know the value and use to approximate, we require that topics are not far away from evenly distributed so that this approximation is reasonable. In practice we can also use other initialization methods, for example we can do alternating gradient descent on and based on the objective function (22). This method also works reasonably well in practice.
The following lemma shows that is indeed a good initialization for .
Suppose the condition (OC) is satisfied, then the initialization satisfies
(26) 
for some constant where .
The initialization is no longer consistent. Nevertheless it is not required. With this initialization, we then follow Algorithm 2 and estimate and alternatively. Note that when estimating and , we run Algorithm 1 for large enough so that the first term in (20) is small compared to the second term. These iterations for Algorithm 1 are one iteration for Algorithm 2 and we use and to denote the iterates we obtained from Algorithm 2. Denote . We obtain the following theorem on estimation error for jointly learning.
6 Simulation
In this section we evaluate our model and algorithms on synthetic datasets. We first consider the setting where the topics are known and we consider nodes with topics. The true matrices and are generated row by row where we randomly select 13 topics for each row and set a random value generated from . All the other values are set to be 0. This gives sparsity level in expectation, and we set in the algorithm as the hard thresholding parameter. For each observation, we randomly select 13 topics and assign each selected topic a random value , and 0 otherwise. We then normalize this vector to get the topic distribution . The true value is generated according to (2). Note that is also a sparse matrix. We consider two types of observation: real valued observation and binary valued observation. For real valued observation, we generate (equivalently, set ) in the following way: first we randomly select 10% of the nonzero values in and set to 0 (miss some edges); second for each of the remaining nonzero values, we generate an independent random number and multiply with the original value (observe edges with noise); finally we randomly select 10% of the zero values in and set them as (false positive edges). For binary observations, we treat the true values in as probability of observing an edge, and generate as . For those true values greater than 1 we just set to be 1. Finally we again pick 10% false positive edges.
We vary the number of observations and compare our model with the following two stateoftheart methods. The first method is inspired by [13] which ignores the topic information and uses one matrix to capture the entire dataset (termed “One matrix”). This matrix is given by . The second method is inspired by [9] which considers the topic information and assigns each topic a matrix (termed “ matrices”). However it still ignores the nodetopic structure. For this model, we ignore the rank constraint and return the matrix given by the initialization procedure. Note that “One matrix” method has parameters, “ matrices” has parameters, but our method has only parameters. Since we usually have , we are able to use much fewer parameters to capture the network structure, and would not suffer too much from overfitting. For fair comparison, we also do hard thresholding on each of these matrices with parameter . The comparison is done by evaluating the objective function on independent test dataset (prediction error). This prediction error is given by , where is the observed value and is the predicted value. The predicted values take different forms for each method. For “One matrix” it is just ; for “ matrices” it is the weighted sum of the estimated matrices for each topic; for our model, the prediction is obtained by plugging in the estimated and into (2). Figure 2 and Figure 2 show the comparison results for real valued and binary observation, respectively. Each result is based on 20 replicates. We can see that our method has the best prediction error since we are able to utilize the topic information and the structure among nodes and topics; “One matrix” method completely ignores the topic information and ends up with bad prediction error; matrices” method ignores the structure among nodes and topics and suffers from overfitting. As sample size goes large, “ matrices” method will behave closer to our model in terms of prediction error, since our model is a special case of the matrices model. However, it still cannot identify the structure among nodes and topics and is hard to interpret.
We then consider the setting where the topics are unknown. We initialize and estimate and according to the procedure described in Section 5; for “One matrix” method, the estimator is still given by ; for “ matrices” method, we estimate and by alternating gradient method on the objective function (22). All the other setups are the same as the previous case. Figure 4 and Figure 4 show the comparison results for real valued and binary observation, respectively. Again we see that our model behaves the best. These results demonstrate the superior performance of our model and algorithm compared with existing stateoftheart methods.
Finally we check the running time of our method experimentally. Here we fix and vary and . The empirical running time is given in Table 1, where we see a linear dependency on and quadratic dependency on , in line with the claim in remark 4.
1.7  2.2  3.1  
4.1  5.2  7.5  
13.0  16.0  22.0 
7 Application to ArXiv data
In this section we evaluate our model on real dataset. The dataset we use is the ArXiv collaboration and citation network dataset on high energy physics theory [25, 12]. This dataset covers papers uploaded to ArXiv high energy physics theory category in the period from 1993 to 2003, and the citation network for each paper. For our experiment we treat each author as a node and each publication as an observation. For each publication , we set the observation matrix in the following way: the component if this paper is written by author and cited by author , and otherwise. Since each paper has only a few authors, we consider a variant of our original model as
(28) 
where operator is componentwise product and is an indictor matrix with if is the author of this paper, and otherwise. This means for each paper, we only consider the influence behavior of its authors.
For our experiment we consider the top 200 authors with about top 10000 papers in terms of number of citations, and split the papers into 8000 training set and 2000 test set. We first do Topic modeling on the abstracts of all the papers and extract topics as well as the topic distribution on each paper. We then treat this topic information as known and apply our Algorithm 1 to the training set and learn the two nodetopic matrices. These two matrices are given in Table 2 and Table 3. The keywords of the 6 topics are shown at the head of the two tables and the first column of the two tables is the name of the author.
We then compare the nodetopic structure to the research interests and publications listed by the authors themselves on their website. The comparison results show that our model is able to capture the research topics accurately. For example, Christopher Pope reports quantum gravity and string theory; Arkady Tseytlin reports quantum field theory; Emilio Elizalde reports quantum physics; Cumrun Vafa reports string theory; Ashoke Sen reports string theory and black holes as their research areas in their webpages. These are all successfully captured by our method.
Finally we compare the result with “One matrix” and “ matrices” methods on test set. The comparison result is given in Table 4 for training error, testing error, number of total parameters, and number of nonzero parameters. Since our model has much fewer parameters, it has the largest training error. However we can see that our model has the best test error, and both the other two methods do not generalize to test set and suffer from overfitting. These results demonstrates that the topic information and nodetopic structure do exist, and our model is able to capture them.








Christopher Pope  0.359  0.468  0.318  
Arkady Tseytlin  0.223  0.565  0.25  
Emilio Elizalde  0.109  
Cumrun Vafa  0.85  0.623  0.679  0.513  
Edward Witten  0.204  0.795  0.678  1.87  
Ashok Das  0.155  0.115  1.07  
Sergei Odintsov  
Sergio Ferrara  0.297  0.889  0.345  0.457  0.453  0.249  
Renata Kallosh  0.44  0.512  0.326  0.382  
Mirjam Cvetic  0.339  0.173  0.338  
Burt A. Ovrut  0.265  0.191  0.127  0.328  0.133  
Ergin Sezgin  0.35  0.286  
Ian I. Kogan  0.193  
Gregory Moore  0.323  0.91  0.325  0.536  
I. Antoniadis  0.443  0.485  0.545  0.898  0.342  
Mirjam Cvetic  0.152  0.691  0.228  0.187  
Andrew Strominger  0.207  0.374  0.467  1.15  
Barton Zwiebach  0.16  0.222  0.383  0.236  
P.K. Townsend  0.629  0.349  0.1  
Robert C. Myers  0.439  0.28  
E. Bergshoeff  0.357  0.371  
Amihay Hanany  0.193  0.327  1.09  
Ashoke Sen  0.319  0.523  0.571 








Christopher Pope  0.477  0.794  0.59  
Arkady Tseytlin  0.704  1.16  0.312  0.487  0.119  
Emilio Elizalde  
Cumrun Vafa  0.309  0.428  0.844  0.203  0.693  
Edward Witten  0.352  0.554  0.585  0.213  0.567  
Ashok Das  0.494  0.339  0.172  
Sergei Odintsov  0.472  
Sergio Ferrara  0.423  0.59  0.664  0.776  
Renata Kallosh  0.123  0.625  0.638  0.484  0.347  
Mirjam Cvetic  0.47  0.731  0.309  
Burt A. Ovrut  0.314  0.217  0.72  0.409  0.137  
Ergin Sezgin  0.108  0.161  0.358  
Ian I. Kogan  0.357  0.382  0.546  
Gregory Moore  0.375  0.178  0.721  0.69  0.455  0.517  
I. Antoniadis  0.461  0.699  0.532  0.189  
Mirjam Cvetic  0.409  1.11  0.173  0.361  
Andrew Strominger  0.718  0.248  0.196  0.133  
Barton Zwiebach  0.308  0.204  0.356  
P.K. Townsend  0.337  0.225  0.245  0.522  
Robert C. Myers  0.364  0.956  0.545  0.139  
E. Bergshoeff  0.487  0.459  0.174  0.619  
Amihay Hanany  0.282  0.237  0.575  0.732  
Ashoke Sen  0.214  0.18  0.37 
8 Conclusion
In this paper we propose an influencereceptivity model and show how this structure can be estimated with theoretical guarantee. Experiments show superior performance of our model on synthetic and real data, compared with existing methods. This influencereceptivity model also provides much better interpretability.
There are several future directions we would like to pursue. Currently the topic information is either learned from topic modeling and fixed, or is (jointly) learned by our model where we ignore the text information. It would be of interest to combine the influencereceptivity structure and topic modeling to provide more accurate results. Another extension would be allowing dynamic influencereceptivity structure over time.
Acknowledgments
This work is partially supported by an IBM Corporation Faculty Research Fund at the University of Chicago Booth School of Business. This work was completed in part with resources provided by the University of Chicago Research Computing Center.
References
 [1] Graphical Models, volume 17 of Oxford Statistical Science Series. The Clarendon Press Oxford University Press, New York, 1996. Oxford Science Publications.
 [2] Amr Ahmed and Eric P Xing. Recovering timevarying networks of dependencies in social and biological studies. Proceedings of the National Academy of Sciences, 106(29):11878–11883, 2009.
 [3] Edo M Airoldi, Thiago B Costa, and Stanley H Chan. Stochastic blockmodel approximation of a graphon: Theory and consistent estimation. In Advances in Neural Information Processing Systems, pages 692–700, 2013.

[4]
Edoardo M Airoldi, David M Blei, Stephen E Fienberg, and Eric P Xing.
Mixed membership stochastic blockmodels.
Journal of Machine Learning Research
, 9(Sep):1981–2014, 2008.  [5] Rina Foygel Barber and Mladen Kolar. Rocket: Robust confidence intervals via kendall’s tau for transelliptical graphical models. ArXiv eprints, arXiv:1502.07641, February 2015.
 [6] Peter J Bickel and Aiyou Chen. A nonparametric view of network models and newman–girvan and other modularities. Proceedings of the National Academy of Sciences, 106(50):21068–21073, 2009.
 [7] Ronald S Burt. The network structure of social capital. Research in organizational behavior, 22:345–423, 2000.
 [8] Siheng Chen, Sufeng Niu, Leman Akoglu, Jelena Kovačević, and Christos Faloutsos. Fast, warped graph embedding: Unifying framework and oneclick algorithm. arXiv preprint arXiv:1702.05764, 2017.
 [9] Nan Du, Le Song, Hyenkyun Woo, and Hongyuan Zha. Uncover topicsensitive information diffusion networks. In Artificial Intelligence and Statistics, pages 229–237, 2013.
 [10] Zuguang Gao, Xudong Chen, and Tamer Başar. Controllability of conjunctive boolean networks with application to gene regulation. IEEE Transactions on Control of Network Systems, 5(2):770–781, 2018.
 [11] Zuguang Gao, Xudong Chen, and Tamer Başar. Stability structures of conjunctive boolean networks. Automatica, 89:8–20, 2018.
 [12] Johannes Gehrke, Paul Ginsparg, and Jon Kleinberg. Overview of the 2003 kdd cup. ACM SIGKDD Explorations Newsletter, 5(2):149–151, 2003.
 [13] Manuel Gomez Rodriguez, Jure Leskovec, and Andreas Krause. Inferring networks of diffusion and influence. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1019–1028. ACM, 2010.
 [14] Manuel GomezRodriguez, Le Song, Hadi Daneshmand, and Bernhard Schölkopf. Estimating diffusion networks: Recovery conditions, sample complexity & softthresholding algorithm. Journal of Machine Learning Research, 2015.
 [15] Peter D Hoff. Multiplicative latent factor models for description and prediction of social networks. Computational and mathematical organization theory, 15(4):261, 2009.
 [16] Peter D Hoff, Adrian E Raftery, and Mark S Handcock. Latent space approaches to social network analysis. Journal of the american Statistical association, 97(460):1090–1098, 2002.
 [17] Paul W Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmodels: First steps. Social networks, 5(2):109–137, 1983.

[18]
WeiShou Hsu and Pascal Poupart.
Online bayesian moment matching for topic modeling with unknown number of topics.
In NIPS, 2016.  [19] Abdul Salam Jarrah, Reinhard Laubenbacher, and Alan VelizCuba. The dynamics of conjunctive and disjunctive boolean network models. Bulletin of Mathematical Biology, 72(6):1425–1447, 2010.
 [20] Jiashun Jin, Zheng Tracy Ke, and Shengming Luo. Estimating network memberships by simplex vertex hunting. arXiv preprint arXiv:1708.07852, 2017.
 [21] Brian Karrer and Mark EJ Newman. Message passing approach for general epidemic models. Physical Review E, 82(1):016101, 2010.
 [22] Brian Karrer and Mark EJ Newman. Stochastic blockmodels and community structure in networks. Physical review E, 83(1):016107, 2011.
 [23] Mladen Kolar, Le Song, Amr Ahmed, and Eric P Xing. Estimating timevarying networks. The Annals of Applied Statistics, pages 94–123, 2010.
 [24] Rachel E Kranton and Deborah F Minehart. A theory of buyerseller networks. American economic review, 91(3):485–508, 2001.
 [25] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graphs over time: densification laws, shrinking diameters and possible explanations. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 177–187. ACM, 2005.
 [26] Krzysztof Nowicki and Tom A B Snijders. Estimation and prediction for stochastic blockstructures. Journal of the American statistical association, 96(455):1077–1087, 2001.
 [27] Neal Parikh and Stephen Boyd. Proximal algorithms. Foundations and Trends ® in Optimization, 1(3):127–239, 2014.
 [28] Manuel Gomez Rodriguez, David Balduzzi, and Bernhard Schölkopf. Uncovering the temporal dynamics of diffusion networks. arXiv preprint arXiv:1105.0697, 2011.
 [29] Karl Rohe, Sourav Chatterjee, and Bin Yu. Spectral clustering and the highdimensional stochastic blockmodel. The Annals of Statistics, pages 1878–1915, 2011.
 [30] Tom AB Snijders and Krzysztof Nowicki. Estimation and prediction for stochastic blockmodels for graphs with latent block structure. Journal of classification, 14(1):75–100, 1997.
 [31] Achim Walter, Thomas Ritter, and Hans Georg Gemünden. Value creation in buyer–seller relationships: Theoretical considerations and empirical results from a supplier’s perspective. Industrial marketing management, 30(4):365–377, 2001.
 [32] Yuchung J Wang and George Y Wong. Stochastic blockmodels for directed graphs. Journal of the American Statistical Association, 82(397):8–19, 1987.
 [33] Stanley Wasserman and Katherine Faust. Social network analysis: Methods and applications, volume 8. Cambridge university press, 1994.
 [34] David T Wilson. An integrated model of buyerseller relationships. Journal of the academy of marketing science, 23(4):335–345, 1995.
 [35] Ming Yu, Varun Gupta, and Mladen Kolar. An influencereceptivity model for topic based information cascades. 2017 IEEE International Conference on Data Mining (ICDM), pages 1141–1146, 2017.
 [36] Ming Yu, Mladen Kolar, and Varun Gupta. Statistical inference for pairwise graphical models using score matching. In Advances in Neural Information Processing Systems, pages 2829–2837, 2016.
 [37] Ming Yu, Zhaoran Wang, Varun Gupta, and Mladen Kolar. Recovery of simultaneous low rank and twoway sparse coefficient matrices, a nonconvex approach. arXiv preprint arXiv:1802.06967, 2018.
 [38] Ke Zhou, Hongyuan Zha, and Le Song. Learning social infectivity in sparse lowrank networks using multidimensional hawkes processes. In Artificial Intelligence and Statistics, pages 641–649, 2013.
Appendix A Technical proofs
a.1 Proof of Lemma 4.
a.2 Proof of Theorem 4.
Proof.
We apply the nonconvex optimization result in [37]. Since the initialization condition and (RSC/RSS) are satisfied for our problem according to Lemma 4, we apply Lemma 3 in [37] and obtain
(32) 
where and . Define the contraction value
(33) 
we can iteratively apply (32) for each and obtain
(34) 
which shows linear convergence up to statistical error. ∎
a.3 Proof of Lemma 5.
Proof.
Since is the best rank approximation for and is also rank , we have and hence
(35) 
By definition we have
(36) 
Plug back to (35) we obtain
(37) 
and hence
(38) 
Since is the mean value of i.i.d. errors , we have that and therefore can be arbitrarily small with large enough . Moreover, the left hand side of (38) is the difference of two singular value decompositions. According to the matrix perturbation theory, for each we have (up to permutation)