## I Introduction

Information often spreads in cascades by following implicit links between nodes in real-world networks whose topologies may be unknown. For example the spread of viral news among blogs, or Internet memes over microblogging tools like *Twitter*, are facilitated by the inherent connectivity of the world-wide web. A celebrity may post an interesting *tweet*, which is then read and *retweeted* by her followers. An information cascade may then emerge if the followers of her followers share the tweet, and so on. The dynamics of propagation of such information over implicit networks are remarkably similar to those that govern the rapid spread of infectious diseases, leading to the so-termed *contagions* [21, 7, 1]

. Similar cascading processes have also been observed in the context of adoption of emerging fashion trends within distinct age groups, the successive firing of thousands of neurons in the brain in response to stimuli, and plummeting stock prices in global financial markets in response to a natural disaster.

Cascades are generally observed by simply recording the time when a specific website first mentioned a cascading news item, or when an infected person first showed symptoms of a disease. On the other hand, the underlying network topologies may be unknown and dynamic, with the link structure varying over time. Unveiling such dynamic network topologies is crucial for several reasons. Viral web advertising can be more effective if a small set of influential early adopters are identified through the link structure, while knowledge of the structure of hidden needle-sharing networks among communities of injecting drug users can aid formulation of policies for curbing contagious diseases. Other examples include assessment of the reliability of heavily interconnected systems like power grids, or risk exposure among investment banks in a highly inter-dependent global economy. In general, knowledge of topologies that facilitate diffusion of network processes leads to useful insights about the behavior of complex systems.

Network contagions arise due to causal interactions between nodes e.g., blogs, or disease-susceptible individuals. Structural equation models (SEMs) effectively capture such causal relationships, that are seldom revealed by symmetric correlations; see e.g., [10, 18]. Widely applied in psychometrics [15], and sociometrics [8], SEMs have also been adopted for gene network inference [5, 12], and brain connectivity studies [13]. Recently, dynamic SEMs have been advocated for tracking slowly-varying sparse social network topologies from cascade data [1].

Network topologies may sometimes “jump” between a finite number of discrete states, as manifested by sudden changes in cascade behavior. For example, an e-mail network may switch topologies from predominantly work-based connections during the week, to friend-based connections over the weekend. Connection dynamics between bloggers may switch suddenly at the peak of sports events (e.g., “Superbowl”), or presidential elections. In such settings, contemporary approaches assuming that network dynamics arise as a result of slow topology variations may yield unpredictable results. The present paper capitalizes on this prior knowledge, and puts forth a novel *switched* dynamic SEM to account for propagation of information cascades in such scenarios. The novel approach builds upon the dynamic SEM advocated in [1], where it is tacitly assumed that node infection times depend on both the switching topologies and exogenous influences e.g., external information sources, or prior predisposition to certain cascades.

The present paper draws connections to identification of hybrid systems, whose behavior is driven by interaction between continuous and discrete dynamics; see e.g., [16] and references therein. Switched linear models have emerged as a useful framework to capture piecewise linear input-output relations in control systems [2, 22, 23]. The merits of these well-grounded approaches are broadened here to temporal network inference from state-driven cascade dynamics. Although the evolution of the unknown network state sequence may be controlled by structured hidden dynamics (e.g., *hidden Markov models*), this work advocates a more general framework in which such prior knowledge is not assumed.

Network inference from temporal traces of infection events has recently emerged as an active research area. A number of approaches put forth probabilistic models for information diffusion, and leverage maximum likelihood estimation (MLE) to infer both static and dynamic edge weights as pairwise transmission weights between nodes [20, 19, 14]. Sparse SEMs are leveraged to capture exogenous inputs in inference of dynamic social networks in [1], and static causal links in gene regulatory networks [5]. Network Granger causality with group sparsity is advocated for inference of causal networks with inherent grouping structure in [3], while causal influences are inferred by modeling historical network events as multidimensional Hawkes processes in [25].

Within the context of prior works on dynamic network inference, the contributions of the present paper are three-fold. First, a novel switched dynamic SEM that captures sudden topology changes within a finite state-space is put forth (Section II). Second, identifiability results for the proposed switched model are established under reasonable assumptions, building upon prior results for static causal networks in [4] (Section III). Finally, an efficient sparsity-promoting proximal-gradient (PG) algorithm is developed (Sections IV and V) to jointly track the evolving state sequence, and the unknown network topologies. Numerical tests on both synthetic and real cascades in Section VI corroborate the efficacy of the novel approach. Interestingly, experiments on real-world web cascade data exemplify that media influence is dominated by major news outlets (e.g., *cnn.com* and *bbc.com*), as well as well-known web-based news aggregators (e.g., *news.yahoo.com* and *news.google.com*).

Notation

. Bold uppercase (lowercase) letters will denote matrices (column vectors), while operators

, , andwill stand for matrix transposition, maximum eigenvalue, and diagonal matrix, respectively. The identity matrix will be represented by

, while will denote the matrix of all zeros, and their dimensions will be clear in context. Finally, the and Frobenius norms will be denoted by , and , respectively.## Ii Model and Problem Statement

Consider a dynamic network with nodes observed over time intervals , captured by a graph whose topology may arbitrarily switch between discrete states. Suppose the network topology during interval is described by an unknown, weighted adjacency matrix , with . Entry of (henceforth denoted by ) is nonzero only if a directed edge connects nodes and (pointing from to ) during interval . In general , i.e., matrix is generally non-symmetric, which is suitable to model directed networks. The model tacitly assumes that the network topology remains fixed during any given time interval , but can change across time intervals.

Suppose contagions propagating over the network are observed, and the time of first infection per node by contagion is denoted by . In online media, can be obtained by recording the timestamp when blog first mentioned news item . If a node remains uninfected during , is set to an arbitrarily large number. Assume that the susceptibility of node to external (non-topological) infection by contagion is known and time invariant. In the web context, can be set to the search engine rank of website with respect to (w.r.t.) keywords associated with .

The infection time of node during interval is modeled according to the following *switched* dynamic structural equation model (SEM)

(1) |

where captures the state-dependent level of influence of external sources, and accounts for measurement errors and unmodeled dynamics. It follows from (1) that if , then is affected by the values of (see Figure 2). With diagonal , collecting observations for the entire network and all contagions yields the dynamic matrix SEM

(2) |

where , , and are all matrices. A single network topology is adopted for all contagions, which is suitable e.g., when information cascades are formed around a common meme or trending (news) topic over the Internet. Matrix has an all-zero diagonal to ensure that the underlying network is free of self loops, i.e., .

Problem statement. Given the sequence and adhering to (2), the goal is to identify the unknown state matrices , and the switching sequence .

## Iii Model Identifiability

This section explores the conditions under which one can uniquely recover the unknown state matrices . First, (2) can be written as

(3) |

where the binary indicator variable if , otherwise . Note that (3) is generally underdetermined when is unknown, and it may not be possible to uniquely identify . Even in the worst-case scenario where , complete identification of all states is impossible if there exists and () such that . In order to establish model identifiability, it will be assumed that the following hold:

as1. The dynamic SEM in (2) is noise-free (); that is,

(4) |

as2. All cascades are generated by some pair , where , and is known. This is a *realizability assumption*, which guarantees the existence of a submodel responsible for the observed cascades.

as3. No two states can be active during a given time interval, i.e., implies that .

Under (as1)-(as3), the following proposition holds.

Proposition 1: Suppose data matrices and adhere to (4) with , , and . If and has full row rank, then and are uniquely expressible in terms of and as , and .

If

denotes the activation probability of state

, with , and (as1)-(as3) hold, then can be uniquely identified with probability one as .Proposition III establishes a two-step identifiability result for the dynamic SEM model in (4). First, it establishes that if the exogenous data matrix is sufficiently rich (i.e., has full row rank), then the per-interval network topology captured by can be uniquely identified per . Second, if the state activation probabilities are all strictly positive, then the switching network topologies will all be recovered as . In fact, if cascade data are acquired in an infinitely streaming fashion (), then one is guaranteed to uniquely recover all states.

Proof of Proposition III: Equation (4) can be written as , which implies that . With and both having full row rank (recall that ), it follows that has full row rank, and is invertible. Consequently, the linear system of equations is solved by .

On the other hand, note that admits the solution

(5) |

which is unique since is full row rank. Since and is a diagonal matrix, the diagonal entries of coincide with those of ; hence, , which leads to

(6) |

upon substitution of (5). Furthermore, note that , and thus

(7) |

which concludes the first part of the proof.

Since and , the results from the unique solutions (6) and (7) coincide with a specific pair in the set per . Intuitively, complete recovery of all state matrices is tantamount to identification of unique pairs as more data are sequentially acquired. In order to guarantee identifiability of all states in the long run, it suffices to prove that the probability of activation of any state at least once tends to as . Since

are Bernoulli random variables with

per , the number of times that state is activated overtime intervals follows a binomial distribution. Letting

denote the total number of activations of state over , then(8) | |||||

Since only if , it follows that all states can be uniquely identified with probability as when , which completes the second part of the proof.

### Iii-a Topology tracking by clustering when

In general, even when is full row rank, in order to compensate for measurement errors and unmodeled dynamics. Under noisy conditions, it will turn out that can be interpreted as cluster centroids (cf. (3

)), and one can leverage traditional clustering approaches (e.g., k-means), coupled with the closed-form solutions in (

6) and (7) to identify the unknown state matrices. Indeed, with , and , it follows readily that(9) |

where , and .

Clearly, in (9) can be viewed as cluster centers, and as unknown binary cluster-assignment variables. Consequently, the sequence can be directly computed per via (6) and (7), where denotes the number of training samples. Identification of and can then be accomplished by batch clustering into clusters. The subsequent operational phase () then boils down to computing , followed by finding the centroid to which it is closest in Euclidean distance, that is

(10) |

and if , otherwise . Algorithm 1 summarizes this cluster-based state identification scheme, with a clustering phase () and an operational phase (). The sub-procedure calls an off-the-shelf clustering algorithm with the training set and the number of clusters as inputs.

###### Remark 1

Algorithm 1 identifies the unknown state-dependent matrices by resorting to “hard clustering,” which entails deterministic assignment of to one of the centroids. In principle, the algorithm can be readily modified to adopt “soft clustering” approaches, with probabilistic assignments to the cluster centroids.

## Iv Exploiting edge sparsity

The fundamental premise established by Proposition III is that are uniquely identifiable provided is sufficiently rich. However, requiring that has full row rank is a restrictive condition, tantamount to requiring that at least as many cascades are observed as the number of nodes (). This is especially prohibitive in large-scale networks such as the world-wide web, with billions of nodes. It is therefore of interest to measure as few cascades as possible while ensuring that are uniquely recovered. To this end, one is motivated to leverage prior knowledge about the unknowns in order to markedly reduce the amount of data required to guarantee model identifiability. For example, each node in the network is connected only to a small number of nodes out of the possible connections. As a result, most practical networks exhibit edge sparsity, a property that can be exploited to ensure that the rank condition on can be relaxed, as shown in the sequel. In addition to (as1)-(as3), consider the following.

as4. Each row of matrix has at most nonzero entries; i.e., .

as5. The nonzero entries of for all are drawn from a continuous distribution.

as6. The Kruskal rank of satisifies , where is defined as the maximum number such that any combination of columns of constitute a full column rank submatrix.

Proposition 2: Suppose data matrices and adhere to (4) with , , and . If (as1)-(as6) hold, and , then can be uniquely identified with probability one as .

The proof of Proposition IV is rather involved, and is deferred to Appendix A. Unlike the matrix rank which only requires existence of a subset of linearly independent columns, the Kruskal rank requires that every possible combination of a given number of columns be linearly independent. Moreover, computing is markedly more challenging, as it entails a combinatorial search over the rows of . Admittedly, requiring is more restrictive than . Nevertheless, in settings where , (as6) may be satisfied even if . In such cases, Proposition IV asserts that one can uniquely identify even when .

### Iv-a Sparsity-promoting estimator

The remainder of the present paper leverages inherent edge sparsity to develop an efficient algorithm to track switching network topologies from noisy cascade traces. Assuming is known a priori, one is motivated to solve the following regularized LS batch estimator

s. to | (11) | ||||

where the constraint enforces a realizability condition, ensuring that only one state can account for the system behavior at any time. With , the regularization term promotes edge sparsity that is inherent to most real networks. The sparsity level of is controlled by . Absence of a self-loop at node is enforced by the constraint , while having , ensures that is diagonal as in (2).

Note that (P0) is an NP-hard mixed integer program that is unsuitable for large-scale and potentially real-time operation. Moreover, entries of per-state matrices may evolve slowly over reasonably long observation periods, motivating algorithms that not only track the switching sequence, but also the slow drifts occurring in per-state network topologies.

Suppose data are sequentially acquired, rendering batch estimators impractical. If it is assumed that instantaneous and past estimates of are known during interval , (P0) decouples over , and is tantamount to solving the following subproblem per and

s. to | (12) |

Note that the LS term in the penalized cost only aggregates residuals when . In principle, during interval , (P1) updates the estimates of and only if has been generated by submodel . Critical to efficiently tracking the hidden network topologies is the need for a reliable approach to estimate . Before developing an efficient tracking algorithm that will be suitable to solve (P1), the rest of this section puts forth criteria for estimating .

Estimation of . Given the most recent estimates prior to acquisition of , one can estimate by minimizing the a priori error over i.e.,

(13) |

followed by solving (P1) with . If network dynamics arise from switching between static or slowly-varying per-state topologies, (13) yields a reliable estimate of the most likely state sequence over time.

Alternatively, one may resort to a criterion that depends on minimizing the a posteriori error. This entails first solving (P1) for all states upon acquisition of , and then selecting so that

(14) |

where (resp. ) denotes the estimate of (resp. ) given that . In the context of big data and online streaming, (14) is prohibitive, and (13) is more desirable for its markedly lower computational overhead. The tracking algorithm developed next adopts (13) for state sequence estimation.

## V Topology Tracking Algorithm

In order to solve (P1), this section resorts to proximal gradient (PG) approaches, which have been popularized for -norm regularized problems, through the class of iterative shrinkage-thresholding algorithms (ISTA); see e.g., [6] and [17]. Unlike off-the-shelf interior point methods, ISTA is computationally simple, with iterations entailing matrix-vector multiplications, followed by a soft-thresholding operation [9, p. 93]. Motivated by its well-documented merits, an ISTA algorithm is developed in the sequel for recursively solving (P1) per . Memory storage and computational costs incurred by the algorithm per acquired sample do not grow with .

Solving (P1) for a single time interval . Introducing the optimization variable , it follows that the gradient of is Lipschitz continuous, meaning there exists a constant so that , in the domain of . Instead of directly optimizing the cost in (P1), PG algorithms minimize a sequence of overestimators evaluated at the current iterate, or a linear combination of the two previous iterates.

Letting denote the iteration index, and , PG algorithms iterate

(15) |

where corresponds to a gradient-descent step taken from , with step-size equal to . The optimization problem (15) is denoted by , and is known as the proximal operator of the function evaluated at . With for convenience, the PG iterations can be compactly rewritten as .

The success of PG algorithms hinges upon efficient evaluation of the proximal operator (cf. (15)). Focusing on (P1), note that (15) decomposes into

(16) | ||||

(17) |

subject to the constraints in (P1) which so far have been left implicit, and . Letting with -th entry given by denote the soft-thresholding operator, it follows that , see e.g., [6, 9]. Since there is no regularization on , (17) boils-down to a simple gradient-descent step.

Specification of and only requires expressions for the gradient of with respect to and . Note that by incorporating the constraints and , one can express as

(18) |

where and denote the -th row of and , respectively; while denotes the vector obtained by removing entry from the -th row of , and likewise is the matrix obtained by removing row from . It is apparent from (18) that is separable across the trimmed row vectors , and the diagonal entries , . The sought gradients are

(19) | ||||

(20) |

where denotes the -th row of , , and . Similarly, and is obtained by removing the -th row and -th column from . From (16)-(17) and (19)-(20), the parallel ISTA iterations

(21) | ||||

(22) | ||||

(23) | ||||

(24) |

are provably convergent to the globally optimal solution of (P1), as per general convergence results for PG methods [6, 17].

) incur a low computational overhead, involving at most matrix-vector multiplication complexity for the gradient evaluations. A final step entails zero-padding the updated

by setting(25) |

The desired SEM parameter estimates are subsequently obtained as and , for large enough so that convergence has been attained.

Solving (P1) over the entire time horizon. Tracking the switching sequence and the state parameters entails sequentially alternating between two operations per datum arrival. First, is estimated (via solving (13) or (14)), and the corresponding values are accordingly obtained. The iteration steps (21)-(24) are then run until convergence is attained. Note that these iterations only depend on past data through recursively updated moving averages, that is,

(26) |

Similar recursive expressions can be readily derived for , and . The complexity in evaluating the Gram matrix dominates the per-iteration computational cost of the algorithm. The recursive updates in (26) are conducted only for a single state per interval , with the remainder set to . Similarly, iterations (21)-(24) are run only for , while are not updated during interval . Furthermore, the need to recompute per can be circumvented by selecting an appropriate step-size for (23)-(24) by line search [17].

Comments

There are no comments yet.