I Introduction
Topology learning and parameter estimation of dynamical networks have become popular research topics recently because such studies can reveal the underling mechanisms of many realworld complex systems. For example, a stock market which consists of a large number of stocks interacting with each other and evolving over time can be characterized as a dynamical network. Here, a node stands for the price of a stock and an edge or link resembles stock interaction. Let be a
dimensional random vector with each component being a time series associated with a node. We are interested in inferring the topology and dynamics of a linear dynamical network
, with as the system disturbance. Such a model has been proposed and studied in many areas such as econometrics, finance and bioinformatics [1, 2, 3, 4]. The transition matrix determines how the current state of the network evolves from the previous state. It can be translated to a directed Granger transition graph (GTG) that shows the Granger causal connections between the nodes [5, 6, 7]. The modern challenge is that the number of unknowns in is usually much larger than the number of available observations , i.e., , and consequently most conventional methods fail in estimation or identification. From a statistical perspective, shrinkage estimation [8] must be applied, and sparsitypromoting regularizations are preferred because they can produce interpretable networks [9, 10]. Indeed, in many applications, there only exist a few significant nodes that directly influence a given node. Sparse graph learning also complies with the principle of Occam’s razor from a philosophical perspective. Nevertheless, existing methods usually assume that the components of are i.i.d., i.e., the covariance matrix of , or, is proportional to the identity matrix. This totally ignores the
secondorder statistical structure of the network. Most realworld networks violate this assumption because even conditioning on past observations, node correlations widely exist.Assuming, ideally, the true is known, the dependence structure of a network can be captured by the sparse Gaussian graph learning, which has attracted a lot of research attention lately (cf. [11, 12, 13, 14] among many others). Under , the th entry of the concentration matrix gives the conditional dependence between node and node given all the other nodes. This can be translated to an undirected conditional dependence graph (CDG), in which case, again, sparsity on is desirable. Unfortunately, Gaussian graph learning is not directly applicable to our dynamical model, because as discussed above, the task of estimating is no less challenging as the task of estimating . Note that substituting the sample mean for the true mean is inappropriate when is a large matrix, which is a well known example of Stein’s phenomenon [8].
To obtain a comprehensive picture of the dynamical network, it is necessary to estimate both and based on their joint likelihood. There are few studies in the literature that consider the joint sparse estimation of the two matrices [15, 16]. In our experience, the existing methods are slow and can not handle big network data. For example, the MRCE algorithm [15] is already infeasible for on an ordinary PC. Note that the number of unknown variables, , can be very large, thereby making it difficult to reliably identify the sparse network topology and accurately estimate the system parameters.
As a real example, we use the Energy category of the S&P 500 stock data to illustrate our motivation. Figure 1 shows the graphs obtained by sole GTG learning (sGTG for short) which ignores the secondorder node correlations, and by sole CDG learning (sCDG for short) which ignores the firstorder Granger causalities. Common isolated nodes have been removed. Some edges exist in both graphs, which suggests that the joint regularization of , might be helpful in detecting the joint graphical structure. In fact, statistically speaking, even when similarities between the two graphs are not clear or even do not exist, joint regularization can improve the overall estimation accuracy in high dimensions, see, e.g., [8, 17]. Another interesting observation from Figure 1 is that the network can be decomposed into smaller subnetworks including isolated indices. Similar decomposability has also been noticed in brain connectivity networks [18] and U.S. macroeconomics [19]. If such a network decomposition could be detected in an early stage, complex learning algorithms, such as MRCE and Gaussian graph learning, would apply in a much more efficient way (in a possibly parallel manner). Of course, the decomposition based on sGTG or sCDG alone may not be trustworthy. When is large and both GTG and CDG are unknown, the graph screening/decomposition based on and , jointly, is much more reasonable.
This work proposes jointly regularizing the directed transition graph and the undirected dependence graph for topology identification and dynamics estimation. We will introduce the notion of joint association graph (JAG) and propose JAG screening and decomposition to facilitate largescale network learning. The JAG screening identifies and removes unnecessary links. The JAG structure can also be used for network decomposition, so that GTG or CDG can be estimated for each subnetwork separately. With search space and problem size substantially reduced, computational and statistical performance can be enhanced. Similar ideas have proved to be very successful in Gaussian graph learning, such as the block diagonal screening rule (BDSR) [20, 21]. Our approach is based on JAG instead of CDG alone. Moreover, we will develop a robust JAG decomposition that does not incur excessive estimation bias as BDSR does [22]. Our approach does not mask authentic edges to guarantee decomposability. To the best of our knowledge, no work of joint graph screening or decomposition is available in the literature.
The remainder of this paper is organized as follows. Section II describes the joint graphical model and proposes a learning framework called joint graphical screening and estimation (JGSE). Section III develops an algorithm of graphical iterative screening via thresholding to be used for JAG screening and robust decomposition. Section IV gives a fine learning of graphs (FLOG) algorithm that estimates and after screening. In Section V, syntheticdata experiments are conducted to show the performance of JGSE. In Section VI, we apply JGSE to real S&P 500 and NASDAQ100 stock data for network learning.
Ii The joint graphical model
Suppose there exist nodes in a dynamical network and let be a dimensional random vector with each component associated with a node. To describe the node behaviors at each time point, we define a linear dynamical network model
(1) 
The current state of the system is determined by two components: The first component is a linear transformation of the previous state; the second component
follows a multivariate Gaussian distribution and characterizes node correlations conditioned on
. The transition matrix can be translated to a directed Granger transition graph (GTG): indicates that node Grangercauses node [5]. The concentration matrix can be translated to an undirected conditional dependence graph (CDG): indicates that node and node are conditionally independent given the other nodes [11, 12] and . Given snapshots of the system, , we would like to recover the firstorder statistic and the secondorder statistic as well as find out their sparsity patterns (or topological structures). We are particularly interested in dynamical networks with both GTG and CDG being sparse or approximately sparse for the following reasons. First, many realworld dynamical networks are indeed sparse. For example, in regulatory networks, a gene is only regulated by several other genes [3]. Second, when the number of observations is small compared with the number of unknown variables, the sparsity assumption reduces the number of model parameters so that estimating the system becomes possible. Third, from a philosophical point of view, a sparse model is consistent with the principle of Occam’s razor and is easier to interpret in practice.As pointed out by a referee, sCDG which estimates typically yields a less sparse graph than , because the transition matrix , together with the autoregressive mechanism, brings in further node dependence (see, e.g., [23] for more details).
Iia Joint regularization in network learning
Using the Markov property and chain rule, we can write the joint likelihood of
and (conditioned on ) as . So the joint ML estimation solves Let and . We write the ML problem in matrix form(2) 
Here means that is positive definite (which implies that is symmetric). From now on, we use , in place of , to represent the GTG.
To enforce sparsity, a straightforward idea is to add penalties, and , to the loss in (2). and can be of the type [24, 25, 15]. In this paper, we propose jointly regularizing and via penalty , where is constructed based on and . This leads to the following optimization problem
(3) 
The design of is to capture the joint structure of GTG and CDG. Of course, joint regularization can reinforce the detection of common edges if they exist. But why does one care about the joint graphical structure in computation and statistics? Some motivations are given below.
1) First, due to the sparsity assumption on and , the union of the two graphs is still sparse. That is, many nodes have no direct influences. Hence one can perform graph screening in an earlier stage for dimension reduction, to facilitate fine GTG and CDG learnings afterwards. A good screening process should take both graphs into account in removing unnecessary hypothetical edges.
2) Many very large dynamical networks demonstrate smallerscale subnetwork structures or clusters. For example, a human brain connectivity network revealed by EEG data can be divided into several functionality regions [18]. Also, in the U.S. macroeconomic network, economic indices can be divided to different categories [19]. It is desirable to decompose a largescale network into small subnetworks, if possible, for both computational and statistical concerns [20, 21]. In the dynamical network setting, such a decomposition must be based on both GTG and CDG.
IiB Joint association graph
Model (1) shows the network evolves through both firstorder and secondorder statistical relations between the nodes. To capture the joint structure, we introduce the notion of joint association graph (JAG), an undirected graph where any two nodes are connected if they are connected in either GTG or CDG. Define the “association strength” between node and node as
(4) 
where is a weight parameter (say, ); the matrix represents the JAG.
To give an illustration of JAG, we show a toy example in Figure 2, where the JAG in Figure (c)c is obtained from (4). The GTG and CDG share many common edges. Furthermore, they both exhibit subnetwork structures. In fact, in both graphs, nodes 14 form a cluster. On the other hand, the two graphs differ from each other in some significant ways. For example, node 9 and node 10 are disconnected in GTG, but not so in CDG. JAG, by integrating the connections in GTG and CDG, provides a comprehensive picture of the network topology.
In reality, both the GTG and CDG are unknown and to be estimated. If one had the JAG learned beforehand, its structure could be used to perform graph screening and help improve the estimation of and . For example, in Figure (c)c, node 4 and node 5 are disconnected, so setting beforehand facilitates network estimation and identification. Particularly, if the JAG, after permutation, exhibits a blockdiagonal structure—, then both and must have the same blockdiagonal structure, and , respectively. It is not difficult to show that such a network can be decomposed into independent subnetworks with its dynamics properties completely intact. For example, the network shown in Figure 2 can be decomposed into two mutually disconnected subnetworks according to its JAG. We can estimate and infer GTG and CDG for each subnetwork separately. Explicitly estimating the JAG based on (4) also facilitates computation and algorithm design, as will be shown in Section IIIB.
IiC The JGSE learning
Directly tackling the jointly regularized problem (3) is extremely challenging. (Even without , the existing algorithms are inefficient or even infeasible for moderatescale problems.) The key of the paper is to detect and utilize the joint structure of and for computational and statistical performance boost. We propose a Joint Graphical Screening and Estimation (JGSE) learning framework which consists of two stages: 1) JAG screening and decomposition; 2) fine estimation.
In Stage 1, we identify the structure of JAG by solving the joint regularization problem
(5) 
where the penalty (or constraint) takes the form
(6) 
That is, we place into the same group, and use a sparsityinducing penalty at the group level. The group sparsity pursuit ensures that as long as any type of connection between node and node exists, the group will be kept, and so will the corresponding edge in JAG. The grouping of variables can be arbitrary. Our algorithms apply provided that the groups do not overlap. For example, if we know a priori that several nodes form a cluster, we can put the corresponding elements of and into a group. The form of (6) serves for the general case where no particular prior information is given.
Stage 2 estimates finely, given the pattern of JAG:
(7) 
where denotes the set of nonzero edges in and are similarly defined. The constraints maintain the sparsity structure of learned from Stage 1. In this fine estimation stage, the number of variables to be estimated has been substantially reduced. Packages for sparse matrix operations can be used. When JAG decomposition is possible, popular graph learning algorithms can be directly applied to subnetworks, and parallelism can be employed for high performance computing.
Both sGTG and sCDG are special cases of (3), and can be learned by screening + fine estimation as well. Ignoring the secondorder network structure and assuming has i.i.d. components, i.e., , the joint graphical model degenerates to the sGTG model where a sparse transition matrix can be obtained by solving Assuming the data have been centered and , the joint graphical model degenerates to the sCDG model where a sparse can be obtained by Gaussian graph learning with . Another instance is the multivariate regression with covariance estimation (MRCE) [15]: . MRCE estimates both and but imposes no joint regularization. In our experience MRCE is only feasible for smallscale network learning, which is a main motivation of our JAG screening. In the following two sections, we present computational algorithms for the twostage JGSE learning framework.
Iii JAG screening and decomposition
The objective function (5) is nonconvex and nonsmooth and there are a large number of unknown variables. One possible way to minimize (5) is to use coordinate descent; the resulting algorithm design is however quite cumbersome—one must consider different cases depending on whether the variables appearing in (6) are zero or not. Our experiments show that such an algorithm is only feasible for . More efficient algorithms are in great need. We propose a novel GIST algorithm based on the group estimator with asynchronous Armijotype line search.
Iiia Group estimator
To solve (5), we start from thresholding rules rather than penalty functions, considering that different penalty forms may result in the same estimator (and the same thresholding operator) [27].
A thresholding rule
is required to be an odd nondecreasing unbounded shrinkage function
[28]. Examples include the softthresholding operator and the hardthresholding . (Throughout the paper, the sign function is defined as if , if , and 0 if .) When is a vector, the thresholding rule is defined componentwise. The multivariate version of , denoted by , is defined by(8) 
Now we formulate a general framework for solving a general group penalized problem
(9) 
where is the loglikelihood function, and are penalty functions possibly nonconvex (and discrete) with corresponding thresholding rules as described in (10). [28] shows that given any thresholding operators , the iterative multivariate thresholding procedure for is guaranteed to converge under a universal choice of ; moreover, the convergent solution (referred to as a group estimator) is a local minimum point of (9), provided that and are coupled through the following equation:
(10) 
for some nonnegative satisfying , . We emphasize that the conclusion holds for any thresholding rules, and most practically used penalties (convex or nonconvex) are covered by (10). Two important examples that will play an important role in the work are given as follows. When all take the form of group softthresholding , the corresponding penalty in (9) becomes the group penalty . When all are chosen to be group hardthresholding , (10) yields infinitely many penalties even when , one of which is the exact group penalty by setting .
We now use the group estimator to deal with problem (5). Divide the variables in and into groups, where variables at entry and entry () belong to the th group with . Let . It is not difficult to compute the gradients of with respect to and (details omitted)
(11)  
Thus the gradient of with respect to is
(12) 
Given , let or , consisting of all elements in that belong to the th group, and similarly, let . We extend the multivariate thresholding to such matrices. Given any thresholding , define its multivariate thresholding as a new matrix satisfying , , with given by (8). Then, the iterative algorithm to get a group estimator of (5) becomes
(13) 
with coupled through (10).
IiiB JAG screening
Equation (13) can deliver a local minimum to problem (5) for any penalty function constructed from a thresholding rule via (10). This covers , SCAD [29], (), and many other penalties [28]. The problem now boils down to choosing a proper penalty form for JAG screening. Another issue that cannot be ignored is parameter tuning, which is a nontrivial task especially for nonconvex penalties.
Among all sparsitypromoting penalties, it is of no doubt that the group penalty is ideal in enforcing sparsity. However, its parameter tuning is not easy, and most tuning approaches, e.g., cross validation, become prohibitive in large network applications. Rather than using the group penalty, we propose using a group constraint for JAG screening
(14) 
This particular form enables one to directly control the cardinality^{1}^{1}1The cardinality of a network refers to the number of nonzero links in in this paper. of the network. (Note that the constraint excludes the diagonal entries of and .) The upper bound can be loose for the JAG screening step. This group constrained problem can be solved using the technique in Section IIIA, resulting in a quantile version of (13):
(15) 
Here, the multivariate quantile thresholding operator [30] for any is defined to be a new matrix with if is among the largest norms in the set of , and otherwise. This iterative quantile screening was proposed in [28] and has found successful applications in group selection, rank reduction, and network screening [30, 31, 32, 6, 7].
An equivalent way to perform the multivariate quantile thresholding for any is based on the JAG. First, compute the JAG matrix by (4) explicitly for all , and set all its diagonal entries to be . Then perform elementwise hardthresholding on with the threshold set as the th largest element in . Finally, zero out ‘small’ entries in or : for any , set . See Algorithm 1 for more details. From [30], we can similarly show that the iterative quantile thresholding converges and leads to a local minimum of the following constrained problem:
(16) 
where denotes the number of nonzero offdiagonal entries in , and (), called the quantile parameter, puts an upper bound on the sparsity level of the network. It can be customized by the user based on the belief of how sparse the network could be. Prior knowledge or specific application needs can be incorporated. Thus this upper bound is usually not difficult to specify in sparse network learning.
In the generalized linear model setting, the proposed iterative multivariate thresholding procedure is guaranteed to converge with a simple analytical expression for the step size [28, 30]. However, in our dynamical network which has both and unknown, there seems to be no simple formula for the step size in (16). The constraint cone increases the difficulty in deriving a universal step size. We propose a simple but effective asynchronous Armijotype (denoted as AA) line search approach in the next subsection, which guarantees a convergent solution with automatically satisfied.
IiiC The AA line search and the GIST algorithm
The basic idea of the Armijotype line search, when restricting to problem (16), is to select a step size along the descent direction that satisfies the Armijo rule [33]:
(17) 
If the condition is satisfied, we accept and carry on to the next iteration; otherwise, decrease and try the new update in (15) till either the condition is satisfied or becomes smaller than a threshold . The values of can be set to, for instance, and . At each iteration we initialize as and decrease according to if the condition (17) is not satisfied.
Empirical studies show that and usually have different orders of magnitude and so using the same step size for updating and may be suboptimal. (In fact, with only one step size parameter, it is often difficult to find an satisfying (17), and the algorithm converges slowly.) Therefore, we propose using different step sizes for and . This can be implemented by updating the component and the component asynchronously in computing . To be more specific, we modify (12) as
(18) 
The AA line search can implicitly guarantee the positive definiteness of . If we set for any not positive definite, then such ’s will naturally be rejected by (17). The same treatment has been used in Gaussian graph learning, see, e.g., [34].
The final form of our graph iterative screening via thresholding (GIST) is proposed in Algorithm 1, under the assumption that the data has been centered and normalized columnwise, has been centered, and and .
The GIST algorithm is very simple to implement and runs efficiently. If the purpose is to get the convergent sparsity pattern instead of the precise estimate, one can terminate the algorithm as long as the sign of the iterates stabilizes—usually within 50 steps. Even for a network with 500 nodes, GIST takes less than 1 second.
IiiD Robust JAG decomposition via spectral clustering
Nowadays, a great challenge in modern network analysis comes from big data, which makes many methods computationally infeasible. Fortunately, very large networks often demonstrate subnetwork structures and thus one can decompose the network in an early stage, and then apply complex learning algorithms to each subnetwork individually. Similar ideas have appeared in Gaussian graph learning [20, 21], where a simple onestep thresholding is applied to the sample covariance matrix to predetermine if the associated concentration matrix estimate is decomposable, referred to as the block diagonal screening rule (BDSR). Yet it ignores the firstorder statistical structure of our dynamical model (1), and the resulting CDG may not reliably capture the network topology. See Section VIA for some experiments.
We propose decomposing the whole network based on the GIST estimate. For example, after getting ,we can apply the DulmageMendelsohn Decomposition [35] to detect if there exists an exact block diagonal form of . However, the noise contamination makes perfect decomposition seldom possible. Therefore, we treat as a similarity matrix where the “association strength” indicates how close node and node are, and so pursuing an approximate block diagonal form is now identified as a node clustering problem. Specifically, we apply Spectral Clustering to to obtain a robust JAG decomposition. Refer to [36] for a comprehensive introduction, and [37] for its ability in suppressing the noise. There are many effective ways to determine the number of clusters [38, 39, 40].
Unlike [20] and [21], our JAG decomposition does not rely on setting low in (16) to yield subnetworks. An oversparse estimate may be problematic in estimation or structure identification. The philosophy is different from that of the BDSR. In fact, BDSR is purely computational—it predetermines, for each , if the associated graph estimate is perfectly decomposable or not. To ensure decomposability on noisy data, one tends to specify overtly high sparsity levels to obtain subnetworks—see, e.g., Section 4 in [21] and our data example in Section VIA. This may remove genuine connections. Therefore, the resultant decomposition could be misleading, and excessive bias may be incurred in estimation (cf. [22]). Our JAG decomposition can deal with noise effectively and is much more robust in this sense.
If the network is decomposable (or approximately so), system (1) can be rewritten as for , where is the number of subnetworks and corresponds to the nodes that belong to the th subnetwork. We can thereby conduct fine estimation of and for each subnetwork (in a possibly parallel fashion). There are two ways to use the GIST screening outcome. If each subnetwork is of relatively small size such that fine learning algorithms can be applied smoothly, one can drop the constraints in (7). In this case, is only used to reveal the block decomposition structure. Alternatively, one can enforce all withinblock sparsity constraints (determined by ) in subnetwork learning. The latter is usually faster, but when the value of is set too low, one should caution against such a manner.
Iv Fine learning
In this stage of JGSE we perform fine estimation of the graph matrices. Recall the optimization problem to take advantage of the JAG screening pattern given by Stage 1
Sometimes the screening constraints may be dropped. In either case, we can state the optimization problem as instances of
(19) 
where and are regularization parameter matrices. Indeed, to enforce the screening constraints, we can set and and , and
To solve for with held fixed, it suffices to study
(20) 
With fixed, the problem of interest reduces to
(21) 
Fortunately, the optimization still falls into the framework described in Section IIIA. We introduce the fine learning of graphs (FLOG) algorithm as follows. For simplicity, suppose and are penalties. Algorithm 1 can be adapted to the optimization (20) (with fixed at its current estimate , and under the initialization , and ):
Experimentation shows that the line search performance is not sensitive to the values of and ; we simply set and , following [41]. As for the optimization (21), this is just the Gaussian graph learning problem with the sample covariance matrix given by . The popular graphical lasso [12] can be used.
Some related works. The MRCE algorithm solves a similar fine learning problem to (19), but there exist no screening constraints. Lee and Liu [16] generalized MRCE to handle weighted penalties. Both algorithms use cyclical coordinate descent in the optimization step, which has a worst case cost [15]. In contrast, the proposed update in FLOG has complexity , which comes from the matrix multiplication for computing the gradient . Experiments show that FLOG is more efficient than MRCE under the same setting of error tolerance and maximum iteration numbers.
With FLOG introduced, the twostage JGSE learning framework is complete. We point out that although FLOG is more efficient and scalable than MRCE, the main contribution of JGSE lies in Stage 1 which reduces the problem size and search space for fine estimation.
V Experiments on synthetic data
In this section, we show the performance of GIST and FLOG in the JGSE network learning using synthetic data.
Va Identification and estimation accuracy
We compare the proposed JGSE with some relevant methods in the literature:

sGTG estimates the sparse only, assuming . It is implemented using the coordinate descent algorithm [42].

sCDG estimates the sparse concentration matrix , after centering the data. It is implemented using the graphical lasso [12].

MRCE [15] jointly estimates and subject to separate penalties, and its implementation is given by the R package “MRCE”.
In all the methods, the penalty function is used for /. Experiments are performed for the following networks with different sizes and topologies.

Example 1: The network consists of two equally sized subnetworks.

Example 2: The network consists of three subnetworks of sizes 40, 20, 20.

Example 3: The network consists of four equally sized subnetworks.

Example 4: , . has no subnetwork structure.

Example 5: , . is nondiagonal, and shows no subnetwork structure.
The identification accuracy is measured by the true positive rate and false positive rate . In Examples 14, the estimation accuracy is measured by the model error [25]. In Example 5, only is estimated, and the accuracy is measured by .
In each of the settings, the number of unknown variables is much larger than the number of observations, e.g., in Example 2. The diagonal blocks of and are all sparse random matrices generated independently, following the scheme in [25]. The data observations are then generated from the multivariate time series model (1). We repeat the synthetic data experiment in each setting for 50 times and summarize the performance of an algorithm as follows. Mean TPR and FPR are reported. The distribution of ME appears nonGaussian and multimodal; for robustness and stability, the 25% trimmedmean of model errors from multiple runs is reported. The algorithms include sGTG, sCDG, MRCE, FLOG and JGSE. FLOG is to make a comparison with MRCE, and denotes FLOG applied to the whole network, i.e., running the second stage algorithm of JGSE without the first stage JAG screening. (We point out however that this is not the recommended way of network estimation in the paper; our proposed JGSE applies FLOG after GIST screening.) The JAG weight parameter is taken to be 1 throughout all experiments. In Examples 13, spectral clustering is called after running sGTG, sCDG, MRCE, and FLOG, because of the existence of subnetworks. All regularization parameters are chosen by minimizing the model validation error, evaluated on 1,000 validating samples independently generated in addition to the training data. We set the value of to be , which is large enough for screening. (Tuning the quantile parameter showed no observable difference; its robustness is also seen in Section VB.) Table I shows the results.
Example 1  Example 2  Example 3  Example 4  Example 5  

(TPR, FPR), ME  (TPR, FPR), ME  (TPR, FPR), ME  (TPR, FPR), ME  (TPR, FPR), ME  
sGTG  (24%, 4%), 1947.8  (16%, 2%), 6404.6  (12%, 1%), 21360.3  (89%, 20%), 53.5  (29%, 6%), 182.1 
sCDG  (47%, 28%), N/A  (32%, 17%), N/A  (26%, 10%), N/A  (70%, 46%), N/A  (93%, 38%), 2.8 
MRCE  (63%, 13%), 106.6  (61%, 8%), 187.9  Infeasible  (88%, 29%), 98.6  (76%, 20%), 7.3 
FLOG  (83%, 25%), 100.9  (91%, 21%), 166.8  (88%, 13%), 635.1  (87%, 21%), 68.8  (88%, 45%), 5.6 
JGSE  (91%, 28%), 74.6  (95%, 23%), 140.8  (95%, 14%), 549.4  (85%, 10%), 53.6  (87%, 44%), 5.6 
In Examples 13, both sGTG and sCDG suffer from oversimplified model assumptions and fail to identify network connections accurately. Indeed, we frequently observe that the conditional dependence graph from sCDG is not sufficiently sparse. It seems that sCDG tries to rephrase firstorder dynamics as node correlations and consequently results in a dense secondorder topology. sGTG shows lowest TPR values and misses many true connections, which is a sign of overshrinkage. Not surprisingly, in the two degenerate cases, sGTG behaves well in Example 4 (because ), and similarly, sCDG does a good job in Example 5 where .
MRCE estimates both firstorder and secondorder statistics and achieves much lower error rates than sGTG (except in Example 4). However, MRCE is quite computationally expensive and may be infeasible for largescale problems. In Example 2, it took MRCE around 40 minutes to run a single experiment. In Example 3, MRCE became computationally intractable. FLOG did not show such computational limitations there. The two algorithm designs resulted in different estimates. (Recall that the objective criterion is nonconvex.) MRCE is less accurate in general.
The complete JGSE learning is even more efficient, owing to the first stage GIST for robust JAG screening and decomposition. More importantly, JGSE shows remarkable improvements in estimation in almost all cases. (The only exception is Example 5, where JGSE has comparable performance to FLOG.) These positive results validate the power of GIST in removing lots of unnecessary edges and reducing the search space for topology identification. In all, our twostage JGSE (GIST+FLOG) successfully beats the existing joint graph learning method MRCE.
VB GIST in Decomposition
In this subsection, we examine the performance of GIST in network decomposition. The rand index (RI) [43] is used for evaluation. It is obtained by comparing the memberships of each pair of nodes assigned by an algorithm with the true memberships. If a pair coming from the same cluster are assigned to a single cluster, it is defined as a true positive (); if a pair coming from different clusters are assigned to different clusters, it is defined as a true negative (); and are defined similarly. Then RI is defined as .
We fix a small sample size and vary in this experiment. The time series data are again generated according to the multivariate autoregression (1). All networks consist of two equallysized subnetworks; each diagonal block of is generated as a random sparse matrix, and each diagonal block of has diagonal elements 1 and all offdiagonal elements . Each experiment is repeated 50 times.
Given any network data, we apply GIST to obtain a JAG estimate and perform robust decomposition (cf. Section IIID). The decompositions of sole GTG (assuming ) and sole CDG (assuming after centering the data) are obtained as well. All decompositions are via spectral clustering. Although GIST considers a more complex model, because of its screening nature, it runs efficiently. The mean RI results are shown in Figure 3.
In all the settings, GIST achieves more reliable decomposition and outperforms sGTG and sCDG by a large margin. This shows that the network decomposition based on the joint graph is trustworthy. Moreover, its performance is rather insensitive to the choice of the quantile parameter as long as bounds the true network cardinality. This offers great ease in practice.
GIST is also superfast: for any network in the experiments, it just takes a few seconds to obtain the graphical screening pattern or subnetwork structure. A more comprehensive computational cost investigation is given in the next subsection.
VC Computation time reduction via graph screening
Now we study how much computational cost can be saved by applying GIST before fine learning. All simulated networks consist of multiple equallysized subnetworks, with the total number of nodes denoted by and the number of nodes in each subnetwork denoted by . The diagonal blocks of and are generated in the same manner as in Section VB. Let be the total computation time of JGSE learning (“GIST+FLOG”), and be the computation time by applying FLOG directly to the whole network without graph screening or decomposition. (We did not include MRCE in the comparison because it is extremely slow for large data). Solution paths of and are computed for a grid of values for that covers various sparsity patterns. The quantile parameter is set as 0.3 in GIST. We report the ratios for different combinations of (, ) in Table II, where in all experiments.
(500,250)  (500,100)  (500,50)  (1000,500)  (1000,200)  (1000,100)  
0.427  0.161  0.133  0.404  0.148  0.112 
Table II shows that at least half of the running time can be reduced when the network is decomposable. The larger the ratio is, the more computational cost can be saved. We conducted the experiment on a PC, but if parallel computing resources are available, the computational efficiency can be further boosted. The network decomposition technique makes an otherwise computationally expensive or even infeasible problem much easier to solve.
Vi Applications
In this section, we analyze real data from 500 and NASDAQ100 stock using JGSE.
Via S&p 500
This dataset keeps a record of the closing prices of the 500 stocks from Jan. 1, 2003 to Jan. 1, 2008. It consists of 1258 samples for 452 stocks. The data have been preprocessed by taking logarithm and differencing transformations [22].
We first applied GIST (with quantile ) and the robust JAG decomposition. Figure (a)a shows the resulting clusters, where the nodes are placed by the FruchtermanReingold algorithm [44]. Although no ground truth is available, interestingly, we found that the obtained 10 subnetworks are highly consistent with the 10 given categories in the data documentation—the corresponding RI is almost as high as 0.9 (cf. Figure (b)b).
We then varied and systematically studied the clustering results based on GIST. The RIs with respect to the 10 stock categories are shown in Figure (b)b. For comparison, sGTG and sCGD clusterings are also included. Our JAG decomposition is quite robust to the choice of in GIST. It seems that the 10category structure in the documentation is reflected on the real stock data.
We also applied the popular BDSR [21, 20] (which is designed under the sole CDG learning setup), to decompose S&P 500 into 10 subnetworks. Figure (a)a shows that the network is now decomposed into a giant cluster and nine isolated nodes, which is more difficult to interpret than GIST. Such a decomposition provides little help in reducing the computational cost. Furthermore, Figure (b)b shows the best tuned sCDG estimate (using the R package huge [22] with default parameters) at . To achieve a 10subnetwork decomposition, we found that must be greater than or equal to . This is the dilemma discussed in Section IIID: BDSR resorts to setting an overly large value for to yield graph decomposition, while such a high thresholding level may mask many truly existing edges and result in an inaccurate estimate. Correspondingly, its decomposition structure is unreliable. Of course, the poor performance of BDSR also has a lot to do with the fact that the transition matrix or GTG estimation is ignored in the sCDG learning.
We next investigate the forecasting capability of JGSE, by use of the conventional rolling MSE scheme (see, e.g., [45]). Denote the rolling window size as . Standing at time point , apply the estimation algorithm to the most recent observations in the past, i.e., . Then use the estimate to forecast : for , and . Repeat the forecasting procedure till the rolling window slides to the end of the time series. The rolling MSE is defined as . We set the window size and horizon and compared sGTG, MRCE, and JGSE in each category. Because of the limited sample size, the largedata validation used in synthetic experiments is not applicable. Following [25, 46, 47]
, we chose the tuning parameters by BIC, where the number of degrees of freedom is given by
if both and are estimated, and if only is estimated. Table III reports the rolling MSEs (times 1e+4 for better readability) for the first five categories. (The conclusions for the last five categories are similar but the first five have relatively larger dimensions.) Even compared with the widely acknowledged MRCE, JGSE offers better or comparable forecasting performance.Model & Method  Category 1  Category 2  Category 3  Category 4  Category 5 

sGTG  236.6  883.5  237.4  1859.1  456.5 
MRCE  28.4  742.6  5.0  250.1  7.9 
JGSE  1.4  3.7  2.0  5.4  7.9 
ViB Nasdaq100
The NASDAQ100 consists of 100 of the largest nonfinancial companies listed on the NASDAQ stock market. We collect the closing prices of the stocks for each trading day from Jan.1 , 2011 to Dec. 31, 2011, which gives 252 samples in total (the data is downloaded from finance.yahoo.com). Differencing is applied to remove trends. There were several significant changes to the indices during 2011. For example, NASDAQ rebalanced the index weights on May 2, 2011 before opening the market—see http://ir.nasdaqomx.com/releasedetail.cfm?releaseid=561718. More event details can be found at http://en.wikipedia.org/wiki/NASDAQ100#Changes_in_2011. In consideration of such major changes, we focus on the following segments. Segment 1 consists of 62 samples from Jan. 1 to Apr. 4; Segment 2 consists of 23 samples from Apr. 4 to May 2; Segment 3 consists of 32 samples from May 31 to Jul. 14; and Segment 4 consists of 98 samples from Jul.15 to Dec. 2.
We present the analysis of Segment 4 as an example. To get a conservative idea of the network cardinality, we applied sGTG and sCDG to the data respectively. Sparse graphs are obtained with around connections. We set in running the GIST algorithm. After removing the isolated indices, we applied the FLOG algorithm to obtain the GTG and CDG estimates. The whole procedure only took a few minutes. We are particularly interested in the hub nodes in the JAG. Figure 6 shows all connections to and from the hub nodes. Nicely, the three hubs, PCLN (Priceline.com Inc.), GOOG (Google Inc.) and ISRG (Intuitive Surgical Inc.), come from the three largest sectors of the NASDAQ100, namely Consumer Service, Technology and Health Care, respectively. PCLN is a commercial website that helps customers obtain discounts for travelrelated purchases, and it is not surprising that PCLN is related to some companies providing similar services, such as EXPE (Expedia Inc.), and some hospitality companies such as WYNN (Wynn Resorts, Limited). Similarly, GOOG, as a worldfamous technology company, is related to many technology based companies, such as AAPL (Apple Inc.), TXN (Texas Instruments Inc.), LLTC (Linear Technology Corporation) and so on. ISRG, a corporation that manufactures robotic surgical systems, is connected with INTC (Intel Corporation), MRVL (Marvell Technology Group Ltd.) and KLAC (KLATencor Corporation), which all produce semiconductor chips and nanoelectronic products to be used in robotics.
The obtained GTG and CDG share some common connections. For example, PCLN not only has a negative causal influence over EXPE, but shows negatively correlation with it conditioned on the other nodes. On the other hand, the two graphs differ in some ways. For example, although PCLN strongly Grangercauses SIRI (Sirius XM Holdings, Inc.), they are conditionally independent. The interaction between LLTC and GOOG is of second order, purely due to their conditional dependence without any direct Granger causality. Fortunately, JAG encompasses all significant links on either GTG or CDG, and provides comprehensive network screening.
We have performed similar analysis for other segments and examined the changes of the network topology. Due to page limitation, details are not reported here.
Next, we call the rolling scheme to investigate the forecasting performance of JGSE. For comparison, sGTG was also included; MRCE is however computationally intractable here, and so we applied our FLOG instead. BIC was used for regularization parameter tuning. The rolling MSEs of three methods are shown in Table IV, with window size and horizon . We see that the joint estimation by FLOG outperforms the popular transition estimation (sGTG) in three of the four segments. This suggests the existence of wide range conditional dependence between the stocks, and it is beneficial to take into account such correlations in statistics modeling. JGSE is able to further improve the forecasting performance by joint regularization, which is however not surprising from Stein et al.’s classical works (e.g., [8]). It also has a lot to do with the success of GIST in reducing the search space for the fine graph learning. These echo the findings in synthetic data experiments in Section VA.
Model & Method  Segment 1  Segment 2  Segment 3  Segment 4 

sGTG  24.3  30.8  20.6  32.2 
FLOG  21.9  31.4  20.1  18.4 
JGSE  18.6  28.7  18.6  15.1 
Vii Conclusion
We studied largescale dynamical networks with sparse firstorder and secondorder statistical structures, where the firstorder connections can be captured by a directed Granger transition graph and the secondorder correlations by an undirected conditional dependence graph. To jointly regularize the two graphs in topology identification and dynamics estimation, we proposed the 2stage JGSE framework. The GIST algorithm was developed for JAG screening and decomposition. As demonstrated by extensive syntheticdata experiments and realworld applications, our proposed algorithms beat the commonly used BDSR and MRCE in graph decomposition and estimation.
References
 Sims [1980] C. A. Sims, “Macroeconomics and reality,” Econometrica, vol. 48, no. 1, pp. 1–48, Jan. 1980.
 Gourieroux and Jasiak [2002] C. Gourieroux and J. Jasiak, “Financial econometrics problems, models and methods,” University Presses of California, Columbia and Princeton: New Jersey, 2002.

Fujita et al. [2007]
A. Fujita, J. Sato, H. GarayMalpartida, R. Yamaguchi, S. Miyano, M. Sogayar, and C. Ferreira, “Modeling gene expression regulatory networks with the sparse vector autoregressive model,”
BMC Systems Biology, vol. 1, no. 1, 2007.  Bullmore and Sporns [2009] E. Bullmore and O. Sporns, “Complex brain networks: graph theoretical analysis of structural and functional systems,” Nature Reviews Neuroscience, vol. 10, pp. 186–198, Mar. 2009.
 Granger [1969] C. W. J. Granger, “Investigating causal relations by econometric models and crossspectral methods,” Econometrica, vol. 37, no. 3, pp. 424–438, 1969.
 He et al. [2013] Y. He, Y. She, and D. Wu, “Stationary sparse causality network learning,” J. Mach. Learn. Res., vol. 14, pp. 3073–3104, 2013.

She et al. [2014]
Y. She, Y. He, and D. Wu, “Learning topology and dynamics of large recurrent neural networks,”
IEEE Transactions on Signal Processing, 2014, to appear.  Stein [1956] C. Stein, “Inadmissibility of the usual estimator for the mean of a multivariate distribution,” Proc. Third Berkeley Symp. Math. Statist. Prob., vol. 1, pp. 197–206, 1956.
 Bolstad et al. [2011] A. Bolstad, B. D. Van Veen, and R. Nowak, “Causal network inference via group sparse regularization,” Signal Processing, IEEE Transactions on, vol. 59, no. 6, pp. 2628–2641, 2011.
 ValdesSosa [2005] P. A. ValdesSosa, “Estimating brain functional connectivity with sparse multivariate autoregression,” Phil. Trans. R. Soc. B, pp. 969–981, 2005.

Banerjee et al. [2008]
O. Banerjee, L. El Ghaoui, and A. d’Aspremont, “Model selection through sparse
maximum likelihood estimation for multivariate gaussian or binary data,”
The Journal of Machine Learning Research
, vol. 9, pp. 485–516, 2008.  Friedman et al. [2008] J. Friedman, T. Hastie, and R. Tibshirani, “Sparse inverse covariance estimation with the graphical lasso,” Biostatistics, vol. 9, no. 3, pp. 432–441, Jul. 2008.
 Bickel and Levina [2008] J. Bickel and E. Levina, “Regularized estimation of large covariance matrices,” Ann. Statist, vol. 36, no. 1, pp. 199–227, 2008.
 Meinshausen and Bühlmann [2006] N. Meinshausen and P. Bühlmann, “Highdimensional graphs and variable selection with the lasso,” The Annals of Statistics, vol. 34, no. 3, pp. 1436–1462, 2006.
 Rothman et al. [2010] A. J. Rothman, E. Levina, and J. Zhu, “Sparse multivariate regression with covariance estimation,” Journal of Computational and Graphical Statistics, vol. 19, no. 4, 2010.

Lee and Liu [2012]
W. Lee and Y. Liu, “Simultaneous multiple response regression and inverse
covariance matrix estimation via penalized gaussian maximum likelihood,”
Journal of Multivariate Analysis
, 2012.  James and Stein [1961] W. James and C. Stein, “Estimation with quadratic loss,” Berkeley, Calif., pp. 361–379, 1961.
 Fink et al. [2009] A. Fink, R. H. Grabner, M. Benedek, G. Reishofer, V. Hauswirth, M. Fally, C. Neuper, F. Ebner, and A. C. Neubauer, “The creative brain: Investigation of brain activity during creative problem solving by means of eeg and fmri,” Human Brain Mapping, vol. 30, no. 3, pp. 734–748, 2009.
 Stock and Watson [2012] J. H. Stock and M. W. Watson, “Generalized shrinkage methods for forecasting using many predictors,” Journal of Business & Economic Statistics, vol. 30, no. 4, pp. 481–493, 2012.
 Witten et al. [2011] D. M. Witten, J. H. Friedman, and N. Simon, “New insights and faster computations for the graphical lasso,” Journal of Computational and Graphical Statistics, vol. 20, no. 4, pp. 892–900, 2011.
 Mazumder and Hastie [2012] R. Mazumder and T. Hastie, “Exact covariance thresholding into connected components for largescale graphical lasso,” J. Mach. Learn. Res., vol. 13, pp. 781–794, Mar. 2012.
 Zhao et al. [2012] T. Zhao, H. Liu, K. Roeder, J. Lafferty, and L. Wasserman, “The huge package for highdimensional undirected graph estimation in r,” J. Mach. Learn. Res., vol. 13, no. 12, pp. 1059–1062, Jun. 2012.
 Lütkepohl [2007] H. Lütkepohl, New Introduction to Multiple Time Series Analysis, 1st ed. Springer, Oct. 2007.
 Tibshirani [1996] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society, vol. 58, no. 1, pp. 267–288, 1996.
 Yuan and Lin [2007] M. Yuan and Y. Lin, “Model selection and estimation in the gaussian graphical model,” Biometrika, vol. 94, no. 1, pp. 19–35, 2007.
 Efron and Morris [1973] B. Efron and C. Morris, “Stein’s estimation rule and its competitors–an empirical bayes approach,” Journal of the American Statistical Association, vol. 68, no. 341, pp. pp. 117–130, 1973.
 She [2009] Y. She, “Thresholdingbased iterative selection procedures for model selection and shrinkage,” Electron. J. Statist, vol. 3, pp. 384–415, 2009.
 She [2012] ——, “An iterative algorithm for fitting nonconvex penalized generalized linear models with grouped predictors,” Computational Statistics and Data Analysis, vol. 9, pp. 2976–2990, 2012.
 Fan and Li [2001] J. Fan and R. Li, “Variable selection via nonconcave penalized likelihood and its oracle properties,” Journal of the American Statistical Association, vol. 96, pp. 1348–1360, Dec. 2001.

She et al. [2013]
Y. She, H. Li, J. Wang, and D. Wu, “Grouped iterative spectrum thresholding for superresolution sparse spectrum selection,”
IEEE Transactions on Signal Processing, vol. 61, pp. 6371–6386, 2013. 
She [2013]
Y. She, “Reduced rank vector generalized linear models for feature extraction,”
Statistics and Its Interface, vol. 6, pp. 197–209, 2013.  [32] ——, “Selectable factor extraction in high dimensions,” arXiv:1403.6212.
 Armijo [1966] L. Armijo, “Minimization of functions having Lipschitz continuous first partial derivatives,” Pacific Journal of Mathematics, vol. 16, no. 1, pp. 1–3, 1966.
 Duchi et al. [2008] J. Duchi, S. Gould, and D. Koller, “Projected subgradient methods for learning sparse Gaussians,” in Proceedings of the Twentyfourth Conference on Uncertainty in AI (UAI), 2008.
 Dulmage and Mendelsohn [1958] A. L. Dulmage and N. S. Mendelsohn, “Coverings of bipartite graphs,” Canadian Journal of Mathematics, vol. 10, no. 4, pp. 516–534, 1958.
 von Luxburg [2007] U. von Luxburg, “A tutorial on spectral clustering,” Statistics and Computing, vol. 17, no. 4, pp. 395–416, 2007.
 von Luxburg et al. [2008] U. von Luxburg, M. Belkin, and O. Bousquet, “Consistency of spectral clustering,” The Annals of Statistics, pp. 555–586, 2008.

Fraley and Raftery [1998]
C. Fraley and A. E. Raftery, “How many clusters? which clustering method? answers via modelbased cluster analysis,”
The Computer Journal, vol. 41, no. 8, pp. 578–588, 1998.  Tibshirani et al. [2001] R. Tibshirani, G. Walther, and T. Hastie, “Estimating the number of clusters in a data set via the gap statistic,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 63, no. 2, pp. 411–423, 2001.
 Sugar and James [2003] C. A. Sugar and G. M. James, “Finding the number of clusters in a dataset: An informationtheoretic approach,” Journal of the American Statistical Association, vol. 98, no. 463, pp. 750–763, 2003.
 Nocedal and Wright [2006] J. Nocedal and S. J. Wright, Numerical Optimization, 2nd ed. New York: Springer, 2006.
 Friedman et al. [2007] J. Friedman, T. Hastie, H. Höfling, and R. Tibshirani, “Pathwise coordinate optimization,” The Annals of Applied Statistics, vol. 1, no. 2, pp. 302–332, 2007.
 Rand [1971] W. M. Rand, “Objective criteria for the evaluation of clustering methods,” Journal of the American Statistical Association, vol. 66, no. 336, pp. 846–850, 1971.
 Fruchterman and Reingold [1991] T. M. Fruchterman and E. M. Reingold, “Graph drawing by forcedirected placement,” Software: Practice and experience, vol. 21, no. 11, pp. 1129–1164, 1991.
 Makridakis et al. [1998] Makridakis, Wheelwright, and Hyndman, Forecasting: Methods and Applications. Wiley, 1998.
 Yin and Li [2011] J. Yin and H. Li, “A sparse conditional gaussian graphical model for analysis of genetical genomics data,” The Annals of Applied Statistics, vol. 5, no. 4, p. 2630, 2011.
 Guo et al. [2011] J. Guo, E. Levina, G. Michailidis, and J. Zhu, “Joint estimation of multiple graphical models,” Biometrika, p. asq060, 2011.
Comments
There are no comments yet.