High-dimensional correlated time series are found in many modern socio-scientific domains such as neurology, cyber-security, genetics and economics. A multivariate approach, where the system is modeled jointly, can potentially reveal important inter-dependencies between variables. However, naive approaches which permit arbitrary (graphical) dependency structures and dynamics are infeasible because the number of possible graphs becomes exponentially large as the number of variables increases.
For data streams with continuous-valued variables, a penalized maximum likelihood approach offers a flexible means to estimate the underlying dependency structure and continues to attract much attention. In this setting a common assumption is that the data is drawn from a multivariate distribution where the conditional dependency structure is in some sense sparse— the dependency graph is expected to constitute a small proportion of the total number of possible edges. Typically, a Gaussian likelihood is accompanied by a sparsity inducing prior. For example, Banerjee et al. (2008) and Friedman et al. (2008) penalize the likelihood with an norm applied to the precision matrix (non-convex penalties have also been investigated, see Chun et al. (2014)). Further extensions have been considered, for example Lafferty et al. (2012) who graphical model estimation in the non-parametric case, and more recently Lee and Hastie (2015) study models with mixed types of variable.
It is of particular interest to understand how dependency graphs evolve over time, and how prior knowledge relating to such dynamics can be exploited to constrain the graph estimation. Specifically, we consider how a piecewise constant graphical model can be estimated such that the dependency graphs are constant in locally stationary regions segmented by a set of changepoints. This is a challenging problem. If the changepoints were known in advance then local graph estimation could be performed. However, the changepoints cannot be found without first estimating the graphs. Previous approaches (Angelosante and Giannakis, 2011) have resorted to using dynamic programming alongside the graph learning approaches. Unfortunately, these are restricted to quadratic computational complexity as a function of the time-series length. An alternative approach as followed by Ahmed and Xing (2009); Kolar and Xing (2012) and others is to formulate a convex optimization problem with suitable constraints that encourage desired dynamical properties. We refer to such approaches as fused graphical models.
Our contribution investigates and extends a class of models to enable the estimation of changepoints that are grouped over multiple edges in a graphical model. Such grouped changepoints indicate changes in dependency across a system that influence many variables at once. In many situations there may be a priori knowledge of potential groups (for example, grouping over different gene function in genetic data or asset classes in finance). Grouped changes often indicate some sort of regime or phase change in the system dynamics which may be of interest to the practitioner. To this end we propose the group-fused graphical lasso (GFGL) method for joint changepoint and graph estimation. We contrast the proposed grouped estimation of changepoints in graphical models with previous approaches which enforce changepoints at the level of individual edges only and which therefore fail to capture such grouping behavior.
In Section 2, we describe current dynamical graphical model estimation; we introduce our main contribution in the form of the GFGL estimator; and contrast this with previously proposed fused graphical model approaches. Following this, in Section 3
we present an efficient alternating-directed method of moments (ADMM) algorithm for estimation with GFGL. The proposed methodology is demonstrated on simulated examples in Section4, before we consider an application looking at temporal-evolution of gene dependencies in Section 5. We conclude with a discussion of the work. Some technical details are summarized in the Appendix which can be found in the on-line supplementary material.
2 Dynamic Gaussian Graphical Models
Given a -variate time-series for , if the precision matrices are well defined then the dependency structure of the series can be captured by a dynamic Gaussian Graphical Model (GGM). This comprises a collection of graphs , where the vertices represent each component of and the edges represent conditional dependency relations between variables over time. More precisely, the edge is present in the graph if the and th variables are conditionally dependent given all other variables. It follows in the Gaussian case that conditional independence is satisfied if and only if the corresponding entries of the precision matrix are zero, i.e. (Lauritzen, 1996). Therefore, since it encodes the conditional dependency graph, estimation of the precision matrix is of great interest.
Traditionally, estimation of GGM’s is performed under the assumption of stationarity, i.e. we have identically distributed draws from a Gaussian model. Letting be a set of observations and assuming , one can construct an estimator for by maximizing the log-likelihood: , where . In the case where the number of observations is greater than the number of variables () we can test for edge significance to find a GGM (Drton and Perlman, 2004). However, in the non-identical case, because only one data-point may be observed at each node per time-step the traditional empirical covariance estimator is rank deficient for . To this end, estimation of the precision matrix requires additional modeling assumptions. A strategy explored in several recent works (Ahmed and Xing, 2009; Danaher et al., 2013; Gibberd and Nelson, 2014; Monti et al., 2014) is to introduce priors in the form of regularized M-estimators, viz.
with a convex cost function:
is proportional to the log-likelihood and follows from the normal distribution. The penalty terms, correspond to prior shrinkage/smoothness assumptions. Typically, the smoothness term will be a function of the difference between estimates , whereas the shrinkage term will act at specific time points, i.e. directly on .
One popular approach that is relevant to GGM is to use these regularizers to place assumptions on the number of dependencies in a graph. For example, in the i.i.d. case, we need not consider conditional dependencies between all variables, but only a small subset of those which appear most dependent. This assumption of sparsity can be viewed as placing a prior on the parameters to induce zeros in the off-diagonal entries of the precision matrix. Akin to the Laplace prior associated with the least absolute shrinkage and selection operator (lasso)1996), one can construct the graphical lasso estimator for the precision matrix as: , where . This estimator, as examined by Banerjee et al. (2007); Friedman et al. (2008) allows one to stabilize estimation of the precision matrix in the high-dimensional regime (where ) and estimate a sparse graph (sparsity is controlled by ). A full Bayesian treatment for the graphical lasso is given by Wang (2012) who investigate how representative the mode is of the full posterior.
Several approaches which incorporate dynamics in such graphical estimators have been suggested. Zhou et al. (2010); Kolar and Xing (2011) utilize a local estimate of the covariance in the term by replacing in the graphical lasso with a time-sensitive weighted estimator
where are weights derived from a symmetric non-negative smoothing kernel function with width related to . The resulting graphs are now representative of some temporally localized data. By making some smoothness assumptions on the underlying covariance matrix such a kernel estimator can be shown to be risk consistent (Zhou et al., 2010). Kolar and Xing (2011) go further, and demonstrate that placing assumptions on the Fisher information matrix allows one to prove consistent estimation of graph structure in such dynamic GGM. In the next section we discuss how one may adapt these assumptions to estimate piecewise constant GGM.
2.1 The group fused graphical lasso
We propose the Group Fused Graphical Lasso (GFGL) estimator for estimating piecewise constant GGM. The model assumes data is generated at time-point according to , where the distribution is strictly stationary i.e. between changepoints (note we set ). We propose to estimate the covariance and precision matrix at each time by minimizing a cost, as in Eq. (1). The GFGL cost is given as:
where is the matrix norm with the diagonal entries removed.
In the remainder of this paper we describe how one can efficiently solve the GFGL problem and demonstrate two key properties, namely:
Estimated precision matrices encode a sparse dependency structure whereby many of the off axis entries are exactly zero, i.e. .
Precision matrices maintain a piecewise constant structure where changepoints tend to be grouped across the precision matrix, such that for many edges indexed by and the estimated changepoints for the two edges are the same, viz. where represents the set of estimated changepoints on the th edge).
2.2 Relationship to previous proposals
|Dynamic Graphical Lasso||Zhou et al. (2010)||via kernel (see Eq. 3)|
|Temporally smoothed logistic regression (TESLA)||Ahmed and Xing (2009)|
|Joint Graphical Lasso (JGL)*||Danaher et al. (2013)|
|Fused Multiple Graphical Lasso (FMGL)*||Yang et al. (2015)|
|SINGLE||Monti et al. (2014)|
|VCVS Model||Kolar and Xing (2012)||For each node|
Unlike most previous proposals (see Table 1) GFGL penalizes changes across groups of edges in the graph. One notable exception to this can be found in the Varying-Coefficient Varying-Structure (VCVS) model of Kolar and Xing (2012) who propose to select changepoints with an type norm over the differences. The motivation in that work is similar to ours. However, the authors formulate the graph-selection problem differently, utilizing a node-wise regularized regression estimator, rather than the multivariate Gaussian likelihood we use. Whilst node-wise estimation can recover the conditional dependency graph, it does not in general result in a valid (positive definite) precision matrix. This is in contrast to our approach here, where the positive-definite precision matrices can be used to define a probabilistic model via the GGM.
In particular we consider comparison to fused methods such as FMGL (Yang et al., 2015), TESLA (Ahmed and Xing, 2009), SINGLE (Monti et al., 2014) and JGL (Danaher et al., 2013). These methods are similar to each other in that they permit finding a smoothed graphical model through a fused term. Throughout the paper we will refer to models of this type as the Independent Fused Graphical Lasso (IFGL) with the same cost function as GFGL (see Eq. 4), but with the group-smoothing term replaced with an penalized difference, such that . Rather than focusing on the smoothly evolving graph through the kernel covariance estimator , we instead study the difference between the smoothing regularizer for IFGL and GFGL. Throughout the rest of this paper we adopt a purely piecewise constant graph model, in this setting, the empirical covariance is simply estimated with the data at time according to; . One can think of this as using a Dirac-delta kernel for the covariance estimate. For example in Eq. (3) we can set .
3 Algorithms for the group fused graphical lasso problem
Since the penalty function of IFGL approaches solely comprises terms it is linearly separable. As such this permits block-coordinate descent approaches utilized, for example, by Friedman et al. (2008); Yang et al. (2015) whereby the precision matrix rows and columns are sequentially updated. Unfortunately, the GFGL objective (Eq. 4) does not have the same linear separability structure. This is due to the norm acting across the whole (or at least multiple rows/columns) of the precision matrix. This lack of linear separability across the precision matrices precludes a block-coordinate descent strategy (Tseng and Yun, 2009). Instead, we make use of the separability of the group norm (with respect to time) and propose an Alternating Directed Method of Moments (ADMM) algorithm. A key innovation of our contribution is to incorporate an iterative proximal projection step to solve the Group Fused Lasso sub-problem. Additionally, we demonstrate how the same framework can be utilized to solve the previously proposed IFGL problem.
3.1 An alternating directions method of multipliers approach
where and the auxiliary variables are also constrained to be positive-semi-definite. The augmented Lagrangian for GFGL is given as:
where is a set of dual matrices . The difference between ADMM and the more traditional augmented Lagrangian method (ALM) (Glowinski and Le Tallec, 1989) is that we do not need to solve for and jointly. Instead, we can take advantage of the separability structure highlighted in Eq. (5) to solve , separately. By combining the inner product terms and the augmentation term we find:
where is a rescaled dual variable. We write the solution at the th iteration as and proceed by updating our estimates according to the three steps below;
Likelihood Update ():
Dual Update ():
3.2 Likelihood update (Step 1)
We can solve the update for through an eigen-decomposition of terms in the covariance, auxiliary and dual variables (Yuan, 2011; Monti et al., 2014). If we differentiate the objective in Eq. (6) and set the result equal to zero we find:
Noting that and. For each eigenvalue and we can construct the quadratic equation . The right hand side of Eq. (9) contains evidence from the data-set via , but also takes into account the effect our priors encoded in , from the non-smooth portion of Eq. (5). Upon solving for given we find:
The full precision matrix can now be found through the eigen-decomposition:
where contains the eigenvectors of as columns and is a diagonal matrix populated by the eigenvalues , ie . We note that, by choosing the positive solution for the quadratic, we ensure that is positive-definite and thus produces a valid estimator for the precision matrix. Since Eq. (6) refers to an estimation at each time-point separately, we can solve for each independently for to yield the set . Indeed this update can be computed in parallel, as appropriate.
3.3 Group fused lasso signal approximator (Step 2)
The main difference between this work and previous approaches is in the use of a grouped constraint. This becomes a significant challenge when updating in Eq. (7). Unlike the calculation of , we cannot separate the optimization over each time-step. Instead, we must solve for the whole set of matrices jointly. In addition, due to the grouped term in GFGL, we cannot separate the optimization across individual edges. In contrast to independent penalization strategies (Monti et al., 2014; Danaher et al., 2013) it is not possible to solve GFGL for independently of , where . Such an inconvenience is to be expected as the constraints which extract changepoints in GFGL can act across all elements in .
For notational convenience we re-write step two in vector form. Since eachis symmetric about the diagonal we can reduce the number of elements by simply taking the elements above the diagonal . We then construct a matrix form such that , whereby row of the matrix correspond to values at time-step . We perform similar transformations for and , and set and 111Note that, since we have essentially split the data in half (due to symmetry), we may wish to adjust the lambdas to be consistent with the original problem specification in Eq. (4).. Re-writing the objective in Eq. (7) with these transformations yields the cost function
where is a backwards differencing matrix of the form for and zero otherwise, the the group norm is defined as . If one constructs a target matrix then
looks like a signal approximation problem, we will refer to this problem as the Group-Fused Lasso Signal Approximator (GFLSA). This looks similar to the previously studied Fused Lasso Signal Approximator (FLSA) (Liu et al., 2010) but crucially incorporates a group norm rather than the norm of FLSA.
We note Eq. (11) can also be thought of as a proximity operator, such that . If and were indicator functions of two closed convex sets and respectively, then would find the best approximation to restricted to the set . Unlike FLSA which penalizes the columns of independently, we find and cannot apply the two-stage smooth-then-sparsify theorem of Friedman et al. (2007); Liu et al. (2010). Instead, we follow the work of Alaíz et al. (2013) and adopt an iterative projection approach which utilizes Dykstra’s method (Combettes and Pesquet, 2011) to find a feasible solution for both the group fused and lasso constraints.
For any unconstrained optimal point there exists a set of parameters which will act to move the optimal point of the regularized case such that where:
For a given likelihood term, we can obtain an sparse and smooth solution by solving a penalized problem instead of the explicitly constrained version above. Such a penalized form is found in Eq. (10) and, while and are not explicitly indicator functions (i.e. they do not take values outside some feasible region), there does exist a mapping between the values of the parameters and the corresponding sparsity and smoothness constraints. To give some intuition, for a given constraint level and function , the size of the feasible set given by , reduces as increases. Thus sparsity is a monotonically non-decreasing function of The same argument can be constructed for smoothing and the constraint set . The proximal Dykstra method provides a way to calculate a point that is, in the sense of the distance, close or proximal to the unconstrained solution for . By iterating between the feasibility of a solution in and (see Algorithm 1), a solution can be found which is both suitably smooth and sparse.
Given that iterative projection can be used to find a feasible point, the challenge is now to compute the separate proximity operators for and . The proximal operator for the term is given by the soft-thresholding operator (Tibshirani, 1996):
where the and functions act in an element-wise manner and denotes element-wise multiplication.
Computing the group-fused term is more involved and there is no obvious closed-form solution, instead we tackle this through a block-coordinate descent approach similar to that considered by Bleakley and Vert (2011) and Yuan and Lin (2006). Our target here is to find the proximal operator for the group smoothing aspect of the regularizer, which we write as:
Re-writing the above with and constructing as a sum of differences via , (where ) then one can interpret the proximal operator as a group lasso problem (Bleakley and Vert, 2011) . Writing the re-parameterized problem in matrix form one can show that solving for the jump parameters allows us to reconstruct an estimate for . This is formally equivalent to a group lasso (Yuan and Lin, 2006) class of problem:
where a bar denotes a column centered matrix and is a matrix with entries for and otherwise. The problem above can be solved through a block-coordinate descent strategy, sequentially updating the solution for each block for (see Appendix). We can then construct a solution for by summing the differences and noting that the optimal value for is given by . Correspondingly, the proximal operator for is found via
3.4 Dual update and convergence (step 3)
The final step in the ADMM-based method is to update the dual variable via Eq. (8). Convergence properties of general ADMM algorithms are analyzed in Glowinski and Le Tallec (1989). The sequence of solutions can be shown to converge (Eckstein and Bertsekas, 1992) to the solution of the problem: , under conditions that is invertible and the intersection between relative interiors of domains is non-empty: . In the GFGL and IFGL problems considered here one simply sets , in order to restrict . Clearly in this case is invertible and ; thus the relative interiors intersect.
Whilst ADMM is guaranteed to converge to an optimal solution, in practice it converges relatively fast to a useful solution, but very slowly if high accuracy is required. Following the approach of Boyd et al. (2011) we consider tracking two convergence criteria: one tracking primal feasibility: , relating to the optimality requirement , and the other looking at dual feasibility: , which tracks the requirement that , where denotes optimal value. The rate at which the algorithm converges is somewhat tunable through the parameter. However it is not clear how to find an optimal for a given problem. In practice we find that a value of order provides reasonably fast convergence which with tolerances order; and .
At this point it is worth noting that there are a variety of ways one can break down Eq. (4) as an ADMM problem. In this paper we proceed by simply adding one set of auxiliary variables as in Eq. (5); however, one could also adopt a linearized ADMM scheme (Parikh and Boyd, 2013) to deal with the differencing total-variation term. A linearized scheme would result in a different set of problems for the proximity updates. The motivation for splitting the problem up as we have, constraining , is that in the IFGL case we can solve the constraint update (c.f. Eq. 7) using the efficient fused lasso signal approximator algorithm of Liu et al. (2010). Given that we are interested in how the solution of the GFGL and IFGL estimators compare it is prudent to ensure that the formulation of the algorithm is similar for both objectives. For example, given our formulation, we know at each step will be exactly sparse and the augmented weighting is comparable for both IFGL and GFGL.
3.5 A solver for the Independent Fused Graphical Lasso
The main comparison in this paper is between the GFGL and the IFGL classes of estimators that, respectively, fuse edges on an group and individual level. It is worth noting that the ADMM (Algorithm 2) described for GFGL can easily be adapted for such IFGL problems by modifying the second step that corresponds to the non-smooth constraint projection. In place of Eq. (10), we construct a fused lasso problem:
where and we replace the norm of GFGL with a simple penalty of IFGL. Since the norm is linearly separable, i.e. , the objective can now be viewed as a series of separate FLSA problems. This can be solved efficiently with gradient descent. In the IFGL case there is no need to apply the iterative Dykstra projection as one can show the proximity operator can be calculated as (Liu et al., 2010).
4 Synthetic Experiments
IFGL and GFGL are here applied to simulated, piecewise stationary, multivariate time-series data. This provides a numerical comparison of their relative abilities to (i) recover the graphical structure and (ii) detect changepoints.
4.1 Data simulation
To validate the graphical recovery performance of the estimators, data is simulated according to a known ground truth set of precision matrices . The simulation is carried out such that, for a given number of ground truth changepoints , there are corresponding graph structures. For each segment , graphical structure is simulated uniformly at random from the set of graphs with vertex size and edges, i.e. . A draw of can then be used to construct a valid GGM by equating the sparsity pattern of the adjacency matrix and precision matrix, i.e. .
Precision matrices are formed by taking a weighted identity matrixand inserting off-diagonal elements according to edges that are uniformly weighted in the range
. The absolute value of these elements is then added to the appropriate diagonal entries to ensure positive semi-definiteness. To focus on the study of correlation structure between variables, the variance of the distributions are normalized such thatfor .
4.2 Hyper-parameter selection
With most statistical estimation problems there are a set of associated tuning parameters (common examples include; kernel width/shape, window sizes, etc.) which must be specified. In the GFGL and IFGL model, one can consider the regularizer terms and in Eq. (4) as effecting prior knowledge on the model parameterization. Given this viewpoint, selection of tuning parameters corresponds to specification of hyper-parameters for graph sparsity and smoothing.
The recovery performance will depend on the strength of priors employed. As such and must be tuned, or otherwise estimated, such that they are appropriate for a given data-set or task. In comparison to models which utilize only one regularizing term (for example, the graphical lasso of Banerjee et al. (2007)) the potential interplay between and sometimes conflates the interpretation of the different regularizers. For example, whilst predominantly effects the sparsity of the extracted graphs, can also have an implicit effect through smoothing (see Appendix for more details).
In the synthetic data-setting, the availability of ground-truth or labeled data affords the opportunity to learn the hyper-parameters via a supervised scheme. In order to avoid repeated use of data, the simulations are split into test and training groups which share the same ground-truth structure , but are independently sampled. The IFGL and GFGL problems are then solved for each pair of parameters over a search grid. Optimal hyper-parameters can then be selected according to a relevant measure of performance. Typically (Zhou et al., 2010)
one considers either predictive risk (approximation of the true distribution), or model recovery (estimation of the correct sparsity pattern). In addition to tuning parameters via cross-validation, we also compared this to estimation via heuristics such as BIC. However, when applied to the IFGL/GFGL methods in this work such heuristics resulted in poor graph recovery performance (see Appendix for more details).
4.3 Model recovery performance
Considering the model recovery setting, the problem of selecting edges can be treated as a binary classification problem. One popular measure of performance for such problems is the -score
considers the number of correctly classified edges, whilstand relate to the number of false positives and false negatives (Type 1 and Type 2 errors) respectively (a score of represents perfect recovery). Since dynamic network recovery is of interest, the average -score is taken over each time-series to measure the effectiveness of edge selection.
time-series. Error-bars are omitted for presentation, with an estimated standard deviation in the-scores of and . (b) Demonstration of GFGL and IFGL graph recovery as a function of the number of estimated changepoints .
For each training time-series an optimal set of parameters are chosen which maximise the -score, namely