Time-dependent spatially varying graphical models, with application to brain fMRI data analysis

11/10/2017
by   Kristjan Greenewald, et al.
0

In this work, we present an additive model for space-time data that splits the data into a temporally correlated component and a spatially correlated component. We model the spatially correlated portion using a time-varying Gaussian graphical model. Under assumptions on the smoothness of changes in covariance matrices, we derive strong single sample convergence results, confirming our ability to estimate meaningful graphical structures as they evolve over time. We apply our methodology to the discovery of time-varying spatial structures in human brain fMRI signals.

READ FULL TEXT VIEW PDF

Authors

page 10

10/04/2017

Bayesian Analysis of fMRI data with Spatially-Varying Autoregressive Orders

Statistical modeling of fMRI data is challenging as the data are both sp...
12/09/2020

Exponential Family Graphical Models: Correlated Replicates and Unmeasured Confounders, with Applications to fMRI Data

Graphical models have been used extensively for modeling brain connectiv...
09/16/2020

Efficient Variational Bayesian Structure Learning of Dynamic Graphical Models

Estimating time-varying graphical models are of paramount importance in ...
03/19/2014

A Hierarchical Graphical Model for Big Inverse Covariance Estimation with an Application to fMRI

Brain networks has attracted the interests of many neuroscientists. From...
02/14/2022

Two Gaussian regularization methods for time-varying networks

We model time-varying network data as realizations from multivariate Gau...
06/27/2012

Small-sample Brain Mapping: Sparse Recovery on Spatially Correlated Designs with Randomization and Clustering

Functional neuroimaging can measure the brain?s response to an external ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning structured models of high-dimensional datasets from relatively few training samples is an important task in statistics and machine learning. Spatiotemporal data, in the form of

variables evolving over time points, often fits this regime due to the high () dimension and potential difficulty in obtaining independent samples. In this work, we develop a nonparametric framework for estimating time varying spatiotemporal graphical structure using an regularization method. The covariance of a spatiotemporal array is an by matrix

(1)

where , denotes the variables or features of interest at the th time point. Even for moderately large and

the number of degrees of freedom (

) in the covariance matrix can greatly exceed the number of training samples available for estimation. One way to handle this problem is to introduce structure and/or sparsity, thus reducing the number of parameters to be estimated. Spatiotemporal data is often highly structured, hence the design of estimators that model and exploit appropriate covariance structure can provide significant gains.

We aim to develop a nonparametric framework for estimating time varying graphical structure for matrix-variate distributions. Associated with each is its undirected graph . Under the assumption that the law of changes smoothly, Zhou et al. (2010) introduced a nonparametric method to estimate the graph sequence assuming that the are independent, where is a smooth function over and we have mapped the indices onto points on the interval . In this work, we are interested in the general time series model where the are dependent and the graphs change over time.

One way to introduce dependency into the is to study the following covariance structure. Let be symmetric positive definite covariance matrices. Let , be the diagonal matrix with elements

along the diagonal. Consider the random matrix

with row vectors

corresponding to measurements at the th spatial location, and columns corresponding to the measurements at times , :

(2)
(3)

that is, the covariance of the column vectors corresponding to each time point changes smoothly with time (if is a smooth function of ). This provides ultimate flexibility in parameterizing spatial correlations, for example, across different geographical scales through variograms (Cressie, 2015)

, each of which is allowed to change over seasons. Observe that while we have used the normal distribution here for simplicity, all our results hold for general subgaussian distributions.

The model (3) also allows modeling the dynamic gene regulatory and brain connectivity networks with topological (e.g., Erdős-Rényi random graph, small-world graph, or modular graphs) constraints via degree specifications as well as spatial constraints in the set of . When , we return to the case of Zhou et al. (2010) where there is no temporal correlation, i.e. assumed to be independent.

We propose methodologies to study the model as constructed in (2) and (3). Building upon and extending techniques of Zhou et al. (2010) and Rudelson & Zhou (2017); Zhou (2014), we aim to design estimators to estimate graph sequence , where the temporal graph and spatial graphs are determined by the zeros of and . Intuitively, the temporal correlation and spatial correlation are modeled as two additive processes. The covariance of is now

(4)

where are the -dimensional standard basis vectors.

In the context of this model, we aim to develop a nonparametric method for estimating time varying graphical structure for matrix-variate normal distributions using an regularization method. We will show that, as long as the covariances change smoothly over time, we can estimate the spatial and temporal covariance matrices well in terms of predictive risk even when are both large. We will investigate the following theoretical properties: (a) consistency and rate of convergence in the operator and Frobenius norm of the covariance matrices and their inverse, (b) large deviation results for covariance matrices for simultaneously correlated and non-identically distributed observations, and (c) conditions that guarantee smoothness of the covariances.

Besides the model (4), another well-studied option for modeling spatio-temporal covariances is to introduce structure via the Kronecker product of smaller symmetric positive definite matrices, i.e. The Kronecker product model, however, is restrictive when applied to general spatio-temporal covariances as it assumes the covariance is separable (disallowing such simple scenarios as the presence of additive noise), and does not allow for time varying spatial structure. When used to estimate covariances not following Kronecker product structure, many estimators will respond to the model mismatch by giving ill-conditioned estimates (Greenewald & Hero, 2015).

Human neuroscience data is a notable application where time-varying structure emerges. In neuroscience, one must take into account temporal correlations as well as spatial correlations, which reflect the connectivity formed by the neural pathways. It is conceivable that the brain connectivity graph will change over a sufficiently long period of measurements. For example, as a child learns to associate symbols in the environment, certain pathways within the brain are reinforced. When they begin to associate images with words, the correlation between a particular sound like Mommy and the sight of a face becomes stronger and forms a well worn pathway. On the other hand, long term non-use of connections between sensory and motor neurons can result in a loss of the pathway.

1.1 Datasets and Related Work

Estimating graphical models (connectomes) in fMRI data using sparse inverse covariance techniques has enjoyed wide application (Huang et al., 2010; Varoquaux et al., 2010; Narayan et al., 2015; Kim et al., 2015). However, recent research has only now begun exploring observed phenomena such as temporal correlations and additive temporally correlated noise (Chen et al., 2015; Arbabshirani et al., 2014; Kim et al., 2015; Qiu et al., 2016), and time-varying dynamics and graphical models (connectomes) (Calhoun et al., 2014; Liu & Duyn, 2013; Chang & Glover, 2010; Chen et al., 2015).

We consider the ADHD-200 fMRI dataset (Biswal et al., 2010), and study resting state fMRIs for a variety of healthy patients in the dataset at different stages of development. Using our methods, we are able to directly estimate age-varying graphical models across brain regions, chronicling the development of brain structure throughout childhood.

Several models have emerged to generalize the Kronecker product model to allow it to model more realistic covariances while still maintaining many of the gains associated with Kronecker structure. Kronecker PCA, discussed in Tsiligkaridis & Hero (2013), approximates the covariance matrix using a sum of Kronecker products. An algorithm (Permuted Rank-penalized Least Squares (PRLS)) for fitting the KronPCA model to a measured sample covariance matrix was introduced in (Tsiligkaridis & Hero, 2013) and was shown to have strong high dimensional MSE performance guarantees. From a modeling perspective, the strengths of Kronecker PCA lie in its ability to handle “near separable” covariances and a variety of time-varying effects. While the Kronecker PCA model is very general, so far incorporation of sparsity in the inverse covariance has not been forthcoming. This motivates our introduction of the sparse model (4), which we demonstrate experimentally in Section 10 of the supplement to enjoy better statistical convergence.

Carvalho et al. (2007) proposed a Bayesian additive time-varying graphical model, where the spatially-correlated noise term is a parameter of the driving noise covariance in a temporal dynamical model. Unlike our method, they did not estimate the temporal correlation, instead requiring the dynamical model to be pre-set. Our proposed method has wholly independent spatial and temporal models, directly estimating an inverse covariance graphical model for the temporal relationships of the data. This allows for a much richer temporal model and increases its applicability.

In the context of fMRI, the work of Qiu et al. (2016) used a similar kernel-weighted estimator for the spatial covariance, however they modeled the temporal covariance with a simple AR-1 model which they did not estimate, and their estimator did not attempt to remove. Similarly, Monti et al. (2014) used a smoothed kernel estimator for with a penalty to further promote smoothness, but did not model the temporal correlations. Our additive model allows the direct estimation of the temporal behavior, revealing a richer structure than a simple AR-1, and allowing for effective denoising of the data, and hence better estimation of the spatial graph structures.

2 The model and method

Let the elements of and be denoted as and . Similar to the setting in (Zhou et al., 2010), we assume that is a smooth function of time for all , and assume that is sparse. Furthermore, we suppose that

, corresponding to there being more time points than spatial variables. For a random variable

, the subgaussian norm of , , is defined via . Note that if , we also have . Define an random matrix with independent, zero mean entries satisfying and having subgaussian norm . Matrices denote independent copies of . We now write an additive generative model for subgaussian data having covariance given in (4). Let

(5)

where and are the -dimensional standard basis vectors. Then the covariance

Thus (5) is a generative model for data following the covariance model (4).

2.1 Estimators

As in Rudelson & Zhou (2017), we can exploit the large- convergence of to to project out the part and create an estimator for the covariances. As is time-varying, we use a weighted average across time to create local estimators of spatial covariance matrix .

It is often assumed that knowledge of the trace of one of the factors is available a priori. For example, the spatial signal variance may be known and time invariant, corresponding to

being known. Alternatively, the temporal component variance may be constant and known, corresponding to being known. In our analysis below, we suppose that is known or otherwise estimated (similar results hold when is known). For simplicity in stating the trace estimators, in what follows we suppose that is constant, and without loss of generality that the data has been normalized such that diagonal elements are constant over .

As is smoothly varying over time, the estimate at time should depend strongly on the time samples close to , and less on the samples further from . For any time of interest , we thus construct a weighted estimator using a weight vector such that . Our weighted, unstructured sample-based estimator for is then given by

(6)

and we have considered the class of weight vectors arising from a symmetric nonnegative kernel function with compact support and bandwidth determined by parameter . A list of minor regularity assumptions on are listed in the supplement. For kernels such as the Gaussian kernel, this will result in samples close to being highly weighted, with the “weight decay” away from scaling with the bandwidth . A wide bandwidth will be appropriate for slowly-varying , and a narrow bandwidth for quickly-varying .

To enforce sparsity in the estimator for , we substitute into the widely-used GLasso objective function, resulting in a penalized estimator for with regularization parameter

(7)

For a matrix , we let . Increasing the parameter gives an increasingly sparse . Having formed an estimator for , we can now form a similar estimator for . Under the constant-trace assumption, we construct an estimator for

(8)

For a time-varying trace , use the time-averaged kernel

(9)

In the future we will derive rates for the time varying case by choosing an optimal . These estimators allow us to construct a sample covariance matrix for :

(10)

We (similarly to ) apply the GLasso approach to . Note that with ,

has negative eigenvalues since

. We obtain a positive semidefinite matrix as:

(11)

We use alternating direction method of multipliers (ADMM) to solve (11) as in Boyd et al. (2011), and prove that this retains a tight elementwise error bound. Note that while we chose this method of obtaining a positive semidefinite for its simplicity, there may exist other possible projections, the exact method is not critical to our overall Kronecker sum approach. In fact, if the GLasso is not used, it is not necessary to do the projection (11), as the elementwise bounds also hold for .

We provide a regularized estimator for the correlation matrices using the positive semidefinite as the initial input to the GLasso problem

(12)

where is a regularization parameter and is the L1 norm on the offdiagonal.

Form the estimate for as . Observe that our method has three tuning parameters, two if is known or can be estimated. If is not known, we present several methods to choose it in Section 7.1 in the supplement. Once is chosen, the estimators (7) and (12) for and respectively do not depend on each other, allowing and to be tuned independently.

3 Statistical convergence

We first bound the estimation error for the time-varying . Since is based on a kernel-smoothed sample covariance, is a biased estimator, with the bias depending on the kernel width and the smoothness of . In Section 12.1 of the supplement, we derive the bias and variance of , using arguments from kernel smoothing and subgaussian concentration respectively.

In the following results, we assume that the eigenvalues of the matrices and are bounded:

Assumption 1: There exist positive constants such that and for all .

Assumption 2: has entries with bounded second derivatives on .

Putting the bounds on the bias and variance together and optimizing the rate of , we obtain the following, which we prove in the supplementary material.

Theorem 1.

Suppose that the above Assumption holds, the entries of have bounded second derivatives for all , and , , and that

. Then with probability at least

, is positive definite and for some

This result confirms that the temporal samples selected by the kernel act as replicates for estimating . We can now substitute this elementwise bound on into the GLasso proof, obtaining the following theorem which demonstrates that successfully exploits sparsity in .

Theorem 2.

Suppose the conditions of Theorem 3 and that has at most nonzero off-diagonal elements for all . If , then the GLasso estimator (7) satisfies

Observe that this single-sample bound converges whenever the part dimensionality grows. The proof follows from the concentration bound in Theorem 3 using the argument in Zhou et al. (2010), Zhou et al. (2011), and Rothman et al. (2008). Note that goes to zero as increases, in accordance with the standard bias/variance tradeoff.

We now turn to the estimator for the part. As it does not involve kernel smoothing, we simply need to bound the variance. We have the following bound on the error of :

Theorem 3.

Suppose the above Assumption holds. Then

with probability for some constants .

Recall that we have assumed that , so the probability converges to 1 with increasing or . While is not positive definite, the triangle inequality implies a bound on the positive definite projection (11):

(13)

Thus, similarly to the earlier result for , the estimator (12) formed by substituting the positive semidefinite into the GLasso objective enjoys the following error bound (Zhou et al., 2011).

Theorem 4.

Suppose the conditions of Theorem 6 and that has at most nonzero off-diagonal elements. If , then the GLasso estimator (12) satisfies

Observe that this single-sample bound converges whenever the dimensionality grows since the sparsity . For relaxation of this stringent sparsity assumption, one can use other assumptions, see for example Theorem 3.3 in Zhou (2014).

4 Simulation study

We generated a time varying sequence of spatial covariances according to the method of Zhou et al. (2010), which follows a type of Erdos-Renyi random graph model. Initially we set , where . Then, we randomly select edges and update as follows: for each new edge , a weight is chosen uniformly at random from ; we subtract from and , and increase by . This keeps positive definite. When we later delete an existing edge from the graph, we reverse the above procedure.

We consider , changing the graph structure at the points as follows. At each

, five existing edges are deleted, and five new edges are added. For each of the five new edges, a target weight is chosen. Linear interpolation of the edge weights between the

is used to smoothly add the new edges and gradually delete the ones to be deleted. Thus, almost always, there are 105 edges in the graph and 10 edges have weights that are varying smoothly (Figure 1).

Figure 1: Example sequence of Erdos-Renyi graphs. At each time point, the 100 edges connecting nodes are shown. Changes are indicated by red and green edges: red edges indicate edges that will be deleted in the next increment and green indicates new edges.

In the first set of experiments we consider generated from the ER time-varying graph procedure and an AR-1 covariance with parameter . The magnitudes of the two factors are balanced. We set and vary from 200 to 2400. For each pair, we vary the regularization parameter , estimating every , for each. We evaluate performance using the mean relative Frobenius estimation error (), the mean relative L2 estimation error (), and the Matthews correlation coefficient (MCC).

The MCC quantifies edge support estimation performance, and is defined as follows. Let the number of true positive edge detections be TP, true negatives TN, false positives FP, and false negatives FN. The Matthews correlation coefficient is defined as Increasing values of MCC imply better edge estimation performance, with implying complete failure and implying perfect edge set estimation.

Results are shown in Figure 2, for and 50 edges in , and 100 edges in , and and 100 edges in . As predicted by the theory, increasing improves performance and increasing decreases performance. Increasing the number of edges in changes the optimal , as expected. Figure 3 shows performance results for the penalized estimator using MCC, Frobenius error, and L2 error, where follows an AR(1) model with and follows a random ER model. Note the MCC, Frobenius, spectral norm errors are improved with larger . In the supplement (Section 11), we repeat these experiments, using an alternate random graph topologies, with similar results.

Figure 2: MCC, Frobenius, and L2 norm error curves for a random ER graph and . Top: is AR covariance with and 50 edges in , Middle: is AR(1) covariance with and having 100 edges, Bottom: AR covariance with and 100 edges in .
Figure 3: MCC, Frobenius, and L2 norm error curves for a AR(1) with when is a random ER graph. From top to bottom: and .

5 fMRI Application

The ADHD-200 resting-state fMRI dataset (Biswal et al., 2010) was collected from 973 subjects, 197 of which were diagnosed with ADHD types 1, 2, or 3. The fMRI images have varying numbers of voxels which we divide into 90 regions of interest for graphical model analysis (Wehbe et al., 2014), and between 76 and 276 images exist for each subject. Provided covariates for the subjects include age, gender, handedness, and IQ. Previous works such as (Qiu et al., 2016) used this dataset to establish that the brain network density increases with age, corresponding to brain development as subjects mature. We revisit this problem using our additive approach. Our additive model allows the direct estimation of the temporal behavior, revealing a richer structure than a simple AR-1, and allowing for effectively a denoising of the data, and better estimation of the spatial graph structure.

We estimate the temporal covariances for each subject using the voxels contained in the regions of interest, with example results shown in Figure 5 in the supplement. We choose as the lower limit of the eigenvalues of , as in the high sample regime it is an upper bound on .

We then estimate the brain connectivity network at a range of ages from 8 to 18, using both our proposed method and the method of Monti et al. (2014), as it is an optimally-penalized version of the estimator in Qiu et al. (2016). We use a Gaussian kernel with bandwidth , and estimate the graphs using a variety of values of and . Subjects with fewer than 120 time samples were eliminated, and those with more were truncated to 120 to reduce bias towards longer scans. The number of edges in the estimated graphs are shown in Figure 4. Note the consistent increase in network density with age, becoming more smooth with increasing .

(a) Non-additive method of Monti et al. (2014) (optimally penalized version of Qiu et al. (2016)).
(b) Our proposed additive method, allowing for denoising of the time-correlated data.
Figure 4: Number of edges in the estimated graphical models across 90 brain regions as a function of age. Shown are results using three different values of the regularization parameter , and from left to right the kernel bandwidth parameter used is = 1.5, 2, and 3. Note the consistently increasing edge density in our estimate, corresponding to predictions of increased brain connectivity as the brain develops, leveling off in the late teenage years. Compare this to the method of Monti et al. (2014), which successfully detects the trend in the years 11-14, but fails for other ages.

6 Conclusion

In this work, we presented estimators for time-varying graphical models in the presence of time-correlated signals and noise. We revealed a bias-variance tradeoff scaling with the underlying rate of change, and proved strong single sample convergence results in high dimensions. We applied our methodology to an fMRI dataset, discovering meaningful temporal changes in functional connectivity, consistent with scientifically expected childhood growth and development.

Acknowledgement

This work was supported in part by NSF under Grant DMS-1316731, Elizabeth Caroline Crosby Research Award from the Advance Program at the University of Michigan, and by AFOSR grant FA9550-13-1-0043.

References

  • Arbabshirani et al. (2014) Arbabshirani, M., Damaraju, E., Phlypo, R., Plis, S., Allen, E., Ma, S., Mathalon, D., Preda, A., Vaidya, J., and Adali, T. Impact of autocorrelation on functional connectivity. Neuroimage, 102:294–308, 2014.
  • Biswal et al. (2010) Biswal, B., Mennes, M., Zuo, X., Gohel, S., Kelly, C., Smith, S., Beckmann, C., Adelstein, J., Buckner, R., and Colcombe, S. Toward discovery science of human brain function. Proceedings of the National Academy of Sciences, 107(10):4734–4739, 2010.
  • Boyd et al. (2011) Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. Distributed optimization and statistical learning via ADMM. Foundations and Trends® in Machine Learning, 3(1):1–122, 2011.
  • Calhoun et al. (2014) Calhoun, V., Miller, R., Pearlson, G., and Adalı, T. The chronnectome: time-varying connectivity networks as the next frontier in fMRI data discovery. Neuron, 84(2):262–274, 2014.
  • Carvalho et al. (2007) Carvalho, C., West, M., et al. Dynamic matrix-variate graphical models. Bayesian analysis, 2(1):69–97, 2007.
  • Chang & Glover (2010) Chang, C. and Glover, G. Time–frequency dynamics of resting-state brain connectivity measured with fmri. Neuroimage, 50(1):81–98, 2010.
  • Chen et al. (2015) Chen, S., Liu, K., Yang, Y., Xu, Y., Lee, S., Lindquist, M., Caffo, B., and Vogelstein, J. An m-estimator for reduced-rank high-dimensional linear dynamical system identification. arXiv:1509.03927, 2015.
  • Cressie (2015) Cressie, N. Statistics for spatial data. John Wiley & Sons, 2015.
  • Greenewald & Hero (2015) Greenewald, K. and Hero, A. Robust kronecker product PCA for spatio-temporal covariance estimation. Signal Processing, IEEE Transactions on, 63(23):6368–6378, Dec 2015.
  • Huang et al. (2010) Huang, S., Li, J., Sun, L., Ye, J., Fleisher, A., Wu, T., Chen, K., and Reiman, E. Learning brain connectivity of alzheimer’s disease by sparse inv. cov. est. NeuroImage, 50(3):935–949, 2010.
  • Kim et al. (2015) Kim, J., Pan, W., Initiative, Alzheimer’s Disease Neuroimaging, et al. Highly adaptive tests for group differences in brain functional connectivity. NeuroImage: Clinical, 9:625–639, 2015.
  • Liu & Duyn (2013) Liu, X. and Duyn, J. Time-varying functional network information extracted from brief instances of spontaneous brain activity. Proc. of the Natl. Academy of Sciences, 110(11):4392–4397, 2013.
  • Monti et al. (2014) Monti, R., Hellyer, P., Sharp, D., Leech, R., Anagnostopoulos, C., and Montana, G. Estimating time-varying brain conn. networks from fMRI time series. NeuroImage, 103:427–443, 2014.
  • Narayan et al. (2015) Narayan, M., Allen, G., and Tomson, S. Two sample inference for populations of graphical models with applications to functional connectivity. arXiv preprint arXiv:1502.03853, 2015.
  • Qiu et al. (2016) Qiu, H., Han, F., Liu, H., and Caffo, B. Joint estimation of multiple graphical models from high dimensional time series. Journal of the Royal Statistical Society: Series B, 78(2):487–504, 2016.
  • Rothman et al. (2008) Rothman, A., Bickel, P., Levina, E., Zhu, J., et al. Sparse permutation invariant covariance estimation. Electronic Journal of Statistics, 2:494–515, 2008.
  • Rudelson & Zhou (2017) Rudelson, M. and Zhou, S. Errors-in-variables models with dependent measurements. The Electronic Journal of Statistics, 11(1):1699–1797, 2017.
  • Tsiligkaridis & Hero (2013) Tsiligkaridis, T. and Hero, A. Covariance estimation in high dimensions via kronecker product expansions. IEEE Trans. on Sig. Proc., 61(21):5347–5360, 2013.
  • Varoquaux et al. (2010) Varoquaux, G., Gramfort, A., Poline, J-B., and Thirion, B. Brain covariance selection: better individual functional connectivity models using population prior. Advances in Neural Information Processing Systems, 23:2334–2342, 2010.
  • Wehbe et al. (2014) Wehbe, L., Murphy, B., Talukdar, P., Fyshe, A., Ramdas, A., and Mitchell, T. Simultaneously uncovering the patterns of brain regions involved in different story reading subprocesses. PLOS ONE, 9(11):e112575, 2014.
  • Zhou (2014) Zhou, S. Gemini: Graph estimation with matrix variate normal instances. The Annals of Statistics, 42(2):532–562, 2014.
  • Zhou et al. (2010) Zhou, S., Lafferty, J., and Wasserman, L. Time varying undirected graphs. Machine Learning, 80(2-3):295–319, 2010.
  • Zhou et al. (2011) Zhou, S., Rütimann, P., Xu, M., and Bühlmann, P. High-dimensional covariance estimation based on gaussian graphical models. The Journal of Machine Learning Research, 12:2975–3026, 2011.

7 Estimating covariance for fMRI data

We estimate the temporal covariance for each subject using the voxels contained in the regions of interest. The results for several example subjects are shown in Figure 5, along with the corresponding sample covariance and its eigenvalues. We choose as the lower “asymptote” of the eigenvalues of , as in the high sample regime it is an upper bound on .

The sample covariance matrix is significantly diagonally dominant, supporting our subtraction of a

scaled identity matrix. Note the evident sparsity in the inverse. Many of the sparsity patterns indicate local AR-type behavior as assumed in

(Qiu et al., 2016), but this pattern is not stationary, and in fact tends to group in blocks. Hence, the assumptions in (Qiu et al., 2016) do not fully capture the richness of the data.

Figure 5: Estimated covariance for three example subjects. Left: part sample covariance . Center: Eigenvalues of , showing estimate of factor. Right: Estimated graphical model. Note the sparsity in the inverse, and that the eigenvalue spectra are consistent with the additive model.

7.1 Choosing

Due to the nonidentifiability of and when a single copy of the data is observed, we assume that the parameter is either known (Rudelson & Zhou, 2017), or chosen as a tuning parameter. If corresponds to a “signal” component in a physical system, then as the mean signal strength may be known by design or may be estimated directly by running an experiment with the “noise” component involving the covariance parameter to be set at for a certain period of time.

The first step of relaxation is by assuming only one entry of , for example, is known. This is feasible when we assume we can observe for a short period of time only so as to obtain the knowledge of a single element in . It is our conjecture that there is an interesting tradeoff between the number of covariates in we are allowed to observe with no measurement errors and the rate of convergence we can obtain for estimating and . Moreover, even when no covariate is observed directly, we can rely on the recent progress on high dimensional regression and signal reconstruction to help establish theoretical limits on recovering and , when replicates are available. For example, if a second sample of either or (cf. Equation (5)) is available, the corresponding or can be estimated directly without needing to specify any of or . We leave these as future directions.

We now use two experimental settings to illustrate how robust the estimation procedure for covariance is with respect to the misspecification of . We consider the case in which , , , , and is generated using the AR(1) or Star-Block, while follows random ER graph. We estimate the inverse of using and , and use MCC to measure the performance of edge selection.

As shown in Figures 6-7, we observe that when the topology is sophisticated (e.g., Star-Block) and the misspecification error of does not approach 0, although joint tuning of and can not resolve the edge selection inconsistency, as expected, although it appears to give certain improvements.

When the topology is relatively simple, e.g., like the chain graph correspondong to the AR(1) model, the edge selection performance is robust to the misspecification of ; in this case, joint tuning of and is not even necessary, as illustrated in Figures 6-7).


Figure 6: The figures are MCC of the estimators with respect to the true as a function of for each ; and ; From left to right: is AR(1) and Star-Block model.

Figure 7: The figures are MCC of the estimators with respect to the true as a function of for each ; and ; From left to right: is AR(1) and Star-Block model.

8 Additional analysis of ADHD-200 data

In addition to the plots showing the relationship between brain connectivity and age in the main text, in this section we reproduce those plots using only healthy subjects and only using ADHD subjects. The results are shown in Figure 8.

Observe that the leftmost plots (those with the narrowest kernel) are quite rough. This is caused by the reduction in the number of available subjects as compared to the plots in the main text that used all the subjects, increasing the effect of the noise. Shown in Figure 9 are the age histograms for all patients, healthy patients only, and ADHD patients. Note the nonsmooth age ranges in the plots of Figure 8 correspond to regions with fewer available subjects, as expected.

While the sample size is such that we cannot make definitive conclusions on the exact form of the differences between healthy and ADHD brain development, the fact that we observe significant differences is not surprising given the nature of ADHD and its effects on childhood development. In particular, note the lower rate of development among teenage ADHD subjects as opposed to healthy teenage subjects. We hypothesize that this corresponds to the common observation that ADHD patients face additional developmental challenges in the teenage years.

(a) Our proposed additive method, trained on healthy subjects only.
(b) Our proposed additive method, trained on ADHD subjects only.
Figure 8: Number of edges in the estimated graphical models across 90 brain regions as a function of age. Shown are results using three different values of the regularization parameter , and from left to right the kernel bandwidth parameter used is = 1.5, 2, and 3 for both methods. Note the consistently increasing edge density in our estimate, corresponding to predictions of increased brain connectivity as the brain develops, and the difference in teenage development rates between healthy and ADHD patients.
(a) Histogram of the ages of all subjects in the ADHD-200 dataset.
(b) Histogram of the ages of healthy subjects in the ADHD-200 dataset.
(c) Histogram of the ages of subjects diagnosed with ADHD in the ADHD-200 dataset.
Figure 9: Histograms showing the age distributions of all patients, healthy patients, and ADHD diagnosed patients in the ADHD-200 dataset. A set of subjects in the dataset have no diagnosis, these left out of both the healthy and ADHD groups.

9 Technical assumptions

9.1 Assumptions

In this section, we repeat the assumptions that we required in the main text.

Assumption A1

There exists a positive constant such that

Assumption A2

The diagonal elements are constant across all , and the trace is constant over time.

Assumption A3

has at most nonzero off-diagonal elements.

Assumption B1

is a symmetric and positive definite matrix for all , and its entries have bounded second derivatives on .

Assumption B2

There exists a positive constant such that for all

Assumption B3

has at most nonzero off-diagonal elements.

Note that two conditions in Assumption A2 can be relaxed, we include them only to make statement of the estimators easier. In the nonnormalized cases the appropriate modifications are easy to derive.

We make the following assumptions on the smooth kernel function used to create the :

Assumption K1

is non-negative, symmetric, twice differentiable, and has compact support .

Assumption K2

.

Assumption K3

.

Assumption K4

.

Assumption K5

.

These are satisfied for most common smooth kernel functions, including the Gaussian kernel (Zhou et al., 2010).

10 Comparison of our method and Kronecker PCA

In this section, we compare our time-varying Kronecker sum model (4) to the sum of Kronecker products (KronPCA) model of Tsiligkaridis & Hero (2013); Greenewald & Hero (2015)

Both our method and KronPCA are a sum of Kronecker products, KronPCA is a more general model, but our method exploits sparsity while KronPCA cannot. Consider data generated from a time-varying Kronecker sum model (4), where is a random ER graph and is a time-varying random ER graph as in Section 4 in the main text. The sizes of where chosen to be relatively small since the computational complexity of KronPCA is (compared to the complexity of our method). Figure 10 shows Frobenius norm results for our L1-penalized method, KronPCA, and for comparison, the baseline sample covariance. Note that due to the sparsity of the true model, our method performs significantly better than KronPCA, especially when the number of available replicates is small. Since in a time-varying setting the number of replicates is small or even 1, this is a significant advantage. Additionally, our method provides interpretable graph estimates, while the factors of KronPCA are nonsparse and not interpretable Tsiligkaridis & Hero (2013); Greenewald & Hero (2015).

Figure 10: Comparison of our method with KronPCA and the sample covariance. Shown is a logarithmic plot of the Frobenius norm error as a function of available replicates, with data generated using , random ER graphs.

11 Additional experiments using alternate graph topologies

11.1 star-block and MA

In Figure 11, we repeat the experiments of main text Figure 2, showing results for changed to a star-block graph (edge weights defined as for ER), and an moving average (MA) covariance (band width 15). The results confirm the trends found for the AR case.

Figure 11: MCC, Frobenius, and L2 norm error curves for a random ER graph and . From top to bottom: is star-block covariance, and MA covariance.

Similarly, in Figure 12, we show the results for the estimator when is a star-block graph.

Figure 12: MCC, Frobenius, and L2 norm error curves for a Star-Block graph when is a random ER graph. From top to bottom: , , and .

11.2 random grid graph

In this section, we use a random grid graph which is produced in the same way as the random ER graph, except edges are only allowed on between adjacent nodes in a square 2 dimensional grid (Figure 13). We replace the random ER model for used in the main text with the random grid graph model, and repeat the main text experiments using this alternate topology.

Random grid graph results for the experiments shown in main text Figure 2 are shown in Figure 14, showing similar results as expected. Similarly, random grid graph results for the experiments shown in Figure 11 are shown in Figure 15.

For the part, Figure 16 repeats the experiments of main text Figure 3, and Figure 17 repeats the experiments of Figure 12, both using the random grid graph for .

Figure 13: Example sequence of random grid graphs used in the experiments. At each time point, the 50 edges connecting nodes are shown. Changes are indicated by red and green edges: red edges indicate edges that will be deleted in the next increment and green indicates new edges.
Figure 14: MCC, Frobenius, and L2 norm error curves for a random grid graph and . From top to bottom: is AR covariance with , AR covariance with .
Figure 15: MCC, Frobenius, and L2 norm error curves for a random grid graph and . From top to bottom: is star-block covariance, and MA covariance.
Figure 16: MCC, Frobenius, and L2 norm error curves for a AR(1) with when is a random grid graph. From top to bottom: and .
Figure 17: MCC, Frobenius, and L2 norm error curves for a Star-Block graph when is a random grid graph. From top to bottom: and .

12 Estimation error bound for part: Proof of Theorem 3

12.1 Preliminary results

Define to be the expected value of the kernel-smoothed estimator at time :

(14)

Using this notation, the estimator bias can be bounded via

Lemma 5 (Bias).

Suppose there exists such that . Then for a kernel satisfying assumptions (K1-K5) we have

This lemma is proved in Section 15.

The variance of the estimator can be bounded as:

Lemma 6 (Variance).

Suppose . Define event

(15)

for some . Then

The proof of this result is based on an application of the Hanson-Wright inequality, and is found in the supplementary material. We emphasize that this bound holds for both the diagonal and off-diagonal elements simultaneously. Using a similar approach, in Section 12.6 we can also show that the estimator is positive definite with high probability:

Lemma 7 (Positive definiteness).

Suppose . Define the event

(16)

for some fixed . Then

12.2 Proof of Theorem 3

In this section, we derive the elementwise bound on the estimator of the spatial covariance at time , and show that it is positive definite with high probability. To obtain the elementwise bound, we will first bound the bias and variance of and then combine the bounds.

12.3 Estimator bias bound

Recall that . By Lemma 5 with (proof in Section 15), we have

12.4 Estimator variance bound (Lemma 2)

Lemma 2 (Variance).

Suppose . Define event

(17)

for some . Then

(18)
Proof.

Recall that

(19)

and

Then

Let be the overall sample covariance. Then observe that for fixed there exists a vector with and such that

By the triangle inequality, , so . We can thus apply Lemma 7 in Section 14, giving for fixed

Using the union bound over (cardinality ), the concentration bound Lemma 7 implies

Setting , for large enough we have since and that

(20)

with probability at least .

12.5 Total error

Putting the bias and variance together, we can bound the total error of the estimator. By the triangle inequality,

Hence,

Optimizing over the order of , we set , giving