Compressive Nonparametric Graphical Model Selection For Time Series

11/13/2013 ∙ by Alexander Jung, et al. ∙ ETH Zurich TU Wien 0

We propose a method for inferring the conditional indepen- dence graph (CIG) of a high-dimensional discrete-time Gaus- sian vector random process from finite-length observations. Our approach does not rely on a parametric model (such as, e.g., an autoregressive model) for the vector random process; rather, it only assumes certain spectral smoothness proper- ties. The proposed inference scheme is compressive in that it works for sample sizes that are (much) smaller than the number of scalar process components. We provide analytical conditions for our method to correctly identify the CIG with high probability.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Consider a -dimensional, zero-mean, stationary, Gaussian random process , . We are interested in learning the conditional independence graph (CIG) [1, 2, 3, 4] of from the finite-length observation of a single process realization. We consider the high-dimensional regime, where the number of observed process samples is (much) smaller than [5, 6, 7, 8, 9, 10, 11]

. In this case, consistent estimation of the CIG is possible only if structural assumptions on the vector process are made. Specifically, we will consider CIGs that are

sparse in the sense of containing relatively few edges. This problem is relevant, e.g., in the analysis of the time evolution of air pollutant concentrations [1, 2] and in medical diagnostic data analysis [8].

Existing approaches to this compressive graphical model selection problem are based on parametric process models [8, 9, 12, 13]

, specifically on vector autoregressive (VAR) models. In this paper, we develop and analyze a nonparametric approach, which only requires the vector process to be spectrally smooth. The smoothness notion we use is quantified in terms of moments of the matrix-valued auto-covariance function (ACF) of the process. Compared to

[10, 8, 9, 12, 13], our approach applies to a considerably more general class of processes including VAR processes as a special case.

Contributions:

Our main conceptual contribution resides in recognizing that the problem of inferring the sparse CIG of a Gaussian vector process is a special case of a block-sparse signal recovery problem [14, 15, 16], i.e., a multitask learning problem [17, 18]. While for the special case of a VAR process with sparse CIG, a block-sparse structure was already identified in [8], we show that a (different) block-sparse structure exists for general stationary time series. This stems from the fact that the CIG of a general stationary time series is encoded in the continuous ensemble of values of the spectral density matrix , . Based on this insight, we develop a multitask LASSO [17, 19] formulation of the sparse CIG estimation problem. Our main analytical contribution is Theorem 4.1, which provides conditions for our scheme to correctly identify the CIG with high probability.

Outline:

The remainder of this paper is organized as follows. In Section 2, we introduce the problem considered. Section 3 describes our CIG inference method and Section 4 presents corresponding performance guarantees. Finally, Section 5 reports numerical results.

2 Problem Formulation

Consider a -dimensional, zero-mean, stationary, real, Gaussian random process with (matrix-valued) ACF . The ACF is assumed summable, i.e., for some matrix norm . The spectral density matrix (SDM) of the process is defined as , and we assume that

(1)

for all , where and

denote the smallest and largest eigenvalue of

, respectively. In particular, (1) implies that the matrix is nonsingular for all . We will furthermore require the vector process to be such that satisfies certain smoothness properties which are expressed in terms of moments of the ACF defined as

(2)

Here, is a nonnegative weight function that typically increases with .

The CIG of the process is the graph with node set representing the scalar component processes and edge set , where if and only if the component processes and are conditionally independent given all remaining component processes [1]. The neighborhood of node is defined as . We restrict ourselves to processes with sparse CIG in the sense that

(3)

The graphical model selection problem we consider can now be stated as the problem of inferring the CIG , or more precisely its edge set , from the observation , where is the sample size. Since is Gaussian with nonsingular for all , it follows from [1, 2, 20] that if and only if for all . The edge set therefore corresponds to the locations of the nonzero entries of , and our graphical model selection problem amounts to determining these locations. We are interested in estimating the CIG from regularly spaced samples , with , , and large enough for the following to hold:

(4)

The implication from left to right in (4) follows trivially from what was said above. The implication from right to left is satisfied, e.g., for processes with all entries of being rational functions in , provided that is larger than the maximum degree of the numerator polynomials of . Another sufficient condition for the implication from right to left to hold is the following.

Lemma 2.1.

Consider a -dimensional, zero-mean, stationary, Gaussian process with CIG and SDM satisfying (1). Then, if is chosen such that for every edge , the ACF moment with and the global partial coherence satisfy , with as in (1), the CIG is characterized by (4).

The restriction to the finite set of frequencies is made for expositional convenience. The general theory developed in this paper goes through for , with our inference procedure becoming a multitask learning problem with a continuum instead of a finite number, , of tasks.

3 Graphical Model Selection

Our method for inferring the CIG is inspired by the approach employed in [10, 7]. We first estimate the SDM , , by means of a multivariate spectral estimator. Then we use this estimate to perform neighborhood regression, which yields an estimate of the support (i.e., the locations of the nonzero entries) of and, via (4), the CIG. Neighborhood regression is performed by solving a multitask learning problem using multitask LASSO (mLASSO) [17].

With regards to the first step, it is natural to estimate using the multivariate Blackman-Tukey estimator [21]:

(5)

Here, for and, by symmetry of the ACF, for . Furthermore, the window function is chosen such that

is positive semidefinite. All window functions with nonnegative discrete-time Fourier transform are admissible

[21, Sec. 2.5.2]. In the high-dimensional regime, where the number of observations is smaller than the number of nodes, the matrices in (5) will be rank-deficient (to see this, note that each column of is a linear combination of ). Simply inverting , for , and inferring the edge set via (4) is therefore not possible.

To cope with this issue, we reduce the problem of finding the support of the matrices to multitask learning problems (one for each node). This can be done as follows. First note that, because of (4), the union of the supports of the th rows of the matrices , determines the neighborhood . The , as shown next, can then be obtained by solving multitask learning problems. For simplicity of exposition and without loss of generality we assume in the following. Given , , we define and via

(6)

where is the positive definite square root of . We next decompose into its orthogonal projection onto and the orthogonal complement thereof according to

(7)

with and , where is the pseudo-inverse of . The significance of this construction is expressed by the following proposition.

Proposition 3.1.

The neighborhood of node is determined by the joint support of the , according to

(8)

where the addition in (8) is elementwise.

Proof.

We first note that (4) implies

where is the vector containing the entries . Next, we show that , which will finalize the proof. By the construction of and in (6), we have

(9)

Applying a well-known formula for the inverse of a block matrix  [22, Fact 2.17.3] to yields with It also follows from [22, Fact 2.17.3] that and hence by (1), which allows us to conclude that . ∎

The essence of Proposition 3.1 is that it reduces the problem of determining the neighborhood to that of finding the joint support of the . Recovering the based on the observations (7) is now recognized as a multitask learning or generalized multiple measurement vector problem [17, 18, 23], which in turn is a special case (with additional structure) of a block-sparse signal recovery problem [14, 15, 16]. Specifically, a multi-task learning problem can be cast as a block-sparse signal recovery problem by stacking the individual linear models in (7) into a single linear model; the resulting system matrix is block-diagonal. The approach described in [8] for VAR processes, albeit leading to a block-sparse recovery problem, does not result in a block-diagonal system matrix.

An efficient method for solving the multi-task learning problem at hand is the mLASSO, which can be formulated as follows (e.g., [17]):

where is the LASSO parameter, and with given by . To compute the estimate , one does not need to compute and by taking the square root of as in (6). To see this, we note that (LABEL:equ_multitask_lasso_graph_sel_statistic) is equivalent to

(11)

and, by (9), and are submatrices of . Therefore, working with (3) instead of (LABEL:equ_multitask_lasso_graph_sel_statistic) has the advantage that the square root does not need to be computed in order to determine .

In summary, we have shown that the neighborhood can be found via the support of the mLASSO estimate (3). Recognizing that this estimate depends on , which is unknown, motivates the following inference algorithm (for general ), which simply uses instead of in (3).

Algorithm 1.

Given the observation , the parameter , the threshold parameter , and the mLASSO parameter (the choice of and will be discussed in Section 4), perform the following steps:

Step 1: For each , compute the SDM estimate according to (5).

Step 2: Compute the mLASSO estimate for each as

(12)

where is the submatrix of obtained by deleting its th column and th row, and is obtained by deleting the th entry in the th column of .

Step 3: Estimate the neighborhood of node as the index set

(13)

with given by .

Our algorithm can be regarded as a generalization of the algorithm proposed in [10] for i.i.d. random processes to general stationary random processes. The new element here is that since we consider general Gaussian vector processes, we have multiple measurements available to determine the CIG. This is exploited through the use of mLASSO instead of plain LASSO as employed in [10].

4 Performance Guarantees

We now present conditions for our CIG selection scheme to correctly identify, with high probability, the neighborhoods , and in turn the edge set , of the underlying CIG. Our analysis yields allowed growth rates for the problem dimensions, i.e., the number of scalar process components and the maximum node degree , as functions of the sample size . Moreover, we provide concrete choices for the threshold parameter in (13) and the mLASSO parameter in (1).

A necessary and sufficient condition for mLASSO to correctly identify the joint support of the underlying parameter vector is the incoherence condition [24, Eqs. (4)–(5)]. This condition is a worst-case (in our case, over frequency ) condition [23] in that it needs the system matrices in (7) (again, we consider node ) to be “well-conditioned” for all . In other words, the incoherence condition does not predict any performance improvement owing to the availability of measurements (7) instead of just one. We will therefore base our performance analysis on the multitask compatibility constant [17], which is defined, for a given index set of size , as

(14)

with . Here, where is the restriction of the vector to the entries in . Invoking the concept of the multitask compatibility constant will be seen below to yield an average (across frequency ) requirement on the SDM for Algorithm 1 to correctly identify the CIG.

We start by defining the class of -dimensional, zero-mean, stationary, Gaussian processes with CIG of maximum node degree (cf. (3)) and SDM satisfying (1) and (4). The remaining parameters characterizing this class are defined as follows:

  • Minimum partial coherence : This parameter quantifies the minimum partial correlation between the spectral components of the process. In particular, we require that, for every , ,

  • ACF moment : We quantify the spectral smoothness of the processes in using the ACF moment (2) with weight function , where is the window function in (5).

  • Minimum multitask compatibility constant111 The relation between (14) and (LABEL:equ_def_spectral_compatibility_constant) is brought out by noting that, for , and . (cf. (14)): For every process in , we require

    to hold for all and all .

Combining techniques from large deviation theory [25] to bound the error with a deterministic performance analysis of the mLASSO [17], one can derive the following result.

Theorem 4.1.

Consider a process belonging to the class . Let be the estimate of given by (13), based on sample size and with the choices (in (1)) and (in (13)). Then, if for some , the sample size and the ACF moment satisfy

with , the probability of Algorithm 1 delivering the correct edge set is at least , i.e., .

Theorem 4.1 shows that success is guaranteed with high probability if the sample size scales logarithmically in and polynomially in , and if the process is sufficiently smooth, i.e., is sufficiently small.

Let us particularize Theorem 4.1 to the special case of a VAR(1) process as considered in [9, 13, 12], i.e., with i.i.d. noise . As in [9], we take to be the adjacency matrix of a dependency graph , which is related to—but in general different from—the CIG . We assume that is a simple graph of maximum node degree and the nonzero entries of the adjacency matrix are all equal to a single positive number . The VAR(1) process we consider satisfies (1) and (3) by its definition and belongs to the class with , , , , and . Moreover, condition (4) is satisfied as soon as since the entries of are rational functions in with numerator degree . The threshold in (LABEL:eq:condonN) becomes , with a constant that is independent of , , , and . In contrast, the corresponding threshold for the method in [9] is , with a constant that is independent of , , , and . The difference between the growth rates of these thresholds with respect to may be explained by the fact that the method in [9] is tailored to VAR processes whereas our approach applies to general (spectrally smooth) stationary processes.

5 Numerical Results

Figure 1: ROC curves for the compressive selection scheme.

We generated222Matlab code to reproduce the results in this section is available at http://www.nt.tuwien.ac.at/about-us/staff/alexander-jung/. a Gaussian process of dimension by applying a finite impulse response (FIR) filter of length to a zero-mean, stationary, white, Gaussian noise process . The covariance matrix was chosen such that the resulting CIG satisfies (3) with . The filter coefficients are such that the magnitude of the associated transfer function is uniformly bounded from above and below by positive constants, thereby ensuring that conditions (1) and (4) (for arbitrary ) are satisfied. We then computed the estimates using Algorithm 1 with window function and . We set and , where , , and was varied in the range .

In Fig. 1, we show receiver operating characteristic (ROC) curves with the average fraction of false alarms and the average fraction of correct decisions for varying mLASSO parameter . Here, denotes the neighborhood estimate obtained from Algorithm 1 in the -th simulation run. We averaged over independent simulation runs.

References

  • [1] R. Dahlhaus, “Graphical interaction models for multivariate time series,” Metrika, vol. 51, pp. 151–172, 2000.
  • [2] R. Dahlhaus and M. Eichler, “Causality and graphical models for time series,” in Highly Structured Stochastic Systems, P. Green, N. Hjort, and S. Richardson, Eds., pp. 115–137. Oxford Univ. Press, Oxford, UK, 2003.
  • [3] F. R. Bach and M. I. Jordan, “Learning graphical models for stationary time series,” IEEE Trans. Signal Processing, vol. 52, no. 8, pp. 2189–2199, Aug. 2004.
  • [4] M. Eichler, Graphical Models in Time Series Analysis, Ph.D. thesis, Universität Heidelberg, Germany, 1999.
  • [5] N. E. Karoui, “Operator norm consistent estimation of large dimensional sparse covariance matrices,” Ann. Statist., vol. 36, no. 6, pp. 2717–2756, 2008.
  • [6] N. P. Santhanam and M. J. Wainwright, “Information-theoretic limits of selecting binary graphical models in high dimensions,” IEEE Trans. Inf. Theory, vol. 58, no. 7, pp. 4117–4134, Jul. 2012.
  • [7] P. Ravikumar, M. J. Wainwright, and J. Lafferty, “High-dimensional Ising model selection using

    -regularized logistic regression,”

    Ann. Stat., vol. 38, no. 3, pp. 1287–1319, 2010.
  • [8] A. Bolstad, B. D. Van Veen, and R. Nowak, “Causal network inference via group sparse regularization,” IEEE Trans. Signal Processing, vol. 59, no. 6, pp. 2628–2641, Jun. 2011.
  • [9] J. Bento, M. Ibrahimi, and A. Montanari, “Learning networks of stochastic differential equations,” in Proc. Advances in Neural Information Processing Systems, pp. 172–180, 2010.
  • [10] N. Meinshausen and P. Bühlmann, “High dimensional graphs and variable selection with the Lasso,” Ann. Stat., vol. 34, no. 3, pp. 1436–1462, 2006.
  • [11] J. H. Friedmann, T. Hastie, and R. Tibshirani, “Sparse inverse covariance estimation with the graphical Lasso,” Biostatistics, vol. 9, no. 3, pp. 432–441, Jul. 2008.
  • [12] J. Songsiri, J. Dahl, and L. Vandenberghe, “Graphical models of autoregressive processes,” in Convex Optimization in Signal Processing and Communications, Y. C. Eldar and D. Palomar, Eds., pp. 89–116. Cambridge Univ. Press, Cambridge, UK, 2010.
  • [13] J. Songsiri and L. Vandenberghe, “Topology selection in graphical models of autoregressive processes,” J. Mach. Learn. Res., vol. 11, pp. 2671–2705, 2010.
  • [14] Y. C. Eldar, P. Kuppinger, and H. Bölcskei, “Block-sparse signals: Uncertainty relations and efficient recovery,” IEEE Trans. Signal Processing, vol. 58, no. 6, pp. 3042–3054, June 2010.
  • [15] M. Mishali and Y. C. Eldar, “Reduce and boost: Recovering arbitrary sets of jointly sparse vectors,” IEEE Trans. Signal Processing, vol. 56, no. 10, pp. 4692–4702, Oct. 2008.
  • [16] Y. C. Eldar and H. Rauhut, “Average case analysis of multichannel sparse recovery using convex relaxation,” IEEE Trans. Inf. Theory, vol. 56, no. 1, pp. 505–519, Jan. 2009.
  • [17] P. Bühlmann and S. van de Geer,

    Statistics for High-Dimensional Data

    ,
    Springer, New York, 2011.
  • [18] K. Lounici, M. Pontil, A. B. Tsybakov, and S. van de Geer, “Taking advantage of sparsity in multi-task learning,” in Proc. 22nd Annual Conference on Learning Theory, pp. 73–82, 2009.
  • [19] S. Lee, J. Zhu, and E. P. Xing, “Adaptive multi-task Lasso: With application to eQTL detection,” in Proc. Advances in Neural Information Processing Systems, pp. 1306–1314, 2010.
  • [20] R. Brillinger, “Remarks concerning graphical models for time series and point processes,” Revista de Econometria, vol. 16, pp. 1–23, 1996.
  • [21] P. Stoica and R.  Moses, Introduction to Spectral Analysis, Prentice Hall, Englewood Cliffs, NJ, 1997.
  • [22] D. S. Bernstein, Matrix Mathematics: Theory, Facts, and Formulas, Princeton Univ. Press, Princeton, NJ, 2nd edition, 2009.
  • [23] R. Heckel and H. Bölcskei, “Joint sparsity with different measurement matrices,” in Proc. 50th Allerton Conf. Commun., Control, and Comput., pp. 698–702, 2012.
  • [24] F. R. Bach, “Consistency of the group Lasso and multiple kernel learning,” J. Mach. Learn. Res., vol. 9, pp. 1179–1225, 2008.
  • [25] S. Foucart and H. Rauhut, A Mathematical Introduction to Compressive Sensing, Springer, New York, 2012.