Estimation of the graphical structure of Gaussian Markov random fields (MRFs) has been the topic of active research in machine learning, data analysis and statistics. The reason is that they provide efficient means for representing complex statistical relations of many variables in forms of a simple undirected graph, disclosing new insights about interactions of genes, users, news articles, operational parts of a human driver, to name a few.
One mainstream of the research is to estimate the structure by maximum likelihood estimation (MLE), penalizing the -norm of the learning parameters. In this framework, structure learning of a Gaussian MRF is equivalent to finding a sparse inverse covariance matrix of a multivariate Gaussian distribution. To formally describe the connection, let us we consider samples of jointly Gaussian random variables following , where the mean is zero without loss of generality and is the covariance matrix. The estimation is essentially the MLE of the inverse covariance matrix under an -norm penalty, which can be stated it as a convex optimization problem Yuan and Lin (2007):
Here, is the sample covariance matrix, and is a tuning parameter that determines the element-wise sparsity of .
The -regularized MLE approach (1) has been addressed quite extensively in literature. The convex optimization formulation has been first discussed in References d’Aspremont et al. (2008); Banerjee et al. (2008). A block-coordinate descent type algorithm was developed in Reference Friedman et al. (2008), while revealing the fact that the sub-problems of (1
) can be solved in forms of the LASSO regression problemsTibshirani (1996). More efficient solvers have been developed to deal with high-dimensional cases Oztoprak et al. (2012); Rolfs et al. (2012); Hsieh et al. (2011, 2012, 2013); Mazumder and Hastie (2012); Treister and Turek (2014); Zhang et al. (2018). In the theoretical side, we are interested in two aspects—one is the estimation quality of and another is the variable selection quality of . These aspects are still in under investigation in theory and experiments Meinshausen and Bühlmann (2006, 2010); Yuan and Lin (2007); Banerjee et al. (2008); Rothman et al. (2008); Lam and Fan (2009); Raskutti et al. (2009); Yuan (2010); Fattahi et al. (2018), as new analyses become available for the closely related
-penalized LASSO regression in vector spaces.
Among these, our method inherits the spirit of Reference Meinshausen and Bühlmann (2006); Yuan (2010) in particular, where the authors considered the problem (1) in terms of a collection of local regression problems defined for each random variables.
LASSO and SLOPE
Under a linear data model with a data matrix and , the -penalized estimation of is known as the LASSO regression Tibshirani (1996), whose estimate is given by solving the following convex optimization problem:
where is a tuning parameter which determines the sparsity of the estimate . Important statistical properties of is that (i) the distance between the estimate and the population parameter vector and (ii) the detection of non-zero locations of by . We can rephrase the former as the estimation error and the latter as the model selection error. These two types of error are dependent on how to choose the value of the tuning parameter . Regarding the model selection, it is known that the LASSO regression can control the family-wise error rate (FWER) at level by choosing Bogdan et al. (2015)
. The FWER is essentially the probability of including at least one entry as non-zero inwhich is zero in . In high-dimensional cases where , controlling FWER is quite restrictive since it is unavoidable to do false positive detection of nonzero entries. As a result, FWER control can lead to weak detection power of nonzero entries.
The SLOPE is an alternative procedure for the estimation of , using the sorted -norm penalization instead. The SLOPE solves a modified convex optimization problem,
where is the sorted -norm defined by
where and is the th largest component of in magnitude. In Reference Bogdan et al. (2015), it has been shown that, for linear regression, the SLOPE procedure can control the false discovery rate (FDR) at level of model selection by choosing . The FDR is the expected ratio of false discovery (i.e., the number of false nonzero entries) to total discovery. Since controlling FDR is less restrictive for model selection compared to the FWER control, FDR control can lead to a significant increase in detection power, while it may slightly increase the total number of false discovery Bogdan et al. (2015).
This paper is motivated by the SLOPE method Bogdan et al. (2015) for its use of the SL1 regularization, where it brings many benefits not available with the popular -based regularization—the capability of false discovery rate (FDR) control Bogdan et al. (2015); Brzyski et al. (2015), adaptivity to unknown signal sparsity Su and Candès (2016) and clustering of coefficients Bondell and Reich (2008); Figueiredo and Nowak (2016). Also, efficient optimization methods Bogdan et al. (2015); Lee et al. (2016); Zhang et al. (2018) and more theoretical analysis Su and Candès (2016); Chen and Banerjee (2016); Bellec et al. (2017); Derumigny (2018) are under active research.
In this paper, we propose a new procedure to find a sparse inverse covariance matrix estimate, we call nsSLOPE (neighborhood selection Sorted L-One Penalized Estimation). Our nsSLOPE procedure uses the sorted -norm for penalized model selection, whereas the existing gLASSO (1) method uses the -norm for the purpose. We investigate our method in two aspects in theory and in experiments, showing that (i) how the estimation error can be bounded, and (ii) how the model selection (more specifically, the neighborhood selection Meinshausen and Bühlmann (2006)) can be done with an FDR control in the edge structure of the Gaussian Markov random field. We also provide an efficient but straightforward estimation algorithm which fits for parallel computation.
2 nsSLOPE (Neighborhood Selection Sorted L-One Penalized Estimation)
Our method is based on the idea that the estimation of the inverse covariance matrix of a multivariate normal distributioncan be decomposed into the multiple regression on conditional distributions Meinshausen and Bühlmann (2006); Yuan (2010).
For a formal description of our method, let us consider a -dimensional random vector , denoting its th component as and the sub-vector without the th component as . For the inverse covariance matrix , we use and to denote the th column of the matrix and the rest of without the th column, respectively.
From the Bayes rule, we can decompose the full log-likelihood into the following parts:
This decomposition allows us a block-wise optimization of the full log-likelihood, which iteratively optimizes while the parameters in are fixed at the current value.
In the block-wise optimization approach we mentioned above, we need to deal with the conditional distribution in each iteration. When , the conditional distribution also follows the Gaussian distribution Anderson (2003), in particular:
Here denotes the th column of without the th row element, denotes the sub-matrix of without the th column and the th row and is the th diagonal component of . Once we define as follow,
then the conditional distribution now looks like:
This indicates that the minimization of the conditional log-likelihood can be understood as a local regression for the random variable , under the data model:
To obtain the solution of the local regression problem (2), we consider a convex optimization based on the sorted -norm Bogdan et al. (2015); Figueiredo and Nowak (2016). In particular, for each local problem index , we solve
Here, the matrix consists of i.i.d. -dimensional samples in rows, is the th column of and is the sub-matrix of without the th column. Note that the sub-problem (3) requires an estimate of , namely , which will be computed dependent on the information on the other sub-problem indices (namely, ). Because of this, the problems (3) for indices are not independent. This contrasts our method to the neighborhood selection algorithms Meinshausen and Bühlmann (2006); Yuan (2010), based on the -regularization.
2.2 Connection to Inverse Covariance Matrix Estimation
With obtained from (3) as an estimate of for , now the question is how to use the information for the estimation of without an explicit inversion of the matrix. For the purpose, we first consider the block-wise formulation of :
putting the th row and the th column to the last positions whenever necessary. There we can see that . Also, from the block-wise inversion of , we have (unnecessary block matrices are replaced with ):
From these and the definition of , we can establish the following relations:
Note that for a positive definite matrix and that (4) implies the sparsity pattern of and must be the same. Also, the updates (4) for are not independent, since the computation of depends on . However, if we estimate based on the sample covariance matrix instead of , (i) our updates (4) no longer needs to explicitly compute or store the matrix, unlike the gLASSO and (ii) our sub-problems become mutually independent and thus solvable in parallel.
Our algorithm, called nsSLOPE, is summarized in Algorithm 1, which is essentially a block-coordinate descent algorithm Beck and Tetruashvili (2013); Treister and Turek (2014). Our algorithm may look similar to that of Yuan (2010) but there are several important differences—(i) each sub-problem in our algorithm solves a SLOPE formulation (SL1 regularization), while Yuan’s sub-problem is either LASSO or Dantzig selector ( regularization); (ii) our sub-problem makes use of the estimate in addition to .
3.1 Sub-Problem Solver
To solve our sub-problems (3), we use the SLOPE R package, which implements the proximal gradient descent algorithm of Reference Beck and Teboulle (2009) with acceleration based on Nesterov’s original idea Nesterov (1983). The algorithm requires to compute the proximal operator involving , namely
This can be computed in time using an algorithm from Reference Bogdan et al. (2015). The optimality of sub-problem is declared by the primal-dual gap and we use a tight threshold value, .
3.2 Choice of
For the sequence of values in sub-problems, we use so-called the Benjamini-Hochberg (BH) sequence Bogdan et al. (2015):
Here is the target level of FDR control we discuss later and is the
th quantile of the standard normal distribution. In fact, when the design matrixin the SLOPE sub-problem (3) is not orthogonal, it is beneficial to use an adjusted version of this sequence. This sequence is generated by:
and if the sequence is non-increasing but begins to increase after the index , we set all the remaining values equal to , so that the resulting sequence will be non-increasing. For more discussion about the adjustment, we refer to Section 3.2.2 of Reference Bogdan et al. (2015).
3.3 Estimation of
To solve the th sub-problem in our algorithm, we need to estimate the value of . This can be done using (4), that is, . However, this implies that (i) we need to keep an estimate of additionally, (ii) the computation of the th sub-problem will be dependent on all the other indices, as it needs to access , requiring the algorithm to run sequentially.
To avoid these overheads, we compute the estimate using the sample covariance matrix instead (we assume that the columns of are centered):
This allows us to compute the inner loop of Algorithm 1 in parallel.
3.4 Stopping Criterion of the External Loop
To terminate the outer loop in Algorithm 1, we check if the diagonal entries of have converged, that is, the algorithm is stopped when the -norm different between two consecutive iterates is below a threshold value, . The value is slightly loose but we have found no practical difference by making it tighter. Note that it is suffice to check the optimality of the diagonal entries of since the optimality of ’s is enforced by the sub-problem solver and .
3.5 Uniqueness of Sub-Problem Solutions
When , our sub-problems may have multiple solutions, which may prevent the global convergence of our algorithm. We may adopt the technique in Reference Razaviyayn et al. (2013) to inject a strongly convex proximity term into each sub-problem objective, so that each sub-problem will have a unique solution. In our experience, however, we encountered no convergence issues using stopping threshold values in the range of for the outer loop.
In this section, we provide two theoretical results of our nsSLOPE procedure—(i) an estimation error bound regarding the distance between our estimate and the true model parameter and (ii) group-wise FDR control on discovering the true edges in the Gaussian MRF corresponding to .
We first discuss about the estimation error bound, for which we divide our analysis into two parts regarding (i) off-diagonal entries and (ii) diagonal entries of .
4.1 Estimation Error Analysis
4.1.1 Off-Diagonal Entries
From (4), we see that , in other words, when is fixed, the off-diagonal entries is determined solely by , and therefore we can focus on the estimation error of .
To discuss the estimation error of , it is convenient to consider a constrained reformulation of the sub-problem (3):
Hereafter, for the sake of simplicity, we use notations and for , also dropping sub-problem indices in and . In this view, the data model being considered in each sub-problem is as follows,
For the analysis, we make the following assumptions:
The true signal satisfies for some (this condition is satisfied for example, if and is -sparse, that is, it has at most nonzero elements).
The noise satisfies the condition This will allow us to say that the true signal is feasible with respect to the constraint in (6).
We provide the following result, which shows that approaches the true in high probability:
Suppose that is an estimate of obtained by solving the sub-problem (6). Consider the factorization where (
) is a Gaussian random matrix whose entries are sampled i.i.d. fromand is a deterministic matrix such that . Such decomposition is possible since the rows of are independent samples from . Then we have,
with probability at least , where and .
We need to discuss a few results before proving Theorem 4.1.1.
Let be a bounded subset of . For an , consider the set
holds with probability at least , where is a standard Gaussian random vector and .
The result follows from an extended general inequality in expectation Figueiredo and Nowak (2014). ∎
The next result shows that the first term of the upper bound in Theorem 4.1.1 can be bounded without using expectation. The quantity is called the width of and is bounded as follows,
This result is a part of the proof for Theorem 3.1 in Reference Figueiredo and Nowak (2014). ∎
where the true signal belongs to a bounded subset . The following corollaries are straightforward extensions of Theorems 3.3 and 3.4 of Reference Figueiredo and Nowak (2014), given our Theorem 4.1.1 (so we skip the proofs).
Choose to be any vector satisfying that
with probability at least .
Now we show the error bound for the estimates we obtain by solving the optimization problem (6). For the purpose, we make use of the Minkowski functional of the set ,
If is a compact and origin-symmetric convex set with non-empty interior, then defines a norm in . Note that if and only if .
with probability at least .
Proof of Theorem 4.1.1.
Since we assumed that , we construct the subset so that all vectors with will be contained in . That is,
This is a sphere defined in the SL1-norm : in this case, the Minkowski functional is proportional to and thus the same solution minimizes both and .
Recall that and we choose . Since , for we have . This implies that .
4.1.2 Diagonal Entries
We estimate based on the residual sum of squares (RSS) as suggested by Yuan (2010),
Unlike in Reference Yuan (2010), we directly analyze the estimation error of based on a chi-square tail bound. For all small enough so that , we have
where for ,
Using the same notation as the previous section, that is, and , consider the estimate in discussion,
where the last equality is from . Therefore,
where . The last term is the sum of squares of independent
and therefore it follows the chi-square distribution, that is,. Applying the tail bound Johnstone (2001) for a chi-square random variable with degrees of freedom:
we get for all small enough ,
4.1.3 Discussion on Asymptotic Behaviors
Our two main results above, Theorems 4.1.1 and 4.1.2, indicate how well our estimate of the off-diagonal entries and diagonal entries would behave. Based on these results, we can discuss the estimation error of the full matrix compared to the true precision matrix .
From Theorem 4.1.1, we can deduce that with and ,
is the smallest eigenvalue of the symmetric positive definite. That is, using the interlacing property of eigenvalues, we have
where is the spectral radius of . Therefore when , the distance between and is bounded by . Here, we can consider in a way such as as increases, so that the bound in Theorem 4.1.1 will hold with the probability approaching one. That is, in rough asymptotics,
Under the conditions above, Theorem 4.1.2 indicates that:
using our assumption that . We can further introduce assumptions on and as in Reference Yuan (2010), so that we can quantify the upper bound but here we will simply say that , where is a constant depending on the properties of the full matrices and . If this is the case, then from Theorem 4.1.2 for an such that , we see that
with the probability approaching one.
4.2 Neighborhood FDR Control under Group Assumptions
Here we consider obtained by solving the unconstrained form (3) with the same data model we discussed above for the sub-problems:
with and with . Here we focus on a particular but an interesting case where the columns of form orthogonal groups, that is, under the decomposition , forms a block-diagonal matrix. We also assume that the columns of belonging to the same group are highly correlated, in the sense that for any columns and of corresponding to the same group, their correlation is high enough to satisfy that
This implies that by Theorem 2.1 of Reference Figueiredo and Nowak (2016), which simplifies our analysis. Note that if and belong to different blocks, then our assumption above implies . Finally, we further assume that for .
Consider a collection of non-overlapping index subsets as the set of groups . Under the block-diagonal covariance matrix assumption above, we see that
where denotes the representative of the same coefficients for all , and . This tells us that we can replace by , if we define containing in its columns (so that and consider the vector of group-representative coefficients .
The regularizer can be rewritten similarly,
where , denoting by the group which has the th largest coefficient in in magnitude and by the size of the group.
where we can easily check that satisfies . This is exactly the form of the SLOPE problem with an orthogonal design matrix in Reference Bogdan et al. (2015), except that the new sequence is not exactly the Benjamini-Hochberg sequence (5).
In the following, is the number of individual false rejections, is the number of total individual rejections and is the number of total group rejections. The following lemmas are slightly modified versions of Lemmas B.1 and B.2 from Reference Bogdan et al. (2015), respectively, to fit our group-wise setup.
Let be a null hypothesis and let . Then
Consider applying the procedure (10) for new data with and let be the number of group rejections made. Then with ,
Using these, we can show our FDR control result. Consider the procedure (10) we consider in a sub-problem of nsSLOPE as a multiple testing of group-wise hypotheses, where we reject the null group hypothesis when , rejecting all individual hypotheses in the group, that is, all , , are rejected. Using defined as in (5), the procedure controls FDR at the level :