1 Introduction
Estimation of the graphical structure of Gaussian Markov random fields (MRFs) has been the topic of active research in machine learning, data analysis and statistics. The reason is that they provide efficient means for representing complex statistical relations of many variables in forms of a simple undirected graph, disclosing new insights about interactions of genes, users, news articles, operational parts of a human driver, to name a few.
One mainstream of the research is to estimate the structure by maximum likelihood estimation (MLE), penalizing the norm of the learning parameters. In this framework, structure learning of a Gaussian MRF is equivalent to finding a sparse inverse covariance matrix of a multivariate Gaussian distribution. To formally describe the connection, let us we consider samples of jointly Gaussian random variables following , where the mean is zero without loss of generality and is the covariance matrix. The estimation is essentially the MLE of the inverse covariance matrix under an norm penalty, which can be stated it as a convex optimization problem Yuan and Lin (2007):
(1) 
Here, is the sample covariance matrix, and is a tuning parameter that determines the elementwise sparsity of .
The regularized MLE approach (1) has been addressed quite extensively in literature. The convex optimization formulation has been first discussed in References d’Aspremont et al. (2008); Banerjee et al. (2008). A blockcoordinate descent type algorithm was developed in Reference Friedman et al. (2008), while revealing the fact that the subproblems of (1
) can be solved in forms of the LASSO regression problems
Tibshirani (1996). More efficient solvers have been developed to deal with highdimensional cases Oztoprak et al. (2012); Rolfs et al. (2012); Hsieh et al. (2011, 2012, 2013); Mazumder and Hastie (2012); Treister and Turek (2014); Zhang et al. (2018). In the theoretical side, we are interested in two aspects—one is the estimation quality of and another is the variable selection quality of . These aspects are still in under investigation in theory and experiments Meinshausen and Bühlmann (2006, 2010); Yuan and Lin (2007); Banerjee et al. (2008); Rothman et al. (2008); Lam and Fan (2009); Raskutti et al. (2009); Yuan (2010); Fattahi et al. (2018), as new analyses become available for the closely relatedpenalized LASSO regression in vector spaces.
Among these, our method inherits the spirit of Reference Meinshausen and Bühlmann (2006); Yuan (2010) in particular, where the authors considered the problem (1) in terms of a collection of local regression problems defined for each random variables.
LASSO and SLOPE
Under a linear data model with a data matrix and , the penalized estimation of is known as the LASSO regression Tibshirani (1996), whose estimate is given by solving the following convex optimization problem:
where is a tuning parameter which determines the sparsity of the estimate . Important statistical properties of is that (i) the distance between the estimate and the population parameter vector and (ii) the detection of nonzero locations of by . We can rephrase the former as the estimation error and the latter as the model selection error. These two types of error are dependent on how to choose the value of the tuning parameter . Regarding the model selection, it is known that the LASSO regression can control the familywise error rate (FWER) at level by choosing Bogdan et al. (2015)
. The FWER is essentially the probability of including at least one entry as nonzero in
which is zero in . In highdimensional cases where , controlling FWER is quite restrictive since it is unavoidable to do false positive detection of nonzero entries. As a result, FWER control can lead to weak detection power of nonzero entries.The SLOPE is an alternative procedure for the estimation of , using the sorted norm penalization instead. The SLOPE solves a modified convex optimization problem,
where is the sorted norm defined by
where and is the th largest component of in magnitude. In Reference Bogdan et al. (2015), it has been shown that, for linear regression, the SLOPE procedure can control the false discovery rate (FDR) at level of model selection by choosing . The FDR is the expected ratio of false discovery (i.e., the number of false nonzero entries) to total discovery. Since controlling FDR is less restrictive for model selection compared to the FWER control, FDR control can lead to a significant increase in detection power, while it may slightly increase the total number of false discovery Bogdan et al. (2015).
This paper is motivated by the SLOPE method Bogdan et al. (2015) for its use of the SL1 regularization, where it brings many benefits not available with the popular based regularization—the capability of false discovery rate (FDR) control Bogdan et al. (2015); Brzyski et al. (2015), adaptivity to unknown signal sparsity Su and Candès (2016) and clustering of coefficients Bondell and Reich (2008); Figueiredo and Nowak (2016). Also, efficient optimization methods Bogdan et al. (2015); Lee et al. (2016); Zhang et al. (2018) and more theoretical analysis Su and Candès (2016); Chen and Banerjee (2016); Bellec et al. (2017); Derumigny (2018) are under active research.
In this paper, we propose a new procedure to find a sparse inverse covariance matrix estimate, we call nsSLOPE (neighborhood selection Sorted LOne Penalized Estimation). Our nsSLOPE procedure uses the sorted norm for penalized model selection, whereas the existing gLASSO (1) method uses the norm for the purpose. We investigate our method in two aspects in theory and in experiments, showing that (i) how the estimation error can be bounded, and (ii) how the model selection (more specifically, the neighborhood selection Meinshausen and Bühlmann (2006)) can be done with an FDR control in the edge structure of the Gaussian Markov random field. We also provide an efficient but straightforward estimation algorithm which fits for parallel computation.
2 nsSLOPE (Neighborhood Selection Sorted LOne Penalized Estimation)
Our method is based on the idea that the estimation of the inverse covariance matrix of a multivariate normal distribution
can be decomposed into the multiple regression on conditional distributions Meinshausen and Bühlmann (2006); Yuan (2010).For a formal description of our method, let us consider a dimensional random vector , denoting its th component as and the subvector without the th component as . For the inverse covariance matrix , we use and to denote the th column of the matrix and the rest of without the th column, respectively.
From the Bayes rule, we can decompose the full loglikelihood into the following parts:
This decomposition allows us a blockwise optimization of the full loglikelihood, which iteratively optimizes while the parameters in are fixed at the current value.
2.1 SubProblems
In the blockwise optimization approach we mentioned above, we need to deal with the conditional distribution in each iteration. When , the conditional distribution also follows the Gaussian distribution Anderson (2003), in particular:
Here denotes the th column of without the th row element, denotes the submatrix of without the th column and the th row and is the th diagonal component of . Once we define as follow,
then the conditional distribution now looks like:
This indicates that the minimization of the conditional loglikelihood can be understood as a local regression for the random variable , under the data model:
(2) 
To obtain the solution of the local regression problem (2), we consider a convex optimization based on the sorted norm Bogdan et al. (2015); Figueiredo and Nowak (2016). In particular, for each local problem index , we solve
(3) 
Here, the matrix consists of i.i.d. dimensional samples in rows, is the th column of and is the submatrix of without the th column. Note that the subproblem (3) requires an estimate of , namely , which will be computed dependent on the information on the other subproblem indices (namely, ). Because of this, the problems (3) for indices are not independent. This contrasts our method to the neighborhood selection algorithms Meinshausen and Bühlmann (2006); Yuan (2010), based on the regularization.
2.2 Connection to Inverse Covariance Matrix Estimation
With obtained from (3) as an estimate of for , now the question is how to use the information for the estimation of without an explicit inversion of the matrix. For the purpose, we first consider the blockwise formulation of :
putting the th row and the th column to the last positions whenever necessary. There we can see that . Also, from the blockwise inversion of , we have (unnecessary block matrices are replaced with ):
From these and the definition of , we can establish the following relations:
(4) 
Note that for a positive definite matrix and that (4) implies the sparsity pattern of and must be the same. Also, the updates (4) for are not independent, since the computation of depends on . However, if we estimate based on the sample covariance matrix instead of , (i) our updates (4) no longer needs to explicitly compute or store the matrix, unlike the gLASSO and (ii) our subproblems become mutually independent and thus solvable in parallel.
3 Algorithm
Our algorithm, called nsSLOPE, is summarized in Algorithm 1, which is essentially a blockcoordinate descent algorithm Beck and Tetruashvili (2013); Treister and Turek (2014). Our algorithm may look similar to that of Yuan (2010) but there are several important differences—(i) each subproblem in our algorithm solves a SLOPE formulation (SL1 regularization), while Yuan’s subproblem is either LASSO or Dantzig selector ( regularization); (ii) our subproblem makes use of the estimate in addition to .
3.1 SubProblem Solver
To solve our subproblems (3), we use the SLOPE R package, which implements the proximal gradient descent algorithm of Reference Beck and Teboulle (2009) with acceleration based on Nesterov’s original idea Nesterov (1983). The algorithm requires to compute the proximal operator involving , namely
This can be computed in time using an algorithm from Reference Bogdan et al. (2015). The optimality of subproblem is declared by the primaldual gap and we use a tight threshold value, .
3.2 Choice of
For the sequence of values in subproblems, we use socalled the BenjaminiHochberg (BH) sequence Bogdan et al. (2015):
(5) 
Here is the target level of FDR control we discuss later and is the
th quantile of the standard normal distribution. In fact, when the design matrix
in the SLOPE subproblem (3) is not orthogonal, it is beneficial to use an adjusted version of this sequence. This sequence is generated by:and if the sequence is nonincreasing but begins to increase after the index , we set all the remaining values equal to , so that the resulting sequence will be nonincreasing. For more discussion about the adjustment, we refer to Section 3.2.2 of Reference Bogdan et al. (2015).
3.3 Estimation of
To solve the th subproblem in our algorithm, we need to estimate the value of . This can be done using (4), that is, . However, this implies that (i) we need to keep an estimate of additionally, (ii) the computation of the th subproblem will be dependent on all the other indices, as it needs to access , requiring the algorithm to run sequentially.
To avoid these overheads, we compute the estimate using the sample covariance matrix instead (we assume that the columns of are centered):
This allows us to compute the inner loop of Algorithm 1 in parallel.
3.4 Stopping Criterion of the External Loop
To terminate the outer loop in Algorithm 1, we check if the diagonal entries of have converged, that is, the algorithm is stopped when the norm different between two consecutive iterates is below a threshold value, . The value is slightly loose but we have found no practical difference by making it tighter. Note that it is suffice to check the optimality of the diagonal entries of since the optimality of ’s is enforced by the subproblem solver and .
3.5 Uniqueness of SubProblem Solutions
When , our subproblems may have multiple solutions, which may prevent the global convergence of our algorithm. We may adopt the technique in Reference Razaviyayn et al. (2013) to inject a strongly convex proximity term into each subproblem objective, so that each subproblem will have a unique solution. In our experience, however, we encountered no convergence issues using stopping threshold values in the range of for the outer loop.
4 Analysis
In this section, we provide two theoretical results of our nsSLOPE procedure—(i) an estimation error bound regarding the distance between our estimate and the true model parameter and (ii) groupwise FDR control on discovering the true edges in the Gaussian MRF corresponding to .
We first discuss about the estimation error bound, for which we divide our analysis into two parts regarding (i) offdiagonal entries and (ii) diagonal entries of .
4.1 Estimation Error Analysis
4.1.1 OffDiagonal Entries
From (4), we see that , in other words, when is fixed, the offdiagonal entries is determined solely by , and therefore we can focus on the estimation error of .
To discuss the estimation error of , it is convenient to consider a constrained reformulation of the subproblem (3):
(6) 
Hereafter, for the sake of simplicity, we use notations and for , also dropping subproblem indices in and . In this view, the data model being considered in each subproblem is as follows,
For the analysis, we make the following assumptions:

The true signal satisfies for some (this condition is satisfied for example, if and is sparse, that is, it has at most nonzero elements).

The noise satisfies the condition This will allow us to say that the true signal is feasible with respect to the constraint in (6).
We provide the following result, which shows that approaches the true in high probability:
Suppose that is an estimate of obtained by solving the subproblem (6). Consider the factorization where (
) is a Gaussian random matrix whose entries are sampled i.i.d. from
and is a deterministic matrix such that . Such decomposition is possible since the rows of are independent samples from . Then we have,with probability at least , where and .
We need to discuss a few results before proving Theorem 4.1.1.
Let be a bounded subset of . For an , consider the set
Then
holds with probability at least , where is a standard Gaussian random vector and .
Proof.
The result follows from an extended general inequality in expectation Figueiredo and Nowak (2014). ∎
The next result shows that the first term of the upper bound in Theorem 4.1.1 can be bounded without using expectation. The quantity is called the width of and is bounded as follows,
Proof.
This result is a part of the proof for Theorem 3.1 in Reference Figueiredo and Nowak (2014). ∎
Using Theorem 4.1.1 and Lemma 4.1.1, we can derive a high probability error bound on the estimation from noisy observations,
where the true signal belongs to a bounded subset . The following corollaries are straightforward extensions of Theorems 3.3 and 3.4 of Reference Figueiredo and Nowak (2014), given our Theorem 4.1.1 (so we skip the proofs).
Choose to be any vector satisfying that
Then,
with probability at least .
Now we show the error bound for the estimates we obtain by solving the optimization problem (6). For the purpose, we make use of the Minkowski functional of the set ,
If is a compact and originsymmetric convex set with nonempty interior, then defines a norm in . Note that if and only if .
Then
with probability at least .
Finally, we show that solving the constrained form of the subproblems (6) also satisfies essentially the same error bound in Corollary 4.1.1.
Proof of Theorem 4.1.1.
Since we assumed that , we construct the subset so that all vectors with will be contained in . That is,
This is a sphere defined in the SL1norm : in this case, the Minkowski functional is proportional to and thus the same solution minimizes both and .
Recall that and we choose . Since , for we have . This implies that .
∎
4.1.2 Diagonal Entries
We estimate based on the residual sum of squares (RSS) as suggested by Yuan (2010),
Unlike in Reference Yuan (2010), we directly analyze the estimation error of based on a chisquare tail bound. For all small enough so that , we have
where for ,
Proof.
Using the same notation as the previous section, that is, and , consider the estimate in discussion,
where the last equality is from . Therefore,
where . The last term is the sum of squares of independent
and therefore it follows the chisquare distribution, that is,
. Applying the tail bound Johnstone (2001) for a chisquare random variable with degrees of freedom:we get for all small enough ,
∎
4.1.3 Discussion on Asymptotic Behaviors
Our two main results above, Theorems 4.1.1 and 4.1.2, indicate how well our estimate of the offdiagonal entries and diagonal entries would behave. Based on these results, we can discuss the estimation error of the full matrix compared to the true precision matrix .
From Theorem 4.1.1, we can deduce that with and ,
where
is the smallest eigenvalue of the symmetric positive definite
. That is, using the interlacing property of eigenvalues, we havewhere is the spectral radius of . Therefore when , the distance between and is bounded by . Here, we can consider in a way such as as increases, so that the bound in Theorem 4.1.1 will hold with the probability approaching one. That is, in rough asymptotics,
(7) 
Under the conditions above, Theorem 4.1.2 indicates that:
using our assumption that . We can further introduce assumptions on and as in Reference Yuan (2010), so that we can quantify the upper bound but here we will simply say that , where is a constant depending on the properties of the full matrices and . If this is the case, then from Theorem 4.1.2 for an such that , we see that
(8) 
with the probability approaching one.
4.2 Neighborhood FDR Control under Group Assumptions
Here we consider obtained by solving the unconstrained form (3) with the same data model we discussed above for the subproblems:
(9) 
with and with . Here we focus on a particular but an interesting case where the columns of form orthogonal groups, that is, under the decomposition , forms a blockdiagonal matrix. We also assume that the columns of belonging to the same group are highly correlated, in the sense that for any columns and of corresponding to the same group, their correlation is high enough to satisfy that
This implies that by Theorem 2.1 of Reference Figueiredo and Nowak (2016), which simplifies our analysis. Note that if and belong to different blocks, then our assumption above implies . Finally, we further assume that for .
Consider a collection of nonoverlapping index subsets as the set of groups . Under the blockdiagonal covariance matrix assumption above, we see that
where denotes the representative of the same coefficients for all , and . This tells us that we can replace by , if we define containing in its columns (so that and consider the vector of grouprepresentative coefficients .
The regularizer can be rewritten similarly,
where , denoting by the group which has the th largest coefficient in in magnitude and by the size of the group.
Using the fact that , we can recast the regression model (9) as , and consider a much simpler form of the problem (3),
(10) 
where we can easily check that satisfies . This is exactly the form of the SLOPE problem with an orthogonal design matrix in Reference Bogdan et al. (2015), except that the new sequence is not exactly the BenjaminiHochberg sequence (5).
We can consider the problem (3) (and respectively (10)) in the context of multiple hypothesis testing of (resp.
(resp. ) for (resp. ), where we reject (resp. ) if and only if (resp. ). In this setting, the Lemmas B.1 and B.2 in Reference Bogdan et al. (2015) still holds for our problem (10), since they are independent of choosing the particular sequence.In the following, is the number of individual false rejections, is the number of total individual rejections and is the number of total group rejections. The following lemmas are slightly modified versions of Lemmas B.1 and B.2 from Reference Bogdan et al. (2015), respectively, to fit our groupwise setup.
Let be a null hypothesis and let . Then
Consider applying the procedure (10) for new data with and let be the number of group rejections made. Then with ,
Using these, we can show our FDR control result. Consider the procedure (10) we consider in a subproblem of nsSLOPE as a multiple testing of groupwise hypotheses, where we reject the null group hypothesis when , rejecting all individual hypotheses in the group, that is, all , , are rejected. Using defined as in (5), the procedure controls FDR at the level :
Comments
There are no comments yet.