Many applications in statistics and machine learning require the estimation of multiple, possibly related, precision matrices. For example, to perform classification using quadratic discriminant analysis (QDA), a practitioner must estimate two or more precision matrices, e.g., see Chapter 4 ofFriedman et al. (2001). Similarly, it is often of scientific interest to estimate multiple graphical models when the same variables are measured on subjects from multiple classes, e.g., Guo et al. (2011).
In this work, the data are assumed to be a realization of independent copies of the random pair such that has support and
where and are unknown, and denotes the set of symmetric, positive definite matrices. Let be the sample size for the th class, let be the observed sample mean for the th class, and let
be the sample covariance matrix for the th class. Define
. After profiling over the means and class probabilities, negative two times the log-likelihood is
A natural estimator of , when it exists, is the maximum likelihood estimator . In settings where the maximum likelihood estimator does not exist, e.g., when , a practitioner could instead estimate the ’s separately using penalized normal maximum likelihood (Pourahmadi, 2011; Fan et al., 2016). Sparsity inducing penalties are especially popular in penalized normal maximum likelihood (Yuan and Lin, 2007; d’Aspermont et al., 2008; Friedman et al., 2007; Rothman et al., 2008; Witten et al., 2011) because a zero in the th entry of implies the conditional independence of the th and th variables in the th class. However when the precision matrices are similar across classes, e.g., when the ’s share sparsity patterns, jointly estimating the ’s can be more efficient than methods that estimate each precision matrix separately.
Many methods exist for estimating multiple precision matrices under the assumption of shared sparsity patterns across classes. For example, Guo et al. (2011) proposed a hierarchical penalty which encourages zeroes in the same entries of estimates of the ’s. Similarly, Danaher et al. (2014) proposed the fused graphical lasso estimator (FGL)
where . The first FGL penalty, controlled by positive tuning parameter , promotes elementwise sparsity separately within classes. The second penalty, controlled by , promotes elementwise equality jointly across classes. For sufficiently large values of the tuning parameter , the FGL estimates of the ’s will have exactly equivalent sparsity patterns. Price et al. (2015) proposed a computationally efficient alternative to FGL, called ridge fusion (RF), which used squared Frobenius norm penalties in place of the norm penalties in FGL. Price et al. (2015) also investigated FGL and RF as methods for fitting the QDA model. These approaches for fitting the QDA model are related to Friedman (1989), who proposed the regularized discriminant analysis (RDA) estimator of the precision matrices. The RDA approach estimates multiple precision matrices for QDA using a linear combination of the sample covariance matrices for each class and the pooled sample covariance matrix across all the classes. Bilgrau et al. (2015) generalized the work of Price et al. (2015) to estimate multiple precision matrices sharing a common target matrix.
Joint estimation procedures such as those proposed by Danaher et al. (2014) and Price et al. (2015) can perform well when all precision matrices are similar. However, there are settings when FGL and RF may perform unnecessary or inappropriate shrinkage. Notice, the second part of the ridge fusion penalty proposed by Price et al. (2015) can be rewritten as
and is the squared Frobenius norm. This formulation suggests that FGL and RF can be viewed as shrinking all precision matrices towards a common precision matrix. Regularization of this type may be problematic if there are substantial differences in the population precision matrices across classes. For instance, consider the case that there are two groupings (i.e., clusters) of the classes denoted and , where is empty and . Suppose the entry of , for , but for for all . This type of scenario may occur, for example, when the variables are the expression of genes belonging to some pathway and the classes represent certain disease subtypes. Two subtypes may have similar gene-gene dependence, which are distinct from the another subtype (e.g., controls). In these settings, FGL may perform poorly since sparsity patterns are only shared within a subset of classes. If such clusters were known a priori, it may be preferable to apply FGL or RF to the groups separately, but when groupings are unknown, they must be estimated from the data.
Methods such as those proposed by Zhu et al. (2014) and Saegusa and Shojaie (2016) address this issue of FGL and RF. The structural pursuit method proposed by Zhu et al. (2014) allows for heterogenity in precision matrices using the truncated lasso penalty of Shen et al. (2012) to promote elementwise equality and shared sparsity patterns across pre-defined groups of precision matrices. The method proposed by Saegusa and Shojaie (2016)
, known as LASICH, allows for heterogeneity of precision matrices through the use of a graph Laplacian penalty which incorporates prior information about how different classes’ sparsity patterns are related. Since such prior information is often not available in practice, the authors propose the HC-LASICH method: a two-step procedure which first uses hierarchical clustering to estimate relationship between precision matrices, then uses this estimate to apply the LASICH procedure. In somewhat related work,Ma and Michailidis (2016) proposed the joint structural estimation method to use prior information on shared sparsity patterns in a two step procedure that first estimates the shared sparsity pattern and then estimates the precision matrices based on the shared sparsity constraints. More recently, Jalali et al. (2019) extended the work of Ma and Michailidis (2016) to the case where prior information on edge relationships not need be known. This is done using a Bayesian approach that incorporates a multivariate Gaussian mixture distribution on all possible sparsity patterns.
In this article, we propose a penalized likelihood framework for simultaneously estimating the precision matrices and how the precision matrices relate to one another. Like FGL and RF, our method can exploit the similarity of precision matrices belonging to a group, but avoids the unnecessary shrinkage of FGL or RF when groups differ. Unlike LASICH, the proposed method does not require any prior information about the relationships between the classes, nor does it require clustering to take place before estimation of the precision matrices. We study the use of our estimator for application in quadratic discriminant analysis and Gaussian graphical modeling in settings where there are groupings of classes which share common dependence structures. Computing our estimator is nontrivial since the penalized objective function we minimize is discontinuous. To overcome this challenge, we propose an iterative algorithm, in which we alternate between updating groupings and updating precision matrix estimates. As part of our algorithm for the sparse estimator we propose (see Section 2), we must solve an elastic net penalized precision matrix estimation problem. To do so, we propose a graphical elastic net iterative shrinkage thresholding algorithm (GEN-ISTA). We prove this GEN-ISTA has a linear convergence rate and characterize the set to which the solution belongs.
2 Joint estimation with cluster fusion penalties
Define to be an unknown element partition of the set . For convenience, we will refer to as the th cluster. Let , , and the positive integer be user defined tuning parameters.
For any set define as the cardinality of . The first estimator we will investigate is the cluster ridge fusion estimator (CRF), which is defined as
We refer to the penalty associated with as the cluster fusion penalty, which promotes similarities in precision matrices that are in the same cluster. The ridge fusion method (RF) proposed by Price et al. (2015) can be viewed as a special case of (2) when .
We also propose a sparsity inducing version of the estimator, the precision cluster elastic net (PCEN), which is defined as
When , is equivalent to estimating the precision matrices separately with penalized normal maximum likelihood using the same tuning parameter for each matrix. In our proposed estimators, the cluster fusion penalty is used to promote similarity in precision matrices that are in the same cluster, while estimating precision matrices in different clusters separately from one another. When estimating Gaussian graphical models, PCEN promotes elementwise similarity between precision matrices in the same cluster, in turn promoting similar sparsity patterns within the same cluster. This differs from other methods, e.g., FGL proposed by Danaher et al. (2014), which penalize the absolute value of entrywise differences across all precision matrices.
Unlike the FGL fusion penalty, the squared Frobenius norm fusion penalty will not lead to exact entrywise equality between estimated precision matrices – even those belonging to the same cluster. However, the Frobenius norm penalty facilitates fast computation and more importantly, an efficient search for clusters using existing algorithms for -means clustering.
Cluster fusion regularization was first proposed in the context in univariate response linear regression byWitten et al. (2014) to detect and promote similarity in effect sizes. More recently Price and Sherwood (2018) used cluster fusion regularization in multivariate response linear regression to detect and promote similarity in fitted values.
The optimization in (4) can be identified as separate ridge fusion estimation problems (Price et al., 2015). The optimization in (5) is also separable over the clusters, and in Section 3.3 we propose a block coordinate descent algorithm to solve (5).
In the Supplementary Material, we describe a validation likelihood based approach to select the tuning parameters for use in CRF and PCEN; and further discuss a more computationally efficient information criterion which can be used for tuning parameter selection for PCEN.
The objective functions in (2) and (3) are not continuous and non-convex with respect to because changing cluster membership results in discrete changes in the objective function. However, in (4) and (5), are fixed so that the objective functions are strictly convex and convex, respectively, with respect to . To compute both CRF and PCEN, we propose an algorithm that iterates between solving for with fixed, and then solving for with fixed. This procedure is similar to those of Witten et al. (2014) and Price and Sherwood (2018). We describe both algorithms in the following subsections.
3.2 Cluster ridge fusion algorithm
Assume , , and are fixed where denotes the set of positive integers. We propose the following iterative algorithm to solve (2):
Initialize as a set of diagonal matrices where the th diagonal element of is .
For the th step where repeat the steps below until the estimates for are equivalent to .
Holding fixed, obtain the th iterate of with
Holding fixed at the th iterate, obtain the th iterate of the precision matrices with
This is identical to the optimization in (4) and can be solved with parallel RF estimation problems, with the th objective function taking the form
3.3 Precision cluster elastic net algorithm
For the PCEN estimator, we propose to use the same iterative procedure as in Section 3.2. The algorithm iterates between a -means clustering algorithm to obtain the estimated clusters and a blockwise coordinate descent algorithm which uses the graphical elastic net iterative thresholding algorithm (GEN-ISTA) to obtain new iterates of the precision matrices at each iteration.
Again, let , and be fixed. Formally, the iterative algorithm is as follows:
Initialize as a set of diagonal matrices where the th diagonal element of is .
For the th step where repeat the steps below until the estimates for are equivalent to .
Holding fixed, obtain the th iterate for with
Holding fixed at the th iterate, obtain the th iterate of the precision matrix estimates with
Just as in the CRF Algorithm, to protect against selecting a local optima in the -means clustering update in 2(a), our implementation uses 100 random starts, and selects the clustering which gives the lowest objective function value (Hartigan and Wong, 1979; Krishna and Murty, 1999).
Since the ’s are fixed, we propose to solve (10) using blockwise coordinate descent where each is treated as a block. That is, for each , we update one for with all other for held fixed. The objective function for the blockwise update, treating all other , , as fixed is
so that the argument minimizing (11) can be expressed
which can be recognized as the elastic net penalized normal likelihood precision matrix estimator. To compute (12), we propose the GEN-ISTA, an elastic net variation of the algorithm proposed by Rolfs et al. (2012), called G-ISTA, which was used to solve problems like (12) with . Iterative shrinkage thresholding algorithms (ISTA), are a special case of the proximal gradient method, which are commonly used to solve penalized likelihood optimization problems. We refer the reader to Beck and Teboulle (2009) and Polson et al. (2015) for more on iterative shrinkage thresholding algorithms and proximal algorithms, respectively.
Our approach uses a first order Taylor expansion to derive a majorizing function of the objective in (12). Let
and let be the previous iterate of . Because is Lipschitz over compact sets of , we have that
for sufficiently small step size , with equality when . Thus, we can majorize with the right hand side of (13): using this inequality and the fact that , we have
Letting denote the right hand side of (14), it follows that for sufficiently small ,
so that at , the right hand side of (15) is a majorizer of (12). Thus, to solve (12), we use an iterative procedure: given the previous iterate , we construct , then we minimize to obtain the new iterate. This choice of majorizer is convenient since the new optimization problem simplifies to the proximal operator for the norm since
where and denotes the set of symmetric matrices. In the following subsection, we will show that there always exists a step size such that the solution to the proximal operator above is positive definite, and hence, the iterates remain feasible for (12).
To summarize, we propose the GEN-ISTA, which updates from iterate to iterate with
where for a matrix and , is the elementwise soft thresholding operator such that . To select for use in (16), we use a backtracking line search. For the step to be accepted we check that the the condition in (14) is met and that . If both conditions are not met, a smaller step size must be used. In Section 3.4 we show that for a pre-specified , which is a function of and , that this update will always be contained in .
Formally we propose the GEN-ISTA algorithm with backtracking to solve (12):
Initialize, , , , , and .
If , then update and return to Step 2 (b). Else, continue to Step 2 (e)
then update and return to Step 2 (b). Else, continue to Step 2 (f).
Update , and return to Step 2 (a).
As previously mentioned, the G-ISTA algorithm proposed by Rolfs et al. (2012) is a special case of the GEN-ISTA algorithm, when , but there are substantial differences. In particular, Rolfs et al. (2012) only consider the case where is a symmetric, non-negative definite matrix, but in our application there is no guarantee that is non-negative definite. In Section 3.4 we demonstrate the role of in the rate of convergence and the choice of appropriate step size, . The elastic net penalized normal likelihood precision matrix estimation problem was also studied by Atchadé et al. (2019)
, who proposed a stochastic gradient descent algorithm for solving (12) with very large and being a sample covariance matrix.
3.4 Convergence Analysis of GEN-ISTA Algorithm
In this section we will discuss the convergence of the GEN-ISTA subroutine proposed in the previous section. Our approach to convergence analysis is based on that of Rolfs et al. (2012), but in our application, we must address that the input matrix may be indefinite. We show that despite the generality of the input matrix, our proximal gradient descent scheme is guaranteed to converge at a linear rate and that the maximum step size is a function of and , both of which are known in our blockwise coordinate descent scheme. Specifically, we show that there exists a worst case contraction constant, , such that
In our case is a function of and . We will show that as increases, approaches 0. Throughout this section, for a matrix , let
denote the ordered eigenvalues of.
Let be the solution to (12). We first will show that is contained in a compact subset of .
Let , and to be the solution to (12) then , where
Our bounds are distinct from those in Rolfs et al. (2012) as theirs do not allow for which is indefinite. Notably, the we obtain is the same as that in Atchadé et al. (2019), although the we obtain is distinct, again owing to the indefiniteness of . Next, we establish that the Lipschitz continuity of the gradient of (12), which we used to construct the majorizing function (13).
Assume such that then
Hence, is Lipschitz on any compact subset of with constant .
The proof of Lemma 2 can be found in the Appendix. The combination of Lemma 1 and Lemma 2 give us necessary and sufficient conditions to apply Theorem 3.1 of Beck and Teboulle (2009) to (12) to obtain a sublinear convergence rate between iterates of the objective function.
Next, we present a lemma that ensures that there always exists a step size parameter such that the iterates of the algorithm are contained in a compact subset of .
Let and define and to be defined as presented in Lemma 1. Assume then the iterates of the proposed algorithm satisfy for all where
Finally, we present a result on the linear convergence rate for our algorithm given the iterates are contained on on a compact subset of .
Let and be defined the same as in Lemma 1. Then for constants and the iterates of our algorithm converge linearly with a rate of
The proof for Theorem 1 can be found in the Appendix. Theorem 1 establishes the linear convergence of our proposed ISTA algorithm. Furthermore, these results show how influences the convergence of the algorithm, and the optimal solution bounds. In particular, for a fixed , as gets larger, the rate approaches 0. From a computational perspective, these results suggest that we could fix the step size parameter and avoid the backtracking line search when is large because and can be calculated directly at each iteration.
4 Gaussian graphical modeling simulation studies
In our first set of simulations, we focus on both estimation accuracy and sparsity detection in Gaussian graphical modeling using PCEN. We generated data from classes, where the -th class is generated from and . By construction, the sparsity patterns of and will be nearly equivalent; as will the sparisty patterns of and . However, the sparsity patterns of and will be distinct from the sparsity patterns of and .
We compare two version of PCEN, PCEN-2 and PCEN-3 (i.e., (3) with and , respectively) to the fused graphical lasso (FGL, Danaher et al. (2014)), graphical lasso with the same tuning parameter for all classes (Glasso), and two versions of the method proposed by Saegusa and Shojaie (2016) which we call LASICH-OR and LASICH-PR (denoting “oracle” and “practical”, respectively). The method proposed by Saegusa and Shojaie (2016) requires the network information between the classes to be known before fitting the precision matrices (i.e., “oracle” information), though it may be estimated using hierarchical clustering. In this simulation, a network where the edges are is used. The difference between LASICH-OR and LASICH-PR is that LASICH-OR applies weights of to the edges in the set while LASICH-PR weights all edges equally. In this way, LASICH-OR effectively uses the fact that sparsity patterns are shared between classes and , which would be unknown in practice, while still allowing for overlap between the two sparsity patterns. Thus, this can be considered a “best-case” version of the HC-LASICH method. Tuning parameters for each of the methods are investigated based a subset of unless otherwise specified.
To evaluate performance of each estimator, we use the sum of true positive (STP) across all classes, which we define as where is an estimate of . In addition, we also report the sum of the Frobenius norm squared error which is defined as
In each replication, the training data consists of independent draws from each of the class distributions, i.e., total realizations. We investigate three different settings each based on Erdos-Renyi graphs. Throughout the settings we consider, we define to be a matrix where is an adjacency matrix associated with an Erdos-Renyi graph. To generate the elements of , we randomly assign each of the non-zero elements of a value from the set . Each off diagonal element is normalized by
times the row sum of the matrix, and each diagonal element is set to 1. The matrix is then scaled such that the associated variance of each of thevariables is 1. Further, we define to be a matrix that is generated using the adjacency matrix , such that nonzero elements are equal to the corresponding value in plus a randomly selected value from the set . The off-diagonal elements are normalized by times the row sum of the matrix, the diagonal elements are set to 1. Finally, the entire matrix is normalized such that the variance of each variable is 1. Similar data generating mechanisms have been used in Danaher et al. (2014) and Saegusa and Shojaie (2016).
4.2 Two clusters, block Erdos-Reyni graphs
We first compare PCEN-2 and PCEN-3 to competing methods under block Erdos-Reyni graphs. Each described in Section 4.1 is replicated 50 times with . In this setting, we generate to be block diagonal with each block of size . The first block is generated using , and the second is generated using where and are adjacency matrices associated with independent Erdos-Reyni graphs with edges. Using , we generate such that it is block diagonal with block size . We define the upper block of as , and the lower block to be where is the adjacency matrix with four edges removed. Similarly is the adjacency matrix with 4 edges removed. Hence, and have nearly equivalent sparsity patterns minus eight nonzero entries in which are zero in .
To generate we randomly select variables and define this set of variables as and define . The submatrix of corresponding to the indices in are generated such that and submatrix of corresponding to the indices in is generated such that , where and are independent Erdos-Renyi graphs with edges. The submatrices of corresponding to the indices in and are generated using and , respectively. The adjacency matrices and are the same as and , respectively, with 4 randomly selected edges removed in each.
The results in panels (a) and (b) of Figure 1 are average log sum of squared Frobenius norm error and the average true positive rate as the number of non-zero elements in the estimated precision matrices varies with . The results for the case of and can be found in the Supplementary Material. The results in panels (a) and (b) of Figure 1 suggest that PCEN-2 can outperform as well or better than competitors in terms of Frobenius norm error and graph recovery. Notably, some versions of PCEN-2 outperform LASICH-OR both in terms of log sum of squared Frobenius norm and TPR, even though LASICH-OR knows the relations between precision matrices a priori.
In the Supplementary Material, we present additional simulation results examining the effect of sample size and on the performance of PCEN-2. Briefly, as one would expect, as the sample size increases, the performance of PCEN-2 improves. In general, as increases, the performance also improves.
4.3 Two clusters, block diagonal Erdos-Renyi graphs
In contrast to the data generating models in Section 4.2, in these simulations we consider settings where all four precision matrices have a high degree of shared sparsity with high probability. We generate such that it is block diagonal with each block size of . The first block is generated using , and the second block is generated from where and are adjacency matrices associated with independent Erdos-Reyni graphs, with edges. Using we generate such that it is block diagonal with block size . We define the upper block of as , and the lower block to be where is the adjacency matrix with four edges removed. Similarly is the adjacency matrix with edges removed. Next, is generated in a similar way to and is generated from in the same fashion is generated from . By generating precision matrices in this way, entries not in the upper or lower block submatrix are zero in all four precision matrices.
The results in panels (c) and (d) Figure 1 are average log sum of squared Frobenius norm error and the average true positive rate as the number of non-zero elements in the precision matrices varying with and . The results for the case of and can be found in the Supplementary Material. These results show a similar pattern to the results from the simulation studies in Section 4.2. For large values of , PCEN-2 is competitive in Frobenius norm error and graph recovery with the all other methods, most notably LASICH-OR. As mentioned, LASICH-OR has oracle knowledge of the true relationships between precision matrices, while PCEN is estimating the relationships as well as the precision matrices.
4.4 Two clusters, block diagonal structures
In the final setting, we again assume a data generating model where the four precision matrices are divided into two groups. We generate such that it is block diagonal with each block size of . The first block is generated using
, and the second block is the identity matrix, whereis an adjacency matrix from an Erdos Renyi, with connections. Using we generate such that it is block diagonal with block size . We define the upper block of as , and the lower block to be the identity where is the adjacency matrix with four edges removed. Next, is generated in a similar way to and is generated from in the same fashion is generated from .
The results in panels (e) and (f) Figure 1 are average log sum of squared Frobenius norm error and the average true positive rate as the number of non-zero elements in the precision matrices varying with and . The results for the case of and can be found in the Supplementary Material. Results exhibit a similar pattern to the results displayed in Sections 4.2 and 4.3. For certain values of , PCEN-2 is competitive in estimation and graph recovery with the other methods, specifically LASICH-OR. As increases, we see the estimation and graph recovery of PCEN decreases relative to LASICH-OR, but is still competitive with other competitors. Again, this can be attributed to LASICH-OR having oracle information and its use of the group penalty which exploits similar sparsity patterns across all precision matrices.
5 Quadratic discriminant analysis simulations studies
In this section, we study CRF as a method for fitting the QDA model. We generate data from classes, where predictors for the -th class are generated from with . The training data consists of 25 independent realizations from each class. Tuning parameters are selected using 5-fold cross-validation maximizing the validation likelihood (see Supplementary Material). We measure classification accuracy to compare methods. To quantify classification accuracy, we generate an independent testing set consisting of 500 observations from each of the classes.
In addition to CRF, RF, and RDA (Friedman, 1989), we include two methods which have oracle knowledge of the population parameters: Oracle, which uses and in the classification rule; and TC (for “true covariance”), which uses and the sample means in the classification rule. These oracle methods provide a benchmark for classification accuracy in these data. We omit the sparse methods discussed in Section 4 as we study a class of dense precision matrices in this particular simulation study.
We consider a situation where each of the two clusters has a distinct structure and precision matrices in both clusters are dense. For 100 independent replications, we generate where each row is an independent realization of and let be the right singular vectors of . We then let and where and are diagonal matrices with the th element equal to and respectively. Define the th element of and where is the indicator function. We consider