The task of estimating highly modular structures based on relationships found in the data frequently arises in computational biology and finance . Due to the large volume and high dimensionality of data in these disciplines, inference procedures can be considerably sped up by leveraging the structured relations between inputs . For example, it is often desirable to induce sparsity in an effort to reduce cross-pathway connections between genes when considering gene expression data.
Gaussian graphical models (GGMs) are well-suited for modeling a sparse graph through the non-zero pattern of the inverse covariance matrix. From a probabilistic point of view, the precision (inverse covariance) matrix directly encodes the conditional independence relations between its elements. The most well-known algorithm for learning GGMs is the graphical lasso , which imposes graph sparsity through maximization of an -penalized Gaussian log-likelihood with block coordinate descent. Current state-of-the-art methods [13, 26, 7, 9] also rely on penalized structural learning. However, estimating the precision matrix column-by-column has received much attention since these methods can be numerically simpler and more accommodating to theoretical analysis [3, 19].
Related work can be organized into three classes: 1) Methods which learn a graph given groups a priori: This class includes  which applies a group
penalty to encourage structured sparsity but requires a set of pre-defined hyperparameters to control the topology of the network. In[25, 17, 29, 4], prior knowledge is used to assign edge weights in order to predict underlying group structure. 2) Methods which find similar groups and then learn the stucture of each group: This class includes 
which proposes a two-step approach to the problem: first applying hierarchical clustering to identify groups and then using the graphical lasso within each group. Devijver et al. (2018) detect groups in the covariance matrix using thresholding and then apply graphical lasso to each group. 3) Methods which learn both group and graph structures simultaneously: Defazio et al. (2012) propose to use a non-decreasing, concave function in the penalty to enforce the submodularity of the objective . Hosseini et al. (2016) propose the GRAB estimator that solves a joint optimization problem which alternates between estimating overlapping groups encoded into a Laplacian prior and learning the precision matrix . Kumar et al. (2019) propose a framework for structured graph learning that imposes Laplacian spectral constraints . Tarzanagh et al. (2018) apply a structured norm to precision matrix estimation to identify overlapping groups .
This paper presents two novel estimators for identifying the precision matrices associated with Gaussian graphical models and makes four major contributions. First, both estimators require a small amount of a priori information and do not require any information about group structure. More specifically, they only require two hyperparameters (one for sparsity and one for grouping) when using the OSCAR-like  weight generation procedure. This differs from  which requires hyperparameters and GRAB which requires a more constrained objective accompanied by an additional hyperparameter for each constraint. Second, both estimators can learn overlapping groups and network structure in a single step proximal descent procedure. This is an advantage over the cluster graphical lasso 
, which solves the task with a two-step procedure by alternating between clustering and gradient steps. Likewise, the GRAB algorithm also alternates between learning the overlapping group prior matrix and learning the inverse covariance. Our focus is on estimating the graphical model, rather than identifying groups, but we can apply a Gaussian Mixture Model (GMM) to identify overlapping groups from our precision matrix estimate. Third, we establish the uniqueness of the GOWL estimator by deriving its dual formulation. Fourth, the ccGOWL framework provides new theoretical guarantees for grouping related entries in the precision matrix and offers a more computationaly efficient algorithm. When comparing ccGOWL to the previously mentioned penalized likelihood methods, it is clear that the ccGOWL is more computationally attractive in the high dimensional setting as it can be obtained one column at a time by solving a simple linear regression that can be easily parallelized/distributed.
Notation and Preliminaries.
Throughout the paper, we highlight vectors and matrices by lowercase and uppercase boldfaced letters, respectively. For a vector, let denote a vector withs its component removed. For a symmetric matrix , represents the row of and denotes the column of , denotes the row of with its entry removed, denotes the column of with its entry removed, the matrix denotes a matrix obtained by removing the row and column. Moreover, we denote as the strict column-wise vectorization of the lower triangular component of :
For a vector , we use the classical definition of the norm, that is for . The largest component in magnitude of is denoted . The vector obtained by sorting (in non-increasing order) the components of is denoted . For matrices and vectors, we define to be the element-wise absolute value function.
2.1 Gaussian Graphical Models (GGM)
A GGM aims to determine the conditional independence relations of a set of random jointly Gaussian variables. Suppose is a collection of i.i.d. -dimensional random samples. Assume that the columns of the design matrix
have been standardized to have zero mean and unit variance. Letdenote the (biased, maximum likelihood) sample covariance matrix, defined as and let denote the precision matrix, defined as . It is well understood that the sparsity pattern of encodes the pairwise partial correlation between variables. More specifically, the entry in is zero if and only if variables and are conditionally independent given the remaining components [15, 30]. The GLASSO algorithm has been proposed to estimate the conditional independence graph embedded into through regularization of Gaussian maximum likelihood estimation:
where is a nonnegative tuning parameter that controls the level of sparsity in . Under the same assumptions  showed that can be estimated through column-by-column linear regressions. Letting for , the conditional distribution of given satisfies:
where and . Thus, we can write the following
where . By the block matrix inversion formula we define:
Thus, we can estimate the precision matrix column-by-column by regressing on , and a LASSO procedure can be adopted by solving:
2.2 OWL in the Linear Model Setting
In the linear regression setting, much interest has been given to regularizers that can identify groups of highly correlated predictors. This is often referred to as structured/group sparsity. Regularization penalties targeting such structure include the elastic net , the fused lasso  and the recently proposed order weighted lasso [1, 10]. Suppose that the response vector is generated from a linear model of the form: where is the design matrix, is the vector of regression coefficients, is the response vector, and is a vector of zero-centered homoscedastic random errors distributed according to . The OWL regularizer is then defined to be:
where is the largest component in magnitude of , and . OSCAR  is a particular case of the OWL that consists of a combination of a and a pairwise terms and in its original formulation is defined as
The OSCAR weights can be recovered within OWL by setting where controls the level of sparsity and controls the degree of grouping. In  the OWL regularizer is referred to as sorted penalized estimation (SLOPE) .
2.3 Proximal Methods
Proximal algorithms form a class of robust methods which can help solving composite optimization problems of the form
where is a Hilbert space with associated inner product and norm , is a closed, continuously differentiable, convex function, with an -Lipschitz continuous gradient for . The function is a convex, not necessarily differentiable function. The proximal mapping of a convex function , denoted by , is given by
In general, this family of algorithms uses the proximal mapping to handle the non-smooth component of (5) and then performs a quadratic approximation of the smooth component every iteration:
where denotes the step size at iteration .  proposed a proximal gradient method for precision matrix estimation which possesses attractive theoretical properties as well as a linear convergence rate under a suitable step size. The authors define as two continuous convex functions mapping the domain of covariance matrices (the positive definite cone ) onto and jointly optimize them using Graphical ISTA . G-ISTA conducts a backtracking line search for determining an optimal step size or specifies a constant step size of for problems with a small . This algorithm uses the duality gap as a stopping criteria and achieves a linear rate of convergence in iterations to reach a tolerance of .
3.1 Overview of OWL Estimators
We define the Graphical Order Weighted (GOWL) estimator to be the solution to the following constrained optimization problem:
Here , is the largest off-diagonal component in magnitude of , , and . The proximal mapping of denoted by can be efficiently computed with complexity using the pool adjacent violators (PAV) algorithm for isotonic regression outlined in .
In a similar way, we define the column-by-column Graphical Order Weighted (ccGOWL) estimator to be the solution to the following unconstrained optimization problem:
where with hyperparameters defined the same way as above.
3.2 Uniqueness of GOWL
In this section, we derive a dual formulation of GOWL. These play an important role for implementing efficient algorithms and allow us to establish uniqueness of the solution of GOWL.
The duality gap given a point in the feasible set defined by is
where denotes the trace operator. Please see Supplementary Material for details of the derivation of the duality gap for the GOWL estimator. In practice, is estimated using the difference between the primal and the dual and acts as a stopping criterion for the iterative procedure. As opposed to other sparse graph estimation techniques, our proposed objective function has a unique solution.
If for all , then problem (7) has a unique optimal point .
The proof is provided in the Supplementary Material and involves selecting within such that Slater’s condition (see Supplementary Material) is satisfied.
Our work borrows from G-ISTA discussed in Section 2.3 for solving the non-smooth problem defined in (7); the details are outlined in Algorithm 1. In a similar way, Algorithm 2 is based on the proximal method for solving the OWL regularized linear regression proposed in  and has a convergence rate of . Algorithm 2 applies a proximal method times and combines each column to arrive at the final precision matrix estimate.
3.4 Sufficient Grouping Conditions for ccGOWL
We can establish sufficient grouping conditions when estimating each column of by drawing on previous work for the OSCAR and OWL regularizers. The term “grouping” refers to components of each column estimate being equal.
Let be elements in the column estimate of generated with hyperparameter , and let them be unique from other entries in . Then there exists a such that
so that for all
for where is the sample correlation between columns of .
Depending on the sample correlation between covariates in the sufficient grouping property can be quantified according to Theorem 2.
4 Experimental Results
To assess the performance of our algorithm, we conducted a series of experiments: we first compared the estimates found by GOWL and ccGOWL to those from GRAB and GLASSO on a synthetic dataset with controlled sparsity and grouping. Then, we compared the average weighted scores obtained by all four algorithms on random positive definite grouped matrices of varying size and group density. All algorithms were implemented in Python and executed on an Intel i7-8700k 3.20 GHz and 32 GB of RAM.
4.1 Synthetic Data
Synthetic data was generated by first creating a proportion of groups for a -dimensional matrix. We randomly chose each group size to be between a minimum size of of and a maximum size of of . Furthermore, group values were determined by uniformly sampling their mean between and respectively. After setting all values of a given group to its mean, we collect the groups into the matrix
. In order to add noise to the true group values, we randomly generated a positive semi-definite matrix with entries set to zero with a fixed probability ofand remaining values sampled between . We then added the grouped matrix to the aforementioned noise matrix to create a matrix. Each grouped matrix was generated 5 times and random noise matrices were added to each of the 5 grouped matrices 20 times. A dataset was then generated by drawing i.i.d from distribution, from which the empirical covariance matrix can be estimated after standardization of covariates. Hyperparameters and using the OSCAR weight generation defined in Section 3.3 were selected using a 2-fold standard cross-validation procedure. Similar to , the regularization parameter for the GRAB estimator was also selected using a 2-fold cross-validation procedure (Supplementary Material Table 1, 2, 3).
Figure 2 shows an example of a synthetic precision matrix with 2 groups with as the ground truth. This ground truth was then used to sample a dataset of , from which we estimated the empirical covariance matrix and provided it as input to the algorithms. The GOWL and ccGOWL precision matrix estimates almost fully recover the 6 red entries from group 1 and the two blue entries from group two. The GLASSO was not included as a result of its poor performance in recovering the true group values. We assessed the performance of all three methods using the weightedscore is defined as ; the weighted score is obtained by calculating the score for each label and then evaluating the weighted average, where the weights are proportional to the number of true instances of each group. For each value of , we generated 5 randomly grouped matrices using the procedure outlined in the previous section and fit GLASSO, GRAB, GOWL, ccGOWL to each of them. The estimates were then clustered using a Gaussian Mixture Model (GMM) with the same number of groups as originally set. The goodness-of-fit of the clusters was compared to the original group labeling using the weighted metric, from which we report the permutation of labels giving the highest weighted score. Figure 2 shows the distribution of the scores for each algorithm, for each class of matrices. Overall, the ccGOWL outperforms GOWL and GLASSO in terms of variance and mean of scores. However, ccGOWL achieves similar or greater performance when compared to GRAB. Mean squared error and absolute error were also measured for each estimate and are provided in the supplementary material (Figure 1 and 2).
4.2 Timing Comparisons
The GLASSO, GRAB and ccGOWL algorithms were run on synthetic datasets generated in the same way as in Section 4.1 with varying , , and different levels of regularization. The GRAB algorithm had a fixed duality gap of and the ccGOWL had a fixed precision of . The GRAB algorithm was implemented in python and utilizes the R package QUIC implemented in C++  to find the optimum of the log-likelihood function. The ccGOWL algorithm is implemented completely in python and requires solving linear regressions with most of the computational complexity attributed to evaluating the proximal mapping. The GLASSO algorithm utilized the python package sklearn . Running times are presented in Table 1 and show a significant advantage of ccGOWL over GRAB. This difference for large is due to GRAB requiring matrix inversions within QUIC (
) and applying the k-means clustering algorithm () on the rows of the block matrix .
4.3 Cancer Gene Expression Data
We consider a dataset that uses expression monitoring of genes using DNA microarrays in 38 patients that have been diagnosed with either acute myeloid leukemia (AML) or acute lymphoblastic leukemia (ALL). This dataset was initially investigated in . In order to allow for a model that is easier to interpret we selected the 50 most highly correlated genes associated with ALL-AML diseases, as identified in . Expression levels were standardized to have zero and unit variance. Of the 50 genes selected, the first 25 genes are known to be highly expressed in ALL and the remaining 25 genes are known to be highly expressed in AML. It is important to note that no single gene is consistently expressed across both AML and ALL patients. This fact illustrates the need for an estimation method that takes into account multiple genes when diagnosing patients. Genes that are highly expressed in AML should appear in the same group and likewise for the ALL disease. In addition, we utilize the Reactome  to identify each gene’s main biological pathway.
Figure 4 illustrates that ccGOWL groups genes according to their disease. In fact, ccGOWL correctly identifies all 25 genes associated with the AML disease in one group and 24 of 25 genes associated with the ALL disease in the other group (Supplementary Material Table 4). We compare the network identified by ccGOWL with the commonly-used baseline GLASSO method, to illustrate the importance of employing an estimator that uses grouping (Supplementary Material Figure 3). In addition, examining the connections within each group can also lead to useful insights. The AML group (top) contains highly connected genes beginning with “M” (blue nodes) which are associated with the neutrophil granulation process in the cell. AML is a disorder in the production of neutrophils . Neutrophils are normal white blood cells with granules inside the cell that fight infections. AML leads to the production of immature neutrophils (referred to as blasts), leading to large infections.
4.4 Equities Data
We consider the stock price dataset described in , which is available in the huge package on CRAN . The dataset consists of the closing prices of stocks in the S&P 500 between January 1, 2003 and January 1, 2008. The collection of stocks can be categorized into 10 Global Industry Classiciation Standard (GICS) sectors . Stocks that were not consistently in S&P 500 index or were missing too many closing prices were removed from the dataset.
The design matrix contains the log-ratio of the price at time to the price at time for each of the 452 stocks for 1257 trading days. More formally we write that the -th entry of is defined as where is the closing price of the th stock on the th day. The matrix was then standardized so each stock has a mean of zero and unit variance. The GICS sector for each stock is known, however, this information was not used when estimating the precision matrix based on the return matrix . Figure 4 illustrates the ability of ccGOWL to group stocks that belong to the same sector and is more interpretable than the network constructed by GLASSO (Supplementary Material Figure 4). Information Technology and Utilities largely exhibit conditional independence of stocks in other GICS sectors. On the other hand, there appears to be a conditional dependence between stocks in the Materials and Industrials sectors, probably as a result of multiple customer-supplier relationships between the companies in these sectors. Both sectors are sensitive to economic cycles and the equities offer exposure to global infrastructure replacement.
We proposed the GOWL and ccGOWL estimators, two novel precision matrix estimation procedures based on the Order Weighted norm. We proved the uniqueness of the GOWL estimator and identified sufficient grouping conditions for the ccGOWL estimator. Based on empirical results on synthetic and real datasets, the ccGOWL estimator has the ability to accurately identify structure in the precision matrix in a much more computationally efficient manner than state-of-the-art estimators that achieve comparable accuracy. Although penalized likelihood methods for learning structure in precision matrix estimation are widely used and perform well, it is also important to consider column-by-column estimators that accomplish similar feats while reducing computational complexity.
-  M. Bogdan, E. Van Den Berg, C. Sabatti, W. Su, and E. J. Candès. SLOPE—adaptive variable selection via convex optimization. The Annals of Applied Statistics, 9(3):1103, 2015.
-  H. D. Bondell and B. J. Reich. Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR. Biometrics, 64(1):115–123, 2008.
-  T. Cai, W. Liu, and X. Luo. A constrained minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association, 106(494):594–607, 2011.
-  T. T. Cai, H. Li, W. Liu, and J. Xie. Joint estimation of multiple high-dimensional precision matrices. Statistica Sinica, 26(2):445, 2016.
-  D. Croft, A. F. Mundo, R. Haw, M. Milacic, J. Weiser, G. Wu, M. Caudy, P. Garapati, M. Gillespie, M. R. Kamdar, et al. The reactome pathway knowledgebase. Nucleic Acids Research, 42(D1):D472–D477, 2013.
-  F. R. Davey, W. N. Erber, K. C. Gatter, and D. Y. Mason. Abnormal neutrophils in acute myeloid leukemia and myelodysplastic syndrome. Human Pathology, 19(4):454–459, 1988.
-  A. Defazio and T. S. Caetano. A convex formulation for learning scale-free networks via submodular relaxation. In Advances in Neural Information Processing Systems, pages 1250–1258, 2012.
-  E. Devijver and M. Gallopin. Block-diagonal covariance selection for high-dimensional Gaussian graphical models. Journal of the American Statistical Association, 113(521):306–314, 2018.
J. Duchi, S. Gould, and D. Koller.
Projected subgradient methods for learning sparse Gaussians.
In Proc. Int. Conf. Artificial Intelligence and Statistics, pages 153–160, 2008.
-  M. Figueiredo and R. Nowak. Ordered weighted regularized regression with strongly correlated covariates: Theoretical aspects. In Proc. Int. Conf. Artificial Intelligence and Statistics, pages 930–938, 2016.
-  J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441, 2008.
-  T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286(5439):531–537, 1999.
-  M. J. Hosseini and S.-I. Lee. Learning sparse Gaussian graphical models with overlapping blocks. In Advances in Neural Information Processing Systems, pages 3808–3816, 2016.
-  C.-J. Hsieh, M. A. Sustik, I. S. Dhillon, and P. K. Ravikumar. Sparse inverse covariance matrix estimation using quadratic approximation. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 2330–2338. 2011.
-  D. Koller, N. Friedman, and F. Bach. Probabilistic graphical models: principles and techniques. MIT press, 2009.
-  S. Kumar, J. Ying, J. V. d. M. Cardoso, and D. Palomar. A unified framework for structured graph learning via spectral constraints. arXiv preprint arXiv:1904.09792, 2019.
W. Lee and Y. Liu.
Joint estimation of multiple precision matrices with common
The Journal of Machine Learning Research, 16(1):1035–1062, 2015.
-  H. Liu, F. Han, M. Yuan, J. Lafferty, L. Wasserman, et al. High-dimensional semiparametric Gaussian copula graphical models. The Annals of Statistics, 40(4):2293–2326, 2012.
-  N. Meinshausen, P. Bühlmann, et al. High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34(3):1436–1462, 2006.
-  MSCI. The Global Industry Classification Standard (GICS), 2019.
-  F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
-  W. E. Peterman, B. H. Ousterhout, T. L. Anderson, D. L. Drake, R. D. Semlitsch, and L. S. Eggert. Assessing modularity in genetic networks to manage spatially structured metapopulations. Ecosphere, 7(2), 2016.
-  B. Rolfs, B. Rajaratnam, D. Guillot, I. Wong, and A. Maleki. Iterative thresholding algorithm for sparse inverse covariance estimation. In Advances in Neural Information Processing Systems, pages 1574–1582, 2012.
-  E. A. Stone and J. F. Ayroles. Modulated modularity clustering as an exploratory tool for functional genomic inference. PLoS Genetics, 5(5):e1000479, 2009.
-  S. Sun, H. Wang, and J. Xu. Inferring block structure of graphical models in exponential families. In In Proc. Int. Conf. Artificial Intelligence and Statistics, pages 939–947, 2015.
-  K. M. Tan, D. Witten, and A. Shojaie. The cluster graphical lasso for improved estimation of Gaussian graphical models. Computational Statistics & Data Analysis, 85:23–36, 2015.
-  D. A. Tarzanagh and G. Michailidis. Estimation of graphical models through structured norm minimization. Journal of Machine Learning Research, 18(1), 2018.
-  R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B, 67(1):91–108, 2005.
-  J. Wang. Joint estimation of sparse multivariate regression and conditional graphical models. Statistica Sinica, pages 831–851, 2015.
-  M. Wytock and Z. Kolter. Sparse gaussian conditional random fields: Algorithms, theory, and application to energy forecasting. In Proc. Int. Conf. of Machine Learning Research, pages 1265–1273, 2013.
-  X. Zeng and M. A. Figueiredo. The ordered weighted norm: Atomic formulation, projections, and algorithms. arXiv preprint arXiv:1409.4271, 2014.
-  T. Zhao, H. Liu, K. Roeder, J. Lafferty, and L. Wasserman. The huge package for high-dimensional undirected graph estimation in R. Journal of Machine Learning Research, 13(Apr):1059–1062, 2012.
-  H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B, 67(2):301–320, 2005.