# Targeted Fused Ridge Estimation of Inverse Covariance Matrices from Multiple High-Dimensional Data Classes

We consider the problem of jointly estimating multiple precision matrices from (aggregated) high-dimensional data consisting of distinct classes. An ℓ_2-penalized maximum-likelihood approach is employed. The suggested approach is flexible and generic, incorporating several other ℓ_2-penalized estimators as special cases. In addition, the approach allows for the specification of target matrices through which prior knowledge may be incorporated and which can stabilize the estimation procedure in high-dimensional settings. The result is a targeted fused ridge estimator that is of use when the precision matrices of the constituent classes are believed to chiefly share the same structure while potentially differing in a number of locations of interest. It has many applications in (multi)factorial study designs. We focus on the graphical interpretation of precision matrices with the proposed estimator then serving as a basis for integrative or meta-analytic Gaussian graphical modeling. Situations are considered in which the classes are defined by data sets and/or (subtypes of) diseases. The performance of the proposed estimator in the graphical modeling setting is assessed through extensive simulation experiments. Its practical usability is illustrated by the differential network modeling of 11 large-scale diffuse large B-cell lymphoma gene expression data sets. The estimator and its related procedures are incorporated into the R-package rags2ridges.

## Authors

• 3 publications
• 9 publications
• 5 publications
• 3 publications
• 6 publications
• ### rags2ridges: A One-Stop-Shop for Graphical Modeling of High-Dimensional Precision Matrices

A graphical model is an undirected network representing the conditional ...
10/12/2020 ∙ by Carel F. W. Peeters, et al. ∙ 0

• ### High-Dimensional Joint Estimation of Multiple Directed Gaussian Graphical Models

We consider the problem of jointly estimating multiple related directed ...
04/03/2018 ∙ by Yuhao Wang, et al. ∙ 0

• ### High dimensional Sparse Gaussian Graphical Mixture Model

This paper considers the problem of networks reconstruction from heterog...
08/15/2013 ∙ by Anani Lotsi, et al. ∙ 0

• ### Estimating Multiple Precision Matrices with Cluster Fusion Regularization

We propose a penalized likelihood framework for estimating multiple prec...
03/01/2020 ∙ by Bradley S. Price, et al. ∙ 0

• ### Robust and sparse Gaussian graphical modeling under cell-wise contamination

Graphical modeling explores dependences among a collection of variables ...
02/15/2018 ∙ by Shota Katayama, et al. ∙ 0

• ### Transfer learning of regression models from a sequence of datasets by penalized estimation

Transfer learning refers to the promising idea of initializing model fit...
07/04/2020 ∙ by Wessel N. van Wieringen, et al. ∙ 0

• ### Precision Matrix Estimation with Noisy and Missing Data

Estimating conditional dependence graphs and precision matrices are some...
04/07/2019 ∙ by Roger Fan, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

High-dimensional data are ubiquitous in modern statistics. Consequently, the fundamental problem of estimating the covariance matrix or its inverse (the precision matrix) has received renewed attention. Suppose we have i.i.d. observations of a -dimensional variate distributed as . The Gaussian log-likelihood parameterized in terms of the precision matrix is then given by:

 (1) L(Ω;S)∝ln|Ω|−tr(SΩ),

where is the sample covariance matrix. When the maximum of (1) is attained at the maximum likelihood estimate (MLE) . However, in the high-dimensional case, i.e., when , the sample covariance matrix is singular and its inverse ceases to exist. Furthermore, when , the sample covariance matrix may be ill-conditioned and the inversion becomes numerically unstable. Hence, these situations necessitate usage of regularization techniques.

Here, we study the simultaneous estimation of numerous precision matrices when multiple classes of high-dimensional data are present. Suppose is a realization of a

-dimensional Gaussian random vector for

independent observations nested within classes, each with class-dependent covariance , i.e., for each designated class . Hence, for each class a data set consisting of the matrix is observed. Without loss of generality can be assumed as each data set can be centered around its column means. The class-specific sample covariance matrix given by

 Sg=1ngng∑i=1yigy⊤ig=1ngY⊤gYg,

then constitutes the well-known MLE of as discussed above. The closely related pooled sample covariance matrix

 (2) S∙=1n∙G∑g=1ng∑i=1yigy⊤ig=1n∙G∑g=1ngSg,

where , is an oft-used estimate of the common covariance matrix across classes. In the high-dimensional case (implying ) the and are singular and their inverses do not exist. Our primary interest thus lies in estimating the precision matrices , as well as their commonalities and differences, when . We will develop a general -penalized ML framework to this end which we designate targeted fused ridge estimation.

The estimation of multiple precision matrices from high-dimensional data classes is of interest in many applications. The field of oncogenomics, for example, often deals with high-dimensional data from high-throughput experiments. Class membership may have different connotations in such settings. It may refer to certain sub-classes within a single data set such as cancer subtypes (cancer is a very heterogeneous disease, even when present in a single organ). It may also designate different data sets or studies. Likewise, the class indicator may also refer to a conjunction of both subclass and study membership to form a two-way design of factors of interest (e.g., breast cancer subtypes present in a batch of study-specific data sets), as is often the case in oncogenomics. Our approach is thus motivated by the meta-analytic setting, where we aim for an integrative analysis in terms of simultaneously considering multiple data (sub-)classes, data sets, or both. Its desire is to borrow statistical power across classes by effectively increasing the sample size in order to improve sensitivity and specificity of discoveries.

### 1.1. Relation to literature and overview

There have been many proposals for estimating a single precision matrix in high-dimensional data settings. A popular approach is to amend (1) with an -penalty [48, 2, 19, 49]. The solution to this penalized problem is generally referred to as the graphical lasso and it is popular as it performs automatic model selection, i.e., the resulting estimate is sparse. It is heavily used in Gaussian graphical modeling (GGM) as the support of a Gaussian precision matrix represents a Markov random field [24].

The -approach has been extended to deal with more than a single sample-group. Guo et al. [21] have proposed a parametrization of class-specific precision matrices that expresses the individual elements as a product of shared and class-specific factors. They include -penalties on both the shared and class-specific factors in order to jointly estimate the sparse precision matrices (representing graphical models). The penalty on the shared factors promotes a shared sparsity structure while the penalty on the class-specific factors promotes class-specific deviations from the shared sparsity structure. Danaher et al. [13] have generalized these efforts by proposing the joint graphical lasso which allows for various penalty structures. They study two particular choices: the group graphical lasso that encourages a shared sparsity structure across the class-specific precision matrices, and the fused graphical lasso that promotes a shared sparsity structure as well as shared precision element-values. A Bayesian approach to inferring multiple sparse precision matrices can be found in Peterson et al. [32].

While simultaneous estimation and model selection can be deemed elegant, automatic sparsity is not always an asset. It may be that one is intrinsically interested in more accurate representations of class-specific precision matrices for usage in, say, covariance-regularized regression [46] or discriminant analysis [33]. In such a situation one is not after sparse representations and one may prefer usage of a regularization method that shrinks the estimated elements of the precision matrices proportionally. In addition—when indeed considering network representations of data—the true class-specific graphical models need not be (extremely) sparse in terms of containing many zero elements. The -penalty is unable to retrieve the sparsity pattern when the number of truly non-null elements exceeds the available sample size [42]. In such a situation one may wish to couple a non-sparsity-inducing penalty with a post-hoc selection step allowing for probabilistic control over element selection. We therefore consider or ridge-type penalization.

In Section 2 the targeted fused ridge estimation framework will be presented. The proposed fused -penalty allows for the simultaneous estimation of multiple precision matrices from high-dimensional data classes that chiefly share the same structure but that may differentiate in locations of interest. The approach is targeted in the sense that it allows for the specification of target matrices that may encode prior information. The framework is flexible and general, containing the recent work of Price et al. [33] and van Wieringen and Peeters [42] as special cases. It may be viewed as a -alternative to the work of Danaher et al. [13]. The method is contingent upon the selection of penalty values and target matrices, topics that are treated in Section 3. Section 4 then focuses on the graphical interpretation of precision matrices. It shows how the fused ridge precision estimates may be coupled with post-hoc support determination in order to arrive at multiple graphical models. We will refer to this coupling as the fused graphical ridge. This then serves as a basis for integrative or meta-analytic network modeling. Section 5 then assesses the performance of the proposed estimator through extensive simulation experiments. Section 6 illustrates the techniques by applying it in a large scale integrative study of gene expression data of diffuse large B-cell lymphoma. The focus is then on finding common motifs and motif differences in network representations of (deregulated) molecular pathways. Section 7 concludes with a discussion.

### 1.2. Notation

Some additional notation must be introduced. Throughout the text and supplementary material, we use the following notation for certain matrix properties and sets: We use and to denote symmetric positive definite (p.d.) and positive semi-definite (p.s.d.) matrices and , respectively. By , , and we denote the real numbers, the non-negative real numbers, and the strictly positive real numbers, respectively. In notational analogue, , , and are used to denote the space of real symmetric matrices, the real symmetric p.s.d. matrices, and real symmetric p.d. matrices, respectively. That is, e.g., . Negative subscripts similarly denote negative reals and negative definiteness. By and similar we denote element-wise relations, i.e., for all . Matrix subscripts will usually denote class membership, e.g., denotes (the realization of) matrix in class . For notational brevity we will often use the shorthand to denote the set .

The following notation is used throughout for operations: We write for the column vector composed of the diagonal of and for the vectorization operator which stacks the columns of on top of each other. Moreover, will denote the Hadamard product while refers to the Kronecker product.

We will also repeatedly make use of several special matrices and functions. We let denote the (

)-dimensional identity matrix. Similarly,

will denote the ()-dimensional all-ones matrix. In addition, will denote the null-matrix, the dimensions of which should be clear from the context. Lastly, and will stand for the squared Frobenius norm and the indicator function, respectively.

## 2. Targeted fused ridge estimation

### 2.1. A general penalized log-likelihood problem

Suppose classes of

-dimensional data exist and that the samples within each class are i.i.d. normally distributed. The log-likelihood for the data takes the following form under the additional assumption that all

observations are independent:

 (3) L({Ωg};{Sg})∝∑gng{ln|Ωg|−tr(SgΩg)}.

We desire to obtain estimates of the precision matrices for each class. Though not a requirement, we primarily consider situations in which for all , necessitating the need for regularization. To this end, amend (3) with the fused ridge penalty given by

 (4)

where the indicate known class-specific target matrices (see also Section 3.3), the denote class-specific ridge penalty parameters, and the are pair-specific fusion penalty parameters subject to the requirement that . All penalties can then be conveniently summarized into a non-negative symmetric matrix which we call the penalty matrix. The diagonal of corresponds to the class-specific ridge penalties whereas off-diagonal entries are the pair-specific fusion penalties. The rationale and use of the penalty matrix is motivated further in Section 3.1. Combining (3) and (4) yields a general targeted fused ridge estimation problem:

 (5) argmax{Ωg}∈Sp++⎧⎨⎩L({Ωg};{Sg})−∑gλgg2∥∥Ωg−Tg∥∥2F−∑\mathclapg1,g2λg1g24∥∥(Ωg1−Tg1)−(Ωg2−Tg2)∥∥2F⎫⎬⎭.

The problem of (5) is strictly concave. Furthermore, it is worth noting that non-zero fusion penalties, for all , alone will not guarantee uniqueness when : In high dimensions, all ridge penalties should be strictly positive to ensure identifiability. These and other properties of the estimation problem are reviewed in Section 2.2.

The problem stated in (5) is very general. We shall sometimes consider a single common ridge penalty for all , as well as a common fusion penalty for all class pairs (cf., however, Section 3.1) such that . This simplification leads to the first special case:

 argmax{Ωg}∈Sp++⎧⎨⎩L({Ωg};{Sg})−λ2∑g∥∥Ωg−Tg∥∥2F−λf4∑\mathclapg1,g2∥∥(Ωg1−Tg1)−(Ωg2−Tg2)∥∥2F⎫⎬⎭.

Here and analogous to (5), controls the rate of shrinkage of each precision towards the corresponding target [42], while determines the retainment of entry-wise similarities between and for all class pairs .

When for all , the problem further simplifies to

 (6) argmax{Ωg}∈Sp++⎧⎨⎩L({Ωg};{Sg})−λ2∑g∥∥Ωg−T∥∥2F−λf4∑\mathclapg1,g2∥∥Ωg1−Ωg2∥∥2F⎫⎬⎭,

where the targets are seen to disappear from the fusion term. Lastly, when the problem (6) reduces to its simplest form recently considered by Price et al. [33]. Appendix A studies, in order to support an intuitive feel for the fused ridge estimation problem, its geometric interpretation in this latter context.

### 2.2. Estimator and properties

There is no explicit solution to (5) except for certain special cases and thus an iterative optimization procedure is needed for its general solution. As described in Section 2.3, we employ a coordinate ascent procedure which relies on the concavity of the penalized likelihood (see Lemma 3 in Appendix B.1) and repeated use of the following result, whose proof (as indeed all proofs) has been deferred to Appendix B.2:

###### Proposition 1.

Let and let be a fixed penalty matrix such that and . Furthermore, assume that is p.d. and fixed for all . The maximizing argument for class of the optimization problem (5) is then given by

 (7) where (8) ¯Sg0=Sg0−∑g≠g0λgg0ng0(Ωg−Tg),¯Tg0=Tg0,and¯λg0=λg0∙ng0,

with denoting the sum of the th column (or row) of .

###### Remark 1.

Defining in Proposition 1 may be deemed redundant. However, it allows us to state equivalent alternatives to (8) without confusing notation. See Section 2.3 as well as Appendix B.2 and Section 1 of the Supplementary Material.

###### Remark 2.

The target matrices from Proposition 1 may be chosen nonnegative definite. However, choosing n.d. targets may lead to ill-conditioned estimates in the limit. From a shrinkage perspective we thus prefer to choose . See Section 3.3.

Proposition 1 provides a function for updating the estimate of the th class while fixing the remaining parameters. As a special case, consider the following. If all off-diagonal elements of are zero no ‘class fusion’ of the estimates takes place and the maximization problem decouples into individual, disjoint ridge estimations: See Corollary 1 in Appendix B.2. The next result summarizes some properties of (7):

###### Proposition 2.

Consider the estimator of Proposition 1 and its accompanying assumptions. Let be the precision matrix estimate of the th class. For this estimator, the following properties hold:

1. for all ;

2. if and ;

3. if for all ;

4. if for all .

The first item of Proposition 2 implies that strictly positive are sufficient to guarantee p.d. estimates from the ridge estimator. The second item then implies that if ‘class fusion’ is absent one obtains as the right-hand limit for group the standard MLE , whose existence is only guaranteed when . The third item shows that the fused ridge precision estimator for class is shrunken exactly to its target matrix when the ridge penalty tends to infinity while the fusion penalties do not. The last item shows that the precision estimators of any two classes tend to a common estimate when the fusion penalty between them tends to infinity while all remaining penalty parameters remain finite.

The attractiveness of the general estimator hinges upon the efficiency by which it can be obtained. We state a result useful in this respect before turning to our computational approach in Section 2.3:

###### Proposition 3.

Let be the precision matrix estimate (7) for the th class and define . The estimate can then be obtained without inversion through:

 ^Ωg=1¯λg[^Σg−(¯Sg−¯λg¯Tg)]=1¯λg{[¯λgIp+14(¯Sg−¯λg¯Tg0)2]1/2−12(¯Sg−¯λg¯Tg)}.

### 2.3. Algorithm

Equation (7) allows for updating the precision estimate of class by plugging in the remaining , , and assuming them fixed. Hence, from initial estimates, all precision estimates may be iteratively updated until some convergence criterion is reached. We propose a block coordinate ascent procedure to solve (5) by repeated use of the results in Proposition 1. This procedure is outlined in Algorithm 1. By the strict concavity of the problem in (5), the procedure guarantees that, contingent upon convergence, the unique maximizer is attained when considering all jointly. Moreover, we can state the following result:

###### Proposition 4.

The gradient ascent procedure given in Algorithm 1 will always stay within the realm of positive definite matrices .

The procedure is implemented in the rags2ridges package within the R statistical language [34]. This implementation focuses on stability and efficiency. With regard to the former: Equivalent (in terms of the obtained estimator) alternatives to (8) can be derived that are numerically more stable for extreme values of . The most apparent such alternative is:

 (9) ¯Sg0=Sg0,¯Tg0=Tg0+∑g≠g0λgg0λg0∙(Ωg−Tg),and¯λg0=λg0∙ng0.

It ‘updates’ the target instead of the covariance and has the intuitive interpretation that the target matrix for a given class in the fused case is a combination of the actual class target matrix and the ‘target corrected’ estimates of remaining classes. The implementation makes use of this alternative where appropriate. See Section 1 of the Supplementary Material for details on alternative updating schemes.

Efficiency is secured through various roads. First, in certain special cases closed-form solutions to (5) exist. When appropriate, these explicit solutions are used. Moreover, these solutions may provide warm-starts for the general problem. See Section 2 of the Supplementary Material for details on estimation in these special cases. Second, the result from Proposition 3 is used, meaning that the relatively expensive operation of matrix inversion is avoided. Third, additional computational speed was achieved by implementing core operations in C++ via the R-packages Rcpp and RcppArmadillo [15, 16, 18, 38]. These efforts make analyzes with large feasible. Throughout, we will initialize the algorithm with for all .

## 3. Penalty and target selection

### 3.1. The penalty graph and analysis of factorial designs

Equality of all class-specific ridge penalties is deemed restrictive, as is equality of all pair-specific fusion penalties . In many settings, such as the analysis of factorial designs, finer control over the individual values of and befits the analysis. This will be motivated by several examples of increasing complexity. In order to do so, some additional notation is developed: The penalties of can be summarized by a node- and edge-weighted graph where the vertex set corresponds to the possible classes and the edge set corresponds to the similarities to be retained. The weight of node is given by and the weight of edge is then given by . We refer to as the penalty graph associated with the penalty matrix . The penalty graph is simple and undirected as the penalty matrix is symmetric.

###### Example 1.

Consider classes or subtypes (ST) of diffuse large B-cell lymphoma (DLBCL) patients with tumors resembling either so-called activated B-cells () or germinal centre B-cells (). Patients with the latter subtype have superior overall survival [1]. As the phenotype is more common than , one might imagine a scenario where the two class sample sizes are sufficiently different such that . Numeric procedures to obtain a common ridge penalty (see, e.g., Section 3.2) would then be dominated by the smaller group. Hence, choosing non-equal class ridge penalties for each group will allow for a better analysis. In such a case, the following penalty graph and matrix would be suitable:

 (10) to122.27pt\vboxto48.51pt\pgfpicture\makeatletterto0.0pt\pgfsys@beginscope\definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@color@rgb@fill000\pgfsys@setlinewidth0.4pt\nullfontto0.0pt\pgfsys@beginscope\hbox{{\pgfsys@beginscope{}{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} { }{{{{}}\pgfsys@beginscope{}\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-39.702585% pt}{-2.499962pt}{}\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}{}\pgfsys@color@rgb@fill{0}{0}{0}{}\hbox{{P=}} }}{}{}\pgfsys@endscope}}} {}{}{}\pgfsys@endscope}}\hbox{\hbox{{\pgfsys@beginscope{}{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}% \pgfsys@moveto{12.699707pt}{0.0pt}\pgfsys@curveto{12.699707pt}{7.013855pt}{7.0% 13855pt}{12.699707pt}{0.0pt}{12.699707pt}\pgfsys@curveto{-7.013855pt}{12.69970% 7pt}{-12.699707pt}{7.013855pt}{-12.699707pt}{0.0pt}\pgfsys@curveto{-12.699707% pt}{-7.013855pt}{-7.013855pt}{-12.699707pt}{0.0pt}{-12.699707pt}% \pgfsys@curveto{7.013855pt}{-12.699707pt}{12.699707pt}{-7.013855pt}{12.699707% pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke% \pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope{}\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-11.699821% pt}{-1.95997pt}{}\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}{}\pgfsys@color@rgb@fill{0}{0}{0}{}\hbox{{λ11}} }}{}{}\pgfsys@endscope}}} {}{}{}\pgfsys@endscope}}}\hbox{{\pgfsys@beginscope{}{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} { }{{{{}}\pgfsys@beginscope{}\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-14.999771% pt}{21.277952pt}{}\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}{}\pgfsys@color@rgb@fill{0}{0}{0}{}\hbox{{ABC}} }}{}{}\pgfsys@endscope}}} {}{}{}\pgfsys@endscope}}\hbox{\hbox{{\pgfsys@beginscope{}{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}% \pgfsys@moveto{69.605219pt}{0.0pt}\pgfsys@curveto{69.605219pt}{7.013855pt}{63.% 919367pt}{12.699707pt}{56.905512pt}{12.699707pt}\pgfsys@curveto{49.891657pt}{1% 2.699707pt}{44.205805pt}{7.013855pt}{44.205805pt}{0.0pt}\pgfsys@curveto{44.205% 805pt}{-7.013855pt}{49.891657pt}{-12.699707pt}{56.905512pt}{-12.699707pt}% \pgfsys@curveto{63.919367pt}{-12.699707pt}{69.605219pt}{-7.013855pt}{69.605219% pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{56.905512pt}{0.0pt}\pgfsys@stroke% \pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope{}\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{45.205691pt% }{-1.95997pt}{}\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}{}\pgfsys@color@rgb@fill{0}{0}{0}{}\hbox{{λ22}} }}{}{}\pgfsys@endscope}}} {}{}{}\pgfsys@endscope}}}\hbox{{\pgfsys@beginscope{}{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} { }{{{{}}\pgfsys@beginscope{}\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{41.905741pt% }{21.277952pt}{}\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}{}\pgfsys@color@rgb@fill{0}{0}{0}{}\hbox{{GCB}} }}{}{}\pgfsys@endscope}}} {}{}{}\pgfsys@endscope}}\par\hbox{{\pgfsys@beginscope{}{{}}{}{{}} {{{{{}}{}{}{} {}{{}}}}}{}{{{{{}}{}{}{} {}{{}}}}}{}{}{} {}{}{{{}{}}}{}\pgfsys@moveto{12.899707pt}{0.0pt}\pgfsys@lineto{44.005805pt}{0.% 0pt}\pgfsys@stroke\pgfsys@invoke{ }\hbox{\hbox{{\pgfsys@beginscope{}{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} { }{{{{}}\pgfsys@beginscope{}\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{18.852902pt% }{6.612953pt}{}\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}{}\pgfsys@color@rgb@fill{0}{0}{0}{}\hbox{{λf}} }}{}{}\pgfsys@endscope}}} {}{}{}\pgfsys@endscope}}} {}\pgfsys@endscope}}\pgfsys@endscope\hss\pgfsys@discardpath\pgfsys@endscope\hss\endpgfpictureΛ=[λ11λfλfλ22].

###### Example 2.

Consider data from a one-way factorial design where the factor is ordinal with classes A, B, and C. For simplicity, we choose the same ridge penalty for each class. Say we have prior information that A is closer to B and B is closer to C than A is to C. The fusion penalty on the pairs containing the intermediate level B might then be allowed to be stronger. The following penalty graph and matrix are thus sensible:

 (11)

Depending on the application, one might even omit the direct shrinkage between A and C by fixing . A similar penalty scheme might also be relevant if one class of the factor is an unknown mix of the remaining classes and one wishes to borrow statistical power from such a class.

###### Example 3.

In two-way or -way factorial designs one might wish to retain similarities in the ‘direction’ of each factor along with a factor-specific penalty. Consider, say, 3 oncogenic data sets (DS, DS, DS) regarding and DLBCL cancer patients. This yields a total of classes of data. One choice of penalization of this by design is represented by the penalty graph and matrix below:

 (12)

This example would favor similarities (with the same force) only between pairs sharing a common level in each factor. This finer control allows users, or the employed algorithm, to penalize differences between data sets more (or less) strongly than differences between the and sub-classes. This corresponds to not applying direct shrinkage of interaction effects which is of interest in some situations.

While the penalty graph primarily serves as an intuitive overview, it does provide some aid in the construction of the penalty matrix for multifactorial designs. For example, the construction of the penalty matrix (12) in Example 3 corresponds to a Cartesian graph product of two complete graphs similar to those given in (10) and (11). We state that and should be chosen carefully in conjunction with the choice of target matrices. Ideally, only strictly necessary penalization parameters (from the perspective of the desired analysis) should be introduced. Each additional penalty introduced will increase the difficulty of finding the optimal penalty values by increasing the dimension of the search-space.

### 3.2. Selection of penalty parameters

As the

-penalty does not automatically induce sparsity in the estimate, it is natural to seek loss efficiency. We then use cross-validation (CV) for penalty parameter selection due to its relation to the minimization of the Kullback-Leibler divergence and its predictive accuracy stemming from its data-driven nature. Randomly divide the data of each class into

disjoint subsets of approximately the same size. Previously, we have defined to be the precision matrix estimate of the th class. Let be the analogous estimate (with similar notational dependencies) for class based on all samples not in . Also, let denote the sample covariance matrix for class based on the data in subset and let denote the size of subset in class . The -fold CV score for our fused regularized precision estimate based on the fixed penalty can then be given as:

 KCV(Λ)=1KGG∑g=1K∑k=1nkg[−ln|^Ω¬kg|+tr(^Ω¬kgSkg)]=−1KGG∑g=1K∑k=1Lkg(^Ω¬kg;Skg).

One would then choose such that

 (13) Λ∗=argminΛKCV(Λ),subject to:Λ≥0∧diag(Λ)>0.

The least biased predictive accuracy can be obtained by choosing such that . This would give the fused version of leave-one-out CV (LOOCV). Unfortunately, LOOCV is computationally demanding for large and/or large . We propose to select the penalties by the computationally expensive LOOCV only if adequate computational power is available. In cases where it is not, we propose two alternatives.

Our first alternative is a special version of the LOOCV scheme that significantly reduces the computational cost. The special LOOCV () is computed much like the LOOCV. However, only the class estimate in the class of the omitted datum is updated. More specifically, the problem is given by:

 (14) Λ⋄=argminΛSLOOCV(Λ),subject to:Λ≥0∧diag(Λ)>0,

with

 SLOOCV(Λ)=−1n∙G∑g=1ng∑i=1Lig(˜Ω¬ig;Sig).

The estimate in (14) is obtained by updating only using Proposition 1. For all other , . The motivation for the SLOOCV is that a single observation in a given class does not exert heavy direct influence on the estimates in the other classes. This way the number of fused ridge estimations for each given and each given leave-one-out sample is reduced from to estimations. Our second and fastest alternative is an approximation of the fused LOOCV score. This approximation can be used as an alternative to (S)LOOCV when the class sample sizes are relatively large (precisely the scenario where LOOCV is unfeasible). See Section 3 of the Supplementary Material for detailed information on this approximation.

### 3.3. Choice of target matrices

The target matrices can be used to encode prior information and their choice is highly dependent on the application at hand. As they influence the efficacy as well as the amount of bias of the estimate, it is of some importance to make a well-informed choice. Here, we describe several options of increasing level of informativeness.

In the non-fused setting, the consideration of a scalar target matrix for some leads to a computational benefit stemming from the property of rotation equivariance [42]

: Under such targets the ridge estimator only operates on the eigenvalues of the sample covariance matrix. This benefit transfers to the fused setting for the estimator described in Proposition

1. Hence, one may consider with for each . The limited fused ridge problem in Price et al. [33] corresponds to choosing for all , such that a common target is employed. This can be considered the least informative target possible. We generally argue against the use of the non p.d. target

, as it implies shrinking the class precision matrices towards the null matrix and thus towards infinite variance.

Choosing to be strictly positive implies a (slightly) more informative choice. The rotation equivariance property dictates that it is sensible to choose based on empirical information regarding the eigenvalues of . One such choice could be the average of the reciprocals of the non-zero eigenvalues of . A straightforward alternative would be to choose . In the special case of (6) where all the analogous choice would be .

More informative targets would move beyond the scalar matrix. An example would be the consideration of factor-specific targets for factorial designs. Recalling Example 3, one might deem the data set factor to be a ‘nuisance factor’. Hence, one might choose different targets and based on training data or the pooled estimates of the and samples, respectively. In general, the usage of pilot training data or (pathway) database information (or both) allows for the construction of target matrices with higher specificity. We illustrate how to construct targets from database information in the DLBCL application of Section 6.

## 4. Fused graphical modeling

### 4.1. To fuse or not to fuse

As a preliminary step to downstream modeling one might consider testing the hypothesis of no class heterogeneity—and therefore the necessity of fusing—amongst the class-specific precision matrices. Effectively, one then wishes to test the null-hypothesis

. Under an explicit estimator is available in which the fused penalty parameters play no role, cf. Section 2.2 of the Supplementary Material. Here we suggest a score test [6] for the evaluation of in conjunction with a way to generate its null distribution in order to assess its observational extremity.

A score test is convenient as it only requires estimation under the null hypothesis, allowing us to exploit the availability of an explicit estimator. The score statistic equals:

 U=−G∑g=1(∂L({Ωg};{Sg})∂Ωg)⊤(∂2L({Ωg};{Sg})∂Ωg∂Ω⊤g)−1∂L({Ωg};{Sg})∂Ωg∣∣ ∣∣Ωg=^ΩH0,

where denotes the precision estimate under given in (S4), which holds for all classes . The gradient can be considered in vectorized form and is readily available from (25). The Hessian of the log-likelihood equals . For practical purposes of evaluating the score statistic, we employ the identity which avoids the manipulation of

-dimensional matrices. Hence, the test statistic

is computed by

 ^U=G∑g=1vec(^Xg)⊤vec(^ΩH0^Xg^ΩH0)=G∑g=1tr[^Xg(^ΩH0^Xg^ΩH0)],

where .

The null distribution of can be generated by permutation of the class labels: one permutes the class labels, followed by re-estimation of under and the re-calculation of the test statistic. The observed test statistic (under ) is obtained from the non-permuted class labels and the regular fused estimator. The -value is readily obtained by comparing the observed test statistic to the null distribution obtained from the test statistic under permuted class labels. We note that the test is conditional on the choice of .

### 4.2. Graphical modeling

A contemporary use for precision matrices is found in the reconstruction and analysis of networks through graphical modeling. Graphical models merge probability distributions of random vectors with graphs that express the conditional (in)dependencies between the constituent random variables. In the fusion setting one might think that the class precisions share a (partly) common origin (conditional independence graph) to which fusion appeals. We focus on class-specific graphs

with a finite set of vertices (or nodes) and set of edges . The vertices correspond to a collection of random variables and we consider the same set of cardinality for all classes . That is, we consider the same variables in all classes. The edge set is a collection of pairs of distinct vertices that are connected by an undirected edge and this collection may differ between classes. In case we assume for all classes we are considering multiple Gaussian graphical models.

Conditional independence between a pair of variables in the Gaussian graphical model corresponds to zero entries in the (class-specific) precision matrix. Let denote a generic estimate of the precision matrix in class . Then the following relations hold for all pairs with :

 (^Ωg)jj′=ω(g)jj′=0⟺Yjto0.0pt$⊥$⊥Yj′∣∣V∖{Yj,Yj′} in % class g⟺(Yj,Yj′)∉Eg.

Hence, determining the (in)dependence structure of the variables for class —or equivalently the edge set of —amounts to determining the support of .

### 4.3. Edge selection

We stress that support determination may be skipped entirely as the estimated precision matrices can be interpreted as complete (weighted) graphs. For more sparse graphical representations we resort to support determination by a local false discovery rate (lFDR) procedure [17] proposed by Schäfer and Strimmer [39]. This procedure assumes that the nonredundant off-diagonal entries of the partial correlation matrix

 (^Pg)jj′=−^ω(g)jj′(^ω(g)jj^ω(g)j′j′)−12

follow a mixture distribution representing null and present edges. The null-distribution is known to be a scaled beta-distribution which allows for estimating the lFDR:

 ˆlFDR(g)jj′=P((Yj,Yj′)∉Eg∣∣(^Pg)jj′),

which gives the empirical posterior probability that the edge between

and is null in class conditional on the observed corresponding partial correlation. The analogous probability that an edge is present can be obtained by considering . See [17, 39, 42] for further details on the lFDR procedure. Our strategy will be to select for each class only those edges for which surpasses a certain threshold (see Section 6). This two-step procedure of regularization followed by subsequent support determination has the advantage that it enables probabilistic statements about the inclusion (or exclusion) of edges.

### 4.4. Common and differential (sub-)networks

After estimation and sparsification of the class precision matrices the identification of commonalities and differences between the graphical estimates are of natural interest. Here we consider some (summary) measures to aid such identifications. Assume in the following that multiple graphical models have been identified by the sparsified estimates and that the corresponding graphs are denoted by .

An obvious method of comparison is by pairwise graph differences or intersections. We utilize the differential network between class and to provide an overview of edges present in one class but not the other. The common network is composed of the edges present in both graphs. We also define the edge-weighted total network of graphs as the graph formed by the union where the weight of the edge is given by the cardinality of the set . More simply, is determined by summing the adjacency matrices of to . Analogously, the signed edge-weighted total network takes into account the stability of the sign of an edge over the classes by summing signed adjacency matrices. Naturally, the classes can also be compared by one or more summary statistics at node-, edge-, and network-level per class [cf. 29].

We also propose the idea of ‘network rewiring’. Suppose an investigator is interested in the specific interaction between genes and for classes and . The desire is to characterize the dependency between genes and and determine the differences between the two classes. To do so, we suggest using the decomposition of the covariance of and into the individual contributions of all paths between and . A path between and of length in a graph for class is, following Lauritzen [24], defined to be a sequence of distinct vertices such that for all . The possibility of the mentioned decomposition was shown by Jones and West [22] and, in terms of , can be stated as:

 (15) Cov(A,B)=∑z∈ZAB(−1)tz+1ωAv1ωv1v2ωv2v3⋯ωvtz−2vtz−1ωvtz−1B|(^Ω0g)¬P||^Ω0g|,

where is the set of all paths between and and denotes the matrix with rows and columns corresponding to the vertices of the path removed. Each term of the covariance decomposition in (15) can be interpreted as the flow of information through a given path between and in . Imagine performing this decomposition for and in both and . For each path, we can then identify whether it runs through the common network , or utilizes the differential networks unique to the classes. The paths that pass through the differential networks can be thought of as a ‘rewiring’ between the groups (in particular compared to the common network). In summary, the covariance between a node pair can be separated into a component that is common and a component that is differential (or rewired).

###### Example 4.

Suppose we have the following two graphs for classes and :

and consider the covariance between node and . In the covariance is decomposed into contributions by the paths , , and . Similarly for , the contributions are from paths and . Thus is the only shared path. Depending on the size of the contributions we might conclude that network 1 has some ‘rewired pathways’ compared to the other. This method gives a concise overview of the estimated interactions between two given genes, which genes mediate or moderate these interactions, as well as how the interaction patterns differ across the classes. In turn this might suggest candidate genes for perturbation or knock-down experiments.

## 5. Simulation study

In this section we explore and measure the performance of the fused estimator and its behavior in four different scenarios. Performance is measured primarily by the squared Frobenius loss,

 L(g)F(^Ωg(Λ),Ωg)=∥∥^Ωg(Λ)−Ωg∥∥2F,

between the class precision estimate and the true population class precision matrix. However, the performance is also assessed in terms of the quadratic loss,

 L(g)Q(^Ωg(Λ),Ωg)=∥∥^Ωg(Λ)Ω−1g−Ip∥∥2F.

The risk defined as the expected loss associated with an estimator, say,

 RF{^Ωg(Λ)}=E[L(g)F(^Ωg(Λ),Ωg)],

is robustly approximated by the median loss over a repeated number of simulations and corresponding estimations.

We designed four simulation scenarios to explore the properties and performance of the fused ridge estimator and alternatives. Scenario (1) evaluates the fused ridge estimator under two choices of the penalty matrix, the non-fused ridge estimate applied individually to the classes, and the non-fused ridge estimate using the pooled covariance matrix when (1a)  and (1b) . Scenario (2) evaluates the fused ridge estimator under different choices of targets: (2a) , (2b) , and (2c) . Scenario (3) evaluates the fused ridge estimator for varying network topologies and degrees of class homogeneity. Specifically, for (3a) scale-free topology and (3b) small-world topology, each with (3i) low class homogeneity and (3ii) high class homogeneity. Scenario (4) investigates the fused estimator under non-equal class sample sizes. Except for scenario 4, we make no distinction between the loss in different classes. Except for scenario 1, we use penalty matrices of the form .

### 5.1. Scenario 1: Fusion versus no fusion

Scenario 1 explores the loss-efficiency of the fused estimate versus non-fused estimates as a function of the class sample size for fixed and hence for different ratios. Banded population precision matrices are simulated from classes. We set and

 (16) (Ωg)jj′=k+1|j−j′|+11[|j−j′|≤k]

with non-zero off-diagonal bands. The sub-scenario (1a) uses bands whereas (1b)