1. Introduction
Highdimensional data are ubiquitous in modern statistics. Consequently, the fundamental problem of estimating the covariance matrix or its inverse (the precision matrix) has received renewed attention. Suppose we have i.i.d. observations of a dimensional variate distributed as . The Gaussian loglikelihood parameterized in terms of the precision matrix is then given by:
(1) 
where is the sample covariance matrix. When the maximum of (1) is attained at the maximum likelihood estimate (MLE) . However, in the highdimensional case, i.e., when , the sample covariance matrix is singular and its inverse ceases to exist. Furthermore, when , the sample covariance matrix may be illconditioned and the inversion becomes numerically unstable. Hence, these situations necessitate usage of regularization techniques.
Here, we study the simultaneous estimation of numerous precision matrices when multiple classes of highdimensional data are present. Suppose is a realization of a
dimensional Gaussian random vector for
independent observations nested within classes, each with classdependent covariance , i.e., for each designated class . Hence, for each class a data set consisting of the matrix is observed. Without loss of generality can be assumed as each data set can be centered around its column means. The classspecific sample covariance matrix given bythen constitutes the wellknown MLE of as discussed above. The closely related pooled sample covariance matrix
(2) 
where , is an oftused estimate of the common covariance matrix across classes. In the highdimensional case (implying ) the and are singular and their inverses do not exist. Our primary interest thus lies in estimating the precision matrices , as well as their commonalities and differences, when . We will develop a general penalized ML framework to this end which we designate targeted fused ridge estimation.
The estimation of multiple precision matrices from highdimensional data classes is of interest in many applications. The field of oncogenomics, for example, often deals with highdimensional data from highthroughput experiments. Class membership may have different connotations in such settings. It may refer to certain subclasses within a single data set such as cancer subtypes (cancer is a very heterogeneous disease, even when present in a single organ). It may also designate different data sets or studies. Likewise, the class indicator may also refer to a conjunction of both subclass and study membership to form a twoway design of factors of interest (e.g., breast cancer subtypes present in a batch of studyspecific data sets), as is often the case in oncogenomics. Our approach is thus motivated by the metaanalytic setting, where we aim for an integrative analysis in terms of simultaneously considering multiple data (sub)classes, data sets, or both. Its desire is to borrow statistical power across classes by effectively increasing the sample size in order to improve sensitivity and specificity of discoveries.
1.1. Relation to literature and overview
There have been many proposals for estimating a single precision matrix in highdimensional data settings. A popular approach is to amend (1) with an penalty [48, 2, 19, 49]. The solution to this penalized problem is generally referred to as the graphical lasso and it is popular as it performs automatic model selection, i.e., the resulting estimate is sparse. It is heavily used in Gaussian graphical modeling (GGM) as the support of a Gaussian precision matrix represents a Markov random field [24].
The approach has been extended to deal with more than a single samplegroup. Guo et al. [21] have proposed a parametrization of classspecific precision matrices that expresses the individual elements as a product of shared and classspecific factors. They include penalties on both the shared and classspecific factors in order to jointly estimate the sparse precision matrices (representing graphical models). The penalty on the shared factors promotes a shared sparsity structure while the penalty on the classspecific factors promotes classspecific deviations from the shared sparsity structure. Danaher et al. [13] have generalized these efforts by proposing the joint graphical lasso which allows for various penalty structures. They study two particular choices: the group graphical lasso that encourages a shared sparsity structure across the classspecific precision matrices, and the fused graphical lasso that promotes a shared sparsity structure as well as shared precision elementvalues. A Bayesian approach to inferring multiple sparse precision matrices can be found in Peterson et al. [32].
While simultaneous estimation and model selection can be deemed elegant, automatic sparsity is not always an asset. It may be that one is intrinsically interested in more accurate representations of classspecific precision matrices for usage in, say, covarianceregularized regression [46] or discriminant analysis [33]. In such a situation one is not after sparse representations and one may prefer usage of a regularization method that shrinks the estimated elements of the precision matrices proportionally. In addition—when indeed considering network representations of data—the true classspecific graphical models need not be (extremely) sparse in terms of containing many zero elements. The penalty is unable to retrieve the sparsity pattern when the number of truly nonnull elements exceeds the available sample size [42]. In such a situation one may wish to couple a nonsparsityinducing penalty with a posthoc selection step allowing for probabilistic control over element selection. We therefore consider or ridgetype penalization.
In Section 2 the targeted fused ridge estimation framework will be presented. The proposed fused penalty allows for the simultaneous estimation of multiple precision matrices from highdimensional data classes that chiefly share the same structure but that may differentiate in locations of interest. The approach is targeted in the sense that it allows for the specification of target matrices that may encode prior information. The framework is flexible and general, containing the recent work of Price et al. [33] and van Wieringen and Peeters [42] as special cases. It may be viewed as a alternative to the work of Danaher et al. [13]. The method is contingent upon the selection of penalty values and target matrices, topics that are treated in Section 3. Section 4 then focuses on the graphical interpretation of precision matrices. It shows how the fused ridge precision estimates may be coupled with posthoc support determination in order to arrive at multiple graphical models. We will refer to this coupling as the fused graphical ridge. This then serves as a basis for integrative or metaanalytic network modeling. Section 5 then assesses the performance of the proposed estimator through extensive simulation experiments. Section 6 illustrates the techniques by applying it in a large scale integrative study of gene expression data of diffuse large Bcell lymphoma. The focus is then on finding common motifs and motif differences in network representations of (deregulated) molecular pathways. Section 7 concludes with a discussion.
1.2. Notation
Some additional notation must be introduced. Throughout the text and supplementary material, we use the following notation for certain matrix properties and sets: We use and to denote symmetric positive definite (p.d.) and positive semidefinite (p.s.d.) matrices and , respectively. By , , and we denote the real numbers, the nonnegative real numbers, and the strictly positive real numbers, respectively. In notational analogue, , , and are used to denote the space of real symmetric matrices, the real symmetric p.s.d. matrices, and real symmetric p.d. matrices, respectively. That is, e.g., . Negative subscripts similarly denote negative reals and negative definiteness. By and similar we denote elementwise relations, i.e., for all . Matrix subscripts will usually denote class membership, e.g., denotes (the realization of) matrix in class . For notational brevity we will often use the shorthand to denote the set .
The following notation is used throughout for operations: We write for the column vector composed of the diagonal of and for the vectorization operator which stacks the columns of on top of each other. Moreover, will denote the Hadamard product while refers to the Kronecker product.
We will also repeatedly make use of several special matrices and functions. We let denote the (
)dimensional identity matrix. Similarly,
will denote the ()dimensional allones matrix. In addition, will denote the nullmatrix, the dimensions of which should be clear from the context. Lastly, and will stand for the squared Frobenius norm and the indicator function, respectively.2. Targeted fused ridge estimation
2.1. A general penalized loglikelihood problem
Suppose classes of
dimensional data exist and that the samples within each class are i.i.d. normally distributed. The loglikelihood for the data takes the following form under the additional assumption that all
observations are independent:(3) 
We desire to obtain estimates of the precision matrices for each class. Though not a requirement, we primarily consider situations in which for all , necessitating the need for regularization. To this end, amend (3) with the fused ridge penalty given by
(4) 
where the indicate known classspecific target matrices (see also Section 3.3), the denote classspecific ridge penalty parameters, and the are pairspecific fusion penalty parameters subject to the requirement that . All penalties can then be conveniently summarized into a nonnegative symmetric matrix which we call the penalty matrix. The diagonal of corresponds to the classspecific ridge penalties whereas offdiagonal entries are the pairspecific fusion penalties. The rationale and use of the penalty matrix is motivated further in Section 3.1. Combining (3) and (4) yields a general targeted fused ridge estimation problem:
(5) 
The problem of (5) is strictly concave. Furthermore, it is worth noting that nonzero fusion penalties, for all , alone will not guarantee uniqueness when : In high dimensions, all ridge penalties should be strictly positive to ensure identifiability. These and other properties of the estimation problem are reviewed in Section 2.2.
The problem stated in (5) is very general. We shall sometimes consider a single common ridge penalty for all , as well as a common fusion penalty for all class pairs (cf., however, Section 3.1) such that . This simplification leads to the first special case:
Here and analogous to (5), controls the rate of shrinkage of each precision towards the corresponding target [42], while determines the retainment of entrywise similarities between and for all class pairs .
When for all , the problem further simplifies to
(6) 
where the targets are seen to disappear from the fusion term. Lastly, when the problem (6) reduces to its simplest form recently considered by Price et al. [33]. Appendix A studies, in order to support an intuitive feel for the fused ridge estimation problem, its geometric interpretation in this latter context.
2.2. Estimator and properties
There is no explicit solution to (5) except for certain special cases and thus an iterative optimization procedure is needed for its general solution. As described in Section 2.3, we employ a coordinate ascent procedure which relies on the concavity of the penalized likelihood (see Lemma 3 in Appendix B.1) and repeated use of the following result, whose proof (as indeed all proofs) has been deferred to Appendix B.2:
Proposition 1.
Let and let be a fixed penalty matrix such that and . Furthermore, assume that is p.d. and fixed for all . The maximizing argument for class of the optimization problem (5) is then given by
(7)  
where  
(8) 
with denoting the sum of the th column (or row) of .
Remark 1.
Remark 2.
Proposition 1 provides a function for updating the estimate of the th class while fixing the remaining parameters. As a special case, consider the following. If all offdiagonal elements of are zero no ‘class fusion’ of the estimates takes place and the maximization problem decouples into individual, disjoint ridge estimations: See Corollary 1 in Appendix B.2. The next result summarizes some properties of (7):
Proposition 2.
Consider the estimator of Proposition 1 and its accompanying assumptions. Let be the precision matrix estimate of the th class. For this estimator, the following properties hold:

for all ;

if and ;

if for all ;

if for all .
The first item of Proposition 2 implies that strictly positive are sufficient to guarantee p.d. estimates from the ridge estimator. The second item then implies that if ‘class fusion’ is absent one obtains as the righthand limit for group the standard MLE , whose existence is only guaranteed when . The third item shows that the fused ridge precision estimator for class is shrunken exactly to its target matrix when the ridge penalty tends to infinity while the fusion penalties do not. The last item shows that the precision estimators of any two classes tend to a common estimate when the fusion penalty between them tends to infinity while all remaining penalty parameters remain finite.
The attractiveness of the general estimator hinges upon the efficiency by which it can be obtained. We state a result useful in this respect before turning to our computational approach in Section 2.3:
Proposition 3.
Let be the precision matrix estimate (7) for the th class and define . The estimate can then be obtained without inversion through:
2.3. Algorithm
Equation (7) allows for updating the precision estimate of class by plugging in the remaining , , and assuming them fixed. Hence, from initial estimates, all precision estimates may be iteratively updated until some convergence criterion is reached. We propose a block coordinate ascent procedure to solve (5) by repeated use of the results in Proposition 1. This procedure is outlined in Algorithm 1. By the strict concavity of the problem in (5), the procedure guarantees that, contingent upon convergence, the unique maximizer is attained when considering all jointly. Moreover, we can state the following result:
Proposition 4.
The gradient ascent procedure given in Algorithm 1 will always stay within the realm of positive definite matrices .
The procedure is implemented in the rags2ridges package within the R statistical language [34]. This implementation focuses on stability and efficiency. With regard to the former: Equivalent (in terms of the obtained estimator) alternatives to (8) can be derived that are numerically more stable for extreme values of . The most apparent such alternative is:
(9) 
It ‘updates’ the target instead of the covariance and has the intuitive interpretation that the target matrix for a given class in the fused case is a combination of the actual class target matrix and the ‘target corrected’ estimates of remaining classes. The implementation makes use of this alternative where appropriate. See Section 1 of the Supplementary Material for details on alternative updating schemes.
Efficiency is secured through various roads. First, in certain special cases closedform solutions to (5) exist. When appropriate, these explicit solutions are used. Moreover, these solutions may provide warmstarts for the general problem. See Section 2 of the Supplementary Material for details on estimation in these special cases. Second, the result from Proposition 3 is used, meaning that the relatively expensive operation of matrix inversion is avoided. Third, additional computational speed was achieved by implementing core operations in C++ via the Rpackages Rcpp and RcppArmadillo [15, 16, 18, 38]. These efforts make analyzes with large feasible. Throughout, we will initialize the algorithm with for all .
3. Penalty and target selection
3.1. The penalty graph and analysis of factorial designs
Equality of all classspecific ridge penalties is deemed restrictive, as is equality of all pairspecific fusion penalties . In many settings, such as the analysis of factorial designs, finer control over the individual values of and befits the analysis. This will be motivated by several examples of increasing complexity. In order to do so, some additional notation is developed: The penalties of can be summarized by a node and edgeweighted graph where the vertex set corresponds to the possible classes and the edge set corresponds to the similarities to be retained. The weight of node is given by and the weight of edge is then given by . We refer to as the penalty graph associated with the penalty matrix . The penalty graph is simple and undirected as the penalty matrix is symmetric.
Example 1.
Consider classes or subtypes (ST) of diffuse large Bcell lymphoma (DLBCL) patients with tumors resembling either socalled activated Bcells () or germinal centre Bcells (). Patients with the latter subtype have superior overall survival [1]. As the phenotype is more common than , one might imagine a scenario where the two class sample sizes are sufficiently different such that . Numeric procedures to obtain a common ridge penalty (see, e.g., Section 3.2) would then be dominated by the smaller group. Hence, choosing nonequal class ridge penalties for each group will allow for a better analysis. In such a case, the following penalty graph and matrix would be suitable:
(10) 
Example 2.
Consider data from a oneway factorial design where the factor is ordinal with classes A, B, and C. For simplicity, we choose the same ridge penalty for each class. Say we have prior information that A is closer to B and B is closer to C than A is to C. The fusion penalty on the pairs containing the intermediate level B might then be allowed to be stronger. The following penalty graph and matrix are thus sensible:
(11) 
Depending on the application, one might even omit the direct shrinkage between A and C by fixing . A similar penalty scheme might also be relevant if one class of the factor is an unknown mix of the remaining classes and one wishes to borrow statistical power from such a class.
Example 3.
In twoway or way factorial designs one might wish to retain similarities in the ‘direction’ of each factor along with a factorspecific penalty. Consider, say, 3 oncogenic data sets (DS, DS, DS) regarding and DLBCL cancer patients. This yields a total of classes of data. One choice of penalization of this by design is represented by the penalty graph and matrix below:
(12) 
This example would favor similarities (with the same force) only between pairs sharing a common level in each factor. This finer control allows users, or the employed algorithm, to penalize differences between data sets more (or less) strongly than differences between the and subclasses. This corresponds to not applying direct shrinkage of interaction effects which is of interest in some situations.
While the penalty graph primarily serves as an intuitive overview, it does provide some aid in the construction of the penalty matrix for multifactorial designs. For example, the construction of the penalty matrix (12) in Example 3 corresponds to a Cartesian graph product of two complete graphs similar to those given in (10) and (11). We state that and should be chosen carefully in conjunction with the choice of target matrices. Ideally, only strictly necessary penalization parameters (from the perspective of the desired analysis) should be introduced. Each additional penalty introduced will increase the difficulty of finding the optimal penalty values by increasing the dimension of the searchspace.
3.2. Selection of penalty parameters
As the
penalty does not automatically induce sparsity in the estimate, it is natural to seek loss efficiency. We then use crossvalidation (CV) for penalty parameter selection due to its relation to the minimization of the KullbackLeibler divergence and its predictive accuracy stemming from its datadriven nature. Randomly divide the data of each class into
disjoint subsets of approximately the same size. Previously, we have defined to be the precision matrix estimate of the th class. Let be the analogous estimate (with similar notational dependencies) for class based on all samples not in . Also, let denote the sample covariance matrix for class based on the data in subset and let denote the size of subset in class . The fold CV score for our fused regularized precision estimate based on the fixed penalty can then be given as:One would then choose such that
(13) 
The least biased predictive accuracy can be obtained by choosing such that . This would give the fused version of leaveoneout CV (LOOCV). Unfortunately, LOOCV is computationally demanding for large and/or large . We propose to select the penalties by the computationally expensive LOOCV only if adequate computational power is available. In cases where it is not, we propose two alternatives.
Our first alternative is a special version of the LOOCV scheme that significantly reduces the computational cost. The special LOOCV () is computed much like the LOOCV. However, only the class estimate in the class of the omitted datum is updated. More specifically, the problem is given by:
(14) 
with
The estimate in (14) is obtained by updating only using Proposition 1. For all other , . The motivation for the SLOOCV is that a single observation in a given class does not exert heavy direct influence on the estimates in the other classes. This way the number of fused ridge estimations for each given and each given leaveoneout sample is reduced from to estimations. Our second and fastest alternative is an approximation of the fused LOOCV score. This approximation can be used as an alternative to (S)LOOCV when the class sample sizes are relatively large (precisely the scenario where LOOCV is unfeasible). See Section 3 of the Supplementary Material for detailed information on this approximation.
3.3. Choice of target matrices
The target matrices can be used to encode prior information and their choice is highly dependent on the application at hand. As they influence the efficacy as well as the amount of bias of the estimate, it is of some importance to make a wellinformed choice. Here, we describe several options of increasing level of informativeness.
In the nonfused setting, the consideration of a scalar target matrix for some leads to a computational benefit stemming from the property of rotation equivariance [42]
: Under such targets the ridge estimator only operates on the eigenvalues of the sample covariance matrix. This benefit transfers to the fused setting for the estimator described in Proposition
1. Hence, one may consider with for each . The limited fused ridge problem in Price et al. [33] corresponds to choosing for all , such that a common target is employed. This can be considered the least informative target possible. We generally argue against the use of the non p.d. target, as it implies shrinking the class precision matrices towards the null matrix and thus towards infinite variance.
Choosing to be strictly positive implies a (slightly) more informative choice. The rotation equivariance property dictates that it is sensible to choose based on empirical information regarding the eigenvalues of . One such choice could be the average of the reciprocals of the nonzero eigenvalues of . A straightforward alternative would be to choose . In the special case of (6) where all the analogous choice would be .
More informative targets would move beyond the scalar matrix. An example would be the consideration of factorspecific targets for factorial designs. Recalling Example 3, one might deem the data set factor to be a ‘nuisance factor’. Hence, one might choose different targets and based on training data or the pooled estimates of the and samples, respectively. In general, the usage of pilot training data or (pathway) database information (or both) allows for the construction of target matrices with higher specificity. We illustrate how to construct targets from database information in the DLBCL application of Section 6.
4. Fused graphical modeling
4.1. To fuse or not to fuse
As a preliminary step to downstream modeling one might consider testing the hypothesis of no class heterogeneity—and therefore the necessity of fusing—amongst the classspecific precision matrices. Effectively, one then wishes to test the nullhypothesis
. Under an explicit estimator is available in which the fused penalty parameters play no role, cf. Section 2.2 of the Supplementary Material. Here we suggest a score test [6] for the evaluation of in conjunction with a way to generate its null distribution in order to assess its observational extremity.A score test is convenient as it only requires estimation under the null hypothesis, allowing us to exploit the availability of an explicit estimator. The score statistic equals:
where denotes the precision estimate under given in (S4), which holds for all classes . The gradient can be considered in vectorized form and is readily available from (25). The Hessian of the loglikelihood equals . For practical purposes of evaluating the score statistic, we employ the identity which avoids the manipulation of
dimensional matrices. Hence, the test statistic
is computed bywhere .
The null distribution of can be generated by permutation of the class labels: one permutes the class labels, followed by reestimation of under and the recalculation of the test statistic. The observed test statistic (under ) is obtained from the nonpermuted class labels and the regular fused estimator. The value is readily obtained by comparing the observed test statistic to the null distribution obtained from the test statistic under permuted class labels. We note that the test is conditional on the choice of .
4.2. Graphical modeling
A contemporary use for precision matrices is found in the reconstruction and analysis of networks through graphical modeling. Graphical models merge probability distributions of random vectors with graphs that express the conditional (in)dependencies between the constituent random variables. In the fusion setting one might think that the class precisions share a (partly) common origin (conditional independence graph) to which fusion appeals. We focus on classspecific graphs
with a finite set of vertices (or nodes) and set of edges . The vertices correspond to a collection of random variables and we consider the same set of cardinality for all classes . That is, we consider the same variables in all classes. The edge set is a collection of pairs of distinct vertices that are connected by an undirected edge and this collection may differ between classes. In case we assume for all classes we are considering multiple Gaussian graphical models.Conditional independence between a pair of variables in the Gaussian graphical model corresponds to zero entries in the (classspecific) precision matrix. Let denote a generic estimate of the precision matrix in class . Then the following relations hold for all pairs with :
Hence, determining the (in)dependence structure of the variables for class —or equivalently the edge set of —amounts to determining the support of .
4.3. Edge selection
We stress that support determination may be skipped entirely as the estimated precision matrices can be interpreted as complete (weighted) graphs. For more sparse graphical representations we resort to support determination by a local false discovery rate (lFDR) procedure [17] proposed by Schäfer and Strimmer [39]. This procedure assumes that the nonredundant offdiagonal entries of the partial correlation matrix
follow a mixture distribution representing null and present edges. The nulldistribution is known to be a scaled betadistribution which allows for estimating the lFDR:
which gives the empirical posterior probability that the edge between
and is null in class conditional on the observed corresponding partial correlation. The analogous probability that an edge is present can be obtained by considering . See [17, 39, 42] for further details on the lFDR procedure. Our strategy will be to select for each class only those edges for which surpasses a certain threshold (see Section 6). This twostep procedure of regularization followed by subsequent support determination has the advantage that it enables probabilistic statements about the inclusion (or exclusion) of edges.4.4. Common and differential (sub)networks
After estimation and sparsification of the class precision matrices the identification of commonalities and differences between the graphical estimates are of natural interest. Here we consider some (summary) measures to aid such identifications. Assume in the following that multiple graphical models have been identified by the sparsified estimates and that the corresponding graphs are denoted by .
An obvious method of comparison is by pairwise graph differences or intersections. We utilize the differential network between class and to provide an overview of edges present in one class but not the other. The common network is composed of the edges present in both graphs. We also define the edgeweighted total network of graphs as the graph formed by the union where the weight of the edge is given by the cardinality of the set . More simply, is determined by summing the adjacency matrices of to . Analogously, the signed edgeweighted total network takes into account the stability of the sign of an edge over the classes by summing signed adjacency matrices. Naturally, the classes can also be compared by one or more summary statistics at node, edge, and networklevel per class [cf. 29].
We also propose the idea of ‘network rewiring’. Suppose an investigator is interested in the specific interaction between genes and for classes and . The desire is to characterize the dependency between genes and and determine the differences between the two classes. To do so, we suggest using the decomposition of the covariance of and into the individual contributions of all paths between and . A path between and of length in a graph for class is, following Lauritzen [24], defined to be a sequence of distinct vertices such that for all . The possibility of the mentioned decomposition was shown by Jones and West [22] and, in terms of , can be stated as:
(15) 
where is the set of all paths between and and denotes the matrix with rows and columns corresponding to the vertices of the path removed. Each term of the covariance decomposition in (15) can be interpreted as the flow of information through a given path between and in . Imagine performing this decomposition for and in both and . For each path, we can then identify whether it runs through the common network , or utilizes the differential networks unique to the classes. The paths that pass through the differential networks can be thought of as a ‘rewiring’ between the groups (in particular compared to the common network). In summary, the covariance between a node pair can be separated into a component that is common and a component that is differential (or rewired).
Example 4.
Suppose we have the following two graphs for classes and :
and consider the covariance between node and . In the covariance is decomposed into contributions by the paths , , and . Similarly for , the contributions are from paths and . Thus is the only shared path. Depending on the size of the contributions we might conclude that network 1 has some ‘rewired pathways’ compared to the other. This method gives a concise overview of the estimated interactions between two given genes, which genes mediate or moderate these interactions, as well as how the interaction patterns differ across the classes. In turn this might suggest candidate genes for perturbation or knockdown experiments.
5. Simulation study
In this section we explore and measure the performance of the fused estimator and its behavior in four different scenarios. Performance is measured primarily by the squared Frobenius loss,
between the class precision estimate and the true population class precision matrix. However, the performance is also assessed in terms of the quadratic loss,
The risk defined as the expected loss associated with an estimator, say,
is robustly approximated by the median loss over a repeated number of simulations and corresponding estimations.
We designed four simulation scenarios to explore the properties and performance of the fused ridge estimator and alternatives. Scenario (1) evaluates the fused ridge estimator under two choices of the penalty matrix, the nonfused ridge estimate applied individually to the classes, and the nonfused ridge estimate using the pooled covariance matrix when (1a) and (1b) . Scenario (2) evaluates the fused ridge estimator under different choices of targets: (2a) , (2b) , and (2c) . Scenario (3) evaluates the fused ridge estimator for varying network topologies and degrees of class homogeneity. Specifically, for (3a) scalefree topology and (3b) smallworld topology, each with (3i) low class homogeneity and (3ii) high class homogeneity. Scenario (4) investigates the fused estimator under nonequal class sample sizes. Except for scenario 4, we make no distinction between the loss in different classes. Except for scenario 1, we use penalty matrices of the form .
5.1. Scenario 1: Fusion versus no fusion
Scenario 1 explores the lossefficiency of the fused estimate versus nonfused estimates as a function of the class sample size for fixed and hence for different ratios. Banded population precision matrices are simulated from classes. We set and
(16) 
with nonzero offdiagonal bands. The subscenario (1a) uses bands whereas (1b) uses bands for and bands for . Hence, identical and very different population precision matrices are considered, respectively.
For the loss over repeated runs was computed. In each run, the optimal unrestricted penalty matrix