# The Spectral Condition Number Plot for Regularization Parameter Determination

Many modern statistical applications ask for the estimation of a covariance (or precision) matrix in settings where the number of variables is larger than the number of observations. There exists a broad class of ridge-type estimators that employs regularization to cope with the subsequent singularity of the sample covariance matrix. These estimators depend on a penalty parameter and choosing its value can be hard, in terms of being computationally unfeasible or tenable only for a restricted set of ridge-type estimators. Here we introduce a simple graphical tool, the spectral condition number plot, for informed heuristic penalty parameter selection. The proposed tool is computationally friendly and can be employed for the full class of ridge-type covariance (precision) estimators.

## Authors

• 9 publications
• 11 publications
• 6 publications
• ### Ridge-type Linear Shrinkage Estimation of the Matrix Mean of High-dimensional Normal Distribution

The estimation of the mean matrix of the multivariate normal distributio...
10/26/2019 ∙ by Ryota Yuasa, et al. ∙ 0

• ### Improving the condition number of estimated covariance matrices

High dimensional error covariance matrices are used to weight the contri...
10/25/2018 ∙ by Jemima M. Tabeart, et al. ∙ 0

• ### Breakdown points of penalized and hybrid M-estimators of covariance

We introduce a class of hybrid M-estimators of multivariate scatter whic...
02/28/2020 ∙ by David E. Tyler, et al. ∙ 0

• ### Graph-Guided Banding of the Covariance Matrix

Regularization has become a primary tool for developing reliable estimat...
06/01/2016 ∙ by Jacob Bien, et al. ∙ 0

• ### Penalty, Shrinkage, and Preliminary Test Estimators under Full Model Hypothesis

This paper considers a multiple regression model and compares, under ful...
03/24/2015 ∙ by Enayetur Raheem, et al. ∙ 0

• ### Distributionally Robust Formulation and Model Selection for the Graphical Lasso

Building on a recent framework for distributionally robust optimization ...
05/22/2019 ∙ by Pedro Cisneros-Velarde, et al. ∙ 0

• ### A Deterministic Streaming Sketch for Ridge Regression

We provide a deterministic space-efficient algorithm for estimating ridg...
02/05/2020 ∙ by Benwei Shi, et al. ∙ 5

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

The covariance matrix of a

-dimensional random vector

is of central importance in many statistical analysis procedures. Estimation of or its inverse (generally known as the precision matrix) are central to, for example, multivariate regression, time-series analysis, canonical correlation analysis, discriminant analysis, and Gaussian graphical modeling. Let be a realization of . It is well-known [50] that the sample covariance matrix is a poor estimate of when approaches the sample size or when . When approaches , will tend to become ill-conditioned. When , is singular leaving undefined. However, many contemporary applications in fields such as molecular biology, neuroimaging, and finance, encounter situations of interest where .

The estimation of can be improved by shrinking the eigenvalues of towards a central value [e.g., 50] or by convexly combining with some well-conditioned target matrix [e.g., 45]. These solutions define a class of estimators that can be caught under the umbrella term ‘ridge-type covariance estimation’. Such estimators depend on a penalty parameter determining the rate of shrinkage and choosing its value is of prime importance. Available procedures for choosing the penalty have some (situation dependent) disadvantages in the sense that (a) they can be computationally expensive, (b) they can be restricted to special cases within the class of ridge-type estimators, or (c) they are not guaranteed to result in a meaningful penalty-value. There is thus some demand for a generic and computationally friendly procedure on the basis of which one can (i) heuristically select an acceptable (minimal) value for the penalty parameter and (ii) assess if more formal procedures have indeed proposed an acceptable penalty-value. Here such a tool is provided on the basis of the matrix condition number.

The remainder is organized as follows. Section 2 reviews the class of ridge-type estimators of the covariance matrix. In addition, penalty parameter selection is reviewed and an exposé of the matrix condition number is given. The spectral condition number is central to the introduction of the spectral condition number plot in Section 3. This graphical display is posited as an exploratory tool, that may function as a fast and simple heuristic in determining the (minimum) value of the penalty parameter when employing ridge estimators of the covariance or precision matrix. Section 4 illustrates usage of the spectral condition number plot with data from the field of oncogenomics. Section 5 discourses on the software that implements the proposed graphical display. Sections 6 and 7 conclude with discussions.

## 2. Eigenstructure regularization and the condition number

### 2.1. Ridge-type shrinkage estimation

Regularization of the covariance matrix goes back to Stein [50, 51], who proposed shrinking the sample eigenvalues towards some central value. This work spurred a large body of literature [see, e.g., 23, 24, 64, 63]. Of late, two encompassing forms of what is referred to as ‘ridge-type covariance estimation’ have emerged.

The first form considers convex combinations of and some positive definite (p.d.) target matrix [12, 34, 35, 36, 45, 18]:

 (1) ^ΣI(λI)=(1−λI)^Σ+λIT,

with penalty parameter

. Such an estimator can be motivated from the Steinian bias-variance tradeoff as it seeks to balance the high-variance, low-bias matrix

with the lower-variance, higher-bias matrix . The second form is of more recent vintage and considers the ad-hoc estimator [61, 65, 22]:

 (2) ^ΣII(λII)=^Σ+λIIIp,

with

. This second form is motivated, much like how ridge regression was introduced by

Hoerl and Kennard [26], as an ad-hoc modification of in order to deal with singularity in the high-dimensional setting.

van Wieringen and Peeters [58] show that both (1) and (2) can be justified as penalized maximum likelihood estimators [cf. 61]. However, neither (1) nor (2) utilizes a proper -penalty in that perspective. Starting from the proper ridge-type -penalty , van Wieringen and Peeters [58] derive an alternative estimator:

 (3) ^Σa(λa)=[λaIp+14(^Σ−λaT)2]1/2+12(^Σ−λaT),

with . van Wieringen and Peeters [58] show that, when considering a p.d. , the estimator (3) is an alternative to (1) with superior behavior in terms of risk. When considering the non-p.d. choice , they show that (3) acts as an alternative to (2), again with superior behavior.

Clearly, one may obtain ridge estimators of the precision matrix by considering the inverses of (1), (2), and (3). For comparisons of these estimators see [39, 9, 58]. For expositions of other penalized covariance and precision estimators we confine by referring to [44].

### 2.2. Penalty parameter selection

The choice of penalty-value is crucial to the aforementioned estimators. Let denote a generic penalty. When choosing too small, an ill-conditioned estimate may ensue when (see Section 2.3). When choosing too large, relevant data signal may be lost. Many options for choosing are available. The ridge estimators, in contrast to -regularized estimators of the covariance or precision matrix [e.g., 19, 3], do not generally produce sparse estimates. This implies that model-selection-consistent methods (such as usage of the BIC), are not appropriate. Rather, for -type estimators, it is more appropriate to seek loss efficiency.

A generic strategy for determining an optimal value for that can be used for any ridge-type estimator of Section 2.1 is

-fold cross-validation (of the likelihood function). Asymptotically, such an approach can be explained in terms of minimizing Kullback-Leibler divergence. Unfortunately, this strategy is computationally prohibitive for large

and/or large . Lian [38] and Vujačić et al. [60] propose approximate solutions to the leave-one-out cross-validation score. While these approximations imply gains in computational efficiency, they are not guaranteed to propose a reasonable optimal value for .

Ledoit and Wolf [36] propose a strategy to determine analytically an optimal value for under a modified Frobenius loss for the estimator (1) under certain choices of [cf. 18]. This optimal value requires information on the unknown population matrix . The optimal penalty-value thus needs to be approximated with some estimate of . Ledoit and Wolf [36] utilize an -consistent estimator while Schäfer and Strimmer [45]

use the unbiased estimate

. In practice, this may result in overshrinkage [9] or even negative penalty-values [45].

Given the concerns stated above, there is some demand for a generic and computationally friendly tool for usage in the following situations of interest:

1. When one wants to speedily determine a (minimal) value for for which is well-conditioned;

2. When one wants to determine speedily whether an optimal proposed by some other (formal) procedure indeed leads to a well-conditioned estimate ;

3. When one wants to determine speedily a reasonable minimal value for for usage in a search-grid (for an optimal such value) by other, optimization-based, procedures.

In Section 3 we propose such a tool based on the spectral condition number.

### 2.3. Spectral condition number

The estimators from Section 2.1 are p.d. when their penalty-values are strictly positive. However, they are not necessarily well-conditioned for any strictly positive penalty-value when , especially when the penalty takes a value in the lower range. We seek to quantify the condition of the estimators w.r.t. a given penalty-value. To this end we utilize a condition number [59, 57].

Consider a nonsingular matrix as well as the matrix norm . The condition number w.r.t. matrix inversion can be defined as [25]

 (4)

indicating the sensitivity of inversion of w.r.t. small perturbations . When the norm in (4) is induced by a vector norm the condition number is characterized as [25]:

 (5) cond(A)=C(A):=∥A∥∥A−1∥.

For singular matrices would equal

. Hence, a high condition number is indicative of near-singularity and quantifies an ill-conditioned matrix. Indeed, the inverse of the condition number gives the relative distance of

to the set of singular matrices [11]:

 dist(A,S)=infS∈S∥A−S∥∥A∥=1C(A).

Another interpretation can be found in error propagation. A high condition number implies severe loss of accuracy or large propagation of error when performing matrix inversion under finite precision arithmetic. One can expect to loose at least digits of accuracy in computing the inverse of [e.g., Chapter 8 and Section 6.4 of, respectively, 7, 21]. In terms of error propagation, is also a reasonable sensitivity measure for linear systems [25] .

We can specify (5) with regard to a particular norm. We have special interest in the -condition number , usually called the spectral condition number for its relation to the spectral decomposition. When is a symmetric p.d. matrix, it is well-known that [59, 21]

 C2(A)=∥A∥2∥A−1∥2=d(A)1d(A)p,

where are the eigenvalues of . We can connect the machinery of ridge-type regularization to this spectral condition number.

Let be the spectral decomposition of with denoting a diagonal matrix with the eigenvalues of on the diagonal and where

denotes the matrix that contains the corresponding eigenvectors as columns. Note that the orthogonality of

implies . This decomposition can then be used to show that, like the Stein estimator [50], estimate (2) is rotation equivariant: . That is, the estimator leaves the eigenvectors of intact and thus solely performs shrinkage on the eigenvalues. When choosing with , the estimator (3) also is of the rotation equivariant form, as we may then write:

 (6) ^Σa(λa)=V{[λaIp+14[D(^Σ)−λaφIp]2]1/2+12[D(^Σ)−λaφIp]}VT.

An analogous decomposition can be given for the estimator (1) when choosing , with . The effect of the shrinkage estimators can then, using the rotation equivariance property, also be explained in terms of restricting the spectral condition number. For example, using (6), we have:

 1≤√λa+[d(^Σ)1−λaφ]2/4+[d(^Σ)1−λaφ]/2√λa+[d(^Σ)p−λaφ]2/4+[d(^Σ)p−λaφ]/2

Similar statements can be made regarding all rotation equivariant versions of the estimators discussed in Section 2.1. Similar statements can also be made when considering a target for estimators (1) and (3) that is not of the form whenever this target is well-conditioned (i.e., has a lower condition number than ).

###### Example 1.

For clarification consider the following toy example. Assume we have a sample covariance matrix whose largest eigenvalue . Additionally assume that is estimated in a situation where so that and, hence,

(under the IEEE computing Standard for Floating-Point Arithmetic;

[27]). Say we are interested in regularization using the estimator (3) using a scalar target matrix with . Even using a very small penalty of it is then quickly verified that . Under a large penalty such as we find that . Indeed, van Wieringen and Peeters [58] have shown that, in this rotation equivariant setting, as for all . Hence, as the penalty value grows very large. Section 1 of the Supplementary Material visualizes this behavior (in the given setting) for various scenarios of interest: , , , , and .

Let denote a generic ridge-type estimator of the covariance matrix under generic penalty . We thus quantify the conditioning of for given (and possibly a given ) w.r.t. perturbations with

 (7) C2[^Σ(λ)]=∥^Σ(λ)∥2∥^Ω(λ)∥2=d[^Σ(λ)]1d[^Σ(λ)]p.

A useful property is that , i.e., knowing the condition of the covariance matrix implies knowing the condition of the precision matrix (so essential in contemporary topics such as graphical modeling). The condition number (7) can be used to construct a simple and computationally friendly visual tool for penalty parameter selection.

## 3. The spectral condition number plot

### 3.1. The basic plot

As one may appreciate from the exposition in Section 2.3, when moves away from near-singularity, small increments in the penalty can cause dramatic changes in . One can expect that at some point along the domain of , its value will be large enough for to stabilize (in some relative sense). We will cast these expectations regarding the behavior in a (loose) definition:

###### Heuristic Definition.

Let denote a generic ridge-type estimator of the covariance matrix under generic fixed penalty . In addition, let indicate a real perturbation in as opposed to the theoretical perturbation . We will term the estimate well-conditioned when small increments in translate to (relatively) small changes in vis-à-vis .

From experience, when considering ridge-type estimation of or its inverse in situations, the point of relative stabilization can be characterized by a leveling-off of the acceleration along the curve when plotting the condition number against the (chosen) domain of . Consider Figure 1, which is the first example of what we call the spectral condition number plot.

Figure 1 indeed plots (7) against the natural logarithm of . As should be clear from Section 2.3, the spectral condition number displays (mostly) decreasing concave upward behavior in the feasible domain of the penalty parameter with a vertical asymptote at and a horizontal asymptote at (which amounts to in case of a strictly positive scalar target). The logarithm is used on the -axis as, especially for estimators (2) and (3), it is more natural to consider orders of magnitude for . In addition, usage of the logarithm ‘decompresses’ the lower domain of , which enhances the visualization of the point of relative stabilization, as it is in the lower domain of the penalty parameter where ill-conditioning usually ensues when . Figure 1 uses simulated data (see Section 4 of the Supplementary Material) with and . The estimator used is the ad-hoc ridge-type estimator given by (2). One can observe relative stabilization of the spectral condition number – in the sense of the Heuristic Definition – at approximately . This value can be taken as a reasonable (minimal) value for the penalty parameter. The spectral condition number plot can be a simple visual tool of interest in the situations sketched in Section 2.2, as will be illustrated in Section 4.

### 3.2. Interpretational aids

The basic spectral condition number plot can be amended with interpretational aids. The software (see Section 5) can add two such aids to form a panel of plots. These aids support the heuristic choice for a penalty-value and provide additional information on the basic plot.

The first aid is the visualization of against the domain of . As stated in Section 2.3, provides an estimate of the digits of accuracy one can expect to loose (on top of the digit loss due to inherent numerical imprecision) in operations based on . Note that this estimate is dependent on the norm. This aid can support choosing a (miminal) penalty-value on the basis of the error propagation (in terms of approximate loss in digits of accuracy) one finds acceptable. Figure 2 gives an example.

Let denote the curvature (of the basic plot) that maps the natural logarithm of the penalty-value to the condition number of the regularized precision matrix. We seek to approximate the second-order derivative of this curvature (the acceleration) at given penalty-values in the domain . The software (see Section 5) requires the specification of and , as well as the number of steps one wants to take along the domain . Say we take steps, such that . The implementation takes steps that are log-equidistant, hence for all . The central finite difference approximation to the second-order derivative [see e.g., 37] of at then takes the following form:

 C′′ln(λs)↦C2[^Σ(λs)]≈C2[^Σ(λs+1)]−2C2[^Σ(λs)]+C2[^Σ(λs−1)]τ2,

which is available for . The second visual aid thus plots against the feasible domain of (see Figure 2 for an example). The behavior of the condition number along the domain of the penalty-value is to always decrease. This decreasing behavior is not consistently concave upward for (1) (under general non-scalar targets) and (3), as there may be parts of the domain where the behavior is concave downward. This aid may then more clearly indicate inflection points in the regularization path of the condition number. In addition, this aid may put perspective on the point of relative stabilization when the -axis of the basic plot represents a very wide range.

Software implementing the spectral condition number plot is discussed in Section 5. The following section illustrates, using oncogenomics data, the various uses of the spectral condition number plot with regard to covariance or precision matrix regularization. Section 2 of the Supplementary Material contains a second data example to further illustrate usage of the condition number plot. Section 4 of the Supplementary Material contains all R code with which these illustrations can be reproduced (including querying the data).

## 4. Illustration

### 4.1. Context and data

Various histological variants of kidney cancer are designated with the amalgamation ‘renal cell carcinoma’ (RCC). Chromophobe RCC (ChRCC) is a rather rare and predominantly sporadic histological variant, accounting for 4-6% of RCC cases [49]. ChRCC originates in the distal convoluted tubule [47], a portion of the nephron (the basic structural unit of the kidney) that serves to maintain electrolyte balance [52]. Often, histological variants of cancer have a distinct pathogenesis contingent upon the deregulation of certain molecular pathways. A pathway can be thought of as a collection of molecular features that work interdependently to regulate some biochemical function. Recent evidence suggests that (reactivation of) the Hedgehog (Hh) signaling pathway may support cancer development and progression in clear cell RCC (CCRCC) [13, 8], the most common subtype of RCC. The Hh-signaling pathway is crucial in the sense that it “orchestrates tissue patterning” in embryonic development, making it “critical to normal kidney development, as it regulates the proliferation and differentiation of mesenchymal cells in the metanephric kidney” [8]. Later in life Hh-signaling is largely silenced and constitutive reactivation may elicit and support tumor growth and vascularization [13, 8]. Our goal here is to explore if Hh-signaling might also be reactivated in ChRCC. The exploration will make use of network modeling (see Section 4.3) in which the network is taken as a representation of a biochemical pathway. This exercise hinges upon a well-conditioned precision matrix.

We attained data on RCC from the The Cancer Genome Atlas Research Network [55] as queried through the Cancer Genomics Data Server [6, 20] using the cgdsr R-package [28]. All ChRCC samples were retrieved for which messenger ribonucleic acid (mRNA) data is available, giving a total of

samples. The data stem from the IlluminaHiSeq_RNASeqV2 RNA sequencing platform and consist of normalized relative gene expressions. That is, individual gene expressions are given as mRNA z-scores relative to a reference population that consists of all tumors that are diploid for the gene in question. All features were retained that map to the Hh-signaling pathway according to the Kyoto Encyclopedia of Genes and Genomes (KEGG)

[31], giving a total of gene features. Regularization of the desired precision matrix is needed as . Even though features from genomic data are often measured on or mapped to (approximately) the same scale, regularization on the standardized scale is often appropriate as the variability of the features may differ substantially when ; a point also made by [61]. Note that we may use the correlation matrix instead of in equations (1) to (3) without loss of generality.

### 4.2. Penalty parameter selection

The precision estimator of choice is the inverse of (3). The target matrix is chosen as , with set to the reciprocal of the average eigenvalue of : . First, the approximate leave-one-out cross-validation (aLOOCV) procedure [38, 60] is used (on the negative log-likelihood) in finding an optimal value for under the given target and data settings. This procedure searches for the optimal value in the domain . A relatively fine-grained grid of log-equidistant steps along this domain points to as being the optimal value for the penalty (in the chosen domain). This value seems low given the ratio of the data. This calls for usage-type ii. of the condition number plot (Section 2.2), where one uses it to determine if an optimal penalty as proposed by some procedure indeed leads to a well-conditioned estimate. The condition number is plotted over the same penalty-domain considered by the aLOOCV procedure. The left-hand panel of Figure 3 depicts this condition number plot. The green vertical line represents the penalty-value that was chosen as optimal by the aLOOCV procedure. Clearly, the precision estimate at is not well-conditioned in the sense of the Heuristic Definition. This exemplifies that the (essentially large-sample) approximation to the LOOCV score may not be suitable for non-sparse situations and/or for situations in which the ratio grows more extreme (the negative log-likelihood term then tends to completely dominate the bias term). At this point one could use the condition number plot in accordance with usage-type i., in which one seeks a reasonable minimal penalty-value. This reasonable minimal value (in accordance with the Heuristic Definition) can be found at approximately , at which .

One could worry that the precision estimate retains too much noise under the heuristic minimal penalty-value. To this end, a proper LOOCV procedure is implemented that makes use of the root-finding Brent algorithm [4]. The expectation is that the proper data-driven LOOCV procedure will find an optimal penalty-value in the domain of for which the estimate is well-conditioned. The penalty-space of search is thus constrained to the region of well-conditionedness for additional speed, exemplifying usage-type iii. of the condition number plot. Hence, the LOOCV procedure is told to search for the optimal value in the domain . The optimal penalty-value is indeed found to the right of the heuristic minimal value at . At this value, indicated by the red vertical line in Figure 3, . The precision estimate at this penalty-value is used in further analysis.

### 4.3. Further analysis

Biochemical networks are often reconstructed from data through graphical models. Graphical modeling refers to a class of probabilistic models that uses graphs to express conditional (in)dependence relations (i.e., Markov properties) between random variables. Let

denote a finite set of vertices that correspond to a collection of random variables with probability distribution

, i.e., . Let denote a set of edges, where edges are understood to consist of pairs of distinct vertices such that is connected to , i.e., . We then consider graphs under the basic assumption . In this Gaussian case, conditional independence between a pair of variables corresponds to zero entries in the precision matrix. Indeed, the following relations can be shown to hold for all pairs with [see, e.g., 62]:

 (^Ω)jj′=0⟺Yj⊥⊥Yj′|V∖{Yj,Yj′}⟺Yj\centernot−Yj′,

where indicates the absence of an edge. Hence, the graphical model can be selected by determining the support of the precision matrix. For support determination we resort to a local false discovery rate procedure proposed by [45]

, retaining only those edges whose empirical posterior probability of being present equals or exceeds

. The right-hand panel of Figure 3 represents the retrieved Markov network on the basis of . The vertex-labels are curated gene names of the Hh-signaling pathway genes. The graph seems to retrieve salient features of the Hh-signaling pathway. The Hh-signaling pathway involves a cascade from the members of the Hh-family (IHH, SHH, and DHH) via the SMO gene to ZIC2 and members of the Wnt-signaling pathway. The largest connected component is indicative of this cascade, giving tentative evidence of reactivation of the Hh-signaling pathway in (at least) rudimentary form in ChRCC.

## 5. Software

The R-package rags2ridges [43] implements (along with functionalities for graphical modeling) the ridge estimators of Section 2.1 and the condition number plot through the function CNplot. This function outputs visualizations such as Figure 1 and Figure 2. The condition number is determined by exact calculation using the spectral decomposition. For most practical purposes this exact calculation is fast enough, especially in rotation equivariant situations for which only a single spectral decomposition is required to obtain the complete solution path of the condition number. Additional computational speed is achieved by the implementation of core operations in C++ via Rcpp and RcppArmadillo [15, 14]. For example, producing the basic condition number plot for the estimator (3) on the basis of data with dimensions and , using a scalar target matrix and steps along the penalty-domain, will take approximately 1.2 seconds (see Section 3 of the Supplementary Material for a benchmark exercise). The additional computational cost of the interpretational aids is linear: producing the panel of plots (including interpretational aids) for the situation just given takes approximately 1.3 seconds. When is very small and the exact calculation of the condition number may suffer from rounding problems (much like the imaginary linear system ), but remains adequate in its indication of ill-conditioning.

When exact computation is deemed too costly in terms of floating-point operations, or when one wants more speed in determining the condition number, the CNplot function offers the option to cheaply approximate the -condition number, which amounts to

 C1[^Σ(λ)]=∥^Σ(λ)∥1∥^Ω(λ)∥1={maxj′∑j|[^Σ(λ)]jj′|}{maxj′∑j|[^Ω(λ)]jj′|}.

The -condition number is computationally less complex than the calculation of in non-rotation equivariant settings. The machinery of ridge-type regularization is, however, less directly connected to this -condition number (in comparison to the -condition number). The approximation of uses LAPACK routines [2] and avoids overflow. This approximation is accessed through the rcond function from R [54]. The package rags2ridges is freely available from the Comprehensive R Archive Network (http://cran.r-project.org/) [54].

## 6. Discussion

The condition number plot is a heuristic tool and heuristics should be handled with care. Below, some cases are presented that serve as notes of caution. They exemplify that the proposed heuristic accompanying the condition number plot should not be applied (as any statistical technique) without proper inspection of the data.

### 6.1. Artificial ill-conditioning

A first concern is that, when the variables are measured on different scales, artificial ill-conditioning may ensue [see, e.g., 21]. In case one worries if the condition number is an adequate indication of error propagation when using variables on their original scale one can ensure that the columns (or rows) of the input matrix are on the same scale. This is easily achieved by scaling the input covariance matrix to be the correlation matrix. Another issue is that it is not guaranteed that the condition number plot will give an unequivocal point of relative stabilization for every data problem (which hinges in part on the chosen domain of the penalty parameter). Such situations can be dealt with by extending the domain of the penalty parameter or by determining the value of that corresponds to the loss of digits (in the imaginary linear system ) one finds acceptable.

### 6.2. Naturally high condition numbers

Some covariance matrices may have high condition numbers as their ‘natural state’. Consider the following covariance matrix: , with and where denotes the -dimensional all-ones matrix. The variates are thus equicorrelated with unit variance. The eigenvalues of this covariance matrix are and (with multiplicity ) . Consequently, its condition number equals . The condition number of thus becomes high when the number of variates grows large and/or the (marginal) correlation approaches one (or ). The large ratio between the largest and smallest eigenvalues of in such situations mimics a high-dimensional setting in which any non-zero eigenvalue of the sample covariance estimate is infinitely larger than the smallest (zero) eigenvalues. However, irrespective of the number of samples, any sample covariance estimate of an with large and close to unity (or ) exhibits such a large ratio. Would one estimate the in penalized fashion (even for reasonable sample sizes) and choose the penalty parameter from the condition number plot as recommended, then one would select a penalty parameter that yields a ‘well-conditioned’ estimate. Effectively, this amounts to limiting the difference between the penalized eigenvalues, which need not give a condition number representative of . Thus, the recommendation to select the penalty parameter from the well-conditioned domain of the condition number plot may in some (perhaps exotic) cases lead to a choice that crushes too much relevant signal (shrinking the largest eigenvalue too much). For high-dimensional settings this may be unavoidable, but for larger sample sizes this is undesirable.

### 6.3. Contamination

Real data is often contaminated with outliers. To illustrate the potential effect of outliers on the usage of the condition number plot, consider data

drawn from a contaminated distribution, typically modeled by a mixture distribution: for , some positive constant , and mixing proportion . Then, the expectation of the sample covariance matrix . Its eigenvalues are: , for . In high-dimensional settings with few samples the presence of any outlier corresponds to mixing proportions clearly deviating from zero. In combination with any substantial the contribution of the outlier(s) to the eigenvalues may be such that the contaminated sample covariance matrix is represented as better conditioned (vis-à-vis its uncontaminated counterpart). It is the outlier(s) that will determine the domain of well-conditionedness in such a situation. Then, when choosing the penalty parameter in accordance with the Heuristic Definition, undershrinkage may ensue. One may opt, in situations in which the results are influenced by outliers, to use a robust estimator for in producing the condition number plot.

## 7. Conclusion

We have proposed a simple visual display that may be of aid in determining the value of the penalty parameter in ridge-type estimation of the covariance or precision matrix when the number of variables is large relative to the sample size. The visualization we propose plots the spectral condition number against the domain of the penalty parameter. As the value of the penalty parameter increases, the covariance (or precision) matrix will move away from (near) singularity. In some lower-end of the domain this will mean that small increments in the value of the penalty parameter will lead to large decreases of the condition number. At some point, the condition number can be expected to stabilize, in the sense that small increments in the value of the penalty have (relatively) little effect on the condition number. The point of relative stabilization may be deemed to indicate a reasonable (minimal) value for the penalty parameter.

Usage of the condition number plot was exemplified in situations concerned with the direct estimation of covariance or precision matrices. The plot may also be of interest in situations in which (scaled versions of) these matrices are conducive to further computational procedures. For example, it may support the ridge approach to the regression problem . We would then assess the conditioning of for use in the ridge-solution to the regression coefficients: .

We explicitly state that we view the proposed condition number plot as an heuristic tool. We emphasize ‘tool’, as it gives easy and fast access to penalty-value determination without proposing an optimal (in some sense) value. Also, in the tradition of exploratory data analysis [56], usage of the condition number plot requires good judgment. As any heuristic method, it is not without flaws.

Notwithstanding these concerns, the condition number plot gives access to a fast and generic (i.e., non-target and non-ridge-type specific) procedure for regularization parameter determination that is of use when analytic solutions are not available and when other procedures fail. In addition, the condition number plot may aid more formal procedures, in terms of assessing if a well-conditioned estimate is indeed obtained, and in terms of proposing a reasonable minimal value for the regularization parameter for usage in a search grid.

## Acknowledgements

This research was supported by grant FP7-269553 (EpiRadBio) through the European Community’s Seventh Framework Programme (FP7, 2007-2013).

## References

• [1] American Cancer Society.
• Anderson et al. [1999] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users’ Guide. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, 3 edition, 1999.
• Bien and Tibshirani [2011] J. Bien and R. Tibshirani. Sparse estimation of a covariance matrix. Biometrika, 98:807–820, 2011.
• Brent [1971] R. P. Brent. An algorithm with guaranteed convergence for finding a zero of a function. The Computer Journal, 14:422–425, 1971.
• Carvalho et al. [2008] C. M. Carvalho, J. Chang, J. E. Lucas, J. R. Nevins, Q. Wang, and M. West. High-dimensional sparse factor modeling: Applications in gene expression genomics. Journal of the American Statistical Association, 103:1438–1456, 2008.
• Cerami et al. [2012] E. Cerami, J. Gao, U. Dogrusoz, B. E. Gross, S. O. Sumer, B. A. Aksoy, A. Jacobsen, C. J. Byrne, M. L. Heuer, E. Larsson, Y. Antipin, B. Reva, A. P. Goldberg, C. Sander, and N. Schultz. The cBio cancer genomics portal: An open platform for exploring multidimensional cancer genomics data. Cancer Discovery, 2:401–404, 2012.
• Cheney and Kincaid [2008] W. Cheney and D. Kincaid. Numerical Computing and Mathematics. Belmont, CA, USA: Thomson Brooks/Cole, 6th edition, 2008.
• D’Amato et al. [2014] C. D’Amato, R. Rosa, R. Marciano, V. D’Amato, L. Formisano, L. Nappi, L. Raimondo, C. Di Mauro, A. Servetto, F. Fulciniti, A. Cipolletta, C. Bianco, F. Ciardiello, B. M. Veneziani, S. De Placido, and R. Bianco. Inhibition of Hedgehog signalling by NVP-LDE225 (Erismodegib) interferes with growth and invasion of human renal cell carcinoma cells. British Journal of Cancer, 111:1168–1179, 2014.
• Daniels and Kass [2001] M. J. Daniels and R. E. Kass. Shrinkage estimators for covariance matrices. Biometrics, 57:1173–1184, 2001.
• de Kroon et al. [2013] A. I. P. M. de Kroon, P. J. Rijken, and C. H. de Smet. Checks and balances in membrane phospholipid class and acyl chain homeostasis, the yeast perspective. Progress in Lipid Research, 52:374–394, 2013.
• Demmel [1987] J. W. Demmel. On condition numbers and the distance to the nearest ill-posed problem. Numerische Mathematik, 51:251–289, 1987.
• Devlin et al. [1975] S. J. Devlin, R. Gnanadesikan, and J. R. Kettenring.

Robust estimation and outlier detection with correlation coefficients.

Biometrika, 62:531–545, 1975.
• Dormoy et al. [2009] V. Dormoy, S. Danilin, V. Lindner, L. Thomas, S. Rothhut, C. Coquard, J. J. Helwig, D. Jacqmin, H. Lang, and T. Massfelder. The sonic hedgehog signaling pathway is reactivated in human renal cell carcinoma and plays orchestral role in tumor growth. Molecular Cancer, 8:123, 2009.
• Eddelbuettel [2013] D. Eddelbuettel. Seamless R and C++ Integration with Rcpp. Springer-Verlag, New York, 2013.
• Eddelbuettel and François [2011] D. Eddelbuettel and R. François. Rcpp: Seamless R and C++ integration. Journal of Statistical Software, 40(8), 2011.
• [16] QIAGEN Ingenuity Target Explorer.
• Ferrara et al. [2003] N. Ferrara, H. P. Gerber, and J. LeCouter. The biology of VEGF and its receptors. Nature Medicine, 9:669–676, 2003.
• Fisher and Sun [2011] T. J. Fisher and X. Sun. Improved Stein-type shrinkage estimators for the high-dimensional multivariate normal covariance matrix. Computational Statistics and Data Analysis, 55:1909–1918, 2011.
• Friedman et al. [2008] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9:432–441, 2008.
• Gao et al. [2013] J. Gao, B. A. Aksoy, U. Dogrusoz, G. Dresdner, B. Gross, S. O. Sumer, Y. Sun, A. Jacobsen, R. Sinha, E. Larsson, E. Cerami, C. Sander, and N. Schultz. Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Science Signaling, 6:pl1, 2013.
• Gentle [2007] J. E. Gentle. Matrix Algebra: Theory, Computations, and Applications in Statistics. New York: Springer Verlag, 2007.
• Ha and Sun [2014] M. J. Ha and W. Sun. Partial correlation matrix estimation using ridge penalty followed by thresholding and re-estimation. Biometrics, 70:765–773, 2014.
• Haff [1980] L. R. Haff. Empirical Bayes estimation of the multivariate normal covariance matrix. Annals of Statistics, 8:586–597, 1980.
• Haff [1991] L. R. Haff. The variational form of certain Bayes estimators. Annals of Statistics, 19:1163–1190, 1991.
• Higham [1995] D. J. Higham. Condition numbers and their condition numbers. Linear Algebra and its Applications, 214:193–213, 1995.
• Hoerl and Kennard [1970] A. E. Hoerl and R. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12:55–67, 1970.
• IEEE Computer Society [2008] IEEE Computer Society. IEEE Standard for Floating-Point Arithmetic. IEEE Std 754-2008, pages 1–70, Aug 2008.
• Jacobsen [2015] A. Jacobsen. cgdsr: R-Based API for Accessing the MSKCC Cancer Genomics Data Server (CGDS), 2015. R package version 1.2.5.
• Kaiser [1958] H. F. Kaiser. The varimax criterion for analytic rotation in factor analysis. Psychometrika, 23:187–200, 1958.
• Kaiser [1970] H. F. Kaiser. A second generation little jiffy. Psychometrika, 35:401–415, 1970.
• Kanehisa and Goto [2000] M. Kanehisa and S. Goto. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research, 28(1):27–30, 2000.
• [32] KEGG.
• Kutmon et al. [2016] M. Kutmon, A. Riutta, N. Nunes, K. Hanspers, E. L. Willighagen, A. Bohler, J. Mélius, A. Waagmeester, S. R. Sinha, R. Miller, S. L. Coort, E. Cirillo, B. Smeets, C. T. Evelo, and A. R. Pico. WikiPathways: capturing the full diversity of pathway knowledge. Nucleic Acids Research, 44(D1):D488–D494, 2016.
• Ledoit and Wolf [2003] O. Ledoit and M. Wolf. Improved estimation of the covariance matrix of stock returns with an application to portfolio selection. Journal of Empirical Finance, 10:603–621, 2003.
• Ledoit and Wolf [2004a] O. Ledoit and M. Wolf. Honey, I shrunk the sample covariance matrix. Journal of Portfolio Management, 30:110–119, 2004a.
• Ledoit and Wolf [2004b] O. Ledoit and M. Wolf. A well-conditioned estimator for large-dimensional covariance matrices.

Journal of Multivariate Analysis

, 88:365–411, 2004b.
• LeVeque [2007] R. J. LeVeque.

Finite Difference Methods for Ordinary and Partial Differential Equations: Steady State and Time Dependent Problems

.
Society for Industrial and Applied Mathematics (SIAM), Philadelphia, 2007.
• Lian [2011] H. Lian. Shrinkage tuning parameter selection in precision matrices estimation. Journal of Statistical Planning and Inference, 141:2839–2848, 2011.
• Lin and Perlman [1985] S. Lin and M. Perlman. A Monte Carlo comparison of four estimators of a covariance matrix. In P. R. Krishnaiah, editor, Multivariate Analysis, pages 411–429. Amsterdam: North Holland, 6 edition, 1985.
• Manukyan et al. [2014] A. Manukyan, E. Çene, A. Sedef, and I. Demir. Dandelion plot: a method for the visualization of R-mode exploratory factor analyses. Computational Statistics, 29:1769–1791, 2014.
• Mersmann [2014] Olaf Mersmann. microbenchmark: Accurate Timing Functions, 2014. R package version 1.4-2.
• Nunez et al. [2008] L. R. Nunez, S. A. Jesch, M. L. Gaspar, C. Almaguer, M. Villa-Garcia, M. Ruiz-Noriega, J. Patton-Vogt, and S. A. Henry. Cell Wall Integrity MAPK Pathway is essential for lipid homeostasis. Journal of Biological Chemistry, 283:34204 –34217, 2008.
• Peeters et al. [2016] C. F. W. Peeters, A. E. Bilgrau, and W. N. van Wieringen.

rags2ridges: Ridge Estimation of Precision Matrices from High-Dimensional Data

, 2016.
R package version 2.1.
• Pourahmadi [2013] M. Pourahmadi. High-dimensional covariance estimation. Hoboken NJ: John Wiley & Sons, 2013.
• Schäfer and Strimmer [2005] J. Schäfer and K. Strimmer. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical Applications in Genetics and Molecular Biology, 4:art. 32, 2005.
• Schwarz [1978] G. E. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6:461–464, 1978.
• Shuch et al. [2015] B. Shuch, A. Amin, A. J. Armstrong, J. N. Eble, V. Ficarra, A. Lopez-Beltran, G. Martignoni, B. I. Rini, and A. Kutikov. Understanding pathologic variants of renal cell carcinoma: Distilling therapeutic opportunities from biologic complexity. European Urology, 67:85–97, 2015.
• Siegel et al. [2016] R. L. Siegel, K. D. Miller, and A. Jemal. Cancer Statistics, 2016. CA: A Cancer Journal for Clinicians, 66:7–30, 2016.
• Stec et al. [2009] R. Stec, B. Grala, M. Ma̧czewski, L. Bodnar, and C. Szczylik. Chromophobe renal cell cancer–review of the literature and potential methods of treating metastatic disease. Journal of Experimental & Clinical Cancer Research, 28:134, 2009.
• Stein [1975] C. Stein. Estimation of a covariance matrix. Rietz Lecture. 39th Annual Meeting IMS. Atlanta, Georgia, 1975.
• Stein [1986] C. Stein. Lectures on the theory of estimation of many parameters. Journal of Mathematical Sciences, 34:1373–1403, 1986.
• Subramanya and Ellison [2014] A. R. Subramanya and D. H. Ellison. Distal convoluted tubule. Clinical Journal of the American Society of Nephrology, 9:2147–2163, 2014.
• Taylor et al. [2010] B. S. Taylor, N. Schultz, H. Hieronymus, A. Gopalan, Y. Xiao, B. S. Carver, V. K. Arora, P. Kaushik, E. Cerami, B. Reva, Y. Antipin, N. Mitsiades, T. Landers, I. Dolgalev, J. E. Major, M. Wilson, N. D. Socci, A. E. Lash, A. Heguy, J. A. Eastham, H. I. Scher, V. E. Reuter, P. T. Scardino, C. Sander, C. L. Sawyers, and W. L. Gerald. Integrative genomic profiling of human prostate cancer. Cancer Cell, 18:11–22, 2010.
• R Development Core Team [2011] R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2011. ISBN 3-900051-07-0.
• The Cancer Genome Atlas Research Network [2013] The Cancer Genome Atlas Research Network. Comprehensive molecular characterization of clear cell renal cell carcinoma. Nature, 499:43–49, 2013.
• Tukey [1977] J. W. Tukey. Exploratory data analysis. Addison-Wesley, 1977.
• Turing [1948] A. M. Turing. Rounding-off errors in matrix processes. Quarterly Journal of Mechanics and Applied Mathematics, 1:287–308, 1948.
• van Wieringen and Peeters [2016] W. N. van Wieringen and C. F. W. Peeters. Ridge estimation of inverse covariance matrices from high-dimensional data. Computational Statistics & Data Analysis, 103:284–303, 2016.
• Von Neumann and Goldstine [1947] J. Von Neumann and H. H. Goldstine. Numerical inverting of matrices of high order. Bulletin of the American Mathematical Society, 53:1021–1099, 1947.
• Vujačić et al. [2015] I. Vujačić, A. Abbruzzo, and E. C. Wit. A computationally fast alternative to cross-validation in penalized Gaussian graphical models. Journal of Statistical Computation and Simulation, 85:3628–3640, 2015.
• Warton [2008] D.I. Warton. Penalized normal likelihood and ridge regularization of correlation and covariance matrices. Journal of the American Statistical Association, 103:340–349, 2008.
• Whittaker [1990] J. Whittaker. Graphical Models in Applied Multivariate Statistics. John Wiley & Sons Ltd., Chichester, 1990.
• Won et al. [2013] J. H. Won, J. Lim, S. J. Kim, and B. Rajaratnam. Condition-number-regularized covariance estimation. Journal of the Royal Statistical Society, Series B, 75:427–450, 2013.
• Yang and Berger [1994] R. Yang and J. O. Berger. Estimation of a covariance matrix using the reference prior. Annals of Statistics, 22:1195–1211, 1994.
• Yuan and Chan [2008] K. H. Yuan and W. Chan. Structural equation modeling with near singular covariance matrices. Computational Statistics and Data Analysis, 52:4842–4858, 2008.

## 1. Behavior largest and smallest eigenvalues

This section contain a visual impression (Figure S1) of the behavior of the largest and smallest eigenvalues along the domain of the penalty parameter. The setting is given in Example 1 of the main text.

## 2. A second Illustration

### 2.1. Context and data

Prostate cancer refers to an adenocarcinoma in the prostate gland. It is the most common solid tumor diagnosed in western men [48]. Its prognosis is largely determined by metastasis with low survival rates for metastatic forms in comparison to organ-confined forms [1].

Vascular endothelial growth factor (VEGF) is a signal protein that supports vascularization [17]. Vascular endothelial cells are cells that line the interior of blood vessels. The formation of new blood vessels is pivotal to the growth and metastasis of solid tumors. VEGF is actually at the helm of a cascade of signaling pathways that form the VEGF-signaling pathway. In specific, the activation of VEGF leads to the activation of the mitogen-activated protein kinase (MAPK) and phosphatidylinositol 3’-kinase (PI3K)-AKT signaling pathways [32]. The VEGF-signaling pathway thus largely consists of two subpathways. These subpathways mediate “the proliferation and migration of endothelial cells and” promote “their survival and vascular permeability” [32]. Hence, VEGF overexpression supports tumor growth and metastasis.

VEGF-signaling is likely active in metastatic prostate cancer. This contention is supported by recent evidence that changes in the PI3K-pathway are present in metastatic tumor samples [53]. Cancer may be viewed, from a pathway perspective, as consisting of a loss of normal biochemical connections and a gain of abnormal biochemical connections. Severe deregulation would then imply the loss of normal subpathways and/or the gain of irregular subpathways. Our goal here is to explore if the VEGF-signaling pathway in metastatic prostate cancer can still be characterized as consisting of the MAPK and PI3K-AKT subpathways. The exploration will make use of a pathway-based factor analysis in which the retained latent factors are taken to represent biochemical subpwathways [cf. 5]. This exercise hinges upon a well-conditioned covariance matrix.

We attained data on prostate cancer from the Memorial Sloan-Kettering Cancer Center [53] as queried through the Cancer Genomics Data Server [6, 20] using the cgdsr R-package [28]. All metastatic samples were retrieved for which messenger ribonucleic acid (mRNA) data is available, giving a total of samples. The data stem from the Affymetrix Human Exon 1.0 ST array platform and consist of whole-transcript mRNA expression values. All Human Genome Organization (HUGO) Gene Nomenclature Committee (HGNC) curated features were retained that map to the VEGF-signaling pathway according to the Kyoto Encyclopedia of Genes and Genomes (KEGG) [31], giving a total of gene features. Regularization of the desired covariance matrix is needed as . Regularization is performed (as in the illustration in the main text) on the standardized scale.

### 2.2. Model

We work with standardized data. Hence, let denote a centered and scaled -dimensional observation vector available for persons. The factor model states that

 zi=Γξi+ϵi,

where denotes an -dimensional vector of latent variables typically called ‘factors’, and where is a matrix whose entries denote the loading of the th variable on the th factor, , . Finally, the denote error measurements. Hence, the model can be conceived of as a multivariate regression with latent predictors. An important assumption in this model is that : the dimension of the latent vector is smaller than the dimension of the observation vector. The following additional assumptions are made: (i) The observation vectors are independent; (ii) ; (iii) ; (iv) , with a -dimensional diagonal matrix with strictly positive diagonal entries ; and (v) and are independent for all and . The preceding assumptions establish a distributional assumption on the covariance structure of the observed data-vector: (vi) .

Hence, to characterize the model, we need to estimate and on the basis of the sample correlation matrix. As for the data at hand, the sample correlation matrix is singular and standard maximum likelihood estimation (MLE) is not available. Recent efforts deal with such situations by (Bayesian) sparsity modeling of the factor model [e.g., 5], i.e., by imposing sparsity constraints in the loadings matrix. Our approach differs. We first ensure that we have a well-conditioned correlation matrix by using a regularized (essentially a Bayesian) estimator. Afterwards, standard MLE techniques are employed to estimate and on the basis of this regularized correlation matrix. Hence, we perform MLE on the basis of where is (in some sense) deemed optimal.

### 2.3. Penalty parameter selection

The estimator of choice is again (3) of the main text where we replace with the sample correlation matrix . The target matrix is chosen as , such that a regularized correlation matrix ensues. Again, the aLOOCV procedure was tried first in finding an optimal value for under the given target and data settings. We searched for the optimal value in the domain with log-equidistant steps along this domain. Again, the procedure pointed to as being the optimal value for the penalty (in the chosen domain), which seems low given the ratio of the data. The condition number plot (covering the same penalty-domain considered by the aLOOCV procedure) indeed indicates that the precision estimate at is not well-conditioned in the sense of the Heuristic Definition, exhibiting a condition number of approximately (Figure S2). A reasonable minimal penalty-value (in accordance with the Heuristic Definition) can be found at approximately , at which . This reasonable minimal value is subsequently used to constrain the search-domain to the region of well-conditionedness. A root-finding (by the Brent algorithm [4]) LOOCV procedure is then told to search for the optimal value in the domain . The optimal penalty-value is found at . At this value, indicated by the red vertical line in Figure S2, .

The Kaiser-Meyer-Olkin index [30] of .98 indicates that a reasonable proportion of variance among the variables might be common variance and, hence, that is suitable for a factor analysis. The regularized estimate is used in further analysis.

### 2.4. Further analysis

The dimension of the factor solution is unknown. Hence, the optimal factor dimension needs to be determined in conjunction with the estimation of the model. Now, let denote the MLE solution under factors. Then, we determine the optimal dimension (contingent upon the MLE solutions) using the Bayesian Information Criterion (BIC; [46]), which, for the problem at hand, amounts to:

 n{pln(2π)+ln|^Σm|+tr(^Σ−1m^Σa(.422))}+ln(n)η,

where indicates the free parameters in the model. Figure S3 gives the BIC scores for each dimension allowed by the existence condition . The solution with the lowest BIC score is deemed optimal, indicating that is the preferred solution.

The specified FA model is inherently underidentified. Assume

is an arbitrary orthogonal matrix. Considering the implied covariance structure of the observed data we may write:

 ΓΓT+Ψ=(ΓH)(ΓH)T+Ψ.

This equality implies that, given , there is an infinite number of alternative loading matrices that generate the same covariance structure as . Thus, in any solution, can be made to satisfy additional conditions, and, hence, the structure of the existence condition given above. Naturally, any estimation method then requires a minimum of restrictions on to attain uniqueness (up to possibly polarity reversals in the columns of ). In MLE this is achieved by requiring that be diagonal along with an order-condition on its diagonal elements. As this is a convenience solution that has no direct interpretational meaning, usually a post-hoc rotation is applied, whence estimation is settled, to enhance interpretation. Here, the Varimax [29] rotation to an orthogonal simple structure was ultimately used as oblique rotation (that allows for factor correlations) indicated a near-zero correlation between the two retained latent factors (note that any orthogonal representation has equivalent oblique representations).

The final factor solution is represented in Figure S4. This Dandelion plot [40] visualizes the magnitudes of the elements of and and their relation to the two retained factors. The latent factors are taken to represent biochemical subpwathways. Table S1 characterizes the factors according to their constituting genes that achieve the highest absolute loading. These genes are furthermore associated with VEGF-signaling subpathways as indicated by the WikiPathways database [33]. Factor 1 consists mostly of MAPK-related genes. In addition, genes related to the glycerol phospholipid biosynthesis pathway load on factor 1. The MAPK pathway has been associated with lipid homeostase in yeast [42] and “phospholipid composition and synthesis are similar in yeast and mammalian cells” [10]. Genes with pathway association indicated by ‘-’ in Table S1 (PIK3R5 and SPHK2) could not be associated to VEGF-subpathways according to the WikiPathways database. PIK3R5 is, however, related to the MAPK-signaling pathway according to the Ingenuity Target Explorer [16]. Factor 2 consists mostly of genes related to the PI3K/AKT-signaling and prostate cancer pathways. Moreover, this factor also includes general VEGF-signaling pathway genes that are conducive in activating the MAPK and PI3K/AKT cascade. These results suggest that the compositional subpathway structure of the VEGF-signaling pathway in metastatic prostate cancer can still be characterized as consisting of the MAPK and PI3K-AKT cascades. Hence, VEGF-pathway degerulation in metastatic prostate cancer is more likely characterized by intricate changes in its known subpathways than by the loss of these (normal) subpathways or the gain of (abnormal) subpathways.

## 3. Benchmark

A benchmarking exercise was conducted to get an indication of the execution time of the CNplot function. The sample size is not a factor in the execution time as the function operates on the covariance (or precision) matrix. Hence, time complexity hinges on the dimension and the number of steps taken along the (specified) domain of the penalty-parameter. Execution time was thus evaluated for various combinations of and .

Timing evaluations were done for all ridge estimators mentioned in Section 2.1 of the main text and all combinations of the elements in and . The basic condition number plot was produced 50 times (on simulated covariance matrices) for each combination of estimator, , and . In addition, both a rotation equivariant situation (using a scalar target) and a rotation non-equivariant situation (using a non-scalar target) were considered for estimators (1) and (3) of the main text (note that estimator (2) is always rotation equivariant). All execution times of the CNplot call in R were evaluated with the microbenchmark package [41]. All timings were carried out on a Intel Core i7-4702MQ 2.2 GHz processor. The results, stated as median runtimes in seconds, are given in Tables S2 to S4. The code used in this benchmarking exercise can be found in Listing 6 in Section 4 of this Supplement.

Tables S2 to S4 indicate that, in general, the condition number plot can be produced quite swiftly. As stated in the main text: in rotation equivariant situations only a single spectral decomposition is required to obtain the complete solution path of the condition number. Hence, the time complexity of the CNplot call in these situations is approximately (worst-case time complexity of a spectral decomposition) as, after the required spectral decomposition, the solution path can be obtained in linear time only. Increasing the number of steps along the penalty-domain thus comes at little additional computational cost. For the rotation non-equivariant setting the time complexity is approximately as a spectral decomposition is required for all . The runtime of the condition number plot function under Estimator (3) of the main text is then somewhat longer than the corresponding runtime under Estimator (1). This is because the spectral decomposition in the former situation is also used for computing the (relatively expensive) matrix square root. As acts as a scaling factor in rotation non-equivariant settings the computation time of the plot can be reduced by coarsening the search-grid along the penalty-domain.

To compare the runtimes for the CNplot call we also conducted timing evaluations for the approximate leave-one-out cross-validation (aLOOCV) and root-finding LOOCV procedures as used in the main text (implemented in rags2ridges). Time complexity for the root-finding LOOCV procedure hinges on the dimension and the sample size . In contrast, time complexity for for the aLOOCV procedure is dependent on , , and . Timings were done for all combinations of the elements in , and (when relevant) . The basic condition number plot was produced 50 times (on simulated data of dimension ) for each combination of , and (possibly) . For the root-finding LOOCV procedure both a rotation equivariant situation and a rotation non-equivariant situation were considered. Ridge estimator (3) of the main text was used for all timing evaluations. The results, stated as median runtimes in seconds, are given in Tables S5 and S6. The code used in this exercise can also be found in Listing 6 in Section 4 of this Supplement.

Tables S5 and S6 indicate that, even under these relatively light settings, the runtimes for the aLOOCV and root-finding LOOCV procedures far exceed the runtime of (corresponding) calls to CNplot. The runtime for the root-finding LOOCV procedure is (given ) exacerbated for larger sample sizes. The runtime for the aLOOCV procedure is (given ) exacerbated for larger samples sizes, finer search-grids, and (not shown) situations of rotation non-equivariance.

## 4. R Codes

The R libraries needed to run the code snippets below can be found in Listing 1:

The code in Listing 2 will produce Figure 1 of the main text:

The code in Listing 3 will produce (the components of) Figure 2 of the main text:

The code in Listing 4 was used for Section 4 of the main text:

The code in Listing 5 was used for the second illustration contained in Section 2 of this supplement: