A Tracy-Widom Empirical Estimator For Valid P-values With High-Dimensional Datasets

11/18/2018
by   Maxime Turgeon, et al.
0

Recent technological advances in many domains including both genomics and brain imaging have led to an abundance of high-dimensional and correlated data being routinely collected. Classical multivariate approaches like Multivariate Analysis of Variance (MANOVA) and Canonical Correlation Analysis (CCA) can be used to study relationships between such multivariate datasets. Yet, special care is required with high-dimensional data, as the test statistics may be ill-defined and classical inference procedures break down. In this work, we explain how valid p-values can be derived for these multivariate methods even in high dimensional datasets. Our main contribution is an empirical estimator for the largest root distribution of a singular double Wishart problem; this general framework underlies many common multivariate analysis approaches. From a small number of permutations of the data, we estimate the location and scale parameters of a parametric Tracy-Widom family that provides a good approximation of this distribution. Through simulations, we show that this estimated distribution also leads to valid p-values that can be used for high-dimensional inference. We then apply our approach to a pathway-based analysis of the association between DNA methylation and disease type in patients with systemic auto-immune rheumatic diseases.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 18

page 20

page 21

07/30/2020

Impulse Response Analysis for Sparse High-Dimensional Time Series

We consider structural impulse response analysis for sparse high-dimensi...
12/28/2020

Rao's Score Tests on Correlation Matrices

Even though the Rao's score tests are classical tests, such as the likel...
02/01/2016

Fast inference of ill-posed problems within a convex space

In multiple scientific and technological applications we face the proble...
10/10/2017

Quantitative Comparison of Statistical Methods for Analyzing Human Metabolomics Data

Background. Emerging technologies now allow for mass spectrometry based ...
08/30/2019

Discovering Reliable Correlations in Categorical Data

In many scientific tasks we are interested in discovering whether there ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Consider the following scenario: your research team received a small grant to study the relationship between anatomical brain features and a set of clinical phenotypes. For the past several months, you’ve been painstakingly enrolling patients in the study and collecting neuroimaging data on them. The subject-matter expert on your team has already grouped the millions of voxels into regions of interest. A given region can now be represented by a dataset of dimension , for patients, and the clinical phenotypes by a dataset of dimension . Prior to the study, you had identified Canonical Correlation Analysis (CCA) as a potential analytical tool; it would allow the extraction of maximally correlated components from both datasets, and the overall relationship between and could be summarised with a series of canonical correlations. In the classical low-dimensional setting (), you could also test for significance of these canonical correlations using Rao’s statistic (Mardia et al., 1979). However, despite your colleague’s excellent work, most brain regions still contain more features or measurements than your overall sample size (

); in other words, you are no longer in the classical inference setting, but in a high-dimensional setting. Building on your knowledge of linear algebra and matrix decomposition, you know that you can still extract the canonical components and canonical correlations using a truncated eigenvalue decomposition (EVD). But one last obstacle remains: the null distribution of the first canonical correlation is unknown in this high-dimensional setting. Although you could rely on a permutation strategy to obtain a p-value, you are also aware that such a procedure is very computationally expensive if you want to reach a high level of precision for the estimated p-value, or if you want to analyze multiple regions in the brain.

In this article, we provide a fast computational approach for estimating the null distribution of the first canonical correlation in such high dimensional setting.

However, the contribution of this paper extends well beyond this CCA example provided above. More generally, a cursory look at the table of contents of recent volumes in both neuroimaging and genomics journals reveals a strong bias towards multivariate analysis methods such as Principal Component Analysis (PCA), Multivariate Analysis of Variance (MANOVA), Canonical Correlation Analysis (CCA), Principal Components of Explained Variance (PCEV), and Linear Discriminant Analysis (LDA); for a small yet broad sample, see

Park et al. (2017); Zhao et al. (2017); Hao et al. (2017); Pesonen et al. (2017); Gossmann et al. (2018); Fraiman and Fraiman (2018); Happ and Greven (2018); Yang et al. (2018); Turgeon et al. (2018). This is hardly surprising, given the evolving nature of technological capabilities data and the complex underlying biological processes that are now measurable. As described in our scenario above, using truncated matrix decomposition, we can often perform dimension reduction even in high dimensions. But beyond dimension reduction, many classical multivariate approaches also aim at summarizing the relationship between two datasets and

; in the list above, CCA, MANOVA and PCEV all have this common goal. Furthermore, this subset of methods also provide a unified way of performing null hypothesis significance testing: they all rely on the largest root of a

double Wishart problem. Specifically, the strength of the association between and is measured in terms of the magnitude of the largest solution to the following determinantal equation:

(1)

where and are two independent random matrices, both following a Wishart distribution with the same scale matrix. The definition of and is formulated under the null hypothesis and is method-specific. Several specific examples will be given later in this paper. From the distribution of this largest root, we can then compute a p-value for the null hypothesis of interest.

More formally, in high-dimensional settings, when the sample size is smaller than the number of measurements, both matrices and can have singular distributions. This singularity leads to both computational challenges for estimation and theoretical challenges for inference. On the one hand, common estimators in the non-singular case can be ill-conditioned (or even undefined) for singular problems; on the other hand, classical asymptotic convergence results rely on large sample sizes and therefore may not directly apply to high-dimensional settings.

In this work, we are interested in multivariate analysis involving two datasets, and , such that the dimension of one or both matrices may be much larger than the sample size . We posit that proper high-dimensional inference in several multivariate statistical methods such as CCA, MANOVA and PCEV, can be attained by studying the singular double Wishart problem described above. Our main contribution is an empirical estimator of the distribution of the largest root that is applicable to the analysis of high-dimensional data. This estimate provides valid p-values by fitting a location-scale family of distributions to a small number of permutations of the original data. By a theorem of Johnstone (2008), this family of distributions is known to provide an excellent approximation in the non-singular case, and we provide empirical evidence that this good performance extends to the singular case.

The rest of the article is structured as follows: in Section 2, we provide some detailed examples of double Wishart problems; these examples are later used to illustrate our approach. In Section 3, we describe our approach to approximating the distribution function of its largest root using an empirical estimator. Then, in Section 4, we investigate the numerical accuracy of the approximation and show how it leads to valid p-values. We further illustrate our approach through an analysis of the association between DNA methylation and disease type in patients with systemic auto-immune rheumatic diseases. Finally, in Section 6, we explain how our empirical estimator can be extended to accommodate linear shrinkage covariance estimators within the double Wishart setting. Our approach has already been implemented in two R packages: pcev and covequal; both packages are available on CRAN.

Notation

In what follows, and will denote and matrices, respectively. We also write when is Wishart-distributed with parameters and scale matrix . Recall that this is equivalent to having an matrix where each row is independently drawn from a multivariate normal and such that .

2 Examples of double Wishart problems

As stated above, the developments in our paper build on theory associated with double Wishart problems. Therefore, we start here by giving four examples of well known multivariate tests that are each double Wishart problems in order to emphasize the number of different applications of the theoretical results to follow. In each case, if the two matrices and are singular or ill-conditioned, then the theoretical results developed for double Wishart problems no longer apply. The four examples are one-way MANOVA, a test of equality of covariance matrices, CCA, and Principal Component of Explained Variance (PCEV). The first example is given for its simplicity and historical importance; the other three examples are used later to illustrate our approach.

2.1 Manova

Suppose that we have a set of independent observations , where denotes the -th observation of the -th group, with , . In MANOVA, we are interested in the null hypothesis of equality of means . First, for each group, we can form the sample mean and covariance matrix . We then compute two basic quantities: 1) the within-group sum of squares ; and 2) the between-group sum of squares , where is the overall mean. Under , the matrices and are independent and Wishart-distributed, with and . The union-intersection test of uses the largest root of as a test statistic or, equivalently, that of  (Mardia et al., 1979, Chapter 12). In the notation of Equation 1, we therefore can express the union-intersection test as a double Wishart problem with and . This test statistic is also known as Roy’s largest root, and it is one of the standard tests in classical MANOVA.

2.2 Test of covariance equality

Suppose that independent samples

from two multivariate normal distributions

and lead to covariance estimates which are independent and Wishart distributed on degrees of freedom, respectively. For example, we could take to be the maximum likelihood estimate of the covariance matrix based on observations. We are interested in testing the null hypothesis ; this can done using the largest root of the determinantal equation 1 as a test statistic for which we set and  (Anderson, 2003, Chapter 10).

2.3 Cca

Suppose we have two datasets of dimension and , respectively. Recall that CCA seeks to find linear combinations of and that are maximally correlated with each other. Assume each row of

has joint distribution

, where has the form

For simplicity, we first assume (which implies that ). Under our normality assumption, a hypothesis test for independence between and is equivalent to a hypothesis test for .

Write and let . Using these quantities, we can define

Under our assumptions, we have and . We can test the null hypothesis using as a test statistic the largest root of the double Wishart problem corresponding to the pair of matrices above (Mardia et al., 1979, Chapter 10). We note that this largest root corresponds to the square of the first canonical correlation between and .

The computation above can be generalized to the case when (and ) and with nonzero mean; for details, see Johnstone (2009).

2.4 Pcev

Similar to CCA, PCEV can also be used to simultaneously perform dimension reduction and test for association between two multivariate samples of dimension and , respectively. However, whereas CCA seeks to maximise the correlation between linear combinations of and , PCEV seeks the linear combination of whose proportion of variance explained by is maximised. Specifically, we assume that the relationship between and can be represented via a linear model:

where is a matrix of regression coefficients for the covariates of interest, and

is a vector of residual errors. This model assumption allows us to decompose the total variance of

as the sum of variance explained by the model and residual variance:

where . PCEV seeks a linear combination of outcomes, , that maximises the proportion of variance being explained by the covariates :

where

(2)

Then testing the null hypothesis is performed using as a test statistic the largest root of the determinantal equation 1 for which we set and . Turgeon et al. (2018) have further details on this dimension-reduction technique.

For further examples of double Wishart problems, we refer the reader to Johnstone (2008, 2009).

3 Empirical Estimator of the Distribution of the Largest Root

As we can see from the examples above, many null hypothesis significance tests in multivariate analysis follow the same two steps 1) computing the largest root of equation 1

; and 2) from its distribution under the null hypothesis, compute a p-value. For the first point, we may have to use a truncated version of singular value decomposition (SVD) (or EVD) when both

and

are singular; this singularity is common in high-dimensional datasets. In essence, these matrix decompositions are restricted to the subspace spanned by the eigenvectors corresponding to the non-zero eigenvalues; the mathematical details are reviewed in Section 1 of the Supplementary Materials. In this section, we focus on the second point. We show how the distribution of the largest root can be accurately approximated in the singular setting.

3.1 Known results in the non-singular setting

In the non-singular setting, the distribution of the largest root to the determinantal equation 1 is well studied (Mardia et al., 1979; Muirhead, 2009). To provide a theoretical underpinning to the remainder of the section, we highlight two approaches to computing this distribution: an exact approach, and an asymptotic result.

First, Chiani (2016)

described an explicit algorithm for computing the cumulative distribution function (CDF) of the largest root

. Building on his earlier work (Chiani, 2014)

, he proposed a new expression that relates the CDF to the Pfaffian of a skew-symmetric matrix. He also provided a set of recursive equations that provide a fast and efficient way to compute this matrix. However, this matrix quickly becomes very large when the parameters of the Wishart distribution increase, leading to both computational instability and numerical overflow problems. And yet, this matrix can only be computed in the non-singular setting.

The second approach we wish to highlight uses results from random matrix theory to derive an approximation to the marginal distribution. Specifically,

Johnstone (2008) showed that after a suitable transformation of the largest root, its distribution can be approximated by the Tracy-Widom distribution of order 1. His result, using the notation of this manuscript, is given below:

Theorem 1

[Johnstone (2008)] Assume and are independent, with positive-definite. Let be the largest root of Equation 1. As , we have

where is the Tracy-Widom distribution of order 1 (cf. Tracy and Widom (1996)), and are defined as follows:

where

3.2 Tracy-Widom Empirical Estimator

Unfortunately, in the singular setting, the marginal distribution of the largest root is not as well-understood. Crucially, the results from both Johnstone (2008) and Chiani (2016) depend on the non-singularity of the matrix , and therefore they cannot readily be applied to the singular setting.

As highlighted in the introduction, a common approach to computing p-values when the null distribution is unknown is to use a permutation procedure. But high precision requires a large number of permutations, and therefore is computationally burdensome. In this section, we show that we can drastically reduce the number of permutations by using Johnstone’s theorem above as inspiration.

We suggest an empirical estimator for approximating the CDF of the largest root. Indeed, we propose to estimate the distribution using a location-scale family of the Tracy-Widom distribution of order 1 indexed by two parameters . Algorithm 1 below describes our approach.

1:  For a double Wishart problem associated with matrices and :
2:  for  to  do
3:     Perform a permutation procedure on and .
4:     Compute the largest root of the corresponding double Wishart problem.
5:  end for
6:  Transform the roots

to the logit scale.

7:  Using a fitting procedure, find the estimates and of the location and scale parameters, respectively.
8:  Approximate the CDF of the largest root using the CDF of .
Algorithm 1 Tracy-Widom empirical estimator of the largest root distribution.

A few comments are required:

  • As we will show in Section 4, the number of permutations can be chosen to be relatively small while retaining good performance.

  • The appropriate permutation procedure on Line 3 depends on the particular double Wishart problem being studied. For a test of association between and , we typically permute the rows of and keep those of fixed. The test of equality of covariance matrices requires a different strategy, and we give all the relevant details in Section 4.2.1 below.

  • Line 7 of Algorithm 1 refers to a fitting procedure. In Section 4.1, we investigate four different approaches: Method of moments; Maximum Likelihood Estimation; and Maximum Goodness-of-Fit estimation (Luceño, 2006) using the Anderson-Darling statistic and a modified version that gives more weight to the right tail. Details about this latter approach, including computational formulas of these statistics, are given in Section 2 of the Supplementary Materials.

Therefore, we propose this location-scale transformation of the Tracy-Widom distribution as a suitable parametric family for estimating the distribution of the largest root in singular double Wishart problems.

4 Simulation results

To investigate the performance of our empirical estimator, we performed two simulation studies: 1) we compared the distribution obtained from our empirical estimator to an empirically generated cumulative distribution function (CDF), as described below, for different number of permutations and using four different fitting strategies; 2) we compared p-values derived from the estimated distribution to those obtained through a permutation procedure. The first simulation study is aimed at assessing the performance of our empirical estimator over the whole range of the distribution; the second simulation study specifically looks at the upper tail of the distribution and therefore at the validity of the resulting p-values.

4.1 Comparison to the true distribution

Since the distribution function of the largest root distribution is not available analytically, we cannot use an analytical function as the benchmark for the “true” CDF. Therefore, an estimate of the true distribution was derived computationally. Specifically, we started by generating 1000 pairs of singular Wishart variates as follows: each singular Wishart variate was generated by first generating a sample of multivariate normal variates , and then defining ; the Wishart variate was generated similarly. For each pair of Wishart variates, we then computed its corresponding largest root using truncated SVD (cf. §1 of the Supplementary Materials). From these 1000 largest roots, we calculated the empirical CDF for the marginal distribution: this estimate was considered our benchmark for assessing the performance of our approach.

For our simulations, we fixed the degrees of freedom and . In the context of a one-way MANOVA, this would correspond to four distinct populations and a total sample size of 100; in the context of CCA, this would correspond to one set of variates of dimension 4 (the dimension of the other is ), and a total sample size of 99. We also fixed the scale matrix , but we varied the parameter .

To compute the Tracy-Widom empirical estimator, we sampled of these largest roots (with or ) and fitted the location-scale family as described in Algorithm 1. Finally, we looked at four different fitting strategies: the method of moments (MM); Maximum Likelihood Estimation (MLE); and Maximum Goodness-of-Fit estimation (Luceño, 2006) using the Anderson-Darling statistic (AD) and a modified version that gives more weight to the right tail (ADR).

The simulation results appear in Figure 1. As we can see, with a larger number of samples , all four methods estimate the distribution of the largest root reasonably well; on the other hand, for a smaller value of , the method of moments clearly outperforms the other fitting strategies. For this reason, unless otherwise stated, we use the method of moments in the remainder of this article.

Figure 1: Approximation to the CDF: Tracy-Widom empirical estimator, using four different fitting strategies, compared to the “true CDF” (derived through computational means). AD: Anderson-Darling; ADR: right-tail-weighted Anderson-Darling; MLE: Maximum Likelihood Estimation; MM: Method of Moments.

4.2 Comparison of p-values

Next, we used our empirical estimator to compute p-values for three double Wishart problems: a test of equality of covariances, CCA, and PCEV. In all settings, we performed 100 simulations, and we fixed the sample size at . We also used permutations to fit the Tracy-Widom empirical estimator using the method of moments. Finally, we compared our approach to a permutation procedure with 500 permutations. As a reference, we summarise the parameters for all simulation scenarios appear in Table 1.

Method
Dimension of
Dimension of
Association
structure
Association
parameter
Equality of
covariance
Autoregressive
CCA Fixed at 20 for , otherwise
PCEV Fixed at 1 Linear model
Table 1: Values of the parameters for all simulation scenarios

4.2.1 Test of equality of covariances

For the test of equality of covariances, we simulated two datasets and , both with observations. We selected an autocorrelation structure for the covariance matrix , with . We varied two parameters: 1) the dimension of both and ; and 2) the correlation parameter . Note that corresponds to the null hypothesis of the same covariance, and correspond to alternative hypotheses.

The permutation procedure for testing the equality of covariance matrices started by centring both and . Then, the observations were permuted between and : that is, a valid permutation would sample 50 observations from both and to create a permuted , and the remaining 50 observations from and were used to create a permuted . The results are summarised using QQ-plots (see Figure 2). The computations were performed using the R package covequal.

The simulation results appear in Figure 2. As we can for these QQ-plots, there is excellent agreement between the p-values obtained from a permutation procedure and those obtained from the Tracy-Widom empirical estimator.

Figure 2: Equality of covariance: QQ-plots comparing the p-values obtained from a permutation procedure to those obtained from the Tracy-Widom empirical estimator.

4.2.2 Cca

For CCA, we again simulated two datasets and , both with observations, and with fixed . We selected an exchangeable structure with parameter for the covariance matrix . We again varied two parameters: 1) the dimension of ; and 2) the correlation parameter . As above, the value corresponds to the null hypothesis of no association between and , and the values correspond to alternative hypotheses. Finally, the permutation procedure consisted in permuting the rows of and keeping those of fixed.

The simulation results appear in Figure 3. As above, there is excellent agreement between the p-values obtained from a permutation procedure and those obtained from the Tracy-Widom empirical estimator.

Figure 3: CCA: QQ-plots comparing the p-values obtained from a permutation procedure to those obtained from the Tracy-Widom empirical estimator.

4.2.3 Pcev

For PCEV, we looked at the following high-dimensional simulation scenario: we fixed the number of observations and a balanced binary covariate

. We then varied the number of response variables

, and fixed the covariance structure of the error term . We induced an association between and the first 50 response variables in . This association was controlled by the parameter ; this parameter is related to the univariate regression coefficient through the following relationship:

As above, we summarised the results using QQ-plots (Figure 4). The computations were performed using the R package pcev (Turgeon et al., 2018); the methodology presented here is part of the package starting with version 2.1.

The simulation results appear in Figure 4. Once again, there is excellent agreement between the p-values obtained from a permutation procedure and those obtained from the Tracy-Widom empirical estimator.

Figure 4: PCEV: QQ-plots comparing the p-values obtained from a permutation procedure to those obtained from the Tracy-Widom empirical estimator.

As we can see, our Tracy-Widom empirical estimator yields valid p-values in a variety of high-dimensional scenarios that include both null and alternative hypotheses. For further simulation results, see Section 3 of the Supplementary Materials.

5 Data analysis

To showcase our ideas in the context of a real analysis of a high-dimensional dataset, we decided to look at the association between DNA methylation and disease type in patients with four systemic auto-immune rheumatic diseases: Scleroderma, Rheumatoid arthritis, Systemic lupus erythematosus, and Myositis. DNA methylation is an epigenetic mark, meaning that it is a chemical modification of the DNA that does not alter the nucleotide sequence (Baylin, 2005). It is known to be associated with changes in RNA transcription, and it is therefore correlated with gene expression.

The DNA methylation used for this analysis was measured prior to treatment on T-cell samples from 28 patients using the Illumina 450k platform (Hudson et al., 2017). Baseline characteristics of the patients appear in Table 2.

Scleroderma Other diseases
(n = 14) (n = 14)
Age Mean (sd) 48 (16) 52 (14)
Female (%) 50% 71%
Table 2: Baseline characteristics

We opted to test for differential methylation between scleroderma and the other three diseases at the pathway level: from the Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa et al., 2017), we extracted the list of genes included in their manually curated list of molecular pathways; these pathways correspond to networks of genes interacting and reacting together as part of a given biological process. For each of the 320 pathways, we then identified all CpG dinucleotides contained in at least one gene of this pathway. All CpG dinucleotides mapped to a given pathway were analysed jointly. The extraction of gene lists was performed using the R package KEGGREST (Tenenbaum, 2017).

For each pathway, we thus have two datasets: an matrix that contains the methylation values at all CpG dinucleotides (where ranges from 39 to 21,640 over the 320 pathways) and an matrix indicating whether an individual has scleroderma. Recall that , and therefore all pathways lead to a high-dimensional dataset. The analysis performed had two steps: first, we did a test of equality for the covariance matrices between both disease groups; then we used PCEV to test for differential methylation between these two groups. The PCEV analysis included both age and sex as possible confounders (El-Maarri et al., 2007; Horvath, 2013). For both steps, the Tracy-Widom empirical estimator was computed using 50 permutations of the data.

Since we repeated the same analysis independently for all 320 pathways, we need to account for multiple comparison. However, since a given gene may appear in multiple pathways, the 320 hypothesis tests are not independent; therefore, a naive Bonferroni correction would be too conservative. To estimate the effective number of independent tests, we looked at the average number of pathways in which a given CpG dinucleotide appears. Overall, 134,941 CpG dinucleotides were successfully matched to at least one of 320 KEGG pathways. On average, each dinucleotide appeared in 4.5 pathways; this leads to effectively 70 independent hypothesis tests. For a nominal family-wise error rate of , an appropriate Bonferroni-corrected significance threshold is therefore given by .

We compared the p-value obtained from our procedure above to that obtained from a permutation procedure; the latter is a common approach when the null distribution of a test statistic is unknown but has the disadvantage of being computationally expensive. Given our significance threshold, we determined that we needed to perform at least 10,000 permutations in order to be able to assess significance.

In Table 3, we present the five most significant pathways, with the top pathway achieving overall significance for the test of differential methylation. None of the covariance equality tests yielded a significant p-value; to improve clarity, we omitted these results from the table.

KEGG code Description # CpGs 3cmTracy-Widom
P-value 3cmPermutation
P-value
path:hsa00120

Glutamatergic synapse

225
path:hsa03040 Ras signaling pathway 2119
path:hsa03450 Circadian rhythm 267
path:hsa00563 Histidine metabolism 394
path:hsa04380 Pathogenic E. coli infection 2312
Table 3: Data analysis: Five most significant pathways based on our empirical estimator. The Tracy-Widom p-values are compared to p-values obtained from a permutation procedure.

For the most significant pathway (i.e. Glutamatergic synapse), we computed the Variable Importance Factors (VIF) and compared them to the p-values obtained from a univariate approach where each CpG dinucleotide is tested individually against the disease outcome (Turgeon et al., 2018). The VIF is defined as the correlation between the individual response variables and the principal component maximising the proportion of variance. The comparison is presented in Figure 5. As previously showed in the literature (Turgeon et al., 2018), there is some degree of agreement between these two measures of association. Moreover, we can see evidence that the overall association between this pathway and our disease indicator is driven by a few CpG dinucleotides.

Figure 5: path:hsa00120—Glutamatergic synapse: Comparison of Variable Importance Factor and 225 univariate p-values for the most significant pathway.

6 Extension to linear shrinkage covariance estimators

In Section 4, we presented graphical evidence that our proposed Tracy-Widom empirical estimator provides a good approximation of the distribution of the largest root of the determinantal equation 1. As discussed in Section 2, the estimates of the matrices appearing in this equation often involve high-dimensional covariance matrices. However, a common problem with such high-dimensional covariance matrices is their instability (Pourahmadi, 2013, Chapter 1). As a result, the power of statistical tests derived from the double Wishart problem decreases as the dimension of and increases. However, it would seem that the Tracy-Widom empirical estimator relies on the assumption that and are Wishart-distributed, and it is not clear a priori that this estimator can be applied with other, more efficient estimators of the underlying high-dimensional covariances matrices.

One strategy for improving the stability of a covariance estimator is to use a shrinkage estimator. One such estimator for the covariance matrix was introduced by Ledoit and Wolf (2004). In this section, we show that by replacing the matrix by a linearly shrunk version in Equation 1, our Tracy-Widom empirical estimator still provides a good approximation of the distribution of the largest root.

Let , and be the identity matrix. Ledoit and Wolf (2004) look for an optimal linear combination to estimate the population covariance matrix; the optimality criterion is described in the following result:

Theorem 2

[Ledoit and Wolf (2004)] Let be the squared Frobenius norm, and let be its corresponding inner product. Consider the optimization problem:

where the coefficients are nonrandom. Its solution verifies:

where

Furthermore, Ledoit and Wolf (2004) provide consistent estimators for all quantities

under mild conditions that hold true for normal random variables, and therefore hold true in our setting. The resulting estimator for

is denoted , and we thus get a linearly shrunk .

To assess the performance of our Tracy-Widom empirical estimator under this extended setting, we repeated the simulations from Section 4.1 but by substituting the matrix with its linearly shrunk equivalent . The results appear in Figure 6, and they are very similar to the earlier results; we get good agreement even with small values of , and the method of moments provides a better approximation than the other fitting procedures.

Figure 6: Approximation to the CDF under a linearly shrunk :Heuristic, using four different values for the number of permutations, compared to the true CDF

7 Discussion

In this work, we investigated the singular double Wishart problem, which arises in the multivariate analysis of high-dimensional datasets. We presented an empirical estimator of the distribution of the largest root that is simple, efficient, and valid for high-dimensional data. Through simulation studies, we showed how our approach leads to a good approximation of the true distribution, and we also showed that it leads to valid p-values. Finally, using a pathway-based approach, we analysed the relationship between DNA methylation and disease type in patients with systemic auto-immune rheumatic diseases. Our analysis used the empirical estimator in two settings: a test for the equality of covariance matrices, and with the dimension-reduction method known as PCEV.

The empirical estimator we presented fills a gap in high-dimensional multivariate analysis. Many common methods, such as MANOVA and CCA, fit into the framework of double Wishart problems. However, classical hypothesis testing breaks down in high dimensions, and therefore analysts often rely on computationally intensive resampling techniques to perform valid inference. For example, genomic studies often require significance thresholds of the order of and lower in order to correct for multiple testing. In this context, a permutation procedure would require at least 1 million resamples. By relying on results from random matrix theory, we can drastically cut down the required number of permutations. Since we only need to estimate two parameters from a location-scale family, good approximation is achieved with less than 100 permutations. Critically, this number is independent of the number of tests performed, and therefore the computation time is reduced by several orders of magnitude.

We motivated our empirical estimator of the CDF using an approximation theorem of Johnstone (2008). Our approach is further motivated by several results from random matrix theory that suggest a central-limit-type theorem for random matrices, with the Tracy-Widom distribution replacing the normal distribution (Deift, 2007). More evidence in support of our empirical estimator is given from the study of the largest root of Wishart variates. On the one hand, Srivastava (2007) and Srivastava and Fujikoshi (2006) showed that the asymptotic distribution of is approximately equal to the distribution of the largest eigenvalue of a scaled Wishart matrix. On the other hand, several results analogous to Johnstone’s theorem were also derived for Wishart distributions. Indeed, it has been shown that if is the largest root of a Wishart variate, then there exists , both functions of the parameters of the Wishart distribution, such that the ratio converges in distribution to  (Johansson, 2000; Johnstone, 2001; El Karoui, 2003; Tracy and Widom, 2009).

Finally, the results from Section 6 show that the domain of applicability of our empirical estimator extends beyond that of a strict double Wishart problem. Specifically, we showed that we can replace the matrix in Equation 1 by a linearly shrunk estimator based on earlier work by Ledoit and Wolf (2004). This approach can easily be applied to multivariate analysis approaches for which is explicitly the covariance matrix of a multivariate random variable. Examples include the test of equality of covariance matrices and PCEV. However, more care is required in order to use these results with CCA.

References

  • Anderson [2003] Theodore W Anderson. An introduction to multivariate statistical analysis. Wiley, 2003.
  • Baylin [2005] Stephen B Baylin. DNA methylation and gene silencing in cancer. Nature clinical practice Oncology, 2:S4–S11, 2005.
  • Chiani [2014] Marco Chiani. Distribution of the largest eigenvalue for real Wishart and Gaussian random matrices and a simple approximation for the Tracy-Widom distribution. Journal of Multivariate Analysis, 129:69–81, 2014.
  • Chiani [2016] Marco Chiani. Distribution of the largest root of a matrix for Roy’s test in multivariate analysis of variance. Journal of Multivariate Analysis, 143:467–471, 2016.
  • Deift [2007] Percy Deift. Universality for mathematical and physical systems. In Marta Sanz-Solé, Javier Soria, Juan Luis Varona, and Joan Verdera, editors, Proceedings of the International Congress of Mathematicians Madrid, August 22–30, 2006, pages 125–152, 2007.
  • El Karoui [2003] Noureddine El Karoui. On the largest eigenvalue of Wishart matrices with identity covariance when , and . arXiv preprint math/0309355, 2003.
  • El-Maarri et al. [2007] Osman El-Maarri, Tim Becker, Judith Junen, Syed Saadi Manzoor, Amalia Diaz-Lacava, Rainer Schwaab, Thomas Wienker, and Johannes Oldenburg. Gender specific differences in levels of DNA methylation at selected loci from human total blood: a tendency toward higher methylation levels in males. Human genetics, 122(5):505–514, 2007.
  • Fraiman and Fraiman [2018] Daniel Fraiman and Ricardo Fraiman. An ANOVA approach for statistical comparisons of brain networks. Scientific reports, 8(1):4746, 2018.
  • Gossmann et al. [2018] Alexej Gossmann, Pascal Zille, Vince Calhoun, and Yu-Ping Wang. FDR-corrected sparse canonical correlation analysis with applications to imaging genomics. IEEE Transactions on Medical Imaging, 2018.
  • Hao et al. [2017] Xiaoke Hao, Chanxiu Li, Jingwen Yan, Xiaohui Yao, Shannon L Risacher, Andrew J Saykin, Li Shen, Daoqiang Zhang, and Alzheimer’s Disease Neuroimaging Initiative. Identification of associations between genotypes and longitudinal phenotypes via temporally-constrained group sparse canonical correlation analysis. Bioinformatics, 33(14):i341–i349, 2017.
  • Happ and Greven [2018] Clara Happ and Sonja Greven. Multivariate functional principal component analysis for data observed on different (dimensional) domains. Journal of the American Statistical Association, pages 1–11, 2018.
  • Horvath [2013] Steve Horvath. DNA methylation age of human tissues and cell types. Genome biology, 14(10):3156, 2013.
  • Hudson et al. [2017] Marie Hudson, Sasha Bernatsky, Ines Colmegna, Maximilien Lora, Tomi Pastinen, Kathleen Klein Oros, and Celia MT Greenwood. Novel insights into systemic autoimmune rheumatic diseases using shared molecular signatures and an integrative analysis. Epigenetics, pages 1–8, 2017.
  • Johansson [2000] Kurt Johansson. Shape fluctuations and random matrices. Communications in mathematical physics, 209(2):437–476, 2000.
  • Johnstone [2001] Iain M Johnstone. On the distribution of the largest eigenvalue in principal components analysis. Annals of statistics, pages 295–327, 2001.
  • Johnstone [2008] Iain M Johnstone. Multivariate analysis and Jacobi ensembles: Largest eigenvalue, Tracy-Widom limits and rates of convergence. Annals of statistics, 36(6):2638, 2008.
  • Johnstone [2009] Iain M Johnstone. Approximate null distribution of the largest root in multivariate analysis. The annals of applied statistics, 3(4):1616, 2009.
  • Kanehisa et al. [2017] Minoru Kanehisa, Miho Furumichi, Mao Tanabe, Yoko Sato, and Kanae Morishima. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic acids research, 45(D1):D353–D361, 2017.
  • Ledoit and Wolf [2004] Olivier Ledoit and Michael Wolf. A well-conditioned estimator for large-dimensional covariance matrices. Journal of multivariate analysis, 88(2):365–411, 2004.
  • Luceño [2006] Alberto Luceño. Fitting the generalized Pareto distribution to data using maximum goodness-of-fit estimators. Computational Statistics & Data Analysis, 51(2):904–917, 2006.
  • Mardia et al. [1979] KV Mardia, JM Bibby, and JT Kent. Multivariate analysis. Academic Press, 1979.
  • Muirhead [2009] Robb J Muirhead. Aspects of multivariate statistical theory, volume 197. John Wiley & Sons, 2009.
  • Park et al. [2017] Yeonhee Park, Zhihua Su, and Hongtu Zhu. Groupwise envelope models for imaging genetic analysis. Biometrics, 73(4):1243–1253, 2017.
  • Pesonen et al. [2017] Maiju Pesonen, Jaakko Nevalainen, Steven Potter, Somnath Datta, and Susmita Datta. A combined PLS and negative binomial regression model for inferring association networks from next-generation sequencing count data. IEEE/ACM transactions on computational biology and bioinformatics, 2017.
  • Pourahmadi [2013] Mohsen Pourahmadi. High-dimensional covariance estimation: with high-dimensional data, volume 882. John Wiley & Sons, 2013.
  • Srivastava [2007] Muni S Srivastava. Multivariate theory for analyzing high dimensional data. Journal of the Japan Statistical Society, 37(1):53–86, 2007.
  • Srivastava and Fujikoshi [2006] Muni S Srivastava and Yasunori Fujikoshi. Multivariate analysis of variance with fewer observations than the dimension. Journal of Multivariate Analysis, 97(9):1927–1940, 2006.
  • Tenenbaum [2017] Dan Tenenbaum. KEGGREST: Client-side REST access to KEGG, 2017. R package version 1.14.1.
  • Tracy and Widom [1996] Craig A Tracy and Harold Widom. On orthogonal and symplectic matrix ensembles. Communications in Mathematical Physics, 177(3):727–754, 1996.
  • Tracy and Widom [2009] Craig A Tracy and Harold Widom. The distributions of random matrix theory and their applications. In New trends in mathematical physics, pages 753–765. Springer, 2009.
  • Turgeon et al. [2018] Maxime Turgeon, Karim Oualkacha, Antonio Ciampi, Hanane Miftah, Golsa Dehghan, Brent W Zanke, Andréa L Benedet, Pedro Rosa-Neto, Celia MT Greenwood, Aurélie Labbe, and for the Alzheimer’s Disease Neuroimaging Initiative. Principal component of explained variance: An efficient and optimal data dimension reduction framework for association studies. Statistical Methods in Medical Research, 27(5):1331–1350, 2018.
  • Yang et al. [2018] Biao Yang, Jinmeng Cao, Tiantong Zhou, Li Dong, Ling Zou, and Jianbo Xiang. Exploration of neural activity under cognitive reappraisal using simultaneous eeg-fmri data and kernel canonical correlation analysis. Computational and mathematical methods in medicine, 2018, 2018.
  • Zhao et al. [2017] Feng Zhao, Lishan Qiao, Feng Shi, Pew-Thian Yap, and Dinggang Shen. Feature fusion via hierarchical supervised local CCA for diagnosis of autism spectrum disorder. Brain imaging and behavior, 11(4):1050–1060, 2017.