The Shrinkage Variance Hotelling T^2 Test for Genomic Profiling Studies

12/07/2017 ∙ by Grant Izmirlian, et al. ∙ National Institutes of Health 0

Designed gene expression micro-array experiments, consisting of several treatment levels with a number of replicates per level, are analyzed by applying simple tests for group differences at the per gene level. The gene level statistics are sorted and a criterion for selecting important genes which takes into account multiplicity is applied. A caveat arises in that true signals (genes truly over or under expressed) are "competing" with fairly large type I error signals. False positives near the top of a sorted list can occur when genes having very small fold-change are compensated by small enough variance to yield a large test statistic. One of the first attempts around this caveat as the development of "significance analysis of micro-arrays (SAM)", which used a modified t-type statistic thresholded against its permutation distribution. The key innovation of the modified t-statistic was the addition of a constant to the per gene standard errors in order to stabilize the coefficient of variation of the resulting test statistic. Since then, several authors have proposed the use of shrinkage variance estimators in conjunction with t-type, and more generally, ANOVA type tests at the gene level. Our new approach proposes the use of a shrinkage variance Hotelling T-squared statistic in which the per gene sample covariance matrix is replaced by a shrinkage estimate borrowing strength from across all genes. It is demonstrated that the new statistic retains the F-distribution under the null, with added degrees of freedom in the denominator. Advantages of this class of tests are (i) flexibility in that a whole family of hypothesis tests is possible (ii) the gains of the above-mentioned earlier innovations are enjoyed more fully. This paper summarizes our results and presents a simulation study benchmarking the new statistic against another recently proposed statistic.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Gene expression microarrays provide a fast and systematic way to identify genes differentially expressed between two or more experimental groups of samples in a hypothesis driven study. These samples and experimental groups could be, for example, human prostate cancer cell line RNA samples treated with two or three different agents, or treated with the same agent at differing concentrations. Right now cDNA chips contain on the order of ten thousand genes, while oligonucleotide arrays contain upwards of twelve thousand genes. In the not too distant future entire genome chips will become available. The consequence is a tremendous savings in time and resources as the per gene expense in time and resources for the preliminary screening of genes has gone down considerably. Nonetheless, the considerable cost per array results in experiments that are typically based upon few replicates. For example, an experiment consisting of two experimental conditions might have just three replicates per set of conditions.

While the shift in platfroms from cDNA arrays to oligonucleotide arrays has resulted in the reduction in various sources of within gene and extra gene variability, the reality is that there is still a great deal of endemic noise in these sorts of investigations. Given the small number of replicates, power is a primary concern. Albeit, the goal of statistical analysis in this setting is to arrive at a relatively short list of candidate genes that warrant further investigation via a more sensitive and specific technique such as PCR. The investigator typically has aloted specific resources for the further investigation of a given number of genes and will request a “short list” of the requesite length. Therefore, the role of efficiency and power may not be completely appreciated. Clearly, however, the goal is to present the best possible list, so that the role of efficiency and power can now be understood.

A caveat arises in that true signals (genes truly over or under expressed) are “competing” with fairly large type I error signals. False positives near the top of a sorted list can occur when genes having very small fold-change are compensated by small enough variance to yield a large test statistic. One of the first attempts around this caveat was the development of “significance analysis of micro-arrays” or (SAM), [11], which used a modified t-type statistic thresholded against its permutation distribution. The key innovation of the modified t-statistic was the addition of a constant to the per gene standard errors in order to stabilize the coefficient of variation of the resulting test statistic. Since then, ([12], [6]) several authors have proposed the use of shrinkage variance estimator in conjunction with t-type and more generally, ANOVA type tests at the gene level. One advantage of this latter approach is that it doesn’t require the computation of ad-hoc fudge constants. In the situation under study, e.g. a hypothesis driven experiment consisting of a small number of experimental groups, a natural model is the per gene linear model on the appropriate scale, leading to a per gene ANOVA type test of the null. Recent work, ([12], [2], [6]

), presented a model in which the per gene residual variance parameters were considered to be draws from an inverse gamma distribution, resulting in a “shrinkage variance test” that could potentially have gains in efficiency depending on the heterogeneity of the extra gene variability. The idea of using a shrinkage estimate of within group variance has also been persued by others such as

[1], [7], and [8]. One assumption of that model which is often violated in applications is that the extra gene variability is consistent across experimental conditions. In order to circumvent this restrictive assumption, we extend that work to the multivariate setting arriving now at a whole class of hypothesis tests based upon a shrinkage variance Hotelling . If there is any appreciable between-group correlation, this approach constitutes a more efficient use of the scarce data available per gene data. Furthermore, as we shall point out in this work, the incorporation of a shrinkage variance/covariance estimator into the usual Hotelling statistic accomplishes the goals of the earlier innovations to an even greater degree.

2 Background and Motivation: Designed Gene Expression Micro-Array experiments

The impetus for this work were two microarray studies with which the authors have been involved. The first of these was a spotted cDNA array experiment studying the effects of the isoflavone/phytoestrogen genestein on gene expression in the LnCAP cell line. Several batches of colonies were treated with either M, M, M, genestein or control media and allowed to grow for 24 hours. Messenger RNA (mRNA) isolated from each of the treated groups was hybridized onto the green channel of a corresponding micro-array, while mRNA isolated from the control treated colony was hybridized onto the red channel of each micro-array. This experiment was conducted independently and in identical fashion on three separate dates. Systematic variability occuring from array to array and within array were adjusted out in the manner suggested by [3]. Within each experimental replicate and for each gene, the log base two of the ratio of normalized green to red channel expression values were calculated and used in subsequent analysis. The research questions being investigated were (i) whether there was differential expression between the green and red channels under treatment with genestein at any of the three concentrations, and if so (ii) was there a trend in this effect.

The second study was an oligonucleotide micro-array experiment studying the effects of two hormones, dehydroepiandrosterone (DHEA) and dihyrotestosterone (DHT), on gene expression in the LnCAP line. Again, several batches of colonies were treated with either DHEA, dhT, or control media and allowed to grow for 24 hours. mRNA isolated from each of the two treated colonies as well as from the control treated colony was hybridized onto one of three corresponding single channel oligonucleotide arrays. The raw image files, in CEL format, were imported into the R statistical computing platform [10]. For each gene, the probe set was summarized into a model based gene expression index [5], using the Bioconductor suite of add-on libraries for R [4]. Within each experimental replicate and for each gene, the log base two of the expression ratios of treatment to control were calculated and used in subsequent analysis. The research questions here were (i) whether there was differential expression between treatment and control under treatment with either hormone, any of the three conditions, and if so (ii) was there differential expression between the two treatments.

3 The Shared Hotelling statistic

As indicated in the introductory remarks above, the new methodologic tool introduced here is a Hotelling statistic for a variety test of the null which incorporates a shrinkage estimate of the per gene residual variance. Suppose that each preprocessed microarray yields expression levels on each of genes. In the type of studies dealt with here we have a total of such microarrays arising from identical replicates of an experiment having experimental conditions or “treatments”. Here as is usually the case, the measurements being analyzed will be the log base two of a treatment to control ratio. For each of the genes, we consider these measurements as an i.i.d. sequence of

-demensional random variables,

, where we allow the possibility that there may be a different number of measurements for different genes due to reading errors. We assume such missingness is completely at random. Let and be the -dimensional sample mean and unbiased sample covariance matrix corresponding to the sample . Denote by and the CDFs corresponding to central and non-central distributions, respectively, of degrees and , the latter having non-centrality parameter

. The following theorem shows that, under an assumed conjugate prior, we can replace the estimated covariance matrix in the usual Hotelling

test with a shrinkage estimate and still retain the property that the resulting test has an

distribution under the null hypothesis.

Theorem 1: Suppose that and for a given gene, , that

  • conditional upon , is i.i.d. ,

  • { is i.i.d. and independant of the above.

  • Let .

(1)

The model in items 1 and 2 above is called the multivariate normal/inverse Wishart model in the following. The above statistic has the potential for fair sized gains in efficiency. The most ideal situation occurs when the average (over genes) of the within gene variability is reasonably small but there is reasonable spread across genes in the magnitude of this variation. In such a case, the parameter would not add so much magnitude to the denomenator, while the shape parameter, would gives us extra degrees of freedom as if we had more replicates per experimental condition. In reality there is trade off between these two phenomena, and one checks for gain in efficiency by comparing with the standard Hotelling .

Next, we note that, as is the case in the usual Hotelling

statistic, a whole family of statistics arises by applying a linear transformation. We state this as a corollary to the above theorem.

Corollary 1: Assume conditions (1) and (2) above except without any restriction on and relative to oneanother. Consider the matrix , which is chosen to be of dimension of rank . Then we can replace and by and in the theorem above and the conclusions still follow.

The above theorem and its corollary are used to test a variety of null hypotheses, where . There are three natural choices for . Call these the “zero means” contrast, , the “equal means” contrast, ,and the “no trend” contrast, . Specifically, these are given by: , which requires that , , which requires that , and , which requires that and . The application of these results to testing hypotheses in the analysis of both cDNA and oligonucleotide arrays will be clearly laid out in the section which follows.

Notice in the definition of the statistics given above in 1, the parameter matrix, , and the shape parameter, arising in the prior distribution of are assumed to be “known”. The next result is used to estimate and via maximum likelihood using the data the which under our model are i.i.d. draws from the density given below in 2.

Theorem 2: Under the conditions of theorem 1, has density function equal to

(2)

4 Other statistics under study

Here we will be using the notation of theorem 1 and its corollary above. In addition to the quantities presented there, we write for the

dimensional column vector containing the observations

stacked by replicate within component, and for the total within group sum of squares. Notice that within a particular set of distributional assumptions on and a particular framework for constructing test statistics, a variety of hypothesis tests are made possible through the application the appropriate linear transformation to the data. That being said, we restrict attention in this portion of exposition to tests of zero group means. The simplist statistic is the standard F-statistic, which assumes that the sequence of random vectors is stochastically independent with identical distribution given by independent normals having component means and common variance . Under the null hypothesis and under these distributional assumptions,

(3)

has an distribution with and degrees of freedom. This standard -statistic was modified using a univariate empirical bayes estimate of the per-gene common variance in [12]. Similar results and extensions were presented in [7], [2], and [6]. The distributional assumptions required by that method are, conditional upon , identical to the above. The difference is that has an inverse gamma distribution with shape parameter and rate parameter which results in exchangeable dependence among replicates of the experiment. Under the null hypothesis and under these assumptions,

(4)

has an distribution with and degrees of freedom. This is deduced via an argument almost identical to that in [12]. This statistic is the univariate analogue of the statistic presented here.

If we assume instead, that the sequence of random vectors is stochastically independent with identical distribution multivariate normal with mean vector components and variance covariance matrix then under the null hypothesis and under these distributional assumptions,

(5)

has an distribution with and degrees of freedom (see, for example, [9]).

5 Software: R package SharedHT2

An R package for conducting analyses using the methods of this paper has been created and is available for download at the CRAN website. One of the most desirable facets of this package is that it is entirely coded in C with minimal processing done in R. The main function “EB.Anova” fits the multivariate normal/inverse Wishart model to micro-array data and calculates the per gene statistics shown in formula 1 in theorem 1 when the argument “Var.Struct” is set to “general”. In addition, the same function can be used to fit the normal/inverse gamma model of [12] and calculate the statistics shown in formula 4 in the preceding section by setting the argument “Var.Struct” to “simple”. In both cases, the models are fit using maximum likelihood estimation. There is flexibility in the choice of hypothesis test via setting the argument “H0” to one of the following choices. Under the “general” variance structure option,

  • if , the H0=“zero.means” null may be tested.

  • if , the H0=“equal.means” null may be tested.

  • if , H0=“no.trend” null may be tested, but of course this only makes sense if .

  • The user may also set H0 to an custom contrast matrix of dimension and of rank .

Under the “simple” variance structure option, any of the above null hypotheses may be tested as long as . By default, “H0” is set to “equal.means”. The package uses S3 classes and has several other nice features. For example if the data comes from an affy experiment and the rows are named after the affy gene identifiers, then a genelist sorted on p-value can be browsed in the html viewer with links to the Weizmann Institute’s “GeneCards” database. Additionally, the simulation study presented in the following section may be repeated using included functions. It is worth mention here that these simulations were only made possible by migrating the entire procedure including the loop over simulation replicates, into C. The interested reader is encouraged to browse the documentation.

6 Comparison with other approaches–simulation study

We conducted a simulation study in order to compare the operating characteristics of the proposed shared variance Hotelling statistic () in expression 1 with those of the three other statistics, 3, 4, 5 that were described in a preceding section. In all cases the test was relative to the null hypothesis of group means identically zero, with two groups.

The first simulation study was conducted by generating data from the multivariate normal/inverse Wishart model with groups and replicated observations for each of G=12625 genes, using values for and that were obtained in the analysis of the oligonucleotide array data (see below for further details). One hundred of the genes were designated as “true positives” by giving them non-zero group specific means that were chosen in the following way. First, a value of was chosen so that

i.e., so that the statistic would have power at a type I error of 0.26% to reject the null hypothesis of zero group means. This value of

was then multiplied by the average per group standard deviation calculated under the multivariate normal/inverse Wishart model, i.e.

to arrive at the two group specific means applied identically to each of the ten designated genes.

In order to study the robustness of the test statistic to lack of model assumptions, a second simulation study was conducted using a Normal-2 component mixed inverse Wishart distribution. Specifically, the data are i.i.d. multivariate normal but the prior distribution on the random variance/covariance matrix is a mixture of two inverse Wisharts, having shape parameters and and common rate matrix . The mixing proportion, and shape parameters and were chose so that the expected value of , the per gene empirical covariance matrix, would remain identical its value under the multivariate normal/inverse Wishart model used previously, . The values used were , and . Once again, one hundred genes were designated as “true positives” by assigning means as above.

The simulation results were summarized in two ways. The first method, shown in tables 6 and 6, used the Benjamini-Hochberg FDR stepdown procedure to set the significance criterion. In each simulation replicate, the four listed statistics and corresponding p-values were calculated for each of the 12625 genes. Next, for each statistic, the list was sorted on corresponding p-value and the row containing the largest p-value not exceeding and all rwos above it were marked significant. The true positive rate was derived as the number of genes called significant as a proportion of those truely differentially expressed, i.e. 100. The false positive rate was derived as the number of genes called significant not among those 100. These were averaged over simulation replicates yielding emprical true positive rates () and empirical false positive rate (). In table 6 is shown results for the data simulated from the normal/Inverse Wishart model. The leftmost column is the nominal false discovery rate, , used in setting the significance criterion. The next eight columns are the empirical true positive and false positive rates for each of the four benchmarked statistics.

Results corresponding to data simulated from the multivariate normal/inverse Wishart model are shown in table 6. In the case of the proposed statistic, , the coincides within simulation error with the FDR. That is because the p-values are derived via the F-distribution listed in theorem 1, which assumes the data arise from a multivariate normal/inverse Wishart distribution. Notice as well that the is quite high in the 90’s at the low FDR of 0.05. The other three statistics benchmarked a clearly inferior. First, , thestandard hotelling , is nearly uninformative, displaying an of 100% at all values of with correspondingly high ranging upwards from 85%. The shrinkage variance F-statistic, , is overly concervative, with equal to zero within simulation error and ranging from 20% to 50%. Finally, the ordinary F-statistic, , is overly conservative at the lower FDR’s of 5% and 10%, but then uninformative at the higher FDR’s of 15%, 20% and 25%.

The results corresponding to data simulated from the normal/mixed inverse Wishart model are shown in 6. The only notable difference relateive to remarks made above is that control over the FDR is now lost, as the no longer agrees with the . Still, if the simulation model can be considered an extreme departure from the model assumptions then use of the FDR=5% which gives and should be acceptible.

On the other hand one may wish to dispesne with any attempts at controling the false discovery rate at all, and instead, rely on the statistic’s ability to provide a more informative ordering. In this case, we simply decide how many genes we wish to call significant and draw the line there. For the second method of summarizing the simulation results the and were derived this time using, consecutively, each of the values of the statistic as the significance criterion. The results for data obeying model assumptions are shown in figures 1, and for data not obeying model assumptions in figure 2. It is clear that our proposed statistic, , outperforms the others when the data obeys the model assumptions presented in theorem 1. Although this advantage is attenuated when the data does not obey model assumptions, there is still a modest advantage. For this reason we recommend its use over the one dimensional test, and the related SAM of [11].

Table 1

ShHT2 HT2 ShUT2 UT2
FDR TPF FPF TPF FPF TPF FPF TPF FPF
0.05 0.929 0.045 1.00 0.857 0.204 0.00 0.014 0.00
0.10 0.964 0.093 1.00 0.927 0.341 0.00 0.079 0.00
0.15 0.976 0.141 1.00 0.951 0.424 0.00 1.00 0.992
0.20 0.983 0.192 1.00 0.963 0.480 0.00 1.00 0.992
0.25 0.987 0.242 1.00 0.970 0.526 0.00 1.00 0.992

Table 2

ShHT2 HT2 ShUT2 UT2
FDR TPF FPF TPF FPF TPF FPF TPF FPF
0.05 0.944 0.123 1.00 0.863 0.424 0.00 0.343 0.000
0.10 0.968 0.223 1.00 0.929 0.556 0.00 0.593 0.000
0.15 0.976 0.307 1.00 0.951 0.626 0.00 1.000 0.992
0.20 0.981 0.378 1.00 0.963 0.671 0.00 1.000 0.992
0.25 0.985 0.442 1.00 0.970 0.706 0.00 1.000 0.992

7 Application: Two Case Studies

As mentioned in the introductory section, these techniques were used to analyze two datasets, the first from a spotted cDNA array experiment and the second from an oligonucleotide array experiment. In the first experiment, several colonies of LnCAP cells were allowed to grow for 24 hours in the presence of either control medium or M, M, or M of genestein. Messenger RNA (mRNA) isolated from each of the treated groups was hybridized onto the green channel of a corresponding micro-array, while mRNA isolated from the control treated colony was hybridized onto the red channel of each micro-array. This experiment was conducted independently and in identical fashion on three separate dates. Systematic variability occuring from array to array and within array were adjusted out in the manner suggested by [3]. Within each experimental replicate and for each gene, the log base two of the ratio of normalized green to red channel expression values were calculated and used in subsequent analysis. Since the group dimension was and the sample size was then a test of the zero means null using the statistic was not possible. However, we tested the equal means null using both the statistic and the statistic.

The second study was an oligonucleotide micro-array experiment studying the effects of two hormones, dehydroepiandrosterone (DHEA) and dihyrotestosterone (DHT), on gene expression in the LnCAP line. Again, several batches of colonies were treated with either DHEA, dhT, or control media and allowed to grow for 24 hours. mRNA isolated from each of the two treated colonies as well as from the control treated colony was hybridized onto one of three corresponding single channel oligonucleotide arrays. The raw image files, in CEL format, were imported into the R statistical computing platform [10]. For each gene, the probe set was summarized into a model based gene expression index [5], using the Bioconductor suite of add-on libraries for R [4]. Within each experimental replicate and for each gene, the log base two of the expression ratios of treatment to control were calculated and used in subsequent analysis. The research questions here were (i) whether there was differential expression between treatment and control under treatment with either hormone, any of the three conditions, and if so (ii) was there differential expression between the two treatments.

Table 7

Gene dhea dht stat p-val FDR=0.10
1 34319_at 1.690 4.400 273.0 1.902e-07 7.129e-06
2 36658_at 2.440 2.790 90.9 8.665e-06 1.426e-05
3 33998_at 0.519 0.275 85.1 1.084e-05 2.139e-05
4 38827_at 1.310 1.840 64.1 2.836e-05 2.851e-05

References

8 Appendix Proofs of theorems

Proof of theorem 1: In the following, for any symmetric matrix with spectral decomposition , let be the symmetric square root of . is the symmetric square root of . Square root matrices without the subscript are considered Cholesky square roots, but will not appear in this manuscript. First, rewrite as follows:

where equality in distribution follows from the fact that because and are both positive definite and symetric, it follows from theorem A9.9 of [9] that they are an orthogonal similarity transformation of eachother and since the latter has a Wishart distribution (see below), equality in distributions follows from the invariance of the Wishart distribution to orthogonal similarity transformations.

Next we make the following observations:

  • has the distribution and is therefore, independent of . This is because the conditional distribution of given is .

  • has the distribution, because has the .

Therefore, the sum,

has the distribution. Next, put and rewrite as

Notice that since has been rescaled, it is independant of . Next, because is the sample mean, it is independant of the sample covariance matrix, . Thus and are independant. Next, it follows from theorem 3.2.12 of [9], the denominator is distributed and independent of the . Because the numerator is , it follows that has the distribution.

Proof of theorem 2: As stated above, the conditional distribution of given is which has density:

while has the distribution, which has density:

Taking the product of the two above densities and reorganizing factors yeilds:

Thus, the posterior distribution of given is , and so the distribution of is the one given in expression 2.

References

  • [1] P. Baldi and A. D. Long,

    A bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes

    , Bioinformatics 17 (2001), 509–519.
  • [2] X.-G. Cui, J.-T. G. Hwang, J. Qui, N. J. Blades, and Churchill G. A., Improved statistical tests for differential gene expression by shrinking variance components estimates, Biostatistics (2005).
  • [3] S. Dudoit, Y.-H. Yang, M. J. Callow, and T. P. Speed, Statistical methods for identifying differentially expressed genes in replicated cdna microarray experiments, Tech. report, UCB statistics, 2000.
  • [4] R. C. Gentleman, V. J. Carey, D. M. Bates, B. Bolstad, M. Dettling, S. Dudoit, B. Ellis, L. Gautier, Y.-C. Ge, J. Gentry, K. Hornik, T. Hothorn, W. Huber, S. Iacus, R. Irizarry, F. Leisch, C. Li, M. Maechler, A. J. Rossini, G. Sawitzki, C. Smith, G. Smyth, L. Tierney, J. Y.-H. Yang, and J.-H. Zhang, Bioconductor: Open software development for computational biology and bioinformatics, Genome Biology (2004).
  • [5] C. Li and W. H. Wong,

    Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection

    , PNAS (2001).
  • [6] I. Lönnstedt, R. Rimini, and P. Nilsson, Empirical bayes microarray ANOVA and grouping cell lines by equal expression levels, Statistical Applications in Genetics and Molecular Biology 4 (2005), Article 7.
  • [7] I. Lönnstedt and Speed T., Replicated microarray data, Statistica Sinica 12 (2002), 31–46.
  • [8] R. Menezes, Heirarchical modeling to handle heteroscedacisity in microarray data, Presentation at 3-day Workshop on Statistical Analysis of Gene Expression Data, Richardson,S. and Brown,P. Organizers, 2003.
  • [9] R. J. Muirhead, Aspects of multivariate statistical theory, John Wiley & Sons, Hoboken, NJ, 1982.
  • [10] R Development Core Team, R: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria, 2004, ISBN 3-900051-00-3.
  • [11] V. G. Tusher, R. Tibshirani, and G Chu, Significance analysis of microarrays applied to the ionizing radiation response, PNAS (2001).
  • [12] G. W. Wright and R. M. Simon, A random variance model for differential gene detection in small sample microarray experiments, Bionformatics 19 (2003), 2448–2455.