Visual High Dimensional Hypothesis Testing

01/02/2021
by   Xi Yang, et al.
0

In exploratory data analysis of known classes of high dimensional data, a central question is how distinct are the classes? The Direction Projection Permutation (DiProPerm) hypothesis test provides an answer to this that is directly connected to a visual analysis of the data. In this paper, we propose an improved DiProPerm test that solves 3 major challenges of the original version. First, we implement only balanced permutations to increase the test power for data with strong signals. Second, our mathematical analysis leads to an adjustment to correct the null behavior of both balanced and the conventional all permutations. Third, new confidence intervals (reflecting permutation variation) for test significance are also proposed for comparison of results across different contexts. This improvement of DiProPerm inference is illustrated in the context of comparing cancer types in examples from The Cancer Genome Atlas.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

02/23/2021

Permutation-based true discovery guarantee by sum tests

Sum-based global tests are highly popular in multiple hypothesis testing...
12/24/2021

RISE: Rank in Similarity Graph Edge-Count Two-Sample Test

Two-sample hypothesis testing for high-dimensional data is ubiquitous no...
09/16/2019

Distance Assessment and Hypothesis Testing of High-Dimensional Samples using Variational Autoencoders

Given two distinct datasets, an important question is if they have arise...
12/30/2019

B-Value and Empirical Equivalence Bound: A New Procedure of Hypothesis Testing

In this study, we propose a two-stage procedure for hypothesis testing, ...
03/11/2019

Estimation and Inference for High Dimensional Generalized Linear Models: A Splitting and Smoothing Approach

For a better understanding of the molecular causes of lung cancer, the B...
02/12/2019

High dimensionality: The latest challenge to data analysis

The advent of modern technology, permitting the measurement of thousands...

Code Repositories

DiProPerm-test

Improved version of DiProPerm-test (Matlab)


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 Motivation

During the exploration of cancer data, an interesting question is how different are various cancer types? In Figure 1, the data are a subset of The Cancer Genome Atlas (TCGA) Pan-Cancer data (Liu et al. (2018)) with 12478 genes and 1523 cases from 5 types of cancer. The tissues came from 5 different organs and there is a key interest in how similar and dissimilar they are. Figure 1

is a principal component analysis (PCA) (

Jolliffe (1986)) scatter plot, which displays the two-dimensional distribution of the PC1 and PC2 scores. Colors and symbols are used to contrast cancer types, i.e. classes. In particular, the 173 Acute Myeloid Leukemia (LAML) cases are represented by magenta triangles; the 138 Bladder Urothelial Carcinoma (BLCA) cases are represented by blue asterisks; the 950 Breast Cancer (BRCA) cases are represented by cyan plus signs; the 190 Colon Adenocarcinoma (COAD) cases are represented by yellow stars; the 72 Rectum Adenocarcinoma (READ) cases are represented by red diamonds.

While PCA views such as Figure 1 often show interesting structure in data, the first few principal components only reflect the maximal variation. Other aspects such as subtype differences may or may not appear clearly. Using the same symbols and colors as Figure 1, the left panels of Figure 2 show how that projection direction for distinguishing classes provides good visual separation. The mean difference (MD) direction, which points from one class mean to the other, is a natural choice for separating classes. Each row of Figure 2 corresponds to a test pair indicated by the titles on the left panels. The left panels show the distributions of data projected on MD directions, with colors and symbols representing classes. The colored symbols in the middle are jitter plots, Tukey (1976)

, allowing the visual separation of the symbols. The colored curves are kernel density estimates (KDE, i.e. a smooth histogram,

Wand and Jones (1994)) of the corresponding classes, which give good visual impressions of the varying separation of the groups. The strength of the difference between the groups is assessed using the difference of the sample means shown in green.


Figure 1: PCA scatter plot of PC1 vs. PC2 scores from TCGA Pan-Cancer Atlas gene expression with types: LAML: magenta ; BLCA: blue *; BRCA: cyan +; COAD: yellow ; READ: red . This suggests LAML is much different from the rest. BLCA, BRCA, COAD, and READ seem to be relatively close to each other.

While such graphics are suggestive, they can be deceptive. Hence it is very important to do the needed statistical inference provided by a formal hypothesis test in such visualizations. The right panels of Figure 2 show how the DiProPerm hypothesis test is implemented. That test proposed by Wei et al. (2016) directly tests what we visually see in projections of high dimensional data. In the following, we call that test all permutations. An important point of this paper is an improvement called balanced permutations shown in red in the right panels of Figure 2.

The difference between populations is measured using the difference of the projected means shown in green. Significance is assessed using a permutation null distribution, based on simulated realizations, shown as black circles, which is indicated using a black kernel density estimate. For assessing the statistical significance of such distances or differences between distributions of projections, the empirical P-value (the proportion of the black circles that are larger than the green mean difference) is a natural choice. When comparing two such tests, very often both empirical P-values are zero. In such cases, we find that better comparisons come from the Z-score

, where is the observed statistic (the vertical green line in the top right panel and the green text); and

are the mean and sample standard deviation of the null realizations (black circles). Larger Z-scores indicate stronger significance and rejection of the null hypothesis.

In the top panels (testing COAD vs. READ) of Figure 2, the mean difference statistic, 23, is shown both in green numbers and as a green vertical line on the right. In the top left panel of Figure 2, the red and yellow are overlapped. This is consistent with the mean difference statistic being comparatively small. In the top right panel of Figure 2, the densities of the red (balanced permutations) and the black (all permutations) are very similar to each other. The Z-score is substantially smaller than that of the panels below, in particular around 4 for both red and black. This indicates the COAD and READ are somewhat close to each other as suggested in Figure 1 and all permutations & balanced permutations give the same test results in this case.


Figure 2: DiProPerm diagnostic graphs. Top left: distribution of data projected on MD directions (class +1: COAD; class -1: READ). The colors and symbols are the same as in Figure 1

. The green texts show the corresponding mean difference test statistics. The top right shows the graphical DiProPerm test results for COAD vs. READ. Each row corresponds to one test pair. The middle row is a similar graph for testing BRCA vs. BLCA and the bottom row is for BLCA vs. LAML. The black densities estimate the null distributions from all permuted differences shown as black circles (the original DiProPerm using all permutations). The red densities estimate the null distributions from balanced permuted differences shown as red + signs (the proposed DiProPerm using balanced permutations). This figure shows a major improvement from using balanced permutations in the latter two cases.

In the middle left panel of Figure 2

, the cyan and blue are more separated and the mean difference statistic, 83, is larger than that in the top panels. It is shown only in green numbers because the vertical green line is far greater than the range of the null distributions. The corresponding Z-score is much larger, 55 for the original version (black) and 65 for the proposed version (red), showing much stronger statistical significance. This indicates that BRCA and BLCA are more distinct than COAD and READ. Note that the range of the null distributions (both red and black) in the middle right panel is smaller than that in the top right panel and bottom right panel. This is due to the smaller variances of the null distributions, which are closely related to the sample sizes. In Section

2.3, we will show that larger sample sizes will lead to smaller variances for the permuted null distributions. Here, we have 950 BRCA cases and 138 BLCA cases, so this test pair has the largest sample size among all 3 pairs shown in Figure 2.

In the bottom panels of Figure 2, the mean difference statistic is even larger, 201, and again far from the permuted null distributions. The blue and magenta are more separated compared to the above pairs. However, the Z-score of 41 (black) may be surprisingly small. This is due to the high variation in the black permuted differences. This larger variance is carefully investigated in Section 2.1 and seen to be an artifact of the permutation method. That problem can be solved using the balanced permutations (red) as introduced in Section 2.4. In the top panels, there is not much difference between the balanced permutations (red) and all permutations (black). The strong impact of balanced permutations is demonstrated in the middle and bottom rows. Using all permutations (black), it appears that the difference (Z-score=55) in BRCA vs. BLCA is stronger than the difference (Z-score=41) between BLCA and LAML. However, we have the opposite conclusion from using the balanced permutations (red), where it seems that the BLCA vs. LAML difference (Z-score=127) is stronger than BRCA vs. BLCA (Z-score=64).

Deeper comparisons of all and balanced permutations appear in Sections 2, 3 and 5.

The above figures motivate the 3 major contributions of this paper:

  1. The original version can have a skewed and multimodal null distribution which leads to misleading results when testing data with a strong signal. In Section

    2.4, we propose using only balanced permutations. This increases the power of the test in high signal situations and the null follows a more reasonable uni-modal distribution.

  2. The above recommendation of balanced permutations is in direct contradiction to the recommendation of Southworth et al. (2009). They appeal to group theory and suggest that all permutations are generally superior to balanced permutations since balanced permutations tend to be anti-conservative. In Section 3, we show this effect is strongest for the very small sample sizes of the era when that paper was written. Our theoretical analysis provides an adjustment to the inference for both all permutations and balanced permutations. This enables us to exploit the improved power available from balanced permutations.

  3. When computing this DiProPerm Z-score, we typically sample a relatively small number of permutations. In Section 4.1, we propose confidence intervals that account for the Monte Carlo uncertainty.

The practical usefulness of this is demonstrated by studying a real data example in Section 5. Matlab code is available at GitHub: https://github.com/mouseteeth/DiProPerm-test

2 Large Signals Lead to the Loss of Power

In this section, we first investigate the surprising behavior that stronger signals can lead to smaller Z-scores using a simulated example in Sections 2.1, 2.2, and 2.3. Then we propose a modification of DiProPerm that addresses this issue in Section 2.4.

2.1 Simulation study

In this section, we study the behavior of the DiProPerm Z-score using a simple Gaussian simulation. The simulated data are i.i.d. drawn from the shifted and scaled standard normal, where are the number of cases in classes +1 and -1 respectively, where is the number of variables (rows) in each case, and where is the distance between the class centers, i.e. the signal strength.

  1. Class +1 (X):

  2. Class -1 (Y):

where

is some unit vector, such as

or , and . The actual direction of is irrelevant because the DiProPerm test is rotation invariant.

The corresponding DiProPerm Z-score, based on all permutations, is defined as as in Section 1.1. Figure 3 shows realizations of the Z-score (colored circles) for different choices of and , with colors representing . The vertical axis shows a random Z-score from DiProPerm, each circle is a single realization for an equally spaced grid of 200 between 0 and 20. When , the Z-score first goes up (as expected from increasing signal strength) and then goes down. This surprising decrease in power is carefully investigated in Sections 2.2 and 2.3. The general tendency is reflected by the dashed curves, whose formulas are given there.


Figure 3: Realizations of the Z-score (circles) for different choices of (shown with colors) and signal strength (x-axis). The y-axis shows the Z-scores from DiProPerm’s results. The dashed curves are introduced in Section 2.3 as approximate local centers of the circles. Three representative cases are highlighted as black stars (discussed below). For each , the approximate local center of Z-scores first goes up and then perhaps surprisingly goes down, demonstrating a serious weakness of using all permutations.

This behavior is strongest for the red circles shown in Figure 3, which will be explained by studying the DiProPerm permutation distributions. In Figure 4, the 3 columns correspond to the three cases shown using stars in Figure 3. The top three panels are similar to the right panels in Figure 2. These 3 panels show the simulated realization of the mean differences shown as the colored dots: , ( is the index of the permutation and usually or 1000) and their kernel density estimates for , (highlighted as black stars in Figure 3). Recall these 3 values of represent increasing, peak, and decreasing regions of Z-scores for . From left to right, the kernel density estimates become more skewed and multi-modal. The right panel is severely multi-modal. The modes differ because of the different means of the permuted subgroup. Different means are caused by different proportions of the subgroups in the permuted sample. These proportions are quantified by the coefficient of unbalance as defined in (3) in Section 2.3. The colors of the dots in these 3 top panels represent the absolute value of of the th permutation, using the color bar shown in Figure 5. As the signal, , gets stronger there is much more separation of colors based on . In the top left panel, which has the smallest signal (), the colored dots are mostly mixed together. In the top middle panel, as the signal strength () increases, the colored dots separate more. In the right panel, the dots around each peak have a different color showing a strong mixture distribution pattern and separation according to the value of . In particular, in the top right panel, the with smaller tend to appear on the left side, while larger ones appear on the right side of the top panels.

The bottom two rows show the projection on permuted directions for selected . In each case, the symbols represent the original labels and the colors show the permuted class labels, whose mean difference determines the direction. Each panel of the middle row shows the permutation with maximal for the corresponding and . These permutations with the maximal represent the far-right colored circles in each top panel. Going from left to right, the permuted mean difference direction first separates the red/blue permuted class colors and then tends to separate the symbols (the original class labels). This direction essentially becomes the original mean difference direction of the non-permuted data. This effect is usefully quantified by the angles between the observed mean difference and each permutation direction shown in each panel. A large angle suggests a large discrepancy between the original mean difference and the corresponding permutation direction. The middle left panel is separating the colors well and mixing up the symbols as intuitively expected from the permutation test, with a relatively large angle, . In the middle panel, there is still some color separation but also a strong separation of the symbols, with a smaller angle, . In the middle right panel, the angle is very small, , showing this direction is very close to the mean difference direction of the original data. In this large and large situation, we also observe a large . Indeed, for large , the value of strongly feels the corresponding unbalance of . In particular, the stronger the unbalance the larger the .

The bottom three panels of Figure 4 show the projections of of the far left point. They are represented as the far left colored stars in the top panels. However, the large doesn’t so strongly dominate the nearly balanced permutations as seen in the bottom panels (i.e. with ), where the angles are relatively large and close to (). These relationships between colors and symbols are related to the geometric representation ideas of Hall et al. (2005). The discussion above indicates that restricting DiProPerm to use only balanced permutations, i.e. red dots, could potentially be a dramatic improvement. Such a methodology is proposed in Section 2.4.


Figure 4: DiProPerm permutation results for d=100. Left: ; middle: ; right: (these realizations generate these three cases shown as black stars in Figure 3). Top panels are the permuted statistics and their kernel density estimates, the Z-scores are indicated in green; middle and bottom panels are chosen permutations with colors representing the permuted labels and symbols representing the original labels; middle panels are permutations with the largest permuted statistic (highlighted as the colored circles in the top panels); bottom panels are permutations with the smallest permuted statistic (highlighted as the colored stars (*) in the top panels). These explain the increasing skewness in the top panels as grows.

Figure 5: Colorbar used in the jitter plots in the top panels in Figures 4 and 6. Numbers represent the absolute value of the coefficient of unbalance defined in Section 2.3.

2.2 Observed Statistics

As described in Section 1.1, DiProPerm is based on the observed mean difference, , shown using green in Figure 2. Under the model in Section 2.1,

and so

Here is called the non-central Chi distribution(as stated in Johnson et al. (1972)

), because it is the square root of the non-central Chi square distribution with degrees of freedom

and non-centrality parameter . Note that

where is the generalized Laguerre polynomial as defined in Koekoek and Meijer (1993). Then we have

(1)

2.3 Permutation distribution

Let ( is the number of permutations) be the realizations of the permutation distribution, i.e. the black circles in the right panels of Figure 2, denoted as . For each permutation there is a random number of observations in each class that switch labels. Note that

is a Hypergeometric random variable whose probability mass function is:

The conditional distribution of is (detailed derivation shown in the appendices):

(2)

Thus, is the proportion of the original Class -1 cases that are relabeled as the new Class +1 and is the proportion of the Class -1 cases that remain in the new Class -1. The difference between these 2 proportions, denoted as , i.e.

(3)

which reflects the unbalance of the permutation. Hence it is called the coefficient of unbalance. When , the mean of the null permutation distribution is canceled and we call this kind of permutation a balanced permutation, e.g. the red dots in the top panels of Figure 4.

Define

(4)

and

then the unconditional distribution of is a normal mixture which is similar to Theorem 1 in Wei et al. (2016):

(5)

Then the permutation null distribution is

(6)

Figure 6 gives a visual impression of Equation (6), where the colored dots are the , the black curves are kernel density estimates of the and the colored dashed curves are the densities of the mixture components in Equation (6) scaled by each . When , the components of the mixture model overlap with each other, and the permutation distribution is relatively symmetric. When , the densities of the mixture model components begin to separate and the permutation distribution starts to skew. When , the permutation distribution shows a strong mixture pattern and skewness. The densities of the mixture model components are separated from each other. This shows these mixture distributions explain the behavior of the realizations that generate the black stars shown in Figure 3 that are studied in detail in Figure 4.


Figure 6: Distribution of for d=100. Left: ; middle: ; right: (highlighted as black stars in Figure 3). The black curves are the kernel density estimates of the colored dots(). The colored dashed curves represent the theoretical density of each component of the mixture model in Equation (6). The colors represent , according to the color bar in Figure 5. As the signal increases, the subdensities become separated and the permutation distribution becomes skewed, as shown in Figure 4.

Then we have

(7)
(8)

From the above equations, the sample sizes are negatively related to the sample variances. This is consistent with the middle panels of Figure 2 that BRCA has a very large sample size and the variances of the red/black circles are relatively small.

The expectation of the Z-score= is hard to calculate, so we consider the following possible surrogate using (1), (7), (8):

As shown in Figure 3, (for fixed ) provides a reasonable approximation of the center of the point cloud of simulated Z-scores, which indicates a good estimation in this case. The , dashed curves in Figure 3, show a very clear pattern for each , the Z-score first increases and then goes down apparently to some constant. The value of this constant is calculated in the appendices:

(9)

2.4 Improvement of DiProPerm

Our first improvement is to include only the balanced permutations, i.e. . This can be accomplished by solving the equation for the number of switched labels . Under the alternative hypothesis, using only balanced permutations makes the mean of the permutation distribution equal to zero as shown in Equation (2), which eliminates the influence of the signal . This not only increases the power of the test, but also provides a better permutation distribution.

Figure 7 shows a simple example illustrating balanced and unbalanced permutations. There are 8 cases in each class shown as rows. The first row is colored using the real class labels followed by 7 permutations where symbols represent the true class labels and colors represent the permuted class labels as in the bottom 6 panels of Figure 4. The colors of the text on the right (of the label balanced/unbalanced) are in the spirit of the color bar in Figure 5 and the colored dots in the top panels in Figure 4. The top 3 permutations are all balanced permutations and in these cases we solve the equation: resulting in , while the bottom 4 permutations are all unbalanced. The original DiProPerm draws from all permutations, but the proposed improved DiProPerm only draws from balanced permutations as shown in Figure 7. In the top panels in Figure 4, the balanced permutations (red dots) lie near the left side of each panel. Since the empirical P-values are calculated as the proportion of the colored dots that are larger than the observed statistics, using the balanced permutations (red dots) can result in a much smaller P-value and thus increase the power of the test.


Figure 7: This figure shows 7 permutations from a simple example from . The top row shows the true class labels and the rest of the rows show 7 different permutations represented as red and blue colored flips. The right column distinguishes between balanced and unbalanced permutations by coloring the text in the spirit of Figure 5. The proposed DiProPerm only draws from balanced permutations which can lead to a much more powerful test. The original DiProPerm draws from all permutations.

Figure 8 compares Z-scores computed using balanced vs. all permutations by adding the former to a part of Figure 3. The circles are all permutation DiProPerm Z-scores from Figure 3. The dashed curves are the approximate centers of the corresponding circles given by Equation (9). The plusses are the balanced DiProPerm Z-scores. For balanced permutations, the solid curves are , the approximate centers of the plusses given below by a similar calculation:

(10)
(11)

then we have:

For small in Figure 8, the balanced and unbalanced Z-scores overlap. When the signal reaches a certain level, the balanced Z-scores continue increasing as expected (from the increased signal strength), and the unbalanced Z-scores reach a peak and then decrease. This indicates the balanced Z-score is much more powerful than the unbalanced Z-score in the case of strong signals.


Figure 8: Realizations of the Z-score for different choices of for both original DiProPerm (circles) and improved DiProPerm (plusses). Green: d=1; blue: d=10; red: d=100. : Z-score (balanced permutations); : Z-score (all permutations). Solid curves: estimated center of the plusses; the dashed curves: estimated centers of the circles. The balanced Z-scores reveal a much more powerful and stable version of the DiProPerm test than the unbalanced Z-scores for larger signal strength .

3 Balanced vs. All Permutation Controversy

Southworth et al. (2009) show that balanced permutations tend to give anti-conservative test results, i.e. the reported P-values are too small because the balanced permutation scheme does not have a group structure. In particular, under the null hypothesis, the permutation distribution, e.g. distribution of the black circles in Figure 2 doesn’t have enough large values.

Under the alternative, Figure 3 reveals the strange behavior that the power of the tests from all permutations decreases as the signal strength increases. Figure 8 adds the corresponding balanced permutations to Figure 3 showing that balanced permutations address this problem, giving a power that is proportional to the signal strength. When the signal is weak, i.e. close to the null hypothesis, Figure 8 and the top right panel of Figure 2 show that the balanced and unbalanced permutations are very similar (especially Z-scores). Thus, in the context we are studying, balanced permutations are superior to all permutations in large-signal cases and have no or minor differences from all permutations in small-signal cases.

When Southworth et al. (2009) was written, most high dimensional data sets had a relatively small number of cases, typically in the 10s. However, current genomic data sets typically have many more patients, i.e cases. For example when , we have balanced permutations, so that the troublesome events described in Southworth et al. (2009) have a very small chance of occurring. Another contrast of contexts is that we focus on the Z-score, not on the P-value, which was the focus of Southworth et al. (2009)’s investigation.

The correlation between permutations is relevant to the anti-conservatism, which we investigate in Section 3.1. In particular, we find the correlation between balanced permutations is larger than that of all permutations, especially with small . We go on to give an upper bound and an adjustment to such a correlation for balanced permutations.

3.1 Correlation Adjustment

We use permutation to simulate the null distributions shown as the black and red curves in the right panels in Figure 2. Recall that the Z-score is defined as . The sample mean

is an unbiased estimate of the mean of the null distribution, but the sample standard deviation is not similarly unbiased. Due to their dependence on the observed data, the black/red circles are positively correlated. This leads to an anti-conservative Z-score and concerns of this type were first reported in

Southworth et al. (2009). Here, we explicitly calculate the correlation, show that is usually very small, and use it to propose an adjustment to the current Z-score.

Under the setting of Section 2.1, let us first consider . As defined in Section 2.3, is the number of observations in each class that switch labels in permutation . Here we only give the derivation from all permutations (similar derivation for balanced permutations can be seen in appendices) . Let be the cases in the original class +1 that are labeled as -1 and be the cases in the original class -1 that are switched to +1 in permutation . Let be the difference of the class means of the non-permuted data (let ) and let be that of permutation . Thus

Using a similar derivation to that in Section 2.2, both and follow . The correlation between them is (see the appendices for the detailed derivation):

As ,

and by the delta method,

so as , .

In order to get the correlation between the mean differences from two random permutations, we add up the weighted correlations (the additive property applies due to this special setting, see the appendices for more details). For all permutations:

Using a similar derivation, the correlation between two random balanced permutations is:

In the left panel of Figure 9, we compare the theoretical correlations for balanced permutations (blue): and for all permutations (red): as functions of . The blue is always larger than the red showing that the balanced permutations have larger correlations than all permutations. Those correlations are all very small and decrease rapidly with . When , the difference between them is quite negligible and the correlations will decrease as goes to infinity with a limit of zero.

The right panel of Figure 9 is a zoom-in figure to investigate the difference between correlations for . The curves are all differences of correlations from balanced correlations when , i.e. differences from . All curves are smaller than or equal to zero indicating that is the correlation upper bound.

The dotted magenta curve is the smallest which indicates that the correlation for all permutations with () is the smallest. For each color, magenta, and green, the solid curves are larger than the dotted and the dashed are larger than the solid. This suggests that correlations will increase when increases for both balanced and all permutations. As becomes large, all curves are close to each other, indicating the correlations all become similar and very small.


Figure 9: Correlation () as a function of with different line types and colors indicating different and balanced versus all permutations. In the left panel, when , the red shows correlations of all permutations () with blue for balanced permutations (). These curves (decreasing rapidly) are very close to each other and . The right panel studies in both cases. The green curves show how (dashed), (dotted) and (solid) (from balanced permutations) differ from . This shows that . The magenta curves similarly study all permutations. This shows that . Correlations rapidly decrease as a function of and increase as a function of . When , correlations are already very close to the limit , which is also the upper bound.

In the case of , we have the following similar results. and are close to each other, but the all permutations correlation is always less than that for balanced permutations:

The sample variance , correlation and variance are related by:

Thus instead of the original DiProPerm Z-score: , we propose as the adjusted Z-score. From here on, we will use the adjusted Z-score in this paper. This adjustment makes very little difference unless the sample size is very small.

4 Quantification of Permutation Sample Variation

For assessing the statistical significance of differences between classes, the empirical P-value based on random sampling from the set of relevant permutations is a natural choice. When empirical P-values from many tests are all 0’s, the Z-score is more informative to compare such differences, e.g. for measuring the strength of the evidence. However, both the P-values and the Z-scores inherit variation from the permutation procedure. In some cases, this variation can obscure important differences between classes which motivates careful quantification of this uncertainty.

4.1 Confidence Intervals

One conventional method for confidence interval calculation is based on simple simulation to approximate the underlying distribution. This solution requires multiple DiProPerm tests which can be computationally expensive. This motivates the development of an efficient way to evaluate the variation of the DiProPerm test Z-score.

In Section 1.1, The Z-score is calculated as . Consequently, the Z-score is also a random variable that depends on the permutation null distribution, from which the , i.e.

or 1000, black circles were drawn. Thus, a confidence interval for the Z-score can be estimated by the upper and lower quantiles using bootstrap simulation/sampling methods. A general algorithm based on

repetitions ( or 1000):

  1. Draw a matrix where each row is a random sample (with replacement) from the black circles. Calculate the sample variance and sample mean of each row. This results in sample variances & means, which are used to get sampled Z-scores.

  2. Find the upper and lower quantile of the Z-scores based on the sampled Z-scores in Step 2.

A similar method can be developed to estimate confidence intervals for the P-value. However, as stated above, P-values are often too small to show up in the simulation quantiles. Thus, the Z-score is more informative for comparison of the significance across such settings.

Alternatively, as we discussed in Section 2.3 when the original data is close to normal, the distribution of the black circles is close to the

distribution. Thus, we can also estimate the distribution using the method of moments estimation based on the Welch–Satterthwaite approximation (

Satterthwaite (1946)). However, in practice, the normal assumption of the original data is often violated, and the bootstrap resampling method works well without any assumption and gives high efficiency. Thus, we recommend the use of the bootstrap method and will implement the bootstrap method only in the following of this paper.

5 TCGA Pan-Can Data

In this section, we now study our improved DiProPerm test for the 5 types of cancer using TCGA Pan-Can data (Liu et al. (2018)) studied in Figure 1.

For all 10 pairwise TCGA hypothesis tests, Figure 10 gives a comparison of the strength of evidence against the null hypothesis that there is no difference between each pair. The random permutation variability of each is reflected by a 95% confidence interval as developed in Section 4.1. Conventional single-sample confidence intervals are shown as thick black lines, proper simultaneous inference comes from considering the thin black lines which are Bonferroni adjusted for the fact that we have 10 intervals here. The Z-scores are not the centers of the confidence intervals because the distributions of the permutation statistics, e.g. black/red circles in Figure 2

, are skewed and not normally distributed.

In Figure 10, LAML vs. BRCA has the largest Z-score and all pairs involving LAML tend to have large Z-scores. This is consistent with Figure 1 which shows that LAML (magenta) is the most distinct cancer type.

Other pairs related to BRCA such as BRCA vs. COAD also have relatively large Z-scores. This is consistent with the fact that BRCA has the largest sample size and larger sample sizes which will lead to smaller variances and thus stronger statistical significance due to the properties of the distribution.

The Z-scores for LAML vs. READ and LAML vs. BLCA and BRCA vs. COAD are similar to each other. The overlap of their confidence intervals shows no statistically significant difference between them, either for the straightforward or the Bonferroni intervals. Other overlapped confidence intervals also indicate a similarity of test significance.

The balanced Z-scores (shown as in Figure 10) of LAML vs. BRCA is 193 and its Z-score from all permutations (shown as in Figure 10) is 70. Similarly, for LAML vs. COAD, BRCA vs. COAD, are significantly larger than . This is consistent with the idea that when the signal is strong, all permutations will cause the loss of power, and Z-scores from all permutations will be significantly smaller than balanced Z-scores.

When the signal is weak, and are the same (all 5) for COAD vs. READ as shown in Figure 10. The Z-scores of COAD vs. READ are the smallest among all test pairs and thus COAD vs. READ has the weakest statistical test significance. This is consistent with Figure 1 where COAD (yellow) and READ (red) are overlapped in the PC1 and PC2 directions.


Figure 10: Balanced permutation DiProPerm 95 % confidence intervals for all 10 pairwise tests of TCGA data. The thicker lines represent the separate confidence interval for each test and the thinner lines are the multiple test confidence interval using Bonferroni correction. The red dot, also , indicates the Z-score from balanced permutations and indicates the z-score from all permutations. The mean difference statistics also shown in green in Figure 2 are denoted by . The LAML tumors are very different from other cancer types which is consistent with the visual impression and Figure 1.

References

  • P. Hall, J. S. Marron, and A. Neeman (2005) Geometric representation of high dimension, low sample size data. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (3), pp. 427–444. Cited by: §2.1.
  • N. L. Johnson, S. Kotz, and N. Balakrishnan (1972) Continuous multivariate distributions. Vol. 7, Wiley New York. Cited by: §2.2.
  • I. T. Jolliffe (1986)

    Principal components in regression analysis

    .
    In Principal component analysis, pp. 129–155. Cited by: §1.1.
  • R. Koekoek and H. Meijer (1993) A generalization of laguerre polynomials. SIAM Journal on Mathematical Analysis 24 (3), pp. 768–782. Cited by: §2.2.
  • F. Leone, L. Nelson, and R. Nottingham (1961) The folded normal distribution. Technometrics 3 (4), pp. 543–550. Cited by: §B.2.
  • J. Liu, T. Lichtenberg, K. A. Hoadley, L. M. Poisson, A. J. Lazar, A. D. Cherniack, A. J. Kovatich, C. C. Benz, D. A. Levine, A. V. Lee, et al. (2018) An integrated tcga pan-cancer clinical data resource to drive high-quality survival outcome analytics. Cell 173 (2), pp. 400–416. Cited by: §1.1, §5.
  • F. E. Satterthwaite (1946) An approximate distribution of estimates of variance components. Biometrics bulletin 2 (6), pp. 110–114. Cited by: §4.1.
  • L. K. Southworth, S. K. Kim, and A. B. Owen (2009) Properties of balanced permutations. Journal of Computational Biology 16 (4), pp. 625–638. Cited by: item 2, §3.1, §3, §3.
  • J. W. Tukey (1976) Exploratory data analysis. 1977. Massachusetts: Addison-Wesley. Cited by: §1.1.
  • M. P. Wand and M. C. Jones (1994) Kernel smoothing. Chapman and Hall/CRC. Cited by: §1.1.
  • S. Wei, C. Lee, L. Wichers, and J. Marron (2016) Direction-projection-permutation for high-dimensional hypothesis tests. Journal of Computational and Graphical Statistics 25 (2), pp. 549–569. Cited by: §1.1, §2.3.

Appendix A Conditional Distribution of

Under the assumptions in Section 2.1, if we pick cases from each class to switch labels,

thus

In this scheme, there are permutations out of overall random permutations. Thus, the probability of picking cases from each class to switch labels is .

Appendix B Limit distribution

In this section, we calculate the all permutations (dashed) and balanced permutations (solid) curves in Figure 8 as functions of , i.e. use to approximate .

b.1 Balanced Permutations

As shown in Section 2.2,

so we need to calculate for both balanced and all permutations.

For balanced permutations, fix . Similarly as in Section 2.3, . Thus,

so

which is monotone increasing as a function of since laguerre polynomials are decreasing for negative values.

b.2 All Permutations

In the unbalanced case, , the limit (as ) of the permuted MD distribution is the same as the observed MD distribution, which is consistent with what we observed in Figure 3. This implies the limit distributions of the (or ) are the same in the unbalanced cases. Thus the limit distributions of the z-score are the same for all as goes to infinity.

For simplicity we investigate the case of . We assume , then the is a folded normal distribution (Leone et al. [1961]) with the location parameter , and scale parameter . Then

Thus

Since

for any , the limiting value is:

In our simulation, , thus , which is very close to the far right end of each (dashed/dotted) curve in Figure 3. This indicates the high quality of this estimate of the limiting z-score.

Appendix C Permutation Correlation

c.1 All Permutations

Under the setting of Section 2.1, let us first consider . As defined in Section 2.3, is the number of observations in each class that switch labels in permutation . Let be the cases in the original class +1 that are labeled as -1 and be the cases in the original class -1 that are labeled as +1 in one permutation. Let be the difference of the class means of the non-permuted data (let ) and let be that of permutation . Since

we need and . From , it is straightforward that

In order to calculate , let

and be entries of . We have