A Nonparametric Normality Test for High-dimensional Data

04/10/2019 ∙ by Hao Chen, et al. ∙ University of California-Davis FUDAN University 0

Many statistical methodologies for high-dimensional data assume the population normality. Although a few multivariate normality tests have been proposed, they either suffer from low power or have serious size distortion when the dimension is high. In this work, we propose a novel nonparametric test that extends from graph-based two-sample tests by utilizing the nearest neighbor information. Theoretical results guarantee the type I error control of the proposed test when the dimension is growing with the number of observations. Simulation studies verify the empirical size performance of the proposed test when the dimension is larger than the sample size and at the same time exhibit the superior power performance of the new test compared with the alternative methods. We also illustrate our approach through a popularly used lung cancer data set in high-dimensional classification literatures where deviation from the normality assumption may lead to completely invalid conclusion.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The population normality assumption is widely adopted in many classical statistical analysis (e.g., linear and quadratic discriminant analysis in classification, normal error linear regression models, and the Hotelling

-test), as well as many recently developed methodologies, such as network inference through Gaussian graphical models (Ma, Gong and Bohnert, 2007; Yuan and Lin, 2007; Friedman, Hastie and Tibshirani, 2008; Rothman et al., 2008; Fan, Feng and Wu, 2009; Yuan, 2010; Liu, 2013; Xia, Cai and Cai, 2015), high-dimensional linear discriminant analysis (Bickel et al., 2004; Fan and Fan, 2008; Cai and Liu, 2011; Mai, Zou and Yuan, 2012), post-selection inference for regression models (Berk et al., 2013; Lee et al., 2016; Taylor and Tibshirani, 2018), and change-point analysis for high-dimensional data (Xie and Siegmund, 2013; Chan and Walther, 2015; Wang and Samworth, 2018; Liu, Zhang and Mei, 2019)

. When the data is univariate, there are many classical tools to check the normality assumption, such as the normal quantile-quantile plot and the Shapiro-Wilk test

(Shapiro and Wilk, 1965). However, many of the modern applications involve multivariate or even high-dimensional data and it constantly calls for multivariate normality testing methods with good theoretical performance.

In this article, we aim to address the following testing problem in the high-dimensional setting with a proper control of type I error. Given a set of observations , where is a distribution in , one wishes to test

versus the alternative hypothesis

In the literature, there have been a good number of methods proposed to test the normality of multivariate data. For example, Mardia (1970)

considered two statistics to measure the multivariate skewness and kurtosis separately, and constructed two tests for the normality of the data by using each of these two statistics; Bonferroni correction can be applied to unify these two tests. More recently,

Doornik and Hansen (2008)

proposed a way to combine the two test statistics effectively. In another line,

Royston (1983) generalized the Shapiro-Wilk test to the multivariate setting by applying the Shapiro-Wilk test to each of the coordinates and then combining the test statistics from all coordinates, while Fattorini (1986) tried to find the projection direction where the data is most non-normal and then applied the Shapiro-Wilk test to the projected data. Later, Zhou and Shao (2014)

combined these two approaches by considering the statistics from both random projections as well as the original coordinates. In addition, there are a series of literatures studied the normality tests based on the characteristic function of the standardized data

(Baringhaus and Henze, 1988; Henze and Zirkler, 1990; Henze and Wagner, 1997). Besides those methods, there is also another work that extends the Friedman-Rafsky test (Friedman and Rafsky, 1979), a nonparametric two-sample test, to a multivariate normality test (Smith and Jain, 1988). Those aforementioned methods provide useful tools for testing multivariate normality assumption for the conventional low-dimensional data.

We illustrate in Table 1 the empirical sizes for some of the representative existing tests: “Skewness” (the test based on the measure of multivariate skewness in Mardia (1970)), “Kurtosis” (the test based on the measure of multivariate kurtosis in Mardia (1970)), “Bonferroni” (the method combining the tests based on multivariate skewness and kurtosis through the Bonferroni correction), “Ep” (an effective way of combining the multivariate skewness and kurtosis in Doornik and Hansen (2008)), “Royston” (generalized Shapiro-Wilk test in Royston (1983)), “HZ” (the test based on the characteristic function proposed in Henze and Zirkler (1990)), and “eFR” (extended Friedman-Rafsky test in Smith and Jain (1988)

). In particular, the extended Friedman-Rafsky test requires an estimate of the variance of the distribution. However, there is a lack of discussions on such estimations in their paper. In the table, “eFR

” uses the sample covariance matrix as an estimate, and “eFR”, an improved version based on the newly developed method, uses the adaptive thresholding approach in Cai and Liu (2011) to estimate the covariance matrix. We observe from the table that, except for eFR, all other tests are either not applicable to the cases when the dimension is larger than the sample size, i.e., , or cannot control the type I error well when the dimension is high.

5 10 20 50 80 90 100 200
Skewness 0.035 0.039 0.014 0 0 0 0.114 0.384
Kurtosis 0.041 0.071 0.254 0.999 1 1 0.950 0.998
Bonferroni 0.029 0.040 0.158 0.994 0.943 1 1 0.997
Ep 0.053 0.059 0.046 0.044 0.047 0.040 0.141
Royston 0.073 0.092 0.080 0.137 0.129 0.164 0.168 0.245
HZ 0.048 0.051 0.051 0 1 1 1 1
eFR 0.056 0.041 0.048 0.081 0.153 0.145 0.161 0.088
eFR 0.045 0.046 0.048 0.041 0.038 0.038 0.044 0.042
Table 1: Empirical sizes (estimated from 1,000 trials) of the tests at 0.05 significance level. Data generated from the standard multivariate Gaussian distribution. . The numbers in the table that are larger than 0.1 are bolded (cannot control the size well). No results for Ep when as the method can only be applied to .

The extended Friedman-Rafsky test is based on an edge-count two-sample test proposed in Friedman and Rafsky (1979)

. Due to the curse of dimensionality, it was shown in a recent work,

Chen and Friedman (2017), that the edge-count two-sample test would suffer from low or even trivial power under some commonly appeared high-dimensional alternatives with typical sample sizes (ranging from hundreds to millions). The same problem also exists in the extended Friedman-Rafsky test for testing normality in the high-dimensional setting, and we refer the details to the power comparison results in Section 3.

In this paper, we take into consideration the findings in Chen and Friedman (2017) and propose a novel nonparametric multivariate normality testing procedure based on nearest neighbor information. Through extensive simulation studies, we observe that the new test has good performance on the type I error control, even when the dimension of the data is larger than the number of observations. It also exhibits much higher power than the extended Friedman-Rafsky test (eFR) under the high-dimensional setting. Moreover, we provide theoretical guarantee of the proposed test in controlling the type I error when the dimension grows with the sample size. As far as we know, there is a paucity of systematic and theory-guaranteed hypothesis testing solutions developed for such type of problems in the high-dimensional setting, and our proposal offers a timely response. We also apply our test to a popularly used lung cancer data set in the linear discriminant analysis literatures where normality is a key assumption. The testing result provides a useful prerequisite for the analysis of such classification type problems, where both the calculation of the linear discriminant rule and the subsequent analysis of misclassification rate are based on the normality assumption.

The rest of the paper is organized as follows. In Section 2, we propose a new nonparametric procedure to test the normality of the high-dimensional data and introduce the main theorem of the new approach. The performance of the proposed method is examined through simulation studies in Section 3 and the method is applied to the lung cancer data set in Section 4. Section 5 discusses a related statistic and possible extensions of the current proposal. The main theorem is proved in Section 6 with technical lemmas collected and proved in Section 7.

2 Method and Theory

We propose in this section a novel nonparametric algorithm to test the normality of the high-dimensional data. We start with the intuition of the proposed method, and then study the main theorem on the type I error control of the new approach based on the asymptotic equivalence of two events for searching the nearest neighbors under the null hypothesis.

2.1 Intuition

A key fact of the Gaussian distribution is that it is completely determined by its mean and variance. Suppose that the mean and covariance matrix of the distribution are known, then testing whether is a multivariate Gaussian distribution is the same as testing whether , where . For this purpose, one may consider goodness-of-fit tests, such as the approach proposed in Liu, Lee and Jordan (2016) for high-dimensional data. We could also generate a new set of observations , and apply the two-sample tests, such as the graph-based two-sample tests (Friedman and Rafsky, 1979; Chen and Friedman, 2017; Chen, Chen and Su, 2018), to examine for arbitrary dimensions.

However, in practice, the parameters and are unknown in general. To compromise, we use the mean and covariance matrix estimated from the set of observations as substitutes. We could again generate a new set of observations , but unfortunately, now the original testing problem is no longer equivalent to testing whether .

To address this issue, we use the same combination of and to generate another set of independent observations . Then we estimate the mean and covariance matrix of these new observations and denote them by and

, respectively. Based on them, we further generate a new set of independent observations from the normal distribution with mean

and covariance matrix , i.e., . Intuitively, if the null hypothesis is true, i.e., the original distribution is multivariate Gaussian, then the relationship between and would be similar to that of and . Henceforth, we shall test whether these two relationships are similar enough to decide whether is close enough to a Gaussian distribution.

In Smith and Jain (1988), the Friedman-Rafsky’s two-sample test was used for this purpose. Here, we use the nearest neighbor information instead. To be specific, we pool and together, and for each observation, we find its nearest neighbor. Similarly, we pool and together, and again find the nearest neighbor for each observation. Let be the event that an observation in finds its nearest neighbor in , and let be the event that an observation in finds its nearest neighbor in . We will show below in Theorem 1 that the events and are asymptotic equivalent under some suitable conditions. As a result, we can estimate the empirical distribution of the test statistic based on through the distribution of the statistic associated with .

2.2 The main theorem

Before studying the main result, we first introduce some notation. Denote by and

the smallest and largest eigenvalues of

. For two sequences of real numbers and , denote by if there exist constants such that for all sufficiently large . We also remark here that, when or , the aforementioned univariate and conventional multivariate methods in the introduction can be easily applied to test the normality assumption, and we shall focus in our work the cases when the dimension is larger than 2.

We next introduce two assumptions.

  1. The eigenvalues of satisfy for some constants .

  2. There exists an estimator of such that , and an estimator of such that with and where and .

Under the above two conditions, Theorem 1 studies the asymptotic equivalence between the events and under the null hypothesis, which in turn guarantees the type I error control of the following proposed Algorithm 1.

Theorem 1.

Assume (A1) and (A2). Then it follows that, under , as ,

The proof of the theorem is provided in Section 6.

Remark 1.

Assumption (A1) is mild and is widely used in the high-dimensional literature (Bickel et al., 2008; Rothman et al., 2008; Yuan, 2010; Cai, Liu and Xia, 2014). In Assumption (A2), can be easily satisfied when . For condition , when and , it can be satisfied by many estimators under some regularity conditions. For example, if we apply the adaptive thresholding estimator in Cai and Liu (2011), and assume that is sparse in the sense that there are at most nonzero entries in each row of , then we have So the condition holds if for some , where is either equal or tending to zero as defined in detail in (A2). When , simulation results show that the conclusion holds well when . There is potential to relax the condition on in the theorem. In the current proof, we made big relaxations from Equation (1) to (2) and from Equation (3) to (4) (see Section 6). More careful examinations could lead to tighter conditions. This requires non-trivial efforts and we save it for future work.

Remark 2.

The theory based on nearest neighbor information in the high-dimensional setting has so far received little attention in the literature. We provide in this paper a novel proof for the asymptotic equivalence on two events of searching the nearest neighbors and it is among the first methods that utilizes such nonparametric information and in the mean while guarantees the theoretical type I error control.

2.3 Algorithm

Based on Theorem 1, we could adopt the following algorithm to test the multivariate normality of the data. To be specific, because of the asymptotic equivalence between the events and , we repeatedly generate the data from the multivariate normal distribution with estimated mean and covariance matrix, and use the empirical distribution of the test statistics based on to approximate the empirical distribution of the test statistic based on under the null hypothesis.

Denote by the percent of ’s that find their nearest neighbors in , and is defined similarly for ’s. Let be the average of the ’s from Step 3 of the algorithm. We then propose a nonparametric normality test based on nearest neighbor information as the following.

1:Generate , calculate .
2: Generate , estimate its mean and covariance matrix . Generate , calculate .
3:Repeat Step 2 for times to get an estimate of the empirical distribution of under .
4:Compute the two-sided -value, i.e., the percentage of (out of ) that are larger than or equal to , where is the absolute value.
Algorithm 1

The type I error control of Algorithm 1 can be asymptotically guaranteed based on the asymptotic equivalence result as illustrated in Theorem 1. In implementation, we use the sample mean to obtain and and use the adaptive thresholding method in Cai and Liu (2011) to compute and . For the selection of , the empirical distribution can be more precisely estimated when is larger. We choose in the implementation and it provides well error control as shown in Section 3.

3 Simulation Studies

We analyze in this section the numerical performance of the newly developed algorithm. As we studied in the introduction, the existing methods “Skewness”, “Kurtosis”, “Bonferroni”, “Ep” and “Royston” all suffer from serious size distortion when the dimension is relatively large. We thus consider in this section the size and power comparisons of our approach and the method “eFR”, in which the covariances are estimated by the adaptive thresholding estimator in Cai and Liu (2011).

The following matrix models are used to generate the data.

  • Model 1: .

  • Model 2: where for .

  • Model 3: where , for and . with .

The sample sizes are taken to be and 150, while the dimension varies over the values 20, 100 and 300. For each model, data are generated from multivariate distribution with mean zero and covariance matrix . Under the null hypothesis, the distribution is set to be multivariate normal, while under the alternative hypothesis, the distribution is set to be one of the following distributions.

  • Multivariate

    distribution with degrees of freedom

    .

  • Mixture Gaussian distribution with .

We set the size of the tests to be 0.05 under all settings, and choose in the algorithm. We run 1,000 replications to summarize the empirical size and power. The results are reported in Tables 2 and 3.



20 100 300 20 100 300

Model 1
NEW 5.1 4.1 4.4 4.6 5.2 4.1
eFR 4.5 4.5 4.9 3.8 4.6 4.4

Model 2
NEW 5.1 4.7 5.2 4.3 5.4 6.9
eFR 3.7 4.6 5.5 6.1 4.4 6.5

Model 3
NEW 4.6 4.1 7.2 4.1 3.9 5.1
eFR 4.6 4.7 20.0 4.4 3.3 11.2
Table 2: Empirical sizes (in percents) of the proposed algorithm (NEW) and extended Friedman-Rafsky test (eFR).

From Table 2, we observe that the new test can control the size reasonably well under all settings, while the extended Friedman-Rafsky test has some serious size distortion for Model 3 when the dimension is larger than the sample size.

Multivariate -distribution



20 100 300 20 100 300


Model 1
NEW 53.3 87.4 91.7 72.0 99.0 99.0
eFR 4.5 3.8 5.2 4.8 2.5 3.6


Model 2
NEW 23.9 73.6 83.3 41.0 86.1 93.1
eFR 10.9 4.7 2.8 10.1 5.8 7.2

Model 3
NEW 54.3 86.4 95.2 77.1 97.0 98.8
eFR 5.8 3.7 17.0 6.8 5.0 9.5

Mixture Gaussian distribution



20 100 300 20 100 300

Model 1
NEW 49.7 78.2 81.6 67.7 94.5 93.2
eFR 6.1 3.6 5.1 7.1 3.5 3.3


Model 2
NEW 16.6 61.6 73.6 26.5 67.5 84.7
eFR 7.8 4.2 5.9 8.8 7.1 7.7


Model 3
NEW 44.3 72.6 88.5 63.1 90.3 94.6
eFR 6.9 3.3 18.9 7.7 3.9 11.1
Table 3: Empirical power (in percents) of the proposed algorithm (NEW) and extended Friedman-Rafsky test (eFR) for multivariate distribution and mixture Gaussian distribution.

For power comparison, we first studied the annoying heavy tail scenario – multivariate -distribution. It can be seen from Table 3 that, the new test can capture the signal very well, while the extended Friedman-Rafsky test suffers from lower power. We also studied the scenario that the distribution is a mixture of two multivariate Gaussian distributions and we observed similar phenomena that the new test has much higher power than the extended Friedman-Rafsky test under all settings.

In summary, for all scenarios studied above, our new proposed algorithm provides superior performance in both empirical sizes as well as empirical powers comparing with the existing methods.

4 Application

Classification is an important statistical problem that has been extensively studied both in the traditional low-dimensional setting and the recently developed high-dimensional setting. In particular, Fisher’s linear discriminant analysis has been shown to perform well and enjoy certain optimality as the sample size tends to infinity while the dimension is fixed (Anderson, 2003), and it has also been widely studied in the high-dimensional setting when the sample covariance matrix is no longer invertible, see, e.g., Bickel et al. (2004), Fan and Fan (2008), Cai and Liu (2011) and Mai, Zou and Yuan (2012). In all of those studies, normality of the data is a key assumption in order to obtain the linear discriminant rule and investigate the subsequent analysis of misclassification rate.

We study in this section a lung cancer data set, which was analyzed by Gordon et al. (2002) and are available at R documentation data(lung). The data set has 181 tissue samples, including 31 malignant pleural mesothelioma (MPM) and 150 adenocarcinoma (ADCA), and each sample is described by 12533 genes. This data set has been analyzed in Fan and Fan (2008) by their methods FAIR and NSC, and in Cai and Liu (2011) by their LPD rule, for distinguishing MPM from ADCA, which is important and challenging from both clinical and pathological perspectives. However, before applying their proposed methods, none of them have checked the normality of the data, which is a fundamental assumption in the formulation of linear discriminants. If the normality fails to hold, then the misclassification rates can be effected and their results may no longer be valid.

In this section, we use our newly developed method to check the normality of the 150 ADCA samples in this lung cancer data set. Note that, multivariate normality assumption for the 12533 genes of the ADCA samples will be rejected if any subsets for this large number of genes deviate from the normality. Thus, we randomly select a group of 200 genes, and applied our new method to test the multivariate normality assumption. By applying Algorithm 1 with , we obtain that, the -value is equal to 0, which gives sufficient evidence that the samples from this data set have severe deviation from the multivariate normal distribution. We further repeat this procedure for 100 times. In each time, we randomly select a group of 200 genes and apply Algorithm 1 () to the selected genes. It turns out that the

-values are all 0 for these 100 times. Thus, it is not reasonable to assume the normality and directly apply the recent developed high-dimensional linear discriminant procedures to classify MPM and ADCA, as studied in

Fan and Fan (2008) and Cai and Liu (2011). So our procedure serves as an important initial step for checking the normality assumption before applying any statistical analysis methods which assume such condition.

5 Discussion

We proposed in this paper a nonparametric normality test based on the nearest neighbor information. Theoretical results ensure the type I error control of the method and in the mean while it was shown to have significant power improvement over the alternative approaches. We discuss in this section a related test statistic and some possible extensions of the current method.

5.1 Test statistic based on

Our proposed test statistic involves the event , i.e., the event that an observation in finds its nearest neighbor in . A straightforward alternative method could be based on the test statistics which involves the event , i.e., the event that an observation in finds its nearest neighbor in , and a question is whether the -equivalent statistic could be incorporated to further enhance the power. Unfortunately, the version is not as robust as the version and does not have good performance in controlling the type I error. Table 4 lists the empirical sizes of the version of the test under the same settings as in Table 2. We observe that this statistic has serious size distortions for Model 3 when the dimension is high. This also explains the bad performance of eFR in controlling type I error under Model 3 because eFR partially uses the information.


20 100 300 20 100 300
Model 1 4.6 3.5 5.1 4.3 3.3 4.5
Model 2 5.3 5.7 8.6 5.5 4.0 9.2
Model 3 4.4 5.1 32.2 4.0 4.5 16.2
Table 4: Empirical sizes (in percents) of the version test.

5.2 Extension to other distributions in the exponential family

The idea of constructing this normality test could be extended to other distributions in the exponential family. As long as one has reasonably good estimators for the parameters of the distribution, a similar procedure as described in Section 2 can be applied. In particular, one could replace the multivariate normal distribution in Algorithm 1 by the distribution of interest, and replace the mean and covariance estimators by the estimators of the corresponding parameters. The conditions for the asymptotic equivalence between the events and would need more careful investigations and we leave it to our future research.

6 Proof of Theorem 1

Let and be respectively the eigen-decomposition of and . Define and . Then under the conditions of Theorem 1, by Lemma 1, we have .

Let be the density of , and be the density of . Then we have,

By the construction of and , we have

Hence,

By a change of measure, we have

It is not hard to see that if we shift the

’s all by a fixed value, the probability of

is unchanged. Hence,

Let . Then,

(1)
(2)

Let and be the estimated mean and variance based on , and and be the estimated mean and variance based on . Let and be the density function of and , respectively. Let , , be the observation in that is closest to , be the observation in that is closest to , , and . Then,

By change of measure, we have that

Let , then

(3)
(4)

Let . Define . By Lemma 2 (see Section 7), we have that . Also, given that , it is easy to have estimates such that . Then we have

(5)
(6)
(7)

Denote by . Recall that . Note that, the covariance matrix of is and the covariance matrix of is . Then using the same estimation method of the covariance matrix as estimating by , we can estimate by an estimator and estimate by , such that

Note that , we have that . Then by the proofs of Lemma 1 and the conditions of Theorem 1, we have that .

Thus we have

By similar arguments, we have

(8)
(9)
(10)
(11)

Thus we have that .

Let

and , .

Suppose . Notice that . When , based on Lemma 3, the probability that for some constant is of order .

We thus focus on . By definitions of and , and the facts that , and , we have that . Let be the probability that falls in the -ball of , and be the probability that falls in the -ball of .

Let . We consider two scenarios: (1) , and (2) .

  1. :

    1. When , we have . since and satisfy the condition for Lemma 4, we have

      where .

      Notice that and also satisfy the condition for Lemma 4, so

      where .

    2. When and , we have , by Lemma 4, is

      Here, with a positive constant. We have is . So . Similarly, .

    3. When is of order or higher, with , then , and is also of order . Similarly, .

    Under (a), (b) and (c), we all have . Then,

  2. :

    First we consider . By the proof of Lemma 4 and the facts that , , , and