Testing the homogeneity of variances arises in many scientific applications. It is increasingly used now to determine uniformity in quality control, in biology, in agricultural production systems, and even in the development of educational methods (see Boos and Brownie, 2004). It is also a prelude to testing the equality of population means such as the analysis of variance (ANOVA)(see Scheffe, 1959), dose- response modeling or discriminant analysis. The literature for testing equality of variances is huge and we refer the readers to the comprehensive review of Conover et. al (1981).
More recently, procedures for testing equality of variances that are robust to non-normality have been categorized into three major approaches. These strategies are based on the following: (1) Kurtosis adjustment of normal-theory tests(Box and Anderson, 1955; Shoemaker, 2003), (2) Analysis of variance (ANOVA) on scale variables such as the absolute deviations from the median or mean (Levene, 1960; Brown and Forsythe, 1974)
, and (3) Resampling methods to obtain p-values for a given test statistic(Box and Anderson, 1955; Boos and Brownie, 1989). Descriptions of these methods are summarized in Boos and Brownie (2004).
. More specifically, our goal is to propose a variance-based procedure to test the homoscedasticity of variances for a wide variety of distributions. It is also our objective to validate whether resampling methods improve Type I and Type II error rates. An important attribute of our proposed method is its ability to control better the Type I and Type II error rates for small sample (both equal and unequal) sizes. Our test uses a box-type acceptance region rather than a p-value which distinguishes it from other resampling methods. It is solely based on a variance-based statistic without applying any transformation to the observed data like smoothing, fractional trimming, or replacing original observations by the scale or residual estimates. The variance-based procedure is also shown to be more sensitive to deviations from the null conditions.
Just like Boos and Brownie (1989), we prefer variance-based procedures as they are more appealing to practitioners, easier to interpret, and variances are of interest in many areas. We also hope that with constantly improving state-of-the-art computing machinery, this research will encourage the use of resampling-based tests for equality of variances by practitioners, and the integration of these procedures into major statistical software packages. The descriptions of the bootstrap and non-bootstrap tests for equality of variances are given in Section 2. Section 3 shows the small-to-moderate sample size performance of the tests. We close the article with a summary and an outline of possible future extensions.
2 Description Of Tests To Be Compared
Given samples from the populations with equal kurtosis, the th sample having size , and , consider a test of the hypothesis
against the alternative hypothesis that at least two of the variances are unequal. Let denote the sample variance based on observations from the th sample. We now describe the tests that will be compared.
2.1 Levene’s Test
Levene (1960) first proposed ANOVA on the scale variables where is the mean of the th sample but Miller (1968) showed that is asymptotically correct for asymmetric populations if the median is used instead of the mean. Brown and Forsythe (1974) formally studied Levene’s method where the median was used instead of the mean to center the variables. Boos and Brownie (1989) and Lim and Loh (1996) provide more details on the features of Levene’s test.
We consider Levene’s test as it is widely used in practice even if it is not a variance- and resampling-based procedure. It is also recommended by Conover et. al (1981). Levene’s procedure is a test for equality of means applied to the scale quantities where is the median of the th sample . The test statistic is
. We reject the null hypothesisif exceeds the
th quantileof the -distribution with and degrees of freedom.
2.2 Shoemaker’s Test
We also consider Shoemaker’s test , which not only provides good insights to our procedure but was the test recommended by Shoemaker (2003) after comparing its performance with some kurtosis-adjusted normal-theory tests. The test statistic is
where , ,
is the harmonic mean,
is the estimator of the fourth moment about the population mean, and. He recommended the estimator of an asymptotically equivalent formula which is to improve simulation accuracy. The null hypothesis is rejected when exceeds the
th percentile of the chi-square distribution withdegrees of freedom.
2.3 Lim and Loh’s Test
Lim and Loh (1996)
compared several bootstrap and non-bootstrap tests for heterogeneity of variances. A bootstrap version of the Levene’s test was recommended because of its superiority in terms of power and Type I error robustness. The procedure used the technique ofBoos and Brownie (1989), and is given below.
1.) Compute the test statistic from the given data .
2.) Initialize .
3.) Compute the residuals where is the median of group .
4.) Draw data points ’s from the pooled residuals .
5.) If the sample size of the th group is less than 10 then smooth the bootstrap observations by setting , where and. Otherwise, set .
6.) Compute the test statistic value based on the bootstrapped samples . If then .
7.) Repeat steps 4, 5, 6, times.
8.) The bootstrap p-value is .
Note that the bootstrap version of the Bartlett’s test is another alternative especially when the populations do not have large kurtosis. However, we exclude it in the comparison as it is not the recommended procedure by the previous studies of Lim and Loh (1996) and Boos and Brownie (2004). More importantly, it is not “robust” (unlike the proposed test) for highly leptokurtic distributions, which can be tricky in practice for small and/or unequal sample sizes. However, Boos and Brownie (1989, 2004) recommended the bootstrap version Bartlett’s test for comparing larger number of populations. A comparison of our proposed method with the bootstrap version of the Bartlett’s test especially involving large number of groups would be an interesting extension of our study as well.
2.4 Alam and Cahoy’s Test
We now give a brief background on how we derive our variance-based test statistic. Alam and Cahoy (1999) proposed the following test for equality of variances for normal populations: Let be normal populations, and let be the mean and variance of . Let where Under , it follows that
is jointly distributed according to the Dirichlet distribution(see Balakrishnan and Nevzorov (2003) ), and is given by
where and . Let
The null hypothesis can then be expressed as , versus for at least one value of , under the alternative hypothesis . We construct a box-type confidence region for as follows: Let and be given by
where Let and It then follows that . Moreover, let and be the mean and variance of . A -level confidence region for is given by
where is chosen such that
The test of the hypothesis (of level ) is derived from (1) with the acceptance region given by
The value of is calculated numerically from (2) using the distribution of which is
where and . This test statistic is powerfully sensitive to individual deviations in the values of the ’s from the origin, under the alternative hypothesis . Alam and Cahoy (1999) give the moments of , its asymptotic properties, and the critical value for normal populations.
We now construct the bootstrap version of the above test. For any distribution and with a slight modification, a generalized box-type confidence region is now given by
where , and . Using the variance stabilizing transformation of the sample variance, the mean and variance of can be approximated by , and for any distribution. But just like Shoemaker (2003), we use the asymptotically equivalent formula except that we don’t use the harmonic mean for the sample size . When the null hypothesis is true, the box-type acceptance region of our test for any distribution can be approximated by
where is the test statistic, is the critical value that needs to be found such that has the coverage and Consequently, the bootstrap version of the box-type acceptance region is then given by
where , and . The test now is being reduced to finding the critical value
. Note that the availability of the standard error estimate without necessarily performing a second layer bootstrap makes the calculations faster. We emphasize that a viable alternative is to use the pivotal quantity. But it often gives more conservative estimated sizes and smaller power (than our procedure but is still better than the other tests in controlling both the Type I and Type II errors). Below is the algorithm for a given :
1.) Calculate the test statistic value from the observed data .
2.) Draw data points ’s with replacement from each sample .
3.) Compute the bootstrap test statistic .
4.) Repeat steps 2 and 3, times.
5.) Center ’s by subtracting the th bootstrap mean, i.e., let .
6.) Sort all the centered ’s ( of them) in descending order as .
7.) For , if then stop, and the critical value is given by .
The test rejects if for at least one . Notice also that we choose the box-type confidence or acceptance region centered at the origin, where the boundaries are parallel to the axes and have equal lengths. We are still studying how to efficiently calculate these critical values for rectangular prisms having unequal lengths or for likelihood-based regions as in Hall (1987).
3 Empirical Results
In our simulation study, we compared the Type I error robustness of the tests using 36 sample size-distribution combinations. The power was examined using 5 and 6 variance configurations for equal- and unequal-sample cases, respectively. In addition, we considered 6 small-to-moderate sample size configurations. Six distributions with kurtosis ranging from 1.8 to 9 were selected. These distributions are as follows: (i) uniform (), (ii) Gaussian (), (iii) extreme value (), (iv) Laplace (), (v) Student’s with 5 degrees of freedom (), and (vi) exponential (). This array of distributions was considered by Boos and Brownie (1989) and Lim and Loh (1996)
to be representative of the data encountered in practice. The extreme value has the probability density function(see Coles, 2001). All variances under the null were chosen to be one. The estimated power and significance levels of the tests were compared for two-sample, three-sample, and four-sample cases. The simulations used the random number generator “Mersenne-Twister” which is a twisted generalized feedback shift register (GFSR) with period and is equidistributed in 623 consecutive dimensions over the whole period (see Matsumoto and Nishimura, 1998).
Following Boos and Brownie (1989), we performed 1000 Monte Carlo simulations using bootstrap samples for each run. We adopted Conover et. al (1981)’s criterion to assess Type I error robustness. It said that a test is “robust” if the maximum estimated significance level over all the sample size-distribution null combinations (equal variances) is less than twice the nominal level. We used the nominal level . We highlighted estimated levels that exceeded 0.10 using an asterisk.
3.1 Two-Sample Case
In this case, the null conditions included the 6 sample size combinations , , and . We also considered the variance ratios , , and . Table 1 shows the estimated sizes of the tests. It clearly indicates that all the tests except Shoemaker (2003)’s test are robust according to Conover et. al (1981)’s criterion. Shoemaker (2003)’s test has a large maximum estimated size of 0.13 which corresponds to the sample size combination
under the exponential distribution. In addition, the testseemed to be sensitive to the sample size configurations as shown by the inflated Type I error rates for unequal sample sizes. Meanwhile, our test has a maximum test size estimate of 0.08, while Levene (1960)’s and Lim and Loh (1996)’s , have 0.04 and 0.06, respectively. This confirmed the previous observations of Conover et. al (1981), Boos and Brownie (1989), and Lim and Loh (1996) about the extreme conservativeness of the Levene’s test . These results also imply that our test controls the Type 1 error better than Levene’s test and is less conservative than the bootstrap Levene’s test . This observation is even more noticeable in the case of having unequal sample sizes.
Tables 2 and 3 show the simulated power of the tests. From here on, we excluded Shoemaker (2003)’s test as it was not “robust” under Conover et. al (1981)’s criterion over the 36 prescribed null settings. The variance ratios under the alternative hypothesis are chosen to minimize unity in power across all the distributions for moderate sample sizes. For equal sample sizes, the alternative hypothesis has the variance configuration . It is apparent that our test has the highest power averaged over all the distributions. With an average power of for sample size configuration , it is more than thrice the power of Levene (1960)’s test which is , but is just slightly greater than that of Lim and Loh (1996) ’s . Furthermore, the superiority of our test becomes more noticeable when the sample size reaches and 10. However, the power difference becomes negligible when the sample size exceeds . Moreover, both Levene’s and its bootstrap version tend to approach unity faster as the sample size increases under the exponential distribution.
As noticed by Loh (1987) and Lim and Loh (1996), the power of Lim and Loh (1996)’s test and Levene (1960)’s is low when the large are associated with large , and is high if large is associated with small . This led us to average the power of the tests corresponding to variance configurations and for unequal sample sizes, and is shown in Table 3. From the table, it is clear that our test still dominated the other procedures across all the distributions. More specifically, the averaged power of our test could possibly be at least higher than the Levene’s test but is just slightly more powerful than the bootstrap Levene’s test .
Overall, our procedure stood out to be the most powerful and is the least conservative test among all other “robust” procedures for the two-sample case. Our results also confirmed that bootstrapping Levene’s test corrected the conservativeness of its Type I error rate and improved its power.
3.2 Three-Sample Case
Table 4 gives the estimated levels of the three tests for the 6 sample size combinations , , and . The table suggests that all the three tests are robust according to Conover et. al (1981)’s criterion. The test has a maximum size of 0.075 while Levene (1960)’s and Lim and Loh (1996)’s have maxima 0.05 and 0.06, correspondingly. The table also indicates that our procedure seems to be more conservative than the bootstrap Levene’s test for distributions with smaller kurtosis (e.g., uniform distribution) and with relatively small sample sizes (e.g., ). With unequal sample sizes, our procedure is the least conservative procedure except for the sample size combination under the uniform distribution.
Table 5 displays the simulated power of the tests for equal sample sizes. The alternative hypothesis has the variance configurations and the relatively small ratio . From the same table, it is easily seen that our test has still the highest power on the average on this array of distributions especially with relatively small sample sizes. With an average power of for sample size configuration and variance ratio , it is more powerful than Levene (1960)’s test which is , and is higher than the recorded for Lim and Loh (1996)’s . When the sample size is between 7 and 15 (inclusive) and with variance configuration , our test is al least more powerful than the Levene’s test and its bootstrap version . Similarly, our procedure is more powerful under the variance ratio and sample size . Again, the difference in average power (over all the 6 distributions) becomes negligible when the sample sizes exceed for the variance configuration . This strongly suggests that the test is more sensitive to relatively small departures from homogeneity of the variances than the Levene and the bootstrap Levene tests.
Table 6 demonstrates the performance of the tests when sample sizes are unequal and when the alternative has the small variance ratios , and . Just like in the two-sample case, we averaged the power over the two variance ratios. It is clear that the procedure is the most powerful as indicated by the average of the averaged (over the two small variance ratios) estimated power. Averaging over all the three unequal sample size configurations, the test has average power. This illustrates that the test is more powerful than Levene test’s , and is more powerful than the bootstrap version’s . Overall, our procedure still has the least conservative test size estimates and is more powerful in detecting slight departures from the null settings.
3.3 Four-Sample Case
The estimated levels of the three tests are shown in Table 7 for the 6 sample size configurations , and . The table apparently suggests that all the three tests are still robust according to Conover et. al (1981)’s criterion for four populations. The test has a maximum size of 0.07 while the bootstrap Levene’s test and the Levene’s test have maximum sizes of 0.07 and 0.04, correspondingly. It also shows that our procedure seems to be more conservative than the bootstrap Levene’s test under the uniform distribution across all the 5 sample size configurations (except ) or when the sample size is as small as 7. Mostly, our test still has the least conservative Type 1 error estimates among the three procedures for the four-sample case.
Tables 8 and 9 give the estimated power of the tests when the sample sizes are equal and unequal, respectively. The power of the tests is computed using four variance and six sample size combinations. When the sample sizes are equal, we considered the variance ratios , and for the alternative. Table 8 below shows that the test is at least more powerful than the other tests when under the two variance ratios. When the alternative assumes the variance ratio and sample size , our test appeared to be more powerful for populations with high kurtosis (e.g., exponential). In addition, a direct comparison of our results with that of Lim and Loh (1996)’s corresponding to the variance configuration and sample sizes indicates that the proposed test is more powerful than the bootstrap Bartlett’s test (except the exponential distribution). These results further validate the superiority of our test when the sample sizes are equal.