1 Introduction
All too often, researchers will conclude that the effect of an explanatory variable, , on an outcome variable,
, is absent when a nullhypothesis significance test (NHST) yields a nonsignificant
value (e.g., when the value ). Unfortunately, such an argument is logically flawed. As the saying goes, “absence of evidence is not evidence of absence” [19, 3]. Indeed, a nonsignificant result can simply be due to insufficient power, and while a nullhypothesis significance test can provide evidence to reject the null hypothesis, it cannot provide evidence in favour of the null [37]. To properly conclude that an association between and is absent (i.e., to confirm the lack of an association), the recommended frequentist tool, the equivalence test, is wellsuited [43]. Equivalence testing is commonly known as noninferiority testing for onesided hypotheses and is often used in the analysis of clinical trials [38].Let be the parameter of interest representing the true association between and in the population of interest. The equivalence/noninferiority test reverses the question that is asked in a NHST. Instead of asking whether we can reject the null hypothesis, e.g., , an equivalence test examines whether the magnitude of is at all meaningful: Can we reject an association between and as large or larger than our smallest effect size of interest, ? The null hypothesis for an equivalence test is therefore defined as . Or for the onesided noninferiority test, the null hypothesis is . Note that researchers must decide which effect size is considered meaningful or relevant [27], and define accordingly, prior to observing any data; see Campbell and Gustafson (2018) [8] for details.
In a standard multivariable linear regression model, or a standard ANOVA analysis, the variability of the outcome variable, , is attributed to multiple different explanatory variables, . Researchers will typically report the linear regression model’s statistic, or the
in the ANOVA context, to estimate the proportion of variance in the observed data that is explained by the model. To determine whether or not the
statistic (or the statistic) is significantly larger than zero, one typically calculates an statistic and tests whether the “null model” (i.e., the intercept only model) can be rejected in favour of the “full model” (i.e., the model with all explanatory variables included). However, in this multivariate setting, while rejecting the “null model” is rather simple, concluding in favour of the “null model” is less obvious.If the explanatory variables are not statistically significant, can we simply disregard the full model? We certainly shouldn’t pick and choose which variables to include in the model based on their significance (it is well known that due to model selection bias, most stepwise variable selection schemes are to be avoided; see Hurvich and Tsai (1990) [21]). How can we formally test whether the proportion of variance attributable to the full set of explanatory variables is too small to be considered meaningful? In this article, we introduce a noninferiority test to reject effect sizes that are as large or larger than the smallest effect size of interest as estimated by either the statistic or the statistic.
In Section 2, we introduce a noninferiority test for the coefficient of determination parameter in a linear regression context. We show how to define hypotheses and calculate a valid value for this test based on the
statistic. We then briefly consider how this frequentist test compares to a Bayesian testing scheme based on Bayes Factors, and conduct a small simulation study to better understand the test’s operating characteristics. In Section
3, we illustrate the use of this test with data from a recent study about the absence of the Hawthorne effect. In Section 4, we present the analogous noninferiority test for the parameter in an ANOVA. We also provide a modified version of this test that allows for the possibility that the variance across groups is unequal.2 A noninferiority test for the coefficient of determination parameter
The coefficient of determination, commonly known as , is a sample statistic used in almost all fields of research. Yet, its corresponding population parameter, which we will denote as , as in Cramer (1987) [12], is rarely discussed. When considered, it is sometimes is known as the “parent multiple correlation coefficient” [6] or the “population proportion of variance accounted for” [24]. See Cramer (1987) [12] for a technical discussion.
While confidence intervals for
have been studied by many researchers (e.g., [33], [32], [9], [15]), there has been no consideration (as far as we know) of a noninferiority test for . In this section we will derive such a test and investigate how it compares to a popular Bayesian alternative [40]. Before we continue, let us define some notation. All technical details are presented in the Appendix. Let:
, be the number of observations in the observed data;

, be the number of explanatory variables in the linear regression model;

, be the observed value of fixed covariate , for the th subject, for in ; and

, be the by covariate matrix (with a column of 1s for the intercept; we use the notation to refer to all values corresponding to the th subject).
We operate under the standard linear regression assumption that observations in the data are independent and normally distributed with:
(1) 
where
is a parameter vector of regression coefficients, and
is the population variance. The parameter represents the proportion of total variance in the population that can be accounted for by knowing the covariates, i.e., by knowing . As such, is entirely dependent on the particular design matrix , and we have that:(2) 
where is the unconditional variance of , (note that: ); is the vector of population covariances between the different variables and ; and is the population covariance matrix of the different variables. The statistic estimates the parameter from the observed data. See Kelley (2007) [24] for a complete derivation of equation (2).
A standard NHST asks whether we can reject the null hypothesis that is equal to zero (). The value for this NHST is calculated as:
(3) 
where is the cdf of the noncentral distribution with and degrees of freedom, and noncentrality parameter (note that corresponds to the central distribution); and where:
(4) 
One can calculate the above value in R with the following code: pval = pf(Fstat,df1=K,df2=NK1,lower.tail=FALSE)
.
A noninferiority test for is asking a different question: can we reject the hypothesis that the total proportion of variance in attributable to is greater than or equal to ? Formally, the hypotheses for the noninferiority test are:
,
.
The value for this noninferiority test is obtained by inverting the onesided CI for (see Appendix for details), and can be calculated as:
(5) 
Note that one can calculate the above value in R with the following code: pval = pf(Fstat,df1=K,df2=NK1,ncp=(N*Delta)/(1Delta),lower.tail=TRUE)
.
It is important to remember that the above tests make two important assumptions about the data:

The data are independent and normally distributed as described in equation (1).

The values for in the observed data are fixed and their distribution in the sample is equal (or representative) to their distribution in population of interest. The sampling distribution of can be quite different when regressor variables are random; see Gatsonis and Sampson (1989) [17].
In practice, one might first conduct a NHST (i.e., calculate a value, , using equation (3)) and only proceed to conduct the noninferiority test (i.e., calculate a value, , using equation (5)) if the NHST fails to reject the null. If the first value, , is less than the Type 1 error threshold (e.g., if ), one may conclude with a “positive” finding: is significantly greater than 0. On the other hand, if the first value, , is greater than and the second value, , is smaller than (e.g., if and ), one may conclude with a “negative” finding: there is evidence of a statistically significant noninferiority, i.e., is at most negligible. If both values are large, the result is inconclusive: there is insufficient data to support either finding. This twostage sequential testing scheme is formally known as conditional equivalence testing (CET); see Campbell and Gustafson (2018) [7] for more details.
2.1 Comparison to a Bayesian alternative
For linear regression models, based on the work of Liang et al. (2012) [29], Rouder and Morey (2012) [40] propose using Bayes Factors (BFs) to determine whether the data, as summarized by the statistic, support the null or the alternative model. This is a common approach used in psychology studies (e.g., see most recently Hattenschwiler (2019) [20]). Here we refer to the null model (“Model 0”) and alternative (full) model (“Model 1”) as:
(6)  
(7) 
where is the overall mean of (i.e., the intercept).
The BF is defined as the probability of the data under the alternative model relative to the probability of the data under the null. Formally, we define the Bayes Factor,
, as the ratio:(8) 
with the “10” subscript indicating that the full model (i.e., “Model 1”) is being compared to the null model (i.e., “Model 0”). The BF can be easily interpreted. For example, a equal to 0.10 indicates that the null model is ten times more likely than the full model.
Bayesian methods require one to define appropriate prior distributions for all model parameters. Rouder and Morey (2012) [40] suggest using “objective priors” for linear regression models and explain in detail how one may implement this approach. We will not discuss the issue of prior specification in detail, and instead point interested readers to Consonni and Veronese (2008) [11] who provide an indepth overview of how to specify prior distributions for linear models.
Using the BayesFactor package in R
[31] with the function linearReg.R2stat()
, one can easily obtain a BF corresponding to given values for , , and . Since we can also calculate frequentist values corresponding to given values for , , and (see equations (3) and (5)), a comparison between the frequentist and Bayesian approaches is relatively straightforward.
For three different values of (=1, 5, 12) and a broad range of values of (76 values from 30 to 1,000), we calculated the values corresponding to a of 1/3 (moderate evidence in favour of the null model, [23]) and of 3 (moderate evidence in favour of the full model). We then proceeded to calculate the corresponding frequentist values for NHST and noninferiority testing for the (, , ) combinations. Note that all priors required for calculating the BF were set by simply selecting the default settings of the linearReg.R2stat()
function (with rscale = “medium”; see [31]).
The results are plotted in Figure 1. The lefthand column plots the conclusions reached by frequentist testing (i.e., the CET sequential testing scheme). For all calculations, we defined and . The righthand column plots the conclusions reached based on the Bayes Factor with a threshold of 3.
Each conclusion corresponds to a different colour in the plot: green indicates a positive finding (evidence in favour of the full model); red indicates a negative finding (evidence in favour of the null model); and yellow indicates an inconclusive finding (insufficient evidence to support either model). Note that we have also included a third colour, lightgreen. For the frequentist testing scheme, lightgreen indicates a scenario where both the NHST value and the noninferiority test value are less than . The tests reveal that the observed effect size is both statistically significant (i.e., we reject ) and statistically smaller than the effect size of interest (i.e., we also reject ). In these situations, one could conclude that, while is significantly greater than zero, it is likely to be practically insignificant (i.e., a real effect of a negligible magnitude).
Three observations merit comment:
(1) For testing with Bayes Factors, there will always exist a combination of values of and that corresponds to an inconclusive result. This is not the case for frequentist testing: the probability of obtaining an inconclusive finding will decrease with increasing , and at a certain point, will be zero. For example, with and any , it is impossible to obtain an inconclusive finding regardless of the observed .
(2) For covariate, with , it is practically impossible to obtain a negative conclusion with the Bayesian approach, and only possible with the frequentist approach (for the equivalence bound of ), if the is very very small ().
2.2 Simulation study
We conducted a simple simulation study in order to better understand the operating characteristics of the noninferiority test and to confirm that the test has correct Type 1 error rates. We simulated data for each of the eighteen scenarios, one for each combination of the following parameters:

one of three sample sizes: , , or, ;

one of two designs with , or binary covariates, (with an orthogonal, balanced design), and with or ; and

one of three variances: ,, or .
Depending on the particular values of and , the true coefficient of variation for these data is either , , or . Parameters for the simulation study were chosen so that we would consider a wide range of values for the sample size and so as to obtain three unique values for approximately evenly spaced between 0 and 0.10.
For each configuration, we simulated 10,000 unique datasets and calculated a noninferiority value with each of 19 different values of (ranging from 0.01 to 0.10). We then calculated the proportion of these values less than . Figure 2 plots the results with a restricted yaxis to better show the Type 1 error rates. In the Appendix, Figure 3 plots the results against the unrestricted yaxis.
We see that when the equivalence bound equals the true effect size (i.e., 0.032, 0.062, or 0.076), the Type 1 error rate is exactly 0.05, as it should be, for all . This situation represents the boundary of the null hypothesis, i.e. . As the equivalence bound increases beyond the true effect size (i.e., ), the alternative hypothesis is then true and it becomes possible to correctly conclude equivalence. The power of the test increases with and , as one would expect.
3 Application: Evidence for the absence of a Hawthorne effect
McCambridge at el. (2019) [30] tested the hypothesis that participants who know that the behavioral focus of a study is alcohol related will modify their consumption of alcohol while under study. The phenomenon of subjects modifying their behaviour simply because they are being observed is commonly known as the Hawthorne effect [42].
The researchers conducted a threearm individually randomized trial online among students in four New Zealand universities. The three groups were: group A (control), who were told they were completing a lifestyle survey; group B, who were told the focus of the survey was alcohol consumption; and group C, who additionally answered 20 questions on their alcohol use and its consequences before answering the same lifestyle questions as Groups A and B. The prespecified primary outcome was a subject’s selfreported volume of alcohol consumption in the previous 4 weeks (units = number of standard drinks). This measure was recorded at baseline and after one month at followup.
baseline  followup  difference  
A  1795  1483  1483  
mean  24.60  18.39  5.13  
sd  31.80  23.32  24.56  
B  1852  1532  1532  
mean  23.83  17.48  5.64  
sd  31.79  23.81  21.77  
C  1825  1565  1565  
mean  23.03  17.45  4.79  
sd  30.65  23.21  25.17  
Total  5472  4582  4580  
mean  23.82  17.77  5.19  
sd  31.42  23.44  23.88 
The data were analyzed by McCambridge at el. (2019) [30] using a linear regression model with repeated measures fit by generalized estimating equations (GEE) and an “independence” correlation structure. For a NHST of the overall experimental group effect, the researchers obtained a value of 0.66. Based on this result, McCambridge at el. (2019) conclude that “the groups were not found to change differently over time” [30].
We note that this linear regression model fit by GEE is just one of many potential models one could use to analyze this data; see Yang and Tsiatis (2001) [44]. Three (among many) other reasonable alternative approaches include (1) a linear model using only the followup responses (without adjustment for the baseline measurement); (2) a linear model using the followup responses as outcome with a covariate adjustment for the baseline measurement; and (3) a linear model using the difference between followup and baseline responses as outcome. These three approaches yield values of 0.45, 0.56, and 0.61, respectively. None of these values suggest rejecting the null. Instead each model leads one to conclude that there is insufficient evidence to reject the null. In order to show evidence “in favour of the null,” we turn to our proposed noninferiority test.
We fit the data () with a linear regression model using the difference between followup and baseline responses as the outcome, and the group membership as a categorical covariate, . We then consider the noninferiority test for the coefficient of determination parameter (see Section 2), with . This test asks the following question: does the overall experimental group effect account for less than 1% of the variability explained in the outcome?
The choice of represents our belief that any Hawthorne effect explaining less than 1% of the variability in the data would be considered negligible. For reference, Cohen (1988) describes a as “a modest enough amount, just barely escaping triviality” [10]; and more recently, Fritz et al. (2012) consider associations explaining “1% of the variability” as “trivial” [16]. It is up to researchers to provide a justification of the equivalence bound before they collect the data.
We obtain a and can calculate the statistic with equation (4):
(9)  
(10)  
(11)  
(12) 
To obtain a value for the noninferiority test, we use equation (5):
(13)  
(14)  
(15) 
This result, value , suggests that we can confidently reject the null hypothesis that . We therefore conclude that the data are most compatible with no important effect. For comparison, the Bayesian testing scheme we considered in Section 2.1 obtains a Bayes Factor of . The Rcode for these calculations is presented in the Appendix.
4 A noninferiority test for the ANOVA parameter
Despite being entirely equivalent to linear regression [18], the fixed effects (or “between subjects”) analysis of variance (ANOVA) continues to be the most common statistical procedure to test the equality of multiple independent population means in many fields [36]. The noninferiority test considered earlier in the linear regression context will now be described in an ANOVA context for evaluating the equivalence of multiple independent groups. Note that all tests developed and discussed in this paper are only for betweensubject ANOVA designs and cannot be applied to withinsubject designs.
Equivalence/noninferiority tests for comparing group means in an ANOVA have been proposed before. For example, Rusticus and Lovato (2011) [41] list several examples of studies that used ANOVA to compare multiple groups in which nonsignificant findings are incorrectly used to conclude that groups are comparable. The authors emphasize the problem (“a statistically nonsignificant finding only indicates that there is not enough evidence to support that two (or more) groups are statistically different” [41]) and offer an equivalence testing solution based on CIs. Unfortunately, a confidence interval approach to equivalence testing does not allow for the calculation of values. Instead, conclusions of equivalence are based only on CIs which the authors warn may be “too wide” [41].
In another proposal, Wellek (2003) [43] considered simultaneous equivalence testing for several parameters to test group means. However, this strategy may not necessarily be more efficient than the rather inefficient strategy of multiple pairwise comparisons; see the conclusions of Pallmann et al. (2017) [35].
Koh and Cribbie (2013) [25] (see also Cribbie et al. (2009) [13]) consider two different omnibus tests. These are presented as noninferiority tests for , a parameter closely related to the population signaltonoise parameter, ; (note that , where is the total sample size). Unfortunately, the use of these tests is limited by the fact that the population parameters and are not commonly used in analyses since their units of measurement are rather arbitrary.
In this section, we consider a noninferiority test for the population effectsize parameter, , a standardized effect size that is commonly used in the social sciences [24]. The parameter represents the proportion of total variance in the population that can be accounted for by knowing the group level. The use of commonly used standardized effect sizes is recommended in order to facilitate future metaanalysis and the interpretation of results [26]. Note that is analogous to the parameter considered earlier in the linear regression context in Section 2. Also note that the noninferiority test we propose is entirely equivalent to the test for proposed by Koh and Cribbie (2013) [25]. It is simply a reformulation of the test in terms of the parameter.
Before going forward, let us define some basic notation. All technical details are presented in the Appendix. Let represent the continuous (normally distributed) outcome variable, and
represent a fixed categorical variable (i.e., group membership). Let
be the total number of observations in the observed data, be the number of groups (i.e., factor levels in ), and be the number of observations in the th group, for in 1,…, . We will consider two separate cases, one in which the variance within each group is equal, and one in which variance is heterogeneous.Typically, one will conduct a standard test to determine whether one can reject the null hypothesis that is equal to zero (). The value is calculated as:
(16) 
where, as in Section 2,
is the cdf of the noncentral Fdistribution with
and degrees of freedom, and noncentrality parameter, ; and where:(17) 
One can calculate the above value using R with the following code: pval = pf(Fstat,df1=J1,df2=NJ,lower.tail=FALSE)
.
A noninferiority test for asks a different question: can we reject the hypothesis that the total amount of variance in attributable to group membership is greater than ? Formally, the hypotheses for the noninferiority test are written as:
,
.
If we reject , we reject the hypothesis that there are meaningful differences between the group means (, ), in favour of the hypothesis that the group means are considered practically equivalent. The value for this test is obtained by inverting the onesided CI for (see Appendix for details) and can be calculated as:
(18) 
Note that one can calculate the above value using R with the following code: pval = pf(Fstat,df1=J1,df2=NJ,ncp=N*Delta/(1Delta),lower.tail=TRUE)
.
The noninferiority test for makes the following three important assumptions about the data:

The outcome data are independent and normally distributed.

The proportions of observations for each group (i.e., , for that are in the observed data are equal to the proportions that are in the total population of interest.

The variance within each group is equal (homogeneous variance).
4.1 A noninferiority test for ANOVA with heterogeneous variance
With regards to the third assumption above, we can modify the above noninferiority test in order to allow for the possibility that the variance is unequal across groups (heterogeneous variance). Recall that a Welch
test statistic is calculated as (see Appendix for details; see also
[14]):(19) 
where , with , for ; and where , and , for .
Then, the value for a noninferiority test () in the case of heterogeneous variance is:
(20) 
where:
(21) 
The above value can be calculated using R with the following code:
aov1 < oneway.test(y ~ x, var.equal = FALSE) Fprime < aov1$statistic dfprime < aov1$parameter[2] pval = pf(Fprime, J1, df2 = dfprime, ncp = (Delta*N)/(1Delta), lower.tail=TRUE)
For the heterogeneous case the population effect size parameter, , is defined slightly differently than for the homogeneous case (see Appendix for details). Based on the simulation studies of Koh and Cribbie (2013) [25], we can recommend that the noninferiority test based on the Welch’s statistic (i.e., the test with value calculated from equation (20)) is almost always preferable (with regards to the statistical power and Type 1 error rate) to the test which requires an assumption of homogeneous variance (i.e., the test with value calculated from equation (18)).
5 Conclusion
In this paper we presented a statistical method for noninferiority testing of standardized omnibus effects commonly used in linear regression and ANOVA. We also considered how frequentist noninferiority testing, and equivalence testing more generally, offer an attractive alternative to Bayesian methods for “testing the null.” We recommend that all researchers specify an appropriate noninferiority margin and plan to use the proposed noninferiority tests in the event that a standard NHST fails to reject the null. Or in cases when the sample size are very large, the noninferiority test can be useful to detect effects that are significant but not meaningful.
Note that our current noninferiority test for in a standard multivariable linear regression is limited to comparing the “full model” to the “null model.” As such, the test is not suitable for comparing two nested models. For example, we cannot use the test to compare a “smaller model” with only the baseline measure as a covariate, with a “larger model” that includes both baseline measure and group membership as covariates.
Equivalence testing for comparing two nested models will be addressed in future work in which we will consider a noninferiority test for the increase in between a smaller model and a larger model. Related work includes that of Algina et al. (2007) [2] and Algina et al. (2008) [1]. We also wish to further investigate noninferiority testing for ANOVA with withinsubject designs, following the work of Rose et al. (2018) [39].
The equivalence test we propose requires researchers to specify equivalence bounds in standardized effect sizes. Standardized effect sizes have strengths and weaknesses, and some researchers have argued in favor of the use of unstandardized effect sizes [5]. Although we proposed equivalence tests in terms of standardized effect sizes, we largely agree with their limitations. Nevertheless, researchers might find it more intuitive to specify equivalence bounds in standardized effect sizes, at least in certain research lines.
There is a great risk of bias in the scientific literature if researchers only rely on statistical tools that can reject null hypotheses, but do not have access to statistical tools that allow them to reject the presence of meaningful effects. Amrhein et al. (2019) express great concern with the the practice of statistically nonsignificant results being “interpreted as indicating ‘no difference’ or ‘no effect’ ”[4]. Equivalence tests provide one approach to improve current research practices by allowing researchers to falsify their predictions concerning the presence of an effect. Thinking about what would falsify your prediction is a crucial step when designing a study, and specifying a smallest effect size of interest and performing an equivalence test provides one way to answer that question.
Available Code 
All the code used in this paper and relevant materials are made available in an OSF repository (https://osf.io/3q2vh/), DOI 10.17605/OSF.IO/3Q2VH. Please do not hesitate to contact the authors if you have any questions or comments.
Acknowledgements
Thank you to Prof. Paul Gustafson for the helpful advice with preliminary drafts. Thank you to Prof. John Petkau for the generous help with editing.
6 Appendix
6.1 Linear Regression: further details and Rcode.
The statistic estimates the parameter from the observed data:
(22) 
where , and ; with , and .
The Rcode for analysis of the McCambridge at al. (2019) [30] data is:
Xmatrix < model.matrix(totaldrinking.diff ~ group, data= side_data) lmmodel < lm(totaldrinking.diff ~ group , data= side_data) R2 < summary(lmmodel)$r.squared Fstat < summary(lmmodel)$fstatistic[1] K < dim(Xmatrix)[2]  1 N < dim(Xmatrix)[1] Delta < 0.01 pf(Fstat,df1=K,df2=NK1,ncp=(N*Delta)/(1Delta),lower.tail=TRUE) linearReg.R2stat(N=N, p=K, R2= R2, simple=TRUE)
The code below replicates the results published in McCambridge et al. (2019), Table 2. Note that there appears to be a typo in the published table whereby the values 0.89 and 0.86 are switched.
Hdata$group<relevel(Hdata$group,"A") mod0 < geeglm(totaldrinking ~ + group+t, id= participant_ID, corstr="independence", data= Hdata, x=TRUE) mod1 < geeglm(totaldrinking ~ group*t + group+t, id= participant_ID, corstr="independence", data= Hdata) (anova(mod1,mod0)) summary(mod1)$coefficients Hdata$group<relevel(Hdata$group,"C") mod1a < geeglm(totaldrinking ~ group*t + group+t, id= participant_ID, corstr="independence", data= Hdata) summary(mod1a)
6.2 ANOVA with homogeneous variance: further details.
The true population group mean for group is denoted , for in 1,…, ; and we denote the group effects as , where is the overall weighted population mean, . These parameters are estimated from the observed data by the corresponding sample group means: , for in 1,…,; and the overall sample mean: .
We operate under the assumption that the data is normally distributed such that:
(23) 
where denotes the variance within groups. We also define the variance between groups as . Finally, the total population variance is defined as . The corresponding sums of squares are estimated from the data: ; ; and .
Recall that the ANOVA Ftest statistic is calculated as:
(24) 
where , and . The statistic follows an F distribution with degrees of freedom for the numerator, and degrees of freedom for the denominator.
The population effect size, , is a parameter that represents the amount of variance in the outcome variable, , that is explained by the group membership, (i.e., knowing the level of the factor ), and is defined as:
(25) 
We can estimate the population parameter from the observed data using the sample statistic, , as follows: . It is well known that is a biased estimate for . However, alternative estimates (including , and ) are also biased; see Okada (2013) [34] for more details (note that there is a typo in eq. 5 of [34]).
The population effect size parameter is closely related to the signaltonoise ratio parameter, , and to the noncentrality parameter, . Consider the following equality:
(26) 
The noncentrality parameter, , is estimated from the data as: , and we can easily calculate a onesided confidence interval (CI),
, by “pivoting” the cumulative distribution function (cdf); see
[24] Section 2.2 and references therein. This requires solving (numerically) the following equation for :(27) 
where is the cdf of the noncentral Fdistribution with and degrees of freedom, and noncentrality parameter, . The values for , , , are calculated from the data as defined above. The solution, , will be the upper confidence bound of , such that: .
6.3 ANOVA with heterogeneous variance: further details.
As above, the true population group mean for group is denoted , for in 1,…,. We now define:
(28) 
and define , and , and finally .
Recall that a Welch Ftest statistic is calculated as:
(29) 
where , with , for ; and where , and , for .
Levy (1978) [28] proposed an approximate nonnull distribution for the statistic such that follows a noncentral distribution with and degrees of freedom, and noncentrality parameter, ; see also [22]. The degrees of freedom for this case are defined as: , and:
(30) 
We will therefore define our population effect size parameter for the heterogeneous case as:
(31) 
Note that in the case of homogeneous variance (i.e., when in ), we have and . The value for the noninferiority test () in the case of heterogeneous variance is:
(32) 
References
 [1] J. Algina, H.J. Keselman, and R.J. Penfield, Note on a confidence interval for the squared semipartial correlation coefficient, Educational and Psychological Measurement 68 (2008), pp. 734–741.
 [2] J. Algina, H. Keselman, and R.D. Penfield, Confidence intervals for an effect size measure in multiple linear regression, Educational and psychological measurement 67 (2007), pp. 207–218.
 [3] D.G. Altman and J.M. Bland, Statistics notes: Absence of evidence is not evidence of absence, The BMJ 311 (1995), p. 485.
 [4] V. Amrhein, S. Greenland, and B. McShane, Scientists rise up against statistical significance (2019).
 [5] T. Baguley, Standardized or simple effect size: What should be reported?, British journal of psychology 100 (2009), pp. 603–617.

[6]
A. Barten,
Note on unbiased estimation of the squared multiple correlation coefficient
, Statistica Neerlandica 16 (1962), pp. 151–164.  [7] H. Campbell and P. Gustafson, Conditional equivalence testing: An alternative remedy for publication bias, PloS one 13 (2018), p. e0195145.
 [8] H. Campbell and P. Gustafson, What to make of noninferiority and equivalence testing with a postspecified margin?, arXiv preprint arXiv:1807.03413 (2018).
 [9] N. Christou, The true R2 and the truth about R2 (2005).
 [10] J. Cohen, Statistical power analysis for the behavioral sciences, Routledge, 1988.
 [11] G. Consonni, P. Veronese, et al., Compatibility of prior specifications across linear models, Statistical Science 23 (2008), pp. 332–353.
 [12] J.S. Cramer, Mean and variance of R2 in small and moderate samples, Journal of Econometrics 35 (1987), pp. 253–266.
 [13] R.A. Cribbie, C.A. ArpinCribbie, and J.A. Gruman, Tests of equivalence for oneway independent groups designs, The Journal of Experimental Education 78 (2009), pp. 1–13.
 [14] M. Delacre, D. Lakens, Y. Mora, and C. Leys, Taking parametric assumptions seriously arguments for the use of welch’s ftest instead of the classical ftest in oneway anova (2018).
 [15] P. Dudgeon, Some improvements in confidence intervals for standardized regression coefficients, Psychometrika 82 (2017), pp. 928–951.
 [16] C.O. Fritz, P.E. Morris, and J.J. Richler, Effect size estimates: current use, calculations, and interpretation., Journal of experimental psychology: General 141 (2012), p. 2.
 [17] C. Gatsonis and A.R. Sampson, Multiple correlation: exact power and sample size calculations., Psychological Bulletin 106 (1989), p. 516.
 [18] A. Gelman, et al., Analysis of variance – why it is more important than ever, The Annals of Statistics 33 (2005), pp. 1–53.
 [19] J. Hartung, J.E. Cottrell, and J.P. Giffin, Absence of evidence is not evidence of absence, Anesthesiology: The Journal of the American Society of Anesthesiologists 58 (1983), pp. 298–299.
 [20] N. Hättenschwiler, S. Merks, Y. Sterchi, and A. Schwaninger, Traditional visual search versus xray image inspection in students and professionals: Are the same visualcognitive abilities needed?, Frontiers in Psychology 10 (2019), p. 525.
 [21] C.M. Hurvich and C. Tsai, The impact of model selection on inference in linear regression, The American Statistician 44 (1990), pp. 214–217.

[22]
S.L. Jan and G. Shieh,
Sample size determinations for welch’s test in oneway heteroscedastic anova
, British Journal of Mathematical and Statistical Psychology 67 (2014), pp. 72–93.  [23] H. Jeffreys, The theory of probability, OUP Oxford, 1961.
 [24] K. Kelley, et al., Confidence intervals for standardized effect sizes: Theory, application, and implementation, Journal of Statistical Software 20 (2007), pp. 1–24.
 [25] A. Koh and R. Cribbie, Robust tests of equivalence for k independent groups, British Journal of Mathematical and Statistical Psychology 66 (2013), pp. 426–434.

[26]
D. Lakens,
Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for ttests and anovas
, Frontiers in Psychology 4 (2013), p. 863.  [27] D. Lakens, A.M. Scheel, and P.M. Isager, Equivalence testing for psychological research: A tutorial, Advances in Methods and Practices in Psychological Science 1 (2018), pp. 259–269.
 [28] K.J. Levy, Some empirical power results associated with welch’s robust analysis of variance technique, Journal of Statistical Computation and Simulation 8 (1978), pp. 43–48.
 [29] F. Liang, R. Paulo, G. Molina, M.A. Clyde, and J.O. Berger, Mixtures of g priors for Bayesian variable selection, Journal of the American Statistical Association 103 (2008), pp. 410–423.
 [30] J. McCambridge, A. Wilson, J. Attia, N. Weaver, and K. Kypri, Randomized trial seeking to induce the Hawthorne effect found no evidence for any effect on selfreported alcohol consumption online, Journal of clinical epidemiology 108 (2019), pp. 102–109.
 [31] R.D. Morey, J.N. Rouder, T. Jamil, and M.R.D. Morey, Package ‘BayesFactor’, URLh http://cran/rprojectorg/web/packages/BayesFactor/BayesFactor pdf i (accessed 1006 15) (2015).

[32]
K. Ohtani,
Bootstrapping R2 and adjusted R2 in regression analysis
, Economic Modelling 17 (2000), pp. 473–483.  [33] K. Ohtani and H. Tanizaki, Exact distributions of R2 and adjusted R2 in a linear regression model with multivariate t error terms, Journal of the Japan Statistical Society 34 (2004), pp. 101–109.
 [34] K. Okada, Is omega squared less biased? a comparison of three major effect size indices in oneway anova, Behaviormetrika 40 (2013), pp. 129–147.
 [35] P. Pallmann and T. Jaki, Simultaneous confidence regions for multivariate bioequivalence, Statistics in Medicine 36 (2017), pp. 4585–4603.
 [36] L. Plonsky and F.L. Oswald, Multiple regression as a flexible alternative to anova in l2 research, Studies in Second Language Acquisition 39 (2017), pp. 579–592.
 [37] E. Quertemont, How to statistically show the absence of an effect, Psychologica Belgica 51 (2011), pp. 109–127.
 [38] S. Rehal, T.P. Morris, K. Fielding, J.R. Carpenter, and P.P. Phillips, Noninferiority trials: are they inferior? a systematic review of reporting in major medical journals, BMJ open 6 (2016), p. e012594.
 [39] E.M. Rose, T. Mathew, D.A. Coss, B. Lohr, and K.E. Omland, A new statistical method to test equivalence: an application in male and female eastern bluebird song, Animal Behaviour 145 (2018), pp. 77–85.
 [40] J.N. Rouder and R.D. Morey, Default Bayes factors for model selection in regression, Multivariate Behavioral Research 47 (2012), pp. 877–903.
 [41] S.A. Rusticus and C.Y. Lovato, Applying tests of equivalence for multiple group comparisons: Demonstration of the confidence interval approach., Practical Assessment, Research & Evaluation 16 (2011).
 [42] J. Stand, The “Hawthorne effect” what did the original Hawthorne studies actually show, Scand J Work Environ Health 26 (2000), pp. 363–367.
 [43] S. Wellek, Testing statistical hypotheses of equivalence and noninferiority, Chapman and Hall/CRC, 2010.
 [44] L. Yang and A.A. Tsiatis, Efficiency study of estimators for a treatment effect in a pretest–posttest trial, The American Statistician 55 (2001), pp. 314–321.