# Can we disregard the whole model? Omnibus non-inferiority testing for R^2 in multivariable linear regression and ^2 in ANOVA

Determining a lack of association between an outcome variable and a number of different explanatory variables is frequently necessary in order to disregard a proposed model (i.e., to confirm the lack of an association between an outcome and predictors). Despite this, the literature rarely offers information about, or technical recommendations concerning, the appropriate statistical methodology to be used to accomplish this task. This paper introduces non-inferiority tests for ANOVA and linear regression analyses, that correspond to the standard widely used F-test for η̂^2 and R^2, respectively. A simulation study is conducted to examine the type I error rates and statistical power of the tests, and a comparison is made with an alternative Bayesian testing approach. The results indicate that the proposed non-inferiority test is a potentially useful tool for 'testing the null.'

• 12 publications
• 2 publications
04/03/2020

### Equivalence testing for standardized effect sizes in linear regression

In this paper, we introduce equivalence testing procedures for standardi...
02/19/2020

### A non-inferiority test for R-squared with random regressors

Determining the lack of association between an outcome variable and a nu...
08/17/2021

### Testing Multiple Linear Regression Systems with Metamorphic Testing

Regression is one of the most commonly used statistical techniques. Howe...
08/29/2018

### Certified Mapper: Repeated testing for acyclicity and obstructions to the nerve lemma

The Mapper algorithm does not include a check for whether the cover prod...
02/16/2022

### An RKHS approach for pivotal inference in functional linear regression

We develop methodology for testing hypotheses regarding the slope functi...
02/20/2018

### How to analyze data in a factorial design? An extensive simulation study

Factorial designs are frequently used in different fields of science, e....
07/05/2017

### Machine Learning Tests for Effects on Multiple Outcomes

A core challenge in the analysis of experimental data is that the impact...

## 1 Introduction

All too often, researchers will conclude that the effect of an explanatory variable, , on an outcome variable,

, is absent when a null-hypothesis significance test (NHST) yields a non-significant

-value (e.g., when the -value ). Unfortunately, such an argument is logically flawed. As the saying goes, “absence of evidence is not evidence of absence” [19, 3]. Indeed, a non-significant result can simply be due to insufficient power, and while a null-hypothesis significance test can provide evidence to reject the null hypothesis, it cannot provide evidence in favour of the null [37]. To properly conclude that an association between and is absent (i.e., to confirm the lack of an association), the recommended frequentist tool, the equivalence test, is well-suited [43]. Equivalence testing is commonly known as non-inferiority testing for one-sided hypotheses and is often used in the analysis of clinical trials [38].

Let be the parameter of interest representing the true association between and in the population of interest. The equivalence/non-inferiority test reverses the question that is asked in a NHST. Instead of asking whether we can reject the null hypothesis, e.g., , an equivalence test examines whether the magnitude of is at all meaningful: Can we reject an association between and as large or larger than our smallest effect size of interest, ? The null hypothesis for an equivalence test is therefore defined as . Or for the one-sided non-inferiority test, the null hypothesis is . Note that researchers must decide which effect size is considered meaningful or relevant [27], and define accordingly, prior to observing any data; see Campbell and Gustafson (2018) [8] for details.

In a standard multi-variable linear regression model, or a standard ANOVA analysis, the variability of the outcome variable, , is attributed to multiple different explanatory variables, . Researchers will typically report the linear regression model’s statistic, or the

in the ANOVA context, to estimate the proportion of variance in the observed data that is explained by the model. To determine whether or not the

statistic (or the statistic) is significantly larger than zero, one typically calculates an -statistic and tests whether the “null model” (i.e., the intercept only model) can be rejected in favour of the “full model” (i.e., the model with all explanatory variables included). However, in this multivariate setting, while rejecting the “null model” is rather simple, concluding in favour of the “null model” is less obvious.

If the explanatory variables are not statistically significant, can we simply disregard the full model? We certainly shouldn’t pick and choose which variables to include in the model based on their significance (it is well known that due to model selection bias, most step-wise variable selection schemes are to be avoided; see Hurvich and Tsai (1990) [21]). How can we formally test whether the proportion of variance attributable to the full set of explanatory variables is too small to be considered meaningful? In this article, we introduce a non-inferiority test to reject effect sizes that are as large or larger than the smallest effect size of interest as estimated by either the statistic or the statistic.

In Section 2, we introduce a non-inferiority test for the coefficient of determination parameter in a linear regression context. We show how to define hypotheses and calculate a valid -value for this test based on the

statistic. We then briefly consider how this frequentist test compares to a Bayesian testing scheme based on Bayes Factors, and conduct a small simulation study to better understand the test’s operating characteristics. In Section

3, we illustrate the use of this test with data from a recent study about the absence of the Hawthorne effect. In Section 4, we present the analogous non-inferiority test for the parameter in an ANOVA. We also provide a modified version of this test that allows for the possibility that the variance across groups is unequal.

## 2 A non-inferiority test for the coefficient of determination parameter

The coefficient of determination, commonly known as , is a sample statistic used in almost all fields of research. Yet, its corresponding population parameter, which we will denote as , as in Cramer (1987) [12], is rarely discussed. When considered, it is sometimes is known as the “parent multiple correlation coefficient” [6] or the “population proportion of variance accounted for” [24]. See Cramer (1987) [12] for a technical discussion.

While confidence intervals for

have been studied by many researchers (e.g., [33], [32], [9], [15]), there has been no consideration (as far as we know) of a non-inferiority test for . In this section we will derive such a test and investigate how it compares to a popular Bayesian alternative [40]. Before we continue, let us define some notation. All technical details are presented in the Appendix. Let:

• , be the number of observations in the observed data;

• , be the number of explanatory variables in the linear regression model;

• , be the observed value of random variable

for the th subject;

• , be the observed value of fixed covariate , for the th subject, for in ; and

• , be the by covariate matrix (with a column of 1s for the intercept; we use the notation to refer to all values corresponding to the th subject).

We operate under the standard linear regression assumption that observations in the data are independent and normally distributed with:

 Yi∼Normal(XTi,⋅β,σ2),∀i=1,...,N; (1)

where

is a parameter vector of regression coefficients, and

is the population variance. The parameter represents the proportion of total variance in the population that can be accounted for by knowing the covariates, i.e., by knowing . As such, is entirely dependent on the particular design matrix , and we have that:

 P2=σTXYΣ−1XσXYσ2Y, (2)

where is the unconditional variance of , (note that: ); is the vector of population covariances between the different variables and ; and is the population covariance matrix of the different variables. The statistic estimates the parameter from the observed data. See Kelley (2007) [24] for a complete derivation of equation (2).

A standard NHST asks whether we can reject the null hypothesis that is equal to zero (). The -value for this NHST is calculated as:

 p−value=1−pf(F;K,N−K−1,0), (3)

where is the cdf of the non-central -distribution with and degrees of freedom, and non-centrality parameter (note that corresponds to the central -distribution); and where:

 F=R2/K(1−R2)/(N−K−1). (4)

One can calculate the above -value in R with the following code: pval = pf(Fstat,df1=K,df2=N-K-1,lower.tail=FALSE).

A non-inferiority test for is asking a different question: can we reject the hypothesis that the total proportion of variance in attributable to is greater than or equal to ? Formally, the hypotheses for the non-inferiority test are:

,
.

The -value for this non-inferiority test is obtained by inverting the one-sided CI for (see Appendix for details), and can be calculated as:

 p−value=pf(F;K,N−K−1,NΔ(1−Δ)), (5)

Note that one can calculate the above -value in R with the following code: pval = pf(Fstat,df1=K,df2=N-K-1,ncp=(N*Delta)/(1-Delta),lower.tail=TRUE).

It is important to remember that the above tests make two important assumptions about the data:

• The data are independent and normally distributed as described in equation (1).

• The values for in the observed data are fixed and their distribution in the sample is equal (or representative) to their distribution in population of interest. The sampling distribution of can be quite different when regressor variables are random; see Gatsonis and Sampson (1989) [17].

In practice, one might first conduct a NHST (i.e., calculate a -value, , using equation (3)) and only proceed to conduct the non-inferiority test (i.e., calculate a -value, , using equation (5)) if the NHST fails to reject the null. If the first -value, , is less than the Type 1 error -threshold (e.g., if ), one may conclude with a “positive” finding: is significantly greater than 0. On the other hand, if the first -value, , is greater than and the second -value, , is smaller than (e.g., if and ), one may conclude with a “negative” finding: there is evidence of a statistically significant non-inferiority, i.e., is at most negligible. If both -values are large, the result is inconclusive: there is insufficient data to support either finding. This two-stage sequential testing scheme is formally known as conditional equivalence testing (CET); see Campbell and Gustafson (2018) [7] for more details.

### 2.1 Comparison to a Bayesian alternative

For linear regression models, based on the work of Liang et al. (2012) [29], Rouder and Morey (2012) [40] propose using Bayes Factors (BFs) to determine whether the data, as summarized by the statistic, support the null or the alternative model. This is a common approach used in psychology studies (e.g., see most recently Hattenschwiler (2019) [20]). Here we refer to the null model (“Model 0”) and alternative (full) model (“Model 1”) as:

 Model 0 :Yi∼Normal(β0,σ2),∀i=1,...,N; (6) Model 1 :Yi∼Normal(XTi,⋅β,σ2),∀i=1,...,N; (7)

where is the overall mean of (i.e., the intercept).

The BF is defined as the probability of the data under the alternative model relative to the probability of the data under the null. Formally, we define the Bayes Factor,

, as the ratio:

 BF10=Pr(Data|Model1)Pr(Data|Model0), (8)

with the “10” subscript indicating that the full model (i.e., “Model 1”) is being compared to the null model (i.e., “Model 0”). The BF can be easily interpreted. For example, a equal to 0.10 indicates that the null model is ten times more likely than the full model.

Bayesian methods require one to define appropriate prior distributions for all model parameters. Rouder and Morey (2012) [40] suggest using “objective priors” for linear regression models and explain in detail how one may implement this approach. We will not discuss the issue of prior specification in detail, and instead point interested readers to Consonni and Veronese (2008) [11] who provide an in-depth overview of how to specify prior distributions for linear models.

Using the BayesFactor package in R [31] with the function linearReg.R2stat(), one can easily obtain a BF corresponding to given values for , , and . Since we can also calculate frequentist -values corresponding to given values for , , and (see equations (3) and (5)), a comparison between the frequentist and Bayesian approaches is relatively straightforward.

For three different values of (=1, 5, 12) and a broad range of values of (76 values from 30 to 1,000), we calculated the values corresponding to a of 1/3 (moderate evidence in favour of the null model, [23]) and of 3 (moderate evidence in favour of the full model). We then proceeded to calculate the corresponding frequentist -values for NHST and non-inferiority testing for the (, , ) combinations. Note that all priors required for calculating the BF were set by simply selecting the default settings of the linearReg.R2stat() function (with rscale = “medium”; see [31]).

The results are plotted in Figure 1. The left-hand column plots the conclusions reached by frequentist testing (i.e., the CET sequential testing scheme). For all calculations, we defined and . The right-hand column plots the conclusions reached based on the Bayes Factor with a threshold of 3.

Each conclusion corresponds to a different colour in the plot: green indicates a positive finding (evidence in favour of the full model); red indicates a negative finding (evidence in favour of the null model); and yellow indicates an inconclusive finding (insufficient evidence to support either model). Note that we have also included a third colour, light-green. For the frequentist testing scheme, light-green indicates a scenario where both the NHST -value and the non-inferiority test -value are less than . The tests reveal that the observed effect size is both statistically significant (i.e., we reject ) and statistically smaller than the effect size of interest (i.e., we also reject ). In these situations, one could conclude that, while is significantly greater than zero, it is likely to be practically insignificant (i.e., a real effect of a negligible magnitude).

Three observations merit comment:

(1) For testing with Bayes Factors, there will always exist a combination of values of and that corresponds to an inconclusive result. This is not the case for frequentist testing: the probability of obtaining an inconclusive finding will decrease with increasing , and at a certain point, will be zero. For example, with and any , it is impossible to obtain an inconclusive finding regardless of the observed .

(2) For covariate, with , it is practically impossible to obtain a negative conclusion with the Bayesian approach, and only possible with the frequentist approach (for the equivalence bound of ), if the is very very small ().

(3) For covariates, with , the frequentist testing scheme obtains a negative conclusion in situations when

. This may seem rather odd but can be explained by the fact that

is “seriously biased upward in small samples” [12].

### 2.2 Simulation study

We conducted a simple simulation study in order to better understand the operating characteristics of the non-inferiority test and to confirm that the test has correct Type 1 error rates. We simulated data for each of the eighteen scenarios, one for each combination of the following parameters:

• one of three sample sizes: , , or, ;

• one of two designs with , or binary covariates, (with an orthogonal, balanced design), and with or ; and

• one of three variances: ,, or .

Depending on the particular values of and , the true coefficient of variation for these data is either , , or . Parameters for the simulation study were chosen so that we would consider a wide range of values for the sample size and so as to obtain three unique values for approximately evenly spaced between 0 and 0.10.

For each configuration, we simulated 10,000 unique datasets and calculated a non-inferiority -value with each of 19 different values of (ranging from 0.01 to 0.10). We then calculated the proportion of these -values less than . Figure 2 plots the results with a restricted y-axis to better show the Type 1 error rates. In the Appendix, Figure 3 plots the results against the unrestricted y-axis.

We see that when the equivalence bound equals the true effect size (i.e., 0.032, 0.062, or 0.076), the Type 1 error rate is exactly 0.05, as it should be, for all . This situation represents the boundary of the null hypothesis, i.e. . As the equivalence bound increases beyond the true effect size (i.e., ), the alternative hypothesis is then true and it becomes possible to correctly conclude equivalence. The power of the test increases with and , as one would expect.

## 3 Application: Evidence for the absence of a Hawthorne effect

McCambridge at el. (2019) [30] tested the hypothesis that participants who know that the behavioral focus of a study is alcohol related will modify their consumption of alcohol while under study. The phenomenon of subjects modifying their behaviour simply because they are being observed is commonly known as the Hawthorne effect [42].

The researchers conducted a three-arm individually randomized trial online among students in four New Zealand universities. The three groups were: group A (control), who were told they were completing a lifestyle survey; group B, who were told the focus of the survey was alcohol consumption; and group C, who additionally answered 20 questions on their alcohol use and its consequences before answering the same lifestyle questions as Groups A and B. The prespecified primary outcome was a subject’s self-reported volume of alcohol consumption in the previous 4 weeks (units = number of standard drinks). This measure was recorded at baseline and after one month at follow-up.

The data were analyzed by McCambridge at el. (2019) [30] using a linear regression model with repeated measures fit by generalized estimating equations (GEE) and an “independence” correlation structure. For a NHST of the overall experimental group effect, the researchers obtained a -value of 0.66. Based on this result, McCambridge at el. (2019) conclude that “the groups were not found to change differently over time” [30].

We note that this linear regression model fit by GEE is just one of many potential models one could use to analyze this data; see Yang and Tsiatis (2001) [44]. Three (among many) other reasonable alternative approaches include (1) a linear model using only the follow-up responses (without adjustment for the baseline measurement); (2) a linear model using the follow-up responses as outcome with a covariate adjustment for the baseline measurement; and (3) a linear model using the difference between follow-up and baseline responses as outcome. These three approaches yield -values of 0.45, 0.56, and 0.61, respectively. None of these -values suggest rejecting the null. Instead each model leads one to conclude that there is insufficient evidence to reject the null. In order to show evidence “in favour of the null,” we turn to our proposed non-inferiority test.

We fit the data () with a linear regression model using the difference between follow-up and baseline responses as the outcome, and the group membership as a categorical covariate, . We then consider the non-inferiority test for the coefficient of determination parameter (see Section 2), with . This test asks the following question: does the overall experimental group effect account for less than 1% of the variability explained in the outcome?

The choice of represents our belief that any Hawthorne effect explaining less than 1% of the variability in the data would be considered negligible. For reference, Cohen (1988) describes a as “a modest enough amount, just barely escaping triviality” [10]; and more recently, Fritz et al. (2012) consider associations explaining “1% of the variability” as “trivial” [16]. It is up to researchers to provide a justification of the equivalence bound before they collect the data.

We obtain a and can calculate the -statistic with equation (4):

 F = R2/K(1−R2)/(N−K−1) (9) = 0.000216/2(1−0.000216)/(4580−2−1) (10) = 0.0001080.000218 (11) = 0.49 (12)

To obtain a -value for the non-inferiority test, we use equation (5):

 p−value = pf(F;K,N−K−1,NΔ(1−Δ)) (13) = pf(0.49;2,4580−2−1,4580⋅0.01(1−0.01)) (14) = 1.13×10−9 (15)

This result, -value , suggests that we can confidently reject the null hypothesis that . We therefore conclude that the data are most compatible with no important effect. For comparison, the Bayesian testing scheme we considered in Section 2.1 obtains a Bayes Factor of . The R-code for these calculations is presented in the Appendix.

## 4 A non-inferiority test for the ANOVA η2 parameter

Despite being entirely equivalent to linear regression [18], the fixed effects (or “between subjects”) analysis of variance (ANOVA) continues to be the most common statistical procedure to test the equality of multiple independent population means in many fields [36]. The non-inferiority test considered earlier in the linear regression context will now be described in an ANOVA context for evaluating the equivalence of multiple independent groups. Note that all tests developed and discussed in this paper are only for between-subject ANOVA designs and cannot be applied to within-subject designs.

Equivalence/non-inferiority tests for comparing group means in an ANOVA have been proposed before. For example, Rusticus and Lovato (2011) [41] list several examples of studies that used ANOVA to compare multiple groups in which non-significant findings are incorrectly used to conclude that groups are comparable. The authors emphasize the problem (“a statistically non-significant finding only indicates that there is not enough evidence to support that two (or more) groups are statistically different” [41]) and offer an equivalence testing solution based on CIs. Unfortunately, a confidence interval approach to equivalence testing does not allow for the calculation of -values. Instead, conclusions of equivalence are based only on CIs which the authors warn may be “too wide” [41].

In another proposal, Wellek (2003) [43] considered simultaneous equivalence testing for several parameters to test group means. However, this strategy may not necessarily be more efficient than the rather inefficient strategy of multiple pairwise comparisons; see the conclusions of Pallmann et al. (2017) [35].

Koh and Cribbie (2013) [25] (see also Cribbie et al. (2009) [13]) consider two different omnibus tests. These are presented as non-inferiority tests for , a parameter closely related to the population signal-to-noise parameter, ; (note that , where is the total sample size). Unfortunately, the use of these tests is limited by the fact that the population parameters and are not commonly used in analyses since their units of measurement are rather arbitrary.

In this section, we consider a non-inferiority test for the population effect-size parameter, , a standardized effect size that is commonly used in the social sciences [24]. The parameter represents the proportion of total variance in the population that can be accounted for by knowing the group level. The use of commonly used standardized effect sizes is recommended in order to facilitate future meta-analysis and the interpretation of results [26]. Note that is analogous to the parameter considered earlier in the linear regression context in Section 2. Also note that the non-inferiority test we propose is entirely equivalent to the test for proposed by Koh and Cribbie (2013) [25]. It is simply a re-formulation of the test in terms of the parameter.

Before going forward, let us define some basic notation. All technical details are presented in the Appendix. Let represent the continuous (normally distributed) outcome variable, and

represent a fixed categorical variable (i.e., group membership). Let

be the total number of observations in the observed data, be the number of groups (i.e., factor levels in ), and be the number of observations in the th group, for in 1,…, . We will consider two separate cases, one in which the variance within each group is equal, and one in which variance is heterogeneous.

Typically, one will conduct a standard -test to determine whether one can reject the null hypothesis that is equal to zero (). The -value is calculated as:

 p−value=1−pf(F;J−1,N−J,0), (16)

where, as in Section 2,

is the cdf of the non-central F-distribution with

and degrees of freedom, and non-centrality parameter, ; and where:

 F=∑Jj=1nj(¯yj−¯y)2/(J−1)∑Jj=1∑nji=1(yij−¯yj)2/(N−J). (17)

One can calculate the above -value using R with the following code: pval = pf(Fstat,df1=J-1,df2=N-J,lower.tail=FALSE).

A non-inferiority test for asks a different question: can we reject the hypothesis that the total amount of variance in attributable to group membership is greater than ? Formally, the hypotheses for the non-inferiority test are written as:

,
.

If we reject , we reject the hypothesis that there are meaningful differences between the group means (, ), in favour of the hypothesis that the group means are considered practically equivalent. The -value for this test is obtained by inverting the one-sided CI for (see Appendix for details) and can be calculated as:

 p−value=pf(F;J−1,N−J,NΔ(1−Δ)). (18)

Note that one can calculate the above -value using R with the following code: pval = pf(Fstat,df1=J-1,df2=N-J,ncp=N*Delta/(1-Delta),lower.tail=TRUE).

The non-inferiority test for makes the following three important assumptions about the data:

• The outcome data are independent and normally distributed.

• The proportions of observations for each group (i.e., , for that are in the observed data are equal to the proportions that are in the total population of interest.

• The variance within each group is equal (homogeneous variance).

### 4.1 A non-inferiority test for ANOVA with heterogeneous variance

With regards to the third assumption above, we can modify the above non-inferiority test in order to allow for the possibility that the variance is unequal across groups (heterogeneous variance). Recall that a Welch

-test statistic is calculated as (see Appendix for details; see also

[14]):

 F′=∑Jj^wj(yj−¯y′)2/(J−1)1+2(J−2)J2−1∑Jj=1((nj−1)−1)(1−^wj^W), (19)

where , with , for ; and where , and , for .

Then, the -value for a non-inferiority test () in the case of heterogeneous variance is:

 p−value=pf(F′;J−1,df′,NΔ(1−Δ)). (20)

where:

 df′=J2−13∑Jj=1((nj−1)−1)(1−^wj/^W)2 (21)

The above -value can be calculated using R with the following code:

aov1 <- oneway.test(y ~ x, var.equal = FALSE)
Fprime <- aov1$statistic dfprime <- aov1$parameter[2]
pval = pf(Fprime, J-1, df2 = dfprime, ncp = (Delta*N)/(1-Delta), lower.tail=TRUE)



For the heterogeneous case the population effect size parameter, , is defined slightly differently than for the homogeneous case (see Appendix for details). Based on the simulation studies of Koh and Cribbie (2013) [25], we can recommend that the non-inferiority test based on the Welch’s statistic (i.e., the test with -value calculated from equation (20)) is almost always preferable (with regards to the statistical power and Type 1 error rate) to the test which requires an assumption of homogeneous variance (i.e., the test with -value calculated from equation (18)).

## 5 Conclusion

In this paper we presented a statistical method for non-inferiority testing of standardized omnibus effects commonly used in linear regression and ANOVA. We also considered how frequentist non-inferiority testing, and equivalence testing more generally, offer an attractive alternative to Bayesian methods for “testing the null.” We recommend that all researchers specify an appropriate non-inferiority margin and plan to use the proposed non-inferiority tests in the event that a standard NHST fails to reject the null. Or in cases when the sample size are very large, the non-inferiority test can be useful to detect effects that are significant but not meaningful.

Note that our current non-inferiority test for in a standard multivariable linear regression is limited to comparing the “full model” to the “null model.” As such, the test is not suitable for comparing two nested models. For example, we cannot use the test to compare a “smaller model” with only the baseline measure as a covariate, with a “larger model” that includes both baseline measure and group membership as covariates.

Equivalence testing for comparing two nested models will be addressed in future work in which we will consider a non-inferiority test for the increase in between a smaller model and a larger model. Related work includes that of Algina et al. (2007) [2] and Algina et al. (2008) [1]. We also wish to further investigate non-inferiority testing for ANOVA with within-subject designs, following the work of Rose et al. (2018) [39].

The equivalence test we propose requires researchers to specify equivalence bounds in standardized effect sizes. Standardized effect sizes have strengths and weaknesses, and some researchers have argued in favor of the use of unstandardized effect sizes [5]. Although we proposed equivalence tests in terms of standardized effect sizes, we largely agree with their limitations. Nevertheless, researchers might find it more intuitive to specify equivalence bounds in standardized effect sizes, at least in certain research lines.

There is a great risk of bias in the scientific literature if researchers only rely on statistical tools that can reject null hypotheses, but do not have access to statistical tools that allow them to reject the presence of meaningful effects. Amrhein et al. (2019) express great concern with the the practice of statistically non-significant results being “interpreted as indicating ‘no difference’ or ‘no effect’ ”[4]. Equivalence tests provide one approach to improve current research practices by allowing researchers to falsify their predictions concerning the presence of an effect. Thinking about what would falsify your prediction is a crucial step when designing a study, and specifying a smallest effect size of interest and performing an equivalence test provides one way to answer that question.

#### Available Code -

All the code used in this paper and relevant materials are made available in an OSF repository (https://osf.io/3q2vh/), DOI 10.17605/OSF.IO/3Q2VH. Please do not hesitate to contact the authors if you have any questions or comments.

#### Acknowledgements

Thank you to Prof. Paul Gustafson for the helpful advice with preliminary drafts. Thank you to Prof. John Petkau for the generous help with editing.

## 6 Appendix

### 6.1 Linear Regression: further details and R-code.

The statistic estimates the parameter from the observed data:

 R2=1−SSRESSSTOT, (22)

where , and ; with , and .

The R-code for analysis of the McCambridge at al. (2019) [30] data is:

Xmatrix <- model.matrix(totaldrinking.diff  ~ group, data= side_data)
lmmodel <- lm(totaldrinking.diff  ~ group , data= side_data)

R2 <- summary(lmmodel)$r.squared Fstat <- summary(lmmodel)$fstatistic[1]
K <- dim(Xmatrix)[2] - 1
N <- dim(Xmatrix)[1]
Delta <- 0.01

pf(Fstat,df1=K,df2=N-K-1,ncp=(N*Delta)/(1-Delta),lower.tail=TRUE)

linearReg.R2stat(N=N, p=K, R2= R2, simple=TRUE)



The code below replicates the results published in McCambridge et al. (2019), Table 2. Note that there appears to be a typo in the published table whereby the -values 0.89 and 0.86 are switched.


Hdata$group<-relevel(Hdata$group,"A")

mod0 <- geeglm(totaldrinking ~ + group+t,
id= participant_ID, corstr="independence", data= Hdata, x=TRUE)
mod1 <- geeglm(totaldrinking ~ group*t + group+t,
id= participant_ID, corstr="independence", data= Hdata)
(anova(mod1,mod0))
summary(mod1)$coefficients Hdata$group<-relevel(Hdata\$group,"C")
mod1a <- geeglm(totaldrinking ~ group*t + group+t,
id= participant_ID, corstr="independence", data= Hdata)
summary(mod1a)



### 6.2 ANOVA with homogeneous variance: further details.

The true population group mean for group is denoted , for in 1,…, ; and we denote the group effects as , where is the overall weighted population mean, . These parameters are estimated from the observed data by the corresponding sample group means: , for in 1,…,; and the overall sample mean: .

We operate under the assumption that the data is normally distributed such that:

 Yi,j∼Normal(μj,σ2w),∀j=1,...,J,∀i=1,...,nj, (23)

where denotes the variance within groups. We also define the variance between groups as . Finally, the total population variance is defined as . The corresponding sums of squares are estimated from the data: ; ; and .

Recall that the ANOVA F-test statistic is calculated as:

 F=SSb/dfbSSw/dfw=∑Jj=1nj(¯yj−¯y)2/(J−1)∑Jj=1∑nji=1(yij−¯yj)2/(N−J), (24)

where , and . The statistic follows an F distribution with degrees of freedom for the numerator, and degrees of freedom for the denominator.

The population effect size, , is a parameter that represents the amount of variance in the outcome variable, , that is explained by the group membership, (i.e., knowing the level of the factor ), and is defined as:

 η2=σ2bσ2t=σ2bσ2b+σ2w=1−σ2wσ2t (25)

We can estimate the population parameter from the observed data using the sample statistic, , as follows: . It is well known that is a biased estimate for . However, alternative estimates (including , and ) are also biased; see Okada (2013) [34] for more details (note that there is a typo in eq. 5 of [34]).

The population effect size parameter is closely related to the signal-to-noise ratio parameter, , and to the non-centrality parameter, . Consider the following equality:

 η2=s2n1−s2n=ΛΛ+N. (26)

The non-centrality parameter, , is estimated from the data as: , and we can easily calculate a one-sided confidence interval (CI),

, by “pivoting” the cumulative distribution function (cdf); see

[24] Section 2.2 and references therein. This requires solving (numerically) the following equation for :

 pf(F;df1=dfb,df2=dfw,ncp=ΛU)=α, (27)

where is the cdf of the non-central F-distribution with and degrees of freedom, and non-centrality parameter, . The values for , , , are calculated from the data as defined above. The solution, , will be the upper confidence bound of , such that: .

As detailed in Kelly (2007) [24] (note that there is a typo in eq. 55 of [24]: in the numerator should be ) one can convert the bounds of the CI for into bounds for a CI for . The upper limit of a one-sided CI for is: . As such, we have that .

### 6.3 ANOVA with heterogeneous variance: further details.

As above, the true population group mean for group is denoted , for in 1,…,. We now define:

 Yi,j∼Normal(μj,σ2w,j),∀j=1,...,J,∀i=1,...,nj, (28)

and define , and , and finally .

Recall that a Welch F-test statistic is calculated as:

 F′=∑Jj^wj(¯yj−¯y′)2/(J−1)1+2(J−2)J2−1∑Jj=1((nj−1)−1)(1−^wj/^W)2, (29)

where , with , for ; and where , and , for .

Levy (1978) [28] proposed an approximate non-null distribution for the statistic such that follows a non-central -distribution with and degrees of freedom, and non-centrality parameter, ; see also [22]. The degrees of freedom for this case are defined as: , and:

 df′=J2−13∑Jj=1((nj−1)−1)(1−^wj/^W)2 (30)

We will therefore define our population effect size parameter for the heterogeneous case as:

 η2′=Λ′Λ′+N. (31)

Note that in the case of homogeneous variance (i.e., when in ), we have and . The -value for the non-inferiority test () in the case of heterogeneous variance is:

 p−value=pf(F′;J−1,df′,ncp=NΔ(1−Δ)). (32)

## References

• [1] J. Algina, H.J. Keselman, and R.J. Penfield, Note on a confidence interval for the squared semipartial correlation coefficient, Educational and Psychological Measurement 68 (2008), pp. 734–741.
• [2] J. Algina, H. Keselman, and R.D. Penfield, Confidence intervals for an effect size measure in multiple linear regression, Educational and psychological measurement 67 (2007), pp. 207–218.
• [3] D.G. Altman and J.M. Bland, Statistics notes: Absence of evidence is not evidence of absence, The BMJ 311 (1995), p. 485.
• [4] V. Amrhein, S. Greenland, and B. McShane, Scientists rise up against statistical significance (2019).
• [5] T. Baguley, Standardized or simple effect size: What should be reported?, British journal of psychology 100 (2009), pp. 603–617.
• [6] A. Barten,

Note on unbiased estimation of the squared multiple correlation coefficient

, Statistica Neerlandica 16 (1962), pp. 151–164.
• [7] H. Campbell and P. Gustafson, Conditional equivalence testing: An alternative remedy for publication bias, PloS one 13 (2018), p. e0195145.
• [8] H. Campbell and P. Gustafson, What to make of non-inferiority and equivalence testing with a post-specified margin?, arXiv preprint arXiv:1807.03413 (2018).
• [9] N. Christou, The true R2 and the truth about R2 (2005).
• [10] J. Cohen, Statistical power analysis for the behavioral sciences, Routledge, 1988.
• [11] G. Consonni, P. Veronese, et al., Compatibility of prior specifications across linear models, Statistical Science 23 (2008), pp. 332–353.
• [12] J.S. Cramer, Mean and variance of R2 in small and moderate samples, Journal of Econometrics 35 (1987), pp. 253–266.
• [13] R.A. Cribbie, C.A. Arpin-Cribbie, and J.A. Gruman, Tests of equivalence for one-way independent groups designs, The Journal of Experimental Education 78 (2009), pp. 1–13.
• [14] M. Delacre, D. Lakens, Y. Mora, and C. Leys, Taking parametric assumptions seriously arguments for the use of welch’s f-test instead of the classical f-test in one-way anova (2018).
• [15] P. Dudgeon, Some improvements in confidence intervals for standardized regression coefficients, Psychometrika 82 (2017), pp. 928–951.
• [16] C.O. Fritz, P.E. Morris, and J.J. Richler, Effect size estimates: current use, calculations, and interpretation., Journal of experimental psychology: General 141 (2012), p. 2.
• [17] C. Gatsonis and A.R. Sampson, Multiple correlation: exact power and sample size calculations., Psychological Bulletin 106 (1989), p. 516.
• [18] A. Gelman, et al., Analysis of variance – why it is more important than ever, The Annals of Statistics 33 (2005), pp. 1–53.
• [19] J. Hartung, J.E. Cottrell, and J.P. Giffin, Absence of evidence is not evidence of absence, Anesthesiology: The Journal of the American Society of Anesthesiologists 58 (1983), pp. 298–299.
• [20] N. Hättenschwiler, S. Merks, Y. Sterchi, and A. Schwaninger, Traditional visual search versus x-ray image inspection in students and professionals: Are the same visual-cognitive abilities needed?, Frontiers in Psychology 10 (2019), p. 525.
• [21] C.M. Hurvich and C. Tsai, The impact of model selection on inference in linear regression, The American Statistician 44 (1990), pp. 214–217.
• [22] S.L. Jan and G. Shieh,

Sample size determinations for welch’s test in one-way heteroscedastic anova

, British Journal of Mathematical and Statistical Psychology 67 (2014), pp. 72–93.
• [23] H. Jeffreys, The theory of probability, OUP Oxford, 1961.
• [24] K. Kelley, et al., Confidence intervals for standardized effect sizes: Theory, application, and implementation, Journal of Statistical Software 20 (2007), pp. 1–24.
• [25] A. Koh and R. Cribbie, Robust tests of equivalence for k independent groups, British Journal of Mathematical and Statistical Psychology 66 (2013), pp. 426–434.
• [26] D. Lakens,

Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and anovas

, Frontiers in Psychology 4 (2013), p. 863.
• [27] D. Lakens, A.M. Scheel, and P.M. Isager, Equivalence testing for psychological research: A tutorial, Advances in Methods and Practices in Psychological Science 1 (2018), pp. 259–269.
• [28] K.J. Levy, Some empirical power results associated with welch’s robust analysis of variance technique, Journal of Statistical Computation and Simulation 8 (1978), pp. 43–48.
• [29] F. Liang, R. Paulo, G. Molina, M.A. Clyde, and J.O. Berger, Mixtures of g priors for Bayesian variable selection, Journal of the American Statistical Association 103 (2008), pp. 410–423.
• [30] J. McCambridge, A. Wilson, J. Attia, N. Weaver, and K. Kypri, Randomized trial seeking to induce the Hawthorne effect found no evidence for any effect on self-reported alcohol consumption online, Journal of clinical epidemiology 108 (2019), pp. 102–109.
• [31] R.D. Morey, J.N. Rouder, T. Jamil, and M.R.D. Morey, Package ‘BayesFactor’, URLh http://cran/r-projectorg/web/packages/BayesFactor/BayesFactor pdf i (accessed 1006 15) (2015).
• [32] K. Ohtani,

Bootstrapping R2 and adjusted R2 in regression analysis

, Economic Modelling 17 (2000), pp. 473–483.
• [33] K. Ohtani and H. Tanizaki, Exact distributions of R2 and adjusted R2 in a linear regression model with multivariate t error terms, Journal of the Japan Statistical Society 34 (2004), pp. 101–109.
• [34] K. Okada, Is omega squared less biased? a comparison of three major effect size indices in one-way anova, Behaviormetrika 40 (2013), pp. 129–147.
• [35] P. Pallmann and T. Jaki, Simultaneous confidence regions for multivariate bioequivalence, Statistics in Medicine 36 (2017), pp. 4585–4603.
• [36] L. Plonsky and F.L. Oswald, Multiple regression as a flexible alternative to anova in l2 research, Studies in Second Language Acquisition 39 (2017), pp. 579–592.
• [37] E. Quertemont, How to statistically show the absence of an effect, Psychologica Belgica 51 (2011), pp. 109–127.
• [38] S. Rehal, T.P. Morris, K. Fielding, J.R. Carpenter, and P.P. Phillips, Non-inferiority trials: are they inferior? a systematic review of reporting in major medical journals, BMJ open 6 (2016), p. e012594.
• [39] E.M. Rose, T. Mathew, D.A. Coss, B. Lohr, and K.E. Omland, A new statistical method to test equivalence: an application in male and female eastern bluebird song, Animal Behaviour 145 (2018), pp. 77–85.
• [40] J.N. Rouder and R.D. Morey, Default Bayes factors for model selection in regression, Multivariate Behavioral Research 47 (2012), pp. 877–903.
• [41] S.A. Rusticus and C.Y. Lovato, Applying tests of equivalence for multiple group comparisons: Demonstration of the confidence interval approach., Practical Assessment, Research & Evaluation 16 (2011).
• [42] J. Stand, The “Hawthorne effect” -what did the original Hawthorne studies actually show, Scand J Work Environ Health 26 (2000), pp. 363–367.
• [43] S. Wellek, Testing statistical hypotheses of equivalence and noninferiority, Chapman and Hall/CRC, 2010.
• [44] L. Yang and A.A. Tsiatis, Efficiency study of estimators for a treatment effect in a pretest–posttest trial, The American Statistician 55 (2001), pp. 314–321.