Loftus and Masson (1994) noted that in a within-subject design, the standard confidence interval for the mean of the response at a particular level of the independent variable, may in fact not be practical for interpretation. This is because, at different levels of the independent variable, such confidence intervals can show substantial overlap, resulting in a genuine trend of within-subject effects being hidden—especially when the between-subject variability is relatively high. Consequently, even if the corresponding repeated-measures (within-subject) analysis of variance (ANOVA) indicates a highly significant statistic, it may not be possible to make any strong inference about the ordering of the condition means based on standard confidence intervals. Conflicting evidence drawn from the standard confidence interval (CI) and the repeated-measures ANOVA arises because the between-subject variance, which is irrelevant in the repeated-measures ANOVA, partially determines the length of a standard CI.
Given the irrelevance of the between-subject variance in a repeated-measures ANOVA, Loftus and Masson (1994) proposed the within-subject CI as an interval estimate for use in within-subject designs, where the length of the interval does not depend on the amount of between-subject variability. The within-subject CI is based on a transformation of the data that removes this source of variability. The resulting interval is smaller in length (assuming cross-measurement correlation is positive) than a standard CI; however, a significant downside is that at nominal level
, the coverage probability of the within-subject CI can be far less than. As a result, the within-subject CI is not a valid CI in the usual sense. Nevertheless, it can be used as a graphical tool that may uncover the true pattern of within-subject effects (that could otherwise be hidden within a set of standard CIs), while at the same time depicting the variability that is of scientific interest in within-subject designs.
A more significant challenge to the interpretation of CIs, which can be extended to the special case of within-subject intervals, is that even experienced researchers can hold erroneous views of what CIs actually mean. This has been discussed by, among others, Hoekstra, Morey, Rouder, and Wagenmakers (2014). These misunderstandings include the idea that a CI has a specified probability of containing the true value of the parameter. Interpretations such as this are perpetuated by text books and other sources (e.g., Cumming 2014; Masson & Loftus, 2003) that do not accurately reflect the concept of CIs as originally proposed by Neyman (1937). Under Neyman’s definition, a CI is an interval generated by a procedure that, with repeated sampling, has a specified probability of containing the true value of the parameter. It is important to realize that this is a pre-data specification of confidence. A CI realized from a particular sample of data, however, cannot itself be linked back to the degree of confidence that was specified before the sample was collected (Morey, Hoekstra, Rouder, Lee, & Wagenmakers, 2016a; Morey, Hoekstra, Rouder, & Wagenmakers, 2016b).
The common misunderstanding of how to interpret CIs is associated with a tendency to treat them as credible intervals—ranges of values that are a posteriori likely to include the parameter of interest. Given the current advocacy of the use of CIs in psychological science (e.g., Cumming, 2013, 2014) and these concerns about widespread misinterpretation of CIs, there may be great value in adopting Bayesian credible intervals as the standard means of expressing estimates of parameter values and the relative precision of those estimates (Kruschke, 2013; Morey et al., 2016a).
We note, however, that there is an important area where CIs have been widely applied, but no method of computing a meaningful Bayesian credible interval has yet been developed. Namely, for repeated-measures or within-subject designs, Loftus and Masson (1994) introduced a widely accepted method (now cited over 2200 times according to Google Scholar) for computing within-subject CIs that reflect the relative magnitudes of sample means. In order to avoid the aforementioned problems associated with interpreting CIs, however, we describe in this paper our development of a Bayesian analogue of this within-subject CI. It is in fact surprising that in the 24 years since the publication of the Loftus and Masson article, no method for Bayesian within-subject interval estimation has been developed, as far as we are aware. Nevertheless, such a development is certainly well-motivated, if only because the move toward a posterior probability interpretation of the within-subject interval will be intuitively more appealing to most scientists.
We therefore present a Bayesian highest density interval (HDI), where the length of this HDI does not depend on the amount of between-subject variability in the data. In addition, we go one step further and properly characterize the posterior probability of such an interval. This latter aspect of our work is a particularly useful new development, in that it allows users of the within-subject interval to understand with greater clarity the sense in which such an interval is a valid interval estimate. As with its classical counterpart, the within-subject HDI is not a standard HDI in the usual sense; nevertheless, our construction will give the user a concrete and more direct understanding of its associated posterior probability at nominal level .
To develop the Bayesian within-subject HDI, we adopt a rather non-standard approach and base inference on a modified posterior distribution that conditions not only on the data, but also on the subject-specific random effects in the corresponding mixed effects model. We note that using a modified posterior distribution (as opposed to the standard posterior distribution for Bayesian inference) may be viewed as somewhat controversial, and as a departure from the standard Bayesian paradigm. We call our new paradigm conditional Bayesian inference.
Another example where a modified posterior distribution is used for statistical inference is mean-field variational Bayes inference (e.g., Ormerod & Wand, 2010; Ostwald et al., 2014; Nathoo et al., 2014)—a method in which the modified posterior distribution is assumed to take a factorized form, and the modification is made to facilitate faster computation. In our case, the modification to the posterior distribution is used to eliminate the contribution of between-subject variability, and a key point here is that this variability is not of scientific interest in within-subject designs. Removing this component of variability was the motivation for the original within-subject CI (Loftus & Masson, 1994), and this approach is now an industry standard for use in within-subject designs in psychology.
As our approach to constructing the within-subject HDI is conditional on random effects that are not known, it is necessary to estimate the random effects. We estimate these effects using maximum likelihood and then condition on the estimated random effects. Thus, our use of plug-in estimates gives our approach an empirical Bayes flavor, although it is not our objective to estimate the prior from the data as it is with empirical Bayes. The idea of using conditional inference and empirical Bayes to improve efficiency in the presence of many nuisance parameters is considered by Liang and Tsou (1992), and conditional inference in the presence of nuisance parameters is discussed by Cox and Reid (1987).
Using the idea of a modified posterior distribution, we are able to derive the original within-subject CI as a within-subject Bayesian HDI corresponding to a particular improper prior. To be clear, we are not advocating treating CIs as credible intervals in general, nor are we advocating treating frequentist and Bayesian intervals as interchangeable. The problems associated with this have been discussed in Hoekstra et al. (2014) and elsewhere. The numerical equivalence between the within-subject CIs and the within-subject HDIs presented in this paper is useful in that it allows users of these specific methods to apply a posterior probability interpretation—an interpretation which is undoubtedly more appealing to most practitioners.
In addition, the Bayesian formulation clarifies for the user what the underlying prior distribution actually is and, as we shall demonstrate, that the standard within-subject CI, when formulated as a Bayesian within-subject HDI, is based on a prior distribution that may be questionable. We therefore develop a new Bayesian within-subject HDI based on a standard noninformative prior distribution, and we show that this new interval always has shorter length than the original within-subject CI of Loftus and Masson (1994).
The noninformative prior that we use to develop the proposed within-subject HDI is a standard noninformative prior commonly used for normal models, and was also considered in Rouder, Morey, Speckman, and Province (2012). It is well known that the use of noninformative priors can cause problems for Bayesian hypothesis testing with the Bayes factor, when these priors are assigned to parameters that are not common to the models being compared (e.g., Wetzels, Grasman, & Wagenmakers, 2012). In our context, however, the use of a noninformative prior causes no theoretical difficulty. Although the prior is improper, we show that the corresponding posterior distribution is always proper and can thus be used to construct an interval estimate with the specified posterior probability.
One can argue against our proposed approach of conditioning on estimated random effects as providing ’false certainty’ in its double use of the data; first to estimate the between-subject variability and then again to construct the interval estimate, given the estimated between-subject variability. Nevertheless, this same criticism can also be launched against the original non-Bayesian within-subject CI, despite this approach’s extensive use and acceptance in psychology for the analysis of repeated-measures designs. As mentioned earlier, the practical justification for the within-subject CI, and perhaps the reason for its continued wide use, is that the within-subject interval removes the component of variability that is not of scientific interest in this experimental design. For the reasons discussed above, this interval addresses the limitations of the standard CI (i.e., between-subject variance masking the true pattern of within-subject effects), while providing the researcher with a depiction of the variability that is of relevance in within-subject designs.
With regard to the conditional Bayesian approach we have adopted and the alternative of a fully Bayesian approach, we acknowledge that our use of plug-in estimates of the random effects implies that the uncertainty in estimating the subject-specific random effects is not propagated to the width of the HDI. Estimating the random effects is necessary because the intervals are based on a conditional posterior distribution, where we are conditioning on unknown quantities, the subject-specific random effects, in order to remove the uninteresting component of variability. We show that this approach, under a certain improper prior, leads to the original interval proposed by Loftus and Masson (1994) which demonstrates that it is a reasonable approach to removing between-subject variance. A fully Bayesian approach precludes conditioning on random effects, and thus would not yield within-subject Bayesian inference. It is the conditioning on estimated random effects that removes the uncertainty that is not of interest, and this conditioning precludes fully Bayesian inference.
The resulting Bayesian HDI leads to two main advantages over the classical within-subject CI. First, it is worth reiterating that the move towards a posterior probability interpretation of the within-subject interval will be intuitively more appealing to most scientists. Second, our use of the modified posterior distribution allows us to attach a direct modified posterior probability to the within-subject HDI; a modification which grants the user a precise understanding of what the associated posterior probability actually is when the within-subject HDI is constructed at nominal level .
We will develop the Bayesian within-subject interval for two cases, each corresponding to an underlying mixed effects model. In the first case, we will assume that the error variance of the response, , is constant across different levels of the experimental factor. In the second case, we relax this assumption and develop a within-subject HDI that can be applied to heteroscedastic data.
The remainder of the paper proceeds as follows. In Section 2, we formulate the new Bayesian within-subject HDI and discuss its connection to the original within-subject CI. In Section 3, we present a Bayesian within-subject HDI that can be applied to heteroscedastic data. Section 4 presents two practical examples, with a tutorial aspect designed to help researchers who are new to Bayesian methods become familiar with applying these methods to their data. In these examples, we make comparisons between the classical within-subject CI and the new Bayesian within-subject HDI, and we also compare these intervals to the classical between-subject interval estimate. Section 5 concludes the paper with a discussion.
2 Formulation of the Bayesian HDI for Within-Subject Designs
Consider a single-factor repeated-measures design with the corresponding mixed effects model
where represents the response obtained from the subject under the level of the experimental manipulation; is the mean of the response at the level; is the number of subjects; is the number of levels; and is a mean-zero random effect for the subject.
Loftus and Masson (1994) proposed the within-subject confidence interval (CI) for the condition mean based on the idea of first transforming the data to remove the between-subject variability. The resulting interval has smaller length than a standard confidence interval for , and its construction is motivated by the following notion: since between-subject variance typically plays no role in the statistical analyses of within-subject designs, it can legitimately be ignored. Hence, an appropriate confidence interval can be based on the standard within-subject error term.
The within-subject CI takes the form
where is the estimated condition mean;
is the interaction sum-of-squares, with being the mean of the data obtained from the subject (note then that and denote the subject and condition means respectively); and is the overall mean.
The Loftus and Masson (1994) within-subject confidence interval is identical to that proposed in Cousineau (2005) and Morey (2008) when the latter use degrees-of-freedom, except that the former interval is based on a pooled standard deviation and the latter uses un-pooled estimates. Loftus and Masson (1994) note that the interval estimate (2) is not a bona fide confidence interval, since it is based only on interaction variance and is not a function of the between-subject variability. Although removing the latter component of variability is in fact the motivation for the within-subject interval, an important consequence is that the within-subject interval will not have a coverage probability equal to the nominal coverage probability of . Further, it is not clear what the (frequentist) coverage probability of such an interval actually is. The authors justify its validity in a typical within-subject design, by noting that the within-subject CI at level has the property that its length is related by a factor of to the standard confidence interval around the difference between two condition means (assuming that the variance is constant across conditions). The latter point provides the user with some (albeit indirect) notion in which (2) is a interval estimate, and indeed, the methodology is now in common use for the analysis of repeated-measures designs in psychology.
As an alternative derivation of a within-subject credible interval, we adopt a Bayesian approach that leads to a within-subject interval that has a more direct interpretation in terms of modified posterior probability. As with the original within-subject CI, our goal is to develop an interval estimate that has, in some sense, removed the between-subject variance such that a more efficient interval is obtained. Rather than transforming the data, we do this by using a modified posterior distribution that conditions both on the data, and on point estimates of the subject-specific random effects. This modified posterior is constructed to improve the efficiency of the interval estimate for the parameter of interest , in the presence of what are essentially many nuisance parameters . The latter are necessary to characterize the differences among the subjects, but are typically of no scientific interest in within-subject designs. In our context, the nuisance parameters are eliminated through conditioning, but since the are unknown, they are replaced by estimates . We use the maximum likelihood estimate which is obtained by solving and ; where denotes the data, , and is the conditional density of given , , and . The usual specification of the normal or other parametric distribution for the random effects is not needed as we do not integrate over the random effects. The modified posterior distribution is a conditional distribution where we condition on both the data and the random effects. Thus, no distribution for the random effects is assumed or required when computing the interval, and in this sense our approach can be considered semiparametric.
One may understand this idea further by considering the simple inequality taught in introductory probability, with its presentation modified for our context as
which indicates, in expectation, the conditional posterior variance of the parameter given the subject-specific random effects, is less than or equal to the unconditional posterior variance of this parameter. We thus use this additional conditioning on subject-specific random effects as a mechanism for constraining the variability so that the component of posterior variability that is not of scientific interest is removed. Importantly, we do this while at the same time obtaining an interval whose posterior probability, albeit conditional on the estimated random effects, can be quantified exactly as . This is an improvement over the original notion of the within-subject CI which at level does not have a (frequentist) coverage probability of ; furthermore, its actual coverage probability is not specified nor clear.
Thus, although a standard Bayesian HDI is based on the posterior density , our proposal for a Bayesian within-subject HDI is based on the modified posterior density , where is a point estimate of , so that our approach treats estimated random effects as known and fixed values that are estimated with maximum likelihood.
For the mixed effects model (1), the Bayesian within-subject HDI for is the set with
with chosen as the largest number so that
Definition 1 gives us a precise notion, in terms of modified posterior probability, of how to construct and subsequently interpret the within-subject HDI. It is worth mentioning that although the within-subject HDI is defined in terms of its modified posterior probability, nothing prevents the practitioner from also calculating its associated unconditional posterior probability , and such a calculation is straightforward given the posterior distribution or a Monte Carlo representation of this distribution.
Assuming the noninformative prior and the mixed effects model (1), the posterior distribution conditioning also on the point estimates takes the form
The proof is given in the Appendix.
The Jeffreys prior adopted here
is a noninformative prior and has the advantage of being invariant to one-to-one transformations.
Under the conditions of Theorem 1, the within-subject Bayesian HDI satisfying equation (3) takes the form
The proof is given in the Appendix.
We notice that this Bayesian within-subject HDI has a simple form that is very similar to the original within-subject CI (2). Furthermore, the original within-subject CI can also be interpreted as a Bayesian within-subject HDI satisfying equation (3) under a specific improper prior. In addition, the proposal (4) is equivalent to that considered in Cousineau (2005) and Morey (2008) when both use degrees-of-freedom, with the only difference being that the latter intervals do not use a pooled estimate of the standard deviation. A referee has noted that there can be a tiny advantage to using separate estimates in some settings as this corresponds to a more flexible model. In this case the intervals considered in Cousineau (2005) and Morey (2008) may have shorter length. In addition, Morey (2008) proved for the hierarchical model assumed here that using separate estimates of the standard deviation has advantages with respect to bias which is then reflected in the corresponding intervals. Of course, this will come at the cost of increased variance as the number of parameters to be estimated increases.
The proof is given in the Appendix.
Thus, from Theorem 2, we are able to interpret the original within-subject CI for as a Bayesian within-subject interval with modified posterior probability , as specified in equation (3). This is, in and of itself, an interesting new interpretation of the original within-subject CI that provides substantial new clarity on how to interpret its associated posterior probability at nominal level . Nevertheless, our proposed new within-subject interval (4) based on the noninformative prior seems to be a better within-subject interval for at least two reasons:
First, the new interval is always shorter than the original within-subject interval, even though both have the same modified posterior probability of . More specifically, letting denote the length of the new interval, denote the length of the original interval, and , we have that
The comparison of the lengths of these two intervals is appropriate and meaningful since: (1) both are within-subject HDI’s so that we are comparing two Bayesian within-subject intervals under different priors; (2) both are centred at the same point, namely, ; so it is relevant to point out that the proposed within-subject HDI has smaller length than the original within-subject CI.
Second, the new interval is based on a standard noninformative prior distribution, , that is far more reasonable than the prior distribution corresponding to the original interval . Although both priors are improper, and both lead to a proper posterior distribution, the former prior is a decreasing function of , whereas the latter is an increasing and unbounded function of (assuming ). Giving larger prior weight to larger values of without bound seems an unreasonable prior; thus, practitioners should be aware that the widely used within-subject CI corresponds to an apparently unreasonable prior.
We therefore claim that the new interval (4) is a better within-subject interval. This claim also corresponds to a claim that the Jeffreys prior leads to a shorter interval than the prior associated with Loftus and Masson (1994), while both intervals are centred at the same point estimate. We note that this does not necessarily imply that the new interval is optimal in any sense. In addition, it is certainly possible to use a prior that will result in an even smaller interval than Jeffreys prior used here (consider for an extreme example, a point mass prior). However, this does not imply that such an interval is necessarily better since its centre might be a biased estimator of location and the bias can be the result of a strong prior. In the context of our comparison of (4) with the within-subject CI of Loftus and Masson (1994), this point about bias does not apply since these intervals are centred at the same point and both are symmetric.
3 Dealing with Heteroscedastic Data
A key assumption of the mixed effects model (1) is that the error variance is constant across the different levels of the experimental manipulation—an assumption that can be violated in psychological experiments. One possible solution is to apply a transformation to the response that makes the variance of the transformed response stable. Another solution is to simply expand the model so that it allows for this behavior. We adopt the second solution here and derive a Bayesian within-subject HDI based on the more general one-way mixed effects model
where now Var[ depends on the level of the experimental condition. As before, the within-subject Bayesian HDI is defined based on the conditional posterior density function , where .
Assuming the prior
and the heteroscedastic mixed effects model (6), the within-subject Bayesian HDI for satisfying , takes the form
The proof is provided in the Appendix.
We note that this interval is precisely the standardization method discussed in Franz and Loftus (2012) and elsewhere, but we have provided here a Bayesian justification for this method along with a precise interpretation (in terms of modified posterior probability) of the resulting interval estimate. The interval (7) also shows rigorously that the intervals proposed in Cousineau (2005) and Morey (2008) with degrees-of-freedom are an adequate solution when homogeneity of variances is violated in the data. The method is extremely simple to apply, and its implementation merely requires a standardization of the data , after which the usual interval estimate used in between-subjects designs is constructed. We note again, however, that our contribution is the derivation of this method as a Bayesian within-subject HDI which allows for a completely novel interpretation and justification, based on its modified posterior probability (3).
Standardization methods have also been discussed by Cousineau (2005) who proposed a simple alternative to the Loftus and Masson CIs that does not assume sphericity. This approach also removes individual differences in the data through a transformation. This same procedure was also described by Loftus and Masson (1994) to illustrate the process of removing individual differences from data rather than for computing the CI. Standardization methods are also discussed in Morey (2008) and Baguley (2012). Morey (2008) pointed out that Cousineau’s (2005) approach produces intervals that are consistently too narrow because the standardization procedure induces a positive covariance between standardized scores within a condition, introducing bias into the estimates of the sample variances. Morey (2008) suggests a simple correction to the Cousineau (2005) approach, in which the half-width of the CI is rescaled by a factor of . The presence of this correction factor is now commonly considered in the Cousineau (2005) and Morey (2008) method and is unambiguously present in all subsequent publications (O’Brien & Cousineau, 2014; Baguley, 2012; Cousineau & O’Brien, 2014).
Related to this, Franz and Loftus (2012) discussed two problems of the standardization method, and it is therefore important that we address these in light of our Bayesian formulation. First, Franz and Loftus stated that the associated intervals are too small, as
(where SEM is an abbreviation for standard error of the mean) underestimates the associated SEM produced by the Loftus and Masson (1994) method,, by a factor of . The rescaling proposed by Morey (2008) can be applied here, though we do not pursue this modification as it would alter the modified posterior probability of the resulting within-subject Bayesian HDI to a value above the nominal level. From Theorem 3, the modified posterior probability of our proposed interval is guaranteed to be , and the length of the unadjusted interval will be smaller than that of the adjusted interval. It is also instructive to point out that the same term also appears in equation (5).
The second problem of the standardization method discussed by Franz and Loftus (2012), is that the method can hide serious violations of the circularity assumption, that is, an assumption on the covariance matrix of the repeated measurements that the variance is constant and the covariance between any pair of measurements is also constant. It should therefore not be used as a tool to detect departures from circularity. We agree with this point, and suggest that the approach recommended by those authors (i.e., showing all pairwise differences between factor levels and computing the corresponding for each pair) can be employed as a simple diagnostic to check for the violation of circularity. Alternatively, various statistical packages (e.g., ezANOVA in the R package ez) can be used directly to test the circularity assumption.
To determine which of our proposed within-subject HDI’s, either (4) or (7), to use for a given dataset, we recommend either simply inspecting the variability of the data at each level of the independent variable to determine whether homogeneity of variance is a reasonable assumption, or, more formally, comparing the underlying models (1) and (6) using Bayesian model selection procedures (e.g., Kass & Raftery, 1995; Wagenmakers, 2007; Masson, 2011; Rouder et al., 2012; Nathoo & Masson, 2016). For example, the Bayes factor can be used to compare models (6) and (1), and its computation can be implemented using the BayesFactor package in R.
4 Data Examples
In this section we illustrate applications of our proposed computation of credible intervals for condition means in a repeated-measures design. For the first example, we consider the hypothetical data used by Loftus and Masson (1994) to demonstrate the application of the within-subject confidence interval they developed. The data consist of scores from 10 subjects, each tested under three conditions representing three different presentation durations (see Table 2 in Loftus & Masson). In Table 1, we present the raw data and the means for each of the three conditions in their example. On the right side of the table, we present two versions of the 95% confidence intervals, one being the standard confidence interval based on between-subject variability (assuming equal variance) and the other representing the within-subject CI defined by Loftus and Masson (2). Note that the within-subject CI is narrower than the between-subject CI because the within-subject version is computed with between-subject variability removed. We also present the 95% within-subject highest density interval computed using (4). This HDI reflects the credible values of the condition means, conditioned on the variability between subjects.