goodman-replication
Replication of Goodman et al. (2019)
view repo
In a recent simulation study, Goodman et al. (2019) compare several methods with regard to their performance of type I and type II error rates in case of a thick null hypothesis that includes all values that are practically equivalent to the point null hypothesis. They propose a hybrid decision criterion only declaring a result "significant" if both a small p-value and a sufficiently large effect size are obtained. We successfully replicate the results using our own software code in R and discuss an additional decision method that maintains a pre-defined false positive rate. We confirm that the hybrid decision criterion has comparably low error rates in settings one can check for but point out that the false discovery rate cannot be easily controlled by the researcher. Our analyses are readily accessible and customizable on https://github.com/drehero/goodman-replication.
READ FULL TEXT VIEW PDFReplication of Goodman et al. (2019)
The discussion on the usefulness of null hypothesis significance testing and the correct use and interpretation of p-values has been hot in recent years. The American Statistical Association Statement on p-Values and Statistical Significance (Wasserstein and Lazar, 2016) was arguably the most prominent work on pitfalls in using -values for statistical inference. It was followed by a symposium on statistical inference organized by the American Statistical Association in 2017. The symposium led to a special issue of The American Statistician titled ”Moving to a World Beyond ” presenting many suggestions on how to separate signal from noise in data. In one of the included papers, Goodman et al. (2019), henceforth GSK, compare several methods with regard to their performance of type I and type II error rates in case of a thick null hypothesis, i.e. the null hypothesis includes all values that are practically equivalent to the point null hypothesis. They propose a hybrid decision criterion, the minimum effect size plus p-value (MESP) that only leads to a rejection if a
-value lower than 5% is obtained for the tested point null hypothesis and the observed effect size is large enough to be considered practically relevant. In a simulation study they compare different decision criteria and show overall good performance of the MESP method. More specifically, MESP has generally rather low type I error rates and has quite high type II error rates only in settings that the researcher can usually identify, namely low nominal power settings. In this note, we replicate the GSK results using independently written software code in R and discuss a further decision criterion and how it is related to the ones presented in GSK. Furthermore, we provide an openly available code repository that allows the scientific community to easily replicate, amend and extend the analyses conducted in GSK and in this note.
^{2}^{2}2GSK provide software code in Excel that requires the proprietary add-on Resampling Stats for Excel to run (https://www.tandfonline.com/doi/suppl/10.1080/00031305.2018.1564697) and kindly sent us additional explanations regarding the spreadsheet. We deem our open source implementation more convenient and accessible.
We find that the results of GSK are replicable in a narrow sense and extend the analyses by a Bayesian-motivated t-test that maintains a pre-defined false positive rate. We also compute false discovery and false omission rates for different nominal power categories and show that the MESP approach has the undesirable property that the false discovery rate does not necessarily decrease for rising nominal power.
In the recent discussion about the replication crisis, some authors suggested that the traditional focus on point hypotheses is part of the problem (e.g., Greenland, 2019; McShane et al., 2019): When testing the point null hypothesis , even tiny, practically irrelevant effects can be “statistically significant” if the number of samples is big enough. Furthermore, problems in the experimental setup can easily lead to a small bias that is erroneously interpreted as a statistically significant effect by the test. It was suggested to refrain from testing point hypotheses and instead testing whether an effect is big enough to be practically significant (e.g., Betensky, 2019; Blume et al., 2019).
In the following, the effect size deemed practically relevant by researchers in a certain scenario is referred to as (Minimum Practically Significant Distance). Accordingly, the thick null hypothesis is or, equivalently, , where is the thick null interval.
We simulate 100,000 cases.^{3}^{3}3In order to obtain robust results over different simulation runs also for subgroup analyses (e.g. Table 2), we increase the number of simulated cases compared to GSK who simulate 10,000 cases. Each can be thought of as a fictional study testing whether some parameter is different from . Formally, a case is characterized by a tuple describing the conditions of the study:
, sampled uniformly from , is the population mean of our data.
, sampled uniformly from
, is the standard deviation of the population.
, sampled uniformly from , is the sample size.
, sampled uniformly from , is the minimal effect size deemed practically relevant by researchers in our fictional scenario.
For every case, we store if (i.e. if is false). We sample the data and store the results of the decision methods described below based on that sample.
GSK compare five different decision methods:
The first method is a -test for the point null hypothesis that with . Note that this is usually the desired false positive rate (the type I error): In case the point null hypothesis is true it should not be falsely rejected by the test more than 5% of the time. However, since the test is used here to decide whether the thick , not , should be rejected, the false positive rate will exceed 5% for .
The second method is a -test as described above with instead of . Lowering from 0.05 to 0.005 was prominently recommended to lower the rate of false positive results and irreproducible research findings (Benjamin et al., 2018). However, as discussed above is used to control the false positive rate with respect to , not .
The third method rejects the null hypothesis if the observed effect size is at least as big as the minimum practically significant distance, i.e. .
GSK propose a new decision method called “Decision by Minimum Effect Size Plus -value” that is a conjunction of the conventional and the distance-only method, i.e. the null hypothesis is rejected only if both methods would reject it, if both the -value of the -test is smaller than 0.05 and the observed effect size is practically significant.
The final method compared by GSK rejects the null only if the thick null interval
and the 95% confidence interval centered around the observed sample mean
do not intersect.^{4}^{4}4The interval-based method is equivalent to a two-sided decision criterion presented in Betensky (2019) for rejecting a thick null hypothesis. However, Betensky (2019) allows the interval-based method also to be used to confirm thick null hypotheses. In this regard, the method applied by Betensky (2019) can be considered as a composite approach that combines the interval-based method presented in GSK and equivalence testing using two one-sided tests (Schuirmann, 1987). The latter is designed to confirm the hypothesis that an effect is not larger then some practically meaningful equivalence bounds, i.e. the boundaries of the thick null. The Bayesian counterpart to Betensky (2019) is the HDI-ROPE procedure (Kruschke, 2011) that also allows for rejection and acceptance of a thick null hypothesis. Both Betensky (2019) and Kruschke (2011)deem results inconclusive if the estimated interval (confidence or credible interval, respectively) is neither fully inside nor fully outside of the thick null hypothesis.
In this section we present a sixth decision method that we call the thick -test. The idea is to base the decision to reject the thick null hypothesis
on the probability to observe an effect that is more extreme with respect to
than the one actually observed if was true^{5}^{5}5We assume that can only be an integer (and accordingly), which is true in our simulation setting. This simplification by GSK is usually false in the real world and we would have to use the continuous uniform distribution and integrate over the thick null interval instead. Please refer to our code for details on how to compute
in either case.:This probability is known as the prior predictive -value popularized by Box (1980) (see also Bayarri and Berger, 2000, for a discussion on this and related concepts for composite null hypotheses). For , i.e. , is simply the -value of a two-sided point hypothesis test. In that sense as defined above is a generalization of familiar -values to thick null hypotheses and the same guarantee about the false positive rate from point hypothesis tests holds for the thick -test: If we decide to reject whenever , the probability to report an effect even though the real effect is practically irrelevant is 5%. This is because, just as the familiar -value in a two sided point hypothesis test, is uniformly distributed on under , i.e. . To be more precise, due to the discrete distribution of in the simulation, is only asymptotically uniformly distributed, and hence only for .
However, we need to make an assumption about the distribution of under and only if this assumption is correct, the guarantee holds. Since we are doing a simulation we know that is uniformly distributed, i.e. for . If it is true that the simulation generates a realistic example set, a uniform distribution might be also a reasonable choice in the real world. On the other hand, there might be settings where other choices for the distribution of under are more appropriate. A detailed discussion on how realistic and generalizable the simulation’s setting and results are can be found in GSK (Appendix A.1).
We believe that this method is useful for a number of reasons: Firstly, only two of the five original methods tested in GSK (MESP and the interval-based method) take both the
and the standard error of the estimated mean
into account. Therefore, it benefits the comparison to add a decision method that takes all available information into account.Secondly, it presents a framework to better understand most of the other methods: The conventional and the small-alpha method can be thought of as thick -tests that assume , which is the least conservative choice for the distribution of in the sense that it minimizes the probability of observing a practically relevant effect () despite being true among all distributions that are symmetric around . Hence, we expect the false positive rate of these methods to be higher than (5% and 0.5% respectively). The interval-based method on the other hand is approximately equivalent to the thick -test that assumes where is the value in the thick null interval that is closest to the observed mean .^{6}^{6}6In the case we observe (and analogously for ) the method would be equivalent if , which is approximately true for reasonably big and reasonably small . This is the most conservative distribution of in the sense that it maximizes the conditional probability to observe an effect more extreme than when is true. Hence, we expect the false positive rate to be below 5% and the false negative rate to be high for this method.^{7}^{7}7The thick -test with is equivalent to choosing . Taking the supremum instead of integrating out is the traditional way of dealing with composite null hypotheses (see for example Bickel and Doksum, 2015, p. 217), e.g. when conducting a one-sided -test. The distance-only method can be thought of as a thick -test that assumes the same distribution as the interval-based method but with . Hence we expect the false positive rate to be higher than that of the interval-based method but well below 50% since the assumed distribution is overly conservative.
Thirdly, the thick -test could be used as a decision method in its own right. The parameter allows to control the risk of incorrectly rejecting a true null hypothesis directly and more accurately than the overly conservative interval-based method. Of course, this requires that the assumption about the distribution of the population mean when there is no practically relevant effect is at least approximately true. We don’t believe that the need to make this assumption is necessarily a disadvantage compared with the other methods, since one would have to make similar assumptions about the distribution of under if one wanted to derive any guarantees about their false positive rates with respect to the thick null hypothesis.
Table 1 was created based on Table 2 of GSK. The different methods are compared in terms of their overall inference success rate, divided by the nominal power. The nominal power for a case in the simulation relies on the power calculation for a one-sample z-test for the mean using , the case’s true population standard deviation , sample size and minimum detectable difference . Table 1 also distinguishes between the true mean falling within and beyond the thick null. In addition to the methods from GSK, the thick -test introduced in the last section is also considered conditional on the nominal power. The table is visualized in Figure 1.
Does the true location fall within the bounds of the thick null? | Nominal power | Cases | Conventional | Small-alpha | MESP | Distance-only | Interval-based | Thick -test |
---|---|---|---|---|---|---|---|---|
23,869 | 37.6% | 53.3% | 90.7% | 90.7% | 99.7% | 95.2% | ||
Yes (45,193 cases) | to | 12,920 | 76.3% | 92.9% | 82.6% | 78.8% | 99.5% | 95.1% |
8,404 | 90.7% | 98.4% | 90.7% | 55.7% | 98.5% | 94.7% | ||
17,674 | 99.4% | 96.7% | 92.4% | 92.4% | 62.6% | 85.8% | ||
No (54,807 cases) | to | 14,507 | 85.5% | 66.2% | 83.0% | 86.8% | 42.9% | 65.3% |
22,626 | 56.3% | 35.3% | 56.3% | 88.0% | 38.7% | 50.6% |
The top half of Table 1 and the left half of Figure 1 show the success rates of the methods when the true parameter lies within the thick null interval. The interval-based method performs best in this case, as already seen in GSK. The new method of the thick -test falsely rejects approximately 5% of the times across all nominal power categories. The success rates of the other methods except for the interval-based method are lower in at least one case and depend on the nominal power. The MESP method exhibits overall moderate error rates.
The bottom half of Table 1 and the right half of Figure 1 show the success rates of the methods when the real parameter lies outside the null interval. The previously best method performs worst in this case. Overall, the distance-only method has high inference success rates and is topped only at high nominal power by the conventional method and the small-alpha method. However, these two methods have a much worse success rate for low nominal power. The newly considered thick -test performs moderately. It is neither one of the best methods at high nominal power, nor one of the worst methods at low nominal power. MESP performs reasonably well for higher nominal power and exhibits low inference success only in the low nominal power setting.
Overall, the described Table 1 and Figure 1 could successfully replicate the corresponding illustrations in GSK. The thick -test method performs very well if the true location falls within the bounds of the thick null. MESP has generally acceptably low error rates except for low nominal power settings when the thick null is not true. As a composite measure of the conventional and the distance-only method it has inference success rates similar to the former when the nominal power is low and similar to the latter when the nominal power is high. The reason for this behavior is intuitive: The -value computed for plays only an important role for the MESP decision if effect sizes need to be really large to lead to a rejection of . Conversely, in high nominal power settings -values are generally low and MESP’s decision is only dependent on the distance criterion. An interesting property of MESP is that due to its construction as a composite measure it has lowest inference success for medium nominal power if the thick null holds.
Does the true location fall within the bounds of the thick null? | Decile | Cases | Conventional | Small-alpha | MESP | Distance-only | Interval-based | Thick -test |
---|---|---|---|---|---|---|---|---|
1 | 1,356 | 93.5% | 98.9% | 93.5% | 38.3% | 97.4% | 94.7% | |
2 | 2,456 | 90.7% | 98.3% | 90.7% | 54.8% | 98.8% | 94.6% | |
3 | 3,308 | 85.4% | 96.6% | 85.9% | 66.3% | 98.9% | 94.5% | |
4 | 4,296 | 80.1% | 94.9% | 83.7% | 74.2% | 99.3% | 95.3% | |
Yes (45,193 cases) | 5 | 5147 | 75.2% | 90.9% | 85.1% | 78.9% | 99.4% | 94.5% |
6 | 5,632 | 68.0% | 85.8% | 85.0% | 80.8% | 99.5% | 95.3% | |
7 | 5,610 | 58.4% | 78.1% | 86.4% | 84.0% | 99.5% | 95.1% | |
8 | 5,615 | 48.5% | 65.8% | 89.2% | 88.1% | 99.7% | 95.2% | |
9 | 5,726 | 34.6% | 49.2% | 91.7% | 91.5% | 99.6% | 95.3% | |
10 | 6,047 | 16.9% | 25.8% | 95.1% | 95.1% | 99.9% | 95.2% | |
1 | 8,644 | 54.5% | 34.7% | 54.5% | 91.8% | 43.9% | 51.9% | |
2 | 7,544 | 63.5% | 43.8% | 63.5% | 89.8% | 43.7% | 57.3% | |
3 | 6,692 | 70.9% | 50.2% | 70.8% | 87.4% | 41.5% | 58.5% | |
4 | 5,704 | 75.9% | 57.0% | 74.6% | 85.4% | 39.1% | 59.9% | |
No (54,807 cases) | 5 | 4853 | 81.3% | 64.2% | 77.2% | 85.1% | 38.3% | 61.4% |
6 | 4,368 | 86.4% | 71.7% | 80.1% | 84.7% | 40.0% | 64.0% | |
7 | 4,390 | 91.7% | 80.8% | 83.8% | 86.9% | 46.0% | 70.8% | |
8 | 4,385 | 96.0% | 90.6% | 88.8% | 90.0% | 55.3% | 80.0% | |
9 | 4,274 | 98.4% | 95.6% | 92.5% | 93.1% | 64.5% | 87.7% | |
10 | 3,953 | 99.9% | 99.2% | 97.2% | 97.2% | 79.5% | 95.9% |
Following Table 3 of GSK, Table 2 shows inference success rates depending on deciles of the relative which is defined as .^{8}^{8}8We add a tiny value to the relative to distribute ties evenly to the deciles. The table is divided into two parts. The upper one shows success rates when the true location falls within the bounds of the thick null while the lower one shows success rates when the true location does not fall within the bounds of the thick null. Figure 2 visualizes Table 2.
Looking at the upper half of the table one can observe that the interval-based and the thick -test method both have low error rates with the interval-based method showing almost no inference failure for high relative s. The MESP method has the best success rates for low and high deciles and performs moderately for medium deciles. For the distance-only method the success rate increases considerably the higher the relative becomes. In contrast to that stand the conventional and the small-alpha method. They perform best for the low relative values and fall off the higher it becomes.
If the thick null is false, the conventional and small-alpha method, MESP as well as the thick -test perform better the higher the relative becomes. The interval-based method has considerably worse success rates than the other methods in all but the lowest deciles. The distance-only method has decent success rates, especially for low and high deciles. The MESP method performs nearly as good as the conventional -test with slightly worse performance in medium and higher deciles. It exhibits inference success rates below 70% only in settings where the relative is low. In practice, such a setting could be recognized by the researcher before conducting a study if a reliable guess on the population’s standard deviation is available.
Overall, these numbers confirm the results in GSK regarding the success rates for different ranges of the relative . As already shown in Table 1 and Figure 1, the added thick -test falsely rejects the thick null less often compared to other methods tailored to thick null hypotheses such as the MESP and the distance-only method, but has lower inference success rates if the thick null does not hold.
Conventional | Small-alpha | MESP | Distance-only | Interval-based | Thick -test | |
---|---|---|---|---|---|---|
Inference success rate | 69.2% | 67.7% | 81.1% | 85.3% | 71.0% | 79.0% |
False positive rate | 41.4% | 27.0% | 11.6% | 19.2% | 0.6% | 4.9% |
False negative rate | 22.1% | 36.7% | 25.0% | 10.9% | 52.5% | 34.2% |
False discovery rate | 30.5% | 26.0% | 11.3% | 15.1% | 1.0% | 5.8% |
False omission rate | 31.4% | 37.9% | 25.5% | 14.1% | 39.0% | 30.4% |
Table 3 and Table 4 are inspired by Figure A7 of GSK. Table 3 shows error rates as well as overall inference success rates, i.e. the share of correct inference decisions over all simulations, with respect to the thick null. With an inference success rate of 69.2% for the conventional and 67.7% for the small-alpha variant, the point -tests together with the interval-based method (71.0%) have the lowest inference success rate. With 85.3% correct inference indications the distance-only method has the highest overall inference success rate. Of course, in another simulation setting with the thick null being true more often than 45.2% as in our setting (Table 1), the picture might look differently. Table 3 also shows that the conventional -test has the highest false positive rate (41.4%) with respect to the thick null. Lowering the alpha of the -test to 0.005 reduces the false positive rate to 27.0% at the expense of a higher false negative rate.
Combining the inference indications of the conventional -test and the distance-only method results in an inference success rate of 81.1% for the MESP method with a false positive rate of 11.6% and a false negative rate of 25.0%. With a false positive rate of 0.6% the interval-based method has the least false positives but with 52.5% also the highest false negative rate of all six methods. Due to its construction the thick -test yields a false positive rate of 4.9% and has a false negative rate that is 18.3 percentage points lower compared to the interval-based method.
The false discovery rate is important from a practical perspective as it is defined as the share of false positive findings among all positive findings (rejections of the null hypothesis). In the context of a thick null hypothesis, this is the probability that there is actually no practically relevant effect in the population if the decision criterion indicates that there is one. In our simulation, the false discovery rate is lowest for the interval-based method, while the false omission rate as its counterpart is lowest for the distance-only method. While the false discovery and false omission rate computed here are dependent on the share of true thick null hypotheses and can be derived from Table 1
, one would have to decide on a prior probability of the thick null being true for a single study in practice.
Table 4 compares the individual methods regarding the false discovery rate and the false omission rate in more detail. In this table, a distinction is made between high, medium and low nominal power. As the rates are crucially dependent on the share of thick null hypotheses being true, the rates are normalized by assuming that in half of the cases the thick null hypotheses is true. To this end, comparisons between different nominal power categories are allowed and it can be investigated whether methods improve with increasing nominal power. All rates should be interpreted only relatively to each other since, in general, the false discovery rate decreases and the false omission rate increases with more true locations falling beyond the thick null. Thus, a different simulation setting, e.g. with more true locations falling beyond the thick null, would lead to different results.
The inverval-based method has false discovery rates which are well below 5% for all nominal power categories. The thick -test method also exhibits overall low false discovery rates. The first three methods shown in Table 4 do not have the generally desirable property of a decreasing false discovery rate if the nominal power increases. While the MESP has overall medium false discovery rates, the false discovery rate rises when moving from low to medium nominal power.^{9}^{9}9Dividing the nominal power categories according to deciles shows that the false discovery rate starts decreasing for a nominal power larger than 70% (results not shown here). The reason is the aforementioned construction of the MESP as a composite measure of the conventional and the distance-only method which have contradicting false discovery rates when the nominal power increases.
Regarding the false omission rate, a consistent structure can be recognized across all methods: The false omission rate increases with decreasing nominal power. For high nominal power, the value is lowest for the conventional method (1.6%) and highest for the interval-based method (27.3%). For low nominal power, the distance-only method shows the lowest value of 17.7%.
When considering the false discovery rate and the false omission rate, there is no clear best method even when disregarding the low nominal power category. For medium and high nominal power, many methods have false discovery rates that arguably exceed desirable values. The interval-based method works best in this regard at the expense of a considerable false omission rate even in the highest nominal power category. To this end, the individual context of a study plays a crucial role in the choice of the appropriate decision criterion.
Nominal Power | Cases | Conventional | Small- alpha | MESP | Distance- only | Interval- based | Thick -test | |
---|---|---|---|---|---|---|---|---|
0.80 | 41,543 | 38.6% | 32.6% | 9.1% | 9.1% | 0.5% | 5.3% | |
(Normalized) false discovery rate | 0.30 to 0.80 | 27,427 | 21.7% | 9.7% | 17.3% | 19.7% | 1.2% | 7% |
0.30 | 31,030 | 14.2% | 4.2% | 14.2% | 33.5% | 3.7% | 9.5% | |
0.80 | 41,543 | 1.6% | 5.8% | 7.7% | 7.7% | 27.3% | 13% | |
(Normalized) false omission rate | 0.30 to 0.80 | 27,427 | 16% | 26.7% | 17% | 14.4% | 36.5% | 26.7% |
0.30 | 31,030 | 32.5% | 39.6% | 32.5% | 17.7% | 38.3% | 34.3% |
In this note, we successfully replicated the results of GSK using our own software code written in R that allows the easy implementation of further decision criteria. We added such a decision criterion that allows to control for a pre-defined type I error rate with respect to a thick null hypothesis.
We confirmed that the MESP method as proposed by GSK has comparably low type I and type II error rates with respect to a thick null hypothesis in settings the researcher can check for. Only in low nominal power settings the MESP method has difficulties to detect a practically relevant effect and, accordingly, has a quite high false omission rate. The false discovery rate is moderate in comparison with other decision criteria but has the undesirable property that it does not decrease monotonically with rising nominal power (assuming a fixed prior probability for the thick null being true). If the researcher’s aim is to keep the false discovery rate low, the MESP should only be used in high nominal power settings.
The MESP method could also be applied for point null hypothesis testing: As it augments the conventional -test by taking the minimum practically significant effect size into account, MESP is a simple and effective approach to add an additional layer of protection against false positives without strongly increasing the false negative rate. However, if researchers are serious about testing whether the thick null hypothesis is true, MESP lacks a parameter to adapt the method to contexts with different costs of false negatives and positives with respect to the thick null hypothesis. In contrast, the interval-based method and the thick -test (assuming the chosen distribution of is approximately true) provide such a parameter with being the upper bound of the false positive rate or approximately equal to it, respectively.
The performance of the decision criteria generally depends on whether the thick null hypothesis is true or not. In practice, one can make an informed guess on this probability, i.e. specifying a prior probability, or select the decision criterion according to the estimated nominal power and the type of error one wants to avoid. Likewise, for hypothesis testing it is recommendable to specify and publish the (thick) null hypothesis, planned analyses methods, and the decision criterion before conducting the study to reduce options for -hacking and increase transparency in science. Irrespective of the decision criterion used, every empirical study is subject to limitations and assumptions and thus uncertainty beyond sampling noise. While an extensive debate on statistical inference including dos and don’ts as well as merits and pitfalls can be found in Wasserstein et al. (2019) and the corresponding issue of the The American Statistician, we conclude by citing a generic advice to the research community using statistics: “Accept uncertainty. Be thoughtful, open, and modest” (Wasserstein et al., 2019).