Semantic and Cognitive Tools to Aid Statistical Inference: Replace Confidence and Significance by Compatibility and Surprise

09/18/2019 ∙ by Zad R. Chow, et al. ∙ 0

Researchers often misinterpret and misrepresent statistical outputs. This abuse has led to a large literature on modification or replacement of testing thresholds and P-values with confidence intervals, Bayes factors, and other devices. Because the core problems appear cognitive rather than statistical, we review some simple proposals to aid researchers in interpreting statistical outputs. These proposals emphasize logical and information concepts over probability, and thus may be more robust to common misinterpretations than are traditional descriptions. The latter treat statistics as referring to targeted hypotheses conditional on background assumptions. In contrast, we advise reinterpretation of P-values and interval estimates in unconditional terms, in which they describe compatibility of data with the entire set of analysis assumptions. We use the Shannon transform of the P-value p, also known as the surprisal or S-value s=-log(p), to provide a measure of the information supplied by the testing procedure against these assumptions, and to help calibrate intuitions against simple physical experiments like coin tossing. We also advise tabulating or graphing test statistics for alternative hypotheses, and interval estimates for different percentile levels, to thwart fallacies arising from arbitrary dichotomies. We believe these simple reforms are well worth the minor effort they require.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Background

[lines=3]Statistical inference is fraught with psychological as well as technical difficulties, yet far less attention has been given to cognitive problems than to technical minutiae and computational devices [1, 2]. If the issues that plague science could be resolved by mechanical algorithms, statisticians and computer scientists would have disposed of them long ago. But the core problems are those of human psychology and social environment, one in which researchers apply traditional frameworks based on fallacious rationales [1, 3]. These problems have no mathematical or philosophical solution, and instead require attention to the unglamorous task of developing tools, interpretations and terminology more resistant to misstatement and abuse than what tradition has handed down.

We believe that neglect of these problems is a major contributor to the current crisis of statistics in science [4, 5, 6, 7, 8, 9]. Several informal descriptions of statistical formulas may be reasonable when strictly adhered to, but nevertheless lead to severe misinterpretations in practice. Users tend to take these extra leaps and shortcuts, hence we need to anticipate implications of terminology and interpretations to improve training. In doing so, we find it remarkable that the P-value is once again at the center of the controversy, despite the fact that some journals strongly discouraged reporting P-values decades ago [10], and complaints about misinterpretation of statistical significance date back a century [11, 12, 13]. Equally remarkable is the diversity of proposed solutions, ranging from modifications of conventional fixed-cutoff testing [14, 15, 16, 17] to complete abandonment of traditional tests in favor of interval estimates [18, 19, 20] or Bayesian tests [21, 22]; no consensus appears in sight.

While few doubt that some sort of reform is needed, the following crucial points are often overlooked:

  1. There is no universally valid way to analyze data and thus no single solution to the problems at issue.

  2. Careful integration of contextual information and technical considerations will always be essential.

  3. Most researchers are under pressure to produce definitive conclusions, and so will resort to familiar automated approaches and questionable defaults [23], with or without P-values or “statistical significance.”

  4. Most researchers lack the time or skills for re-education, so what is most needed are methods that are simple to acquire quickly based on what is commonly taught, yet are also less vulnerable to common misinterpretation than are traditional approaches (or at least have not yet become as widely misunderstood as those approaches).

Thus, rather than propose abandoning old methods in favor of entirely new methods, we will review some simple cognitive devices, terminologic reforms, and conceptual shifts that encourage more realistic, accurate interpretations of conventional statistical summaries. Specifically, we will advise that:

  • We should replace decisive-sounding, overconfident terms like “significance,” “nonsignificance” and “confidence interval,” as well as proposed replacements like “uncertainty interval,” by more modest descriptors such as “low compatibility,” “high compatibility” and “compatibility interval” [24, 25, 26].

  • We should teach alternate ways to view P-values and intervals via information measures such as S-values (surprisals), which are the negative logarithms of the P-values; these measures facilitate translation of statistical test results into results from simple physical experiments [25].

  • For quantities targeted for study, we should replace single P-values, S-values, and interval estimates by tables or graphs of P-values or S-values showing results for relevant alternative hypotheses as well as for null hypotheses.

  • We should from the start teach that the usual interpretations of statistical outputs are often misleading even when they are technically accurate. This is because they condition on background assumptions (i.e., they treat them as given), and thus they ignore what may be serious uncertainty about those assumptions. This deficiency can be most directly and nontechnically addressed by treating those assumptions unconditionally, shifting their logical status from what is assumed to part of what is tested.

We have found that the last recommendation (to decondition inferences [25]) is the most difficult for most readers to comprehend, and is even resisted and misrepresented by some with extensive credentials in statistics. Thus, to keep the present paper of manageable length we have written a companion piece, Greenland & Chow, 2019 [27], which explains in depth the rationale for de-emphasizing traditional conditional interpretations in favor of unconditional interpretations.

2 An Example

We will display problems and recommendations with published results from a record-based cohort study of serotonergic antidepressant prescriptions during pregnancy and subsequent autism spectrum disorder (ASD) of the child (Brown et al. [28]). Out of 2,837 pregnancies that had filled prescriptions, approximately 2% of the children were diagnosed with ASD. The paper first reported an adjusted ratio of ASD rates (hazard ratio or HR) of 1.59 when comparing mothers with and without the prescriptions, and 95% confidence limits (CI) of 1.17 and 2.17. This estimate was derived from a proportional-hazards model which included maternal age, parity, calendar year of delivery, neighborhood income quintile, resource use, psychotic disorder, mood disorder, anxiety disorder, alcohol or substance use disorder, use of other serotonergic medications, psychiatric hospitalization during pregnancy, and psychiatric emergency department visit during pregnancy.

The paper then presented an analysis with adjustment based on a high-dimensional propensity score (HDPS), in which the estimated hazard ratio became 1.61 with a 95% CI spanning 0.997 to 2.59. Despite the estimated 61% increase in the hazard rate in the exposed children and an interval estimate including ratios as large as 2.59 and no lower than 0.997, the authors still declared that there was no association between in utero serotonergic antidepressant exposure and ASD because it was not “statistically significant.” This was a misinterpretation of their own results insofar as an association was not only present [29, 30], and was also quite close to the 70% increase they reported from other studies [31]. Yet the media simply repeated Brown et al.’s misstatement that there was no association after adjustment [32].

Such misreporting remains common, despite increasing awareness that such dichotomous thinking is detrimental to sound science and ongoing efforts to retire statistical significance [22, 24, 30, 33, 34, 35]. To aid these efforts, we will explain the importance of showing results for a range of hypotheses, which help readers see why conclusions such as in Brown et al. [28, 32] represent dramatic misinterpretations of statistics – even though the reported numeric summaries are correct. We intend our discussion to make clear why it would be correct to instead have reported that “After HDPS adjustment for confounding, a 61% hazard elevation remained; however, under the same model, every hypothesis from no elevation up to a 160% hazard increase had ; Thus, while quite imprecise, these results are most consistent with previous observations of a positive association between antidepressant exposure and subsequent ASD (although the association may be partially or wholly due to uncontrolled biases).”

3 Making Sense of Tests

3.1 The P-value as a
compatibility measure

The infamous observed P-value p (originally called the observed or attained “level of significance” or value of P [36, 37]) is a measure of compatibility between the observed data and a targeted test hypothesis H, given a set of background assumptions (the background model) used along with the hypothesis to compute the P-value from the data. By far the most common example of a test hypothesis H

is a traditional null hypothesis, such as “there is no association” or (more ambitiously) “there is no treatment effect.” In some books this null hypothesis is the only test hypothesis ever mentioned. Nonetheless, the test hypothesis

H could just as well be “the treatment doubles risk” or “the treatment halves risk” or any other hypothesis of practical interest [38]; we will argue such alternatives to the null should be also be tested whenever the traditional null hypothesis is tested.

With this general background about the test hypothesis, the other key ingredient in traditional statistical testing is a test statistic, such as a -score or , which measures the discrepancy between the observed data and what would have been expected under the test hypothesis, given the background assumptions. We can now define an observed P-value p as the probability of the test statistic being at least as extreme as observed if the hypothesis H targeted for testing and every assumption used to compute the P-value (the test hypothesis H and the background statistical model) were correct [38]

. This accurate description does not accord well with human psychology, however: It is often said that researchers want a probability for the targeted test hypothesis (posterior probability of

H), not a probability of observations. This imperative is indicated by the many “intuitive” – and incorrect – verbal definitions and descriptions of the P-value that amount to calling it the probability of the test hypothesis, which is usually quite misleading [38]. Such errors are often called inversion fallacies because they invert the role of the observations and the hypothesis in defining the P-value (which is a probability for the observed test statistic, not the test hypothesis).

A standard criterion for judging whether a P-value is useful for testing is that all possible values for it from zero to one are equally likely (uniform in probability) if the test hypothesis and background assumptions are correct. In fact, some authors regard a P-value as useless unless it comes close to meeting this uniformity criterion, and consider a test to be valid only if its P-value meets the criterion [39, 40, 41]. With this validity criterion met, we can correctly describe the P-value as the percentile at which the observed test statistic falls in the distribution for the test statistic under the test hypothesis and the background assumptions [42]. The purpose of this description is to connect the P-value to a familiar object, the percentile at which someone’s score fell on a standard test (e.g., a college or graduate admissions examination), as opposed to the abstract (and too easily inverted) probability definition.

3.2 The S-value

Even when P-values are correctly defined and valid, their scaling can be deceptive due to their compression into the interval from 0 to 1, with vastly different meanings for absolute differences in P-values near 1 and the same differences for P-values near 0 [25], as we will describe below. One way to reduce test misinterpretations and provide more intuitive numerical results is to take the negative logarithm of the P-value, , known as the Shannon information, surprisal, logworth, or S-value from the test [25, 43, 38]. The S-value is designed to avert incorrect intuitive interpretations of statistics that dominate research reports by providing a measure of information supplied by the test against the test hypothesis H [25].

Obtaining p from standard statistical tests, this information can only be against the test hypothesis because there is no way to distinguish among the infinitude of models that would have the same or larger P-value and hence the same or smaller S-value. This limitation reflects the fact that there is no way the data can support a test hypothesis without adding what may be implausible assumptions to drastically narrow down possible models (e.g., no bias or error in any background assumption). Furthermore, the S-value measures only the information supplied by a particular test when applied to the data. A different test based on different background assumptions will usually produce a different P-value and thus a different S-value; thus it would be a mistake to simply call the S-value “the information against the hypothesis supplied by the data.”

The S-value provides an absolute scale on which to view the information provided by a statistical test, as measured by calibrating P against a physical mechanism that produces data. For a mechanism that produces binary outcomes (e.g., yes/no; positive/negative; up/down; right/left; heads/tails), we express the S-value on the base-2 log scale, in which case the units of measurement are called bits (short for binary digits) or shannons. As an example, suppose we toss a coin and label 1 = heads, 0 = tails. We can then think of one bit as the amount of information provided by seeing heads on one toss against the hypothesis that the toss is fair vs. loaded (biased) for heads. Now suppose we do k independent tosses of this coin and they all come up heads. Then the total amount of information against fairness provided by this outcome (seeing k heads in a row) is k bits. Thus 4 heads in a row supplies 4 bits of information against fairness in the direction of loading for heads.

Other units for measuring information arise from different choices for the base of the logarithms. For example, using natural (base-e) logs, the S-value units are called nats, while using base-10 logs, the units are called hartleys, bans or dits (decimal digits). The ratio of one dit of information to one bit of information is 3.22 which is similar to the ratio of meters to feet, 3.28. Just as the choice of meters vs. feet does not affect the concepts and methods surrounding length measurement, so choice of dits vs. bits does not affect any of the concepts or methods of information measurement. Bits are most commonly used however because the fundamental physical components in electronic information storage are binary and thus their information capacity is one bit. They also correspond to the information supplied by exactly one binary measurement on a unit, e.g., biologic sex (male/female).

Figure 1: Comparison of P-value and S-value scales. Top labels: Data compatibility with test model as measured by P-values. Bottom labels: Information against test model as measured by the corresponding S-values.

3.3 Using the S-value

With the S-value in hand, a cognitive difficulty of the P-value scale for evidence can be seen by first noting that the difference in the evidence provided by P-values of 0.9999 and 0.90 is trivial: Both represent almost no information against the test hypothesis, in that the corresponding S-values are = 0.00014 bits and = 0.15 bits. Both are far less than the 1 bit of information against the hypothesis – they are just a fraction of a coin toss different. In contrast, the information against the test hypothesis in P-values of 0.10 and 0.0001 is profoundly different, in that the corresponding S-values are = 3.32 and = 13.29 bits; thus p = 0.001 provides 10 bits more information against the test hypothesis than does p = 0.10, corresponding to the information provided by 10 additional heads in a row. The contrast is illustrated in Figure 1, along with other examples of the scaling difference between P and S values.

50pt50pt

As an example of this perspective on reported results, from the point and interval estimate from the HDPS analysis reported by Brown et al. [28], we calculated that the P-value for the “null” test hypothesis H that the hazard ratio is 1 (no association) is 0.0505. Using the S-value to measure the information supplied by the HDPS analysis against this hypothesis, we get s = = 4.31 bits; this hardly more than 4 coin tosses worth of information against no association. For comparison, taking instead the test hypothesis H to be that the hazard ratio is 2 (doubling of the hazard among the treated), we calculated a P-value of 0.373. The information supplied by the HDPS analysis against this test hypothesis is then measured by the S-value as s = = 1.42 bits, hardly more than a coin-toss worth of information against doubling of the hazard among the treated. In these terms, then, the HDPS results supply roughly 3 bits more information against the test hypothesis of no association than against doubling of the hazard, so that doubling is more compatible with the analysis results than is no association. S-values can help understand objections to comparing P-values to sharp dichotomies. Consider that P-value of 0.06 yields about 4 bits of information against the test hypothesis H, while a P-value of 0.03 yields about 5 bits of information against H. Thus, p = 0.06 is about as surprising as getting all heads on four fair coin tosses while p = 0.03 is one toss (one bit) more surprising. Even if one is committed to making a decision based on a sharp cutoff, S-values illustrate what range around that cutoff corresponds to a trivial information difference (e.g., any P-value between 0.025 and 0.10 is less than a coin-toss difference in evidence from p = 0.05).

S-values can also help researchers understand more subtle problems with traditional testing. Consider for example the import of the magical 0.05 threshold (α-level) that is used to declare associations present or absent. It has often been claimed that this threshold is too high to be regarded representing much evidence against H [21], but the arguments for that are usually couched in Bayesian terms of which many are skeptical. We can however see those objections to 0.05 straightforwardly by noting that the threshold translates into requiring an S-value of only = 4.32 bits of information against the null; that means p = 0.05 is barely more surprising than getting all heads on 4 fair coin tosses. While 4 heads in a row may seem surprising to some intuitions, it does in fact correspond to doing only 4 tosses to study the coin, and so may call into question those intuitions given how little information is contained in a P-value of 0.05, especially in relation to the trillion or so bits of information available on laptop hard drives. As always, further crucial information will be given by P-values and S-values for tabulated for several alternative hypotheses, interval estimates over varying percentiles, and graphs of data and information summaries such as those illustrated below.

4 Advantages of S-values

Unlike probabilities, S-values are unbounded above and are additive over independent information sources, thus providing a scale for comparing test results across hypotheses that is aligned with information rather than probability measurement [25]. The S-values for testing the same hypothesis from independent studies can thus be summed to provide a measure of the total information against the hypothesis [25, 36]. Another advantage of S-values is that they help thwart inversion fallacies, in which a P-value is misinterpreted as a probability of a hypothesis being correct (or equivalently, as the probability that a statement about the hypothesis is in error). Such probabilities, when computed using the data, are called posterior probabilities (because they come after the data). It is difficult to confuse an S-value with a posterior probability because the S-value is unbounded above, and in fact will be above 1 whenever the P-value is below 0.50.

Probabilities of data given hypotheses and probabilities of hypotheses given data are identically scaled, and users inevitably conflate P-values with posterior probabilities. This confusion dominates observed misinterpretations [38] and is invited with open arms by “significance” and “confidence” terminology. Such fallacies could be avoided by giving actual posterior probabilities along with P-values. Bayesian methods provide such probabilities, but require prior distributions as input; in turn, those priors require justifications that will satisfy most readers. While the task of creating such distributions can be instructive, this extra input burden has greatly deterred adoption of Bayesian methods; in contrast, S-values provide a direct quantification of information without this input.

 

 

Table 1: P-values, S-values, Maximum-Likelihood Ratios, and Likelihood-Ratio Statistics For Various Test Hypotheses About the Hazard Ratio (HR) Computed from Brown et al. [28] HDPS results.222Computed from the normal approximations given in the Appendix.
Test Hypothesis (H) P-value (compatibility) S-value (bits) Maximum- Likelihood Ratio Likelihood- Ratio Statistic
Halving of hazard (HR = 0.5) 19.3 23.1
No association (null) (HR = 1) 0.05 4.31 6.77 3.82
Point estimate (HR = 1.61) 1 0.00 1.00 0.00
Doubling of hazard (HR = 2) 0.37 1.42 1.49 0.79
Tripling of hazard (HR = 3) 0.01 6.56 26.2 6.53
Quintupling of hazard (HR = 5) 18.2 21.7

In summary, the S

-value provides a gauge of the information supplied by a statistical test, with a simple counting interpretation in familiar terms of coin tosses. It thus complements the probability interpretation of a

P-value by supplying a mechanism that is easy to do both thought and physical experiments with. Given amply documented human tendencies to underestimate the frequency of seemingly “unusual” events [47], these experiments may improve intuitions about what are reasonable frequencies to expect and thus what evidence strength a given P-value actually represents.

5 Tests of Different Values for a Parameter vs. Tests of Different Parameters

Even if all background assumptions hold, no single number (whether a P-value, S-value, or point estimate) can by itself provide an adequate sense of uncertainty about a targeted parameter, such as a mean difference, a hazard ratio (HR), or some other contrast across treatment groups. We have thus formulated our description to allow the test hypothesis H to refer to different values for the same parameter. For example, H could be “HR = 1”, the traditional null hypothesis of no change in hazard across compared groups; but H could just as well be “HR = 2”, a doubling of the hazard, or “HR = 0.5”, a halving of the hazard. In all these variations the set of auxiliary assumptions (background model) used to compute the statistics stay unchanged; only H is changing. Unconditionally, the S-values for the different H are measuring information against different test models, although the only difference between the models is the value stated in the targeted hypothesis H; all other assumptions are the same.

A similar comment applies when, in a model, we test different coefficients: The background assumptions are unchanged, only the targeted test hypothesis H is changing, although now the change is to another parameter (rather than another value for the same parameter). For example, in a model for effects of cancer treatments we might compute the P-value and S-value from a test of H = “the coefficient of radiotherapy is zero” and another P-value and S-value from a test of H = “the coefficient of chemotherapy is zero.” Conditionally, these two S-values are giving information against different target hypotheses H using the same background model; for example, using a proportional-hazards model, that background includes the assumption that the effects of different treatments on the hazard multiply together to produce the total effect of all treatments combined. Unconditionally, these S-values are measuring information against different test models, one a model with no effect of radiotherapy but allowing an effect of chemotherapy, the other allowing an effect of radiotherapy but no effect of chemotherapy, with all other assumptions the same in both models.

Testing different parameters with the same data raises issues of multiple comparisons (also known as simultaneous inference). These issues are very complex and controversial, with opinions about multiple-comparison adjustment ranging from complete dismissal of adjustments to demands for mindless, routine use. These issues extend far beyond the present scope; see Greenland & Hofman, 2019 [48] and Greenland, 2020 [49] for a recent commentary and a review, respectively. We can only note here that the devices we recommend can be also applied to adjusted comparisons, where the S-value computed from an adjusted P-value becomes the information against a hypothesis penalized (reduced) to account for multiplicity.

6 Replace Unrealistic Confidence Claims with Compatibility Measures

Confidence intervals (commonly abbreviated as CI) have been widely promoted as a solution to the problems of statistical misinterpretation [18, 20]. While we support their presentation, such intervals have difficulties of their own. The major problem with “confidence” is that it encourages the common confusion of the CI percentile level (typically 95%) with the probability that the true value is in the interval (mistaking the CI for a Bayesian posterior interval) [38], as in statements such as “we are 95% confident that the true value is within the interval.”

The fact that “confidence” refers to the procedure, not the reported interval, seems to be lost on most researchers; remarking on this subtlety, when Jerzy Neyman discussed his confidence concept in 1934, Arthur Bowley said, “I am not at all sure that the ’confidence’ is not a confidence trick” [50]. And indeed, forty years later, Cox and Hinkley [51] warned, “interval estimates cannot be taken as probability statements about parameters, and foremost is the interpretation ‘such and such parameter values are consistent with the data.’ ” Unfortunately, the word “consistency” is used for several other concepts in statistics, while in logic it refers to an absolute condition (of noncontradiction); thus, its use in place of “confidence” would risk further confusion.

To address this problem, we exploit the fact that a 95% CI summarizes the results of varying the test hypothesis H over a range of parameter values, displaying all values for which P > 0.05 [52] and hence S < 4.32 bits [25, 53]. Thus, conditional on the background assumptions, the CI contains a range of parameter values that are more compatible with the data than are values outside the interval [25, 38]. Unconditionally (and thus even if the background assumptions are uncertain), the interval shows the values of the parameter which, when combined with the background assumptions, produce a test model that is “highly compatible” with the data in the sense of having less than 4.32 bits of information against it. We thus refer to CI as compatibility intervals rather than confidence intervals [25, 53, 26]; their abbreviation remains “CI.”

Another problem is that a CI is often used as nothing more than a null-hypothesis significance test (NHST), by declaring that the null parameter value (e.g., HR = 1) is supported if it is inside the interval, or refuted if it is outside the interval. Such use defeats the use of interval estimates for indicating uncertainty about the parameter and perpetuates the fallacy that information changes abruptly across decision boundaries [34, 38, 53, 54].

In particular, the usual 95% default forces the user’s focus onto parameter values that yield p > 0.05, without regard to the trivial difference between (say) p = 0.06 and p = 0.04 (a difference not even worth a coin toss). To address this problem, we first note that a 95% interval estimate is only one of a number of arbitrary dichotomization of possibilities of parameter values (into either inside or outside of an interval). A more accurate picture of uncertainty is then obtained by examining intervals using other percentiles, e.g., proportionally-spaced compatibility levels such as p > 0.25, 0.05, 0.01, which correspond to 75%, 95%, 99% CIs and equally-spaced S-values of s < 2, 4.32, 6.64 bits. When a detailed picture is desired, a table or graph of all the P-values and S-values across a broad range of parameter values seems the clearest way to see how compatibility varies smoothly across the values.

Figure 2: S-values (surprisals) for a range of hazard ratios. An information graph in which S-values are plotted across alternative hazard ratios. Computed from results in Brown et al. [28]. HR = 1 represents no association.

Figure 2 illustrates how misleading it is to frame discussion in terms of whether P is above or below 0.05, or whether the null value is included in the 95% CI: Every hazard ratio from 1 to 2.58 is more compatible with the Brown et al. data [28] according to the HDPS analysis, and has less information against it than does the null value of 1. Thus, as the graphs make clear, the analysis provides absolutely no basis for claiming the study found “no association.” Instead, the analysis exhibits an association similar to that seen in earlier studies and should have been reported as such, even though it leaves open the question of what caused the association (e.g., a drug effect, a bias, a positive random error, or some combination).

7 Moving Forward

Most efforts to reform statistical reporting have promoted interval estimates [20, 65] or Bayesian methods [21] over P-values. There is nonetheless scant empirical evidence that these or any proposals (including ours) have improved or will improve reporting without accompanying editorial and reviewer efforts to enforce proper interpretations. Instead, the above example and many others [25, 66, 67] illustrate how, without proper editorial monitoring, interval estimates are often of no help and can even be harmful when journals require dichotomous interpretation of results, for example as does JAMA [68].

We nonetheless argue that such examples show it is imperative to implement simple reforms to traditional terms and interpretations, for those traditions have encouraged numerous overinterpretations and misinterpretations, especially when there is doubt about conventional assumptions. Overconfident terms like “significance,” “confidence,” and “severity” [17] and decisive interpretations [14] can be easily replaced with more cautiously graded unconditional compatibility descriptions; narrowly compressed probabilities like P-values can be supplemented with additive-information concepts like S-values; and requests can be made for tables or graphs of P-values and S-values for multiple alternative hypotheses, rather than forcing focus onto a null hypotheses [25, 34, 56, 58]. These reforms need to be given a serious chance via editorial encouragement in both review and instructions to authors.

Acknowledgements

We are most grateful for the generous comments and criticisms on our initial drafts offered by Andrew Althouse, Valentin Amrhein, Darren Dahly, Frank Harrell, John Ioannidis, Daniël Lakens, Nicole Lazar, Gregory Lopez, Oliver Maclaren, Blake McShane, Tim Morris, Keith O’Rourke, Kristin Sainani, Allen Schirm, Philip Stark, Andrew Vigotsky, Jack Wilkinson, and Corey Yanofsky. We also thank Karen Pendergrass for her help in producing the figures in this paper. Our acknowledgment does not imply endorsement of all our views by all of these colleagues, and we remain solely responsible for the views expressed herein.

Authors’ Contributions

Both authors wrote the first draft and revised the manuscript, read and approved the submitted manuscript, and have agreed to be personally accountable for their own contributions related to the accuracy and integrity of any part of the work.

Abbreviations

ASD: Autism spectrum disorder; CI: Compatibility/confidence interval; HDPS: High-dimensional propensity score; HR: Hazard ratio; LI: Likelihood interval; LR: Likelihood ratio; MLR: Maximum-likelihood ratio; NHST: Null-hypothesis significance test; S-value: Surprisal (Shannon-information) value

Data and Materials

The datasets generated and analyzed in the current paper are available in the Open Science Framework | DOI: 10.17605/OSF.IO/6W8G9 and on figshare | DOI: 10.6084/m9.figshare.9202214.v3.

Funding

This work was produced with no funding.

Competing Interests

The authors declare that they have no competing interests.

Chow ZR, Greenland S. Semantic and Cognitive Tools to Aid Statistical Inference: Replace Confidence and Significance by Compatibility and Surprise. arXiv: [stat]. 2019 Sep;.

References

  • [1] Greenland S. Invited Commentary: The Need for Cognitive Science in Methodology. Am J Epidemiol. 2017 Sep;186(6):639–645.
  • [2] Gigerenzer G. Mindless Statistics. J Socio-Econ. 2004 Nov;33(5):587–606.
  • [3] Stark PB, Saltelli A. Cargo-Cult Statistics and Scientific Crisis. Significance. 2018 Aug;15(4):40–43.
  • [4] Simmons JP, Nelson LD, Simonsohn U. False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant. Psychol Sci. 2011 Nov;22(11):1359–1366.
  • [5] Open Science Collaboration. Estimating the Reproducibility of Psychological Science. Science. 2015 Aug;349(6251):aac4716.
  • [6] Freedman LP, Cockburn IM, Simcoe TS. The Economics of Reproducibility in Preclinical Research. PLoS Biol. 2015 Jun;13(6):e1002165.
  • [7] Camerer CF, Dreber A, Forsell E, Ho TH, Huber J, Johannesson M, et al. Evaluating Replicability of Laboratory Experiments in Economics. Science. 2016 Mar;351(6280):1433–1436.
  • [8] Lash TL, Collin LJ, Van Dyke ME. The Replication Crisis in Epidemiology: Snowball, Snow Job, or Winter Solstice? Curr Epidemiol Rep. 2018 Jun;5(2):175–183.
  • [9] Cassidy SA, Dimova R, Giguère B, Spence JR, Stanley DJ. Failing Grade: 89% of Introduction-to-Psychology Textbooks That Define or Explain Statistical Significance Do so Incorrectly. Adv Methods Pract Psychol Sci. 2019 Jun;.
  • [10] Lang JM, Rothman KJ, Cann CI. That Confounded P-Value. Epidemiology. 1998 Jan;9(1):7–8.
  • [11] Pearson K. V. Note on the Significant or Non-Significant Character of a Sub-Sample Drawn from a Sample. Biometrika. 1906 Oct;5(1-2):181–183.
  • [12] Boring EG. Mathematical vs. Scientific Significance. Psychol Bull. 1919;16(10):335–338.
  • [13] Tyler RW. What Is Statistical Significance? Educ Res Bull. 1931;10(5):115–142.
  • [14] Benjamin DJ, Berger JO, Johannesson M, Nosek BA, Wagenmakers E, Berk R, et al. Redefine Statistical Significance. Nat Hum Behav. 2017 Sep;2(1):6–10.
  • [15] Lakens D, Adolfi FG, Albers CJ, Anvari F, Apps MAJ, Argamon SE, et al. Justify Your Alpha. Nat Hum Behav. 2018 Feb;2(3):168–171.
  • [16] Lakens D, Scheel AM, Isager PM. Equivalence Testing for Psychological Research: A Tutorial. Adv Methods Pract Psychol Sci. 2018 Jun;1(2):259–269.
  • [17] Mayo DG. Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars. Cambridge University Press; 2018.
  • [18] Rothman KJ. A Show of Confidence. N Engl J Med. 1978 Dec;299(24):1362–1363.
  • [19] Bland JM, Altman DG. Measuring Agreement in Method Comparison Studies. Stat Methods Med Res. 1999 Jun;8(2):135–160.
  • [20] Cumming G. Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. Routledge; 2012.
  • [21] Sellke T, Bayarri MJ, Berger JO. Calibration of Values for Testing Precise Null Hypotheses. Am Stat. 2001 Feb;55(1):62–71.
  • [22] Goodman SN. Introduction to Bayesian Methods I: Measuring the Strength of Evidence. Clin Trials. 2005;2(4).
  • [23] Wang MQ, Yan AF, Katz RV. Researcher Requests for Inappropriate Analysis and Reporting: A U.S. Survey of Consulting Biostatisticians. Ann Intern Med. 2018 Oct;169(8):554.
  • [24] Amrhein V, Greenland S, McShane B. Scientists Rise up against Statistical Significance. Nature. 2019 Mar;567(7748):305.
  • [25] Greenland S. Valid P-Values Behave Exactly as They Should: Some Misleading Criticisms of P-Values and Their Resolution with S-Values. Am Stat. 2019 Mar;73(sup1):106–114.
  • [26] Greenland S. Are Confidence Intervals Better Termed “Uncertainty Intervals”? No: Call Them Compatibility Intervals. BMJ. 2019 Sep;366:l5381.
  • [27] Greenland S, Chow ZR. To Aid Statistical Inference, Emphasize Unconditional Descriptions of Statistics. arXiv: [stat]. 2019 Sep;.
  • [28] Brown HK, Ray JG, Wilton AS, Lunsky Y, Gomes T, Vigod SN. Association between Serotonergic Antidepressant Use during Pregnancy and Autism Spectrum Disorder in Children. JAMA. 2017 Apr;317(15):1544–1552.
  • [29] McShane BB, Gal D. Statistical Significance and the Dichotomization of Evidence. J Am Stat Assoc. 2017 Jul;112(519):885–895.
  • [30] McShane BB, Gal D, Gelman A, Robert C, Tackett JL. Abandon Statistical Significance. Am Stat. 2019 Mar;73(sup1):235–245.
  • [31] Brown HK, Hussain-Shamsy N, Lunsky Y, Dennis CLE, Vigod SN. The Association between Antenatal Exposure to Selective Serotonin Reuptake Inhibitors and Autism: A Systematic Review and Meta-Analysis. J Clin Psychiatry. 2017 Jan;78(1):e48–e58.
  • [32] Yasgur B. Antidepressants in Pregnancy: No Link to Autism, ADHD [Medical Website]; 2017. http://www.medscape.com/viewarticle/878948.
  • [33] Wasserstein RL, Schirm AL, Lazar NA. Moving to a World beyond “p 0.05”. Am Stat. 2019 Mar;73(sup1):1–19.
  • [34] Poole C. Beyond the Confidence Interval. Am J Public Health. 1987 Feb;77(2):195–199.
  • [35] Altman DG, Bland JM. Absence of Evidence Is Not Evidence of Absence. BMJ. 1995 Aug;311(7003):485.
  • [36] Fisher RA. Statistical Methods for Research Workers. Oliver and Boyd: Edinburgh; 1925.
  • [37] Pearson K. X. On the Criterion That a given System of Deviations from the Probable in the Case of a Correlated System of Variables Is Such That It Can Be Reasonably Supposed to Have Arisen from Random Sampling. Lond Edinb Dublin Philos Mag J Sci. 1900;50(302):157–175.
  • [38] Greenland S, Senn SJ, Rothman KJ, Carlin JB, Poole C, Goodman SN, et al. Statistical Tests, P Values, Confidence Intervals, and Power: A Guide to Misinterpretations. Eur J Epidemiol. 2016 Apr;31(4):337–350.
  • [39] Bayarri MJ, Berger JO. P Values for Composite Null Models. J Am Stat Assoc. 2000;95(452):1127–1142.
  • [40] Robins J, Wasserman L. Conditioning, Likelihood, and Coherence: A Review of Some Foundational Concepts. J Am Stat Assoc. 2000 Dec;95(452):1340–1346.
  • [41] Kuffner TA, Walker SG. Why Are P-Values Controversial? Am Stat. 2019 Jan;73(1):1–3.
  • [42] Perezgonzalez JD. P-Values as Percentiles. Commentary on: “Null Hypothesis Significance Tests. A Mix–up of Two Different Theories: The Basis for Widespread Confusion and Numerous Misinterpretations”. Front Psychol. 2015;6.
  • [43] Good IJ.

    The Surprise Index for the Multivariate Normal Distribution.

    Ann Math Stat. 1956;27(4):1130–1135.
  • [44] Cox DR, Donnelly CA. Principles of Applied Statistics. Cambridge University Press; 2011.
  • [45] Cummings P. Analysis of Incidence Rates. CRC Press; 2019.
  • [46] Burnham KP, Anderson DR. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. 2nd ed. New York: Springer-Verlag; 2002.
  • [47] Hand DJ. The Improbability Principle: Why Coincidences, Miracles, and Rare Events Happen Every Day. Macmillan; 2014.
  • [48] Greenland S, Hofman A. Multiple Comparisons Controversies Are about Context and Costs, Not Frequentism versus Bayesianism. Eur J Epidemiol. 2019 Sep;.
  • [49] Greenland S. The Causal Foundations of Applied Probability and Statistics. In: Geffner H, Dechter R, Halpern J, editors. Probabilistic and Causal Inference: The Work of Judea Pearl. in press; 2020. .
  • [50] Bowley AL. Discussion on Dr. Neyman’s Paper. J R Stat Soc. 1934;(97):607–610.
  • [51] Cox DR, Hinkley DV. Chapter 7, Interval Estimation. In: Theoretical Statistics. Chapman and Hall/CRC; 1974. p. 207–249.
  • [52] Cox DR. Principles of Statistical Inference. Cambridge University Press; 2006.
  • [53] Amrhein V, Trafimow D, Greenland S. Inferential Statistics as Descriptive Statistics: There Is No Replication Crisis If We Don’t Expect Replication. Am Stat. 2019 Mar;73(sup1):262–270.
  • [54] Poole C. Confidence Intervals Exclude Nothing. Am J Public Health. 1987 Apr;77(4):492–493.
  • [55] Birnbaum A. A Unified Theory of Estimation, I. Ann Math Stat. 1961;32(1):112–135.
  • [56] Sullivan KM, Foster DA. Use of the Confidence Interval Function. Epidemiology. 1990 Jan;1(1):39–42.
  • [57] Rothman KJ, Greenland S, Lash TL. Precision and Statistics in Epidemiologic Studies. In: Rothman KJ, Greenland S, Lash TL, editors. Modern Epidemiology. 3rd ed. Lippincott Williams & Wilkins; 2008. p. 148–167.
  • [58] Folks L. Ideas of Statistics. Wiley; 1981.
  • [59] Chow ZR, Vigotsky AD. Concurve: Computes and Plots Consonance (Confidence) Intervals, P-Values, and S-Values to Form Consonance and Surprisal Functions; 2019. CRAN.
  • [60] Black J, Rothman K, Thelwall S. Episheet: Rothman’s Episheet; 2019. CRAN.
  • [61] Infanger D, Schmidt-Trucksäss A. P Value Functions: An Underused Method to Present Research Results and to Promote Quantitative Reasoning. Stat Med. 2019;.
  • [62] Fraser DAS. The P-Value Function and Statistical Inference. Am Stat. 2019 Mar;73(sup1):135–147.
  • [63] Whitehead J. The Case for Frequentism in Clinical Trials. Stat Med. 1993;12(15-16):1405–1413.
  • [64] Rubenstein S. A New Low in Drug Research: 21 Fabricated Studies; 2009.
  • [65] Harrington D, D’Agostino RB, Gatsonis C, Hogan JW, Hunter DJ, Normand SLT, et al. New Guidelines for Statistical Reporting in the Journal. N Engl J Med. 2019 Jul;381(3):285–286.
  • [66] Schmidt M, Rothman KJ. Mistaken Inference Caused by Reliance on and Misinterpretation of a Significance Test. Int J Cardiol. 2014 Dec;177(3):1089–1090.
  • [67] Greenland S. A Serious Misinterpretation of a Consistent Inverse Association of Statin Use with Glioma across 3 Case-Control Studies. Eur J Epidemiol. 2017 Jan;32(1):87–88.
  • [68] Bauchner H, Golub RM, Fontanarosa PB. Reporting and Interpretation of Randomized Clinical Trials. JAMA. 2019 Aug;322(8):732–735.
  • [69] Cousins RD. The Jeffreys–Lindley Paradox and Discovery Criteria in High Energy Physics. Synthese. 2017 Feb;194(2):395–432.

8 Appendix

  The coin-toss interpretation we have used assumes that the only alternative to fairness is in the direction of loading for heads. The S-value it produces thus corresponds to a P-value for the 1-sided hypothesis ; nonetheless, this interpretation applies even if the observed P-value p was 2-sided. This translation from a 2-sided P-value to a 1-sided S-value parallels the transformation of P-values into 1-sided sigmas in physics, in which for example a P-value of 0.05 from a two-sided test would become a of 1.645, the upper 5% cutoff for a standard-normal deviate [69].

There are many other measures of information and evidence against a test hypothesis H or test model. One example is the maximum-likelihood ratio (MLR), which is the value of the likelihood function at its maximum under the background model, divided by its (restricted) maximum when the test hypothesis H is added to that model [44, p. 151, 45, p. 156]. Like the S

-value, the MLR defined this way is always above 1; it is however sometimes confused with posterior odds against the tested value

H given the background model, which it equals only under very special (and usually unrealistic) conditions.

The MLR does however show the most extreme change in posterior odds against H that the data could produce under the background model. The corresponding information measure paralleling the S-value is the deviance difference or likelihood-ratio (LR) statistic for H given the background model, , which is itself a test statistic for H that provides a conditional P-value and S-value. The change in the Akaike Information Criterion (without small-sample adjustment) from adding H to the background model is this difference minus 2.

Table LABEL:Tab:tab1

shows relations under the standard 1 degree-of-freedom (

df) approximation for the LR statistic when H is a hypothesis that a parameter equals a specific value, e.g., for the hypothesis that a hazard ratio HR equals the number , H: HR . For normal (Gaussian) data these relations are exact and the LR statistic reduces to squared -score for the hypothesis [45, p. 156]. The S-value and LR statistic track each other rather closely although the latter increases more rapidly. Their relation reflects that, under the test model and the standard approximations, the P-value is uniform and hence the S-value is unit-exponential, which is half a 2 df [36] and hence has a heavier right tail than the 1 df LR statistic; specifically, with , the ratio of densities for the 2 and 1 df is proportional to .

For Table 4, Figure LABEL:Fig2, and Figure 2, statistics were computed from the approximate normal distribution used for the CIs in Brown et al. [28], in which the log-hazard ratio ln(HR) is estimated to have mean

and standard deviation

. The P-value for H: HR is then derived from the normal score , and the LR statistic and MLR are approximated by and exp().

For contrast to the P-value graph in Figure LABEL:Fig2, Supplementary Figure 8 shows the relative likelihood function, 1/MLR, produced from the Brown et al. HDPS results, taking the maximum as the reference point so that the graph extends from 0 to 1. It may be noticed that this function appears proportional to a posterior probability density for ln(HR), but this proportionality holds only under very special conditions. For contrast to the S-value graph in Figure 2, Supplementary Figure 8 shows the corresponding deviance function .

[t]

Relative likelihoods for a range of hazard ratios. A relative likelihood function that corresponds to Figure LABEL:Fig2, the P-value function. Also plotted is the likelihood interval (LI), which corresponds to the 95% compatibility interval. Computed from results in Brown et al. [28]. MLR = Maximum-Likelihood Ratio. HR = 1 represents no association.

[t]

Deviance statistics for a range of hazard ratios. A deviance function, which corresponds to Figure 2, the S-value function. Also plotted is the likelihood interval (LI), which corresponds to the 95% compatibility interval. Computed from results in Brown et al. [28].

MLR = Maximum-Likelihood Ratio. HR = 1 represents no association.