Why multiple hypothesis test corrections provide poor control of false positives in the real world

08/10/2021 ∙ by Stanley E. Lazic, et al. ∙ University of Cambridge 0

Most scientific disciplines use significance testing to draw conclusions from experimental or observational data. This classical approach provides theoretical guarantees for controlling the number of false positives across a set of hypothesis tests, making it an appealing framework for scientists who wish to limit the number of false effects or associations that they claim exist. Unfortunately, these theoretical guarantees apply to few experiments and the actual false positive rate (FPR) is much higher than the theoretical rate. In real experiments, hypotheses are often tested after finding unexpected relationships or patterns, the data are analysed in several ways, analyses may be run repeatedly as data accumulate from new experimental runs, and publicly available data are analysed by many groups. In addition, the freedom scientists have to choose the error rate to control, the collection of tests to include in the adjustment, and the method of correction provides too much flexibility for strong error control. Even worse, methods known to provide poor control of the FPR such as Newman-Keuls and Fisher's Least Significant Difference are popular with researchers. As a result, adjusted p-values are too small, the incorrect conclusion is often reached, and reported results are less reproducible. Here, I show why the FPR is rarely controlled in any meaningful way and argue that a single well-defined FPR does not even exist.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Abstract

Most scientific disciplines use significance testing to draw conclusions from experimental or observational data. This classical approach provides theoretical guarantees for controlling the number of false positives across a set of hypothesis tests, making it an appealing framework for scientists who wish to limit the number of false effects or associations that they claim exist. Unfortunately, these theoretical guarantees apply to few experiments and the actual false positive rate (FPR) is much higher than the theoretical rate. In real experiments, hypotheses are often tested after finding unexpected relationships or patterns, the data are analysed in several ways, analyses may be run repeatedly as data accumulate from new experimental runs, and publicly available data are analysed by many groups. In addition, the freedom scientists have to choose the error rate to control, the collection of tests to include in the adjustment, and the method of correction provides too much flexibility for strong error control. Even worse, methods known to provide poor control of the FPR such as Newman-Keuls and Fisher’s Least Significant Difference are popular with researchers. As a result, adjusted p-values are too small, the incorrect conclusion is often reached, and reported results are less reproducible. Here, I show why the FPR is rarely controlled in any meaningful way and argue that a single well-defined FPR does not even exist.

Introduction

In many disciplines scientists are taught to analyse data by framing their research question into a significance testing framework, often called null hypothesis significance testing (NHST). This involves reshaping the research question into two mutually exclusive hypotheses. The first is the null hypothesis (

), which states that there is no effect, correlation, association, or more generally, “there’s nothing going on”. The second is the alternative hypothesis (), which states that an effect, correlation, or association exists, or more simply, that

is false. The data are then tested for compatibility with the null hypothesis by reducing them to a single number, called the test statistic, which is then used to calculate a p-value. A small p-value indicates that the observed or a more extreme test statistic is unlikely if the null hypothesis is true. Or more informally, a small p-value indicates that the size of the observed effect, correlation, or association is larger than expected, if in fact there is nothing going on. The implication is that the alternative hypothesis is true, as there are only two options (

and

), and if we discredit one of them, the other must be more likely. This implication does not follow directly since a p-value is the probability of data, not a hypothesis. However, researchers often interpret a p-value as the probability of a hypothesis. The practical and philosophical limitations of NHST are not considered here, as they have been discussed elsewhere

[11, 20, 49].

To control false positives when using NHST, a predetermined threshold is set for the false positive error rate (denoted and often set to 0.05), and if the p-value is less than , is rejected. Given , there is a 5% chance that an incorrect conclusion is reached, by definition, assuming is true. And if many such tests are conducted, the probably is higher than 5% that at least one test in the collection is a false positive. This increased error rate has motivated the development of “post-hoc tests” or multiple comparisons procedures (MCP) that continues to this day. I argue that the correction for multiple testing framework often fails to control the FPR in real studies and contributes to the lack of reproducibility in research.

Many reasons have been put forward for irreproducibility, including publication bias, selective reporting, inappropriate experimental designs and analyses, and perverse incentives [31, 26, 2, 32]. I argue that a more basic reason has been overlooked: the supposed control of the FPR rate is far off the assumed level in many experiments, even when analyses are conducted appropriately. Below I describe why the FPR is rarely controlled in any meaningful way in real experiments. Some reasons are well known and have good solutions, but others are unrecognised. The NHST approach also assumes that there is a single unique to control, but I argue this is a myth. The paper concludes with suggestions for improving inferences when testing multiple hypotheses.

Reasons for poor control of false positives

Nine reasons are given below why the FPR may differ from the nominal 0.05 level, and most experiments will suffer from one or more of these issues. Only registered clinical trials and preregistered confirmatory studies likely minimise the discrepancy between the actual and nominal FPR [41]. The case where p-values are not adjusted for multiple tests is excluded.

Freedom to choose the family of comparisons

When using multiple testing procedures, researchers must define a family or collection of hypothesis tests to be corrected together, and exclude other hypothesis tests that may be conducted. Hence, correction is done within families but not between families. Researchers are free to arbitrarily choose the family of tests, and even though some conventions are followed, they are a cultural practice and not a principled choice. For example, suppose an experiment is arranged as a 2-way ANOVA with two genotypes (wildtype and knock-out) and three doses of a compound (none, low, high). A standard ANOVA analysis of this data produces three p-values, one for each main effect and one for the interaction. These three p-values are rarely corrected for multiple testing [12], but they are a family of related tests as they use the same data, are part of the same statistical model, and use the same error term for calculating the p-values. In 1959 Ryan argued that these p-values should be corrected, but it has not become standard practice [45]. I do not take a position on whether these corrections are appropriate, only acknowledge that they are not used, and that such corrections would be sensible. However, when comparing levels within a factor (e.g. none versus low, and none versus high within the wildtype group) researchers often correct for multiple testing.

Oddly, the family of tests changes if the above experiment is a functional genomics study, despite having an identical design. The family is usually a single comparison across all genes. Hence, all 20,000 genes for the none versus low dose comparison in the wild-type group would be considered a family of tests. The p-value adjustment ignores the other treatment groups and genotype. It may seem more rigorous adjusting for 20,000 tests for a single comparison, instead of adjusting for all possible comparisons per gene. But, wouldn’t adjusting for 20,000 genes times the number of group comparisons be sensible to limit the number of false positives in the whole experiment? Again, I do not take a position on whether this is the best approach, only draw attention to arbitrary cultural practices.

Other reasonable families could be considered, such as all tests that relate to a research question. For example, does lesioning brain region X in mice affect their behaviour? Multiple cognitive, affective, and motor assays, and multiple readouts on each assay could be controlled as one family. That is, the FPR could be controlled across all behaviours and all readouts. Or, if the research question is specifically about motor behaviours, then all the motor assays and the multiple readouts on each motor assay could be considered a family, and the FPR for this family can be controlled. Grouping all the tests that relate to the research question into a family is rarely done but is a sensible error rate to control. Another reasonable family is all of the tests reported in a figure. Often the graphs in a multi-panel figure are related and could be considered a family, and a researcher might want to control this overall error rate. Or being even more stringent, all the p-values reported in a paper could be adjusted such that the probably of at least one false positive amongst all the reported hypothesis tests is 0.05. But what about tests reported in the supplementary material, or tests performed but not published, or published as part of another paper [44, 42]?

I am not arguing that these other families should be used, only that there are many sensible options, and current practices are arbitrary and likely driven by software defaults, which then become cultural practice. Feise argues that families of tests are ambiguously, arbitrarily, and inconsistently defined [16]. Hence, researchers can choose to group their hypothesis tests into many small families instead of a few large ones so that less stringent corrections are applied. This provides weaker control of the FPR and appears to be standard practice. Researchers may also avoid reporting non-significant comparisons because a less stringent correction is needed for the remaining hypotheses. Such questionable research practices contribute to the reproducibility crisis and are difficult for peer reviewers and readers to detect.

Freedom to choose the error rate to control

Once researchers define the family of tests to be corrected, they are free to choose from four main error rates to control. The first is the per-comparison error rate (PCER), which is the rate of a single comparison and is equal to the significance threshold defined by the researcher, usually . The PCER ignores the number of hypotheses tested and hence it does not correct for multiple testing. This is the appropriate error rate for a single planned comparison.

The second and most common is the familywise error rate (FWER)111Sometimes the term experimentwise error rate is used. An experiment is but one way of defining a family, so we use the more general familywise term., which is the probability of having at least one false positive in the collection of tests. This is the classic error rate and is calculated as FWER = , where is the error rate for a single test and is the number of tests conducted. Thus if and , we expected , or a 54% chance of at least one false positive amongst the 15 tests. This example highlights how one of the oldest and simplest MCP works. The Bonferroni correction controls the FPR using a smaller cutoff, defined as the for a single test divided by the number of tests (). The FWER is then , or approximately the target of 0.05, rounded to two decimal places. Alternatively, instead of using a new cutoff, the p-values can be adjusted upward, and for the Bonferroni correction the raw p-value is multiplied by the number of tests, so if the raw p-value is 0.0021, the adjusted value is . Many MCPs work in a similar way; a new cutoff is calculated from the number of tests (or the p-value adjusted), but the details vary for each method.

The third and more recent error rate is the false discovery rate (FDR) [3]. The FDR is the proportion of tests that are false positives amongst all the tests declared statistically significant. If 1000 tests are conducted and 200 are deemed significant at an FDR = 0.05 level, then the interpretation is that 200 x 0.05 = 10 will be false positives and the other 190 will be true positives or true discoveries. The FDR has a much weaker control over the FPR and is popular for high throughput and high dimensional “omics” experiments, which tend to have few samples but many tests.

If controlling the FWER results in no or few significant p-values, researchers are free to move the goal posts and control the FDR instead, or they can choose the FDR from the start, thereby guaranteeing weaker error control. A cynical view is that the FDR was developed to make the number 0.05 meaningful with thousands of tests (and its popularity is confirmed with the high number of citations [40]). The FDR is now occasionally used in experiments with few hypothesis tests, instead of the more stringent FWER.

The fourth and least common error is the per-family error rate (PFER) [45, 28], which is the expected rate at which false positives occur in the family of tests. It is defined as the number of expected errors out of tests, so if the error rate for a single test is and 2000 tests are conducted, the expected number of errors is . The PFER is rarely chosen because it is the strictest of the four methods, meaning there will be fewer significant p-values. However, the popular Bonferroni correction controls the PFER as well as the FWER, so it is often used in practice. The Bonferroni correction is often criticised for being overly conservative, but it is only conservative for the FWER; it is well calibrated for the PFER [28, 17]. The PFER is recommended over the FWER when every false positive is a problem needs to be minimised, whereas the FWER error rate is more appropriate when, given one false positive, further false positives are less relevant [45, 17].

I do not take a position on which error rate is preferable and under which circumstances, only draw attention to the options that researchers have, and that they often select the least stringent.

Freedom to choose the correction method

Once the family of tests and the error rate are decided, researchers are spoilt for choice when it comes to the method of correction: for the FWER, options include Bonferroni, Holm, Scheffe, Sidak, Tukey, Student-Newman-Keuls, Dunn, Duncan, Dunnett, or the more exotic Ryan-Einot-Gabriel-Welsh F, and so on; and for the FDR, options include Benjamini-Hochberg, Benjamini-Yekutieli, Q-value, local FDR, marginal FDR, conditional FDR, empirical FDR, Black Box FDR, and so on. Dozens of methods exist to correct for multiple comparisons and researchers are free to choose whatever they please. Sometimes making their selection after looking at the results, and choosing the method that provides significant p-values for the key comparisons. Modern software makes it easy to click all options, examine the output, and report the procedure that gives the desired results.

More knowledgeable researchers can avail themselves of further options. Many older MCPs adjust all the hypotheses in a family together, like the Bonferroni example above. But fixed-sequence, fallback, and gatekeeping procedures allow researchers to order hypotheses, where hypotheses earlier in the order receive a less stringent correction, or none at all [1, 55, 56, 15, 25, 34]. For example, an experiment with a low dose, high dose, and control group would first test the control versus high dose comparison, and if this is significant, only then is the control versus low dose group tested, but with no penalty for having conducted the first test. Thus by declaring that a “fixed-sequence procedure” was used, researchers can bypass any actual correction for multiple testing. These methods are rarely used outside of clinical trials, but only because they are unfamiliar to most researchers and unavailable in popular statistical software. A consequence of continually developing new ways to get smaller p-values is that the methods for adjusting p-values have become more complex than the statistical models themselves [8].

Procedures that provide poor control of the FPR are often used

Given the variety of multiple correction procedures available it is unsurprising that some control error rates less well than others, and that researchers will occasionally use them. At the extreme end, Fisher’s Least Significant Difference (LSD) provides no correction for multiple testing. The LSD is a rule-of-thumb to “only proceed to post-hoc comparisons if the overall ANOVA F-test is significant”. Few use this if-then rule, and even if they do, control of the FPR is only adequate when there are three groups or comparisons [38]. With more than three comparisons, the method fails to adequately control the FPR. Furthermore, the rule-of-thumb is inappropriate for experiments with a positive control condition because the overall F-test will always be significant (if not, there was a problem with the experiment). Despite these problems, researchers frequently report using Fisher’s LSD. The methods and results sections of journals from the Public Library of Science, BioMed Central, and Nature Publishing Group were searched on 22 June 2021 and approximately 3700 papers reported using this method.

Other procedures such as the Newman-Keuls or Student-Newman-Keuls (SNK) provide only weak control of the FPR. In other words, it’s better than nothing, but the true FPR is higher than the advertised 5% level, and error control gets worse as the number of comparisons increases. SNK is commonly used despite warnings from statistical software manufacturers: “We offer the Newman-Keuls test for historical reasons… but we suggest you avoid it because it does not maintain the family-wise error rate at the specified level”222https://www.graphpad.com/guides/prism/7/statistics/stat_options_tab_three-way_anova.htm. Researchers reported using this method in over 10,000 papers from the above publications.

If effect sizes are large and comparisons few, many papers using Fisher’s LSD or SNK would have come to the same conclusion, had a better method been used. However, as a disproportionate number of published p-values are suspiciously close to 0.05 [43, 36, 33, 14, 30, 53], more stringent control of the FPR will no doubt change the conclusion for many. If these were influential studies, they could have misled a field for years.

Exploratory analyses are uncontrolled

Statisticians recommend plotting and graphically exploring the data before a formal analysis [9, 10, 29, 32]

. The aim is to better understand the data and to perform quality control checks for outliers, clusters, unusual or impossible values, and so on. Unanticipated patterns or relationships are often found, which stimulate new hypotheses to test. Unfortunately, the same data cannot be used to both derive a hypothesis and then test it, at least if control of the FPR is important. This process has been called “HARKing” – Hypothesising After the Results are Known

[27].

All MCPs take the number of tests into account, but it is impossible to know this number when data have been examined for patterns because it includes all the tests that might have been done if there was a sufficiently interesting pattern. Even though only one test was done, it needs to be corrected for other potential tests that might have been done if the pattern looked sufficiently interesting, which is impossible to determine. Hence, there is no direct method to adjust p-values to account for data-driven hypotheses when testing for discovered relationships.

Alternative analyses are uncontrolled

Researchers often try multiple models or analyses before settling on the final reported analysis. This includes adding/removing covariates such as age or sex, transformations of the outcome to achieve normality or to stabilise variances, different likelihood and link functions for generalised linear models, and so on. Often the best model cannot be determined when planning the experiment and it makes sense to try alternate models until a suitable fit is achieved. This can lead a researcher to select a model that gives the desired results that supports their theory. Simmons and colleagues showed how easy it is to get significant results with such alternate analyses

[48]. Once again there is no clearly defined family of tests and therefore no way to adjust for multiple alternative analyses.

Multiple analyses as data accumulate are uncontrolled

In many settings the data accumulate over time and are analysed before the experiment is complete. Examples include clinical samples, patients, or other human subjects that are recruited into an experiment; cell culture experiments where the whole procedure or protocol is repeated on several occasions; or animal experiments that are run on each litter of animals as they are born. Multiple “looks” increases the FPR because there are several chances for results to be significant, especially when an experiment is terminated early because the desired result is found. Although methods exist to control the FPR in these situations [6, 46], they are uncommon outside of clinical trials.

Analyses of the same data by multiple research groups are uncontrolled

An underappreciated situation that increases the FPR is when multiple individuals or research groups analyse the same data. This is becoming more common as data are deposited in repositories and consortia are pooling data, but the members analyse them separately. Even if each research group is scrupulous in controlling the FPR, the same data is analysed in so many ways by so many people that every false positive result will likely be found and reported. For example, there are over 2000 primary publications in PubMed using data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). There are also nearly 15,000 primary publications using the National Health and Nutrition Examination Survey (NHANES) data. Both examples do not contain a single fixed dataset but have multiple datasets that have accumulated over time, and thus suffer both from “multiple looks” and multiple analysts, making it impossible to define a family of tests. Grootswagers and Robinson report that a similar problem exists in computational cognitive neuroscience [24].

Suggestions from people in authority are uncontrolled

Finally, it is not unusual for peer reviewers, editors, collaborators, clients, PhD/postdoc supervisors, or other senior individuals – often with the best of intentions – to “suggest” alternative analyses, tests, transformations, removal of outliers, and so on. Even if a researcher maintained strict control of the FPR, these additional requests, that often come after the analysis is complete, can also increase the FPR. These suggestions are often hard to ignore by junior researchers and are a source of angst for professional data analysts.

Solutions

Many options are available to reduce the distance between the actual and nominal FPR, but rarely will they be made equal. An easy solution is to avoid MCPs that inadequately control the FPR, especially Fisher’s LSD and Newman-Keuls methods. Researchers can switch to other methods and journals can discourage their use. Tukey’s or Holm’s method can be substituted in all cases, or more recent “simultaneous tests” that take correlations between parameters into account [7]. If corrections for multiple tests are not used, this should be clearly stated.

Inflated FPRs are often the result of too much freedom for researchers, such as the choice of family, error rate, and correction method. By defining the analytical decisions before the experiment is conducted, the choice of options is greatly constrained. Preregistering the analysis plan is a formal way to specify the analytical options and prevents data-driven decisions that can increase the FPR. However, the most liberal options can be chosen beforehand, thereby mitigating the usefulness of this approach. In addition, vague wording such as “…main analysis will be followed up with post-hoc tests as appropriate” is pointless and no better than a standard non-preregistered study. Adding constraints is best suited for confirmatory experiments, where there is a clear hypothesis, a primary outcome variable, and prior information to help define a statistical model. However, since confirmatory experiments tend to have few hypotheses, the risk of false positives is lower, and therefore MCPs are likely to work best where they are needed the least.

Inflated FPRs also result from uncontrolled analyses: exploratory, alternative, as data accumulate, by multiple groups, and from suggestions from others. In these cases it is difficult or impossible to define the family of tests, and therefore impossible to use standard MCPs to control the FPR. Three options for dealing with these cases are described below: data splitting, blinded analyses, and shrinkage.

Data splitting

One way to decrease the risk of false positives is to take a portion of the data – usually 20 to 50% – and set it aside. The exploratory analysis is done on the remaining 50 to 80%, and any interesting results that are found are then confirmed on the held-out set. Unfortunately, given the small sample size of many experiments, this approach is unfeasible. Furthermore, the confirmatory hypothesis testing part is done on a smaller dataset (unless a 50/50% split was used), and so power will be lower. The best way to control the FPR in experiments where hypotheses were derived after looking at the data is to perform a second experiment with a clear hypothesis based on the results of the first experiment. This is infrequently done because the initial finding is publishable (given current practices) but a follow-up experiment requires twice the work and cost, and if it fails to confirm the initial result, then nothing is publishable (unless the second experiment is discarded and only the first reported). This prospect is too risky for many researchers. Given that data splitting is usually unfeasible and follow-up studies are risky, these cannot be general solutions to control the FPR.

Blinded analyses

When many alternate analyses are done the FPR can be controlled by ignoring the p-values for the scientific questions of interest. For example, when fitting multiple models to assess their adequacy, model residuals can be graphically assessed and two models compared to assess if batch effects are present. Only after a suitable model is found are results for the main hypotheses examined. Hence, modelling decisions are independent of the p-values and effect sizes for the scientific hypotheses of interest. A better but more involved option is to perform a blind data analysis [35]. Data can be blinded in many ways such as obscuring column names; instead of “age” and “weight” use “Column_A” and “Column_B”. If it’s possible to identify the column labels from the values then the variables can be scaled or standardised. Group labels can be permuted or obscured, or a small amount of noise can be added to continuous variables. Standard blinding methods do not exist and having enough blinding to reduce bias while still being able to assess the data and models needs to be traded off.

Shrink parameter estimates

An ongoing debate surrounds whether it’s better to estimate parameters (i.e. effect sizes) or test hypotheses

[13, 39]

. They are complementary procedures, and many studies report both p-values and effect sizes with confidence intervals. Here, I am not arguing that parameter estimation should replace hypothesis testing, but that parameter estimation provides a solution to multiple testing. The problem is not that p-values are too small, but that effect sizes are too big (sometimes coupled with underestimated variances, which also leads to small p-values). The solution therefore is to shrink parameter estimates closer to zero or the null value

[53]. This approach recognises that p-values are not the problem, it’s the effect sizes. Adjusting p-values is like treating the symptoms of a disease instead of the cause. It’s a problem of overfitting, not multiple hypothesis testing. Furthermore, adjusted p-values do not fix the visual impression of exaggerated effects when the data are plotted.

The logic of shrinking parameter estimates is as follows. Assume we are estimating the difference between the means of two groups, call this . Even if , our experiment will rarely return a value of exactly 0, sometimes it will be greater than 0 and sometimes less than 0. Occasionally will be far from 0 due to sampling variation and we will incorrectly conclude that . Knowing this, we can take the observed value of and pull or shrink it closer to 0, which will prevent or reduce large values of when no effect exists, thereby reducing the number of false positive conclusions. Shrinking will reduce statistical power because true effects will also be shrunk towards 0, but ideally the reduction in false positives will compensate for the loss of power.

Figure 1 illustrates shrinkage with simulated data. Data were simulated for two groups with ten samples each, and for 100 genes. A separate t-test was used to compare the mean difference between groups for all the genes. One data set was simulated with no effects, and a second with five genes differing between groups (the simulation script is provided in the supplementary materials). Figure 1A shows the estimates (mean difference) and standard errors for the 100 genes from the data with no true effects. Figure 1B shows the shrunken estimates, which have all been pulled to zero with very little uncertainty, indicating that the model is confident that no effects are present. Figure 1C shows estimates for the data set with five genes having different expression levels. The data were simulated such that all five genes have negative estimates. Figure 1D shows the shrunken estimates, and while most genes have been shrunken to zero, four remain distinct (although these are also shrunken towards zero, but to a lesser extent). Note how shrinking corrects the inflated effect sizes.

Figure 1: Shrinkage estimates. Estimates (mean difference) and standard errors for the data with no true effects (A), and the shrunken estimates (B), which have all been shrunk to zero. Estimates and standard errors for the data with five true negative effects (C), and the shrunken estimates (D), four of which remain strongly negative.

Parameter estimates can be shrunk in three general ways. The first is to take a Bayesian approach and place informative priors on the parameter [21, 22]. For example, when comparing several groups to a control, the parameters representing the difference between group means – call them for comparisons – could have independent Normal(0, ) priors (see McElreath for an introduction to Bayesian methods [37]). The observed mean differences () would be shrunk towards zero, with the degree of shrinkage controlled by (a small provides more shrinkage). This is a very general approach but the control of the FPR critically depends on . Often there is no theory to help select a suitable value, but it can be estimated using simulations. Data are simulated from hypothetical experiments where no effects are present, a value for is chosen, and the data analysed as they will be in the real experiment. This is repeated for many simulated datasets and the number of false positives are counted. If this number is greater than a desired threshold (e.g. 5%), then needs to be smaller. This procedure is repeated until a suitable value of is found and the sample size can also be calculated to ensure that statistical power is adequate. This approach is best suited when there are few tests.

The second approach uses Bayesian or empirical Bayesian multilevel/hierarchical models [19]. Here, the

parameters are assumed to come from a common distribution with a mean of zero and a standard deviation of

. Unlike the first approach, is estimated from the data, and hence the amount of shrinkage is data-dependent: with little variation between the group means, will be small and the shrinkage will be greater. A drawback of this approach is that with very few groups or comparisons, will be poorly estimated and hence is better employed for a moderate number of tests.

The third approach is to calculate summary statistics for the comparisons, such as effect sizes and their standard errors using traditional methods, and then in a second step use empirical Bayesian methods to shrink the effect sizes [51, 23, 50]. This approach was taken in Figure 1 and is especially useful for “omics” or high-dimensional biology experiments that have many hypothesis tests [50]

. Another approach recently developed by van Zwet and Gelman relies on the signal-to-noise ratio from similar published studies to estimate the prior distribution, which is then used to shrink the estimated parameter values

[52].

The main drawback to controlling false positives by shrinking parameters is that the FPR is not known exactly before experiments are conducted, although it can be estimated with simulation studies. The traditional approach does provide theoretical guarantees (e.g. that the FWER for tests will be when using as the significance threshold), but as argued above, these guarantees rarely apply to real experiments. Hence, having an approximate error rate that is actually achieved is better than an exact theoretical error rate that doesn’t apply. A second drawback is that many of these methods will require more work than simply ticking a box as they are not implemented in popular software packages, although simple solutions are available in R for omics experiments [50] and correlation matrices [47].

These drawbacks are balanced by several advantages: results are less dependent on the number of tests, adjustments can be made when the number of tests is undefined, and the options are more limited, thereby reducing researchers’ ability to select methods that give desirable results. Researchers must choose a prior for the Bayesian methods, but these can also be specified in advance. Shrinkage can also be combined with other approaches such as data splitting, blinded analyses, and specifying analyses beforehand. Shrinkage is also a more direct solution to the problem of overfitting, since exaggerated effect estimates are brought closer to the true estimates.

Discussion

It is commonly believed that the FPR exists as a single well-defined entity, but based on the arguments above, this view is untenable. First, many error rates exist – per-comparison, familywise, false discovery, and per-family – so by definition there cannot be a generic “false positive rate” that a researcher can control. But even if a researcher (arbitrarily) selects a specific error rate, multiple comparison procedures critically depend on the number of tests included in the family. Often the family cannot be specified for exploratory studies; for example, when the data accumulate over time, several people analyse the same data, or multiple analyses are tried. Berry described these as silent or hidden multiplicities [5, 4]. Hence, adjusted p-values are impossible to calculate and therefore meaningless – just like you cannot solve for when is unknown in . Perhaps it is better to acknowledge that is undefined instead of making up a value for to calculate an equally fictitious value for . When the family can be unambiguously defined, it is usually arbitrary, and other sensible families could equally have been chosen, which would lead to different results. Once a researcher finally performs the calculations, the bewildering number of procedures available means that many different results can be obtained for the same supposed error rate.

Studies assessing the performance of MCPs always take the family as unambiguously given, but since this is an ill-defined entity, performance of MCPs in the real world is questionable. While there is no direct proof, I speculate that one reason for the “reproducibility crisis” is that the classic MCP approach itself often fails to achieve its purpose. It is an ill-posed problem because researchers are trying to control something that cannot defined, measured, or even exists. This means that the usual fixes such as improving researcher incentives and training will have minimal impact on this aspect of reproducibility. Although, limiting researcher degrees of freedom by defining the approaches upfront and avoiding methods with poor performance can help.

The classic MCP approach assumes that effects can be sensibly divided into those that exist and those that do not, and it is not clear that the world is thus constructed [18]. MCPs, much like the p-values they operate on, serve a social function: they allow researchers to make scientific claims. For this reason, p-values are unlikely to go away, despite calls to reform, ban, or abandon their use (see Volume 73 of The American Statistician [54]). The established practice in most fields is that a researcher is justified in claiming an effect or association exists when a p-value or adjusted p-value . When many such claims are made, the probability of a false claim increases, hence the need for MCPs. The number of false claims should be kept under control, but is adjusting p-values the best way? I argue that p-values are not the best quantity to adjust, it should be the effect sizes. In most fields, shrinkage methods are rarely used to control the FPR, and so little is known about their performance in the real world, but it is worthy of further research. For example, how would these methods work when subsets of a large dataset are analysed separately, compared with one large analysis (mimicking the situation where multiple people analyse parts of the data)? Classical MCPs are expected to have weaker control for the subsets since the family of tests is smaller, and hence the FPR will likely be inflated.

Given that estimated effect sizes are too large, focusing on shrinking parameter estimates is fundamentally the correct approach to control making too many false scientific claims. P-values are calculated from the effect sizes, so the argument is to move the adjustment upstream. If the effect sizes are fixed, so are p-values; much like treating the disease often improves the symptoms. However, shrinkage methods are harder to apply because they require more knowledge than ticking a box. In addition, the best approach will depend on the experimental design. Some experiments are simple with few comparisons, such as a clinical trial with a primary outcome assessing a low and high dose of a compound versus a placebo. Other experiments have a simple design (e.g. healthy versus disease) but with thousands of outcomes, such as gene or protein expression values. Others have a single outcome but many comparisons, such as screen testing thousands of drugs. Finally, some experiments run several assays on the same samples or subjects, each with several outcomes and factors (e.g. a typical neuroscience experiment with several behaviour assays and several outcomes per assay). The best way to shrink estimates will likely differ between designs, and we have little data comparing the options. Shrinking estimates does give up the theoretical guarantees of the traditional MCP approach, but since these guarantees do not apply to most real world settings, little is lost.

This paper also serves as a call to scientists to try shrinkage methods in their research, perhaps along side standard methods, and to report both for comparison. Over time, this will enable scientific communities to assess the relative merits of this approach on the types of data they routinely generate. This is also a call for quantitative researchers to assess these methods using simulations to understand when they work well, when they fail, and if sensible default options can be recommended. This would have more value than developing yet another multiple comparison procedure. Ultimately, we shouldn’t assume that effects either exist or don’t exist, and try to control the number of false positives; we should instead try to minimise the discrepancy between the true and estimated effects, and wrong conclusions will be automatically controlled.

References

  • [1] Bauer P, Röhmel J, Maurer W, Hothorn L (1998). Testing strategies in multi-dose experiments including active control. Statistics in Medicine 17(18): 2133–2146.
  • [2] Begley CG, Ioannidis JPA (2015). Reproducibility in Science: Improving the Standard for Basic and Preclinical Research. Circ Res 116(1): 116–126.
  • [3] Benjamini Y, Hochberg Y (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B 57: 289–300.
  • [4] Berry D (2012). Multiplicities in cancer research: ubiquitous and necessary evils. J Natl Cancer Inst 104(15): 1124–1132.
  • [5] Berry DA (2007). The difficult and ubiquitous problems of multiplicities. Pharm Stat 6(3): 155–160.
  • [6] Berry SM, Carlin BP, abd Peter Muller JJL (2010). Bayesian Adaptive Methods for Clinical Trials. Boca Raton, FL: CRC Press.
  • [7] Bretz F, Hothorn T, Westfall P (2011). Multiple Comparisons Using R. Boca Ratan, FL: CRC Press.
  • [8] Bretz F, Posch M, Glimm E, Klinglmueller F, Maurer W, Rohmeyer K (2011). Graphical approaches for multiple comparison procedures using weighted Bonferroni, Simes, or parametric tests. Biometrical Journal 53(6): 894–913.
  • [9] Cleveland WS (1993). Visualizing Data. New Jersey: Hobart Press.
  • [10] Cleveland WS (1994). The Elements of Graphing Data. New Jersey: Hobart Press, revised edn.
  • [11] Cohen J (1994). The earth is round (p .05). American Psychologist 49(12): 997–1003.
  • [12] Cramer AOJ, van Ravenzwaaij D, Matzke D, Steingroever H, Wetzels R, Grasman RPPP, Waldorp LJ, Wagenmakers EJ (2015). Hidden multiplicity in exploratory multiway ANOVA: Prevalence and remedies. Psychonomic Bulletin & Review 23(2): 640–647.
  • [13] Cumming G (2013). The new statistics. Psychological Science 25(1): 7–29.
  • [14] de Winter JC, Dodou D (2015). A surge of p-values between 0.041 and 0.049 in recent decades (but negative results are increasing rapidly too). PeerJ 3: e733.
  • [15] Dmitrienko A, Offen WW, Westfall PH (2003). Gatekeeping strategies for clinical trials that do not require all primary effects to be significant. Statistics in Medicine 22(15): 2387–2400.
  • [16] Feise RJ (2002). Do multiple outcome measures require p-value adjustment? BMC Med Res Methodol 2: 8.
  • [17]

    Frane AV (2015). Are per-family Type I error rates relevant in social and behavioral science?

    Journal of Modern Applied Statistical Methods 14(1): 12–23.
  • [18] Gelman A (2015). The connection between varying treatment effects and the crisis of unreplicable research: a Bayesian perspective. Journal of Management 41(2): 632–643.
  • [19] Gelman A, Hill J, Yajima M (2012). Why we (usually) don’t have to worry about multiple comparisons. Journal of Research on Educational Effectiveness 5: 189–211.
  • [20] Goodman S (2008). A dirty dozen: twelve p-value misconceptions. Semin Hematol 45(3): 135–140.
  • [21] Greenland S (1992). A semi-Bayes approach to the analysis of correlated multiple associations, with an application to an occupational cancer-mortality study. Statistics in Medicine 11(2): 219–230.
  • [22] Greenland S, Mansournia MA (2015). Penalization, bias reduction, and default priors in logistic and related categorical and survival regressions. Statistics in Medicine 34(23): 3133–3143.
  • [23] Greenland S, Robins JM (1991). Empirical-Bayes Adjustments for Multiple Comparisons Are Sometimes Useful. Epidemiology 2(4): 244–251.
  • [24] Grootswagers T, Robinson AK (2021). Overfitting the literature to one set of stimuli and data. Frontiers in Human Neuroscience 15.
  • [25] Huque MF, Alosh M (2008). A flexible fixed-sequence testing method for hierarchically ordered correlated multiple endpoints in clinical trials. Journal of Statistical Planning and Inference 138(2): 321–335.
  • [26] Ioannidis JPA (2014). How to make more published research true. PLoS Med 11(10): e1001747.
  • [27] Kerr NL (1998). HARKing: hypothesizing after the results are known. Pers Soc Psychol Rev 2(3): 196–217.
  • [28] Klockars AJ, Hancock GR (1994). Per-experiment error rates: The hidden costs of several multiple comparison procedures. Educational and Psychological Measurement 54(2): 292–298.
  • [29] Krause A, O’Connell M (Eds.) (2012). A Picture is Worth a Thousand Tables: Graphics in Life Sciences. New York, NY: Springer.
  • [30] Krawczyk M (2015). The Search for Significance: A Few Peculiarities in the Distribution of P Values in Experimental Psychology Literature. PLoS One 10(6): e0127872.
  • [31] Landis SC, Amara SG, Asadullah K, Austin CP, Blumenstein R, Bradley EW, Crystal RG, Darnell RB, Ferrante RJ, Fillit H, Finkelstein R, Fisher M, Gendelman HE, Golub RM, Goudreau JL, Gross RA, Gubitz AK, Hesterlee SE, Howells DW, Huguenard J, Kelner K, Koroshetz W, Krainc D, Lazic SE, Levine MS, Macleod MR, McCall JM, Moxley RT, Narasimhan K, Noble LJ, Perrin S, Porter JD, Steward O, Unger E, Utz U, Silberberg SD (2012). A call for transparent reporting to optimize the predictive value of preclinical research. Nature 490(7419): 187–191.
  • [32] Lazic SE (2016). Experimental Design for Laboratory Biologists: Maximising Information and Improving Reproducibility. Cambridge, UK: Cambridge University Press.
  • [33] Leggett NC, Thomas NA, Loetscher T, Nicholls MER (2013). The life of p: ”just significant” results are on the rise. Q J Exp Psychol (Hove) 66(12): 2303–2309.
  • [34] Li JD, Mehrotra DV (2008). An efficient method for accommodating potentially underpowered primary endpoints. Statistics in Medicine 27(26): 5377–5391.
  • [35] MacCoun R, Perlmutter S (2015). Blind analysis: Hide results to seek the truth. Nature 526(7572): 187–189.
  • [36] Masicampo EJ, Lalande DR (2012). A peculiar prevalence of p values just below .05. Q J Exp Psychol (Hove) 65(11): 2271–2279.
  • [37] McElreath R (2016). Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Boca Raton, FL: CRC Press.
  • [38] Meier U (2006). A note on the power of Fisher’s least significant difference procedure. Pharm Stat 5(4): 253–263.
  • [39] Morey RD, Rouder JN, Verhagen J, Wagenmakers EJ (2014). Why hypothesis tests are essential for psychological science. Psychological Science 25(6): 1289–1290.
  • [40] Noorden RV, Maher B, Nuzzo R (2014). The top 100 papers. Nature 514(7524): 550–553.
  • [41] Nosek BA, Ebersole CR, DeHaven AC, Mellor DT (2018). The preregistration revolution. Proc. Natl. Acad. Sci. U.S.A. 115(11): 2600–2606.
  • [42] Perneger TV (1998). What’s wrong with Bonferroni adjustments. BMJ 316(7139): 1236–1238.
  • [43] Ridley J, Kolm N, Freckelton RP, Gage MJG (2007). An unexpected influence of widely used significance thresholds on the distribution of reported P-values. J Evol Biol 20(3): 1082–1089.
  • [44] Rothman KJ (1990). No adjustments are needed for multiple comparisons. Epidemiology 1(1): 43–46.
  • [45] Ryan TA (1959). Multiple comparison in psychological research. Psychological Bulletin 56(1): 26–47.
  • [46]

    Schonbrodt FD, Wagenmakers EJ, Zehetleitner M, Perugini M (2017). Sequential hypothesis testing with Bayes factors: Efficiently testing mean differences.

    Psychol Methods 22(2): 322–339.
  • [47] Schäfer J, Strimmer K (2005). A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical Applications in Genetics and Molecular Biology 4(1).
  • [48] Simmons JP, Nelson LD, Simonsohn U (2011). False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol Sci 22(11): 1359–1366.
  • [49] Stang A, Poole C, Kuss O (2010). The ongoing tyranny of statistical significance testing in biomedical research. Eur J Epidemiol 25(4): 225–230.
  • [50] Stephens M (2017). False discovery rates: a new deal. Biostatistics 18(2): 275–294.
  • [51] Thomas DC, Siemiatycki J, Dewar R, Robins J, Goldberg M, Armstrong BG (1985). The problem of multiple inference in studies designed to generate hypotheses. American Journal of Epidemiology 122(6): 1080–1095.
  • [52] van Zwet E, Gelman A (2021). A proposal for informative default priors scaled by the standard error of estimates. The American Statistician 1–9.
  • [53] van Zwet EW, Cator EA (2021). The significance filter, the winners curse and the need to shrink. Statistica Neerlandica 1–16.
  • [54] Wasserstein RL, Schirm AL, Lazar NA (2019). Moving to a world beyond ”p 0.05”. The American Statistician 73(Supp 1): 1–19.
  • [55] Westfall PH, Krishen A (2001). Optimally weighted, fixed sequence and gatekeeper multiple testing procedures. Journal of Statistical Planning and Inference 99(1): 25–40.
  • [56] Wiens BL (2003). A fixed sequence Bonferroni procedure for testing multiple endpoints. Pharmaceutical Statistics 2(3): 211–215.