Reply to Chen et al.: Parametric methods for cluster inference perform worse for two-sided t-tests

10/05/2018 ∙ by Anders Eklund, et al. ∙ Linköping University 0

One-sided t-tests are commonly used in the neuroimaging field, but two-sided tests should be the default unless a researcher has a strong reason for using a one-sided test. Here we extend our previous work on cluster false positive rates, which used one-sided tests, to two-sided tests. Briefly, we found that parametric methods perform worse for two-sided t-tests, and that non-parametric methods perform equally well for one-sided and two-sided tests.



There are no comments yet.


page 1

page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Chen et al. (2018) discuss an important topic which is often neglected in the neuroimaging field, the use of one-sided or two-sided tests and the lack of multiple comparison correction for two one-sided tests. As mentioned in their paper, in our work on massive empirical evaluation of task fMRI inference methods with resting state fMRI (Eklund et al., 2016) we used one-sided tests (familywise error rate ). We made this choice for two reasons. The first reason was simply that for analyses of randomly created groups of healthy controls, it should make no difference if one uses a one-sided or a two-sided test. The second reason was more practical. FSL and SPM both run one-sided tests by default, and we wished to reflect the typical (if ill-advised) practices of the community. Furthermore, to perform a two-sided permutation test (Winkler et al., 2014), it would be necessary to run two permutation tests per group analysis (which would double the processing time), since normally only the maximum test value over the brain (or the largest cluster) is saved for every permutation (to form the maximum null distribution).

2 Methods

To investigate if performing a two-sided test (as implemented by two tests at ) lead to different false positive rates compared to a single one-sided test (at = 0.05), we performed new group analyses for a subset of all the parameter settings used in our previous work (Eklund et al., 2016, 2018). Specifically, we only performed two-sample t-tests for the Beijing data (Biswal et al., 2010), using 40 subjects (i.e. 20 subjects per group) and a cluster defining threshold of p = 0.001. All group analyses were performed for 4 mm, 6 mm, 8 mm and 10 mm FWHM of smoothing. See our recent work (Eklund et al., 2018) for a description of the six designs (B1, B2, E1, E2, E3, E4) applied to every subject in the first level analysis.

For FSL, group analyses were only performed using FSL OLS, and not using FLAME1 (which is the default option); FLAME1 leads to conservative results if resting state fMRI data is used, while null task fMRI analyses (control-control) with FLAME1 gives FWE rates comparable to FSL OLS (Eklund et al., 2016). For AFNI, we used the new ACF (autocorrelation function) option in 3dClustSim (Cox et al., 2017), which uses a long-tail spatial ACF instead of a Gaussian one. It should be noted that AFNI provides another function for cluster thresholding, ETAC (equitable thresholding and clustering) (Cox, 2018), which may perform better than the long-tail ACF function used here, but we used the ACF approach to be able to compare the two-sided results to our recent work (Eklund et al., 2018). Contrary to Chen et al. (2018), we did not change the cluster defining threshold to p = 0.0005 when performing two one-sided tests (for SPM, FSL or AFNI), as this represents yet another change in the inference configuration that we rather leave fixed to facilitate the comparison of these results to previous one-sided findings.

3 Results

Figure 1

shows estimated familywise error rates for one-sided and two-sided tests, where both should exhibit a nominal 5% familywise false positive rate. The non-parametric permutation test produces similar results in both cases, while the parametric methods perform worse for two-sided tests.

Figure 1: A comparison of empirical familywise error rates for one-sided (left) and two-sided (right) tests, for a cluster defining threshold of p = 0.001. Designs B1 and B2 represent two block based activity paradigms, while E1, E2, E3 and E4 represent event related paradigms. Design E4 is randomized over subjects, while all other designs are the same for all subjects. The parametric methods perform worse for two one-sided tests at = 0.025, compared to a single one-sided test at = 0.05, while the permutation test produces nominal results in both cases.

4 Discussion

We have extended our original work on cluster false positive rates (Eklund et al., 2016, 2018) to two-sided tests, showing that parametric methods perform worse for two-sided tests. RFT p-values depend on a number of approximations:

  1. Joint normality over the image,

  2. Sufficient smoothness for lattice images to behave like continuous processes,

  3. Homogeneous smoothness (stationarity), so that the null distribution of cluster size does not vary over space,

  4. Spatial dependence mostly local, i.e. the spatial autocorrelation function is proportional to a Gaussian density, and

  5. Sufficiently high cluster-forming threshold so that the approximate distribution for cluster size is accurate.

On this last assumption, the control of FWE depends on the accuracy of the cluster size distribution in its tail. For example, it is of little consequence if the true cluster size FWE p-value is 0.6 and RFT estimates it as 0.5; in contrast, two-sided inference demands accuracy in the RFT approximation down to FWE 0.025, and then any inaccuracies are doubled as both positive and negative excursions are considered. In our findings, it appears that modest inaccuracies in the null cluster size distribution corresponding to FWE 0.05 (see Figure 1 (a), and general tendency to over estimate FWE) grow into larger inaccuracies when the more stringent FWE level 0.025 is used (the inference used twice for each result contributing to Figure 1 (b)).

In contrast, the non-parametric permutation test for a two-sample t-test is only based on the assumption of exchangeability between subjects, and therefore performs equally well for two one-sided tests at = 0.025.


The authors have no conflict of interest to declare. This study was supported by Swedish research council grants 2013-5229 and 2017-04889. Funding was also provided by the Center for Industrial Information Technology (CENIIT) at Linköping University, and the Knut and Alice Wallenberg foundation project ”Seeing organ function”. Thomas E. Nichols was supported by the Wellcome Trust (100309/Z/12/Z) and the NIH (R01 EB015611). The Nvidia Corporation, who donated the Nvidia Quadro P6000 graphics card used to run all permutation tests, is also acknowledged. This study would not be possible without the recent data-sharing initiatives in the neuroimaging field. We therefore thank the Neuroimaging Informatics Tools and Resources Clearinghouse and all of the researchers who have contributed with resting-state data to the 1,000 Functional Connectomes Project.


  • Biswal et al. (2010) Biswal, B., Mennes, M., …, X. Z., & Milham, M. (2010). Toward discovery science of human brain function. PNAS, 107, 4734–4739.
  • Chen et al. (2018) Chen, G., Cox, R. W., Glen, D. R., Rajendra, J. K., Reynolds, R. C., & Taylor, P. A. (2018). A tail of two sides: Artificially doubled false positive rates in neuroimaging due to the sidedness choice with t-tests. Human Brain Mapping, .
  • Cox et al. (2017) Cox, R., Chen, G., Glen, D., Reynolds, R., & Taylor, P. (2017). FMRI Clustering in AFNI: False-Positive Rates Redux. Brain Connectivity, 7, 152–171.
  • Cox (2018) Cox, R. W. (2018). Equitable thresholding and clustering. bioRxiv, 10.1101/295931, .
  • Eklund et al. (2018) Eklund, A., Knutsson, H., & Nichols, T. (2018). Cluster failure revisited: impact of first level design and physiological noise on cluster false positive rates. Human Brain Mapping, .
  • Eklund et al. (2016) Eklund, A., Nichols, T., & Knutsson, H. (2016). Cluster failure: why fMRI inferences for spatial extent have inflated false positive rates. PNAS, 113, 7900–7905.
  • Winkler et al. (2014) Winkler, A., Ridgway, G., Webster, M., Smith, S., & Nichols, T. (2014). Permutation inference for the general linear model. NeuroImage, 92, 381–397.