1 Introduction
Investigating neuroimagephenotype associations is important for understanding how stimuli, environment, and diseases affect the brain. Various neuroimaging modalities are used to study brainbehavior associations. For example, imaging scientists can study how brain activation measured with functional magnetic resonance imaging (fMRI) is related to performance on an executive functioning task (Satterthwaite et al., 2013)
. Spatial extent inference (SEI) is a widely used tool across neuroimaging modalities to study imagephenotype associations. SEI views the statistical image as a realization of a stochastic process and thresholds the image at a given value in order to compute the distribution of the spatial extent of contiguous clusters of activation. A pvalue is determined for each cluster based on its spatial extent by comparing the cluster extent to the distribution of the maximum cluster size under a predetermined null hypothesis
(Friston et al., 1994; Worsley et al., 1999).A number of recent studies have shown that the standard tools used for SEI that rely on gaussian random field (GRF) approximations can have highly inflated false positive rates (Eklund et al., 2016; Silver et al., 2011; Greve and Fischl, 2018). These studies have provoked significant discussion about the conditions necessary for the GRFbased tools to produce nominal familywise error rates (FWERs; Kessler et al., 2017; Mueller et al., 2017; Eklund et al., 2018; Slotnick, 2017). A good deal of this discussion has led to ad hoc preprocessing and thresholding suggestions for particular data types that are based on simulation results (Eklund et al., 2016; Greve and Fischl, 2018; Flandin and Friston, 2016; Slotnick, 2017; Mueller et al., 2017). For example, simulation results suggest using particular values preprocessing parameters such that the GRF based methods yield nominal FWERs (Flandin and Friston, 2016). However, GRFbased methods cannot control the FWER generally due to unrealistic assumptions about the covariance structure of the neuroimaging data. As a result ad hoc preprocessing decisions are unlikely to yield nominal FWERs across a wide range of data sets.
To date, only permutation testing procedures have been shown to robustly control the spatial extent false positive rate in most neuroimaging datasets (Eklund et al., 2016; Winkler et al., 2014). However, the permutation procedure may not control the false positive rate at the nominal level in some cases because it relies on the assumption of exchangeability. Exchangeability includes the assumption of homoskedasticity; namely, that the covariance structures of all the subjects are unaffected by the covariates. This assumption is often violated in neuroimaging datasets because many types of imaging datasets require multilevel models, in which parameters are estimated separately for each subject and subsequently analyzed at the group level. Exchangeability may also be violated because the covariance structure of imaging data is known to be affected by inscanner motion, age, and diagnostic status (see e.g. Satterthwaite et al., 2012; Power et al., 2010; Woodward and Heckers, 2016).
In the statistics literature several procedures have been developed for hypothesis testing in spatial datasets. Sun et al. (2015) developed false discovery rate (FDR) controlling procedures in a Bayesian context in order to perform inference pointwise or on predefined clusters. Procedures to address largescale hypothesis testing for correlated tests could also be utilized to perform inference on images (e.g. Yekutieli and Benjamini, 1999; Benjamini and Yekutieli, 2001; Romano et al., 2008). These procedures, which perform inference pointwise have the desirable property that they identify locations of the image or map where the mean is likely to be nonzero. However, due to the highdimensionality, small sample sizes, and complex covariance functions of neuroimaging data, many studies lack power to perform inference directly on the mean function. As a solution, other proposed methods for controlling the FDR in spatial data improve power by using predefined clusters and performing inference on the clusters (Benjamini and Heller, 2007; Pacifico et al., 2004)
. SEI is similar to these cluster based approaches because a significant cluster only guarantees (with a prespecified probability) that an arbitrary nonzero portion of the cluster has nonzero mean. However, unlike the cluster based FDR methods SEI does not require that the user defines the clusters
a priori.Here we develop two methods to perform SEI that use a (semi) parametric bootstrap joint (PBJ; sPBJ) testing procedure to compute test statistics and estimate the joint distribution of test statistics (Section
3). The PBJ procedure accomodates violations of exchangeability by using subject weights to deweight noisy observations at each voxel. If the selected weights are inversely proportional to the subjectlevel variances then the covariance function is correctly specified and the parameter estimators have minimum variance. In order to address the case that subject weights are not correctly specified we introduce the sPBJ procedure that uses a robust “sandwich” estimate of the covariance function, which is an asymptotically consistent estimator of the covariance function of the stochastic process (estimator in Long and Ervin, 2000; MacKinnon and White, 1985). Because the covariance function of the stochastic process is complex in imaging data, we leverage a bootstrap procedure to estimate the distribution of the maximum cluster size (Vandekar et al., 2017). These methods are asymptotically robust across any range of preprocessing parameters and cluster forming thresholds (CFTs) because they estimate the covariance structure from the observed data.We utilize a subset of 1,000 subjects from the Philadelphia Neurodevelopmental Cohort (PNC; Satterthwaite et al., 2014, 2016) and a novel approach to generate realistic simulated data in which the assumption of exchangeability is violated in order to show that the PBJ and sPBJ procedures are robust to violations of this assumption (Section 5). We demonstrate the procedures by studying the association between task performance and executive functioning measured with fMRI using the back task (Sections 4 and 6; Ragland et al., 2002; Satterthwaite et al., 2013). While the procedure is generally applicable for inference in spatial data sets, we developed an R package to perform the PBJ and sPBJ SEI procedures on NIfTI images that is available for download at https://github.com/simonvandekar/pbj and from Neuroconductor (https://www.neuroconductor.org/; Muschelli et al., 2018).
2 Spatial extent inference
Let the statistical map under the null, , be a stochastic process on , where denotes the normed space of bounded functions from a bounded subset to equipped with the uniform norm. Let be the observed statistical map that is sampled from an arbitrary stochastic process on . We assume that the sample paths of and are continuous almost everywhere. Let be a function that thresholds the map with a given cluster forming threshold (CFT), , and computes the volume of all suprathreshold contiguous clusters in the image. is in because there is an arbitrary number of contiguous clusters possible. To give the clusters a natural ordering, we assume that, for all sample paths , for and that there exists finite such that for all . That is, the clusters are indexed in decreasing order and there are a finite number of nonzero clusters.
Procedure 2.1 (Spatial extent inference).
For a fixed we set
(1) 
and call cluster “significant” if for some predetermined rejection threshold .
The pvalue (1) computed by Procedure 2.1 represents the probability of observing a cluster as or more extreme than , under the global null that is a sample from (Friston et al., 1994). Typically, the threshold is chosen such that suprathreshold regions have uncorrected pvalues less than some common CFT threshold, e.g. , which corresponds to a voxelwise under Gaussianity assumptions. The bootstrap samples generated by Procedures 3.1 and 3.2, below, can be used to perform the SEI described in Procedure 2.1 by approximating the distribution of across bootstrap samples.
3 Statistical theory for the bootstrap procedures
3.1 Overview
We propose two approaches to model the covariance function of the statistical map. The PBJ procedure assumes the within subject variance estimates are correct. The sPBJ procedure utilizes a working covariance function. An advantage of the sPBJ is that even if the working covariance function is incorrect the procedure still yields consistent standard errors.
Let , for , be stochastic processes on . represent the imaging data and is the bounded space of interest (in neuroimaging, the brain). We assume that
(2) 
where
is a row vector of nuisance covariates including the intercept,
is a row vector of variables of interest, , , the parameter image column vectors , , , error with and covariance function for all. Here, all capital letters denote random variables or stochastic processes. Throughout, we assume that the conditional mean
of model (2) is correctly specified. This model is a form of functiononscalar regression (Ramsay and Silverman, 2005; Reiss et al., 2010; Morris, 2015) and encompasses many models of interest in image analysis. Let , , , . Lastly, to describe sampling procedures we define to be the number of locations in the image.3.2 PBJ: parametric covariance functions
Without weights, the basic PBJ model assumes
(3) 
for all
; i.e. every subject has the same image covariance function. This can be generalized to heteroskedastic variances by using weighted regression (with subject specific weights) where the weights are, ideally, proportional to the inverse of the standard deviation,
. Inference using the weighted procedure is conditional on the weights, which are usually estimates (or approximations) of the standard deviation of . This approach is akin to similar multilevel approaches used in neuroimaging software (Woolrich et al., 2009; Penny and Holmes, 2007). If the weights are incorrectly specified, then the standard error estimates are asymptotically biased and the FWER may not be controlled at the nominal level. For simplicity, we describe the unweighted version of the PBJ procedure here. In Section 3.3 we describe the robust covariance function estimator used by sPBJ that allows subject specific weights.Under assumption (3) we can compute the statistic image
(4) 
where is the Fstatistic image for the test of the hypothesis
(5) 
denotes the inverse cumulative distribution function (CDF) of an Fdistribution on
and degrees of freedom, and is the CDF of a statistic (Vandekar et al., 2017). The transformation in (4) makes asymptotically a diagonal Wishart process (Vandekar et al., 2017),(6) 
under the null (5), where . The transformation on the right hand side of (4) improves the finite sample performance of the following procedure, which samples observations from a diagonal Wishart process, conditional on an estimate of .
Procedure 3.1.

Regress the observed image vector onto as in model (2) and obtain the residual image

Let be after the rows are standardized to be norm 1.

For sample standard normal random variables, and compute the statistic image
The maximum cluster size of the bootstrap images are then used to compute the cluster values (1).
3.3 sPBJ: sandwich covariance functions
Here, we utilize an estimating equations approach to obtain parameter estimates and robust standard errors. Standard error estimates using estimating equations are robust to misspecification of the covariance function which can occur when the subject variance function differs across . We use the following estimating equation
(7) 
where is a weight for subject that is an estimator for for . does not need to be an estimator of for the sPBJ procedure to be asymptotically valid, however the best choices are estimators that are nearly unbiased for and have small variance. Because our outcome data are stochastic processes indexed by , (7) is also a stochastic process.
We define as the solution to (7), which is equivalent to the least squares estimate for at each location in the image using the weight function . is unbiased in finite samples, if the mean function is correctly specified, and has covariance function equal to
(8) 
where
(9) 
and we assume both limits (9) exist. We define (9) as limits because the underlying observations are not identically distributed due to having different subjectlevel variances. The covariance (8) is derived by performing a Taylor expansion of the estimating equation (7) centered at the true value of the parameter and then computing their asymptotic covariances (see e.g. Boos and Stefanski, 2013, p. 300). The following theorem gives a useful way to compute an estimate of the sandwich covariance function for the parameter of interest. We use a modification (by Long and Ervin (2000)) of the covariance estimate of MacKinnon and White (1985), which has shown to be best at achieving nominal type 1 error rates in small samples over several heteroskedasticityconsistent estimators (Long and Ervin, 2000).
Theorem 3.1.
The covariance function for the parameter of interest is
where and are as defined in Supplementary Section S1.1.
The estimator
(10) 
is consistent for , where all objects are as defined in Supplementary Section S1.1.
The Wald statistic variable, , for the test of the hypothesis image
(11) 
where is fixed, is asymptotically chisquared (under the null),
(12) 
by Theorem 8.3 of Boos and Stefanski (2013, p. 345).
Theorem 3.2.
The proof of Theorem 3.2 is also given in the Appendix. When the joint distribution of is challenging to sample from, so that case is beyond the scope of this work.
Procedure 3.2.
Let all objects be as defined in Section S1.1. In order to generate samples , for from a :

Perform the weighted regression of onto using weights to obtain the weighted residuals .

Divide the residual vector elementwise by the diagonal of to obtain .

Construct the intermediate matrix
and let , be after the rows are standardized to be norm 1.

For sample independent standard normal random variables, , and compute the asymptotically normal test statistic vector
Recall that Procedures 3.1 and 3.2
are used to estimate the probability distribution necessary to use Procedure
2.1.4 Fractal back fMRI data description
We use the back task fMRI data from the PNC to generate realistic data for the simulation analyses. The fractal back task is a working memory task where complex geometric fractals were presented to the subject for 500 ms, followed by a fixed interstimulus interval of 2500 ms (Ragland et al., 2002). For each subject, approximately 11.5 minutes of back task fMRI were acquired as BOLDweighted timeseries (231 volumes; TR = 3000ms; TE = 32ms; FoV = 192x192mm; resolution 3mm isotropic); detailed scanning parameters are discussed in Satterthwaite et al. (2014, 2013). Images were presented under three conditions: In the 0back condition, participants responded with a button press to a specified target. For the 1back condition, participants responded if the current fractal was identical to the previous one; in the 2back condition, participants responded if the current fractal was identical to the item presented two trials previously (Satterthwaite et al., 2014). We include in our analyses a subset of 1,100 subjects who satisfied image quality and clinical exclusionary criteria from the total of 1,601 scanned as part of the PNC.
We apply a series of standard image preprocessing steps: distortioncorrection using FSL’s FUGUE, timeseries preprocessing, rigid registration, brain extraction, temporal filtering, and spatial smoothing (Smith and Brady, 1997; Jenkinson et al., 2012). subjectlevel models are fit using a linear model in FSL’s FILM software including convolved task boxcar functions for each level of the task and 24 motion covariates. The 20 back contrast of these estimates is used as the subjectlevel outcome for all analyses presented in this paper. The subjectlevel 20 back contrast’s variance is also utilized as a voxel level weight in Section 6.
The result of the subjectlevel preprocessing is a parameter estimate image for each subject, , and a variance estimate for each subject’s parameter estimate image. The parameter estimate images are then submitted to a second level analysis where the parameter estimates are regressed onto covariates (e.g. performance, age, and motion) in order to understand the association between task activation and performance at each voxel across subjects.
5 Simulation Analysis
5.1 Methods
In order to generate nonexchangeable data with heteroskedastic covariance functions we first residualize the back neuroimaging data set to a set of independent variables that have known associations with the imaging variable, including age, sex, inscannermotion (mot), and task performance (; Macmillan, 2002), using the unpenalized spline model
(14) 
where is an indicator for males, and , , and are fit using thin plate splines with 10 knots (Wood and Augustin, 2002; Wood, 2015). Fixed degree spline bases were used to model continuous variables to ensure that the mean of the residuals was completely unassociated with the covariates.
As a result of fitting this model, the residualized images have little to no mean association with the independent variables. However, the residualized images do not remove the effect the independent variables can have on the covariance structure of the imaging data. By drawing subsamples of subjects from the residualized images together with the independent variables we generate data that constitutes the global null hypothesis because the covariates are unassociated with the mean, however these covariates can still have an effect on the covariance function of the data. To demonstrate how the covariates affect the covariance function of the residualized data we create covariance matrices using nodes defined in the Power atlas for the 100 lowest and 100 highest valued subjects for motion and (Figure 1; Power et al., 2011). Fox example, high motion is associated with stronger interhemispheric and anteriorposterior correlations than low motion (Figure 1).
In each of 1,000 simulations we draw a bootstrap sample of size and fit the model
(15) 
where are the residuals from model (14). We then perform the test of for using SEI of a statistical map with two CFTs (; ).
We compare the PBJ and sPBJ (using 500 bootstraps) procedures to a standard GRFbased method (easythresh; Jenkinson et al., 2012) and to the permutation procedure (using 500 permutations; Winkler et al., 2014). For the simulation study we use the PBJ and sPBJ with uniform weights (PBJ(1); sPBJ(1)) and with the inverse of motion as the weight (PBJ(mot); sPBJ(mot)), which downweights subjects with high motion. The permutation procedure is fit using a single group variance estimate (Perm) as well as a two group variance (Perm Grp), which attempts to weaken the assumption of exchangeability by allowing for group differences in variance at each location (Winkler et al., 2014)
. The groups were determined by the upper and lower quantiles of motion. The SEI FWER is estimated by computing the proportion of simulations where a cluster is determined significant at two rejection thresholds
and .We use a novel procedure to assess power: for each of 1,000 simulations we randomly select 4 voxels within the brain and create an image of spheres, , with radii of 8, 10, 11, and 12 voxels centered at each of the 4 voxels. We set when is inside a sphere (and inside the brain) and 0 otherwise. represents the locations in the image where there is a true effect. Because the voxels are randomly selected the spheres could overlap or be cropped depending on the proximity to the edge of the brain, thus the number and size of clusters is random across simulations. For locations where , the variable is first orthogonalized to the other covariates and then we set
where is as defined in (15) and is the variance of the orthogonalized variable. This approach ensures that the voxel level effect size of is approximately equal to 0.4. Power curves as a function of cluster size are estimated for each sample size using shape constrained additive models (SCAMs) using the scam package in R (Pya and Wood, 2015). SCAMs were used to ensure the estimated power curves were monotonically increasing. Power was only assessed for procedures that had nearly nominal type 1 error rate.
5.2 Results
Simulations under the global null (5) were used to assess FWER for each of the SEI procedures. For the test of the motion variable all methods have inflated FWER for smaller sample sizes (), except the PBJ and sPBJ procedures with motion dewighting (Figure 2). Because the sPBJ procedure without weights is still asymptotically valid the FWER falls to the nominal level with increasing sample size. The permutation procedures and the PBJ procedure without uniform weights actually increase FWER with increasing sample size. The GRFbased method always has a FWER above 0.9. This pattern holds across both CFTs (0.01, 0.005) and both nominal FWER (0.01, 0.05).
Our novel null simulation approach allows us to assess how the heteroskedasticity of real variables of interest affect the FWER for each procedure. For the test of most procedures controlled the FWER at the nominal level (Figure 2). The FWER for the test of using the PBJ procedure with motion deweighting increases above 0.1 with both CFTs when the desired level is 0.05. The GRFbased procedure maintains a FWER above 0.4.
Simulations where known signal is added to the image are used to assess the power of each method. The PBJ and sPBJ with motion deweighting have comparable power, whereas all methods without motion deweighting have similar, and lower, power (Figure 3
). Oddly, the permutation procedure with group variances has a detrimental effect on the power of the procedure. The results demonstrate that the capability of the PBJ and sPBJ procedures to downweight observations with high motion improves the power to detect smaller clusters of activation in small sample sizes (Figure
3).6 Nback data analysis
We next investigate the effect of task performance on 20 back activation. We model the 20 back subjectlevel parameter estimates using model (15). We use the inverse of the subjects’ estimated variance images as weights in the estimating equation (7) for the PBJ and sPBJ procedures (using 5,000 bootstraps). We compare the results using the (s)PBJ procedures to the permutation procedure without weights (using 5,000 permutations) with a CFT of 0.005.
The PBJ and sPBJ procedures have a larger rejection volume and a more extreme average statistical value in the rejected regions (Table 1). The permutation and sPBJ procedures identify one large contiguous cluster and one small cluster of deactivation. The PBJ fails to detect the deactivated cluster, which is marginally significant. Statistical maps for the PBJ and sPBJ procedure are similar as they both deweight high variance subjects.
Procedure  Rej Vol (voxels)  Mean 

PBJ  31,419  3.80 
sPBJ  32,720  3.79 
Perm  24,893  3.61 
Task performance is positively associated with activation in the thalamus, dorsolateral prefrontal cortex (DLPFC), cingulate, and precuneus (Figure 4). Performance was also associated with diffuse white matter activation that may connect functional gray matter regions or be related to reaction time differences correlated with task performance (Yarkoni et al., 2009; Mazerolle et al., 2010; Ding et al., 2018). Task performance is negatively associated with activation in the ventromedial prefrontal cortex.
7 Discussion
The sPBJ procedure is a robust multileveling approach because it produces consistent covariance function estimators even when the covariance model is misspecified. When using the sPBJ, investigators can use arbitrary estimates of noise as weights to improve signal and still be confident that the FWER will be near nominal level in large samples. The permutation, GRF, and PBJ procedures that do not have consistent standard error estimates can have FWERs that grow as the sample size increases. Given that the sPBJ has closer to nominal FWER than the PBJ and permutation procedures, while maintaining equal or superior power, it is an appealing option for SEI in small and large samples.
Our findings support those presented in other papers (Eklund et al., 2016; Silver et al., 2011; Greve and Fischl, 2018): GRFbased methods do not have robust FWER control across a range of parameter settings, regardless of whether heteroskedasticity is present. The only variable that produces enough heteroskedasticity to affect the permutation procedure is the motion covariate; the permutation procedure performs well with , which is a typical covariate of interest. Thus, a model that assumes subject exchangeability is appropriate for the covariate we consider here, albeit with a slight loss in power. It is possible that covariates that are highly correlated with motion (e.g. ADHD diagnosis) may have FWERs that are far from the nominal level using the permutation procedure.
The bootstrap approach to approximating the probability (1
) is quite general and need not be used with the two models presented here. So long as the joint distribution of the test statistics is asymptotically normal or chisquared and a square root of the covariance matrix can be computed, then the PBJ procedure can be used to sample from the joint distribution of the test statistics. This suggests the method can be used to study patterns generated by machine learning methods
(see e.g. Gaonkar and Davatzikos, 2013).Here, we perform analyses and simulations of task fMRI data, which is one of the most common multilevel neuroimaging models. However, the PBJ procedure is appropriate for other types of imaging modalities as subjectlevel weighting can be used for models that are not typically considered multilevel models. For example, estimates of voxel level variance can be used to deweight noisy cerebral blood flow images computed from arterial spin labeled MRI or measurements of T1 image quality can be used to deweight images affected with motion artifact (Rosen et al., 2018). The sPBJ procedure can also be used for metaanalyses of publicly available statistical maps (e.g. Gorgolewski et al., 2015; Maumet et al., 2016). Typically, variance maps are needed to accurately fit such multilevel models, but variance maps are seldom made publicly available. An advantage of the sPBJ procedure for metaanalysis is that it does not require variance maps for valid statistical inference, so weights could be determined from other reported variables, such as the sample size.
There are two apparent limitations of the approach: First, the sPBJ procedure relies on consistency, which only provides guarantees for FWER control as the sample size gets large. Our simulations demonstrate that the procedure has nominal FWER control in relatively small samples (e.g.
), however this convergence is likely slower in very heavily skewed datasets. Second, the distribution of the statistical map (
13) is unknown when , so the procedure is not able to perform tests of multiple parameters at each location. Future work is necessary to derive the distribution of the statistical map in the case when and develop efficient methods to sample from the joint distribution of the statistical image. Despite these limitations, the sPBJ procedure is a robust multilevel method for SEI that can be used to further our understanding of brainbehavior associations.Acknowledgements
Funding
RC2 grants from the National Institute of Mental Health MH089983 and MH089924. National Institutes of Health grants: (5P30CA06848522 to S.N.V.); (R01MH107235 to R.C.G., R.E.G., R.T.S.), (R01MH107703 to T.D.S., R.T.S.), (R01MH112847 to TDS, KR, and RTS) and (R01NS085211 to R.T.S.).
References
 Ayachit (2015) Ayachit, U. (2015). The ParaView Guide: A Parallel Visualization Application. Kitware, Inc., USA.
 Benjamini and Heller (2007) Benjamini, Y. and Heller, R. (2007). False Discovery Rates for Spatial Signals. Journal of the American Statistical Association 102, 1272–1281.
 Benjamini and Yekutieli (2001) Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Annals of statistics pages 1165–1188.
 Boos and Stefanski (2013) Boos, D. D. and Stefanski, L. A. (2013). Essential Statistical Inference: Theory and Methods. Springer Texts in Statistics. SpringerVerlag, New York.
 Ding et al. (2018) Ding, Z., Huang, Y., Bailey, S. K., Gao, Y., Cutting, L. E., Rogers, B. P., Newton, A. T., and Gore, J. C. (2018). Detection of synchronous brain activity in white matter tracts at rest and under functional loading. Proceedings of the National Academy of Sciences 115, 595–600.
 Eklund et al. (2018) Eklund, A., Knutsson, H., and Nichols, T. E. (2018). Cluster Failure Revisited: Impact of First Level Design and Data Quality on Cluster False Positive Rates. arXiv:1804.03185 [stat] arXiv: 1804.03185.
 Eklund et al. (2016) Eklund, A., Nichols, T. E., and Knutsson, H. (2016). Cluster failure: why fMRI inferences for spatial extent have inflated falsepositive rates. Proceedings of the National Academy of Sciences page 201602413.
 Flandin and Friston (2016) Flandin, G. and Friston, K. J. (2016). Analysis of familywise error rates in statistical parametric mapping using random field theory. Human brain mapping .
 Friston et al. (1994) Friston, K. J., Worsley, K. J., Frackowiak, R. S., Mazziotta, J. C., and Evans, A. C. (1994). Assessing the significance of focal activations using their spatial extent. Human brain mapping 1, 210–220.

Gaonkar and
Davatzikos (2013)
Gaonkar, B. and Davatzikos, C. (2013).
Analytic estimation of statistical significance maps for support vector machine based multivariate image analysis and classification.
NeuroImage 78, 270–283.  Gorgolewski et al. (2015) Gorgolewski, K. J., Varoquaux, G., Rivera, G., Schwarz, Y., Ghosh, S. S., Maumet, C., Sochat, V. V., Nichols, T. E., Poldrack, R. A., Poline, J.B., Yarkoni, T., and Margulies, D. S. (2015). NeuroVault.org: a webbased repository for collecting and sharing unthresholded statistical maps of the human brain. Frontiers in Neuroinformatics 9,.
 Greve and Fischl (2018) Greve, D. N. and Fischl, B. (2018). False positive rates in surfacebased anatomical analysis. NeuroImage 171, 6–14.
 Jenkinson et al. (2012) Jenkinson, M., Beckmann, C. F., Behrens, T. E., Woolrich, M. W., and Smith, S. M. (2012). Fsl. Neuroimage 62, 782–790.
 Kessler et al. (2017) Kessler, D., Angstadt, M., and Sripada, C. S. (2017). Reevaluating “cluster failure” in fMRI using nonparametric control of the false discovery rate. Proceedings of the National Academy of Sciences 114, E3372–E3373.

Long and
Ervin (2000)
Long, J. S. and Ervin, L. H. (2000).
Using heteroscedasticity consistent standard errors in the linear regression model.
The American Statistician 54, 217–224.  MacKinnon and White (1985) MacKinnon, J. G. and White, H. (1985). Some heteroskedasticityconsistent covariance matrix estimators with improved finite sample properties. Journal of econometrics 29, 305–325.
 Macmillan (2002) Macmillan, N. A. (2002). Signal detection theory. Stevens’ handbook of experimental psychology 4, 43–90.
 Madan (2015) Madan, C. R. (2015). Creating 3d visualizations of MRI data: A brief guide. F1000Research 4,.
 Maumet et al. (2016) Maumet, C., Auer, T., Bowring, A., Chen, G., Das, S., Flandin, G., Ghosh, S., Glatard, T., Gorgolewski, K. J., Helmer, K. G., Jenkinson, M., Keator, D. B., Nichols, B. N., Poline, J.B., Reynolds, R., Sochat, V., Turner, J., and Nichols, T. E. (2016). Sharing brain mapping statistical results with the neuroimaging data model. Scientific Data 3,.
 Mazerolle et al. (2010) Mazerolle, E. L., Beyea, S. D., Gawryluk, J. R., Brewer, K. D., Bowen, C. V., and D’arcy, R. C. (2010). Confirming white matter fMRI activation in the corpus callosum: colocalization with DTI tractography. Neuroimage 50, 616–621.
 Morris (2015) Morris, J. S. (2015). Functional regression. Annual Review of Statistics and Its Application 2, 321–359.
 Mueller et al. (2017) Mueller, K., Lepsien, J., Möller, H. E., and Lohmann, G. (2017). Commentary: Cluster failure: Why fMRI inferences for spatial extent have inflated falsepositive rates. Frontiers in Human Neuroscience 11,.
 Muschelli et al. (2018) Muschelli, J., Gherman, A., Fortin, J.P., Avants, B., Whitcher, B., Clayden, J. D., Caffo, B. S., and Crainiceanu, C. M. (2018). Neuroconductor: an R platform for medical imaging analysis. Biostatistics .
 Pacifico et al. (2004) Pacifico, M. P., Genovese, C., Verdinelli, I., and Wasserman, L. (2004). False Discovery Control for Random Fields. Journal of the American Statistical Association 99, 1002–1014.
 Penny and Holmes (2007) Penny, W. and Holmes, A. P. (2007). Random effects analysis. Statistical parametric mapping: The analysis of functional brain images pages 156–165.
 Power et al. (2011) Power, J. D., Cohen, A. L., Nelson, S. M., Wig, G. S., Barnes, K. A., Church, J. A., Vogel, A. C., Laumann, T. O., Miezin, F. M., Schlaggar, B. L., and Petersen, S. E. (2011). Functional network organization of the human brain. Neuron 72, 665–678.
 Power et al. (2010) Power, J. D., Fair, D. A., Schlaggar, B. L., and Petersen, S. E. (2010). The Development of Human Functional Brain Networks. Neuron 67, 735–748.
 Pya and Wood (2015) Pya, N. and Wood, S. N. (2015). Shape constrained additive models. Statistics and Computing 25, 543–559.
 Ragland et al. (2002) Ragland, J. D., Turetsky, B. I., Gur, R. C., GunningDixon, F., Turner, T., Schroeder, L., Chan, R., and Gur, R. E. (2002). Working Memory for Complex Figures: An fMRI Comparison of Letter and Fractal nBack Tasks. Neuropsychology 16, 370–379.
 Ramsay and Silverman (2005) Ramsay, J. and Silverman, B. W. (2005). Functional Data Analysis. Springer Series in Statistics. SpringerVerlag, New York, 2 edition.
 Reiss et al. (2010) Reiss, P. T., Huang, L., and Mennes, M. (2010). Fast functiononscalar regression with penalized basis expansions. The international journal of biostatistics 6,.
 Romano et al. (2008) Romano, J. P., Shaikh, A. M., and Wolf, M. (2008). Control of the false discovery rate under dependence using the bootstrap and subsampling. Test 17, 417.
 Rosen et al. (2018) Rosen, A. F. G., Roalf, D. R., Ruparel, K., Blake, J., Seelaus, K., Villa, L. P., Ciric, R., Cook, P. A., Davatzikos, C., Elliott, M. A., Garcia de La Garza, A., Gennatas, E. D., Quarmley, M., Schmitt, J. E., Shinohara, R. T., Tisdall, M. D., Craddock, R. C., Gur, R. E., Gur, R. C., and Satterthwaite, T. D. (2018). Quantitative assessment of structural image quality. NeuroImage 169, 407–418.
 Satterthwaite et al. (2016) Satterthwaite, T. D., Connolly, J. J., Ruparel, K., Calkins, M. E., Jackson, C., Elliott, M. A., Roalf, D. R., Hopson, R., Prabhakaran, K., Behr, M., Qiu, H., Mentch, F. D., Chiavacci, R., Sleiman, P. M. A., Gur, R. C., Hakonarson, H., and Gur, R. E. (2016). The Philadelphia Neurodevelopmental Cohort: A publicly available resource for the study of normal and abnormal brain development in youth. NeuroImage 124, 1115–1119.
 Satterthwaite et al. (2014) Satterthwaite, T. D., Elliott, M. A., Ruparel, K., Loughead, J., Prabhakaran, K., Calkins, M. E., Hopson, R., Jackson, C., Keefe, J., Riley, M., Mentch, F. D., Sleiman, P., Verma, R., Davatzikos, C., Hakonarson, H., Gur, R. C., and Gur, R. E. (2014). Neuroimaging of the Philadelphia neurodevelopmental cohort. NeuroImage 86, 544–553.
 Satterthwaite et al. (2013) Satterthwaite, T. D., Wolf, D. H., Erus, G., Ruparel, K., Elliott, M. A., Gennatas, E. D., Hopson, R., Jackson, C., Prabhakaran, K., Bilker, W. B., Calkins, M. E., Loughead, J., Smith, A., Roalf, D. R., Hakonarson, H., Verma, R., Davatzikos, C., Gur, R. C., and Gur, R. E. (2013). Functional Maturation of the Executive System during Adolescence. The Journal of Neuroscience 33, 16249–16261.
 Satterthwaite et al. (2012) Satterthwaite, T. D., Wolf, D. H., Loughead, J., Ruparel, K., Elliott, M. A., Hakonarson, H., Gur, R. C., and Gur, R. E. (2012). Impact of inscanner head motion on multiple measures of functional connectivity: relevance for studies of neurodevelopment in youth. Neuroimage 60, 623–632.
 Silver et al. (2011) Silver, M., Montana, G., Nichols, T. E., and Initiative, A. D. N. (2011). False positives in neuroimaging genetics using voxelbased morphometry data. Neuroimage 54, 992–1000.
 Slotnick (2017) Slotnick, S. D. (2017). Cluster success: fMRI inferences for spatial extent have acceptable falsepositive rates. Cognitive Neuroscience 8, 150–155.

Smith and
Brady (1997)
Smith, S. M. and Brady, J. M. (1997).
SUSAN—a new approach to low level image processing.
International journal of computer vision
23, 45–78.  Sun et al. (2015) Sun, W., Reich, B. J., Tony Cai, T., Guindani, M., and Schwartzman, A. (2015). False discovery control in largescale spatial multiple testing. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 77, 59–83.
 Van der Vaart (2000) Van der Vaart, A. W. (2000). Asymptotic statistics, volume 3. Cambridge university press.
 Vandekar et al. (2017) Vandekar, S. N., Satterthwaite, T. D., Rosen, A., Ciric, R., Roalf, D. R., Ruparel, K., Gur, R. C., Gur, R. E., and Shinohara, R. T. (2017). Faster familywise error control for neuroimaging with a parametric bootstrap. Biostatistics .
 Winkler et al. (2014) Winkler, A. M., Ridgway, G. R., Webster, M. A., Smith, S. M., and Nichols, T. E. (2014). Permutation inference for the general linear model. Neuroimage 92, 381–397.
 Wood (2015) Wood, S. (2015). Package ‘mgcv’. R package version 1, 29.
 Wood and Augustin (2002) Wood, S. N. and Augustin, N. H. (2002). GAMs with integrated model selection using penalized regression splines and applications to environmental modelling. Ecological modelling 157, 157–177.
 Woodward and Heckers (2016) Woodward, N. D. and Heckers, S. (2016). Mapping Thalamocortical Functional Connectivity in Chronic and Early Stages of Psychotic Disorders. Biological Psychiatry 79, 1016–1025.
 Woolrich et al. (2009) Woolrich, M. W., Jbabdi, S., Patenaude, B., Chappell, M., Makni, S., Behrens, T., Beckmann, C., Jenkinson, M., and Smith, S. M. (2009). Bayesian analysis of neuroimaging data in FSL. Neuroimage 45, S173–S186.
 Worsley et al. (1999) Worsley, K. J., Andermann, M., Koulis, T., MacDonald, D., and Evans, A. C. (1999). Detecting changes in nonisotropic images. Human brain mapping 8, 98–101.
 Xia et al. (2013) Xia, M., Wang, J., and He, Y. (2013). BrainNet Viewer: A Network Visualization Tool for Human Brain Connectomics. PLOS One 8,.
 Yarkoni et al. (2009) Yarkoni, T., Barch, D. M., Gray, J. R., Conturo, T. E., and Braver, T. S. (2009). BOLD Correlates of TrialbyTrial Reaction Time Variability in Gray and White Matter: A MultiStudy fMRI Analysis. PLOS ONE 4, e4257.
 Yekutieli and Benjamini (1999) Yekutieli, D. and Benjamini, Y. (1999). Resamplingbased false discovery rate controlling multiple test procedures for correlated test statistics. Journal of Statistical Planning and Inference 82, 171–196.
 Yushkevich et al. (2016) Yushkevich, P. A., Gao, Y., and Gerig, G. (2016). ITKSNAP: An interactive tool for semiautomatic segmentation of multimodality biomedical images. In 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pages 3342–3345.
S1 Supplementary Materials
s1.1 Notation
The following objects are defined for Theorem 3.1.
s1.2 Assumptions for Theorems 3.1 and 3.2
(S3)  
We use Theorem 18.14 of Van der Vaart (2000) to prove weak convergence of which requires an additional assumption: for all there exists a partition of into finitely many sets , such that
(S4) 
s1.3 Proofs of Theorems 3.1 and 3.2
Proof.
Proof of Theorem 3.1
Finding the covariance of and for involves extracting the bottom right block matrix of
(S5) 
Because is unwieldy the derivation is messy, but not intellectually stimulating and so is not shown here.
The estimator (10) can be rewritten
It follows that if
(S6)  
and
(S7)  
then is a consistent estimator of . Equations (S6
) hold by the weak law of large numbers given assumptions (
S3). Note, because is a function of we must use a weak law that allows the averages in (S7) to depend on parameter estimates such as Theorem 7.3 of (Boos and Stefanski, 2013, p. 329). The assumptions (S3) are satisfactory to use this theorem as the functionin a neighborhood of the true parameter value and . ∎
Proof of Theorem 3.2.
Let be fixed for . Note that for . The convergence of follows because our estimating equation (7) satisfies a set of regularity conditions (see e.g. Boos and Stefanski, 2013). The additional assumptions (9) and (S3) are required to so that and do not need to be identically distributed. These assumptions are satisfactory for the convergence of the covariance
Thus, the continuous mapping theorem implies
where . follows a diagonal Wishart distribution by definition. The weak convergence of the stochastic process follows by (S4) and Theorem 18.14 of (Van der Vaart, 2000).
∎