Statistical Agnostic Mapping: a Framework in Neuroimaging based on Concentration Inequalities

12/27/2019 ∙ by J M Gorriz, et al. ∙ University of Cambridge University of Granada 33

In the 70s a novel branch of statistics emerged focusing its effort in selecting a function in the pattern recognition problem, which fulfils a definite relationship between the quality of the approximation and its complexity. These data-driven approaches are mainly devoted to problems of estimating dependencies with limited sample sizes and comprise all the empirical out-of sample generalization approaches, e.g. cross validation (CV) approaches. Although the latter are not designed for testing competing hypothesis or comparing different models in neuroimaging, there are a number of theoretical developments within this theory which could be employed to derive a Statistical Agnostic (non-parametric) Mapping (SAM) at voxel or multi-voxel level. Moreover, SAMs could relieve i) the problem of instability in limited sample sizes when estimating the actual risk via the CV approaches, e.g. large error bars, and provide ii) an alternative way of Family-wise-error (FWE) corrected p-value maps in inferential statistics for hypothesis testing. In this sense, we propose a novel framework in neuroimaging based on concentration inequalities, which results in (i) a rigorous development for model validation with a small sample/dimension ratio, and (ii) a less-conservative procedure than FWE p-value correction, to determine the brain significance maps from the inferences made using small upper bounds of the actual risk.



There are no comments yet.


page 6

page 9

page 10

page 12

page 14

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the last decades neuroscience has transitioned from qualitative case reports to quantitative, longitudinal and multivariate population studies in the quest for defining the abnormal patterns of disease pathogenesis. Neuroscience has recently provided valuable insight by means of classical statistics, e.g statistical inference based on null-hypothesis (

) testing or regression-type analyses. Thus, brain mapping community has predominantly used null hypothesis testing for exploratory analyses in whole brain searches [1]. In this context, classical inference makes emphasis on in-sample image-based statistical estimates from previously assumed data models to determine the existence of relevant effects (large or subtle) in binary group comparisons. The critical p-value (significance vs. not significant) is often complemented by effect-size measures of the magnitude of the phenomenon [2].

On the other hand, out-of sample generalization approaches in machine learning (ML), such as Cross-Validation (CV), try to estimate on unseen new data the actual error of the classifier in the (binary) classification problem. Despite the methods and goals of predictive CV inference are distinct from classical extrapolation procedure

[9], they are actually exploited within statistical frameworks aimed at providing statistical significance [27]. Examples are bootstrapping, binomial or permutation (“resampling”) tests, etc. which have demonstrated to be competitive outside the comfort zone of classical statistics, filling otherwise-unmet inferential needs.

In the pattern classification problem we usually assume the existence of classes (

) that are differentiated by the classifiers in terms of accuracy (Acc) on a presumably independent dataset. Empirical confidence intervals or plausible Acc values derived from CV are consequently used to evaluate the system performance and to conclude (improperly in a statistical sense)

. Moreover, in limited sample sizes the most popular K-fold CV method [4] has demonstrated to sub-optimally work under unstable conditions [6, 5, 7] and then, the predictive power of the fitted classifiers can be arguable.

Beyond the latter techniques, CV in ML is well-framed into a data-driven statistical learning theory (SLT) which is mainly devoted to problems of estimating dependencies with limited amounts of data

[3]. Although, CV-ML approaches were not originally designed to test hypothesis in brain mapping [1], they are theoretically grounded to provide maps of confidence intervals

(protected inference). As shown in present study, this can be achieved by assessing the upper bounds of the actual error in a binary classification problem, and by using simple significance tests for a population proportion. Thus, assessing with high probability the quality of the fitting function (and its generalization ability) in terms of in/out-sample predictions can be conceptualised, under a hypothesis testing scenario, as the inverse problem of “carefully rejecting

”, that is, the problem of rejecting and thus accepting (there is no effect or it is not significant).

The paper is organized as follows. In section 2 we derive analytical upper bounds using the agnostic or model-free formulation of the learning problem. In connection to the drawbacks pointed in [8] regarding the CV-based inference challenge that “they are not functions of the complete data” set, it is worth mentioning that the previous model considers all the available data. Then, the learning algorithm is fitted in the best possible way to the empirical data, as shown in section 3, to obtain the empirical error. This empirical error represents an upper bound of the real

(actual) error of the model limited by a deviation quantity that is analytically derived beforehand. Sample size and the empirical settings regarding the complexity of the selected classifiers are key to the proposed neuroimaging methodology as they condition the degrees of freedom and the number of separating functions used to define the aforementioned deviation quantity. In a nutshell, low dimensional scenarios and linear classifiers are required, whenever they easily get a strong link between both errors. Under these conditions, we can determine which regions across volumes are within these confidence intervals with a probability of

, what corresponds with statistical significance in a group comparison (see section 5). To this purpose, we need to estimate a probability threshold for the obtained accuracy values of each region to reject the null-hypothesis, i.e. , of the underlying population proportion, thus a regionally specific activation can be stated under the Statistical Agnostic Mapping (SAM) framework.

2 Methods: bounding the actual error with probability 1-

2.1 Background on Agnostic learning

Let assume the agnostic model in the problem of binary pattern classification as proposed in [10]. Given an independent and identically distributed sample of -dimensional predictors and classes pairs, , where each of them is drawn from the unknown , the goal is to construct a good approximation to an unknown target function , using a class of functions : , and evaluating their goodness by a predefined expected loss:


where the loss function

and is a random element of the hypothesis space .

To simplify notation let consider the function composition, i.e. , to define the class of functions : with expected loss (probability of error) . Thus, the empirical error can be determined by:


A learning algorithm particularly selects given the sample , i.e. via the empirical risk minimization (ERM) [3], and hopefully provides:

  • a real error (on the ideal infinite population) close to the one obtained on the sample, that is,

  • and close to the minimum risk ,

2.2 Upper Bound based on concentration inequalities

Unfortunately, the aforementioned statement of is not generally true. More precisely:


with an arbitrarily . Under the worst case scenario the uniform deviation can be defined as , for any . Using the ERM algorithm we readily get the following concentration inequalities:


Bounding can be (not readily) achieved by using several theorems and lemmas of the SLT [17, 21, 20, 18, 19] to finally get (see Appendix)111A similar bound can be achieved for the second row in equation 4:


with probability , where is the cardinality of or the number of separating functions given the sample realization222A trivial bound for this quantity can be found: .

3 Fitting the selected function to current data

3.1 Feature extraction and selection

In order to minimize the left part of equation 5 we could minimize one (or both) of the summands on the right. However, they are dependent each other in terms of the classifier complexity [3]. One solution could be, as explained in the next section, to prevent the increase of given the sample , by selecting a low classifier order [11], i.e. a linear decision functions. However, this comes at the cost of a maybe non-negligible empirical error.

As an attempt to reduce the ratio (curse of dimensionality

), the machine learning community usually tends to employ feature extraction and selection (FES) methods to enhance the classification performance while preserving the system complexity. This can be achieved by removing irrelevant features from the sample, which can also facilitate interpretation (FS), and by identifying multivariate sets of meaningful features (FE) that best discriminate the classes

[14]. The final aim is to provide an almost linearly separable classification problem in the feature space.

Several methods have been employed in neuroimaging aiming at reducing the dimensionality () of the problem (in relation to ) based on statistical tests for FS [16], matrix decompositions [12]

or even deep learning architectures for FES

[15]. To validate the methodology proposed in this paper, we perform FE using a popular method in neuroscience, such as the Partial Least Squares (PLS) algorithm [12]. PLS methods have been demonstrated its utility in describing the relation between brain activity and experimental design or behaviour measures within a multivariate framework (see [12, 13, 6]

and the appendix for mathematical details and the interpretation of the PLS-maps as a classical t-test).

3.2 Linear Decision functions: a small upper bound

Regularized linear decision functions have been recently applied to neuroimaging for detecting activation patterns, and compared to parametric hypothesis testing, such as univariate t-tests [24, 22, 23]. In general, they have limited their analyses to provide in-sample estimates based on resampling, failing to demonstrate their out-of-sample performance in terms of confidence intervals.

As stated before, the minimization of the left part of inequality 5 can be achieved by decreasing the number of separating functions given the sample (). This quantity is indeed decreased by selecting a linear decision function-based classifier in a binary classification problem, following the results in the extant literature [11, 3], etc. After transforming and selecting the feature set by FES methods, the concentration inequalities 5 obtained with linear classifiers result in a strong association with a given confidence level whenever the extracted features are significant across regions of interest (ROIs) and group comparisons.

Beyond the existing caveats and solutions when using regularization methods in neuroimaging for FS, we adopt the linear support vector machine (SVM) classification algorithm which allows us to tentatively evaluate the worst case of

, that is, and to set the following upper bound [3]:


with probability and is the VC dimension, e.g for linear classifiers. In the same manner, several upper bounds could be tested based on several innovative concepts and paradigms, such as the ones based on data distributions, set’s shape, Rademacher averages, pseudo-dimension, fat-shattered dimension, etc. [30, 31]. We preferred to use, due to its simplicity, the upper bound recently proposed in [5]. The latter is strongly grounded on the geometrical assumption of in general position distributed samples and the function-counting theorem of homogeneously linearly separable dichotomies [11]:


With the help of expressions such as inequalities 6 and 7, we can even evaluate the deviation of the empirical error from the actual error at voxel level, although it is preferable, for the aforementioned reasons, to do it region-wise using a fitted linear SVM classifier in the multivariate feature space (see figure 1). In this sense, the motivation for a multivariate framework in assessing the areas of relevance is analogous to other proposed techniques for addressing the multiple comparison problem in functional imaging, e.g. Random Field Theory for neuroimaging analysis [32] or the classical p-value corrections for multiple comparison after null-hypothesis testing. In general, only those voxels (or ROIs) showing a tight association, i.e. high performance in terms of accuracy, should be considered as relevant maps or patterns in that particular condition with probability .

Figure 1: The upper bound of inequality 7 connecting actual and empirical errors with level of confidence. Note how increasing the feature dimension results in a larger bound (blue solid line). However, working in low dimensional scenarios, i.e. , and using medium-sized datasets (), the confidence interval is less than .

4 Statistical Agnostic Mapping

The significant areas derived from SAM correspond by construction with those regions having an empirical error that, under the worst case scenario, has associated an actual error greater than the random guess accuracy . Confidence intervals derived from the concentration inequalities allow us to bound the worst case at the “upper” border of the confidence interval, providing a protective inference. Thus, within this confidence interval, a significance test can be used to make an inference about whether the accuracy value for a specific region differs from the null-hypothesis of the random proportion (see Appendix). Therefore, the statistical significance of any region is assessed, in combination with confidence intervals, by evaluating the p-value of any ROI at a given significance level, i.e. . A total of standardized regions [28] were analysed within a protective interval, avoiding the limitations of significant tests to distinguish statistical from practical importances (see Appendix).

In the following sections we will show how the combination of the aforementioned protective intervals and significance tests may be used to derive a SAM in different group comparisons, such as Alzheimer’s disease (AD) vs healthy controls (HC), Parkinson’s disease (PD) vs HC and on a well-known example of single-subject activation map in fMRI, and how they relate with the classical approach based on null-hypothesis testing, i.e. two sample t-test with corrected-p value. Unlike, previous approaches, the proposed model-free method is less specific but more robust against sample size, artifacts and nuisance effects. See the complete diagram of the poposed method in figure 2.

Figure 2: Complete diagram of the proposed methodology including typical preprocessing steps in SPM for different modalities (left column of blocks), classification fitting and FESfor actual risk estimation (middle column) and inference to derive the SAM (right column).

5 Experiments

The aim of this section is to present a novel methodology in neuroimaging based on analytical concentration inequalities, and to experimentally compare them to the accepted framework used by the neuroscience community based on the SPM analysis [25]. Thus, we will assess several experiments collected from well-known databases that include imaging data from patients with a variety of conditions/pathologies. Nevertheless, we will avoid somewhat related theoretical discussions about the comparison of both branches of statistics, referring the readers to the introduction section in this paper and the vast extant literature addressing these issues [1, 8, 9, 27].

All the datasets were preprocessed using standardised neuroimaging methods and protocols implemented by the SPM software (registration in MNI space, spatial normalization and segmented to differentiate brain tissues, e.g. Grey matter (GM)) [25]. For further comparison with the SAM proposed in this paper, significance maps were obtained with SPM using a standard two-sample t-test with FWE p-value (and null extent threshold -voxels-). We first conducted a first-level analysis to derive the GLM for the dataset under assessment (a design matrix for group comparisons) and then, in the 2nd-level analysis, the contrast images were fed into a GLM for implementing the statistical test.

Status Number Age Gender (M/F) MMSE
ADNI NC 229 75.975.0 119/110 29.001.0
MCI 401 74.857.4 258/143 27.011.8
AD 188 75.367.5 99/89 23.282.0
PPMI NC 194 53.022.27 129/65
PD 168 53.142.37 103/65
VV NC 108 69.0514.53 54/54
PS 100 68.6213.41 53/47
Auditory Res 41 - 1
List 43 - 1
Table 1:

Demographics details of the datasets (ADNI, PPMI, VV and fMRI), with group means with their standard deviation

5.1 A structural MRI (sMRI) study: the ADNI database

Data used in preparation of this paper were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database ( The ADNI was launched in 2003 by the National Institute on Aging (NIA), the National Institute of Biomedical Imaging and Bioengineering (NIBIB), the Food and Drug Administration (FDA), private pharmaceutical companies and non-profit organizations, as a 60 million dollars, 5-year public-private partnership. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to characterise the progression of mild cognitive impairment (MCI) and early AD. The construction of sensitive and specific biomarkers of very early AD progression is intended to aid researchers and clinicians to develop new treatments and monitor their effectiveness, as well as lessen the time and cost of clinical trials.

The Principal Investigator of this initiative is Michael W. Weiner, MD, VA Medical Center and University of California San Francisco. ADNI is the result of efforts of many co-investigators from a broad range of academic institutions and private corporations, and subjects have been recruited from over 50 sites across the U.S. and Canada. The initial goal of ADNI was to recruit 800 subjects but ADNI has been followed by ADNI-GO and ADNI-2. To date these three protocols have recruited over 1500 adults, ages 55–90, including cognitively normal older individuals, people with early or late MCI, and people with early AD. The follow up duration of each group is specified in the protocols for ADNI-1, ADNI-2 and ADNI-GO. Subjects originally recruited for ADNI-1 and ADNI-GO had the option to be followed in ADNI-2. For up-to-date information, see

The ADNI database contains 1.5 T and 3.0 T t1w MRI scans for AD, MCI, and cognitively normal controls (NC) which are acquired at multiple time points. Here we only included 1.5T sMRI corresponding to the three different groups of subjects: NC, AD and MCI. The original database contained more than 1000 T1-weighted MRI images, comprising 229 NC, 401 MCI (312 stable MCI and 86 progressive MCI) and 188 AD. Although for the proposed study, only the first medical examination of each subject is considered, resulting in 818 GM images. Following the recommendation of the National Institute on Aging and the Alzheimer’s Association (NIA-AA) for the use of imaging biomarkers [33], we considered the group comparison for establishing a clear framework for comparing statistical paradigms (SPM and SAM). Thus, the MCI class is strictly based on clinical criteria, without including any other biomarker [34]. Demographic data of subjects in the database is summarized in Table 1.

5.1.1 Classification Results

First, the proposed methodology try to fit in an optimal way a linear SVM classifier in the feature space obtained after a FES approach (PLS). With the aim of applying a regression-type analysis to the dataset, we parcellated the brain volume into 116 standardized regions [28] and then, obtained an optimistic estimation of the actual error as shown in solid blue line in figure 3. This estimation is corrected by the use of upper bounds drawing a novel set of accuracy values (proportions) and a confidence interval, depending on the selected theoretical method, i.e. Vapnik’s bound. The lower accuracies in this plot corresponds to the worst cases as considered by the selected concentration inequalities.

Figure 3: Accuracy values and upper bounds in standardized ROIs (only significant regions from to are shown) for three methods based on concentration inequalities in 4 and 5. We highlight several regions, relevant in the biological definition of AD, i.e. Hippocampus, Temporal, Amygdala and Parahippocampal regions, corresponding to peaks of these curves. Moreover, observe how the VC approach is more pessimistic than the one based on [5]. The confidence interval is drawn in the space between the solid blue line and the colored lines.

It is worth mentioning that the results, shown in figure 3

, are obtained with the first PLS component extracted by this regression analysis (

). This PLS score for each subject can be conceptualised as the representation of the subject into a multi-dimensional reference system as described in [6] (see supplementary material).

5.1.2 Statistical Agnostic Maps

In the aforementioned figures we heuristically identified all those relevant regions for the characterisation of AD based on absolute values. Therefore, a definition of relevancy in terms of hypothesis testing within confidence intervals is required. Following the method presented in the appendix, we provide an automatic (and statistically elegant) method for selecting ROIs in which a regionally specific activation is identified. As depicted in figure

4, the main result is the SAM obtained with the same p-value as the one of confidence intervals using concentration inequalities.

Figure 4: Accuracy values in the worst case using the method in [5] and the set of probabilities (log(p-values)) within the confidence interval. The ROIs (p) are detected out of standardized regions using a significance test for a proportion (see appendix). Note that we show the probability of observation (in the right “y axis”) of the set of accuracy values under , i.e. random distribution.

Finally, a direct comparison with the SPM approach is shown in figures 5 and 6, in terms of the sample-size analysis and the relevant regions determined by both methods. Key to this comparison is the different working operations, i.e. SAM includes the spatial structure of data at the first FES stage, whilst SPM do it at the final stage, by means of RFT. For this reason, SPM is more specific (voxel-wise) but widespread comparing to SAM. The number of identified ROIs conforming the SPM increases as the number of sample increases, unlike the proposed approach, which provides the same volumetric differences for . It is worth mentioning that from the perspective of SLT, due the small ratio in all these experiments proposed in this paper (and in the extant literature), we are dealing with the “small sample size problem”. In terms of classical statistics (SPM) this derives in a challenging scenario that constrains the generalisation of the results from small datasets to new unseen samples.

Figure 5: Statistical comparison of brain volumes using SAM (left) and SPM (right) in the ADNI database. Green area corresponds to the whole dataset while the rest of colors (red, blue, yellow) are linked to data subsets, which are plotted in increasing (opacity of representations is preserved for clarity reasons). The ROIs selected for increase , satisfy except for where an additional region “Frontal Mid L” is selected. It is worth mentioning that all the ROIs extracted in different sample-size configurations were included in the confidence interval and with probability close to the significance level ().

Figure 6 shows that main regions identified by SPM are included in the ROIs deployed by SAM-based approach. In addition, the number of “activated” voxels in SPM is associated with sample size and these voxels are widespread across several anatomical regions. The number of voxels in ROIs obtained by SAM is almost independent on the sample size, except for the extreme case , an given the magnitude of the effect being sought in the HCvsAD comparison.

Figure 6: (a) SPM (red) over SAM (green) using the complete ADNI dataset (). (b) overlap analysis vs sample size. Observe how the SPM activation map linearly increases with and is located on more than standardized regions with the whole dataset (although part of these isolated activation voxels could be removed from the map using the extent threshold.)

5.2 A SPECT study: the PPMI database

Data used in the preparation of this article were obtained from the Parkinson’s Progression Markers Initiative (PPMI) database ( For up-to-date information on the study, visit PPMI is a public-private partnership funded by the Michael J. Fox Foundation for Parkinson’s Research and funding partners, including all partners listed on

Informed consents to clinical testing and neuroimaging prior to participation of the PPMI cohort were obtained, approved by the institutional review boards (IRB) of all participating institutions. The PPMI obtained written informed consent from all study participants before enrolled in the Initiative. None of the participants were taking any PD medication when they enrolled in the PPMI.

The inclusion criteria adopted in the PPMI cohort study are available in This diagnostic procedure also includes a confirmation step based on imaging biomarkers. A selection of DaTSCAN images from this database were used in the preparation of the article. Specifically, the baseline acquisition from subjects suffering from PD and NC was used. In addition, a similar SPECT image database from a the “Virgen de la Victoria” (VV) Hospital (Malaga, Spain) was used to validate and generalise our findings to a dataset that contains a more complex pattern in the Parkinsonian Syndome (PS) class derived from a clinical diagnosis criteria [36] (see table 1).

5.2.1 Effect size in classification

Following the methodology presented in the latter section we will show i) the robustness of the proposed methodology in limited sample sizes regarding effect size and ii) a quantitative interpretation of effect size appealing to image classification in diagnostics. As already commented in [35], studies with low statistical power require large effects to be observed by hypothesis testing with a pre-specified p‑value threshold (typically 0.05). In DatSCAN imaging of PD the true effect size is known to be considerably large on specific regions, e.g. striatum. On the contrary, large effects observed in studies with reduced sizes do not assure that the true effect is large, or even that it exist at all. These studies are usually related to poorly mechanistically grounded hypothesis [8] or a bad specification of clinical analysis plans to conform the set of observations, i.e. dataset [35].

These issues are can be observed in figure 7, where accuracy values are shown for increasing . Effect sizes are large when they can discriminate between subjects that do and do not show an effect [8]. Large (but trivial under our hypothesis PDvsHC) effects observed for samples reduce as the sample size increases in the VV dataset, unlike in the PPMI dataset. In the latter dataset, the proposed methodology provides almost the same accuracy values, which are, in general, shifted up w.r.t the former database, for a wide range of samples sizes of randomly selected subjects. Anyway, our method reports effect sizes (in terms of accuracy values) and confidence intervals alongside exact p values, thus improving the strength of inference.

Figure 7: Effect sizes and significant ROI selection using the proposed methodology. Observe in top figure (VV) the black arrows highlighting the observed trivial effects, outside the specific region, in studies with low statistical power, i.e. . In addition, we also remark the reduction of “true effect” in the top figure compared to the same effect depicted in the bottom figure (PPMI), due to the presence of a more complex PD-plus pattern in the VV dataset.

5.2.2 Statistical Agnostic Maps

Compared with the subtle effects in the ADNI dataset, the magnitude of the effect in this study is relatively large. Thus, maps of significance derived from both approaches should be similar each other in the specific regions. However, this image modality has associated important challenges such as low resolution empowering partial volume effects (PVE) [37] and lack of structural information in the images to perform an accurate spatial normalization and co-registration, [36]. These issues could reveal the limitations of voxel-wise approaches using sharp null hypothesis tests, which may find small effects that are practically unimportant. All these questions are found in figure 8 where we show how SAM are stable several sample sizes and included in the regions detected by the SPM approach. Moreover, we see how the number of voxels in the classical approach is dramatically rising with increasing , due to the fact that large studies are more likely to find a significant difference for a persistent trivial effect that is not really meaningfully different from the null [38].

Figure 8: (a) SPM (up) and SAM (bottom) using the PPMI dataset for . (b) overlap analysis vs sample size. Observe how the SPM activation map linearly increases with and is located on more than standardized regions with the whole dataset. Typical effects, such as PVE, in this kind of low-resolution image modality results in rejecting the null-hypothesis although FWE corrected p-values were considered in the inference test. On the contrary, SLT is less specific but more stable in the rejection of the null-hypothesis. In addition, the ROIs obtained are overlapped more than , using a wide range of small sample sizes.

5.3 An fMRI study: the SPM database

Data used in the preparation of this article was obtained from the SPM database related to an epoch auditory fMRI activation data This database is one of several databases included in the SPM site for personal education and evaluation purposes, and shows the ability of the SPM methodology for detecting auditory stimulation maps. Specifically, the experiment associated with the data was conducted by the FIL methods group and was designed for exploring equipment and techniques related to fMRI.

The database consists of BOLD/EPI images obtained from a single subject. They were acquired on a modified 2T Siemens MAGNETOM Vision system. The number of acquisitions was and each one consisted of contiguous slices ( mm voxels). Acquisition took s, with the scan to scan repeat time (TR) set arbitrarily to s. The acquisitions were made in blocks of , giving s blocks. The condition associated with each block alternated between rest and auditory stimulation, starting with rest. Auditory stimulation was bi-syllabic words presented binaurally, at a rate of per minute. As the SPM site recommends the first few scans are discarded to avoid T1 effects in the initial scans of an fMRI time series. Then, acquisitions were finally used after discarding the first complete cycle ( scans). The images were preprocessed (realigned, coregistered using a sMRI, normalized and smoothed) for collecting two different conditions, rest and listening. Then, a GLM specification followed by model estimation and a t-test-based inference (FWE p-value = 0.05) resulted in the activation maps for this auditory-evoked potential experiment.

5.3.1 Detecting auditory stimulation maps

In the last sections we have seen the potentiality of the proposed approach for ROI detection in several binary classification paradigms, i.e. diagnosis, given the usefulness of machine learning. Images collected from the aforementioned experiment are used to identify areas performing a specific information processing function, such as the primary auditory cortex.

The areas identified by the proposed approach are mainly those corresponding with the temporal lobe, as shown in figure 9. A mosaic and the 3D representation of the activated cortical areas are shown in the same figure 9 (b), together with the activation pattern sought by the SPM methodology. The comparison analysis of both approaches is displayed in figure 9 (a). In the upper figure we see the significance test for a proportion () that was applied to this auditory fMRI experiment. The SAM is mainly located on regions where we found the activation voxels in SPM. In the middle we represent the number of voxels in ROIs (for different sample sizes) and the ratio w.r.t the total number of voxels in that region. Finally, in the bottom we compared both approaches using the overlap-analysis type measures, as described in the last sections. To sum up, we found: i) the same ROIs in both approaches, ii) SPM required sufficiently large sample size to provide significant ROIs, i.e. for no significant areas were sought, and iii) both approaches converge with increasing sample size to the same number of activated voxels.

Figure 9: (a) Significant tests for a population proportion in the fMRI experiment (up), number of activated voxels in ROIs using SPM (middle) and overlap analysis between SPM and SAM (bottom) (b) Activation maps in the auditory experiment for the whole dataset using the SAM (up) and the SPM (bottom)

6 Discussion

As shown in the latter section, in general the SAM is a very robust method, in terms of sample size, to find relevant standardized areas, and a stable framework which contains those regions defined as relevant by the SPM, with sufficiently large sample size. The experiments carried out in different experimental frameworks and datasets have demonstrated the ability of this multivariate approach for establishing a novel model-free method for the assessment of significant changes across brain volumes.

The behaviour of the analysed methods depends on the size of effect we are interested in. In the seek of subtle effects, such as the ones found in AD or Autistic patterns, and provided that hypothesis tests cannot separate important, but subtle, and actually trivial effects [9], our SAM focus on standardized ROIs to avoid the presence of false positives in the sought maps. In this sense, SPM is more specific and can detect, within these regions sought by SAM, which substructures are responsible for the discrimination between classes.

On the other hand, when large effects are bound to be found, SAM is a suitable method in their detection since, with a few amounts of samples, it provides similar results than the ones obtained with complete databases, i.e. fMRI and DatSCAN experiments. This is in line with the main idea derived from [8] that when an effect is found in small datasets is more than likely to be extrapolated in large samples. On the contrary, only in small datasets with small – but meaningful – effects that are missed, missing data, sampling bias, etc. we found the absence of replication, i.e. across data collecting sites [9]. All these statistical features in the analysis of neuroimaging data are experimentally described in the datasets analysed throughout this paper.

Finally, we have seen the usefulness of the confidence intervals derived for the STL based on concentration inequalities to achieve a confidence framework beyond sharp null-hypothesis testing. Key to this methodology in the field of SLT is that it is based on in-sample estimates (a similar procedure in exploratory analysis using hypothesis testing), unlike the out-sample estimates in CV-procedures, which usually subdivide the (small) datasets for an estimation of the actual error. In this way, an analytical bound depending on sample size () and number of predictors () defines a “worst-case” operation point. Nevertheless, the experiments showed the application of a systematic hypothesis test for the selection of significant empirical errors which conforms the highlighted regions in the SAM. Only in this case, a model is assumed in the set of accuracies, but it has been demonstrated to be in accordance with the nature of the one-dimensional data and sufficiently accurate for our purposes.

7 Conclusion

In this paper we present a data-driven approach, mainly devoted to classification problems with limited sample sizes, to derive statistical model-free (agnostic) mappings. Although the latter is not designed for testing competing hypothesis or comparing different models in neuroimaging, we derive the SAM assuming the existence of classes (), at voxel or multi-voxel level. The analysis of the “worst case’ considers the upper bounds of the actual risk, under suitable theoretical conditions (see methods and appendix) and a selection of regions with a highly-corrected empirical risk, according with a test for significance on a population proportion. As a conclusion, the SAM relieved the problem of instability in limited sample sizes, when determining maps of relevance in several neurological conditions, such as AD, PD or auditory tasks, and resulted in a very completive and complementary method with the SPM framework, which is mainly accepted by the neuroimaging community. Moreover, the latter usually employs several strategies for reducing the false positive rates in multiple comparisons, such as the (FWE) corrected p-value maps in inferential statistics null-hypothesis testing, and RFT to tackle with the spatial structure of the maps. However, this approach is found to be very conservative in our experiments and in the extant literature. In this sense, the novel framework based on SLT provides similar activation maps than the ones obtained by the SPM, but defined on ROIs, under a rigorous development in scenarios with a small sample/dimension ratio and large, small and trivial effect sizes, as shown in the experimental part.


This work was partly supported by the MINECO/ FEDER under the RTI2018-098913-B100 project. We would like to thank the reviewers for their thoughtful comments and efforts towards improving our manuscript. J.M. Gorriz would like to thank Dr. Maxim Raginsky for his elegant abstract notation which is borrowed in this paper.

Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81- XWH-12-2-0012). ADNI is funded by the National Insti- tute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research Development, LLC.; Johnson Johnson Pharmaceutical Research Development LLC.; Lumosity; Lundbeck; Merck Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health ( The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.

Supplementary Material

A similar analysis was carried out with increasing number of components, i.e. , however the upper bounds are increased accordingly as shown in figure 10. This highlight the benefits of working in a low dimensional scenario, , although the use of new features in the analysis allows us to detect other regions, such as “Temporal Pole Sup” and “Temporal Mid” regions.

Figure 10: The same analysis as the one proposed in figure 3 but with . Note that the ROIs showing highest accuracy values (bottom on the right) are similar to the ones selected in the latter experiment (bottom on the left) but with an increase in the confidence interval of the approximation. Examples are Hippocampal, Hippocampus, Amygdala and Temporal regions (see section 5.1).


Upper Bounding the worst case: a summary

A shown in equation 4 the consistency of the ERM algorithm is mainly dependent on the evaluation of the two-sided uniform deviation of the error probabilities in the worst case. An upper bound with probability at least for this quantity can be obtained by invoking a result in [19], since has the bounded differences property by :


This is known as the generalized Hoeffding inequation. Then.,555Given

a random variable, if

then ,


with probability .

Moreover, the expected value of the deviation can be absolutely bounded by the so-called Rademacher average [29] as follows. First, the uniform deviation is bounded by its expected value w.r.t the set of random error functions , using the “symmetrization” trick proposed in [3] and the convexity property of the norm function:


where is randomly selected from and . Taking whole expectations on the both sides we get:


By using the triangle inequality and the definition of empirical error we finally obtain:


where the right part of inequality is equally distributed as the Rademacher average , where are independent random variables in with equal probability. Finally, using Massart’s finite class lemma [18] we can bound the left part of the latter inequality as:


Consequently, introducing equations 9, 11, 12 and 13 in equation 4 we finally prove equation 5.

The partial least squares algorithm

The PLS algorithm extracts the relevant patterns within ROIs across brains by a regression between the multivariate data matrix and the label vector . In short, we maximize:


where the score vectors are iteratively extracted and used to deflate the input matrix by subtracting their rank-one approximations based on [13]. The deflation process is accomplished by the computation of the vector of loadings as a coefficient of regressing on :


As shown in [5] the size of the input data is crucial to the assessment of the relationship volume data and group membership within the evaluated ROIs, where some statistical properties of the involved processes, such as the stationarity or the ergodicity in the correlation, must be assumed. The PLS-maps derived can be seen as a multivariate two-sample test weighted by the scores of each sample with unknown distribution, except for a normalization term that depends on the pooled standard deviation [5], thus its statistical significance can be assessed in a similar manner of a t-test [12].

Significance test for a proportion

Let denote the sampling distribution of empirical errors , for , then the null hypothesis test about the population proportion within the confidence interval has the form:


denotes a particular proportion value between 0 and 1, i.e. 0.5. The test-statistic in a population proportion is

, where . For large samples, i.e. for at least , if is true, the sampling distribution of the

test statistic is the standard normal distribution.


  • [1] Friston, K. Sample size and the fallacies of classical inference. NeuroImage 81 (2013) 503–504
  • [2] Bzdok, D. Classical Statistics and Statistical Learning in Imaging Neuroscience. Front. Neurosci., 06 October 2017 |
  • [3] V. Vapnik Estimation dependencies based on Empirical Data. Springer-Verlach. 1982 ISBN 0-387-90733-5
  • [4] Kohavi, R. A study of CV and bootstrap for accuracy estimation and model selection. Proc. of the 14th international joint conference on AI - Vol. 2 pp 1137-1143 (1995)
  • [5] Górriz, J.M. et al. On the computation of distribution-free performance bounds: Application to small sample sizes in neuroimaging. Pattern Recognition 93, 1-13
  • [6] Górriz, et al. A Machine Learning Approach to Reveal the NeuroPhenotypes of Autisms. International journal of neural systems, 1850058
  • [7] Varoquaux, G. Cross-validation failure: Small sample sizes lead to large error bars. NeuroImage 180 (2018) 68–77.
  • [8] Friston, K. Ten ironic rules for non-statistical reviewers. NeuroImage 61 (2012) 1300–1310
  • [9] Lindquist, M.A. et al. Ironing out the statistical wrinkles in "ten ironic rules". Neuroimage. 2013 Nov 1;81:499-502.
  • [10] Haussler, D. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation Volume 100, Issue 1, September 1992, Pages 78-150
  • [11] Cover, T.M. Geometrical and Statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Transactions on Electronic Computers. EC-14: 326–334 (1965)
  • [12] McIntosh, A.R. et al. Spatial pattern analysis of functional brain images using partial least squares, Neuroimage 3(3 Pt 1) (1996) 143-157
  • [13] Rosipal, R. et al. Overview and Recent Advances in Partial Least Squares (Springer Berlin, Heidelberg, 2006), pp. 34-51
  • [14]

    Rondina, J.M. SCoRS - a Method Based on Stability for Feature Selection and Mapping in Neuroimaging. IEEE Trans Med Imaging. 2014 Jan; 33(1): 85–98.

  • [15]

    Martinez-Murcia F.J. Studying the Manifold Structure of Alzheimer’s Disease: A Deep Learning Approach Using Convolutional Autoencoders. IEEE J Biomed Health Inform. 2019 Jun 17.

  • [16] De Martino, F. et al. Combining multivariate voxel selection and support vector machines for mapping and classification of fMRI spatial patterns NeuroImage, 43 (1) (2008), pp. 44-58
  • [17]

    Vapnik V. et al. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications, 16:264–280, 1971.

  • [18] Massart, P. Some applications of concentration inequalities to statistics. Annales de la Faculté des Sciences de Toulouse, 2000.
  • [19] McDiarmid, C. On the method of bounded differences, Surveys in Combinatorics 141 (1989), 148–188
  • [20] Sauer, N. On the density of families of sets. Journal of Combinatorial Theory, Series A, 13:145–147, 1972.
  • [21] Shelah, S. A combinatorial problem: stability and order for models and theories in infinity languages. Pacific Journal of Mathematics, 41:247–261, 1972.
  • [22] Gómez-Verdejo, V. Sign-Consistency Based Variable Importance for Machine Learning in Brain Imaging Neuroinformatics October 2019, Volume 17, Issue 4, pp 593–609
  • [23] Khundrakpam, B.S. et al. (2015). Prediction of brain maturity based on cortical thickness at different spatial resolutions. NeuroImage, 111, 350–359.
  • [24] Mouro-Miranda, J. et al. Classifying brain states and determining the discriminating activation patterns: Support vector machine on functional MRI data. NeuroImage, 28, 980–995. (2005).
  • [25] Friston K. et al. Statistical Parametric Maps in functional imaging: A general linear approach Hum. Brain Mapp. 2:189-210 (1995)
  • [26] Friston, K.J., Harrison, L., Penny, W., 2003. Dynamic causal modelling. Neuroimage 19, 1273–1302.
  • [27] Reiss, P.T. Cross-validation and hypothesis testing in neuroimaging: an irenic comment on the exchange between Friston and Lindquist et al. Neuroimage. 2015 August 1; 116: 248–254
  • [28] Tzourio-Mazoyer, N. et al.. Automated anatomical labeling of activations in spm using a macroscopic anatomical parcellation of the MNI MRI single subject brain. Neuroimage 2002; 15: 273-289. DOI
  • [29] Shalev-Shwartz, S. et al. Understanding Machine Learning – from Theory to Algorithms. Cambridge University Press. ISBN 9781107057135. 2014
  • [30] Antós, A. et al. Data-dependent margin-based generalization bounds for classification. Journal of Machine Learning Research 3 (2002) 73–98
  • [31]

    Vidyasagar, M. Learning and Generalisation With Applications to Neural Networks- Springer. ISBN 978-1-84996-867-6 (2003)

  • [32] Frackowiak et al. Human Brain Function (Second Edition). Chap. 44. Introduction to Random Field Theory. ISBN 978-0-12-264841-0 Academic Press. 867-879, 2004.
  • [33] Jack, Jr. C.C. NIA-AA Research Framework: Toward a biological definition of Alzheimer’s disease. Alzheimers Dement. 2018 Apr; 14(4): 535–562.
  • [34] McKhann, G.M. et al.. The diagnosis of dementia due to Alzheimer’s disease: recommendations from the National Institute on Aging and the Alzheimer’s Assocation Workgroup. Alzheimers Dement. 2011;7:263–9.
  • [35] Button, K.S. et al. Confidence and precision increase with high statistical power Nature Reviews Neuroscience volume 14, page 585(2013).
  • [36] Illan, I.A. et al. Automatic assistance to Parkinson’s disease diagnosis in DaTSCAN SPECT imaging. Medical Physics. 2012
  • [37] Zaidi, H. et al. Quantitative Analysis in Nuclear Medicine Imaging Springer Science Business Media, Inc. ISBN-10: 0-387-23854-9
  • [38] Ioannidis, J.P.A.. Why most published research findings are false. PLoS Med. 2 (8) (e124), 696–701. 2005.