1 Introduction
In the last decades neuroscience has transitioned from qualitative case reports to quantitative, longitudinal and multivariate population studies in the quest for defining the abnormal patterns of disease pathogenesis. Neuroscience has recently provided valuable insight by means of classical statistics, e.g statistical inference based on nullhypothesis (
) testing or regressiontype analyses. Thus, brain mapping community has predominantly used null hypothesis testing for exploratory analyses in whole brain searches [1]. In this context, classical inference makes emphasis on insample imagebased statistical estimates from previously assumed data models to determine the existence of relevant effects (large or subtle) in binary group comparisons. The critical pvalue (significance vs. not significant) is often complemented by effectsize measures of the magnitude of the phenomenon [2].On the other hand, outof sample generalization approaches in machine learning (ML), such as CrossValidation (CV), try to estimate on unseen new data the actual error of the classifier in the (binary) classification problem. Despite the methods and goals of predictive CV inference are distinct from classical extrapolation procedure
[9], they are actually exploited within statistical frameworks aimed at providing statistical significance [27]. Examples are bootstrapping, binomial or permutation (“resampling”) tests, etc. which have demonstrated to be competitive outside the comfort zone of classical statistics, filling otherwiseunmet inferential needs.In the pattern classification problem we usually assume the existence of classes (
) that are differentiated by the classifiers in terms of accuracy (Acc) on a presumably independent dataset. Empirical confidence intervals or plausible Acc values derived from CV are consequently used to evaluate the system performance and to conclude (improperly in a statistical sense)
. Moreover, in limited sample sizes the most popular Kfold CV method [4] has demonstrated to suboptimally work under unstable conditions [6, 5, 7] and then, the predictive power of the fitted classifiers can be arguable.Beyond the latter techniques, CV in ML is wellframed into a datadriven statistical learning theory (SLT) which is mainly devoted to problems of estimating dependencies with limited amounts of data
[3]. Although, CVML approaches were not originally designed to test hypothesis in brain mapping [1], they are theoretically grounded to provide maps of confidence intervals(protected inference). As shown in present study, this can be achieved by assessing the upper bounds of the actual error in a binary classification problem, and by using simple significance tests for a population proportion. Thus, assessing with high probability the quality of the fitting function (and its generalization ability) in terms of in/outsample predictions can be conceptualised, under a hypothesis testing scenario, as the inverse problem of “carefully rejecting
”, that is, the problem of rejecting and thus accepting (there is no effect or it is not significant).The paper is organized as follows. In section 2 we derive analytical upper bounds using the agnostic or modelfree formulation of the learning problem. In connection to the drawbacks pointed in [8] regarding the CVbased inference challenge that “they are not functions of the complete data” set, it is worth mentioning that the previous model considers all the available data. Then, the learning algorithm is fitted in the best possible way to the empirical data, as shown in section 3, to obtain the empirical error. This empirical error represents an upper bound of the real
(actual) error of the model limited by a deviation quantity that is analytically derived beforehand. Sample size and the empirical settings regarding the complexity of the selected classifiers are key to the proposed neuroimaging methodology as they condition the degrees of freedom and the number of separating functions used to define the aforementioned deviation quantity. In a nutshell, low dimensional scenarios and linear classifiers are required, whenever they easily get a strong link between both errors. Under these conditions, we can determine which regions across volumes are within these confidence intervals with a probability of
, what corresponds with statistical significance in a group comparison (see section 5). To this purpose, we need to estimate a probability threshold for the obtained accuracy values of each region to reject the nullhypothesis, i.e. , of the underlying population proportion, thus a regionally specific activation can be stated under the Statistical Agnostic Mapping (SAM) framework.2 Methods: bounding the actual error with probability 1
2.1 Background on Agnostic learning
Let assume the agnostic model in the problem of binary pattern classification as proposed in [10]. Given an independent and identically distributed sample of dimensional predictors and classes pairs, , where each of them is drawn from the unknown , the goal is to construct a good approximation to an unknown target function , using a class of functions : , and evaluating their goodness by a predefined expected loss:
(1) 
where the loss function
and is a random element of the hypothesis space .To simplify notation let consider the function composition, i.e. , to define the class of functions : with expected loss (probability of error) . Thus, the empirical error can be determined by:
(2) 
A learning algorithm particularly selects given the sample , i.e. via the empirical risk minimization (ERM) [3], and hopefully provides:

a real error (on the ideal infinite population) close to the one obtained on the sample, that is,

and close to the minimum risk ,
2.2 Upper Bound based on concentration inequalities
Unfortunately, the aforementioned statement of is not generally true. More precisely:
(3) 
with an arbitrarily . Under the worst case scenario the uniform deviation can be defined as , for any . Using the ERM algorithm we readily get the following concentration inequalities:
(4) 
Bounding can be (not readily) achieved by using several theorems and lemmas of the SLT [17, 21, 20, 18, 19] to finally get (see Appendix)^{1}^{1}1A similar bound can be achieved for the second row in equation 4:
(5) 
with probability , where is the cardinality of or the number of separating functions given the sample realization^{2}^{2}2A trivial bound for this quantity can be found: .
3 Fitting the selected function to current data
3.1 Feature extraction and selection
In order to minimize the left part of equation 5 we could minimize one (or both) of the summands on the right. However, they are dependent each other in terms of the classifier complexity [3]. One solution could be, as explained in the next section, to prevent the increase of given the sample , by selecting a low classifier order [11], i.e. a linear decision functions. However, this comes at the cost of a maybe nonnegligible empirical error.
As an attempt to reduce the ratio (curse of dimensionality
), the machine learning community usually tends to employ feature extraction and selection (FES) methods to enhance the classification performance while preserving the system complexity. This can be achieved by removing irrelevant features from the sample, which can also facilitate interpretation (FS), and by identifying multivariate sets of meaningful features (FE) that best discriminate the classes
[14]. The final aim is to provide an almost linearly separable classification problem in the feature space.Several methods have been employed in neuroimaging aiming at reducing the dimensionality () of the problem (in relation to ) based on statistical tests for FS [16], matrix decompositions [12]
or even deep learning architectures for FES
[15]. To validate the methodology proposed in this paper, we perform FE using a popular method in neuroscience, such as the Partial Least Squares (PLS) algorithm [12]. PLS methods have been demonstrated its utility in describing the relation between brain activity and experimental design or behaviour measures within a multivariate framework (see [12, 13, 6]and the appendix for mathematical details and the interpretation of the PLSmaps as a classical ttest).
3.2 Linear Decision functions: a small upper bound
Regularized linear decision functions have been recently applied to neuroimaging for detecting activation patterns, and compared to parametric hypothesis testing, such as univariate ttests [24, 22, 23]. In general, they have limited their analyses to provide insample estimates based on resampling, failing to demonstrate their outofsample performance in terms of confidence intervals.
As stated before, the minimization of the left part of inequality 5 can be achieved by decreasing the number of separating functions given the sample (). This quantity is indeed decreased by selecting a linear decision functionbased classifier in a binary classification problem, following the results in the extant literature [11, 3], etc. After transforming and selecting the feature set by FES methods, the concentration inequalities 5 obtained with linear classifiers result in a strong association with a given confidence level whenever the extracted features are significant across regions of interest (ROIs) and group comparisons.
Beyond the existing caveats and solutions when using regularization methods in neuroimaging for FS, we adopt the linear support vector machine (SVM) classification algorithm which allows us to tentatively evaluate the worst case of
, that is, and to set the following upper bound [3]:(6) 
with probability and is the VC dimension, e.g for linear classifiers. In the same manner, several upper bounds could be tested based on several innovative concepts and paradigms, such as the ones based on data distributions, set’s shape, Rademacher averages, pseudodimension, fatshattered dimension, etc. [30, 31]. We preferred to use, due to its simplicity, the upper bound recently proposed in [5]. The latter is strongly grounded on the geometrical assumption of in general position distributed samples and the functioncounting theorem of homogeneously linearly separable dichotomies [11]:
(7) 
With the help of expressions such as inequalities 6 and 7, we can even evaluate the deviation of the empirical error from the actual error at voxel level, although it is preferable, for the aforementioned reasons, to do it regionwise using a fitted linear SVM classifier in the multivariate feature space (see figure 1). In this sense, the motivation for a multivariate framework in assessing the areas of relevance is analogous to other proposed techniques for addressing the multiple comparison problem in functional imaging, e.g. Random Field Theory for neuroimaging analysis [32] or the classical pvalue corrections for multiple comparison after nullhypothesis testing. In general, only those voxels (or ROIs) showing a tight association, i.e. high performance in terms of accuracy, should be considered as relevant maps or patterns in that particular condition with probability .
4 Statistical Agnostic Mapping
The significant areas derived from SAM correspond by construction with those regions having an empirical error that, under the worst case scenario, has associated an actual error greater than the random guess accuracy . Confidence intervals derived from the concentration inequalities allow us to bound the worst case at the “upper” border of the confidence interval, providing a protective inference. Thus, within this confidence interval, a significance test can be used to make an inference about whether the accuracy value for a specific region differs from the nullhypothesis of the random proportion (see Appendix). Therefore, the statistical significance of any region is assessed, in combination with confidence intervals, by evaluating the pvalue of any ROI at a given significance level, i.e. . A total of standardized regions [28] were analysed within a protective interval, avoiding the limitations of significant tests to distinguish statistical from practical importances (see Appendix).
In the following sections we will show how the combination of the aforementioned protective intervals and significance tests may be used to derive a SAM in different group comparisons, such as Alzheimer’s disease (AD) vs healthy controls (HC), Parkinson’s disease (PD) vs HC and on a wellknown example of singlesubject activation map in fMRI, and how they relate with the classical approach based on nullhypothesis testing, i.e. two sample ttest with correctedp value. Unlike, previous approaches, the proposed modelfree method is less specific but more robust against sample size, artifacts and nuisance effects. See the complete diagram of the poposed method in figure 2.
5 Experiments
The aim of this section is to present a novel methodology in neuroimaging based on analytical concentration inequalities, and to experimentally compare them to the accepted framework used by the neuroscience community based on the SPM analysis [25]. Thus, we will assess several experiments collected from wellknown databases that include imaging data from patients with a variety of conditions/pathologies. Nevertheless, we will avoid somewhat related theoretical discussions about the comparison of both branches of statistics, referring the readers to the introduction section in this paper and the vast extant literature addressing these issues [1, 8, 9, 27].
All the datasets were preprocessed using standardised neuroimaging methods and protocols implemented by the SPM software (registration in MNI space, spatial normalization and segmented to differentiate brain tissues, e.g. Grey matter (GM)) [25]. For further comparison with the SAM proposed in this paper, significance maps were obtained with SPM using a standard twosample ttest with FWE pvalue (and null extent threshold voxels). We first conducted a firstlevel analysis to derive the GLM for the dataset under assessment (a design matrix for group comparisons) and then, in the 2ndlevel analysis, the contrast images were fed into a GLM for implementing the statistical test.
Status  Number  Age  Gender (M/F)  MMSE  

MRI  
ADNI  NC  229  75.975.0  119/110  29.001.0 
MCI  401  74.857.4  258/143  27.011.8  
AD  188  75.367.5  99/89  23.282.0  
SPECT  
PPMI  NC  194  53.022.27  129/65  – 
PD  168  53.142.37  103/65  –  
SPECT  
VV  NC  108  69.0514.53  54/54  – 
PS  100  68.6213.41  53/47  –  
fMRI  
Auditory  Res  41    1  – 
List  43    1  – 
Demographics details of the datasets (ADNI, PPMI, VV and fMRI), with group means with their standard deviation
5.1 A structural MRI (sMRI) study: the ADNI database
Data used in preparation of this paper were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The ADNI was launched in 2003 by the National Institute on Aging (NIA), the National Institute of Biomedical Imaging and Bioengineering (NIBIB), the Food and Drug Administration (FDA), private pharmaceutical companies and nonprofit organizations, as a 60 million dollars, 5year publicprivate partnership. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to characterise the progression of mild cognitive impairment (MCI) and early AD. The construction of sensitive and specific biomarkers of very early AD progression is intended to aid researchers and clinicians to develop new treatments and monitor their effectiveness, as well as lessen the time and cost of clinical trials.
The Principal Investigator of this initiative is Michael W. Weiner, MD, VA Medical Center and University of California San Francisco. ADNI is the result of efforts of many coinvestigators from a broad range of academic institutions and private corporations, and subjects have been recruited from over 50 sites across the U.S. and Canada. The initial goal of ADNI was to recruit 800 subjects but ADNI has been followed by ADNIGO and ADNI2. To date these three protocols have recruited over 1500 adults, ages 55–90, including cognitively normal older individuals, people with early or late MCI, and people with early AD. The follow up duration of each group is specified in the protocols for ADNI1, ADNI2 and ADNIGO. Subjects originally recruited for ADNI1 and ADNIGO had the option to be followed in ADNI2. For uptodate information, see www.adniinfo.org.
The ADNI database contains 1.5 T and 3.0 T t1w MRI scans for AD, MCI, and cognitively normal controls (NC) which are acquired at multiple time points. Here we only included 1.5T sMRI corresponding to the three different groups of subjects: NC, AD and MCI. The original database contained more than 1000 T1weighted MRI images, comprising 229 NC, 401 MCI (312 stable MCI and 86 progressive MCI) and 188 AD. Although for the proposed study, only the first medical examination of each subject is considered, resulting in 818 GM images. Following the recommendation of the National Institute on Aging and the Alzheimer’s Association (NIAAA) for the use of imaging biomarkers [33], we considered the group comparison for establishing a clear framework for comparing statistical paradigms (SPM and SAM). Thus, the MCI class is strictly based on clinical criteria, without including any other biomarker [34]. Demographic data of subjects in the database is summarized in Table 1.
5.1.1 Classification Results
First, the proposed methodology try to fit in an optimal way a linear SVM classifier in the feature space obtained after a FES approach (PLS). With the aim of applying a regressiontype analysis to the dataset, we parcellated the brain volume into 116 standardized regions [28] and then, obtained an optimistic estimation of the actual error as shown in solid blue line in figure 3. This estimation is corrected by the use of upper bounds drawing a novel set of accuracy values (proportions) and a confidence interval, depending on the selected theoretical method, i.e. Vapnik’s bound. The lower accuracies in this plot corresponds to the worst cases as considered by the selected concentration inequalities.
It is worth mentioning that the results, shown in figure 3
, are obtained with the first PLS component extracted by this regression analysis (
). This PLS score for each subject can be conceptualised as the representation of the subject into a multidimensional reference system as described in [6] (see supplementary material).5.1.2 Statistical Agnostic Maps
In the aforementioned figures we heuristically identified all those relevant regions for the characterisation of AD based on absolute values. Therefore, a definition of relevancy in terms of hypothesis testing within confidence intervals is required. Following the method presented in the appendix, we provide an automatic (and statistically elegant) method for selecting ROIs in which a regionally specific activation is identified. As depicted in figure
4, the main result is the SAM obtained with the same pvalue as the one of confidence intervals using concentration inequalities.Finally, a direct comparison with the SPM approach is shown in figures 5 and 6, in terms of the samplesize analysis and the relevant regions determined by both methods. Key to this comparison is the different working operations, i.e. SAM includes the spatial structure of data at the first FES stage, whilst SPM do it at the final stage, by means of RFT. For this reason, SPM is more specific (voxelwise) but widespread comparing to SAM. The number of identified ROIs conforming the SPM increases as the number of sample increases, unlike the proposed approach, which provides the same volumetric differences for . It is worth mentioning that from the perspective of SLT, due the small ratio in all these experiments proposed in this paper (and in the extant literature), we are dealing with the “small sample size problem”. In terms of classical statistics (SPM) this derives in a challenging scenario that constrains the generalisation of the results from small datasets to new unseen samples.
Figure 6 shows that main regions identified by SPM are included in the ROIs deployed by SAMbased approach. In addition, the number of “activated” voxels in SPM is associated with sample size and these voxels are widespread across several anatomical regions. The number of voxels in ROIs obtained by SAM is almost independent on the sample size, except for the extreme case , an given the magnitude of the effect being sought in the HCvsAD comparison.

5.2 A SPECT study: the PPMI database
Data used in the preparation of this article were obtained from the Parkinson’s Progression Markers Initiative (PPMI) database (www.ppmiinfo.org/data). For uptodate information on the study, visit www.ppmiinfo.org. PPMI is a publicprivate partnership funded by the Michael J. Fox Foundation for Parkinson’s Research and funding partners, including all partners listed on www.ppmiinfo.org/fundingpartners.
Informed consents to clinical testing and neuroimaging prior to participation of the PPMI cohort were obtained, approved by the institutional review boards (IRB) of all participating institutions. The PPMI obtained written informed consent from all study participants before enrolled in the Initiative. None of the participants were taking any PD medication when they enrolled in the PPMI.
The inclusion criteria adopted in the PPMI cohort study are available in http://www.ppmiinfo.org/wpcontent/uploads/2014/06/PPMIAmendment8Protocol.pdf. This diagnostic procedure also includes a confirmation step based on imaging biomarkers. A selection of DaTSCAN images from this database were used in the preparation of the article. Specifically, the baseline acquisition from subjects suffering from PD and NC was used. In addition, a similar SPECT image database from a the “Virgen de la Victoria” (VV) Hospital (Malaga, Spain) was used to validate and generalise our findings to a dataset that contains a more complex pattern in the Parkinsonian Syndome (PS) class derived from a clinical diagnosis criteria [36] (see table 1).
5.2.1 Effect size in classification
Following the methodology presented in the latter section we will show i) the robustness of the proposed methodology in limited sample sizes regarding effect size and ii) a quantitative interpretation of effect size appealing to image classification in diagnostics. As already commented in [35], studies with low statistical power require large effects to be observed by hypothesis testing with a prespecified p‑value threshold (typically 0.05). In DatSCAN imaging of PD the true effect size is known to be considerably large on specific regions, e.g. striatum. On the contrary, large effects observed in studies with reduced sizes do not assure that the true effect is large, or even that it exist at all. These studies are usually related to poorly mechanistically grounded hypothesis [8] or a bad specification of clinical analysis plans to conform the set of observations, i.e. dataset [35].
These issues are can be observed in figure 7, where accuracy values are shown for increasing . Effect sizes are large when they can discriminate between subjects that do and do not show an effect [8]. Large (but trivial under our hypothesis PDvsHC) effects observed for samples reduce as the sample size increases in the VV dataset, unlike in the PPMI dataset. In the latter dataset, the proposed methodology provides almost the same accuracy values, which are, in general, shifted up w.r.t the former database, for a wide range of samples sizes of randomly selected subjects. Anyway, our method reports effect sizes (in terms of accuracy values) and confidence intervals alongside exact p values, thus improving the strength of inference.
5.2.2 Statistical Agnostic Maps
Compared with the subtle effects in the ADNI dataset, the magnitude of the effect in this study is relatively large. Thus, maps of significance derived from both approaches should be similar each other in the specific regions. However, this image modality has associated important challenges such as low resolution empowering partial volume effects (PVE) [37] and lack of structural information in the images to perform an accurate spatial normalization and coregistration, [36]. These issues could reveal the limitations of voxelwise approaches using sharp null hypothesis tests, which may find small effects that are practically unimportant. All these questions are found in figure 8 where we show how SAM are stable several sample sizes and included in the regions detected by the SPM approach. Moreover, we see how the number of voxels in the classical approach is dramatically rising with increasing , due to the fact that large studies are more likely to find a significant difference for a persistent trivial effect that is not really meaningfully different from the null [38].


5.3 An fMRI study: the SPM database
Data used in the preparation of this article was obtained from the SPM database related to an epoch auditory fMRI activation data
^{3}^{3}3www.fil.ion.ucl.ac.uk/spm/data/auditory/. This database is one of several databases included in the SPM site ^{4}^{4}4www.fil.ion.ucl.ac.uk/spm/ for personal education and evaluation purposes, and shows the ability of the SPM methodology for detecting auditory stimulation maps. Specifically, the experiment associated with the data was conducted by the FIL methods group and was designed for exploring equipment and techniques related to fMRI.The database consists of BOLD/EPI images obtained from a single subject. They were acquired on a modified 2T Siemens MAGNETOM Vision system. The number of acquisitions was and each one consisted of contiguous slices ( mm voxels). Acquisition took s, with the scan to scan repeat time (TR) set arbitrarily to s. The acquisitions were made in blocks of , giving s blocks. The condition associated with each block alternated between rest and auditory stimulation, starting with rest. Auditory stimulation was bisyllabic words presented binaurally, at a rate of per minute. As the SPM site recommends the first few scans are discarded to avoid T1 effects in the initial scans of an fMRI time series. Then, acquisitions were finally used after discarding the first complete cycle ( scans). The images were preprocessed (realigned, coregistered using a sMRI, normalized and smoothed) for collecting two different conditions, rest and listening. Then, a GLM specification followed by model estimation and a ttestbased inference (FWE pvalue = 0.05) resulted in the activation maps for this auditoryevoked potential experiment.
5.3.1 Detecting auditory stimulation maps
In the last sections we have seen the potentiality of the proposed approach for ROI detection in several binary classification paradigms, i.e. diagnosis, given the usefulness of machine learning. Images collected from the aforementioned experiment are used to identify areas performing a specific information processing function, such as the primary auditory cortex.
The areas identified by the proposed approach are mainly those corresponding with the temporal lobe, as shown in figure 9. A mosaic and the 3D representation of the activated cortical areas are shown in the same figure 9 (b), together with the activation pattern sought by the SPM methodology. The comparison analysis of both approaches is displayed in figure 9 (a). In the upper figure we see the significance test for a proportion () that was applied to this auditory fMRI experiment. The SAM is mainly located on regions where we found the activation voxels in SPM. In the middle we represent the number of voxels in ROIs (for different sample sizes) and the ratio w.r.t the total number of voxels in that region. Finally, in the bottom we compared both approaches using the overlapanalysis type measures, as described in the last sections. To sum up, we found: i) the same ROIs in both approaches, ii) SPM required sufficiently large sample size to provide significant ROIs, i.e. for no significant areas were sought, and iii) both approaches converge with increasing sample size to the same number of activated voxels.


6 Discussion
As shown in the latter section, in general the SAM is a very robust method, in terms of sample size, to find relevant standardized areas, and a stable framework which contains those regions defined as relevant by the SPM, with sufficiently large sample size. The experiments carried out in different experimental frameworks and datasets have demonstrated the ability of this multivariate approach for establishing a novel modelfree method for the assessment of significant changes across brain volumes.
The behaviour of the analysed methods depends on the size of effect we are interested in. In the seek of subtle effects, such as the ones found in AD or Autistic patterns, and provided that hypothesis tests cannot separate important, but subtle, and actually trivial effects [9], our SAM focus on standardized ROIs to avoid the presence of false positives in the sought maps. In this sense, SPM is more specific and can detect, within these regions sought by SAM, which substructures are responsible for the discrimination between classes.
On the other hand, when large effects are bound to be found, SAM is a suitable method in their detection since, with a few amounts of samples, it provides similar results than the ones obtained with complete databases, i.e. fMRI and DatSCAN experiments. This is in line with the main idea derived from [8] that when an effect is found in small datasets is more than likely to be extrapolated in large samples. On the contrary, only in small datasets with small – but meaningful – effects that are missed, missing data, sampling bias, etc. we found the absence of replication, i.e. across data collecting sites [9]. All these statistical features in the analysis of neuroimaging data are experimentally described in the datasets analysed throughout this paper.
Finally, we have seen the usefulness of the confidence intervals derived for the STL based on concentration inequalities to achieve a confidence framework beyond sharp nullhypothesis testing. Key to this methodology in the field of SLT is that it is based on insample estimates (a similar procedure in exploratory analysis using hypothesis testing), unlike the outsample estimates in CVprocedures, which usually subdivide the (small) datasets for an estimation of the actual error. In this way, an analytical bound depending on sample size () and number of predictors () defines a “worstcase” operation point. Nevertheless, the experiments showed the application of a systematic hypothesis test for the selection of significant empirical errors which conforms the highlighted regions in the SAM. Only in this case, a model is assumed in the set of accuracies, but it has been demonstrated to be in accordance with the nature of the onedimensional data and sufficiently accurate for our purposes.
7 Conclusion
In this paper we present a datadriven approach, mainly devoted to classification problems with limited sample sizes, to derive statistical modelfree (agnostic) mappings. Although the latter is not designed for testing competing hypothesis or comparing different models in neuroimaging, we derive the SAM assuming the existence of classes (), at voxel or multivoxel level. The analysis of the “worst case’ considers the upper bounds of the actual risk, under suitable theoretical conditions (see methods and appendix) and a selection of regions with a highlycorrected empirical risk, according with a test for significance on a population proportion. As a conclusion, the SAM relieved the problem of instability in limited sample sizes, when determining maps of relevance in several neurological conditions, such as AD, PD or auditory tasks, and resulted in a very completive and complementary method with the SPM framework, which is mainly accepted by the neuroimaging community. Moreover, the latter usually employs several strategies for reducing the false positive rates in multiple comparisons, such as the (FWE) corrected pvalue maps in inferential statistics nullhypothesis testing, and RFT to tackle with the spatial structure of the maps. However, this approach is found to be very conservative in our experiments and in the extant literature. In this sense, the novel framework based on SLT provides similar activation maps than the ones obtained by the SPM, but defined on ROIs, under a rigorous development in scenarios with a small sample/dimension ratio and large, small and trivial effect sizes, as shown in the experimental part.
Acknowledgments.
This work was partly supported by the MINECO/ FEDER under the RTI2018098913B100 project. We would like to thank the reviewers for their thoughtful comments and efforts towards improving our manuscript. J.M. Gorriz would like to thank Dr. Maxim Raginsky for his elegant abstract notation which is borrowed in this paper.
Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81 XWH1220012). ADNI is funded by the National Insti tute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; BristolMyers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. HoffmannLa Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research Development, LLC.; Johnson Johnson Pharmaceutical Research Development LLC.; Lumosity; Lundbeck; Merck Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.
Supplementary Material
A similar analysis was carried out with increasing number of components, i.e. , however the upper bounds are increased accordingly as shown in figure 10. This highlight the benefits of working in a low dimensional scenario, , although the use of new features in the analysis allows us to detect other regions, such as “Temporal Pole Sup” and “Temporal Mid” regions.
Appendix
Upper Bounding the worst case: a summary
A shown in equation 4 the consistency of the ERM algorithm is mainly dependent on the evaluation of the twosided uniform deviation of the error probabilities in the worst case. An upper bound with probability at least for this quantity can be obtained by invoking a result in [19], since has the bounded differences property by :
(8) 
This is known as the generalized Hoeffding inequation. Then.,^{5}^{5}5Given
a random variable, if
then ,(9) 
with probability .
Moreover, the expected value of the deviation can be absolutely bounded by the socalled Rademacher average [29] as follows. First, the uniform deviation is bounded by its expected value w.r.t the set of random error functions , using the “symmetrization” trick proposed in [3] and the convexity property of the norm function:
(10) 
where is randomly selected from and . Taking whole expectations on the both sides we get:
(11) 
By using the triangle inequality and the definition of empirical error we finally obtain:
(12) 
where the right part of inequality is equally distributed as the Rademacher average , where are independent random variables in with equal probability. Finally, using Massart’s finite class lemma [18] we can bound the left part of the latter inequality as:
(13) 
Consequently, introducing equations 9, 11, 12 and 13 in equation 4 we finally prove equation 5.
The partial least squares algorithm
The PLS algorithm extracts the relevant patterns within ROIs across brains by a regression between the multivariate data matrix and the label vector . In short, we maximize:
(14) 
where the score vectors are iteratively extracted and used to deflate the input matrix by subtracting their rankone approximations based on [13]. The deflation process is accomplished by the computation of the vector of loadings as a coefficient of regressing on :
(15) 
As shown in [5] the size of the input data is crucial to the assessment of the relationship volume data and group membership within the evaluated ROIs, where some statistical properties of the involved processes, such as the stationarity or the ergodicity in the correlation, must be assumed. The PLSmaps derived can be seen as a multivariate twosample test weighted by the scores of each sample with unknown distribution, except for a normalization term that depends on the pooled standard deviation [5], thus its statistical significance can be assessed in a similar manner of a ttest [12].
Significance test for a proportion
Let denote the sampling distribution of empirical errors , for , then the null hypothesis test about the population proportion within the confidence interval has the form:
where
denotes a particular proportion value between 0 and 1, i.e. 0.5. The teststatistic in a population proportion is
, where . For large samples, i.e. for at least , if is true, the sampling distribution of thetest statistic is the standard normal distribution.
References
 [1] Friston, K. Sample size and the fallacies of classical inference. NeuroImage 81 (2013) 503–504
 [2] Bzdok, D. Classical Statistics and Statistical Learning in Imaging Neuroscience. Front. Neurosci., 06 October 2017  https://doi.org/10.3389/fnins.2017.00543
 [3] V. Vapnik Estimation dependencies based on Empirical Data. SpringerVerlach. 1982 ISBN 0387907335
 [4] Kohavi, R. A study of CV and bootstrap for accuracy estimation and model selection. Proc. of the 14th international joint conference on AI  Vol. 2 pp 11371143 (1995)
 [5] Górriz, J.M. et al. On the computation of distributionfree performance bounds: Application to small sample sizes in neuroimaging. Pattern Recognition 93, 113
 [6] Górriz, et al. A Machine Learning Approach to Reveal the NeuroPhenotypes of Autisms. International journal of neural systems, 1850058
 [7] Varoquaux, G. Crossvalidation failure: Small sample sizes lead to large error bars. NeuroImage 180 (2018) 68–77.
 [8] Friston, K. Ten ironic rules for nonstatistical reviewers. NeuroImage 61 (2012) 1300–1310
 [9] Lindquist, M.A. et al. Ironing out the statistical wrinkles in "ten ironic rules". Neuroimage. 2013 Nov 1;81:499502.
 [10] Haussler, D. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation Volume 100, Issue 1, September 1992, Pages 78150
 [11] Cover, T.M. Geometrical and Statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Transactions on Electronic Computers. EC14: 326–334 (1965)
 [12] McIntosh, A.R. et al. Spatial pattern analysis of functional brain images using partial least squares, Neuroimage 3(3 Pt 1) (1996) 143157
 [13] Rosipal, R. et al. Overview and Recent Advances in Partial Least Squares (Springer Berlin, Heidelberg, 2006), pp. 3451

[14]
Rondina, J.M. SCoRS  a Method Based on Stability for Feature Selection and Mapping in Neuroimaging. IEEE Trans Med Imaging. 2014 Jan; 33(1): 85–98.

[15]
MartinezMurcia F.J. Studying the Manifold Structure of Alzheimer’s Disease: A Deep Learning Approach Using Convolutional Autoencoders. IEEE J Biomed Health Inform. 2019 Jun 17.
 [16] De Martino, F. et al. Combining multivariate voxel selection and support vector machines for mapping and classification of fMRI spatial patterns NeuroImage, 43 (1) (2008), pp. 4458

[17]
Vapnik V. et al. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications, 16:264–280, 1971.
 [18] Massart, P. Some applications of concentration inequalities to statistics. Annales de la Faculté des Sciences de Toulouse, 2000.
 [19] McDiarmid, C. On the method of bounded differences, Surveys in Combinatorics 141 (1989), 148–188
 [20] Sauer, N. On the density of families of sets. Journal of Combinatorial Theory, Series A, 13:145–147, 1972.
 [21] Shelah, S. A combinatorial problem: stability and order for models and theories in infinity languages. Pacific Journal of Mathematics, 41:247–261, 1972.
 [22] GómezVerdejo, V. SignConsistency Based Variable Importance for Machine Learning in Brain Imaging Neuroinformatics October 2019, Volume 17, Issue 4, pp 593–609
 [23] Khundrakpam, B.S. et al. (2015). Prediction of brain maturity based on cortical thickness at different spatial resolutions. NeuroImage, 111, 350–359.
 [24] MouroMiranda, J. et al. Classifying brain states and determining the discriminating activation patterns: Support vector machine on functional MRI data. NeuroImage, 28, 980–995. (2005).
 [25] Friston K. et al. Statistical Parametric Maps in functional imaging: A general linear approach Hum. Brain Mapp. 2:189210 (1995)
 [26] Friston, K.J., Harrison, L., Penny, W., 2003. Dynamic causal modelling. Neuroimage 19, 1273–1302.
 [27] Reiss, P.T. Crossvalidation and hypothesis testing in neuroimaging: an irenic comment on the exchange between Friston and Lindquist et al. Neuroimage. 2015 August 1; 116: 248–254
 [28] TzourioMazoyer, N. et al.. Automated anatomical labeling of activations in spm using a macroscopic anatomical parcellation of the MNI MRI single subject brain. Neuroimage 2002; 15: 273289. DOI
 [29] ShalevShwartz, S. et al. Understanding Machine Learning – from Theory to Algorithms. Cambridge University Press. ISBN 9781107057135. 2014
 [30] Antós, A. et al. Datadependent marginbased generalization bounds for classification. Journal of Machine Learning Research 3 (2002) 73–98

[31]
Vidyasagar, M. Learning and Generalisation With Applications to Neural Networks Springer. ISBN 9781849968676 (2003)
 [32] Frackowiak et al. Human Brain Function (Second Edition). Chap. 44. Introduction to Random Field Theory. ISBN 9780122648410 Academic Press. 867879, 2004.
 [33] Jack, Jr. C.C. NIAAA Research Framework: Toward a biological definition of Alzheimer’s disease. Alzheimers Dement. 2018 Apr; 14(4): 535–562.
 [34] McKhann, G.M. et al.. The diagnosis of dementia due to Alzheimer’s disease: recommendations from the National Institute on Aging and the Alzheimer’s Assocation Workgroup. Alzheimers Dement. 2011;7:263–9.
 [35] Button, K.S. et al. Confidence and precision increase with high statistical power Nature Reviews Neuroscience volume 14, page 585(2013).
 [36] Illan, I.A. et al. Automatic assistance to Parkinson’s disease diagnosis in DaTSCAN SPECT imaging. Medical Physics. 2012
 [37] Zaidi, H. et al. Quantitative Analysis in Nuclear Medicine Imaging Springer Science Business Media, Inc. ISBN10: 0387238549
 [38] Ioannidis, J.P.A.. Why most published research findings are false. PLoS Med. 2 (8) (e124), 696–701. 2005.
Comments
There are no comments yet.