1 Introduction
Medical images are increasingly used in predictive settings, in which one wants to classify patients into disease categories or predict some outcomes of interest. Besides predictive accuracy, a fundamental question is that of
opening the black box, understanding the combinations of observations that explains the outcome. A particular relevant question is that of the importance of image features in the prediction of an outcome of interest, conditioned on other features. Such conditional analysis is a fundamental step to allow causal inference on the implications of the signals from image regions in this outcome; see e.g. [12] for the case of brain imaging. However, the typical setting in medical imaging is that of highdimensional smallsample problems, in which the number of samples is much smaller than the number of covariates . This is further aggravated by the steady improvements in data resolution. In such cases, classical inference tools fail, both theoretically and practically. One solution to this problem is to reduce the massive number of covariates by utilizing dimension reduction, such as clusteringbased image compression, to reduce the number of features to a value close to n; see e.g. [4]. This approach can be viewed as the bias/variance tradeoff: some loss in the localization of the predictive features —bias— is tolerated as it comes with less variance —hence higher power— in the statistical model. This is particularly relevant in medical imaging, where localizing predictive features at the voxel level is rarely important: one is typically more interested in the enclosing region.
However, such a method suffers from the arbitrariness of the clustering step and the ensuing highvariance in inference results with different clustering runs, as shown empirically in [6]. [6] also introduced an algorithm called Ensemble of Clustered Desparsified Lasso (ECDL), based on the inference technique developed in [13]
, that provides pvalues for each feature, and controls the Family Wise Error Rate (FWER), i.e. the probability of making one or more false discoveries. In applications, it is however more relevant to control the False Discovery Rate (FDR)
[3], which indicates the expected fraction of false discoveries among all discoveries, since it allows to detect a greater number of variables. In univariate settings, the FDR is easily controlled by the BenjaminiHochberg procedure [3], valid under independence or positive correlation between features. It is unclear whether this can be applied to multivariate statistical settings. A promising method which controls the FDR in multivariate settings is the socalled knockoff inference [2, 5], which has been successfully applied in settings where . However, the method relies on randomly constructed knockoff variables, therefore it also suffers from instability. Our contribution is a new algorithm, called Ensemble of Clustered Knockoffs (ECKO), that i) stabilizes knockoff inference through an aggregation approach; ii) adapts knockoffs to settings. This is achieved by running the knockoff inference on the reduced data and ensembling the ensuing results.The remainder of our paper is organized as follows: Section 2 establishes a rigorous theoretical framework for the ECKO algorithm; Section 3 describes the setup of our experiments with both synthetic and brain imaging data predictive problems, to illustrate the performance of ECKO, followed by details of the experimental results in Section 4; specifically, we benchmark this approach against the procedure proposed in [7], that does not require the clustering step, yet only provides asymptotic () guarantees. We show the benefit of the ECKO approach in terms of both statistical control and statistical power.
2 Theory
2.1 Generalized Linear Models and High Dimensional Setting
Given a design matrix
and a response vector
, we consider that the true underlying model is of the following form:(1) 
where is the true parameter vector, the (unknown) noise magnitude, the noise vector and is a function that depends on the experimental setting (e.g. for the regression problem or e.g. for the classification problem). The columns of refer to the explanatory variables also called features, while the rows of represent the coordinates of different samples in the feature space. We focus on experimental settings in which the number of features is much greater than the number of samples . Additionally, the (true) support denoted by is given by . Let
denotes an estimate of the support given a particular inference procedure. We also define the signaltonoise ratio (SNR) which allows to assess the noise regime of a given experiment:
(2) 
A high SNR means the signal magnitude is strong compared to the noise, hence it refers to an easier inference problem.
2.2 Structured Data
In medical imaging and many other experimental settings, the data stored in the design matrix relate to structured signals. More precisely, the features have a peculiar dependence structure that is related to an underlying spatial organization, for instance the spatial neighborhood in 3D images. Then, the features are generated from a random process acting on this underlying metric space. In our paper, the distance between the th and the th features is denoted by .
2.3 FDR control
In this section, we introduce the false discovery rate (FDR) and a spatial generalization of the FDR that we called FDR. This quantity is important since a desirable property of an inference procedure is to control the FDR or the FDR. In the following, we assume that the true model is the one defined in Section 2.1.
Definition 1
False discovery proportion (FDP). Given an estimate of the support obtained from a particular inference procedure, the false discovery proportion is the ratio of the number selected features that do not belong to the support (false discoveries) divided by the number of selected features (discoveries):
(3) 
Definition 2
FDP. Given an estimate of the support obtained from a particular inference procedure, the false discovery proportion with parameter , denoted FDP is the ratio of the number selected features that are at a distance more than from any feature of the support, divided by the number of selected features:
(4) 
One can notice that for , the FDP and the FDP refer to same quantity .
Definition 3
False Discovery Rate (FDR) and FDR. The false discovery rate and the false discovery rate with parameter which is denoted by FDR are respectively the expectations of the FDP and the FDP:
(5) 
2.4 Knockoff Inference
Initially introduced by [2] to identify variables in genomics, the knockoff filter is an FDP control approach for multivariate models. This method has been improved to work with mildly highdimensional settings in [5], leading to the socalled modelX knockoffs:
Definition 4
ModelX knockoffs [5].
The modelX knockoffs for the family of random variables
are a new family of random variables constructed to satisfy the two properties:
For any subset : ,
where the vector denotes the swap of entries and , 
where is the response vector.
In a nutshell, knockoff procedure first creates extra null variables that have a correlation structure similar to that of the original variables. A test statistic vector is then calculated to measure the strength of the original versus its knockoff counterpart. An example of such statistic is the lassocoefficient difference (LCD) that we use in this paper:
Definition 5
Remark 1
Our first contribution is to extend this procedure computing by aggregating different draws of knockoffs before applying the BenjaminiHochberg (BHq) procedure. More precisely, we first compute draws of knockoff variables and, using Equation 6, we derive the corresponding pvalues , for all and . Then, we aggregate them for each
in parallel, using the quantile aggregation procedure introduced in
[11]:(7) 
We then proceed with the fourth and fifth steps of the knockoff procedure described in Def. 5.
2.5 Dimension reduction
Knockoff (KO) inference is intractable in highdimensional settings, as knockoff generation requires the estimation and inversion of covariance matrices of size . Hence we leverage data structure by introducing a clustering step that reduces data dimension before applying KO inference. As in [6], assuming the features’ signals are spatially smooth, it is relevant to consider a spatiallyconstrained clustering algorithm. By averaging the features with each clustering, we reduce the number of parameters from to , the number of clusters, where . KO inference on clusterbased signal averages will be referred to as Clustered Knockoffs (CKO). However, it is preferable not to fully rely on a particular clustering, as a small perturbation on the input data has a dramatic impact on the clustering solution. We followed the approach used in [9] that aggregates solutions across random clusterings. More precisely, they build different clusterings from different random subsamples of size from the full sample , but always using the same clustering algorithm.
2.6 The Ensemble of Clustered Knockoff Algorithm
The problem is to aggregate the qvalues obtained across CKO runs on different clustering solutions. To do so, we transfer the qvalues from clusters (group of voxels) to features (voxels): given a clustering solution , we assign to each voxel the qvalue of its corresponding cluster. More formally, if, considering the th clustering solution, the th voxel belongs to the th cluster denoted by then the qvalue assigned to this voxel is: if . This procedure hinges on the observation that the FDR is a resolutioninvariant concept —it controls the proportion of false discoveries. In the worst case, this results in a spatial inaccuracy of in the location of significant activity, being the diameter of the clusters. Finally, the aggregated qvalue of the th voxel is the average of the qvalues , , received across C different clusterings: given the FDR definition Equation 5, FDPs can naturally be averaged. The algorithm is summarized in Alg. 1 and represented graphically in Figure 1.
2.7 Theoretical Results
Ensemble of Clustered Knockoffs (ECKO).
Theorem 1
FDR control by the Ensemble of Clustered Knockoffs procedure at the voxel level. Assuming that the true model is the one defined in Equation 1, using the qvalues defined in Section 2.6, the estimated support ensure that the FDR is lower than .
Sketch of the proof
(details are omitted for the sake of space). We first establish that the aggregation procedure yields qvalues that control the FDR. This follows simply from the argument given in the proof of Theorem 3.1 in [11]. Second, we show that broadcasting the values from clusters () to voxels () still controls the FDR, yet with a possible inaccuracy of , where is the supremum of clusters diameters: the FDR is controlled. This comes from the resolution invariance of FDR and the definition of FDR. Third, averagingbased aggregation of the qvalues at the voxel level, controls the FDR. This stems from the definition of the FDR as an expected value.
2.8 Alternative approaches
In the present work, we use two alternatives to the proposed CKO/ECKO approach: the ensemble of clustered desparsified lasso (ECDL) [6] and the APT framework from [7]. As we already noted, ECDL is structured as ECKO. The main differences are that it relies on desparsified lasso rather than knockoff inference and returns pvalues instead of qvalues. The APT approach was proposed to return featurelevel pvalues for binary classification problems (though the generalization to regression is straightforward). It directly works at the voxel level, yet with two caveats:

Statistical control is granted only in the limit

Unlike ECDL and ECKO, it is unclear whether the returned score represents marginal or conditional association of the input features with the output.
For both ECDL and APT, the returned pvalues are converted to qvalues using the standard BHq procedure. The resulting qvalues are questionable, given that BHq is not valid under negative dependence between the input qvalues [3]; on the other hand, practitioners rarely check the hypothesis underlying statistical models. We thus use the procedure in a blackbox mode and check its validity a posteriori.
3 Experiments
3.0.1 Synthetic data.
To demonstrate the improvement of the proposed algorithm, we first benchmark the method on 3D synthetic data set that resembles a medical image with compact regions of interest that display some predictive information. The size of the weight vector is , with 5 regions of interest (ROIs) of size . A design matrix
that represents random brain signal is then sampled according to a multivariate Gaussian distribution. Finally, the response vector
is calculated following linear model assumption with Gaussian noise, which is configured to have , similar to real data settings. An average precisionrecall curve of 30 simulations is calculated to show the relative merits of single cluster Knockoffs inference versus ECKO and ECDL and APT. Furthermore, we also vary the SignaltoNoise Ratio (SNR) of the simulation to investigate the accuracy of FDR control of ECKO with different levels of difficulty in detecting the signal.3.0.2 Real MRI dataset.
We compare singleclustered Knockoffs (CKO), ECKO and ECDL on different MRI datasets downloaded from the Nilearn library [1]. In particular, the following datasets are used:

Haxby [8]. In this functionalMRI (fMRI) dataset, subjects are presented with images of different objects. For the benchmark in our study, we only use the brain signal and responses for images related to faces and houses of subject 2 ().

Oasis [10]. The original collection include data of gray and white matter density probability maps for 416 subjects aged 18 to 96, 100 of which have been clinically diagnosed with very mild to moderate Alzheimer’s disease. The purpose for our inference task is to find regions that predict the age of a subject ().
We chose in all experiments for the algorithms that require clustering step (KO, ECKO and ECDL). In the two cases, we start with a qualitative comparison of the returned results. The brain maps are ternary: all regions outside are zeroed, while regions in get a value of or , depending on whether the contribution to the prediction is positive or negative. For ECKO, a vote is performed to decide whether a voxel is more frequently in a cluster with positive or negative weight.
4 Results
4.0.1 Synthetic data.
A strong demonstration of how ECKO makes an improvement in stabilizing the singleclustering Knockoffs (CKO) is shown in Fig. 2. There is a clear distinction between selection of the orange area at lower right and the blue area at upper right in the CKO result, compared to the ground truth. Moreover, CKO falsely discovers some regions in the middle of the cube. By Contrast, ECKO’s selection is more similar to the true 3D weight cube. While it returns a wider selection than ECKO, ECDL also claims more false discoveries, most visibly in the blue area on upperleft corner. At the same time, APT returns adequate results, but is more conservative than ECKO.
Fig. 2(a) is the result of averaging 30 simulations for the 3D brain synthetic data. ECKO and ECDL obtain almost identical precisionrecall curve: for a precision of at least 90%, both methods have recall rate of around 50%. Meanwhile, CKO falls behind, and in fact it cannot reach a precision of over 40% across all recall rates. APT yields the best precisionrecall compromise, slightly above ECKO and ECDL.
When varying SNR (from to ) and investigating the average
proportion of false discoveries (FDR) made over the average of 30
simulations (Fig. 2(b)), we observe that CKO fails to control
FDR at nominal level 10% in general.
Note that accurate FDR control would be obtained with larger
values, but this makes the whole procedure less useful.
The ECDL controls FDR at low SNR level. However, when the signal is
strong, ECDL might select more false positives.
ECKO, on the other hand, is always reliable —albeit conservative— keeping
FDR below the nominal level even when SNR increases to larger magnitude.
Oasis & Haxby dataset. When decoding the brain signal on
subject 2 of the Haxby dataset using response vector label for watching ’Face
vs. House’, there is a clear resemblance of selection results between ECKO and
ECDL.
Using an FDR threshold of 10%, both algorithms select the same area (with a difference in size), namely a face responsive region in the ventral visual cortex, and agree on the sign of the effect. However, on Oasis dataset, thresholding to control the FDR at 0.1 yields empty selection with ECDL and APT, while ECKO still selects some voxels. This potentially means that ECKO is statistically more powerful than ECDL and APT.
5 Conclusion
In this work, we proposed an algorithm that makes False Discovery Rate (FDR) control possible in highdimensional statistical inference. The algorithm is an integration of clustering algorithm for dimension reduction and aggregation technique to tackle the instability of the original knockoff procedure. Evaluating the algorithm on both synthetic and brain imaging datasets shows a consistent gain of ECKO with respect to CKO in both FDR control and sensitivity. Furthermore, empirical results also suggest that the procedure achieves nonasymptotic statistical guarantees, yet requires the
relaxation for FDR.The number of clusters represents a biasvariance tradeoff: increasing it can reduce the bias (in fact, the value of ), while reducing it improves the conditioning for statistical inference, hence the sensitivity of the knockoff control. We set it to 500 in our experiments. Learning it from the data is an interesting research direction.
We note that an assumption of independence between hypothesis tests is required for the algorithm to work, which is often not the case in realistic scenarios. Note that this is actually the case for all FDRcontrolling procedures that rely on the BHq algorithm. As a result, making the algorithm work with relaxed assumption is a potential direction for our future study. Furthermore, the doubleaggregation procedure makes the algorithm more expensive, although it results in embarrassingly parallel loops. An interesting challenge is to reduce the computation cost of this procedure. Another avenue to explore for the future is novel generative schemes for knockoff, based e.g. on deep adversarial approaches.
Acknowledgement This research is supported by French ANR (project FASTBIG ANR17CE230011) and Labex DigiCosme (project ANR11LABEX0045DIGICOSME). The authors would like to thank Sylvain Arlot and Matthieu Lerasle for fruitful discussions and helpful comments.
References

[1]
Abraham, A., Pedregosa, F., Eickenberg, M., Gervais, P., Mueller, A., Kossaifi, J., Gramfort, A., Thirion, B., Varoquaux, G.: Machine learning for neuroimaging with scikitlearn. Frontiers in Neuroinformatics 8, 14 (2014)
 [2] Barber, R.F., Candès, E.J.: Controlling the false discovery rate via knockoffs. The Annals of Statistics 43(5), 2055–2085 (Oct 2015), http://arxiv.org/abs/1404.5609, arXiv: 1404.5609
 [3] Benjamini, Y., Hochberg, Y.: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological) 57(1), 289–300 (1995), https://www.jstor.org/stable/2346101
 [4] Bühlmann, P., Rütimann, P., van de Geer, S., Zhang, C.H.: Correlated variables in regression: Clustering and sparse estimation. Journal of Statistical Planning and Inference 143(11), 1835–1858 (2013)
 [5] Candès, E., Fan, Y., Janson, L., Lv, J.: Panning for gold: ‘modelx’ knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80(3), 551–577
 [6] Chevalier, J.A., Salmon, J., Thirion, B.: Statistical inference with ensemble of clustered desparsified lasso. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., AlberolaLópez, C., Fichtinger, G. (eds.) Medical Image Computing and Computer Assisted Intervention – MICCAI 2018. pp. 638–646. Springer International Publishing, Cham (2018)

[7]
Gaonkar, B., Shinohara, R.T., Davatzikos, C.: Interpreting support vector machine models for multivariate group wise analysis in neuroimaging. Medical Image Analysis 24(1), 190–204 (2015)
 [8] Haxby, J.V., Gobbini, M.I., Furey, M.L., Ishai, A., Schouten, J.L., Pietrini, P.: Distributed and overlapping representations of faces and objects in ventral temporal cortex. Science 293(5539), 2425–2430 (2001)
 [9] HoyosIdrobo, A., Varoquaux, G., Schwartz, Y., Thirion, B.: Frem – scalable and stable decoding with fast regularized ensemble of models. NeuroImage 180, 160 – 172 (2018), http://www.sciencedirect.com/science/article/pii/S1053811917308182, new advances in encoding and decoding of brain signals
 [10] Marcus, D.S., Wang, T.H., Parker, J., Csernansky, J.G., Morris, J.C., Buckner, R.L.: Open access series of imaging studies (oasis): Crosssectional mri data in young, middle aged, nondemented, and demented older adults. Journal of Cognitive Neuroscience 19(9), 1498–1507 (2007)
 [11] Meinshausen, N., Meier, L., Bühlmann, P.: pvalues for highdimensional regression. Journal of the American Statistical Association 104(488), 1671–1681 (2009)
 [12] Weichwald, S., Meyer, T., Özdenizci, O., Schölkopf, B., Ball, T., GrosseWentrup, M.: Causal interpretation rules for encoding and decoding models in neuroimaging. NeuroImage 110, 48 – 59 (2015), http://www.sciencedirect.com/science/article/pii/S105381191500052X

[13]
Zhang, C.H., Zhang, S.S.: Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76(1), 217–242 (Mar 2014)