1 Model: multidataset classification of brain statistical images
Our general goal is to extract and integrate biological knowledge across many brainimaging studies within the same statistical learning framework. We first outline how analyzing large repositories of fMRI experiments can be cast as a classification problem. Here, success in capturing brainbehavior relationships is measured by outofsample prediction accuracy. The proposed model (Figure 1
) solves a range of these classification problems in an identical statistical estimation and imposes a shared latent structure across the singledataset classification parameters. These shared model parameters may be viewed as a chain of two dimension reductions. The first reduction layer leverages knowledge about brain spatial regularities; it is learned from restingstate data and designed to capture neural activity patterns at different coarseness levels. The second reduction layer projects data on directions generally relevant for cognitivestate prediction. The combination of both reductions yields lowdimensional representatives that are less affected by noise and subject variance than the highdimensional samples: classification is expected to have better outofsample prediction performance.
1.1 Problem setting: predicting conditions from brain activity in multiple studies
We first introduce our notations and terminology, and formalize a general prediction problem applicable to any task fMRI dataset. In a single fMRI study, each subject performs different experiments in the scanner. During such an experiment, the subjects are presented a set of sensory stimuli (i.e., conditions
) that aim at recruiting a target set of cognitive processes. We fit a firstlevel general linear model for every record to obtain zscore maps that quantify the importance of each condition in explaining each voxel. Formally, the
statistical maps of a given study form a sequence in , where is the number of voxels in the brain. Each such observation is labelled by a condition in whose effect captures . A single study typically features one or a few (if experiments are repeated) statistical map per condition and per subject, and may present up toconditions. Across the studies, the observed brain maps can be modeled as generated from an unknown joint distribution of brain activity and associated cognitive conditions
where variability across trials and subjects acts as confounding noise. In this context, we wish to learn a decoding model that predicts condition from brain activity measured from new subjects or new studies.Inspired by recent work (schwartz_mapping_2013; bzdok_semisupervised_2015; rubin_generalized_2016)
, we frame the condition prediction problem into the estimation of a multinomial classification model. Our models estimate a probability vector of
being labeled by each condition in . This vector is modeled as a function of in that takes the softmax form. For all in , its th coordinate is defined as(1) 
Fitting the model weights is done by minimizing the crossentropy between and the true labels , with respect to ,
, with or without imposing parameter regularization. In this model, an input image is classified in between all conditions presented in the whole
study. It is possible to restrict this classification to the set of conditions used in a given experiment — the empirical results of this work can be reproduced in this setting.The challenge of model parameter estimation.
A major inconvenience of the vanilla multinomial model lies in the ratio between the limited number of samples provided by a typical fMRI dataset and the overwhelming number of model weights to be estimated. Fitting the model amounts to estimating discriminative brain map, i.e.
millions of parameters (4M for the 23 conditions of HCP), whereas most brainimaging studies yield less than a hundred observations and therefore only a few thousands samples. This makes it hard to reasonably approximate the population parameters for successful generalization, especially because the variance between subjects is high compared to the variance between conditions. The obstacle is usually tackled in one of two major ways in brainimaging: 1) we can impose sparsity or apriori structure over the model weights. Alternatively, 2) we can reduce the dimension of input data by performing spatial clustering or univariate feature selection by ANOVA. However, we note that, on the one hand, regularization strategies frequently incur costly computational budgets if one wants to obtain interpretable weights
(gramfort_identifying_2013) and they introduce artificial bias. On the other hand, existing techniques developed in fMRI analysis for dimension reduction can lead to distorted signal and accuracy losses (thirion_which_2014). Most importantly, previous statistical approaches are not tuned to identifying conditions from task fMRI data. We therefore propose to use a dimension reduction that is estimated from data and tuned to capture the common hidden aspects shared by statistical maps across studies — we aggregate several classification models that share parameters.1.2 Learning shared representation across studies for decoding
We now consider several fMRI studies. is the union of all statistical maps from all datasets. We write the set of all studies, the set of all conditions from study , the total number of conditions and the subset of that index samples of study . For each study , we estimate the parameters for the classification problem defined above. Adapting the multitask learning framework of (ando_framework_2005), we constrain the weights to share a common latent structure: namely, we fix a latent dimension , and enforce that for all datasets ,
(2) 
where the matrix in is shared across datasets, and are datasetspecific classification matrices from a dimensional input space. Intuitively, should be a “consensus” projection matrix, that project every sample from every dataset onto a lower dimensional representation in that is easy to label correctly.
The latent dimension may be chosen larger than . In this case, regularization is necessary to ensure that the factorization (2) is indeed useful, i.e., that the multidataset classification problem does not reduce to separate multinomial regressions on each dataset. To regularize our model, we apply Dropout (srivastava_dropout:_2014) to the projected data representation. Namely, during successive training iterations, we set a random fraction of the reduced data features to . This prevents the coadaptation of matrices and and ensures that every direction of is useful for classifying every dataset. Formally, Dropout amounts to sample binary diagonal matrices in during training, with Bernouilli distributed coefficients; for all datasets , is estimated through the task of classifying Dropoutcorrupted reduced data .
In practice, matrices and are learned by jointly minimizing the following expected risk, where the objective is the sum of each of singlestudy crossentropies, averaged over Dropout noise:
(3) 
Imposing a common structure to the classification matrices is natural as the classes to be distinguished do share some common neural organization — brain maps have a correlated spatial structure, while the psychological conditions of the diffent datasets may trigger shared cognitive primitives underlying human cognition (rubin_generalized_2016; bzdok_semisupervised_2015). With our design, we aim at learning a matrix that captures these common aspects and thus benefits the generalization performance of all the classifiers. As is estimated from data, brain maps from one study are enriched by the maps from all the other studies, even if the conditions to be classified are not shared among studies. In so doing, our modeling approach allows transfer learning among all the classification tasks.
Unfortunately, estimators provided by solving (3) may have limited generalization performance as remain relatively small compared to the number of parameters. We address this issue by performing an initial dimension reduction that captures the spatial structure of brain maps.
1.3 Initial dimension reduction using localized restfMRI activity patterns
The projection expressed by ignores the signal structure of statistical brain maps. Acknowledging this structure in commonly acquired brain measurements should allow to reduce the dimensionality of data with little signal loss, and possibly the additional benefit of a denoising effect. Several recent studies (blumensath_spatially_2013) in the brainimaging domain suggest to use fMRI data acquired in experimentfree studies for such dimension reduction. For this reason, we introduce a first reduction of dimension that is not estimated from statistical maps, but from restingstate data. Formally, we enforce , where (), and . Intuitively, the multiplication by matrix should summarize the spatial distribution of brain maps, while multiplying by , that is estimated solving (3), should find lowdimensional representations able to capture cognitive features. is now of reasonable size (): solving (3) should estimate parameters with better generalization performance. Defining an appropriate matrix is the purpose of the next paragaphs.
Restingstate decomposition.
The initial dimension reduction determines the relative contribution of statistical brain maps over what is commonly interpreted by neuroscience investigators as functional networks. We discover such macroscopical brain networks by performing a sparse matrix factorization over the massive restingstate dataset provided in the HCP900 release (vanessen_human_2012): such a decomposition technique, described e.g., in (mensch_dictionary_2016; mensch_stochastic_2017) efficiently provides (i.e., in the order of few hours) a given number of sparse spatial maps that decompose the resting state signal with good reconstruction performance. That is, it finds a sparse and positive matrix in and loadings in such that the restingstate brain images in are well approximated by . is this a set of slightly overlapping networks — each voxel belongs to at most two networks. To maximally preserve Euclidian distance when performing the reduction, we perform an orthogonal projection, which amounts to setting . Replacing in (3), we obtain the reduced expected risk minimization problem, where the input dimension is now the number of dictionary components:
(4) 
Multiscale projection.
Selecting the “best” number of brain networks is an illposed problem (eickhoff_connectivitybased_2015): the size of functional networks that will prove relevant for condition classification is unknown to the investigator. To address this issue, we propose to reduce highresolution data in a multiscale fashion: we initially extract sparse spatial dictionaries with , and components respectively. Then, we project statistical maps onto each of the dictionaries, and concatenate the loadings, in a process analogous to projecting on an overcomplete dictionary in computer vision (e.g., mallat_matching_1993). This amounts to define the matrix as the concatenation
(5) 
With this definition, the reduced data carry information about the network activations at different scales. As such, it makes the classification maps learned by the model more regular than when using a singlescale dictionary, and indeed yields more interpretable classification maps. However, it only brings only a small improvement in term of predictive accuracy, compared to using a simple dictionary of size . We further discuss multiscale decomposition in Appendix A.2.
1.4 Training with stochastic gradient descent
As illustrated in Figure 1
, our model may be interpreted as a threelayer neural network with linear activations and several readout heads, each corresponding to a specific dataset. The model can be trained using stochastic gradient descent, following a previously employed alternated training scheme
(collobert_unified_2008): we cycle through datasets and select, at each iteration, a minibatch of samples , where has the same size for all datasets. We perform a gradient step — the weights , and are updated, while the others are left unchanged. The optimizer thus sees the same number of samples for each dataset, and the expected stochastic gradient is the gradient of (4), so that the empirical risk decreases in expectation and we find a critical point of (4) asymptotically. We use the Adam solver (kingma_adam:_2014) as a flavor of stochastic gradient descent, as it allows faster convergence.Computational cost.
Training the model on projected data takes 10 minutes on a conventional single CPU machine with an Intel Xeon 3.21Ghz. The initial step of computing the dictionaries from all HCP900 restingstate (4TB of data) records takes 5 hours using (mensch_stochastic_2017), while transforming data from all the studies with projection takes around 1 hour. Adding a new dataset with 30 subjects to our model and performing the joint training takes no more than minutes. This is much less than the cost of fitting a firstlevel GLM on this dataset ( per subject).
2 Experiments
We characterize the behavior and performance of our model on several large, publicly available brainimaging datasets. First, to validate the relevance of all the elements of our model, we perform an ablation study. It proves that the multiscale spatial dimension reduction and the use of multidataset classification improves substancially classification performance, and suggests that the proposed model captures a new interesting latent structure of brain images. We further illustrate the effect of transfer learning, by systematically varying the number of subjects in a single dataset: we show how multidataset learning helps mitigating the decrease in accuracy due to smaller train size — a result of much use for analysing cognitive experiments on small cohorts. Finally, we illustrate the interpretability of our model and show how the latent “cognitivespace” can be explored to uncover some template brain maps associated with related conditions in different datasets.
2.1 Datasets and tools
Datasets.
Our experimental study features 5 publiclyavailable task fMRI study. We use all restingstate records from the HCP900 release (vanessen_human_2012) to compute the sparse dictionaries that are used in the first dimension reduction materialized by . We succinctly describe the conditions of each dataset — we refer the reader to the original publications for further details.

HCP: gambling, working memory, motor, language, social and relational tasks. 800 subjects.

Archi (pinel_fast_2007): localizer protocol, motor, social and relational task. 79 subjects.

Brainomics (papadopoulos_orfanos_brainomics/localizer_2017): localizer protocol. 98 subjects.

Camcan (shafto_cambridge_2014): audiovideo task, with frequency variation. 606 subjects.

LA5c consortium (poldrack_phenomewide_2016): taskswitching, balloon analog risk taking, stopsignal and spatial working memory capacity tasks — highlevel tasks. 200 subjects.
The last four datasets are target datasets, on which we measure outofsample prediction performance. The larger HCP dataset serves as a knowledge transfering dataset, which should boost these performance when considered in the multidataset model. We register the task timeseries in the reference MNI space before fitting a general linear model (GLM) and computing the maps (standardized by zscoring) associated with each base condition — no manual design of contrast is involved. More details on the pipeline used for zmap extraction is provided in Appendix A.1.
Tools.
We use pytorch
^{1}^{1}1http://pytorch.org/ to define and train the proposed models, nilearn (abraham_machine_2014) to handle brain datasets, along with scikitlearn (pedregosa_scikitlearn:_2011) to design the experimental pipelines. Sparse brain decompositions were computed from the whole HCP900 restingstate data. The code for reproducing experiments is available at http://github.com/arthurmensch/cogspaces. Our model involves a few noncritical hyperparameters: we use batches of size
, set the latent dimension and use a Dropout rate in the latent cognitive space — this value perform slightly better than . We use a multiscale dictionary with , and components, as it yields the best quantitative and qualitative results.^{2}^{2}2Note that using only the components dictionary yields comparable predictive accuracy. Quantitatively, the multiscale approach is beneficial when using dictionary with less components (e.g., , , ) — see Appendix A.2 for a quantitative validation of the multiscale approach.. Finally, test accuracy is measured on half of the subjects of each dataset, that are removed from training sets beforehand. Benchmarks are repeated 20 times with random split folds to estimate the variance in performance.2.2 Dimension reduction and transfer improves test accuracy
For the four benchmark studies, the proposed model brings between +1.3% to +13.4% extra test accuracy compared to a simple multinomial classification. To further quantify which aspects of the model improve performance, we perform an ablation study: we measure the prediction accuracy of six models, from the simplest to the most complete model described in Section 1. The first three experiments study the effect of initial dimension reduction and regularization^{3}^{3}3For these models, and Dropout regularization parameter are estimated by nested crossvalidation.. The last three experiments measure the performance of the proposed factored model, and the effect of multidataset classification.

Baseline penalized multinomial classification, where we predict from directly.

Multinomial classification after projection on a dictionary, i.e. predicting from .

Same as experience 2, using Dropout noise on projected data .

Factored model in the singlestudy case: solving (4) with the target study only.

Factored model in a twostudy case: using target study alongside HCP.

Factored model in the multistudy case: using target study alongside all other studies.
The results are summarized in Figure 2. On average, both dimension reduction introduced by and are beneficial to generalization performance. Using many datasets for prediction brings a further increase in performance, providing evidence of transfer learning between datasets.
In detail, the comparison between experiments 1, 2 and 3 confirms that projecting brain images onto functional networks of interest is a good strategy to capture cognitive information (bzdok_semisupervised_2015; blumensath_spatially_2013). Note that in addition to improving the statistical properties of the estimators, the projection reduces drastically the computational complexity of training our full model. Experiment 2 and 3 measure the impact of the regularization method without learning a further latent projection. Using Dropout on the input space performs consistently better than regularization ( to ); this can be explained in view of (wager_dropout_2013), that interpret inputDropout as a regularization on the natural model parametrization.
Experiment 4 shows that Dropout regularization becomes much more powerful when learning a second dimension reduction, i.e. when solving problem (4). Even when using a single study for learning, we observe a significant improvement ( to ) in performance on three out of four datasets. Learning a latent space projection together with Dropoutbased data augmentation in this space is thus a much better regularization strategy than a simple or inputDropout regularization.
Finally, the comparison between experiments 4, 5 and 6 exhibits the expected transfer effect. On three out of four target studies, learning the projection matrix using several datasets leads to an accuracy gain from to , consistent across folds. The more datasets are used, the higher the accuracy gain — already note that this gain increases with smaller train size. Jointly classifying images on several datasets thus brings extra information to the cognitive model, which allows to find better representative brain maps for the target study. In particular, we conjecture that the large number of subjects in HCP helps modeling intersubject noises. On the other hand, we observe a negative transfer effect on LA5c, as the tasks of this dataset share little cognitive aspects with the tasks of the other datasets. This encourages us to use richer dataset repositories for further improvement.
2.3 Transfer learning is very effective on small datasets
To further demonstrate the benefits of the multidataset model, we vary the size of target datasets (Archi, Brainomics and CamCan) and compare the performance of the singlestudy model with the model that aggregates Archi, Brainomics, CamCan and HCP studies. Figure 3 shows that the effect of transfer learning increases as we reduce the training size of the target dataset. This suggests that the learned data embedding does capture some universal cognitive information, and can be learned from different data sources. As a consequence, aggregating a larger study to mitigate the small number of training samples in the target dataset. With only 5 subjects, the gain in accuracy due to transfer is on Archi, on Brainomics, and on CamCan. Multistudy learning should thus proves very useful to classify conditions in studies with ten or so subjects, which are still very common in neuroimaging.
2.4 Introspecting classification maps
At prediction time, our multidataset model can be collapsed into one multinomial model per dataset. Each dataset is then classified using matrix . Similar to the linear models classically used for decoding, the model weights for each condition can be represented as a brain map. Figure 4 shows the maps associated with digit computation and face viewing, for the Archi dataset. The models 2, 4 and 5 from the ablation study are compared. Although it is hard to assess the intrinsic quality of the maps, we can see that the introduction of the second projection layer and the multistudy problem formulation (here, appending the HCP dataset) yields maps with more weight on the highlevel functional regions known to be specific of the task: for face viewing, the FFA stands out more compared to primary visual cortices; for calculations, the weights of the intraparietal sulci becomes left lateralized, as it has been reported for symbolic number processing (bugden_role_2012).
2.5 Exploring the latent space
Within our model, classification is performed on the same dimensional space for all datasets, that is learned during training. To further show that this space captures some cognitive information, we extract from template brain images associated to general cognitive concepts. Fitting our model on the Archi, Brainomics, CamCan and HCP studies, we extract representative vectors of
with a kmeans clustering over the projected data and consider the centroids
of clusters. Each centroid can be associated to a brain image that lies in the span of , and . In doing so, we go backward through the model and obtain a representative of with well delineated spatial regions. Going forward, we compute the classification probability vectors for each study . Together, these probability vectors give an indication on the cognitive functions that captures. Figure 5 represents six template images, associated to their probability vectors, shown as word clouds. We clearly obtain interpretable pairs of brain image/cognitive concepts. These pairs capture across datasets clusters of experiment conditions with similar brain representations.3 Discussion
We compare our model to a previously proposed formulation for brain image classification. We show how our model differs from convex multitask learning, and stress the importance of Dropout.
Task fMRI classification.
Our model is related to a previous semisupervised classification model (bzdok_semisupervised_2015) that also performs multinomial classification of conditions in a lowdimensional space: the dimension reduction they propose is the equivalent of our projection . Our approach differs in two aspects. First, we replace the initial semisupervised dimension reduction with unsupervised analysis of restingstate, using a much more tractable approach that we have shown to be conservative of cognitive signals. Second, we introduce the additional cognitiveaware projection , learned on multiple studies. It substancially improves outofsample prediction performance, especially on small datasets, and above all allow to uncover a cognitiveaware latent space, as we have shown in our experiments.
Convex multitask learning.
Due to the Dropout regularization and the fact that is allowed to be larger than , our formulation differs from the classical approach (srebro_maximummargin_2004) to the multitask problem, that would estimate by solving a convex empirical risk minimization problem with a tracenorm penalization, that encourages to be lowrank. We tested this formulation, which does not perform better than the explicit factorization formulation with Dropout regularization. Tracenorm regularized regression has the further drawback of being slower to train, as it typically operates with full gradients, e.g. using FISTA (beck_fast_2009). In contrast, the nonconvex explicit factorization model is easily amenable to largescale stochastic optimization — hence our focus.
Importance of Dropout.
The use of Dropout regularization is crucial in our model. Without Dropout, in the singlestudy case with , solving the factored problem (4) yields a solution worse in term of empirical risk than solving the simple multinomial problem on , which finds a global minimizer of (4). Yet, Figure 2 shows that the model enriched with a latent space (red) has better performance in test accuracy than the simple model (orange), thanks to the Dropout noise applied to the latentspace representation of the input data. Dropout is thus a promising novel way of regularizing fMRI models.
4 Conclusion
We proposed and characterized a novel cognitive neuroimaging modeling scheme that blends latent factor discovery and transfer learning. It can be applied to many different cognitive studies jointly without requiring explicit correspondences between the cognitive tasks. The model helps identifying the fundamental building blocks underlying the diversity of cognitive processes that the human mind can realize. It produces a basis of cognitive processes whose generalization power is validated quantitatively, and extracts representations of brain activity that grounds the transfer of knowledge from existing fMRI repositories to newly acquired task data. The captured cognitive representations will improve as we provide the model with a growing number of studies and cognitive conditions.
5 Acknowledgments
This project has received funding from the European Union’s Horizon 2020 Framework Programme for Research and Innovation under grant agreement N^{o} 720270 (Human Brain Project SGA1). Julien Mairal was supported by the ERC grant SOLARIS (N^{o} 714381) and a grant from ANR (MACARON project ANR14CE23000301). We thank Olivier Grisel for his most helpful insights.
References
Appendix A Supplementary material
a.1 Data processing pipeline
Restingstate fMRI.
To extract sparse dictionaries, we perform dictionary learning on resting state timeseries from the 900 subject of the HCP900 dataset. These time series are already provided preprocessed and aligned in the MNI space. We apply a smoothing kernel of size 4 mm to the data. We then use the modl^{4}^{4}4http://github.com/arthurmensch/modl library to extract dictionaries from data by streaming records (in Nifti format) to an online algorithm. We run the algorithm on a grid of regularization parameters, that control the sparsity. We then select the sparsest dictionaries that cover all the brain. Dictionaries have been uploaded on our reproduction repository for potential further use.
Task fMRI.
To construct task fMRI datasets (HCP, Archi, Brainomics, Camcan, LA5C), we perform as follow. We use SPM (via the pypreprocess library^{5}^{5}5http://github.com/neurospin/pypreprocess
) for standard preprocessing of task fMRI time series: motion correction, slicetime interpolation and common registration to MNI space, using the DARTEL algorithm
(ashburner2007fast). We use a smoothing kernel of size 4 mm. We then fit a first level general linear model to each subject. We use the result of the regression to obtain one zstatistic map per base condition (a.k.a. explatory variable) and per record. This pipeline can be streamlined for fMRI datasets provided in BIDS (gorgolewski_brain_2016) format, using the nistats library^{6}^{6}6http://github.com/nistats/nistats.a.2 Multiscale dictionaries
To further investigate the interest of multiscale projection on restingstate dictionaries, we enforce a more agressive geometric dimension reduction in our model than in the main text. That is, we use dictionaries with fewer components than in Section 2. This is typically a way to improve interpretability of the model by sacrifying a little outofsample accuracy. It is also adviseable to restrain the size of the dictionary if ones use a smaller restingstate dataset than HCP — large dictionaries tend to overfit small datasets.
In practice, we compare the dimension reduction using a components dictionary to the dimension reduction using the same dictionary, along with a and a component dictionary. The results are presented in Figure 6. We observe that multiscale projection performs systematically better than single scale projection. Note that the difference between the performance of the two sets of model is significant as it consistent across all folds. The multiscale approach is thus quantitatively useful when using relatively small dictionaries from resting state data.
Comments
There are no comments yet.