1
Statistical machine learning methods are increasingly used for neuroimaging data analysis. Their main virtue is their ability to model highdimensional datasets, e.g. multivariate analysis of activation images or restingstate time series. Supervised learning is typically used in decoding or encoding settings to relate brain images to behavioral or clinical observations, while unsupervised learning can uncover hidden structures in sets of images (e.g. resting state functional MRI) or find subpopulations in large cohorts. By considering different functional neuroimaging applications, we illustrate how scikitlearn, a Python machine learning library, can be used to perform some key analysis steps. Scikitlearn contains a very large set of statistical learning algorithms, both supervised and unsupervised, and its application to neuroimaging data provides a versatile tool to study the brain.
2 Keywords:
machine learning, statistical learning, neuroimaging, scikitlearn, Python
2 Keywords:
machine learning, statistical learning, neuroimaging, scikitlearn, Python
3 Introduction
Interest in applying statistical machine learning to neuroimaging data analysis is growing. Neuroscientists use it as a powerful, albeit complex, tool for statistical inference. The tools are developed by computer scientists who may lack a deep understanding of the neuroscience questions. This paper aims to fill the gap between machine learning and neuroimaging by demonstrating how a generalpurpose machinelearning toolbox, scikitlearn, can provide stateoftheart methods for neuroimaging analysis while keeping the code simple and understandable by both worlds. Here, we focus on software; for a more conceptual introduction to machine learning methods in fMRI analysis, see pereira2009 or mur2009, while hastie2001 provides a good reference on machine learning. We discuss the use of the scikitlearn toolkit as it is a reference machine learning tool and has and a variety of algorithms that is matched by few packages, but also because it is implemented in Python, and thus dovetails nicely in the rich neuroimaging Python ecosystem.
This paper explores a few applications of statistical learning to resolve common neuroimaging needs, detailing the corresponding code, the choice of the methods, and the underlying assumptions. We discuss not only prediction scores, but also the interpretability of the results, which leads us to explore the internal model of various methods. Importantly, the GitHub repository of the paper^{1}^{1}1http://www.github.com/AlexandreAbraham/frontiers2013 provides complete scripts to generate figures. The scope of this paper is not to present a neuroimagingspecific library, but rather code patterns related to scikitlearn. However, the nilearn library –http://nilearn.github.io– is a software package under development that seeks to simplify the use of scikitlearn for neuroimaging. Rather than relying on an immature and blackbox library, we prefer here to unravel simple and didactic examples of code that enable readers to build their own analysis strategies.
The paper is organized as follows. After introducing the scikitlearn toolbox, we show how to prepare the data to apply scikitlearn routines. Then we describe the application of supervised learning techniques to learn the links between brain images and stimuli. Finally we demonstrate how unsupervised learning techniques can extract useful structure from the images.
4 Our tools: scikitlearn and the Python ecosystem
4.1 Basic scientific Python tools for the neuroimager
With its mature scientific stack, Python is a growing contender in the landscape of neuroimaging data analysis with tools such as Nipy (millman2007analysis) or Nipype (gorgolewski2011). The scientific Python libraries used in this paper are:

NumPy. Provides the
ndarray
data type to python, an efﬁcient dimensional data representation for arraybased numerical computation, similar to that used in Matlab (vanderwalt2011). It handles efficient array persistance (input and output) and provides basic operations such as dot product. Most scientific Python libraries, including scikitlearn, use NumPy arrays as input and output data type. 
SciPy: higher level mathematical functions that operate on ndarrays for a variety of domains including linear algebra, optimization and signal processing. SciPy is linked to compiled libraries to ensure high performances (BLAS, Arpack and MKL for linear algebra and mathematical operations). Together, NumPy and SciPy provide a robust scientific environment for numerical computing and they are the elementary bricks that we use in all our algorithms.

Matplotlib, a plotting library tightly integrated into the scientific Python stack (hunter2007). It offers publicationquality figures in different formats and is used to generate the figures in this paper.

Nibabel, to access data in neuroimaging file formats. We use it at the beginning of all our scripts.
4.2 Scikitlearn and the machine learning ecosystem
Scikitlearn (pedregosa2011) is a general purpose machine learning library written in Python. It provides efficient implementations of stateoftheart algorithms, accessible to nonmachine learning experts, and reusable across scientific disciplines and application fields. It also takes advantage of Python interactivity and modularity to supply fast and easy prototyping. There is a variety of other learning packages. For instance, in Python, PyBrain (schaul2010pybrain)
is best at neural networks and reinforcement learning approaches, but its models are fairly black box, and do not match our need to interpret the results. Beyond Python, Weka
(hall2009weka) is a rich machine learning framework written in Java, however it is more oriented toward data mining.Some higher level frameworks provides full pipeline to apply machine learning techniques to neuroimaging. PyMVPA (hanke2009pymvpa) is a Python packaging that does data preparation, loading and analysis, as well as result visualization. It performs multivariate pattern analysis and can make use of external tools such as R, scikitlearn or Shogun (sonnenburg2010). PRoNTo (schrouff2013pronto) is written in Matlab and can easily interface with SPM but does not propose many machine learning algorithms. Here, rather than fullblown neuroimaging analysis pipelines, we discuss lowerlevel patterns that break down how neuroimaging data is input to scikitlearn and processed with it. Indeed, the breadth of machine learning techniques in scikitlearn and the variety of possible applications are too wide to be fully exposed in a highlevel interface. Note that a package like PyMVPA that can rely on scikitlearn for neuroimaging data analysis implements similar patterns behind its highlevel interface.
4.3 Scikitlearn concepts
In scikitlearn, all objects and algorithms accept input data in the form of 2dimensional arrays of size samples features. This convention makes it generic and domainindependent. Scikitlearn objects share a uniform set of methods that depends on their purpose: estimators can fit models from data, predictors can make predictions on new data and transformers convert data from one representation to another.

Estimator. The estimator interface, the core of the library, exposes a fit
method for learning model parameters from training data. All supervised and unsupervised learning algorithms (e.g., for classification, regression or clustering) are available as objects implementing this interface. Machine learning tasks such as feature selection or dimensionality reduction are also provided as estimators.

Predictor. A predictor is an estimator with a predict method that takes an input array X_test and makes predictions for each sample in it. We denote this input parameter “X_test” in order to emphasize that predict generalizes to new data. In the case of supervised learning estimators, this method typically returns the predicted labels or values computed from the estimated model.

Transformer. As it is common to modify or filter data before feeding it to a learning algorithm, some estimators, named transformers, implement a transform method. Preprocessing, feature selection and dimensionality reduction algorithms are all provided as transformers within the library. If the transformation can be inverted, a method called inverse_transform also exists.
When testing an estimator or setting hyperparameters, one needs a reliable metric to evaluate its performance. Using the same data for training and testing is not acceptable because it leads to overly confident model performance, a phenomenon also known as
overfitting. Crossvalidation is a technique that allows one to reliably evaluate an estimator on a given dataset. It consists in iteratively fitting the estimator on a fraction of the data, called training set, and testing it on the leftout unseen data, called test set. Several strategies exists to partition the data. For example, fold crossvalidation consists in dividing (randomly or not) the samples in subsets: each subset is then used once as testing set while the others subsets are used to train the estimator. This is one of the simplest and most widely used crossvalidation strategies. The parameter is commonly set to 5 or 10. Another strategy, sometimes called MonteCarlo crossvalidation, uses many random partitions in the data.For a given model and some fixed value of hyperparameters, the scores on the various test sets can be averaged to give a quantitative score to assess how good the model is. Maximizing this crossvalidation score offers a principled way to set hyperparameters and allows to choose between different models. This procedure is known as model selection. In scikitlearn, hyperparameters tuning can be conviently done with the GridSearchCV estimator. It takes as input an estimator and a set of candidate hyperparameters. Crossvalidation scores are then computed for all hyperparameters combinations, possibly in parallel, in order to find the best one. In this paper, we set the regularization coefficient with grid search in section 7.
5 Data preparation: from MR volumes to a data matrix
Before applying statistical learning to neuroimaging data, standard preprocessing must be applied. For fMRI, this includes motion correction, slice timing correction, coregistration with an anatomical image and normalization to a common template like the MNI (Montreal Neurologic Institute) one if necessary. Reference softwares for these tasks are SPM (friston2007) and FSL (smith2004). A Python interface to these tools is available in nipype Python library (gorgolewski2011). Below we discuss shaping preprocessed data into a format that can be fed to scikitlearn. For the machine learning settings, we need a data matrix, that we will denote , and optionally a target variable to predict, .
5.1 Spatial resampling
Neuroimaging data often come as Nifti files, 4dimensional data (3D scans with time series at each location or voxel) along with a transformation matrix (called affine) used to compute voxel locations from array indices to world coordinates. When working with several subjects, each individual data is registered on a common template (MNI, Talairach…), hence on a common affine, during preprocessing.
Affine matrix can express data anisotropy, when the distance between two voxels is not the same depending on the direction. This information is used by algorithms relying on the spatial structure of the data, for instance the Searchlight.
SciPy routine scipy.ndimage.affine_transform can be used to perform image resampling: changing the spatial resolution of the data^{2}^{2}2An easytouse implementation is proposed in nilearn
. This is an interpolation and alters the data, that is why it should be used carefully. Downsampling is commonly used to reduce the size of data to process. Typical sizes are 2mm or 3mm resolution, but scan spatial resolution is increasing with progress in MR physics. The affine matrix can encode the scaling factors for each direction.
5.2 Signal cleaning
Due to its complex and indirect acquisition process, neuroimaging data often have a low signaltonoise ratio. They contain trends and artifacts that must be removed to ensure maximum machine learning algorithms efficiency. Signal cleaning includes:

Detrending removes a linear trend over the time series of each voxel. This is a useful step when studying fMRI data, as the voxel intensity itself has no meaning and we want to study its variation and correlation with other voxels. Detrending can be done thanks to SciPy (scipy.signal.detrend).

Normalization
consists in setting the timeseries variance to 1. This harmonization is necessary as some machine learning algorithms are sensible to different value ranges.

Frequency filtering
consists in removing high or low frequency signals. Lowfrequency signals in fMRI data are caused by physiological mechanisms or scanner drifts. Filtering can be done thanks to a Fourier transform (
scipy.fftpack.fft) or a Butterworth filter (scipy.signal.butter).
5.3 From 4dimensional images to 2dimensional array: masking
Neuroimaging data are represented in 4 dimensions: 3 spatial dimensions, and one dimension to index time or trials. Scikitlearn algorithms, on the other hand, only accept 2dimensional samples features matrices (see Section 4.3
). Depending on the setting, voxels and time series can be considered as features or samples. For example, in spatial independent component analysis (ICA), voxels are samples.
The reduction process from 4Dimages to feature vectors comes with the loss of spatial structure. It however allows to discard uninformative voxels, such as the ones outside of the brain. Such voxels that only carry noise and scanner artifacts would reduce SNR and affect the quality of the estimation. The selected voxels form a
brain mask. Such a mask is often given along with the datasets or can be computed with software tools such as FSL or SPM.Applying the mask is made easy by NumPy advanced indexing using boolean arrays. Twodimensional masked data will be referred to as X to follow scikitlearn conventions:
5.4 Data visualisation
Across all our examples, voxels of interest are represented on an axial slice of the brain. Some transformations of the original matrix data are required to match matplotlib data format. The following snippet of code shows how to load and display an axial slice overlaid with an activation map. The background is an anatomical scan and its highest voxels are used as synthetic activations.
Note that a background is needed to display partial maps. Overlaying two images can be done thanks to the numpy.ma.masked_array data structure. Several options exist to enhance the overall aspect of the plot. Some of them can be found in the full scripts provided with this paper. It generally boils down to a good knowledge of Matplotlib. Note that the Nipy package provides a plot_map function that is tuned to display activation maps (a background is even provided if needed).
6 Decoding the mental representation of objects in the brain
In the context of neuroimaging, decoding refers to learning a model that predicts behavioral or phenotypic variables from brain imaging data. The alternative that consists in predicting the imaging data given external variables, such as stimuli descriptors, is called encoding (naselaris2011). It is further discussed in the next section.
First, we illustrate decoding with a simplified version of the experiment presented in haxby2001. In the original work, visual stimuli from 8 different categories are presented to 6 subjects during 12 sessions. The goal is to predict the category of the stimulus presented to the subject given the recorded fMRI volumes. This example has already been widely analyzed (hanson2004combinatorial; detre2006multi; otoole2007; hanson2008brain; hanke2009pymvpa) and has become a reference example in matter of decoding. For the sake of simplicity, we restrict the example to one subject and to two categories, faces and houses.
As there is a target variable to predict, this is a supervised learning problem. Here represents the two object categories, a.k.a. classes in machinelearning terms. In such settings, where takes discrete values the learning problem is known as classification, as opposed to regression when the variable can take continuous values, such as age.
6.1 Classification with feature selection and linear SVM
Many classification methods are available in scikitlearn. In this example we chose to combine the use of univariate feature selection and Support Vector Machines (SVM). Such a classification strategy is simple yet efficient when used on neuroimaging data.
After applying a brain mask, the data consist of 40 000 voxels, here the features, for only 1 400 volumes, here the samples. Machine learning with many more features than samples is challenging, due to the socalled curse of dimensionality. Several strategies exist to reduce the number of features. A first one is based on prior neuroscientific knowledge. Here one could restrict the mask to occipital areas, where the visual cortex is located. Feature selection is a second, datadriven, approach that relies on a univariate statistical test for each individual feature. Variables with high individual discriminative power are kept.
Scikitlearn offers a panel of strategies to select features. In supervised learning, the most popular feature selection method is the Ftest. The null hypothesis of this test is that the feature takes the same value independently of the value of
to predict. In scikitlearn,sklearn.feature_selection
proposes a panel
of feature selection strategies. One can choose to take a percentile of the features
(SelectPercentile
), or a fixed number of features (SelectKBest
).
All these objects are implemented as transformers (see
section 4.3).
The code below uses the f_classif
function (ANOVA FTest) along with
the selection of a fixed number of features.
On the reduced feature set, we use a linear SVM classifier,
sklearn.svm.SVC
, to find the hyperplane that maximally separates the samples belonging to the different classes. Classifying a new sample boils down to determining on which side of the hyperplane it lies. With a linear kernel, the separating hyperplane is defined in the input data space and its coefficients can be related to the voxels. Such coefficients can therefore be visualized as an image (after unmasking step described in
5.3) where voxels with high values have more influence on the prediction than the others (see figure 2).6.2 Searchlight
Searchlight (kriegeskorte2006) is a popular algorithm in the neuroimaging community. It runs a predictive model on a spatial neighborhood of each voxel and tests the outofsample prediction performance as proxy measure of the link between the local brain activity and the target behavioral variable. In practice, it entails performing crossvalidation of the model, most often an SVM, on voxels contained in balls centered on each voxel of interest. The procedure implies solving a large number of SVMs and is computationally expensive. Detailing an efficient implementation of this algorithm is beyond the scope of this paper. However, code for searchlight and to generate figure 2 is available in the GitHub repository accompanying the paper.
6.3 Results
Results are shown in figure 2
: first Fscore, that is standard analysis in brain mapping but also the statistic used to select features; second the SVC weights after feature selection and last the Searchlight map. Note that the voxels with larger weights roughly match for all methods and are located in the houseresponsive areas as defined by the original paper. The Searchlight is more expanded and blurry than the other methods as it iterates over a ball around the voxels.
These results match neuroscientific knowledge as they highlight the high level regions of the ventral visual cortex which is known to contain categoryspecific visual areas. While Searchlight only gives a score to each voxel, the SVC can be used afterward to classify unseen brain scans.
Most of the final example script (haxby_decoding.py
on GitHub) is for data loading and result visualization. Only 5 lines are needed to run a scikitlearn classifier. In addition, thanks to the scikitlearn modularity, the SVC can be easily replaced by any other classifier in this example. As all linear models share the same interface, replacing the SVC by another linear model, such as ElasticNet or LogisticRegression, requires changing only one line. Gaussian Naive Bayes is a nonlinear classifier that should perform well in this case, and modifiying display can be done by replacing
coef_ by theta_.7 Encoding brain activity and decoding images
In the previous experiment, the category of a visual stimulus was inferred from brain activity measured in the visual cortex. One can go further by inferring a direct link between the image seen by the subject and the associated fMRI data.
In the experiment of miyawaki2008 several series of binary images are presented to two subjects while activity on the visual cortex is recorded. In the original paper, the training set is composed of random images (where black and white pixels are balanced) while the testing set is composed of structured images containing geometric shapes (square, cross…) and letters. Here, for the sake of simplicity, we consider only the training set and use crossvalidation to obtain scores on unseen data. In the following example, we study the relation between stimuli pixels and brain voxels in both directions: the reconstruction of the visual stimuli from fMRI, which is a decoding task, and the prediction of fMRI data from descriptors of the visual stimuli, which is an encoding task.
7.1 Decoding
In this setting, we want to infer the binary visual stimulus presented to the subject from the recorded fMRI data. As the stimuli are binary, we will treat this problem as a classification problem. This implies that the method presented here cannot be extended asis to natural stimuli described with gray values.
In the original work, miyawaki2008
uses a Bayesian logistic regression promoting sparsity along with a sophisticated multiscale strategy. As one can indeed expect the number of predictive voxels to be limited, we compare the
SVM used above with a logistic regression and a SVM penalized with the norm known to promote sparsity. Thepenalized SVM classifier compared here uses a squarehinge loss while the logistic regression uses a logit function.
value  0.0005  0.001  0.005  0.01  0.05  0.1 

Logistic Regression  0.50 .02  0.50 .02  0.57 .13  0.63 .11  0.70 .12  0.70 .12 
Logistic Regression  0.60 .11  0.61 .12  0.63 .13  0.63 .13  0.64 .13  0.64 .13 
SVM classifier (SVC)  0.50 .06  0.55 .12  0.69 .11  0.71 .12  0.69 .12  0.68 .12 
SVM classifier (SVC)  0.67 .12  0.67 .12  0.67 .12  0.66 .12  0.65 .12  0.65 .12 
Table 1 reports the performance of the different classifiers for various values of C using a 5fold crossvalidation. We first observe that setting the parameter is crucial as performance drops for inappropriate values of C. It is particularly true for regularized models. Both logistic regression and SVM yield similar performances, which is not surprising as they implement similar models.
7.2 Encoding
Given an appropriate model of the stimulus, e.g. one which can provide an approximately linear representation of BOLD activation, an encoding approach allows one to quantify for each voxel to what extent its variability is captured by the model. A popular evaluation method is the predictive score, which uses a prediction on left out data to quantify the decrease in residual norm brought about by fitting a regression function as opposed to fitting a constant. The remaining variance consists of potentially unmodelled, but reproducible signal and spurious noise.
On the Miyawaki dataset, we can observe that mere black and white pixel values can explain a large part of the BOLD variance in many visual voxels. Sticking to the notation that represesents BOLD signal and
the stimulus, we can write an encoding model using the ridge regression estimator:
Note here that the Ridge can be replaced by a Lasso estimator, which can give better prediction performance at the cost of computation time.
7.2.1 Receptive fields
Given the retinotopic structure of early visual areas, it is expected that the voxels well predicted by the presence of a black or white pixel are strongly localized in socalled population receptive fields (prf
). This suggests that only very few stimulus pixels should suffice to explain the activity in each brain voxel of the posterior visual cortex. This information can be exploited by using a sparse linear regression –the Lasso
(tibshirani:96)– to find the receptive fields. Here we use the LassoLarsCV estimator that relies on the LARS algorithm (Efron04leastangle) and crossvalidation to set the Lasso parameter.7.3 Results
Figure 3 gives encoding and decoding results: the relationship between a given image pixel and four voxels of interest in the brain. In decoding settings, Figures 3a and 3c show the classifier’s weights as brain maps for both methods. They both give roughly the same results and we can see that the weights are centered in the V1 and nearby retinotopic areas. Figures 3b and 3d show reconstruction accuracy score using Logistic Regression (LR) and SVM (variable mean_scores in the code above). Both methods give almost identical results. As in the original work (miyawaki2008)
, reconstruction is more accurate in the fovea. This is explained by the higher density of neurons dedicated to foveal representation in the primary visual area.
In encoding settings, figure 3e shows classifiers weights for encoding, that we interpret as receptive fields. We can see that receptive fields of neighboring voxels are neighboring pixels, which is expected from retinotopy: primary visual cortex maps the visual field in a topologically organized manner.
Both encoding and decoding analysis show a link between the selected pixel and brain voxels. In the absence of ground truth, seeing that different methods come to the same conclusion comes as face validity.
8 Restingstate and functional Connectivity analysis
Even in the absence of external behavioral or clinical variable, studying the structure of brain signals can reveal interesting information. Indeed, biswal1995 have shown that brain activation exhibits coherent spatial patterns during rest. These correlated voxel activations form functional networks that are consistent with known taskrelated networks (smith2009).
Biomarkers found via predictive modeling on restingstate fMRI would be particularly useful, as they could be applied to diminished subjects that cannot execute a specific task. Here we use a dataset containing control and ADHD (Attention Disorder Hyperactivity Disorder) patients resting state data (subjects are scanned without giving them any specific task to capture the cerebral background activity).
Resting state fMRI is unlabeled data in the sense that the brain activity at a given instant in time cannot be related to an output variable. In machine learning, this class of problems is known as unsupervised learning. To extract functional networks or regions, we use methods that group together similar voxels by comparing their time series. In neuroimaging, the most popular method is ICA that is the subject of our first example. We then show how to obtained functionallyhomogeneous regions with clustering methods.
8.1 Independent Component Analysis (ICA) to extract networks
ICA is a blind source separation method. Its principle is to separate a multivariate signal into several components by maximizing their nonGaussianity. A typical example is the cocktail party problem where ICA is able to separate voices from several people using signal from microphones located across the room.
8.1.1 ICA in neuroimaging
ICA is the reference method to extract networks from resting state fMRI (kiviniemi2003). Several strategies have been used to syndicate ICA results across several subjects. calhoun2001a propose a dimension reduction (using PCA) followed by a concatenation of timeseries (used in this example). varoquaux2010 use dimension reduction and canonical correlation analysis to aggregate subject data. Melodic (beckmann2004), the ICA tool in the FSL suite, uses a concatenation approach not detailed here.
8.1.2 Application
As data preparation steps, we not only center, but also detrend the time series to avoid capturing linear trends with the ICA. Applying to the resulting time series the FastICA algorithm (Hyvarinen:2000vk) with scikitlearn is straightforward thanks to the transformer concept. The data matrix must be transposed, as we are using spatial ICA, in other words the direction considered as random is that of the voxels and not the time points. The maps obtained capture different components of the signal, including noise components as well as restingstate functional networks. To produce the figures, we extract only 10 components, as we are interested here in exploring only the main signal structures.
8.1.3 Results
On fig. 4 we compare a simple concat ICA as implemented by the code above to more sophisticated multisubject methods, namely Melodic’s concat ICA and CanICA–also implemented using scikitlearn although we do not discuss the code here. We display here only the default mode network as it is a wellknown restingstate network. It is hard to draw conclusions from a single map but, at first sight, it seems that both CanICA and Melodic approaches are less subject to noise and give similar results.
Scikitlearn proposes several other matrix decomposition strategies listed in the module ‘sklearn.decomposition‘. A good alternative to ICA is the dictionary learning that applies a regularization on the extracted components (varoquaux2011). This leads to more sparse and compact components than ICA ones, which are fullbrain and require thresholding.
8.2 Learning functionally homogeneous regions with clustering
From a machine learning perspective, a clustering method aggregates samples into groups (called clusters) maximizing a measure of similarity between samples within each cluster. If we consider voxels of a functional brain image as samples, this measure can be based on functional similarity, leading to clusters of voxels that form functionally homogeneous regions (thirion2006).
8.2.1 Approaches
Several clustering approaches exists, each one having its own pros and cons. Most require setting the number of clusters extracted. This choice depends on the application: a large number of clusters will give a more finegrained description of the data, with a higher fidelity to the original signal, but also a higher model complexity. Some clustering approaches can make use of spatial information and yield spatially contiguous clusters, i.e. parcels. Here we will describe two clustering approaches that are simple and fast.
Ward clustering
uses a bottomup hierarchical approach: voxels are progressively agglomerated together into clusters. In scikitlearn, structural information can be specified via a connectivity graph given to the Ward clustering estimator. This graph is used to allow only merges between neighboring voxels, thus readily producing contiguous parcels. We will rely on the sklearn.feature_extraction.image.grid_to_graph function to construct such a graph using the neighbor structure of an image grid, with optionally a brain mask.
KMeans
is a more topdown approach, seeking cluster centers to evenly explain the variance of the data. Each voxels are then assigned to the nearest center, thus forming clusters. As imposing a spatial model in Kmeans is not easy, it is often advisable to spatially smooth the data.
To apply the clustering algorithms, we run the common data preparation steps and produce a data matrix. As both Ward clustering and Kmeans rely on secondorder statistics, we can speed up the algorithms by reducing the dimensionality while preserving these secondorder statistics with a PCA. Note that clustering algorithms group samples and that here we want to group voxels. So if the data matrix is, as previously a (time points voxels) matrix, we need to transpose it before running the scikitlearn clustering estimators. Scikitlearn provides a WardAgglomeration object to do this feature agglomeration with Ward clustering (michel2012supervisedclustering), but this is not the case when using KMeans.
8.2.2 Results
Clustering results are shown in figure 5. While clustering extracts some known large scale structure, such as the calcarine sulcus on fig 5.a, it is not guaranteed to delineate functionally specific brain regions. Rather, it can be considered as a compression, that is a useful method of summarizing information, as it groups together similar voxels. Note that, as Kmeans does not extract spatiallycontiguous clusters, it gives a number of regions that can be much larger than the number of clusters specified, although some of these regions can be very small. On the opposite, spatiallyconstrained Ward directly creates regions. As it is a bottomup process, it tends to perform best with a large number of clusters. There exist many more clustering techniques exposed in scikitlearn. Determining which is the best one to process fMRI timeseries requires a more precise definition of the target application.
Ward’s clustering and KMeans are among the simplest approaches proposed in the scikitlearn. craddock2011
applied spectral clustering on neuroimaging data, a similar application is available in nilearn as an example.
9 Conclusion
In this paper we have illustrated with simple examples how machine learning techniques can be applied to fMRI data using the scikitlearn Python toolkit in order to tackle neuroscientific problems. Encoding and decoding can rely on supervised learning to link brain images with stimuli. Unsupervised learning can extract structure such as functional networks or brain regions from restingstate data. The accompanying Python code for the machine learning tasks is straightforward. Difficulties lie in applying proper preprocessing to the data, choosing the right model for the problem, and interpreting the results. Tackling these difficulties while providing the scientists with simple and readable code requires building a domainspecific library, dedicated to applying scikitlearn to neuroimaging data. This effort is underway in a nascent project, nilearn, that aims to facilitate the use of scikitlearn on neuroimaging data.
The examples covered in this paper only scratch the surface of applications of statistical learning to neuroimaging. The tool stack presented here shines uniquely in this regard as it opens the door to any combination of the wide range of machine learning methods present in scikitlearn with neuroimagingrelated code. For instance, sparse inverse covariance can extract the functional interaction structure from fMRI timeseries (varoquaux2013)
using the graphlasso estimator. Modern neuroimaging data analysis entails fitting rich models on limited data quantities. These are highdimensional statistics problems which call for statisticallearning techniques. We hope that bridging a generalpurpose machine learning tool, scikitlearn, to domainspecific data preparation code will foster new scientific advances.
Disclosure/ConflictofInterest Statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Funding
We acknowledge funding from the NiConnect project and NIDA R21 DA034954, SUBSample project from the DIGITEO Institute, France.