1 Introduction
One of the main goals of brain imaging and neuroscience—and, possibly, of most natural sciences—is to improve understanding of the investigated system based on data. In our case, this amounts to inference of descriptive features of brain structure and function from noninvasive measurements. Brain imaging field has come a long way from anatomical maps and atlases towards data driven feature learning methods, such as seedbased correlation [2], canonical correlation analysis [33]
, and independent component analysis (ICA)
[24, 1]. These methods are highly successful in revealing known brain features with new details [3] (supporting their credibility), in recovering features that differentiate patients and controls [28] (assisting diagnosis and disease understanding), and starting a “resting state” revolution after revealing consistent patters in data from uncontrolled resting experiments [35, 29]. Classification is often used merely as a correctness checking tool, as the main emphasis is on learning about the brain. A perfect oracle that does not explain its conclusions would be useful, but mainly to facilitate the inference of the ways the oracle draws these conclusions.As an oracle, deep learning methods are breaking records taken over the areas of speech, signal, image, video and text mining and recognition by improving state of the art classification accuracy by, sometimes, more than 30% where the prior decade struggled to obtain a 12% improvements [19, 21]
. What differentiates them from other classifiers, however, is the automatic feature learning from data which largely contributes to improvements in accuracy. Presently, this seems to be the closest solution to an oracle that reveals its methods — a desirable tool for brain imaging.
Another distinguishing feature of deep learning is the depth of the models. Based on already acceptable feature learning results obtained by shallow models—currently dominating neuroimaging field—it is not immediately clear what benefits would depth have. Considering the state of multimodal learning, where models are either assumed to be the same for analyzed modalities [26] or crossmodal relations are sought at the (shallow) level of mixture coefficients [23], deeper models better fit the intuitive notion of crossmodality relations, as, for example, relations between genetics and phenotypes should be indirect, happening at a deeper conceptual level.
In this work we present our recent advances in application of deep learning methods to functional and structural magnetic resonance imaging (fMRI and sMRI). Each consists of brain volumes but for sMRI these are static volumes—one per subject/session,—while for fMRI a single subject dataset is comprised of multiple volumes capturing the changes during an experimental session. Our goal is to validate feasibility of this application by a)
investigating if a building block of deep generative models—a restricted Boltzmann machine (RBM)
[17]—is competitive with ICA (a representative model of its class) (Section 2); b) examining the effect of the depth in deep learning analysis of structural MRI data (Section 3.3); and c) determining the value of the methods for discovery of latent structure of a largescale (by neuroimaging standards) dataset (Section 3.4). The measure of feature learning performance in a shallow model (a) is comparable with existing methods and known brain physiology. However, this measure cannot be used when deeper models are investigated. As we further demonstrate, classification accuracy does not provide the complete picture either. To be able to visualize the effect of depth and gain an insight into the learning process, we introduce a flexible constraint satisfaction embedding method that allows us to control the complexity of the constraints (Section 3.2). Deliberately choosing local constraints we are able to reflect the transformations that the deep belief network (DBN)
[15] learns and applies to the data and gain additional insight.2 A shallow belief network for feature learning
Prior to investigating the benefits of depth of a DBN in learning representations from fMRI and sMRI data, we would like to find out if a shallow (single hidden layer) model–which is the RBM—from this family meets the field’s expectations. As mentioned in the introduction, a number of methods are used for feature learning from neuroimaging data: most of them belong to the single matrix factorization (SMF) class. We do a quick comparison to a small subset of SMF methods on simulated data; and continue with a more extensive comparison against ICA as an approach trusted in the neuroimaging field. Similarly to RBM, ICA relies on the bipartite graph structure, or even is an artificial neural network with sigmoid hidden units as is in the case of Infomax ICA
[1] that we compare against. Note the difference with RBM: ICA applies its weight matrix to the (shorter) temporal dimension of the data imposing independence on the spatial dimension while RBM applies its weight matrix (hidden units “receptive fields”) to the high dimensional spatial dimension instead (Figure 2).2.1 A restricted Boltzmann machine
A restricted Boltzmann machine (RBM) is a Markov random field that models data distribution parameterizing it with the Gibbs distribution over a bipartite graph between visible and hidden variables [10]: where is the normalization term (the partition function) and
is the energy of the system. Each visible variable in the case of fMRI data represents a voxel of an fMRI scan with a realvalued and approximately Gaussian distribution. In this case, the energy is defined as:
(1) 
where and are biases and
is the standard deviation of a parabolic containment function for each visible variable
centered on the bias . In general, the parametersneed to be learned along with the other parameters. However, in practice normalizing the distribution of each voxel to have zero mean and unit variance is faster and yet effective
[27]. A number of choices affect the quality of interpretation of the representations learned from fMRI by an RBM. Encouraging sparse features via the regularization: (gave best results) and using hyperbolic tangent for hidden units nonlinearity are essential settings that respectively facilitate spatial and temporal interpretation of the result. The weights were updated using the truncated Gibbs sampling method called contrastive divergence (CD) with a single sampling step (CD1). Further information on RBM model can be found in
[17, 16].2.2 Synthetic data
In this section we summarize our comparisons of RBM with SMF models—including Infomax ICA [1], PCA [14], sparse PCA (sPCA) [37], and sparse NMF (sNMF) [18]—on synthetic data with known spatial maps generated to simulate fMRI.
Figure (a)a shows the correlation of spatial maps (SM) and time course (TC) estimates to the ground truth for RBM, ICA, PCA, sPCA, and sNMF. Correlations are averaged across all sources and datasets. RBM and ICA showed the best overall performance. While sNMF also estimated SMs well, it showed inferior performance on TC estimation, likely due to the nonnegativity constraint. Based on these results and the broad adoption of ICA in the field, we focus on comparing Infomax ICA and RBM.
Figure (b)b shows the full set of ground truth sources along with RBM and ICA estimates for a single representative dataset. SMs are thresholded and represented as contours for visualization. Results over all synthetic datasets showed similar performance for RBM and ICA (Figure (c)c), with a slight advantage for ICA with regard to SM estimation, and a slight advantage for RBM with regards to TC estimation. RBM and ICA also showed comparable performance estimating cross correlations also called functional network connectivity (FNC).
2.3 An fMRI data application
Data used in this work comprised of taskrelated scans from 28 (five females) healthy participants, all of whom gave written, informed, IRBapproved consent at Hartford Hospital and were compensated for participation^{1}^{1}1More detailed information regarding participant demographics is provided in [9]. All participants were scanned during an auditory oddball task (AOD) involving the detection of an infrequent target sound within a series of standard and novel sounds^{2}^{2}2The task is described in more detail in [4] and [9]..
Scans were acquired at the Olin Neuropsychiatry Research Center at the Institute of Living/Hartford Hospital on a Siemens Allegra 3T dedicated head scanner equipped with 40 gradients and a standard quadrature head coil [4, 9]. The AOD consisted of two 8min runs, and scans (volumes) at 2 second TR (0.5 Hz sampling rate) were used for the final dataset. Data were postprocessed using the SPM5 software package [12], motion corrected using INRIalign [11], and subsampled to . The complete fMRI dataset was masked below mean and the mean image across the dataset was removed, giving a complete dataset of size voxels by volumes. Each voxel was then normalized to have zero mean and unit variance.
The RBM was constructed using Gaussian visible units and hyperbolic tangent hidden units. The hyper parameters (0.08 from the searched range) for learning rate and (0.1 from the searched range [,]) for weight decay were selected as those that showed a reduction of reconstruction error over training and a significant reduction in span of the receptive fields respectively. Parameter value outside the ranges either resulted in unstable or slow learning () or uninterpretable features (). The RBM was then trained with a batch size of for approximately epochs to allow for full convergence of the parameters.
After flipping the sign of negative receptive fields, we then identified and labeled spatially distinct features as corresponding to brain regions with the aid of AFNI [5]
excluding features which had a high probability of corresponding to white matter, ventricles, or artifacts (eg. motion, edges).
We normalized the fMRI volume time series to mean zero and used the trained RBM in feedforward mode to compute time series for each fMRI feature. This was done to better compare to ICA, where the mean is removed in PCA preprocessing.
The workflow is outlined in Figure 2, while Figure 3 shows comparison of resulting features with those obtained by Infomax ICA. In general, RBM performs competitively with ICA, while providing–perhaps, not surprisingly due to the used regularization—sharper and more localized features. While we recognize that this is a subjective measure we list more features in Figure S2 of Section 5 and note that RBM features lack negative parts for corresponding features. Note, that in the case of regularized weights RBM algorithms starts to resemble some of the ICA approaches (such as the recent RICA by Le at al. [20]), which may explain the similar performance. However, the differences and possible advantages are the generative nature of the RBM and no enforcement of component orthogonality (not explicit at the least). Moreover, the block structure of the correlation matrix (see below the Supplementary material section) of feature time courses provide a grouping that is more physiologically supported than that provided by ICA. For example, see Figure S1 in the supplementary material section below. Perhaps, because ICA working hard to enforce spatial independence subtly affects the time courses and their crosscorrelations in turn. We have observed comparable running times of the (non GPU) ICA (http://www.nitrc.org/projects/gift) and a GPU implementation of the RBM (https://github.com/nitishsrivastava/deepnet).
3 Validating the depth effect
Since the RBM results demonstrate a featurelearning performance competitive with the state of the art (or better), we proceed to investigating the effects of the model depth. To do that we turn from fMRI to sMRI data. As it is commonly assumed in the deep learning literature [22] the depth is often improving classification accuracy. We investigate if that is indeed true in the sMRI case. Structural data is convenient for the purpose as each subject/session is represented only by a single volume that has a label: control or patient in our case. Compare to 4D data where hundreds of volumes belong to the same subject with the same disease state.
3.1 A deep belief network
A DBN is a sigmoidal belief network (although other activation functions may be used) with an RBM as the top level prior. The joint probability distribution of its visible and hidden units is parametrized as follows:
(2) 
where is the number of hidden layers, is an RBM, and factor into individual conditionals:
(3) 
The important property of DBN for our goals of feature learning to facilitate discovery is its ability to operate in generative mode with fixed values on chosen hidden units thus allowing one to investigate the features that the model have learned and/or weighs as important in discriminative decisions. We, however, not going to use this property in this section, focusing instead on validating the claim that a network’s depth provides benefits for neuroimaging data analysis. And we will do this using discriminative mode of DBN’s operation as it provides an objective measure of the depth effect.
DBN training splits into two stages: pretraining and discriminative fine tuning. A DBN can be pretrained by treating each of its layers as an RBM—trained in an unsupervised way on inputs from the previous layer—and later finetuned by treating it as a feedforward neural network. The latter allows supervised training via the error back propagation algorithm. We use this schema in the following by augmenting each DBN with a softmax layer at the finetuning stage.
3.2 Nonlinear embedding as a constraint satisfaction problem
A DBN and an RBM operate on data samples, which are brain volumes in the fMRI and sMRI case. A fiveminute fMRI experiment with 2 seconds sampling rate yields 150 of these volumes per subject. For sMRI studies number of participating subjects varies but in this paper we operate with a 300 and a 3500 subjectvolumes datasets. Transformations learned by deep learning methods do not look intuitive in the hidden node space and generative sampling of the trained model does not provide a sense if a model have learned anything useful in the case of MRI data: in contrast to natural images, fMRI and sMRI images do not look very intuitive. Instead, we use a nonlinear embedding method to control whether a model learned useful information and to assist in investigation of what have it, in fact, learned.
One of the purposes of an embedding is to display a complex high dimensional dataset in a way that is i) intuitive, and ii) representative of the data sample. The first requirement usually leads to displaying data samples as points in a 2dimensional map, while the second is more elusive and each approach addresses it differently. Embedding approaches include relatively simple random linear projections—provably preserving some neighbor relations [6]—and a more complex class of nonlinear embedding approaches [36, 32, 34, 30]. In an attempt to organize the properties of this diverse family we have aimed at representing nonlinear embedding methods under a single constraint satisfaction problem (CSP) framework (see below). We hypothesize that each method places the samples in a map to satisfy a specific set of constraints. Although this work is not yet complete, it proven useful in our current study. We briefly outline the ideas in this section to provide enough intuition of the method that we further use in Section 3.
Since we can control the constraints in the CSP framework, to study the effect of deep learning we choose them to do the least amount of work—while still being useful—letting the DBN do (or not) the hard part. A more complicated method such as tSNE [36] already does complex processing to preserve the structure of a dataset in a 2D map – it is hard to infer if the quality of the map is determined by a deep learning method or the embedding. While some of the existing method may have provided the “least amount of work” solutions as well we chose to go with the CSP framework. It explicitly states the constraints that are being satisfied and thus lets us reason about deep learning effects within the constraints, while with other methods—where the constraints are implicit—this would have been harder.
A constraint satisfaction problem (CSP) is one requiring a solution that satisfies a set of constraints. One of the well known examples is the boolean satisfiability problem (SAT). There are multiple other important CSPs such as the packing, molecular conformations, and, recently, error correcting codes [7]. Freedom to setup per point constraints without controlling for their global interactions makes a CSP formulation an attractive representation of the nonlinear embedding problem. Pursuing this property we use the iterative “divide and concur” (DC) algorithm [13] as the solver for our representation. In DC algorithm we treat each point on the solution map as a variable and assign a set of constraints that this variable needs to satisfy (more on these later). Then each points gets a “replica” for each constraint it is involved into. Then DC algorithm alternates the divide and concur projections. The divide projection moves each “replica” points to the nearest locations in the 2D map that satisfy the constraint they participate in. The concur projection concurs locations of all “replicas” of a point by placing them at the average location on the map. The key idea is to avoid local traps by combining the divide and concur steps within the difference map [8]. A single location update is represented by:
(4) 
where and denote the divide and concur projections and is a userdefined parameter.
While the concur projection will only differ by subsets of “replicas” across different methods representable in DC framework, the divide projection is unique and defines the algorithm behavior. In this paper, we choose a divide projection that keeps nearest neighbors of each point in the higher dimensional space also its neighbors in the 2D map. This is a simple local neighborhood constraint that allows us to assess effects of deep learning transformation leaving most of the mapping decisions to the deep learning.
Note, that for a general dataset we may not be able to satisfy this constraint: each point has exactly the same neighbors in 2D as in the original space (and this is what we indeed observe). The DC algorithm, however, is only guaranteed to find the solution if it exists and oscillates otherwise. Oscillating behavior is detectable and may be used to stop the algorithm. We found informative watching the 2D map in dynamics, as the points that keep oscillating provide additional information into the structure of the data. Another practically important feature of the algorithm: it is deterministic. Given the same parameters ( and the parameters of ) it converges to the same solution regardless of the initial point. If each of the points participates in each constraint then complexity of the algorithm is quadratic. With our simple neighborhood constraints it is , for samples/points.
3.3 A schizophrenia structural MRI dataset
We use a combined data from four separate schizophrenia studies conducted at Johns Hopkins University (JHU), the Maryland Psychiatric Research Center (MPRC), the Institute of Psychiatry, London, UK (IOP), and the Western Psychiatric Institute and Clinic at the University of Pittsburgh (WPIC) (the data used in Meda et al. [25]). The combined sample comprised 198 schizophrenia patients and 191 matched healthy controls and contained both first episode and chronic patients [25]. At all sites, whole brain MRIs were obtained on a 1.5T Signa GE scanner using identical parameters and software. Original structural MRI images were segmented in native space and the resulting gray and white matter images then spatially normalized to gray and white matter templates respectively to derive the optimized normalization parameters. These parameters were then applied to the whole brain structural images in native space prior to a new segmentation. The obtained 60465 voxel gray matter images were used in this study. Figure 4 shows example orthogonal slice views of the gray matter data samples of a patient and a healthy control.
The main question of this Section is to evaluate the effect of the depth of a DBN on sMRI. To answer this question, we investigate if classification rates improve with the depth. For that we sequentially investigate DBNs of 3 depth. From RBM experiments we have learned that even with a larger number of hidden units (72, 128 and 512) RBM tends to only keep around 50 features driving the rest to zero. Classification rate and reconstruction error still slightly improves, however, when the number of hidden units increases. These observations affected our choice of 50 hidden units of the first two layers and 100 for the third. Each hidden unit is connected to all units in the previous layer which results in an all to all connectivity structure between the layers, which is a more common and conventional approach to constructing these models. Note, larger networks (up to double the umber of units) lead to similar results. We pretrain each layer via an unsupervised RBM and discriminatively finetune models of depth 1 (50 hidden units in the top layer), 2 (5050 hidden units in the first and the top layer respectively), and 3 (5050100 hidden units in the first, second and the top layer respectively) by adding a softmax layer on top of each of these models and training via the back propagation.
We estimate the accuracy of classification via 10fold cross validation on finetuned models splitting the 389 subject dataset into 10 approximately classbalanced folds.
depth  raw  1  2  3 

SVM Fscore 

LR Fscore  
KNN Fscore 
We train the rbfkernel SVM, logistic regression and a knearest neighbors (knn) classifier using activations of the topmost hidden layers in finetuned models to the training data of each fold as their input. The testing is performed likewise but on the test data. We also perform the same 10fold cross validation on the raw data. Table
1summarizes the precision and recall values in the Fscores and their standard deviations.
All models demonstrate a similar trend when the accuracy only slightly increases from depth1 to depth2 DBN and then improves significantly. Table 1 supports the general claim of deep learning community about improvement of classification rate with the depth even for sMRI data. Improvement in classification even for the simple knn classifier indicates the character of the transformation that the DBN learns and applies to the data: it may be changing the data manifold to organize classes by neighborhoods. Ideally, to make general conclusion about this transformation we need to analyze several representative datasets. However, even working with the same data we can have a closer view of the depth effect using the method introduced in Section 3.2.
Although it may seem that the DBN does not provide significant improvements in sMRI classification from depth1 to depth2 in this model, it keeps on learning potentially useful transformaions of the data. We can see that using our simple local neighborhoodbased embedding. Figure 5 displays 2D maps of the raw data, as well as the depth 1, 2, and 3 activations (of a network trained on 335 subjects): the deeper networks place patients and control groups further apart. Additionally, Figure 5 displays the 54 subjects that the DBN was not train on. These hold out subjects are also getting increased separation with depth. This DBN’s behavior is potentially useful for generalization, when larger and more diverse data become available.
Our new mapping method has two essential properties to facilitate the conclusion and provide confidence in the result: its already mentioned local properties and the deterministic nature of the algorithm. The latter leads to independence of the resulting maps from the starting point. The map only depends on the models parameter —the size of the neighborhood—and the data.
3.4 A largescale Huntington disease data
In this section we focus on sMRI data collected from healthy controls and Huntington disease (HD) patients as part of the PREDICTHD project (www.predicthd.net
). Huntington disease is a genetic neurodegenerative disease that results in degeneration of neurons in certain areas of the brain. The project is focused on identifying the earliest detectable changes in thinking skills, emotions and brain structure as a person begins the transition from health to being diagnosed with Huntington disease. We would like to know if deep learning methods can assist in answering that question.
For this study T1weighted scans were collected at multiple sites (32 international sites), representing multiple field strengths (1.5T and 3.0T) and multiple manufactures (Siemens, Phillips, and GE). The 1.5T T1 weighted scans were an axial 3D volumetric spoiledgradient echo series ( mm voxels), and the 3.0T T1 weighted scans were a 3D Volumetric MPRAGE series ( mm voxels).
depth  raw  1  2  3 

SVM Fscore  
LR Fscore 
The images were segmented in the native space and the normalized to a common template. After correlating the normalized gray matter segmentation with the template and eliminating poorly correlating scans we obtain a dataset of 3500 scans, where 2641 were from patients and 859 from healthy controls. We have used all of the scans in this imbalanced sample to pretrain and fine tune the same model architecture (5050100) as in Section 3.3 for all three depths^{3}^{3}3Note, in both cases we have experimented with larger layer sizes but the results were not significantly different to warrant increase in computation and parameters needed to be estimated..
Table 2 lists the average Fscore values for both classes at the raw data and all depth levels. Note the drop from the raw data and then a recovery at depth 3. The limited capacity of levels 1 and 2 has reduced the network ability to differentiate the groups but representational capacity of depth 3 network compensates for the initial bottleneck. This, confirms our previous observation on the depth effect, however, does not yet help the main question of the PREDICTHD study. Note, however, while Table 1 in the previous section evaluates generalization ability of the DBN, Table 2 here only demonstrates changes in DBN’s representational capacity with the depth as we use no testing data. To further investigate utility of the deep learning approach for scientific discovery we again augment it with the embedding method of Section 3.2. Figure 7 shows the map of 3500 scans of HD patients and healthy controls. Each point on the map is an sMRI volume, shown in Figures 6 and 7. Although we have used the complete data to train the DBN, discriminative finetuning had access only to binary label: control or patient. In addition to that, we have information about severity of the disease from low to high. We have color coded this information in Figure 7 from bright yellow (low) through orange (medium) to red (high). The network^{4}^{4}4Note, the embedding algorithm does not have access to any label information. discriminates the patients by disease severity which results in a spectrum on the map. Note, that neither tSNE (not shown), nor our new embedding see the spectrum or even the patient groups in the raw data. This is a important property of the method that may help support its future use in discovery of new information about the disease.
4 Conclusions
Our investigations show that deep learning has a high potential in neuroimaging applications. Even the shallow RBM is already competitive with the model routinely used in the field: it produces physiologically meaningful features which are (desirably) highly focal and have time course cross correlations that connect them into meaningful functional groups (Section 5). The depth of the DBN does indeed help classification and increases group separation. This is apparent on two sMRI datasets collected under varying conditions, at multiple sites each, from different disease groups, and preprocessed differently. This is a strong evidence of DBNs robustness. Furthermore, our study shows a high potential of DBNs for exploratory analysis. As Figure 7 demonstrates, DBN in conjunction with our new mapping method can reveal hidden relations in data. We did find it difficult initially to find workable parameter regions, but we hope that other researchers won’t have this difficulty starting from the baseline that we provide in this paper.
References
 [1] A. J. Bell and T. J. Sejnowski. An informationmaximization approach to blind separation and blind deconvolution. Neural Computation, 7(6):1129–1159, November 1995.
 [2] Bharat Biswal, F Zerrin Yetkin, Victor M Haughton, and James S Hyde. Functional connectivity in the motor cortex of resting human brain using echoplanar mri. Magnetic resonance in medicine, 34(4):537–541, 1995.
 [3] M.J. Brookes, M. Woolrich, H. Luckhoo, D. Price, J.R. Hale, M.C. Stephenson, G.R. Barnes, S.M. Smith, and P.G. Morris. Investigating the electrophysiological basis of resting state networks using magnetoencephalography. Proceedings of the National Academy of Sciences, 108(40):16783–16788, 2011.
 [4] V. D. Calhoun, K. A. Kiehl, and G. D. Pearlson. Modulation of temporally coherent brain networks estimated using ICA at rest and during cognitive tasks. Human Brain Mapping, 29(7):828–838, 2008.
 [5] R. W. Cox et al. AFNI: software for analysis and visualization of functional magnetic resonance neuroimages. Computers and Biomedical Research, 29(3):162–173, 1996.
 [6] T. de Vries, S. Chawla, and M. E. Houle. Finding local anomalies in very high dimensional space. In Proceedings of the 10th {IEEE} international conference on data mining, pages 128–137. IEEE, IEEE Computer Society, 2010.
 [7] Nate Derbinsky, José Bento, Veit Elser, and Jonathan S Yedidia. An improved threeweight messagepassing algorithm. arXiv preprint arXiv:1305.1961, 2013.
 [8] V. Elser, I. Rankenburg, and P. Thibault. Searching with iterated maps. Proceedings of the National Academy of Sciences, 104(2):418, 2007.
 [9] Nathan Swanson et. al. Lateral differences in the default mode network in healthy controls and patients with schizophrenia. Human Brain Mapping, 32:654–664, 2011.

[10]
Asja Fischer and Christian Igel.
An introduction to restricted boltzmann machines.
In
Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications
, pages 14–36. Springer, 2012.  [11] L. Freire, A. Roche, and J. F. Mangin. What is the best similarity measure for motion correction in fMRI. IEEE Transactions in Medical Imaging, 21:470–484, 2002.
 [12] Karl J Friston, Andrew P Holmes, Keith J Worsley, JP Poline, Chris D Frith, and Richard SJ Frackowiak. Statistical parametric maps in functional imaging: a general linear approach. Human brain mapping, 2(4):189–210, 1994.
 [13] S. Gravel and V. Elser. Divide and concur: A general approach to constraint satisfaction. Physical Review E, 78(3):36706, 2008.
 [14] Trevor. Hastie, Robert. Tibshirani, and JH (Jerome H.) Friedman. The elements of statistical learning. Springer, 2009.
 [15] G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
 [16] G. E. Hinton, S. Osindero, and Y. W. Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
 [17] Geoffrey Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14:2002, 2000.
 [18] P. O. Hoyer. Nonnegative sparse coding. In Neural Networks for Signal Processing, 2002. Proceedings of the 2002 12th IEEE Workshop on, pages 557–565, 2002.
 [19] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Neural Information Processing Systems, 2012.
 [20] Quoc V Le, Alexandre Karpenko, Jiquan Ngiam, and Andrew Y Ng. Ica with reconstruction cost for efficient overcomplete feature learning. In NIPS, pages 1017–1025, 2011.

[21]
Quoc V. Le, Rajat Monga, Matthieu Devin, Kai Chen, Greg S. Corrado, Jeff Dean,
and Andrew Y. Ng.
Building highlevel features using large scale unsupervised learning.
InInternational Conference on Machine Learning. 103
, 2012.  [22] N. Le Roux and Y. Bengio. Deep belief networks are compact universal approximators. Neural computation, 22(8):2192–2207, 2010.
 [23] Jingyu Liu and Vince Calhoun. Parallel independent component analysis for multimodal analysis: Application to fmri and eeg data. In Biomedical Imaging: From Nano to Macro, 2007. ISBI 2007. 4th IEEE International Symposium on, pages 1028–1031. IEEE, 2007.
 [24] M. J. McKeown, S. Makeig, G. G. Brown, T. P. Jung, S. S. Kindermann, A. J. Bell, and T. J. Sejnowski. Analysis of fMRI data by blind separation into independent spatial components. Human Brain Mapping, 6(3):160–188, 1998.
 [25] Shashwath A Meda, Nicole R Giuliani, Vince D Calhoun, Kanchana Jagannathan, David J Schretlen, Anne Pulver, Nicola Cascella, Matcheri Keshavan, Wendy Kates, Robert Buchanan, et al. A large scale (n= 400) investigation of gray matter differences in schizophrenia using optimized voxelbased morphometry. Schizophrenia research, 101(1):95–105, 2008.
 [26] M. Moosmann, T. Eichele, H. Nordby, K. Hugdahl, and V. D. Calhoun. Joint independent component analysis for simultaneous EEGfMRI: principle and simulation. International Journal of Psychophysiology, 67(3):212–221, 2008.
 [27] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML10), pages 807–814, 2010.
 [28] V. K. Potluru and V. D. Calhoun. Group learning using contrast NMF : Application to functional and structural MRI of schizophrenia. Circuits and Systems, 2008. ISCAS 2008. IEEE International Symposium on, pages 1336–1339, May 2008.
 [29] Marcus E. Raichle, Ann Mary MacLeod, Abraham Z. Snyder, William J. Powers, Debra A. Gusnard, and Gordon L. Shulman. A default mode of brain function. Proceedings of the National Academy of Sciences, 98(2):676–682, 2001.
 [30] Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, (5500):2323–2326, 2000.
 [31] Mikail Rubinov and Olaf Sporns. Weightconserving characterization of complex functional brain networks. Neuroimage, 56(4):2068–2079, 2011.
 [32] John W Sammon Jr. A nonlinear mapping for data structure analysis. Computers, IEEE Transactions on, 100(5):401–409, 1969.
 [33] Jing Sui, Tulay Adali, Qingbao Yu, Jiayu Chen, and Vince D. Calhoun. A review of multivariate methods for multimodal fusion of brain imaging data. Journal of Neuroscience Methods, 204(1):68–81, 2012.
 [34] Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000.
 [35] M.P. van den Heuvel and H.E. Hulshoff Pol. Exploring the brain network: A review on restingstate fMRI functional connectivity. European Neuropsychopharmacology, 2010.
 [36] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using tsne. Journal of Machine Learning Research, 9(25792605):85, 2008.

[37]
Hui Zou, Trevor Hastie, and Robert Tibshirani.
Sparse principal component analysis.
Journal of computational and graphical statistics, 15(2):265–286, 2006.
5 Supplementary material
The correlation matrices for both RBM and ICA results on the fMRI dataset of Section 2.3 are provided in Figure S1, where the ordering of components is performed separately for each method. Each network is named by their physiological function but we do not go in depth explaining these in the current paper. For RBM, modularity is more apparent, both visually and quantitatively. Modularity, as defined in [31], averages across subjects for RBM, and for ICA. These values are significantly greater for RBM (
per the paired ttest). Also note that the scale of correlation values for RBM and ICA is different, which highlights that RBM overestimated strong FNC values.