1 Introduction
Functional magnetic resonance imaging (fMRI) of blood oxygenationlevel dependent (BOLD) signal provides a powerful tool for studying temporally coherent patterns in the brain (Damoiseaux et al., 2006; Calhoun et al., 2008; Smith et al., 2009). Intrinsic networks (INs, Biswal et al., 1995) and functional connectivity are important outcomes of fMRI studies which illuminate our understanding of healthy and diseased brain function (Calhoun et al., 2001b; Allen et al., 2012). While deep or nonlinear approaches for INs from fMRI and MRI exist (Hjelm et al., 2014; Plis et al., 2013; Castro et al., 2016), of the tools available, the most widely used are generative models with shallow and linear structure. Such models typically use a shared parameterization of structure to learn a common model across subjects which refactor the data into a constrained space that both provides straightforward analysis and allows for efficient and effective learning algorithms.
The most popular of such methods, independent component analysis (ICA, Bell and Sejnowski, 1995), begins with the hypothesis that the data is a mixture of maximally independent sources. ICA is trainable through one of many relatively simple optimization routines that maximize nonGaussianity or minimize mutual information (Hyvärinen and Oja, 2000). However, ICA, as with other popular linear methods for separating INs, is orderagnostic in time: each multivariate signal at each time step is treated as independent and identically distributed (i.i.d.). While model degeneracy in time is convenient for learning; as an assumption about the data the explicit lack of temporal dependence necessarily marginalizes out dynamics, which then must be extrapolated in posthoc analysis.
In addition, ICA, as it is commonly used in fMRI studies, uses the same parameterization across subjects, which allows for either temporal or spatial variability, but not both (Calhoun et al., 2001a). The consequence of this is that ICA is not optimized to represent variation of shape in INs while also representing variation in time courses. This may encourage ICA to exaggerate time course statistics, as any significant variability in shape or size will primarily be accounted for by the time courses.
Despite these drawbacks, the benefits of using ICA for separating independent sources in fMRI data is strongly evident in numerous studies, to the extent that has become the dominant approach for separating INs and analyzing connectivity (Zuo et al., 2010; Kim et al., 2008; Allen et al., 2012; Damoiseaux et al., 2006; Calhoun et al., 2008; Smith et al., 2009; Calhoun and Adali, 2012). In order to overcome shortcomings in temporal dynamics and subject/temporal variability, but without abandoning the fundamental strengths of ICA, we extend ICA to model sequences using recurrent neural networks (RNNs). The resulting model, which we call RNNICA, naturally represents temporal dynamics through a sequential ICA objective and is easily trainable using backpropogation and gradient descent.
2 Background
Here we will formalize the problem of source separation with temporal dependencies and formulate the solution in terms of maximum likelihood estimation (MLE) and a recurrent model that parameterizes a conditionally independent distribution (i.e., recurrent neural networks).
Let us assume that the data is composed of ordered sequences of length ,
(1) 
where each element in the sequence, , is a
dimensional vector, and the index
enumerates the whole sequence. The goal is to find/infer a set of source signals,(2) 
such that a subsequence generates a subsequence of data,
, for and .
In particular, we are interested in finding a generating function,
(3) 
where is an additional noise variable.
This problem can generally be understood as inference of unobserved or latent configurations from timeseries observations. It is convenient to assume that the sources,
, are stochastic random variables with wellunderstood and interpretable noise, such as Gaussian or logistic variables with independence constraints. Representable as a directed graphical model in time, the choice of apriori model structure, such as the relationship between latent variables and observations, can have consequences on model capacity and inference complexity.
Directed graphical models often require complex approximate inference which introduces variance into learning. Rather than solving the general problem in Equation
3, we will assume that the generating function, , is noiseless, and the source sequences, have the same dimensionality as the data, , with each source signal being composed of a set of conditionally independent components with density parameterized by a recurrent neural network (RNN). We will show that the learning objective closely resembles that of noiseless independent component analysis (ICA). Assuming generation is noiseless and preserves dimensionality will reduce variance which would otherwise hinder learning with highdimensional, lowsample size data, such as fMRI.2.1 Independent component analysis
ICA (Bell and Sejnowski, 1995) hypothesizes that the observed data is a linear mixture of independent sources: , where are sources and are the columns of a mixing matrix, . ICA constrains the sources (a.k.a., components
) to be maximally independent. This framework presupposes any specific definition of component independence, and algorithms widely used for fMRI typically fall under two primary families, kurtosisbased methods and infomax
(Hyvärinen and Oja, 2000), although there are other algorithms providing a more flexible density estimation (Fu et al., 2014).For the infomax algorithm (Bell and Sejnowski, 1995), the model is parameterized by an unmixing matrix , such that . In the context of fMRI, the infomax objective seeks to minimize the mutual information of for all subjects at all times. This can be shown to be equivalent to assuming the prior density of the sources are nonGaussian and that they factorize, or , where is an dimensional vector. When the sources are drawn from a logistic distribution, it can be shown that infomax is equivalent to maximum likelihood estimation (MLE), with the loglikelihood objective for the empirical density, , being transformed by :
(4) 
where is the absolute value of the determinant of the Jacobian matrix.
With ICA, generating example sequences can be done by applying the inverse of the unmixing matrix to an ordered set of sources. However, one cannot simply sample from the model and generate samples of the observed data: any attempt to do so would simply generate unordered data and not true sequences. The sources in ICA are constrained to be marginally independent in time; ICA does not explicitly model dynamics, and training on shuffled observed sequences will regularly produce the same source structure.
There are numerous graphical models and methods designed to model sequences, including hidden Markov models (HMMs) and sequential Monte Carlo
(SMC, Doucet et al., 2001). HMMs are a popular and simple generative directed graphical models in time with tractable inference and learning and a traditional approach in modeling language. However, HMMs place a high burden on the hidden states to encode enough longrange dynamics to model entire sequences. Recurrent neural networks (RNNs), on the other hand, have the capacity to encode longrange dependencies through deterministic hidden units. When used in conjunction to the ICA objective, the resulting algorithm is a novel and, as we will show, much more powerful, approach for blind source separation based on a conditional independence assumption.2.2 Recurrent neural networks
An RNN is a type of neural network with cyclic connections that has seen widespread success in neural machine translation
(Cho et al., 2014), sequencetosequence learning (Sutskever et al., 2014), sequence generation (Graves, 2013), and numerous other settings. When computing the internal state across a sequence index (such as time or word/character position), RNNs apply the same set of parameters (i.e., connective weights) at each step. This gives the model the properties of translational symmetry and directed dependence across time, which are desirable if we expect directed dependence with the same update rules across the sequence. In addition, this makes RNN relatively memoryefficient, as one set of parameters are used across the sequence dimension.RNNs have many forms, but we will focus on those that act as probabilistic models of sequences, i.e.:
(5) 
Better known as a “language model” or generative RNN, the exact form of the conditional density typically falls under a family of transformations,
(6) 
where are a set of deterministic recurrent states (or “recurrent unit”). are recurrent connections that take the current observation and hidden state as input and output the next recurrent state. The output connections, , take the recurrent states as input at each step and output the parameters for the conditional distribution. Note that the model parameters, and , are recurrent: the same parameters are used at every time step and are not unique across the sequence index, .
The most canonical RNN for sequence modeling has the a simple parameterization (e.g., see Figure 1):
(7) 
where is a square matrix of recurrent weights, are the input weights, and
is a bias term. The mappings between the various variables in the model need not be shallow: an RNN with deep neural networks can model more complex recurrent transitions. Parameterizations that use gating and other types of memory functions, such as long shortterm memory
(LSTM, Hochreiter and Schmidhuber, 1997)(GRUs, Cho et al., 2014), can be used to better model longer sequences and are also widely used.Training an RNN for simple sequence modeling is easily done with the backpropagation algorithm, using the negative loglikelihood objective over the output conditional distributions:
(8) 
Typically the loss is computed with minibatches instead of over the entire dataset for efficiency, randomizing minibatches at each training epoch. The marginal density,
, can be learned by fitting to the average marginal across time, either to parameters of a target distribution directly or by training a neural network to predict the hidden state that generates (Bahdanau et al., 2014).3 Methods
3.1 RnnIca
RNNs have already been shown to be capable of predicting signal from BOLD fMRI data (Güçlü and van Gerven, 2017)
, though usually in the supervised setting. An unsupervised RNN framework for sequence modeling can easily be extended to incorporate the infomax objective. Define, as with ICA, a linear transformation for each observation to source configuration:
, and define a highkurtosis and factorized source distribution, (such as a logistic or Laplace distribution) for each time step, , and each fMRI sequence, . We apply this transformation to an fMRI time series: . The loglikelihood function over the whole sequence, , can be reparameterized as:(9) 
where is the Jacobian over the transformation, , and the source distribution, , has parameters determined by the recurrent states, . A highkurtosis distribution is desirable to ensure independence of the sources (or minimizing the mutual information, e.g., the infomax objective (Bell and Sejnowski, 1995)), so a reasonable choice for the outputs of the RNN at each time step are the mean, , and scale, , for a logistic distribution:
(10) 
Figure 2 illustrates the network structure for a few time steps as well as the forward and backpropagated signal, and Algorithm 1 demonstrates the training procedure for RNNICA. For our model, all network parameters and the ICA weight / unmixing matrix, , are the same for all subjects at all times. Our treatment assumes the ICA weight matrix is square, which is necessary to ensure a tractable determinant Jacobian and inverse. fMRI data is very high dimensional, so to reduce the dimensionality, we must resort to some sort of dimensionality reduction as preprocessing. A widely used for dimensionality reduction in ICA studies of fMRI is principle component analysis (PCA) (Calhoun et al., 2001b; Allen et al., 2012), used to reduce the data to match the selected number of sources, .
Note that RNNs with deeper architectures have been very successful for generative tasks (e.g., WaveNets, Van Den Oord et al., 2016), and RNNICA could benefit from a deeper architecture capable of inferring more complex relationships in the data. However, as fMRI data is often composed of a low number of training samples, we found it necessary to demonstrate the ability of RNNICA to learn meaningful sources with a simple RNN architecture. We leave architectural improvements for RNNICA for future research.
4 Experiments and Results
We first apply RNNICA to synthetic data simulated to evaluate the model performance and subsequently on real functional magnetic imaging (fMRI) data. FMRI analyses typically falls under two categories: taskbased and resting state analysis. Task experiments typically involve subjects being exposed to a timeseries of stimulus, from which taskspecific components can be extrapolated. In the case of RNNICA, this should reveal taskrelated directed connectivity and spatial variability, in addition to the usual taskrelatedness of activity from ICA. Restingstate data is often used to confirm the presence of distinct and functional states of the brain. We chose a dataset resting state experiment that also had simultaneous Electroencephalography (EEG) from which groundstate subject neurobiological states could be derived. For RNNICA, we should be able to find a correspondence between predicted activation as defined in our model and changes in state. As a result, this should provide a means to prevent false positives or negatives when interpreting resting state network or intergroup differences owing to (systematically) different sleep stages present in their examined cohorts.
4.1 Experiments with simulated data
To test the model, we generated synthetic data using SimTB toolbox (Erhardt et al., 2012) in a framework developed to assess dynamics functional connectivity (Allen et al., 2012) and described in Figure 1 of Lehmann et al. (2017). A total of subjects corresponding to two groups of subjects, simulated healthy (SimHC) and simulated schizophrenia patients (SimSZ) were generated. A set of time courses were generated for each SimHC and SimSZ subject with the constraint that they have five states (covariance patterns) and a transition probability matrix per group that dictates state transitions derived from data from prior work on real data (Damaraju et al., 2014). The initial state probabilities were also derived from that work. A sequence of 480 time points with a TR of seconds were generated. A total of subjects ( per group) were generated of which first from each group were used during training and remaining samples were used in testing the model. The parameters of hemodynamic response model (delay, undershoot etc) used to simulate the data were also varied per subject to introduce some heterogeneity. The known initial state of a subject and a transition probability matrix that governs transitions ensured a ground truth state transition vector (a vector of transitions between five simulated states unique to each subject).
An RNNICA model was then trained on the subject training data for epochs with model parameters similar to those in subsequent sections. The resultant sources , the source distributions predicted by RNN (, and ), and the RNN hidden unit activations for each subject were then correlated to the subject’s ground truth state vector. The trained model was then run on the test data and correlations were again computed between their model outputs and state vectors. We then computed group differences between the correlation distributions of SimHC and SimSZ groups and are summarized for both training and test cases in Figure 3). Our results show that RNNICA generalized group differences well to the test set in this setting, as represented in the hidden state activations and scaling factor.
4.2 Task experiments
To demonstrate the properties and strengths of our model, we apply our method to task fMRI data. Data used in this work is comprised of taskrelated scans from healthy participants and subjects diagnosed with schizophrenia, all of whom gave written, informed, Hartford hospital and Yale IRB approved consent at the Institute of Living and were compensated for their participation. All participants were scanned during an auditory oddball task (AOD) involving the detection of an infrequent target sound within a series of standard and novel sounds. More detailed information regarding participant demographics and task details are provided in Swanson et al. (2011).
Scans were acquired at the Olin Neuropsychiatry Research Center at the Institute of Living/Hartford Hospital on a Siemens Allegra 3T dedicated head scanner equipped with 40 gradients and a standard quadrature head coil. The functional scans were acquired transaxially using gradientecho echoplanarimaging with the following parameters: repeat time (TR) 1.50 s, echo time (TE) 27 ms, field of view 24 cm, acquisition matrix , flip angle , voxel size , slice thickness 4 mm, gap 1 mm, 29 slices, ascending acquisition. Six “dummy” scans were acquired at the beginning to allow for longitudinal equilibrium, after which the paradigm was automatically triggered to start by the scanner. The final AOD dataset consisted of volumes for each subject.
Data underwent standard preprocessing steps using the SPM software package (see Calhoun et al., 2008, for further details). Subject scans were masked below a global mean image then each voxel was variance normalized. Each voxel timecourses was then detrended using a 4thdegree polynomial fit, and this was repeated for all subjects. PCA was applied to the complete dataset without whitening, and the first components were kept to reduce the data. Finally, each PCA component had its mean removed before being entered into the model.
4.2.1 Model and setup
For use in RNNs, the data was then segmented into windowed data, shuffled, and then arranged into random batches. Each PCA loading matrix for subject was comprised of PCA time courses of length . These were segmented into equallength windowed slices using a window size of
and stride of
. The number of components roughly corresponds to the number found in other studies (Calhoun et al., 2001b; Allen et al., 2012; Calhoun et al., 2008), and time steps is equivalent to seconds, which has been shown provides a good tradeoff in terms of capturing dynamics and not being overly sensitive to noise (Vergara et al., 2017). The final dataset was comprised of volumes for each of the subjects with pca time courses each. These were then randomly shuffled at each epoch into batches of volumes each from random subjects and time points.We used a simple RNN with recurrent hidden units and a recurrent parameterization as in Equation 7, as we do not anticipate needing to model long range dependencies that necessitate gated models (Hochreiter and Schmidhuber, 1997). The initial hidden state of the RNN was a layer feed forward network with softplus units using dropout. An additional decay cost, , was imposed on the unmixing matrix, , for additional regularization with a decay rate of
. The model was trained using the RMSProp algorithm
(Hinton, 2012) with a learning rate of for epochs.4.2.2 Results
ID  Label  Targets  Novels  Targets  Novels  Targets 

2  FF  3.1e13 (+)  2.1e08 (+)  
13  CG  2.2e13 (+)  1.5e11 (+)  
14  STG ()  4.9e08 ()  1.4e08 ()  
16  Temp.  2.2e10 (+)  1.9e17 (+)  
17  L FP (+)  2.6e11 (+)  
20  Precun. (+)  8.5e09 (+)  
21  MeFG.  4.1e08 (+)  
34  DMN  1.0e18 ()  1.3e18 ()  1.8e10 (+)  
42  IPL  4.7e19 (+)  4.4e08 (+)  1.3e14 (+)  1.1e08 (+)  
43  MTG ()  2.1e19 (+)  7.0e15 (+)  4.4e15 (+)  1.2e14 (+)  
45  AG (+)  3.9e09 ()  9.4e08 ()  
46  Cere. (+)  5.4e10 (+)  
47  AG (+)  1.4e12 ()  4.5e10 (+)  6.6e13 ()  2.0e11 ()  
57  PCing  1.7e09 () 
pvalues from a 1sample ttest over betavalues with
for the target and novel stimulus for the sources, , the predicted means, , and the predicted scalefactor, . Betavalues were found for each subject and component using multiple regression to target and novel stimulus, and ttests were performed for each component over all subjects. Among the mostsignificant taskrelated components to target stimulus include the middle temporal gyrus, default mode network, and the parietal lobule. A legend for ROI label names can be found in the caption of Figure 4. The (+/) in the label name specify the sign of the map in Figure 4, while the (+/) in the pvalues specifies the sign of the corresponding tvalue.Figure 4 shows spatial maps backreconstructed. The spatial maps were filtered from the original , omitting white matter, ventricle, and motion artifact features. Each of the spatial maps along with their respective timecourses were signflipped to ensure that each backreconstructed distribution of voxels had positive skew. The maps are highly analogous to those typically found by linear ICA (Calhoun et al., 2001b; Allen et al., 2012), though with more combined positive/negative features in one map.
Figure 5 shows the functional network connectivity (FNC, Calhoun et al., 2001b; Jafri et al., 2008) matrix, in which the components are grouped according to a multilevel community algorithm Blondel et al. (2008)
using the symmetric temporal crosscorrelation matrix. For each subject and component, we performed multiple linear regression of the sources,
, the predicted means, , and the predicted scalefactor, from each subject to the target and novel stimulus. Table 1 shows the pvalues from a 1sample ttest on the beta values across subjects for components with values of . Many components show similar taskrelatedness across the source time courses and predicted means, notably temporal gyrus features, parietal lobule, and the default mode network (DMN, which is negatively correlated). In addition, the DMN shows the strongest taskrelatedness in the scale factor.In order to analyze how the RNN encodes dynamics, we analyze the Jacobian of the predicted mean of each component at time over all components at previous times, :
(11) 
The derivatives are tractable, as the means, , are differentiable functions w.r.t the input . These derivatives can can be interpreted as being a measure of directed connectivity between components in time, as they represent the predicted change of a future component (as understand through the change of its mean value) given change of a previous component. While the full Jacobian provides directed connectivity between source between all pairs of time, , to simplify analysis, we only looked at nexttime terms, or .
A representative graph is given in Figure 6, where the thickness of the edges represents the strength of the directed connection as averaged across time and subjects with the sign removed (). The color/grouping of the nodes corresponds to the similarity in directed connectivity as measured by the Pearson correlation coefficient:
(12) 
is the covariance, and is the standard deviation across the components indexed by . Grouping was done by constructing an undirected graph using the Pearson coefficients, clustering the vertices using the same communitybased hierarchical algorithm as with the FNC above. An example directed connectivity graph with the spatial maps is given in Figure 7.
Each of the nextstep Jacobian terms were used as timecourses with a multipleregression to target and novel stimulus, with significance tested using a onesample ttest as with the time courses and a twosample ttest across groups. The resulting taskrelated directed connectivity are represented in Figure 8 for both targets and novels, with an example graph with spatial maps presented in Figure 9. Groupdifferentiating relationships are given in Figure 10 with an example graph with spatial maps given in Figure 9.
4.3 Resting state experiments
We evaluated our model on resting state data to show RNNICA as a viable model and to demonstrate that properties of the network correspond to wake/sleep states. Resting state functional MRI data was collected from subjects for minutes each ( volumes, TR= 2.08 s) with a Siemens 3T Trio scanner while the subjects transitioned from wakefulness to at most sleep stage N3 (see, Tagliazucchi et al., 2012, for more details). This data was approved by ethics committee of Goethe University. Simultaneous EEG was acquired facilitating sleep staging per AASM criteria resulting in a hypnogram per subject (a vector assignment of consecutive 30 s EEG epochs to one of wakeful(W), N1, N2 and N3 sleep stae). We discarded first time points to account for T1 equilibration effects.
After performing rigid body realignment and slicetiming correction, subject data was warped to MNI space using SPM12. Then voxel time courses were despiked using AFNI. We then regressed out voxel time courses with respect to their head motion parameters (and their derivatives and squares), their mean white matter and CSF signals. Next, we bandpass filtered the data with a passband of 0.01  0.15 Hz. We extracted mean ROI time courses from nodes extracted from the bioimage suite (Papademetris et al., 2006) and reported in Shen et al. (2013).
4.3.1 Model and setup
We used the same model and training procedure as with our task data analysis in the previous section. Of the 55 subjects, 50 subjects were used during training and 5 subjects were left out for testing. We then examined the correspondence between hidden recurrent units of the trained model and subject hypnogram as well as between mean and scale of predictive source distribution and hypnogram. Similar tests were run on the model outputs on the 5 left out test cases.
4.3.2 Results
The activity of several hidden recurrent units of trained model was predictive of wakefulness across all subjects (see Figure 11 for an example subject). The RNN hidden unit activity (bound between 1 and 1) stays at the extremes during awake state exhibiting higher standard deviation and the activity tends towards zero with lower standard deviation as the subject transitions from wakefulness to sleep. Oneway ANOVA on the absolute mean and standard deviation of hidden unit activity by hypnogram state shows significant group differences in mean () and standard deviation (). Subsequent posthoc ttests reveal significant reductions in both from wakefulness and light sleep N1 state to deeper sleep stages N2 and N3 states, and also between N2 and N3 states (means:[ ], and standard deviations: [ ]; all these pvalues after correcting for multiple comparisons). In addition, the scaling factor tended to correlate well with changes of state, as measures by correlation with a smoothed derivative of the hypnogram. Figure 12 shows the correlation coefficients between RNN hidden units to subject hypnogram state, component scale factors, to subject hypnogram vector. Several hidden states show consistent correlation to hypnograms, indicating the RNN is encoding subject sleep state. Similarly some component scale factors also encode sleep states. Surprisingly, however, the source time courses, , and the means, , did not. Finally, some component scale factors correlate somewhat consistently with changes in state across subjects. This indicates that the model is encoding changes of state in terms of uncertainty.
5 Discussion and conclusion
5.1 Summary
In this work, we demonstrate how recurrent neural networks can be used to separate conditionally independent sources analogous to independent component analysis but with the benefits of modeling temporal dynamics through recurrent parameters. Results show that this approach is effective for modeling both taskrelated and restingstate functional magnetic imaging (fMRI) data. Using this approach, we are able to separate similar components to ICA, but having the additional benefit of directly analyzing temporal dynamics through the recurrent parameters.
Notably, in addition to finding similar maps and taskrelatedness as with ICA, we are able to derive directed temporal connectivity which is taskrelated and groupdifferential, and these are derived directly from the parameters of the RNN. In addition, for resting state data, we found that some hidden unit activity corresponded very well with wake/sleep states and that the uncertainty factor was consistent with changes of state, both of which were learned in a completely unsupervised way.
5.2 Related work
Our method introduces deep and nonlinear computations in time to maximum likelihood estimation independent component analysis (MLE ICA) without sacrificing the simplicity of linear relationships between source and observation. MLE ICA has an equivalent learning objective to infomax ICA, widely used in fMRI studies, in which the sources are drawn from a factorized logistic distribution (Hyvärinen et al., 2004). While the model learns a linear transformation between data and sources through the unmixing matrix, the source dynamics are encoded by a deep nonlinear transformation with recurrent structure, as represented by an RNN. Alternative nonlinear parameterizations of the ICA transformation exist that use deep neural networks have been shown to work with fMRI data (Castro et al., 2016). Such approaches allow for deep and nonlinear static spatial maps and are compatible with our learning objective. Temporal ICA as used in group ICA (Calhoun et al., 2009), like spatial ICA, does capture some temporal dynamics, but only as summaries through a one to twostage PCA preprocessing step. These temporal summaries are captured and can be analyzed, however they are not learned as part of an endtoend learning objective. Overall, the strengths of RNNICA compared to these methods are the dynamics are directly learned as model parameters, which allows for richer and higherorder temporal analyses, as we showed in the previous section.
Recurrent neural networks do not typically incorporate latent variables, as this requires expensive inference. Versions that incorporate stochastic latent variables exist, are trainable via variational methods, and working approaches for sequential data exist (Chung et al., 2015). However, these require complex inference which introduces variance into learning that may make training with fMRI data challenging. Our method instead incorporates concepts from noiseless ICA, which reduces inference to the inverse of a generative transformation. The consequence is that the temporal analyses are relatively simple, relying on only the tractable computation of the Jacobian of component conditional densities given the activations.
5.3 Future work
The RNNICA model provides a unique mode of analysis previously unavailable to fMRI research. Results are encouraging, in that we were able to find both taskrelated and groupdifferentiating directed connectivity, however the broader potential of this approach is unexplored. It is our belief that this method will expand neuroscience research that involves temporal data, leading to new and significant conclusions.
Finally, the uncertainty factor in our resting state experiments may indicate a novel application for imaging data through RNNICA, that is changeofstate detection. The model we employed was simple, as was not intended to take advantage of this. It is quite possible that further modifications could produce a model that reliably predicts changeofstate in fMRI and EEG data.
6 Acknowledgements
This work was supported in part by National Institutes of Health grants 2R01EB005846, P20GM103472, and
R01EB020407 and National Science Foundation grant #1539067
References
 Allen et al. (2012) Allen, E. A., Erhardt, E. B., Wei, Y., Eichele, T., and Calhoun, V. D. (2012). Capturing intersubject variability with group independent component analysis of fMRI data: a simulation study. Neuroimage, 59(4):4141–4159.
 Bahdanau et al. (2014) Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
 Bell and Sejnowski (1995) Bell, A. J. and Sejnowski, T. J. (1995). An informationmaximization approach to blind separation and blind deconvolution. Neural Computation, 7(6):1129–1159.
 Biswal et al. (1995) Biswal, B., Zerrin Yetkin, F., Haughton, V. M., and Hyde, J. S. (1995). Functional connectivity in the motor cortex of resting human brain using echoplanar MRI. Magnetic Resonance in Medicine, 34(4):537–541.
 Blondel et al. (2008) Blondel, V. D., Guillaume, J.L., Lambiotte, R., and Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008(10):P10008.
 Calhoun et al. (2001a) Calhoun, V., Adali, T., Pearlson, G., and Pekar, J. (2001a). Spatial and temporal independent component analysis of functional mri data containing a pair of taskrelated waveforms. Human brain mapping, 13(1):43–53.
 Calhoun and Adali (2012) Calhoun, V. D. and Adali, T. (2012). Multisubject independent component analysis of fmri: a decade of intrinsic networks, default mode, and neurodiagnostic discovery. IEEE reviews in biomedical engineering, 5:60–73.
 Calhoun et al. (2001b) Calhoun, V. D., Adali, T., Pearlson, G. D., and Pekar, J. J. (2001b). A method for making group inferences from functional MRI data using independent component analysis. Human Brain Mapping, 14.
 Calhoun et al. (2008) Calhoun, V. D., Kiehl, K. A., and Pearlson, G. D. (2008). Modulation of temporally coherent brain networks estimated using ICA at rest and during cognitive tasks. Human Brain Mapping, 29(7):828–838.
 Calhoun et al. (2009) Calhoun, V. D., Liu, J., and Adalı, T. (2009). A review of group ica for fmri data and ica for joint inference of imaging, genetic, and erp data. Neuroimage, 45(1):S163–S172.
 Castro et al. (2016) Castro, E., Hjelm, R. D., Plis, S., Dihn, L., Turner, J., and Calhoun, V. (2016). Deep independence network analysis of structural brain imaging: Application to schizophrenia. IEEE.
 Cho et al. (2014) Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014). Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
 Chung et al. (2015) Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A. C., and Bengio, Y. (2015). A recurrent latent variable model for sequential data. In Advances in neural information processing systems, pages 2962–2970.
 Cox (1996) Cox, R. W. (1996). Afni: software for analysis and visualization of functional magnetic resonance neuroimages. Computers and Biomedical research, 29(3):162–173.
 Damaraju et al. (2014) Damaraju, E., Allen, E. A., Belger, A., Ford, J., McEwen, S., Mathalon, D., Mueller, B., Pearlson, G., Potkin, S., Preda, A., et al. (2014). Dynamic functional connectivity analysis reveals transient states of dysconnectivity in schizophrenia. NeuroImage: Clinical, 5:298–308.
 Damoiseaux et al. (2006) Damoiseaux, J., Rombouts, S., Barkhof, F., Scheltens, P., Stam, C., Smith, S. M., and Beckmann, C. (2006). Consistent restingstate networks across healthy subjects. Proceedings of the National Academy of Sciences, 103(37):13848–13853.
 Doucet et al. (2001) Doucet, A., De Freitas, N., and Gordon, N. (2001). An introduction to sequential monte carlo methods. In Sequential Monte Carlo methods in practice, pages 3–14. Springer.
 Erhardt et al. (2012) Erhardt, E. B., Allen, E. A., Wei, Y., Eichele, T., and Calhoun, V. D. (2012). Simtb, a simulation toolbox for fmri data under a model of spatiotemporal separability. Neuroimage, 59(4):4160–4167.
 Fu et al. (2014) Fu, G.S., Phlypo, R., Anderson, M., Li, X.L., et al. (2014). Blind source separation by entropy rate minimization. IEEE Transactions on Signal Processing, 62(16):4245–4255.
 Graves (2013) Graves, A. (2013). Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850.
 Güçlü and van Gerven (2017) Güçlü, U. and van Gerven, M. A. (2017). Modeling the dynamics of human brain activity with recurrent neural networks. Frontiers in computational neuroscience, 11.

Hinton (2012)
Hinton, G. (2012).
Neural networks for machine learning.
Coursera, video lectures.  Hjelm et al. (2014) Hjelm, R. D., Calhoun, V. D., Salakhutdinov, R., Allen, E. A., Adali, T., and Plis, S. M. (2014). Restricted boltzmann machines for neuroimaging: an application in identifying intrinsic networks. NeuroImage, 96:245–260.
 Hochreiter and Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. (1997). Long shortterm memory. Neural computation, 9(8):1735–1780.
 Hyvärinen et al. (2004) Hyvärinen, A., Karhunen, J., and Oja, E. (2004). Independent component analysis, volume 46. John Wiley & Sons.
 Hyvärinen and Oja (2000) Hyvärinen, A. and Oja, E. (2000). Independent component analysis: algorithms and applications. Neural networks, 13(4):411–430.
 Jafri et al. (2008) Jafri, M. J., Pearlson, G. D., Stevens, M., and Calhoun, V. D. (2008). A method for functional network connectivity among spatially independent restingstate components in schizophrenia. Neuroimage, 39(4):1666–1681.

Kim et al. (2008)
Kim, D., Burge, J., Lane, T., Pearlson, G. D., Kiehl, K. A., and Calhoun, V. D.
(2008).
Hybrid ica–bayesian network approach reveals distinct effective connectivity differences in schizophrenia.
Neuroimage, 42(4):1560–1568.  Lehmann et al. (2017) Lehmann, B., White, S., Henson, R., CamCAN, and Geerligs, L. (2017). Assessing dynamic functional connectivity in heterogeneous samples. NeuroImage, 157:635 – 647.
 Papademetris et al. (2006) Papademetris, X., Jackowski, M. P., Rajeevan, N., DiStasio, M., Okuda, H., Constable, R. T., and Staib, L. H. (2006). Bioimage suite: An integrated medical image analysis suite: An update. The insight journal, 2006:209.
 Plis et al. (2013) Plis, S. M., Hjelm, R. D., Salakhutdinov, R., and Calhoun, V. D. (2013). Deep learning for neuroimaging: a validation study. Human Brain Mapping.
 Shen et al. (2013) Shen, X., Tokoglu, F., Papademetris, X., and Constable, R. T. (2013). Groupwise wholebrain parcellation from restingstate fmri data for network node identification. Neuroimage, 82:403–415.
 Smith et al. (2009) Smith, S. M., Fox, P. T., Miller, K. L., Glahn, D. C., Fox, P. M., Mackay, C. E., Filippini, N., Watkins, K. E., Toro, R., Laird, A. R., et al. (2009). Correspondence of the brain’s functional architecture during activation and rest. Proceedings of the National Academy of Sciences, 106(31):13040–13045.
 Sutskever et al. (2014) Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.
 Swanson et al. (2011) Swanson, N., Eichele, T., Pearlson, G., Kiehl, K., Yu, Q., and Calhoun, V. D. (2011). Lateral differences in the default mode network in healthy controls and patients with schizophrenia. Human Brain Mapping, 32(4):654–664.
 Tagliazucchi et al. (2012) Tagliazucchi, E., von Wegner, F., Morzelewski, A., Borisov, S., Jahnke, K., and Laufs, H. (2012). Automatic sleep staging using fmri functional connectivity data. NeuroImage, 63(1):63 – 72.
 Van Den Oord et al. (2016) Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.

Vergara et al. (2017)
Vergara, V. M., Mayer, A. R., Damaraju, E., and Calhoun, V. D. (2017).
The effect of preprocessing in dynamic functional network connectivity used to classify mild traumatic brain injury.
Brain and behavior, 7(10).  Zuo et al. (2010) Zuo, X.N., Kelly, C., Adelstein, J. S., Klein, D. F., Castellanos, F. X., and Milham, M. P. (2010). Reliable intrinsic connectivity networks: test–retest evaluation using ICA and dual regression approach. Neuroimage, 49(3):2163–2177.