Diagnosis of psychiatric diseases is challenging as there are currently no objective biological markers associated with mental disorders. Similarity of symptoms among different diseases (e.g. depression phase of bipolar disorder and unipolar depression) can lead to inaccurate diagnosis, and to less effective intervention. Worse, there is also no objective biological marker for predicting treatment response in an individual. This oftentimes results in multiple changes in a patient’s prescription, often resulting in poor adherence given the medication’s side effects. Such inefficiency in the diagnosis and treatment prognosis process for psychiatric disorders has increased the global burden of disease, with mental illness ranking first, before cancer and cardiac conditions, in terms of time lost to disability (WHO 2012 report) and costs .
In recent years, machine learning techniques have shown success in identifying patients with mental or neurological disorders and in predicting treatment response using brain imaging, especially structural and/or functional MRI (magnetic resonance imaging) data [2, 3, 4, 5, 6]2, 7]
that can discriminate between patients and controls, or predict response to treatment. Some typical imaging features extracted from functional MRI (fMRI) or structural MRI (sMRI) data include functional connectivity (FC) and amplitude of low-frequency fluctuations (ALFF) for fMRI, and voxel-based morphometry and gray matter thickness/volume for sMRI. Such features may be extracted voxel-wise (where every voxel is a brain tissue of size) or region-wise, from predefined brain regions (e.g. thalamus, postcentral gyrus).
With deep learning techniques providing outstanding performance in various fields, including image classification, speech recognition, and video classification, among others, this approach is being explored in clinical applications, including those involving medical imaging data[8, 9, 10]
. In addition to their potential to surpass the performance of other standard machine learning techniques, deep learning methods are attractive because they can be applied directly to the data, skipping the need to extract hand-designed features, a step that is necessary in almost all other machine learning approaches. In addition to the possibility of improving prediction accuracy, deep neural networks (DNN) allow us to move away from hypothesis-driven feature selection to data-driven feature discovery. Various deep learning methods (e.g. multi-layer perceptron, autoencoders, deep belief networks, and convolutional neural networks) have been used to analyze imaging data for various psychiatric and neurological disorders, including but not limited to Alzheimer’s disease, ADHD, and Psychosis[see 5, for a review]. Most of these studies use sMRI for predictions in neurological disorders, and much fewer studies use fMRI [11, 12, 13, 14], which has been shown to be particularly relevant in predictive analysis of psychiatric disorders (such as schizophrenia) [15, 16]. fMRI data measures blood oxygenation level-dependent (BOLD) signal at every brain voxel by taking a scan of the whole brain every 1-3 s. This produces a movie of the brain activity (reflected in BOLD signal111Note the relationship between BOLD signal and neural activity is still under scrutiny .), either in response to a task (e.g. a motor, sensory, or cognitive task) or simply at rest.
Here, our goal is to exploit both spatial and temporal information in the fMRI movie (at the whole-brain voxel level) to distinguish patients with schizophrenia vs healthy controls. We propose using a recurrent convolutional neural network (R-CNN) involving a 3-D CNN followed by a sequential neural network with LSTM (long short term memory) units. The CNN extracts spatial features, which are fed to the LSTM model, that uses the dependencies between time points at every spatial location to generate a label(see Figure-1). To our knowledge, this is the first work to apply a recurrent CNN to fMRI data for neurological/psychiatric diagnosis (here schizophrenia). As discussed earlier, most previous fMRI/machine learning studies, including some that used DNNs , use hand-designed features, in particular FC features , which collapses the time dimension into one single number (i.e., the correlation coefficient between a pair of time-series). Such approaches do not keep track of the relationships between spatial locations (e.g. voxel or brain regions) either. Here, we expand the work by , who successfully applied a R-CNN (with 2-D convolutions) to EEG data in a mental load classification task, to fMRI data (using 3-D convolutions). We used fMRI data in response to an auditory oddball task from patients diagnosed with schizophrenia and healthy controls from FBIRN dataset . The task is to predict whether this sample came from a patient, or a control, based on the preprocessed fMRI BOLD signal at the voxel level, exploiting the temporal and spatial information in the data within an end-to-end deep learning framework.
We used the FBIRN phase-II fMRI dataset  , which includes functional and structural MRI data for patients with schizophrenia or schizoaffective disorder and age- and sex-matched healthy controls. We focused on the subset of fMRI data acquired in response to an Auditory Oddball (AO) task. The fMRI data included whole-brain scans taken every 2 seconds for a period of 280 seconds. Per subject, there were 4 experiment runs (280 sec each). A standard preprocessing pipeline was applied to each subject’s raw fMRI data using the FSL software package . The pipeline included motion correction, tCompCor denoising, spatial filtering, high-pass temporal filtering, and linear registration to the MNI T1 template through subject’s T1 scan [see 6, for a more detailed description of the preprocessing stages]. The first 3 volumes in each run were deleted for signal instability, resulting in a total of 137 volumes. Finally, a universal mask, i.e. the intersection of all subjects’ brains, was applied to each subject’s data, resulting in a common non-zero brain area of 26,949 voxels (brain tissue of size mm). After preprocessing and quality control, N = 95 subjects (46 patients, 49 controls) remained in the study (from a total of 164 subjects in both scanning sessions).(Subjects with missing imaging data, runs less than 4, or excessive motion were excluded [see 6, for details of quality control].)
We compared the accuracy of recurrent neural networks (LSTM model) and recurrent-convolutional neural networks (R-CNN), versus a linear and a nonlinear baseline classifier, in learning features for discriminating between patients with schizophrenia and healthy individuals. We tried several different architectures consisting of LSTMs as well as R-CNNs. The input data was structured as a 4-dimensional tensor (4-D brain scans:).
: We compared our proposed models with the results from a linear and a RBF (radial basis function) SVM. In order to make the SVM solution feasible, we reduced the size of the input by a rate ofvoxels. This resulted in a feature vector of size 77,953 (137 time points 569 "supervoxels") per run per subject, for a total of 380 samples (95 subjects 4 runs).
LSTM model: All voxel values per brain are reshaped into a vector and fed directly into a two-layer forward LSTM model. LSTMs are known for their ability to learn long-term dependencies between inputs. In this model the spatial relationship between voxels are ignored and LSTM learns the temporal relationship between activations in different voxels. The LSTM model contains two back-to-back LSTM layers of 32 LSTM units per layer.
: 3-D activations for each time frame were fed into a 3-D CNN. We used 3-dimensional convolutions followed by max-pooling to extract position and scale independent features that would generalize across individuals. Identical networks were used to process each time-frame with shared weights between them. Outputs of each CNN were then reshaped into a vector that was fed as input to the LSTM network at each time step. Figure-1 shows an overview of the recurrent-convolutional network used here. The CNN part learns 3-D features that are invariant to translation and scaling and reduces the dimension of the input space before feeding it into LSTM. Filters of size (voxels) were used in each convolutional layer. The number of filters varied according to the architecture of the model, as detailed in Table 1. The R-CNN (2,1) contains two back-to-back convolutional layers with 16 filters in each, followed by a single convolutional layer with 32 units, and lastly two back-to-back LSTM layers with 32 units. The R-CNN (1,2) model reverses the order of the convolutional layers from R-CNN (2,1). Lastly, the R-CNN (2,2,1) model contains two back-to-back convolutional layers with 16 units per layer, followed by two back-to-back convolutional layers with 32 units each, followed by one convolutional layer with 32 units, and finished with two back-to-back LSTM layers with 32 units.
2.3 Training Details
We used Adam optimizer with default parameter settings ()  and learning rate of 0.0001. We included 50 % dropout in the input and outputs of the LSTM cells , which contained most of the tunable parameters in our models. In addition, we applied regularization with to all convolutional and fully connected weights. 3-D-convolutions of size
with stride 1 were used. Each block of convolutions were followed by a maxpool layer of size 2 and stride 2. Size of the input to the network wasvoxels. A batch size of 64 was used for all models. Larger convolutional models were trained synchronously on 16 GPUs. In both cases, we experimented with varying number of time-frames included in each sample - viz., 16 and 64. We used 10-fold cross-validation to evaluate the performance for each proposed model, where each fold was independently trained using different samples. Subjects in the training, evaluation, and test sets were disjoint, i.e. all samples corresponding to a subject were used only in one of the three sets. For training, samples within each batch were generated by randomly selecting time windows from different subjects. All possible time windows of size
from all subjects were used for training, validation and test. We used the validation set for early stopping of training. After each training epoch, performance on the validation set was computed. Training was performed for 10 epochs and test performance was computed for the network state with highest validation score.
2.4 Data Preparation
Preprocessed fMRI data was normalized in three steps: For each subject/run/voxel, (i) the BOLD signal time-series was demeaned; (ii) The resulting time-series was divided by standard deviation (SD) of activation across the whole brain (all voxels and times). Next, each voxel’s time-series was standardized based on the mean and SD of the voxel’s activation across all subjects and runs. Each fMRI sample (subject/run) was split into windows of 64 time points (i.e. 128 sec with a total of 137-63=74 samples), which formed the input for training, evaluation, and testing of our models. A shorter time-frame of 16 volumes (32 sec) was also tested.
We investigated the effectiveness of LSTM and R-CNNs in capturing the temporal and spatial structure of the fMRI data in order to distinguish between patients and controls. The average accuracy of the baseline method was 57.89%. Both LSTM and R-CNN models performed better than the baseline model (Table 1). Our best LSTM model performed slightly better than our R-CNN model (). Additionally, we explored the impact of length of the samples on the performance of the trained network . We found that larger time windows improved the accuracy in all tested models. For the R-CNN models, we experimented with the number of convolutional layers and found that the deeper CNN models did not reduce the error (and slightly increased it).
|Model||LSTM||RCNN (2, 1)||RCNN (1, 2)||RCNN (2, 2, 1)|
|Test Performance||60.2 %||66.4%||63.1%||64.9%||60.9%||61.4%||63.3%|
|# Conv Filters||-||-||16, 32||16, 32||16, 32||16, 32||16, 32, 32|
The best recurrent convolutional model was obtained with RCNN (2,1) containing two back-to-back convolutions in the first layer, one convolution in the second layer, and two back-to-back LSTMs in the third layer. We compared results using both 16 and 64 time-window samples while keeping the batch size fixed at 64 samples. For the model using 64 time windows, we noticed a significant improvement of 1.8% in classification accuracy. For the LSTM model, the improvement with increasing the time window was over 6%. Comparing the performance of the baseline models and deep learning, the test scores of LSTM and R-CNN were 8% better than linear SVM (with an accuracy of 57.9%, and false positive and negative rates of 30.6% and 54.0%, respectively), and ??% better than RBF SVM (with an accuracy of 62.1%, and false positive and negative rates of 19.9% and 57.1%, respectively).
We applied several neural network architectures to learn invariant markers for schizophrenia disorder from a large-scale fMRI dataset. In particular we tried sequential LSTM networks alone and in combination with CNN input layers. While both of these methods achieved remarkable accuracy in distinguishing between patients with schizophrenia and healthy control subjects, they fell short of reaching the same level as hand-designed connectivity features (74% in ). This may be due to the relatively small sample size for an end-to-end deep learning framework. The 2-layer LSTM architecture, which was used in all our models, helped to learn the temporal dependencies in long time windows. The R-CNN model learned better representations compared to the LSTM model when using short time windows () but failed to match the LSTM performance when using the longer time windows (). In future work, we will exploit other fMRI data subsets in the FBIRN dataset that are acquired in response to three other tasks (working memory, sensorimotor, and breath hold) to enrich the sample space. Applying transformations (such as wavelets) to fMRI data before feeding to DNN might also help improve the subtle signal in the data, compared to directly feeding the BOLD signal to the network.
-  C. Roehrig, “Mental disorders top the list of the most costly conditions in the united states: $201 billion,” Health Affairs, pp. 10–1377, 2016.
-  G. Orru, W. Pettersson-Yeo, A. F. Marquand, G. Sartori, and A. Mechelli, “Using support vector machine to identify imaging biomarkers of neurological and psychiatric disease: a critical review,” Neurosci Biobehav Rev, vol. 36, no. 4, pp. 1140–52, 2012.
-  E. Zarogianni, T. W. Moorhead, and S. M. Lawrie, “Towards the identification of imaging biomarkers in schizophrenia, using multivariate pattern classification at a single-subject level,” Neuroimage Clin, vol. 3, pp. 279–89, 2013.
-  N. Koutsouleris, R. S. Kahn, A. M. Chekroud, S. Leucht, P. Falkai, T. Wobrock, E. M. Derks, W. W. Fleischhacker, and A. Hasan, “Multisite prediction of 4-week and 52-week treatment outcomes in patients with first-episode psychosis: a machine learning approach,” Lancet Psychiatry, vol. 3, no. 10, pp. 935–946, 2016.
-  S. Vieira, W. H. Pinaya, and A. Mechelli, “Using deep learning to investigate the neuroimaging correlates of psychiatric and neurological disorders: Methods and applications,” Neurosci Biobehav Rev, vol. 74, no. Pt A, pp. 58–75, 2017.
-  M. Gheiratmand, I. Rish, G. A. Cecchi, M. R. G. Brown, R. Greiner, P. I. Polosecki, P. Bashivan, A. J. Greenshaw, R. Ramasubbu, and S. M. Dursun, “Learning stable and predictive network-based patterns of schizophrenia and its clinical symptoms,” NPJ Schizophr, vol. 3, p. 22, 2017.
T. Wolfers, J. K. Buitelaar, C. F. Beckmann, B. Franke, and A. F. Marquand, “From estimating activation locality to predicting disorder: A review of pattern recognition for neuroimaging-based psychiatric diagnostics,”Neurosci Biobehav Rev, vol. 57, pp. 328–49, 2015.
-  D. Shen, G. Wu, and H. I. Suk, “Deep learning in medical image analysis,” Annu Rev Biomed Eng, vol. 19, pp. 221–248, 2017.
-  G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. van der Laak, B. van Ginneken, and C. I. Sanchez, “A survey on deep learning in medical image analysis,” Med Image Anal, vol. 42, pp. 60–88, 2017.
-  V. Gulshan, L. Peng, M. Coram, M. C. Stumpe, D. Wu, A. Narayanaswamy, S. Venugopalan, K. Widner, T. Madams, J. Cuadros, R. Kim, R. Raman, P. C. Nelson, J. L. Mega, and D. R. Webster, “Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs,” JAMA, vol. 316, no. 22, pp. 2402–2410, 2016.
-  S. M. Plis, D. R. Hjelm, R. Salakhutdinov, E. A. Allen, H. J. Bockholt, J. D. Long, H. J. Johnson, J. S. Paulsen, J. A. Turner, and V. D. Calhoun, “Deep learning for neuroimaging: a validation study,” Front Neurosci, vol. 8, p. 229, 2014.
-  J. Kim, V. D. Calhoun, E. Shim, and J. H. Lee, “Deep neural network with weight sparsity control and pre-training extracts hierarchical features and enhances classification performance: Evidence from whole-brain resting-state functional connectivity patterns of schizophrenia,” Neuroimage, vol. 124, no. Pt A, pp. 127–146, 2016.
-  H. I. Suk, C. Y. Wee, S. W. Lee, and D. Shen, “State-space model with deep learning for functional dynamics estimation in resting-state fmri,” Neuroimage, vol. 129, pp. 292–307, 2016.
-  S. Sarraf, G. Tofighi, et al., “Deepad: Alzheimer’ s disease classification via deep convolutional neural networks using mri and fmri,” bioRxiv, p. 070441, 2016.
-  E. Damaraju, E. A. Allen, A. Belger, J. M. Ford, S. McEwen, D. H. Mathalon, B. A. Mueller, G. D. Pearlson, S. G. Potkin, A. Preda, J. A. Turner, J. G. Vaidya, T. G. van Erp, and V. D. Calhoun, “Dynamic functional connectivity analysis reveals transient states of dysconnectivity in schizophrenia,” Neuroimage Clin, vol. 5, pp. 298–308, 2014.
-  V. D. Calhoun, T. Eichele, and G. Pearlson, “Functional brain networks in schizophrenia: a review,” Front Hum Neurosci, vol. 3, p. 17, 2009.
-  “Connecting the dots,” Nat Neurosci, vol. 12, p. 99, Feb 2009. eng Editorial 2009/01/28 09:00 Nat Neurosci. 2009 Feb;12(2):99. doi: 10.1038/nn0209-99.
-  P. Bashivan, I. Rish, M. Yeasin, and N. Codella, “Learning Representations from EEG with Deep Recurrent-Convolutional Neural Networks,” pp. 1–15, 2016.
-  D. B. Keator, T. G. van Erp, J. A. Turner, G. H. Glover, B. A. Mueller, T. T. Liu, J. T. Voyvodic, J. Rasmussen, V. D. Calhoun, H. J. Lee, A. W. Toga, S. McEwen, J. M. Ford, D. H. Mathalon, M. Diaz, D. S. O’Leary, H. Jeremy Bockholt, S. Gadde, A. Preda, C. G. Wible, H. S. Stern, A. Belger, G. McCarthy, B. Ozyurt, S. G. Potkin, and Fbirn, “The function biomedical informatics research network data repository,” Neuroimage, vol. 124, no. Pt B, pp. 1074–9, 2016.
-  M. Jenkinson, C. F. Beckmann, T. E. Behrens, M. W. Woolrich, and S. M. Smith, “Fsl,” Neuroimage, vol. 62, no. 2, pp. 782–90, 2012.
-  D. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” International Conference on Learning Representations, pp. 1–13, 2014.
-  W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent Neural Network Regularization,” arXiv:1409.2329 [cs], no. 2013, pp. 1–8, 2014.