Mental disorders manifest in behavior that is driven by disruptions in brain dynamics. Functional MRI captures the nuances of spatiotemporal dynamics that could potentially provide clues to the causes of mental disorders and enable early diagnosis. However, the obtained data for a single subject is of high dimensionality , and to be useful for learning, and statistical analysis, one needs to collect datasets with a large number of subjects . Yet, for any disorder, demographics, or other types of conditions, a single study is rarely able to amass datasets large enough to go out of the mode. Traditionally this is approached by handcrafting features Khazaee et al. (2016) of a much smaller dimension, effectively reducing via dimensionality reduction. Often, the dynamics of brain function in these representations vanishes into proxy features such as correlation matrices of functional network connectivity (FNC) (Yan et al., 2017). Efforts that pull together data from various studies and increase do exist, but it is difficult to generalize to the study of smaller and more specific disease populations that cannot be shared to become a part of these pools or are too different from the data in them.
Our goal is to enable the direct study of brain dynamics in smaller datasets to, in turn, allow an analysis of brain function. In this paper, we show how one can achieve significant improvement in classification directly from dynamical data on small datasets by taking advantage of publicly available large but unrelated datasets. We demonstrate that it is possible to train a model in a self-supervised manner on the dynamics of healthy control subjects from the Human Connectome Project (HCP) (Van Essen et al., 2013) and apply that pre-trained encoder to a completely different data collected across multiple sites from healthy controls and schizophrenia subjects.
2 Related Work
and even reinforcement learning(Anand et al., 2019).
Prior works in brain imaging have been based on unsupervised methods such as linear ICA (Calhoun et al., 2001) and HMM framework Eavani et al. (2013). Some other nonlinear approaches were also proposed to capture the dynamics as using RBMs (Hjelm et al., 2014) and RNN modification of ICA (Hjelm et al., 2018a).
Also, in most cases, researchers in brain imaging are dealing with small datasets. In this case, transfer learning (Mensch et al., 2017; Li et al., 2018; Thomas et al., 2019) might be a way to improve results and in some cases, to enable learning from data otherwise too small for any results. Another idea to improve performance might be considered by a data generating approach (Ulloa et al., 2018).
3 Method Description
For self-supervised pre-training, we are using spatio-temporal objective ST-DIM (Anand et al., 2019) to maximize predictability between current latent state and future spatial state and between consecutive spatial states. For the lower bound of mutual information, we are using InfoNCE (Oord et al., 2018) estimator. Compare to other available estimators, InfoNCE shows better performance (Hjelm et al., 2018b; Bachman et al., 2019) in case of a greater number of negative samples that are readily available in case of time series data.
Let be a dataset of pairs of values at time point and sampled from sequence with length . A pair is called positive if and — negative if . A positive pair models the joint and a negative — marginal distributions. Eventually, the InfoNCE estimator is defined as:
where is a critic function Tschannen et al. (2019). Specifically, we are using separable critic , where and
are some embedding function parametrized by neural networks. Such embedding functions are used to calculate value of a critic function in same dimensional space from two dimesionally different inputs. Critic learns an embedding function such that it assigns higher values for positive pairs compare to negative pairs:.
We define a latent state as an output of encoder and a spatial state as the output of th layer of the encoder for input at time point . To optimize the objective between current latent state and future spatial state the critic function for input pair is and for consecutive spatial states — . Finally, the loss is the sum of the InfoNCE with and InfoNCE with as .
To simulate the data we generate multiple -node graphs with
stable transition matrices. Using these we generated multivariate time series with autoregressive (VAR) and structural vector autoregressive (SVAR) models(Lütkepohl, 2005).
First, we generate VAR times series with size . Then we split our dataset to samples for training, —for validation and — for testing. Using these samples we pre-train an encoder and evaluate based on its ability to identify consecutive windows sampled from whole time series.
In the final downstream task we classify the whole time-series whether it is generated by VAR or SVAR (undersampled VAR at rate 2). We create graphs with corresponding stable transition matrices and generate samples ( for each) and split as for training, for validation and for hold-out test. Here we also use windows as a single time-point input.
4.1.2 Real data
Two independent datasets were used in the current study. The first dataset is a Schizophrenia dataset, which is from Function Biomedical Informatics Research Network (FBIRN) (Keator et al., 2016)111These data were downloaded from the Function BIRN Data Repository, Project Accession Number 2007-BDR-6UHZ1), and the second dataset is a healthy subject dataset, which is from the 1200 Subject release of Human Connectome Project (HCP) (Van Essen et al., 2013).
The FBIRN dataset was pre-processed through SPM12 (Penny et al., 2011) based on the MATLAB 2016b environment. The slice-timing was first performed on the data, and then subject head motions were corrected by the realignment procedure. After that, the data was warped to MNI space using EPI template and resampled to mm voxels. Finally, the data were smoothed with a 6mm FWHM Gaussian kernel. The FBIRN dataset consists of subjects, including SZ patients and healthy controls.
The resting-state fMRI HCP data comes pre-processed by the following pipeline (Glasser et al., 2013). It includes removing of spatial distortions, compensation of the subject motion, reduction of the bias field, normalization with a global mean, and final brain masking. The pre-processed HCP data were then warped to the MNI space using the EPI template and resampled to the mm voxels using the same bounding box, to guarantee HCP and FBIRN datasets have the same spatial resolution and dimensions. HCP consists only of healthy controls.
For each dataset, intrinsic connectivity networks (ICNs) were extracted using the pipeline described at (Du et al., 2019). These ICNs are supposed to be non-noise components providing meaningful functional network information and thus were used in training.
Encoder for simulation experiment consist of D convolutional layers with output features , kernel sizes
and stride —
, following by ReLUGlorot et al. (2011) after each layer followed by Linear layer with units. For real data — D convolutional layers with , and respectively, followed by linear layer with units. Then for all possible pairs in the batch we took flattened features after rd convolutional layer and features from last layer . We embedded them using for and for to dimensional vector to compute the score of a critic function or . Using these scores we computed the loss. The neural networks trained using Adam optimizer (Kingma and Ba, 2014). The weights were initialized using Xavier (Glorot and Bengio, 2010).
For simulation experiment, first, we train our encoder to learn on windows from the VAR time series using InfoNCE based loss, and secondly, we train a supervised classifier based on windows. This window-based classification provides promising results (accuracy ). However, in solving similar real problems, we are more interested in subjects, i.e., entire time series, rather than a single-window for classification. Hence, we perform classification based on the whole time-series. In this setting, the entire time-series is encoded as a sequence of representations and fed through a biLSTM classifier. Two additional linear layers with hidden units on top of the last hidden state of the biLSTM are used to map the representation to classification scores.
For the real data case, similar to simulations, we successfully train (accuracy ) our encoder on consecutive windows of fMRI from HCP healthy subjects. Then each computed feature for each window of the whole fMRI sequence used to train biLSTM classifier on fBIRN dataset. The biLSTM classifies SZ and HC subjects. Overall each fMRI of the subject consists of a series of overlapping by half windows by components by time points.
Here we compare an end-to-end supervised model without pre-training (NPT), with frozen layers of the pre-trained encoder (FPT), and with unfrozen layers of the pre-trained encoder (UFPT).
In the simulation study, we observe that the pre-trained model can easily be fine-tuned only with a small amount of downstream data. Our model can classify a randomly chosen time-series as a sample of VAR or SVAR (Figure 2). Note, with very few training samples, models based on the pre-trained encoder outperform supervised models. However, as the number of samples grows, the accuracy achieved with or without pre-training levels out.
As we can see from Figure 2, the real data results substantiate the insights achieved in a simulation study. The test dataset consists of subjects that are held out from training and validation processes and are the same for all tests in the plot. Training data was randomly resampled ten times from the available data pool. To put it another way, self-supervised transferable pre-training always helps when we have very few samples offering higher AUC.
5 Conclusions and Future Work
As we have demonstrated, self-supervised pre-training of a spatiotemporal encoder on fMRI of healthy subjects provides benefits that transfer across datasets, collection sites, and to schizophrenia disease classification. Learning dynamics of fMRI helps to improve classification results for schizophrenia on small datasets, that otherwise do not provide reliable generalizations. Although the utility of this result is highly promising by itself, we conjecture that direct application to spatiotemporal data will warrant benefits beyond improved classification accuracy in the future work. Working with ICA components is smaller and thus easier to handle space that exhibits all dynamics of the signal. In the future, we will move beyond ICA pre-processing and work with fMRI data directly. We expect model introspection to yield insight into the spatio-temporal biomarkers of schizophrenia. In future work, we will test the same analogously pre-trained encoder on datasets with various other mental disorders such as MCI and bipolar. We are optimistic about the outcome because the proposed pre-training is oblivious to the downstream use and is done in a manner quite different from the classifier’s work. It may indeed be learning crucial information about dynamics that might contain important clues into the nature of mental disorders.
This study is supported by NIH grant R01EB020407.
Data were provided [in part] by the Human Connectome Project, WU-Minn Consortium (Principal Investigators: David Van Essen and Kamil Ugurbil; 1U54MH091657) funded by the 16 NIH Institutes and Centers that support the NIH Blueprint for Neuroscience Research; and by the McDonnell Center for Systems Neuroscience at Washington University. Additional data used in this study were downloaded from the Function BIRN Data Repository (http://fbirnbdr.birncommunity.org:8080/BDR/), supported by grants to the Function BIRN (U24-RR021992) Testbed funded by the National Center for Research Resources at the National Institutes of Health, U.S.A.
- Anand et al.  Ankesh Anand, Evan Racah, Sherjil Ozair, Yoshua Bengio, Marc-Alexandre Côté, and R Devon Hjelm. Unsupervised state representation learning in Atari. arXiv preprint arXiv:1906.08226, 2019.
- Bachman et al.  Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by maximizing mutual information across views. arXiv preprint arXiv:1906.00910, 2019.
Calhoun et al. 
Vince D Calhoun, Tulay Adali, Godfrey D Pearlson, and JJ Pekar.
A method for making group inferences from functional MRI data using independent component analysis.Human brain mapping, 14(3):140–151, 2001.
- Du et al.  Yuhui Du, Zening Fu, Jing Sui, Shuang Gao, Ying Xing, Dongdong Lin, Mustafa Salman, Md Abdur Rahaman, Anees Abrol, Jiayu Chen, Elliot Hong, Peter Kochunov, Elizabeth A. Osuch, and Vince D. Calhoun. Neuromark: an adaptive independent component analysis framework for estimating reproducible and comparable fmri biomarkers among brain disorders. medRxiv, 2019. doi: 10.1101/19008631. URL https://www.medrxiv.org/content/early/2019/10/16/19008631.
- Eavani et al.  Harini Eavani, Theodore D Satterthwaite, Raquel E Gur, Ruben C Gur, and Christos Davatzikos. Unsupervised learning of functional network dynamics in resting state fmri. In International Conference on Information Processing in Medical Imaging, pages 426–437. Springer, 2013.
- Fedorov et al.  Alex Fedorov, R Devon Hjelm, Anees Abrol, Zening Fu, Yuhui Du, Sergey Plis, and Vince D Calhoun. Prediction of progression to Alzheimers disease with deep InfoMax. arXiv preprint arXiv:1904.10931, 2019.
- Glasser et al.  Matthew F Glasser, Stamatios N Sotiropoulos, J Anthony Wilson, Timothy S Coalson, Bruce Fischl, Jesper L Andersson, Junqian Xu, Saad Jbabdi, Matthew Webster, Jonathan R Polimeni, et al. The minimal preprocessing pipelines for the Human Connectome Project. Neuroimage, 80:105–124, 2013.
Glorot and Bengio 
Xavier Glorot and Yoshua Bengio.
Understanding the difficulty of training deep feedforward neural
Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256, 2010.
- Glorot et al.  Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 315–323, 2011.
- Hjelm et al.  R Devon Hjelm, Vince D Calhoun, Ruslan Salakhutdinov, Elena A Allen, Tulay Adali, and Sergey M Plis. Restricted Boltzmann machines for neuroimaging: an application in identifying intrinsic networks. NeuroImage, 96:245–260, 2014.
Hjelm et al. [2018a]
R Devon Hjelm, Eswar Damaraju, Kyunghyun Cho, Helmut Laufs, Sergey M Plis, and
Vince D Calhoun.
Spatio-temporal dynamics of intrinsic networks in functional magnetic imaging data using recurrent neural networks.Frontiers in neuroscience, 12:600, 2018a.
- Hjelm et al. [2018b] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670, 2018b.
- Keator et al.  David B Keator, Theo GM van Erp, Jessica A Turner, Gary H Glover, Bryon A Mueller, Thomas T Liu, James T Voyvodic, Jerod Rasmussen, Vince D Calhoun, Hyo Jong Lee, et al. The function biomedical informatics research network data repository. Neuroimage, 124:1074–1079, 2016.
Khazaee et al. 
Ali Khazaee, Ata Ebrahimzadeh, and Abbas Babajani-Feremi.
Application of advanced machine learning methods on resting-state fMRI network for identification of mild cognitive impairment and Alzheimer’s disease.Brain Imaging and Behavior, 10(3):799–817, Sep 2016. ISSN 1931-7565. doi: 10.1007/s11682-015-9448-7. URL https://doi.org/10.1007/s11682-015-9448-7.
- Kingma and Ba  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Li et al.  Hailong Li, Nehal A. Parikh, and Lili He. A novel transfer learning approach to enhance deep neural network classification of brain functional connectomes. Frontiers in Neuroscience, 12:491, 2018. ISSN 1662-453X. doi: 10.3389/fnins.2018.00491. URL https://www.frontiersin.org/article/10.3389/fnins.2018.00491.
- Lütkepohl  Helmut Lütkepohl. New introduction to multiple time series analysis. Springer Science & Business Media, 2005.
- Mensch et al.  Arthur Mensch, Julien Mairal, Danilo Bzdok, Bertrand Thirion, and Gaël Varoquaux. Learning neural representations of human cognition across many fMRI studies. In Advances in Neural Information Processing Systems, pages 5883–5893, 2017.
- Oord et al.  Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Penny et al.  William D Penny, Karl J Friston, John T Ashburner, Stefan J Kiebel, and Thomas E Nichols. Statistical parametric mapping: the analysis of functional brain images. Elsevier, 2011.
- Thomas et al.  Armin W Thomas, Klaus-Robert Müller, and Wojciech Samek. Deep transfer learning for whole-brain fMRI analyses. arXiv preprint arXiv:1907.01953, 2019.
- Tschannen et al.  Michael Tschannen, Josip Djolonga, Paul K Rubenstein, Sylvain Gelly, and Mario Lucic. On mutual information maximization for representation learning. arXiv preprint arXiv:1907.13625, 2019.
- Ulloa et al.  Alvaro Ulloa, Sergey Plis, and Vince Calhoun. Improving classification rate of schizophrenia using a multimodal multi-layer perceptron model with structural and functional MR. arXiv preprint arXiv:1804.04591, 2018.
- Van Essen et al.  David C Van Essen, Stephen M Smith, Deanna M Barch, Timothy EJ Behrens, Essa Yacoub, Kamil Ugurbil, Wu-Minn HCP Consortium, et al. The WU-Minn human connectome project: an overview. Neuroimage, 80:62–79, 2013.
- Yan et al.  Weizheng Yan, Sergey Plis, Vince D Calhoun, Shengfeng Liu, Rongtao Jiang, Tian-Zi Jiang, and Jing Sui. Discriminating schizophrenia from normal controls using resting state functional network connectivity: A deep neural network and layer-wise relevance propagation method. In 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6. IEEE, 2017.