Transfer Learning of fMRI Dynamics

11/16/2019 ∙ by Usman Mahmood, et al. ∙ Georgia Institute of Technology Georgia State University 0

As a mental disorder progresses, it may affect brain structure, but brain function expressed in brain dynamics is affected much earlier. Capturing the moment when brain dynamics express the disorder is crucial for early diagnosis. The traditional approach to this problem via training classifiers either proceeds from handcrafted features or requires large datasets to combat the m>>n problem when a high dimensional fMRI volume only has a single label that carries learning signal. Large datasets may not be available for a study of each disorder, or rare disorder types or sub-populations may not warrant for them. In this paper, we demonstrate a self-supervised pre-training method that enables us to pre-train directly on fMRI dynamics of healthy control subjects and transfer the learning to much smaller datasets of schizophrenia. Not only we enable classification of disorder directly based on fMRI dynamics in small data but also significantly speed up the learning when possible. This is encouraging evidence of informative transfer learning across datasets and diagnostic categories.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Mental disorders manifest in behavior that is driven by disruptions in brain dynamics. Functional MRI captures the nuances of spatiotemporal dynamics that could potentially provide clues to the causes of mental disorders and enable early diagnosis. However, the obtained data for a single subject is of high dimensionality , and to be useful for learning, and statistical analysis, one needs to collect datasets with a large number of subjects . Yet, for any disorder, demographics, or other types of conditions, a single study is rarely able to amass datasets large enough to go out of the mode. Traditionally this is approached by handcrafting features Khazaee et al. (2016) of a much smaller dimension, effectively reducing via dimensionality reduction. Often, the dynamics of brain function in these representations vanishes into proxy features such as correlation matrices of functional network connectivity (FNC) (Yan et al., 2017). Efforts that pull together data from various studies and increase do exist, but it is difficult to generalize to the study of smaller and more specific disease populations that cannot be shared to become a part of these pools or are too different from the data in them.

Our goal is to enable the direct study of brain dynamics in smaller datasets to, in turn, allow an analysis of brain function. In this paper, we show how one can achieve significant improvement in classification directly from dynamical data on small datasets by taking advantage of publicly available large but unrelated datasets. We demonstrate that it is possible to train a model in a self-supervised manner on the dynamics of healthy control subjects from the Human Connectome Project (HCP) (Van Essen et al., 2013) and apply that pre-trained encoder to a completely different data collected across multiple sites from healthy controls and schizophrenia subjects.

2 Related Work

Recent advances in unsupervised learning using self-supervised methods by estimating and maximizing mutual information reduced the gap between supervised and unsupervised learning 

(Oord et al., 2018; Hjelm et al., 2018b; Bachman et al., 2019). Such success has already influenced neuroimaging in the case of structural MRI (Fedorov et al., 2019)

and even reinforcement learning 

(Anand et al., 2019).

Prior works in brain imaging have been based on unsupervised methods such as linear ICA (Calhoun et al., 2001) and HMM framework Eavani et al. (2013). Some other nonlinear approaches were also proposed to capture the dynamics as using RBMs (Hjelm et al., 2014) and RNN modification of ICA (Hjelm et al., 2018a).

Also, in most cases, researchers in brain imaging are dealing with small datasets. In this case, transfer learning (Mensch et al., 2017; Li et al., 2018; Thomas et al., 2019) might be a way to improve results and in some cases, to enable learning from data otherwise too small for any results. Another idea to improve performance might be considered by a data generating approach (Ulloa et al., 2018).

3 Method Description

For self-supervised pre-training, we are using spatio-temporal objective ST-DIM (Anand et al., 2019) to maximize predictability between current latent state and future spatial state and between consecutive spatial states. For the lower bound of mutual information, we are using InfoNCE (Oord et al., 2018) estimator. Compare to other available estimators, InfoNCE shows better performance (Hjelm et al., 2018b; Bachman et al., 2019) in case of a greater number of negative samples that are readily available in case of time series data.

Let be a dataset of pairs of values at time point and sampled from sequence with length . A pair is called positive if and — negative if . A positive pair models the joint and a negative — marginal distributions. Eventually, the InfoNCE estimator is defined as:


where is a critic function Tschannen et al. (2019). Specifically, we are using separable critic , where and

are some embedding function parametrized by neural networks. Such embedding functions are used to calculate value of a critic function in same dimensional space from two dimesionally different inputs. Critic learns an embedding function such that it assigns higher values for positive pairs compare to negative pairs:


We define a latent state as an output of encoder and a spatial state as the output of th layer of the encoder for input at time point . To optimize the objective between current latent state and future spatial state the critic function for input pair is and for consecutive spatial states — . Finally, the loss is the sum of the InfoNCE with and InfoNCE with as .

4 Experiments

4.1 Datasets

4.1.1 Simulation

To simulate the data we generate multiple -node graphs with

stable transition matrices. Using these we generated multivariate time series with autoregressive (VAR) and structural vector autoregressive (SVAR) models 

(Lütkepohl, 2005).

First, we generate VAR times series with size . Then we split our dataset to samples for training, —for validation and — for testing. Using these samples we pre-train an encoder and evaluate based on its ability to identify consecutive windows sampled from whole time series.

In the final downstream task we classify the whole time-series whether it is generated by VAR or SVAR (undersampled VAR at rate 2). We create graphs with corresponding stable transition matrices and generate samples ( for each) and split as for training, for validation and for hold-out test. Here we also use windows as a single time-point input.

4.1.2 Real data

Two independent datasets were used in the current study. The first dataset is a Schizophrenia dataset, which is from Function Biomedical Informatics Research Network (FBIRN) (Keator et al., 2016)111These data were downloaded from the Function BIRN Data Repository, Project Accession Number 2007-BDR-6UHZ1), and the second dataset is a healthy subject dataset, which is from the 1200 Subject release of Human Connectome Project (HCP) (Van Essen et al., 2013).

The FBIRN dataset was pre-processed through SPM12 (Penny et al., 2011) based on the MATLAB 2016b environment. The slice-timing was first performed on the data, and then subject head motions were corrected by the realignment procedure. After that, the data was warped to MNI space using EPI template and resampled to mm voxels. Finally, the data were smoothed with a 6mm FWHM Gaussian kernel. The FBIRN dataset consists of subjects, including SZ patients and healthy controls.

The resting-state fMRI HCP data comes pre-processed by the following pipeline (Glasser et al., 2013). It includes removing of spatial distortions, compensation of the subject motion, reduction of the bias field, normalization with a global mean, and final brain masking. The pre-processed HCP data were then warped to the MNI space using the EPI template and resampled to the mm voxels using the same bounding box, to guarantee HCP and FBIRN datasets have the same spatial resolution and dimensions. HCP consists only of healthy controls.

For each dataset, intrinsic connectivity networks (ICNs) were extracted using the pipeline described at (Du et al., 2019). These ICNs are supposed to be non-noise components providing meaningful functional network information and thus were used in training.

4.2 Training

Encoder for simulation experiment consist of D convolutional layers with output features , kernel sizes

and stride

, following by ReLU 

Glorot et al. (2011) after each layer followed by Linear layer with units. For real data — D convolutional layers with , and respectively, followed by linear layer with units. Then for all possible pairs in the batch we took flattened features after rd convolutional layer and features from last layer . We embedded them using for and for to dimensional vector to compute the score of a critic function or . Using these scores we computed the loss. The neural networks trained using Adam optimizer (Kingma and Ba, 2014). The weights were initialized using Xavier (Glorot and Bengio, 2010).

For simulation experiment, first, we train our encoder to learn on windows from the VAR time series using InfoNCE based loss, and secondly, we train a supervised classifier based on windows. This window-based classification provides promising results (accuracy ). However, in solving similar real problems, we are more interested in subjects, i.e., entire time series, rather than a single-window for classification. Hence, we perform classification based on the whole time-series. In this setting, the entire time-series is encoded as a sequence of representations and fed through a biLSTM classifier. Two additional linear layers with hidden units on top of the last hidden state of the biLSTM are used to map the representation to classification scores.

For the real data case, similar to simulations, we successfully train (accuracy ) our encoder on consecutive windows of fMRI from HCP healthy subjects. Then each computed feature for each window of the whole fMRI sequence used to train biLSTM classifier on fBIRN dataset. The biLSTM classifies SZ and HC subjects. Overall each fMRI of the subject consists of a series of overlapping by half windows by components by time points.

Figure 1: VAR vs. SVAR time-series classification accuracy of synthetic data.
Figure 2: Classification results on the progressively larger datasets.

4.3 Results

Here we compare an end-to-end supervised model without pre-training (NPT), with frozen layers of the pre-trained encoder (FPT), and with unfrozen layers of the pre-trained encoder (UFPT).

In the simulation study, we observe that the pre-trained model can easily be fine-tuned only with a small amount of downstream data. Our model can classify a randomly chosen time-series as a sample of VAR or SVAR (Figure 2). Note, with very few training samples, models based on the pre-trained encoder outperform supervised models. However, as the number of samples grows, the accuracy achieved with or without pre-training levels out.

As we can see from Figure 2, the real data results substantiate the insights achieved in a simulation study. The test dataset consists of subjects that are held out from training and validation processes and are the same for all tests in the plot. Training data was randomly resampled ten times from the available data pool. To put it another way, self-supervised transferable pre-training always helps when we have very few samples offering higher AUC.

5 Conclusions and Future Work

As we have demonstrated, self-supervised pre-training of a spatiotemporal encoder on fMRI of healthy subjects provides benefits that transfer across datasets, collection sites, and to schizophrenia disease classification. Learning dynamics of fMRI helps to improve classification results for schizophrenia on small datasets, that otherwise do not provide reliable generalizations. Although the utility of this result is highly promising by itself, we conjecture that direct application to spatiotemporal data will warrant benefits beyond improved classification accuracy in the future work. Working with ICA components is smaller and thus easier to handle space that exhibits all dynamics of the signal. In the future, we will move beyond ICA pre-processing and work with fMRI data directly. We expect model introspection to yield insight into the spatio-temporal biomarkers of schizophrenia. In future work, we will test the same analogously pre-trained encoder on datasets with various other mental disorders such as MCI and bipolar. We are optimistic about the outcome because the proposed pre-training is oblivious to the downstream use and is done in a manner quite different from the classifier’s work. It may indeed be learning crucial information about dynamics that might contain important clues into the nature of mental disorders.

6 Acknowledgement

This study is supported by NIH grant R01EB020407.

Data were provided [in part] by the Human Connectome Project, WU-Minn Consortium (Principal Investigators: David Van Essen and Kamil Ugurbil; 1U54MH091657) funded by the 16 NIH Institutes and Centers that support the NIH Blueprint for Neuroscience Research; and by the McDonnell Center for Systems Neuroscience at Washington University. Additional data used in this study were downloaded from the Function BIRN Data Repository (, supported by grants to the Function BIRN (U24-RR021992) Testbed funded by the National Center for Research Resources at the National Institutes of Health, U.S.A.