Interpretable LSTMs For Whole-Brain Neuroimaging Analyses

10/23/2018 ∙ by Armin W. Thomas, et al. ∙ 14

The analysis of neuroimaging data poses several strong challenges, in particular, due to its high dimensionality, its strong spatio-temporal correlation and the comparably small sample sizes of the respective datasets. To address these challenges, conventional decoding approaches such as the searchlight reduce the complexity of the decoding problem by considering local clusters of voxels only. Thereby, neglecting the distributed spatial patterns of brain activity underlying many cognitive states. In this work, we introduce the DLight framework, which overcomes these challenges by utilizing a long short-term memory unit (LSTM) based deep neural network architecture to analyze the spatial dependency structure of whole-brain fMRI data. In order to maintain interpretability of the neuroimaging data, we adapt the layer-wise relevance propagation (LRP) method. Thereby, we enable the neuroscientist user to study the learned association of the LSTM between the data and the cognitive state of the individual. We demonstrate the versatility of DLight by applying it to a large fMRI dataset of the Human Connectome Project. We show that the decoding performance of our method scales better with large datasets, and moreover outperforms conventional decoding approaches, while still detecting physiologically appropriate brain areas for the cognitive states classified. We also demonstrate that DLight is able to detect these areas on several levels of data granularity (i.e., group, subject, trial, time point).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 10

page 12

page 13

page 24

page 25

page 26

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Abstract

The analysis of neuroimaging data poses several strong challenges, in particular, due to its high dimensionality, its strong spatio-temporal correlation and the comparably small sample sizes of the respective datasets. To address these challenges, conventional decoding approaches, such as the searchlight analysis, reduce the complexity of the decoding problem by considering local clusters of voxels only. Thereby, neglecting the distributed spatial patterns of brain activity underlying many cognitive states. In this work, we introduce the DLight framework, which overcomes these challenges by utilizing a long short-term memory unit (LSTM) based deep neural network architecture to analyze the spatial dependency structure of whole-brain fMRI data. In order to maintain interpretability of the neuroimaging data, we adapt the layer-wise relevance propagation (LRP) method. Thereby, we enable the neuroscientist user to study the learned association of the LSTM between the data and the cognitive state of the individual. We demonstrate the versatility of DLight by applying it to a large fMRI dataset of the Human Connectome Project. We show that the decoding performance of our method scales better with large datasets, and moreover outperforms conventional decoding approaches, while still detecting physiologically appropriate brain areas for the cognitive states classified. We also demonstrate that DLight is able to detect these areas on several levels of data granularity (i.e., group, subject, trial, time point).

 

Keywords

: Decoding, fMRI, whole-brain, deep learning, LSTM, interpretability.

1 Introduction

Neuroimaging research has recently started collecting large corpora of experimental data, often comprising many hundred individuals [1, 2]. By collecting these datasets, researchers want to gain insights into the association between the cognitive states of an individual (e.g., while viewing images or performing a specific task) and the underlying brain activity, while also studying the variability of these associations across the population [1]. Yet, many conventional methods used to analyze these data (such as the general linear model [3] or searchlight analysis [4]) contain several important restrictions: (1) They are mathematically linear to allow an interpretation of the data. Although it has been shown that linear methods are often advantageous when decoding cognitive states of an individual from small clusters of brain activity only [5], it is an open question whether this is also true when larger clusters of brain activity are used (i.e., whole-brain). (2) They often incorporate restrictive regional independence assumptions to maintain computational and statistical feasibility by analyzing single voxels and brain areas independent of one another. We know, however, that widespread spatial correlations of brain activity persist throughout the whole brain [6, 7]. (3) Their performance in decoding the cognitive state of an individual from the underlying brain activity does often not improve well with large datasets [8]. (4) Their ability to study the association between brain activity and the cognitive state of an individual on the level of single time points is often limited, as they require aggregating over group- or subject-level data.

In order to approach these limitations, researchers have recently started applying sophisticated machine learning techniques to neuroimaging data (e.g.,

[9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]). However, none of these approaches have yet found widespread empirical application in the neurosciences, where more conventional analysis methods, as searchlight [4] and GLM [3], are still most commonly used. For this reason, our work will use these two methods for comparison, as they can be seen as the current convention for the analysis of neuroimaging data.

At first sight, the analysis of neuroimaging data seems ideally suited for the application of deep learning (DL) [20, 21] models, due to the availability of large and structured datasets. Although researchers have started exploring the application of DL models to neuroimaging data (e.g., [22, 10, 23, 24, 25, 18, 26]), three major obstacles have so far prevented broad DL usage: (1) No structured framework for the application of DL models to whole-brain neuroimaging data exists. (2) Neuroimaging data are high dimensional, while containing comparably few samples. For example, a typical functional Magnetic Resonance Imaging (fMRI) dataset comprises up to a few hundred samples per subject and recently up to several hundred subjects [1], while each sample contains several hundred thousand dimensions (ie., voxels). In such classification settings, DL models (as well as more traditional machine learning approaches) have been shown to suffer from overfitting (by too closely capturing those dynamics that are specific to the training data so that their predictive performance does not generalize well to new data). (3) DL models have often been considered non-linear black box models to the neuroscientist impeding insight into (and interpretation of) the underlying brain activity. Therefore, even if a DL model could be successfully trained to accurately decode cognitive states from neuroimaging data, the underlying mapping between input and prediction would still be unknown.

To overcome these challenges, we propose the DLight framework, which defines a method to utilize long short-term memory (LSTM) based DL architectures [27, 28] to analyze whole-brain neuroimaging data. In DLight, each whole-brain volume is sliced into a sequence of axial images. These images are then sequentially processed by a convolutional LSTM in order to decode the underlying cognitive state of an individual. Thereby, DLight is not restricted locally by modeling the full spatial dependency structure of the whole brain. To subsequently relate the observed brain activity and cognitive state, DLight applies layer-wise relevance propagation (LRP) [29, 30] to the trained DL model. Thereby, identifying those areas of the brain whose activity the DL model used to make a decoding decision. In particular, the LRP method decomposes the decoding decision of the DL model into the contributions of the single input voxels to this decision. As the LRP analysis is performed on the level of single brain samples, DLight is able to analyze neuroimaging data on several levels of data granularity, from the level of the group down to the level of single time points.

Here, we will provide an extensive proof of concept of DLight by applying it to an openly available fMRI dataset of the Human Connectome Project [1]. In particular, to the data of an N-back task, where 100 subjects viewed images of either body parts, faces, places or tools in two separate experiment runs. Subsequently, we will evaluate the performance of DLight in decoding the four underlying cognitive states (resulting from viewing an image of either of the four stimulus classes) from the fMRI data and localizing the brain areas that the DL model used to do so. We will also demonstrate that the performance of DLight is superior to the conventional searchlight analysis [4] in decoding the cognitive states from data.

Figure 1: Experimental Paradigm. 100 subjects completed two experiment runs in the fMRI. Each run consisted of eight task and four fixation blocks. The four stimulus types (body, face, place, tool) were presented in separate blocks. Each task block consisted of 10 trials. In each trial, a stimulus was presented for 2s, followed by a 500ms interstimulus interval. Subjects performed an N-back task, in which they were asked to respond "target" when the currently presented stimulus was the same as a target stimulus. The target was either presented at the beginning of the block (0-back) or subjects were asked to indicate whether the current stimulus was the same as the stimulus two back (2-back). Half of the blocks used a 2-back and the other half a 0-back condition. For illustrative purposes only the 0-back task is depicted.

2 Data

2.1 Experimental Paradigm

100 participants performed a version of the N-back task in two separate fMRI runs (see Fig. 1 for an overview). Each of the two runs consisted of eight task blocks (25s each) and four fixation blocks (15s each). Within each run, the four different stimulus types (body, face, place and tool) were presented in separate blocks. Half of the task blocks used a 2-back working memory task (participants were asked to respond "target” when the current stimulus was the same as the stimulus 2 back) and the other half a 0-back working memory task (a target cue was presented at the beginning of each block and the participants were asked to respond "target” whenever the target cue was presented in the block). Each task block consisted of 10 trials (2.5s each). In each trial, a stimulus was presented for 2s followed by a 500 ms interstimulus interval (ISI). As we were not interested in identifying any effect of the N-back task condition on the evoked brain activity, we pooled the data of both N-back conditions.

2.2 FMRI Data Acquisition

Structural MRI and functional MRI data of 100 unrelated participants for this experiment were provided in a preprocessed format by the Human Connectome Project, WU Minn Consortium (Principal Investigators: David Van Essen and Kamil Ugurbil; 1U54MH091657) funded by the 16 NIH Institutes and Centers that support the NIH Blueprint for Neuroscience Research; and by the McDonnell Center for Systems Neuroscience at Washington University.

Whole-brain EPI acquisitions were acquired with a 32 channel head coil on a modified 3T Siemens Skyra with TR=720 ms, TE=33.1 ms, flip angle=52 deg, BW=2290 Hz/Px, in-plane FOV=, 72 slices, 2.0 mm isotropic voxels with a multi-band acceleration factor of 8. Two runs were acquired, one with a right-to-left and the other with a left-to-right phase encoding (for further methodological details on MRI data acquisition, see [31]). The Human Connectome Project preprocessing pipeline for functional MRI data (“fMRIVolume”) [32] includes the following steps: gradient unwarping, motion correction, fieldmap-based EPI distortion correction, brain-boundary based registration of EPI to structural T1-weighted scan, non-linear registration into MNI152 space, and grand-mean intensity normalization (for further fMRI preprocessing details, see [31, 32]).

2.3 FMRI Data Preprocessing

In addition to the minimal preprocessing of the fMRI data that was performed by the Human Connectome Project, we applied the following preprocessing steps to the data for all decoding analyses: volume-based smoothing of the fMRI sequences with a 3mm Gaussian kernel and linear detrending and standardization of the single voxel signal time-series (resulting in a zero-centered voxel time-series with unit variance). All of our preprocessing was performed by the use of the Nilearn 0.4 Python library

[33].

2.4 Decoding Task

The target of all decoding analyses was to decode the four cognitive states of the experiment from the fMRI data.

3 Searchlight Analysis

The searchlight analysis [4] belongs to the class of multi-voxel pattern analysis methods (MVPA) [34]

and is a standard tool for decoding cognitive states from fMRI data. Its goal is to identify clusters of voxels in the brain whose activity relates to an underlying cognitive state by probing the ability of a classifier to decode this state from the activity pattern of these voxels. To do this, the entire brain is scanned with a sphere of a given radius (the searchlight) and the performance of the classifier in decoding the cognitive states is evaluated at each location resulting in a map of decoding accuracies. A common choice for such a classifier is the linear Support Vector Machine (SVM; Supplementary Information 3)

[35].

All searchlight analyses reported in the following used a searchlight radius of 5.6mm, in line with the suggestions by Kriegeskorte and colleagues [4]. Importantly, we applied two distinct searchlight approaches for the subject-level and group-level fMRI data.

On the subject-level, we computed the searchlight analysis separately within the fMRI data of each individual subject. For each individual, we first split the data according to the two experiment runs (see section 2

) and used the data of the first run to train the SVM and the data of the second run to evaluate its decoding performance. We estimated the SVM parameters in a one-versus-one approach. Here, one SVM classifier is trained to distinguish each unique pair of cognitive states in the training data. A prediction is made according to the class with the most votes from all classifiers.

The previously described subject-level analysis approach does not easily scale to the group-level, as the fMRI data of many subjects easily exceed the working memory capacities of a regular computing station (the fMRI data of one subject span approximately 1GB). Therefore, we adopted another widely used approach (e.g., [36, 37, 38]) by computing the searchlight analysis based on the -maps of a subject-level GLM analysis (Supplementary Information 2). Here, we split the fMRI data into distinct training and test datasets by assigning the first-level -maps of 70 randomly selected subjects to the training and the maps of the remaining 30 subjects to the test dataset. We then trained the SVM searchlight in a one-vs-one approach to distinguish the cognitive states underlying each -map in the training dataset. Subsequently, we predicted the cognitive states of each -map of the held-out test data.

For both approaches we defined the decoding accuracy of the searchlight analysis as the maximum decoding accuracy that was achieved at any searchlight location in the brain.

Figure 2:

Illustration of the DLight framework. Each whole-brain fMRI volume is sliced into a sequence of axial images. These images are then passed to a DL model consisting of a feature extractor, an LSTM and an output unit. First, the feature extractor reduces the dimensionality of the brain slices through a sequence of eight convolutional layers. The resulting sequence of higher-level slice representations is then fed to a bi-directional LSTM modeling the spatial dependencies between brain slices. Lastly, the DL model outputs a decoding prediction about the underlying cognitive state through a softmax layer with four output neurons (one for each cognitive state in the dataset). Once the prediction is made, DLight utilizes the LRP method to decompose the prediction into the contributions of the single input voxels to the prediction, enabling an analysis of the association between fMRI data and cognitive state.

4 DLight Analysis

4.1 Deep Learning Model

The DL model underlying DLight consists of three distinct computational modules, namely a feature extractor, an LSTM, and an output unit (Fig. 2). The task of this model is to decode the four cognitive states from the whole-brain fMRI data. First, the feature extractor processes each of the axial volume slices through a chain of convolutions [39], resulting in a sequence of higher-level features for each volume slice. These higher-level slice representations are then fed to an LSTM [27]

, modeling their spatial dependency structure and thereby enabling an analysis of widely distributed (i.e., whole-brain) activity. Lastly, the output unit makes a decoding decision by projecting the output of the LSTM into a lower-dimensional space, spanning the cognitive states to be classified. Here, a probability for each cognitive state is estimated indicating whether the current brain volume belongs to each of the states. This combination of convolutional and recurrent DL models was inspired by previous research, showing that it is generally well-suited to learn the spatio-temporal dependency structure of long sequences of input data

[28, 40, 41].

Importantly, the DLight approach is not dependent on any specific architecture of each of these three modules. The DL model architecture described in the following is exemplary. Further research is needed to explore the effect of specific module architectures on the performance of DLight.

4.1.1 Feature Extractor

In line with previous work, we first applied an outer brain mask to the fMRI data of all subjects (Supplementary Information 1) [42], resulting in 81 axial brain slices for each brain sample (Fig. 2) with 74 x 92 pixels per slice.

The feature extractor that we used in our DL model was composed of a sequence of eight convolutional layers [39]. A convolutional layer consists of a set of kernels (or filters) that learn the local features of the input image that are relevant to discriminate a decoding target. These local features are then convolved over the input, resulting in an activation map indicating whether a feature represented by the convolutional kernel is present at each given location of the input. In particular, the kernel is sequentially moved over the input image and at each location of the input, the value of each cell of the kernel matrix is multiplied with the overlapping input pixel value and then summed over the kernel:

(1)

Here, represents the bias of the kernel, while

represents the activation function.

and represent the row and column index of the kernel matrix, whereas and represent the row and column index of the activation map.

We specified the sequence of convolutional layers as follows (see Fig. 2

): conv3-16, conv3-16, conv3-16, conv3-16, conv3-32, conv3-32, conv3-32, conv3-32 (notation: conv(kernel size) - (number of kernels)). All convolution layers were activated through a rectified linear unit (ReLU) function:

(2)

Importantly, all even-numbered convolutional layers were moved over the input image with a step size of one pixel and all odd-numbered layers with a step size of two pixels. This sequence of eight convolutional layers resulted in a 960-dimensional representation of each volume slice.

4.1.2 Long Short-Term Memory Unit

We applied a bi-directional LSTM comprising two independent LSTM units, both iterating through the entire sequence of input slices, but in reverse order (one from bottom-to-top and the other from top-to-bottom). An LSTM unit contains a hidden cell state , which stores information over the input sequence with length and outputs a vector for each input vector at sequence step . The unit has the ability to add and remove information from through a series of carefully designed gates. In a first step, the LSTM unit decides what information from the cell state is removed. This is done by a fully-connected sigmoid layer, the forget gate :

(3)
(4)

The forget gate outputs a number between 0 and 1 for each entry in the cell state at the previous sequence step . Next, the LSTM unit decides what information is going to be stored in the cell state. This operation contains two elements: a sigmoid layer, the input gate , which decides which values of will be updated, and a tanh layer, which creates a new vector of candidate values :

(5)
(6)
(7)

Subsequently, the old cell state is updated into the new cell state :

(8)

Lastly, the LSTM computes its output . Here, a third sigmoid layer, the output gate , decides what part of will be outputted. Subsequently, is multiplied by another tanh layer to make sure that is scaled between -1 and 1:

(9)
(10)

Each of the two LSTM units in our DL model contained 40 output neurons. Both LSTM units only passed the output for the last sequence element to the output unit of our DL model.

4.1.3 Output Unit

The output unit of our model comprised a single fully-connected softmax layer with four channels, assigning a probability to each of the () cognitive states that the current fMRI sample belongs to this state:

(11)

4.1.4 Model training

We split the fMRI data according to the same split that we used for our SVM searchlight group-level analysis (see section 3

) by assigning the same 70 subjects to the training data and the remaining 30 subjects to the test dataset. We then iteratively trained the DL model through backpropagation

[43]

over a period of 60 epochs by the use of the ADAM optimization algorithm as implemented in tensorflow 1.4

[44]. To prevent overfitting, we applied dropout regularization to all network layers [45], global gradient norm clipping (with a clipping threshold of 5) [46], as well as an early stopping of the training (see Supplementary Fig. S1

for an overview of training statistics). During the training, we set the dropout probability to 50% for all network layers, except for the first four convolutional layers, where we reduced the dropout probability to 30% for the first two layers and 40% for the third and fourth layer. Each training epoch was defined as a complete iteration over all samples in the training dataset. We used a learning rate of 0.0001 and a batch size of 32. All network weights were initialized by the use of a normal-distributed random initialization scheme

[47]. The DL model was written in tensorflow 1.4 [44] and the interprettensor library111https://github.com/VigneshSrinivasan10/interprettensor.

4.2 Layer-Wise Relevance Propagation

The goal of the LRP method [29, 48, 49] is to understand the contribution of a single dimension of an input to the prediction that is made by a linear or non-linear classifier . We denote the contribution of a single dimension as its relevance . One way of decomposing the prediction is by the sum of the relevance values of each dimension of the input:

(12)

Qualitatively, any can be interpreted as evidence against the presence of a classification target, while denotes evidence for the presence of the target. Importantly, LRP assumes that indicates evidence for the presence of a target.

Let’s assume the relevance of a neuron at network layer for the prediction is known. We would like to decompose this relevance into the messages that are sent to those neurons in layer which provide the inputs to neuron :

(13)

Thereby, we can also define the relevance of a neuron in layer by summing all messages from the neurons at layer :

(14)

With these two equations the propagation of relevance from layer to layer is defined. While the relevance of the output neuron at layer is defined as , the dimension-wise relevance scores on the input neurons are given by .

4.2.1 Propagation Rules

For all weighted connections of the DL model (see eqs. 1, 3, 5, 6 and 9), DLight defines the messages as follows:

(15)

Here, and , while represents a stabilizer term that is necessary to avoid numerical degenerations when is close to 0 (we set ) and indicates the number of stimuli.

Importantly, an LSTM also applies another type of connection, which we refer to as multiplicative connection (see eqs. 8 and 10). Let be an upper-layer neuron whose value in the forward pass is computed by multiplying two lower-layer neuron values and such that . These multiplicative connections occur when we multiply the outputs of a neuron, whose values range between 0 and 1, with an instance of the hidden cell state, which we will call neuron. For these types of connections, we set the relevances of the gate neuron and the relevances of the source neuron , where denotes the relevances of the upper layer neuron (as proposed in [50]). The reasoning behind this rule is that the gate neuron already decides in the forward pass how much of the information contained in the source neuron should be retained to make the classification. Even if this seems to ignore the neurons and , these are actually taken into account when computing the value from the relevances of the upper-layer neurons to which is connected by the weighted connections. We refer the reader to [51, 52] for more information about neural network explanation techniques.

4.2.2 LRP in the DLight Framework

In the context of this work, we only decomposed the predictions of the DL model for the actual cognitive state underlying each fMRI sample, as we were solely interested in understanding what the DL model used as evidence in favor of the presence of this state. We neglected the predictions for all other states, because we were not interested in understanding what the DL model used as evidence for the presence of the other, false states (i.e., we set the predictions of the model for these states to zero). Further, we restricted the LRP analysis to those brain samples that the DL model classified correctly. Overall, we can only assume that the DL model has learned a meaningful mapping between the fMRI data and the underlying cognitive states of an individual, if it classified the fMRI volume correctly (see [52] for a more detailed discussion).

5 Results

All experiments reported in the following solely include the data of the 30 subjects from the held-out test dataset (see section 3).

Figure 3: Out of sample decoding performance: Comparison of the DLight and searchlight approach. A

: Average group-level decoding accuracies of DLight for each cognitive state. Bar heights indicate averages with vertical lines indicating standard errors.

B: Average subject-level decoding accuracies of the DLight and searchlight analysis. Subjects are indicated by crosses, whereas colored lines represent the average across subjects. C: Average group-level decoding accuracies of the DLight and searchlight analysis as a function of the number of subjects in the training dataset. Colored bands indicate standard errors. We defined decoding accuracy as the fraction of fMRI samples in the held-out test dataset that were classified correctly.

5.1 DLight Decoding Performance Superior to Searchlight

A key prerequisite for the DLight analysis is that the underlying DL model achieves reasonable performance in the decoding task (see section 2.4). Only then we can assume that the model has learned a meaningful mapping from the brain data to the cognitive states of an individual and interpret the resulting relevance patterns as informative about these states.

Overall, DLight accurately decoded the cognitive states underlying 69% of the fMRI samples in the held-out dataset (61.1%, 73.6%, 74.5%, 66.7% for body, face, place and tool respectively; Fig. 3 A).

Note that DLight’s performance in decoding the four cognitive states from the fMRI data varied over the course of an experiment block (Supplementary Fig. S2). In the early stages of the block, DLight generally did not perform as well as in the middle and later stages. This finding is generally in line with the temporal evolution of the hemodynamic response function (HRF; [53]) measured by the fMRI (the HRF is known to peak 5-10 seconds after to the onset of the underlying neuronal activity).

To further evaluate DLight’s performance in decoding the cognitive states from the fMRI data, we compared it to the performance of the searchlight analysis. On the group-level (see section 3), where the searchlight classifier was trained on the first-level -maps of the same training and test subjects that we also used for the DLight analysis, searchlight achieved an average decoding performance of 60% (DLight improved upon this performance by 15%; ). To also compare DLight’s decoding performance to the searchlight analysis on the subject-level, we trained a searchlight classifier on the fMRI data of the first experiment run of a subject and then compared the performance of both classifiers in decoding the four cognitive states underlying each fMRI sample in the second experiment run of the same subject. Note that DLight did not see any data of this subject in the training, as it was solely trained on the data of the remaining subjects in the training dataset. Nevertheless, DLight clearly outperformed searchlight, by decoding the cognitive states more accurately for 29 out of 30 subjects (while searchlight achieved an average decoding accuracy of 47.2% across subjects, DLight improved upon this performance by 46.9% (with an average decoding accuracy of 69.3%); ) (Fig. 3 B).

To also test whether the observed differences in decoding performance between searchlight and DLight result from the linearity assumption of the SVM that we used in the searchlight analysis, we replicated our subject-level searchlight analysis using a non-linear radial basis function kernel (RBF;

[35, 54, 55]) SVM (Supplementary Fig. S3). However, the decoding accuracies achieved by the RBF-kernel SVM were not significantly different from those of the linear-kernel SVM (t(29)=-1.75, p=0.09).

5.2 DLight Decoding Performance Improves with Amount of Available Training Data

Our findings have shown that DLight outperformed the searchlight analysis in decoding the four cognitive states from the fMRI data. Yet, our previous analysis did not indicate how well the performance of both approaches improves with the amount of available training data. A key premise of the DLight approach is that it is better suited to analyze larger corpora of neuroimaging data, when compared to traditional decoding methods. To test this, we repeatedly trained both classifiers on a subset of the original group-level training data (including the data of 5, 10, 15, 20, 25, 30, 35, 40, 50, 60 and 70 subjects), and validated their performance at each iteration on the full held-out test data (Fig. 3 C). Again, the group-level searchlight was trained and evaluated on the -maps of these subjects while DLight used their fMRI data.

Overall, the performance of DLight in decoding the cognitive states of the fMRI samples in the test data increased as a function of the amount of available training data, whereas the performance of searchlight plateaued after being trained on the data of only five subjects. Nevertheless, searchlight outperformed DLight in decoding the cognitive states from the fMRI data when only little training data were available (here, 10 or less subjects (). The decoding advantage of DLight, on the other hand, came to light in this comparison when the data of 50 or more subjects were available in the training dataset ().

Figure 4: Group-level statistical contrast maps of a standard two-stage GLM analysis of the relevance values for each cognitive state. Colors indicate t-values. All t-values were false-discovery-rate (FDR) corrected at a rate of 0.005. Only positive t-values are visualized. All data is projected onto the FsAverage5 surface template [56].

5.3 DLight Accurately Identifies Physiologically Meaningful Brain Patterns

From our previous analyses we can conclude that DLight has learned a meaningful mapping between the fMRI data and the underlying cognitive states by demonstrating its performance in accurately decoding these states from the data.

Therefore, we next tested the ability of DLight to identify the brain areas underlying its decoding decisions by decomposing these decisions with the LRP method and analyzing the resulting group-level distribution of relevance values with a standard two-stage GLM approach (Supplementary Information 2). For simplicity, we plotted positive and negative relevance maps separately. Generally, positive relevances indicate voxels whose activty DLight interprets as evidence in favor for the presence of a cognitive state, whereas negative relevances indicate voxels whose activity DLight interprets as evidence against the presence of the cognitive state. An overview of the positive group-level maps is given in Figure 4, whereas negative group-level maps can be found in Supplementary Figure S4.

When identifying subjects’ cognitive state while viewing body stimuli, DLight assigned positive relevance to the activity of the lateral and medial occipital cortex as well as the caudate nucleus and parts of the dorsal somatosensory cortex (Fig. 4 A). For the face state, DLight assigned positive relevance to the activity of the left inferior temporal cortex, the somatosensory and motor cortex, the anterior and posterior cingulate cortex as well as the caudate nucleus (Fig. 4 B). For the place state, DLight assigned positive relevance to the activity of the parahippocampal gyrus, the medial occipital cortex, posterior cingulate cortex, parts of the prefrontal and somatosensory cortex as well as the superior and medial temporal gyrus (Fig. 4 C). Lastly, when decoding the tool state, DLight assigned positive relevance to the activity of the somatosensory and motor cortex, the lateral and medial occipital cortex, parts of the precuneus, superior and medial frontal gyrus as well as the thalamus and right inferior temporal cortex (Fig. 4 D).

In contrast, when identifying the body state, DLight generally assigned negative relevance to the activity of the left medial and right inferior frontal as well as left superior temporal gyrus (Supplementary Fig. S4 A). For the face state, DLight assigned negative relevance to the activity of the medial and lateral occipital cortex, the superior, medial and inferior frontal gyrus, as well as the ventromedial prefrontal cortex and thalamus (Supplementary Fig. S4 B). When identifying the place state, DLight assigned negative relevance to the activity of parts of the motor and somatosensory cortex, lateral occipital cortex, anterior cingulate cortex, caudate nucleus, inferior temporal cortex, precuneus and thalamus (Supplementary Fig. S4 C). For the tool state, DLight assigned negative relevance to the activity of the medial and lateral prefrontal cortex, the posterior cingulate cortex, caudate nucleus as well as the medial occipital cortex and the superior and medial temporal gyrus (Supplementary Fig. S4 D).

Figure 5: Exemplar DLight relevance maps at different levels of data granularity for the subject with the highest decoding accuracy. A: Subject-level statistical contrast maps of the DLight approach (for further details, see Supplementary Information 3). Colors indicate t-values. B: Statistical maps of the first experiment trial of the first run for each stimulus class. Colors indicate t-values, testing the distribution of relevance values across the sequence of brain samples in this trial against 0. C: Relevance maps for an exemplar brain sample (i.e., TR) of each cognitive state. All data is projected onto the FsAverage5 surface template [56].

The networks identified by DLight overlap in large parts with those identified by a standard two-stage GLM analysis of the same fMRI data (Supplementary Fig. S5) as well as those identified by a meta analysis with the Neurosynth database (Supplementary Information 4 & Supplementary Fig. S6): Both, the group-level GLM and meta analysis also associated the lateral occipital cortex with the body state (Supplementary Fig. S5 A and S6 A), the parahippocampal gyrus with the place state (Supplementary Fig. S5 C and S6 C) and the lateral occipital cortex with the tool state (Supplementary Fig. S5 D and S6 D). While the meta analysis and DLight associated the activity of the inferior temporal cortex with the face state (Supplementary Fig. S6 B), the group-level GLM did not associate these areas with this cognitive state.

Overall, DLight has identified a set of physiologically meaningful brain areas that are in line with those identified by traditional means of analyzing neuroimaging data.

5.4 DLight Identifies Physiologically Meaningful Brain Patterns on Different Levels of Data Granularity

A distinguishing feature of DLight, when compared to conventional approaches to analyzing neuroimaging data is its ability to study the association between brain activity and individuals’ cognitive state on the level of single time points (i.e., TRs). Thereby, enabling the study of this association on different levels of data granularity, from single brain samples up to the level of a group of subjects.

This ability is exemplified in Figure 5, where the distribution of relevance values for the four cognitive states is visualized on three different levels of data granularity of an exemplar subject (namely, the subject with the highest decoding accuracy in Fig. 3 B): First, on the level of the overall distribution of relevance values of each cognitive state of this subject (Fig. 5 A; incorporating an average of 123 TRs per cognitive state), then on the level of the first experiment trial of each cognitive state in the first experiment run (Fig. 5 B; incorporating an average of 31 TRs per cognitive state) and lastly on the level of a single brain sample of each cognitive state (Fig. 5 C; incorporating a single TR per cognitive state).

On all three levels, DLight utilized a specific set of brain regions to identify each of the four cognitive states. Importantly, these regions generally overlap with those identified by the DLight group-level analysis (Fig. 4): For the body state, DLight assigned positive relevance to the activity of the lateral occipital cortex of this subject, while it generally assigned positive relevance to the activity of the somatosensory and motor cortex for the face state, the parahippocampal gyrus and occipital cortex for the place state and the lateral occipital cortex for the tool state. Interestingly, for the face state, DLight did not assign high positive relevance to the motor and somatosensory cortex on the level of the exemplified time point, but rather to the inferior temporal cortex. However, the inferior temporal cortex is well-known to be associated with the processing of face images [57, 58]. This illustrates DLight’s ability to holistically integrate spatially distributed brain activity by identifying a cognitive state even through varying brain activity patterns.

Figure 6:

Distribution of relevance values over the course a single experiment block for the face and place stimulus classes. Each volume depicts the results of a t-test over the relevance data of all subjects in the first task block of the first experiment run. For each plotted volume, we grouped the relevance data over two consecutive TRs and tested the resulting distribution of relevance values for each voxel against 0. Colors indicate the t-values of this test. All t-values were FDR-corrected at a rate of 0.05. Only positive t-values are visualized. All data is projected onto the FsAverage5 surface template

[56].

5.5 Temporal Distribution of Relevance Values Mimics Hemodynamic Response

To further test DLight’s ability to analyze single fMRI volumes, we next studied the distribution of relevance values over the course of a single experiment block (Fig. 6). In particular, we plotted this distribution as a function of the fMRI sampling-time (i.e., TR) over all subjects for the first block of a stimulus class in the first experiment run. We restricted this analysis to the face and place stimulus classes, as the neural networks involved in processing face and place stimuli, respectively, have been widely characterized [57, 58]. An overview of the different brain networks is given in Figure S6 B-C, where the results of a meta analysis with the Neurosynth database are visualized (see Supplementary Information 4 for methodological details). Again, we separated positive (Fig. 6) and negative (Supplementary Fig. S7) relevance maps.

Interestingly, the relevance patterns mimic the hemodynamic response of the fMRI with a peak in positive relevance in the fusiform face area (FFA) and parahippocampal place area (PPA) [57, 58] between 6-12 seconds after the onset of the experiment trial. Reversely, negative relevances peaked in the PPA when decoding face stimuli (Supplementary Fig. S7 A) and in the FFA when decoding place stimuli (Supplementary Fig. S7 B), indicating that DLight has learned to distinguish these two neural networks.

6 Discussion

Summary

Here, we have provided an extensive proof of concept of the DLight framework for the analysis of whole-brain neuroimaging data by the application and interpretation of LSTM-based DL models. In particular, DLight processes a whole-brain fMRI volume by first separating it into a sequence of axial slices. These slices are sequentially processed by an LSTM-based DL model to decode the underlying cognitive state of an individual. DLight then utilizes the LRP method to decompose the decoding decision into the contributions of the input voxels to this decision enabling an analysis of the association between fMRI data and cognitive state.

To demonstrate the power of DLight, we applied it to an openly available fMRI dataset of 100 subjects viewing images of body parts, faces, places or tools. We have shown that DLight 1) decodes the cognitive states of the individuals more accurately from the fMRI data than the searchlight analysis, 2) scales well in its ability to decode these states with the amount of available training data, 3) is able to identify a set of physiologically appropriate brain regions associated with these cognitive states, and 4) is able to identify these brain regions on different levels of data granularity, down to the level of single time points.

DLight Decoding Performance Superior to Searchlight

We have shown that DLight’s performance in decoding the cognitive states from the fMRI data is superior to the searchlight analysis (both, across subjects and within their individual data). Yet, it is unclear which feature of DLight is driving this decoding advantage. Generally, there are two major differences between both approaches: 1) DLight is based on a non-linear decoding model, whereas the searchlight analysis that we implemented used a linear-kernel SVM. 2) DLight utilizes whole-brain fMRI data, whereas searchlight is restricted to local clusters of brain activity only.

To answer this question, we replicated our SVM searchlight analyses with a non-linear RBF-kernel SVM (Supplementary Fig. S3). However, the performance of the RBF-kernel SVM did not significantly improve upon the performance of the linear-kernel SVM. Thereby, the advantage of DLight in decoding the different cognitive states from the fMRI Data does not solely result from the non-linear decoding model, but must result from the conjunction of the non-linearity with the ability to process whole-brain fMRI data.

DLight Decoding Performance Improves with Amount of Available Training Data

Our analyses also demonstrated that the decoding performance of DLight improves well with the amount of available training data (Fig. 3 C). The performance of searchlight, on the other hand, leveled out after being trained on the data of only few subjects (Fig. 3 C). Nevertheless, searchlight performed better in decoding the cognitive states, when only little data were available (here, the data of 10 or less subjects in the training data set). The decoding performance of DLight is therefore only superior, when larger corpora of data are available.

However, in this comparison the decoding problem of the group-level searchlight was generally simpler than that of DLight, as it was trained and evaluated on fewer samples, where each sample represented a denoised aggregate of the actual underlying fMRI data, which was used to train and evaluate DLight. We assume that a classifier generally performs better when discriminating few samples that have a high signal-to-noise ratio than when discriminating many samples with a low ratio of signal-to-noise. This assumption finds further support when comparing both decoding approaches on the subject-level fMRI data, where they were tested on the exact same data (see Fig. 3 B). In this comparison, DLight again clearly outperformed searchlight by decoding the cognitive states more accurately from the fMRI data of 29 out of 30 subjects.

DLight Accurately Identifies Physiologically Meaningful Brain Patterns.

In addition to DLight’s ability to accurately decode the four cognitive states from the neuroimaging data, it also successfully identified a set of physiologically meaningful brain regions underlying its decoding decisions (Fig. 4 and Supplementary Fig. S4). These networks overlap in large parts with those identified by a traditional two-stage GLM analysis of the same fMRI data (Supplementary Fig. S5) and the results of a meta analysis with the Neurosynth database (Supplementary Fig. S6). Interestingly, the networks identified by DLight have a much wider extent, incorporating an array of additional brain areas.

In contrast to traditional means of analyzing neuroimaging data, which often incorporate strict local independence assumptions by analyzing single groups of voxels independent of one another, DLight is able to integrate spatially distributed whole-brain activity. Thereby, DLight is able to identify widespread networks of brain activity, which might not be identified by a traditional GLM or searchlight analysis. For example, if the activity pattern of a cluster of voxels is by itself not informative about the cognitive state of an individual, but solely in conjunction with the activity pattern of another, distinctly located cluster, this dependency will not be identified by a GLM analysis. DLight, on the other hand, is able to identify the association between these clusters and the underlying cognitive state of the individual, due to its ability to integrate the spatial dependencies between and within axial brain slices.

However, when comparing the brain maps of DLight with those of more conventional means of analyzing neuroimaging data, it is important to keep in mind that these maps do not directly reflect the data, but the interpretation of these data through the underlying DL model. In particular, they reflect the evidence that the DL model sees in the activity pattern of a cluster of voxels for the presence of a cognitive state. For example, if a cluster of voxels is consistently inactive while an individual is in cognitive state A, but active while the individual is in another cognitive state B, this activity pattern will result in negative -coefficients for these voxels in a traditional GLM contrast analysis. DLight, on the other hand, might interpret the absence of voxel activity in cognitive state A as evidence in favor of the presence of this state (as systematic inactivity also conveys evidence about the presence of a cognitive state) and thereby assign positive relevances to these voxels when decoding cognitive state A.

DLight Identifies Brain Patters on Different Levels of Data Granularity.

Our analyses have demonstrated that DLight accurately detects physiologically meaningful patterns of brain activity, which are associated with the four cognitive states of our data across different levels of data granularity (Fig. 4 - 5): From the level of a group of subjects down to the level of single time points. This feature clearly discriminates DLight from other approaches to analyzing neuroimaging data, which are mostly restricted to aggregating group- or subject-level data.

Importantly, we found that the distribution of relevance values generated by DLight is not necessarily constant across the granularity levels within a cognitive state, but exhibits variability. For example, DLight was able to identify the cognitive state underlying the perception of face stimuli through both, activity in the inferior temporal cortex as well as the motor and somatosensory cortex (Fig. 5).

Even further, when studying the temporal distribution of relevance values across an experiment block, we found that this distribution exhibits strong variability and mimics the hemodynamic fluctuation of brain activity observed by the fMRI (Fig. 6). This however, does not indicate that DLight has learned the functional form of the hemodynamic response, but rather that it sees the most evidence for the presence of a cognitive state in the activity of a cluster of voxels, when their hemodynamic response is strongest.

Overall, these findings demonstrate that DLight has gained a holistic understanding of the spatial distribution of brain activity, as it does not persistently assign the same relevance to the same brain regions when decoding a cognitive state, but shifts its relevance weighting depending on the currently observed pattern of brain activity.

7 Conclusion

Neuroimaging data has a complex spatio-temporal dependency structure that renders modeling and decoding of experimental data a challenging endeavor. With the DLight we propose a new computational framework for the analysis and interpretation of whole-brain neuroimaging data that scales well to large datasets and is mathematically non-linear, while still maintaining interpretability of the data. Moreover, it uses LSTMs to harvest the spatial dependency of the whole-brain and thereby outperforms conventional decoding approaches in decoding cognitive states from large neuroimaging datasets. We have demonstrated that DLight is a versatile tool that can be used to identify associations between the brain activity of individuals and their underlying cognitive states, by the use of the LRP method. Importantly, DLight is able to identify these associations not just on the group- or subject-level, but also on the level of single trials and time points.

Thereby, DLight advances conventional methods for the analysis of whole-brain neuroimaging data by providing a new, scalable and interpretable framework that can be applied to large whole-brain datasets at different levels of data granularity. We hope that the DLight will enable future studies to identify the holistic associations between individuals’ brain activity and their cognitive states. For example, to identify the complex patterns of brain activity underlying many psychiatric disorders (such as depression [59] or schizophrenia [60]).

Acknowledgements

This work was supported by the Fraunhofer Society (through the MPI-FhG collaboration project "Theory & Practice for Reduced Learnin Machines”), the German Ministry for Education and Research (through the Berlin Big Data Center under Grant 01IS14013A and the Berlin Center for Machine Learning under Grant 01IS180371), as well as by the Institute for Information & Communications Technology Promotion funded by the Korea government (MSIT) (No. 2017-0-00451, No. 2017-0-01779).

References

  • [1] D. C. Van Essen, S. M. Smith, D. M. Barch, T. E. Behrens, E. Yacoub, K. Ugurbil, W.-M. H. Consortium, et al., “The wu-minn human connectome project: an overview,” NeuroImage, vol. 80, pp. 62–79, 2013.
  • [2] R. A. Poldrack, D. M. Barch, J. Mitchell, T. Wager, A. D. Wagner, J. T. Devlin, C. Cumba, O. Koyejo, and M. Milham, “Toward open sharing of task-based fmri data: the openfmri project,” Frontiers in Neuroinformatics, vol. 7, p. 12, 2013.
  • [3] K. J. Friston, A. P. Holmes, K. J. Worsley, J.-P. Poline, C. D. Frith, and R. S. Frackowiak, “Statistical parametric maps in functional imaging: a general linear approach,” Human Brain Mapping, vol. 2, no. 4, pp. 189–210, 1994.
  • [4] N. Kriegeskorte, R. Goebel, and P. Bandettini, “Information-based functional brain mapping,” Proceedings of the National Academy of Sciences of the United States of America, vol. 103, no. 10, pp. 3863–3868, 2006.
  • [5] M. Misaki, Y. Kim, P. A. Bandettini, and N. Kriegeskorte, “Comparison of multivariate classifiers and response normalizations for pattern-information fmri,” Neuroimage, vol. 53, no. 1, pp. 103–118, 2010.
  • [6] A. G. Huth, S. Nishimoto, A. T. Vu, and J. L. Gallant, “A continuous semantic space describes the representation of thousands of object and action categories across the human brain,” Neuron, vol. 76, no. 6, pp. 1210–1224, 2012.
  • [7] A. G. Huth, W. A. de Heer, T. L. Griffiths, F. E. Theunissen, and J. L. Gallant, “Natural speech reveals the semantic maps that tile human cerebral cortex,” Nature, vol. 532, no. 7600, p. 453, 2016.
  • [8]

    G. Schohn and D. Cohn, “Less is more: Active learning with support vector machines,” in

    ICML, pp. 839–846, Citeseer, 2000.
  • [9] S. J. Hanson, T. Matsuka, and J. V. Haxby, “Combinatorial codes in ventral temporal lobe for object recognition: Haxby (2001) revisited: is there a “face” area?,” Neuroimage, vol. 23, no. 1, pp. 156–166, 2004.
  • [10] A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux, “Extracting universal representations of cognition across brain-imaging studies,” arXiv preprint arXiv:1809.06035, 2018.
  • [11] A. Mensch, J. Mairal, D. Bzdok, B. Thirion, and G. Varoquaux, “Learning neural representations of human cognition across many fmri studies,” in Advances in Neural Information Processing Systems, pp. 5883–5893, 2017.
  • [12] S. Aydore, B. Thirion, O. Grisel, and G. Varoquaux, “Using feature grouping as a stochastic regularizer for high-dimensional noisy data,” arXiv preprint arXiv:1807.11718, 2018.
  • [13]

    E. D. Dohmatob, A. Gramfort, B. Thirion, and G. Varoquaux, “Benchmarking solvers for tv-l 1 least-squares and logistic regression in brain imaging,” in

    Pattern Recognition in Neuroimaging, 2014 International Workshop on, pp. 1–4, IEEE, 2014.
  • [14] I. Rish, B. Thyreau, B. Thirion, M. Plaze, M.-l. Paillere-martinot, C. Martelli, J.-l. Martinot, J.-b. Poline, and G. A. Cecchi, “Discriminative network models of schizophrenia,” in Advances in Neural Information Processing Systems, pp. 252–260, 2009.
  • [15] D. Bzdok, M. Eickenberg, O. Grisel, B. Thirion, and G. Varoquaux, “Semi-supervised factored logistic regression for high-dimensional neuroimaging data,” in Advances in neural information processing systems, pp. 3348–3356, 2015.
  • [16] J. Loula, G. Varoquaux, and B. Thirion, “Decoding fmri activity in the time domain improves classification performance,” NeuroImage, 2017.
  • [17] W. Hu, B. Cai, V. Calhoun, and Y.-P. Wang, “Multi-modal brain connectivity study using deep collaborative learning,” in Graphs in Biomedical Image Analysis and Integrating Medical Imaging and Non-Imaging Modalities, pp. 66–73, Springer, 2018.
  • [18] H.-I. Suk, S.-W. Lee, D. Shen, A. D. N. Initiative, et al., “Hierarchical feature representation and multimodal fusion with deep learning for ad/mci diagnosis,” NeuroImage, vol. 101, pp. 569–582, 2014.
  • [19] D. Zhang, D. Shen, A. D. N. Initiative, et al., “Multi-modal multi-task learning for joint prediction of multiple regression and classification variables in alzheimer’s disease,” NeuroImage, vol. 59, no. 2, pp. 895–907, 2012.
  • [20] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, p. 436, 2015.
  • [21] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning, vol. 1. MIT press Cambridge, 2016.
  • [22] S. M. Plis, D. R. Hjelm, R. Salakhutdinov, E. A. Allen, H. J. Bockholt, J. D. Long, H. J. Johnson, J. S. Paulsen, J. A. Turner, and V. D. Calhoun, “Deep learning for neuroimaging: a validation study,” Frontiers in neuroscience, vol. 8, p. 229, 2014.
  • [23] M. Yousefnezhad and D. Zhang, “Anatomical pattern analysis for decoding visual stimuli in human brains,” Cognitive Computation, vol. 10, no. 2, pp. 284–295, 2018.
  • [24] S. Sarraf and G. Tofighi, “Classification of alzheimer’s disease using fmri data and deep learning convolutional neural networks,” arXiv preprint arXiv:1603.08631, 2016.
  • [25] D. Nie, H. Zhang, E. Adeli, L. Liu, and D. Shen, “3d deep learning for multi-modal imaging-guided survival time prediction of brain tumor patients,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 212–220, Springer, 2016.
  • [26] D. Petrov, B. A. Kuznetsov, T. G. van Erp, J. A. Turner, L. Schmaal, D. Veltman, L. Wang, K. Alpert, D. Isaev, A. Zavaliangos-Petropulu, et al., “Deep learning for quality control of subcortical brain 3d shape models,” arXiv preprint arXiv:1808.10315, 2018.
  • [27] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [28] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” in

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pp. 2625–2634, 2015.
  • [29] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, and W. Samek, “On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation,” PLOS ONE, vol. 10, no. 7, p. e0130140, 2015.
  • [30] S. Lapuschkin, A. Binder, G. Montavon, K.-R. Müller, and W. Samek, “The layer-wise relevance propagation toolbox for artificial neural networks,” Journal of Machine Learning Research, vol. 17, no. 114, pp. 1–5, 2016.
  • [31] K. Uğurbil, J. Xu, E. J. Auerbach, S. Moeller, A. T. Vu, J. M. Duarte-Carvajalino, C. Lenglet, X. Wu, S. Schmitter, P. F. Van de Moortele, et al., “Pushing spatial and temporal resolution for functional and diffusion mri in the human connectome project,” NeuroImage, vol. 80, pp. 80–104, 2013.
  • [32] M. F. Glasser, S. N. Sotiropoulos, J. A. Wilson, T. S. Coalson, B. Fischl, J. L. Andersson, J. Xu, S. Jbabdi, M. Webster, J. R. Polimeni, et al., “The minimal preprocessing pipelines for the human connectome project,” NeuroImage, vol. 80, pp. 105–124, 2013.
  • [33] A. Abraham, F. Pedregosa, M. Eickenberg, P. Gervais, A. Mueller, J. Kossaifi, A. Gramfort, B. Thirion, and G. Varoquaux, “Machine learning for neuroimaging with scikit-learn,” Frontiers in Neuroinformatics, vol. 8, p. 14, 2014.
  • [34] K. A. Norman, S. M. Polyn, G. J. Detre, and J. V. Haxby, “Beyond mind-reading: multi-voxel pattern analysis of fmri data,” Trends in Cognitive Sciences, vol. 10, no. 9, pp. 424–430, 2006.
  • [35] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–297, 1995.
  • [36] S. M. Helfinstein, T. Schonberg, E. Congdon, K. H. Karlsgodt, J. A. Mumford, F. W. Sabb, T. D. Cannon, E. D. London, R. M. Bilder, and R. A. Poldrack, “Predicting risky choices from brain activity patterns,” Proceedings of the National Academy of Sciences of the United States of America, vol. 111, no. 7, pp. 2470–2475, 2014.
  • [37] N. W. Schuck, M. B. Cai, R. C. Wilson, and Y. Niv, “Human orbitofrontal cortex represents a cognitive map of state space,” Neuron, vol. 91, no. 6, pp. 1402–1412, 2016.
  • [38] M. Weygandt, A. Schaefer, A. Schienle, and J.-D. Haynes, “Diagnosing different binge-eating disorders based on reward-related brain activation patterns,” Human brain mapping, vol. 33, no. 9, pp. 2135–2146, 2012.
  • [39] Y. LeCun, Y. Bengio, et al., “Convolutional networks for images, speech, and time series,” The handbook of brain theory and neural networks, vol. 3361, no. 10, p. 1995, 1995.
  • [40] N. McLaughlin, J. Martinez del Rincon, and P. Miller, “Recurrent convolutional network for video-based person re-identification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1325–1334, 2016.
  • [41] A. Marban, V. Srinivasan, W. Samek, J. Fernández, and A. Casals, “A recurrent convolutional neural network approach for sensorless force estimation in robotic surgery,” arXiv preprint arXiv:1805.08545, 2018.
  • [42]

    H. Jang, S. M. Plis, V. D. Calhoun, and J.-H. Lee, “Task-specific feature extraction and classification of fmri volumes using a deep neural network initialized with a deep belief network: Evaluation using sensorimotor tasks,”

    NeuroImage, vol. 145, pp. 314–328, 2017.
  • [43] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” nature, vol. 323, no. 6088, p. 533, 1986.
  • [44] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., “Tensorflow: A system for large-scale machine learning.,” in OSDI, vol. 16, pp. 265–283, 2016.
  • [45] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
  • [46]

    R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” in

    International Conference on Machine Learning, pp. 1310–1318, 2013.
  • [47] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in

    Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS)

    , pp. 249–256, 2010.
  • [48] G. Montavon, S. Bach, A. Binder, W. Samek, and K.-R. Müller, “Explaining nonlinear classification decisions with deep taylor decomposition,” Pattern Recognition, vol. 65, pp. 211–222, 2017.
  • [49] I. Sturm, S. Lapuschkin, W. Samek, and K.-R. Müller, “Interpretable deep neural networks for single-trial eeg classification,” Journal of neuroscience methods, vol. 274, pp. 141–145, 2016.
  • [50]

    L. Arras, G. Montavon, K.-R. Müller, and W. Samek, “Explaining recurrent neural network predictions in sentiment analysis,” in

    Proceedings of the EMNLP’17 Workshop on Computational Approaches to Subjectivity, Sentiment & Social Media Analysis (WASSA), pp. 159–168, Association for Computational Linguistics, 2017.
  • [51] W. Samek, T. Wiegand, and K.-R. Müller, “Explainable artificial intelligence: Understanding, visualizing and interpreting deep learning models,” ITU Journal: ICT Discoveries - Special Issue 1 - The Impact of Artificial Intelligence (AI) on Communication Networks and Services, vol. 1, no. 1, pp. 39–48, 2018.
  • [52] G. Montavon, W. Samek, and K.-R. Müller, “Methods for interpreting and understanding deep neural networks,” Digital Signal Processing, vol. 73, pp. 1–15, 2018.
  • [53] M. A. Lindquist, J. M. Loh, L. Y. Atlas, and T. D. Wager, “Modeling the hemodynamic response function in fmri: efficiency, bias and mis-modeling,” Neuroimage, vol. 45, no. 1, pp. S187–S198, 2009.
  • [54] K.-R. Müller, S. Mika, G. Rätsch, K. Tsuda, and B. Schölkopf, “An introduction to kernel-based learning algorithms,” IEEE transactions on neural networks, vol. 12, no. 2, pp. 181–201, 2001.
  • [55] B. Schölkopf and A. J. Smola, Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2002.
  • [56] B. Fischl, “Freesurfer,” Neuroimage, vol. 62, no. 2, pp. 774–781, 2012.
  • [57] J. V. Haxby, M. I. Gobbini, M. L. Furey, A. Ishai, J. L. Schouten, and P. Pietrini, “Distributed and overlapping representations of faces and objects in ventral temporal cortex,” Science, vol. 293, no. 5539, pp. 2425–2430, 2001.
  • [58] H. R. Heekeren, S. Marrett, P. A. Bandettini, and L. G. Ungerleider, “A general mechanism for perceptual decision-making in the human brain,” Nature, vol. 431, no. 7010, p. 859, 2004.
  • [59] J. Zhang, J. Wang, Q. Wu, W. Kuang, X. Huang, Y. He, and Q. Gong, “Disrupted brain connectivity networks in drug-naive, first-episode major depressive disorder,” Biological psychiatry, vol. 70, no. 4, pp. 334–342, 2011.
  • [60] B. Rashid, M. R. Arbabshirani, E. Damaraju, M. S. Cetin, R. Miller, G. D. Pearlson, and V. D. Calhoun, “Classification of schizophrenia and bipolar patients using static and dynamic resting-state fmri brain connectivity,” NeuroImage, vol. 134, pp. 645–657, 2016.
  • [61] G. H. Glover, “Deconvolution of impulse response in event-related bold fmri1,” NeuroImage, vol. 9, no. 4, pp. 416–429, 1999.
  • [62] K. Gorgolewski, C. D. Burns, C. Madison, D. Clark, Y. O. Halchenko, M. L. Waskom, and S. S. Ghosh, “Nipype: a flexible, lightweight and extensible neuroimaging data processing framework in python,” Frontiers in Neuroinformatics, vol. 5, p. 13, 2011.
  • [63] A. Holmes and K. Friston, “Generalisability, random e ects & population inference,” NeuroImage, vol. 7, p. S754, 1998.
  • [64] T. Yarkoni, R. A. Poldrack, T. E. Nichols, D. C. Van Essen, and T. D. Wager, “Large-scale automated synthesis of human functional neuroimaging data,” Nature Methods, vol. 8, no. 8, p. 665, 2011.

Supplementary Information

1. Outer Brain Mask

For each functional volume we identified those voxels whose activity was larger than 5% of the maximum voxel intensity within the same volume. We then only kept those voxels for further analysis that were positioned between the first and last voxel to fulfill this property in the 3 spatial dimensions of any functional brain volume of our dataset. This resulted in a brain mask spanning 74 x 92 x 81 voxels (X x Y x Z). Next, we split the data of each volume into the 81 slices along the axial dimension (i.e., from bottom-to-top), resulting in a stack of 81 images, comprising 74 x 92 voxels each.

2. General Linear Model Analysis

Traditionally, first-level (i.e. subject-level) fMRI data are analyzed by the use of a fixed-effects GLM [3] of the following form:

(16)

Here, presents a dimensional matrix containing the multivariate time series data of voxels and time points. represents the design matrix, which is composed of real-valued data points, where each column represents one of predictors. represents a dimensional matrix of regression coefficients quantifying the estimated contribution of each predictor to the time series signal of each of the voxels and represents a dimensional matrix of error terms.

In our analyses we included one predictor for each of the 4 experiment conditions in the design matrix (each representing a box-car function for the occurrence of each of the 4 stimulus conditions). As previously noted, we aggregated the data over the 2-back and 0-back task conditions, as differences in these conditions were not of interest for our analysis. Subsequently, we convolved these predictors with a canonical hemodynamic response function (HRF) [61], as implemented in NiPy 0.4.1 [62], to generate the main model predictors. Further, we added temporal derivative terms derived from each predictor and an intercept as well as an indicator variable of the session index to the design matrix, which we treated as confounds of no interest. The derivative terms were computed according to the cosine drift model as implemented in NiPy 0.4.1 [62].

Subsequently, was smoothed with a 6mm FWHM Gaussian kernel and and were temporally filtered with a butterworth highpass filter with a cutoff of 128s, as implemented in Nilean 0.4.1 [33].

-coefficients and error terms were then estimated by the use of a first-level autoregressive model, as implemented in NiPy 0.4.1

[62]. Subsequently, we computed first-level linear contrasts for these predictors, each representing a contrast between one of the stimulus classes and all others.

In addition to these first-level subject contrasts, we also computed second-level contrast estimates by the use of the standard two-stage procedure for a random-effects group-level analysis as proposed by Friston & Holmes [63]. Here, the linear contrasts of the estimated regression coefficients of each subject are treated as random effects in a second level analysis, where their distribution is assessed by the use of a one-sample t-test.

Importantly, we deviated from this analysis approach for the analysis of DLight relevance data, as obtained according to equations 12-15. For DLight, we estimated the first-level GLM coefficients without any temporal derivatives and without convolving the predictors with a canonical HRF, as we did not assume any specific temporal dependencies of these data. However, we did include an additional regressor in the first-level GLM analyses indicating the overall sum of relevance values for each individual fMRI volume. This is necessary, as the sum of relevance values contained in each volume is dependent on the output of the DL model. Therefore, the resulting variability in relevance values across fMRI volumes needs to be accounted for when estimating the first-level GLM coefficients. Further, we did not compute linear first-level contrasts between each condition and all others (as for the fMRI data), but solely contrasts between each coefficient and 0. We were not interested in testing whether DLight assigns more relavance to the activity of one brain area in one condition than in all others, but solely in identifying the overall network of brain areas that DLight utilized to identify a stimulus class. Lastly, we smoothed the relevance data (i.e., ) with a 3mm FWHM Gaussian kernel (instead of 6mm as for the fMRI data, as the fMRI data that was used to compute the relevance maps was already smoothed with a 3mm FWHM Gaussian kernel).

3. Support Vector Machine

Given a training dataset of N data points , where is the real-valued input pattern and is the real-valued output pattern of the data point, the SVM [35] generally represents the following classifier:

(17)

Here, and are positive real constants, whereas represents the kernel function. We used a linear kernel function, as implemented in Nilearn 0.4.1 [33]:

(18)

4. Neurosynth

The goal of Neurosynth [64] is to provide an automated meta-analysis database relating cognitive states and brain activity. For specific cognitive states (i.e., “pain”), the Neurosynth database incorporates a large record of neuroimaging studies that used this term at a high frequency (>1 in 1000 words) in the article text. For these studies, the database includes the activation coordinates from all tables that are reported in these studies, producing a large set of term-to-activation mappings. Based on these mappings, the Neurosynth database provides two types of activation maps: a forward-inference map, indicating the probability that a specific activation occurred, given the presence of a specific term in the article (P(activation|term)), as well as a reverse-inference map, indicating the probability that a specific term occurred in the article, given a specific brain activation (P(term|activation)).

For our analyses, we used the latter, reverse inference maps, as recommended by the Neurosynth authors [64], and extracted the thresholded (, FDR corrected) reverse inference maps for the four stimulus classes (indicated by the terms "body”, "face”, "place” and "tools”; Supplementary Fig. S6). These maps indicate a z-value for the reverse inference probability of each coordinate in the MNI-space.

Figure S1: Training statistics as a function of the training epochs. Red indicates the test data, whereas green indicates the training dataset. An epoch is defined as a full iteration over the training data. We define decoding accuracy as the fraction of samples that were classified correctly.
Figure S2: Average decoding accuracy of DLight over the course of an experiment block. The red line indicates the average across subjects with standard errors surrounding it.
Figure S3: Out of sample decoding performance of the linear-kernel SVM searchlight approach with a non-linear RFB-kernel SVM on the individual subject-level. As before, we trained searchlight on the data of the first experiment run of each subject, before predicting the cognitive state for each volume of the second experiment run. We performed this prediction exercise only within the data of the subjects in the held-out test dataset. We fixed the searchlight radius to 5.6mm, while we set parameter of the RBF-kernel to 1 across all subjects. Left: Comparison of the RBF-searchlight decoding performance with the DLight approach. Right: Comparison of the RBF-searchlight with our previous searchlight implementation that utilizes a linear-kernel SVM. Each cross represents an individual subject. Colored lines indicate averages across subjects.
Figure S4: Group-level statistical contrast maps of a standard two-stage GLM analysis of the relevance values for each cognitive state. Replication of Fig. 4, except that here only negative t-values are visualized.
Figure S5: Group-level statistical contrast maps of a standard two-stage GLM analysis for each cognitive state. Colors indicate t-values. All t-values were FDR corrected at a rate of 0.05. All data is projected onto the FsAverage5 [56] surface template.
Figure S6: Contrast maps of a meta analysis with the Neurosynth database for the terms "body", "face", "place" and "tools". Colors indicate z-values. All z-values were FDR corrected at a rate of 0.01. All data is projected onto the FsAverage5 [56] surface template.
Figure S7: Distribution of relevance values over the course a single trial for the faces and places stimulus classes. Replication of Fig. 6, except that here only negative t-values are visualized.