## 1 Introduction

Autism spectrum disorders (ASD) affect the structure and function of the brain. To better target the underlying roots of ASD for diagnosis and treatment, efforts to identify reliable biomarkers are growing [8]. Significant progress has been made using functional magnetic resonance imaging (fMRI) to characterize the brain remodeling in ASD [9]

. Recently, emerging research on Graph Neural Networks (GNNs) has combined deep learning with graph representation and applied an integrated approach to fMRI analysis in different neuro-disorders

[11]. Most existing approaches (based on Graph Convolutional Network (GCN) [10]) require all nodes in the graph to be present during training and thus lack natural generalization on unseen nodes. Also, it is necessary to interpret the important feature in the data used as evidence for the model, but currently no tool exists that can interpret and explain GNNs while recent CNN explanation algorithms cannot directly work on graph input.Our main contributions include the following three points: 1) We develop a method to integrate all the available connectivity, geometric, anatomic information and task-fMRI (tfMRI) related parameters into graphs for deep learning. Our approach alleviates the problem of predetermining the best features and measures of functional connectivity, which is often ambiguous due to the intrinsic complex structure of task-fMRI. 2) We propose a generalizable GNN inductive learning model to more accurately classify ASD v.s. healthy controls (HC). Different from the spectral GCN [10], our GNN classifier is based on graph isomorphism, which can be applied to multigraphs with different nodes/edges (e.g. sub-graphs), and learn local graph information without binding to the whole graph structure. 3) The GNN architecture enables us to train the model on the whole graph and validate it on subgraphs. We directly evaluate the importance scores on sub-graphs and nodes (i.e. regions of interest (ROIs)) by examining model responses, without resampling value for the occluded features. The 2-stage pipeline to interpret important sub-graphs/ROIs, which are defined as biomarkers in our setting, is shown in Fig. 1.

## 2 Methodology

### 2.1 Graph Definition

We firstly parcellate the brain into ROIs based on its T1 structural MRI. We define ROIs as graph nodes. We define an undirected multigraph , where and , and are the attribute dimensions of nodes and edges respectively

. For node attributes, we concatenate handcrafted features: degree of connectivity, General Linear Model (GLM) coefficients, mean, standard deviation of task-fMRI, and ROI center coordinates. We applied the Box-Cox transformation

[13]to make each feature follow a normal distribution (parameters are learned from the training set and applied to the training and testing sets). The edge attribute

of node and includes the Pearson correlation, partial correlation calculated using residual fMRI, and where is the geometric distance between the centers of the two ROIs. We thresholded the edges under the 95th percentile of partial correlation values to ensure sparsity for efficient computation and avoiding oversmoothing.### 2.2 Graph Neural Network (GNN) Classifier

The architecture of our proposed GNN is shown in Fig. 2

(node, edge attribute definition, kernel sizes are denoted). The model inductively learns node representation by recursively aggregating and transforming feature vectors of its neighboring nodes. Below, we define the layers in the proposed GNN classifier.

#### 2.2.1 Convolutional Layer

Following Message Passing Neural Networks (NNconv) [7], which is invariant to graph symmetries, we leverage node degree in the embedding. The embedded representation of the th convolutional layer is

(1) |

where

is a nonlinear activation function (we use

relu here), is node ’s 1-hop neighborhood, is a learnable propagation matrix,denotes a Multi-layer Perceptron (MLP), which maps the edge attributes

to a matrix, and we initialize .#### 2.2.2 Pooling Aggregation Layer

To make sure that down-sampling layers behave idiomatically with respect to different graph sizes and structures, we adopt the approach in [2] for reducing graph nodes. The choice of which nodes to drop is done based on projecting the node attributes on a learnable vector . The nodes receiving lower scores will experience less feature retention. Fully written out, the operation of this pooling layer (computing a pooled graph, , from an input graph, ), is expressed as follows:

(2) |

Here is the norm, top finds the indices corresponding to the largest elements in vector , is (broadcasted) element-wise multiplication, and is an indexing operation which takes elements at row indices specified by and column indices specified by (colon denotes all indices). The pooling operation trivially retains sparsity by requiring only a projection, a point-wise multiplication and a slicing into the original feature and adjacency matrix. Different from [2], we induce constraint implemented by adding an additional regularization loss to avoid identifiability issues.

#### 2.2.3 Readout Layer

Lastly, we seek a “flattening” operation to preserve information about the input graph in a fixed-size representation. Concretely, to summarise the output graph of the th conv-pool block, , we use

(3) |

where is the number of graph nodes, is the th node’s feature vector, operates elementwisely, and denotes concatenation. The final summary vector is obtained as the concatenation of all those summaries (i.e. ) and submitted to a MLP for obtaining final predictions.

### 2.3 Explain Input Data Sensitivity

To explain input data sensitivity, we cluster the whole brain graph into sub-graphs first. Then we investigate the predictive power of each sub-graph, further assign importance score to each ROI.

#### 2.3.1 Network Community Clustering

From now on we add the subscript to the graph as for the th instance, , where is the number of graphs. Concatenating the sparsified non-negative partial correlation matrices

for all the graphs, we can create a 3rd-order tensor

of dimension . Non-negative PARAFAC [3] tensor decomposition is applied to tensor to discover overlapping functional brain networks. Given decomposition rank , , where loading vectors , , and denotes the vector outer product. since the connectivity matrix is symmetric. The th element of , provides the membership of region in the community . Here, we consider region belongs to community if [12]. This gives us a collection of community indices indicating region membership .#### 2.3.2 Graph Salience Mapping

After decomposing all the brain networks into community sub-graphs , we use a salience mapping method to assign each sub-graph an importance score. In our classification setting, the probability of class (0: HC, 1: ASD) given the original network

is estimated from the predictive score of the model:

. To calculate , different from CNN or GCN, we can directly input the sub-graph into the pre-trained classifier. We denote as the class label for instance and define Evidence for Correct Class (ECC) for each community:(4) |

where laplace correction () is used to avoid zero denominators.

Note that log odds-ratio is commonly used in logistic regression to make

more separable. The nonlinear tanh function is used for bounding ECC. ECC can be positive or negative. A positive value provides evidence for the classifier, whereas a negative value provides evidence against the classifier. The final importance score for node is calculated by . The larger the score, the more possible the node can be used as a distinguishable marker.## 3 Experiments and Results

### 3.1 Data Acquisition and Preprocessing

We tested our method on a group of 75 ASD children and 43 age and IQ-matched healthy controls collected at Yale Child Study Center. Each subject underwent a task fMRI scan (BOLD, TR = 2000 ms, TE = 25 ms, flip angle = , voxel size ) acquired on a Siemens MAGNETOM Trio TIM 3T scanner. For the fMRI scans, subjects performed the ”biopoint” task, viewing point light animations of coherent and scrambled biological motion in a block design [9] ( per block). The fMRI data was preprocessed following the pipeline in [14].

The mean time series for each node were extracted from a random of voxels in the ROI (given an atlas) of preprocessed images by bootstrapping. We augmented the ASD data 10 times and the HC data 20 times, resulting in 750 ASD graphs and 860 HC graphs separately. We split the data into 5 folds based on subjects. Four folds were used as training data and the left out fold was used for testing. Based on the definition in Section 2.1, each node attribute and each edge attribute . Specifically, the GLM parameters of ”biopoint task” are: coefficient of biological motion matrix; : coefficient of scramble motion matrix; and : coefficients of the previous two matrices’ derivatives.

Model | RF(V) | RF(E) | RF(V+E) | GNN(r=0.3) | GNN(r=0.5) | GNN(r=0.8) |
---|---|---|---|---|---|---|

Accuracy | ||||||

F-score | ||||||

Precision | ||||||

Recall |

### 3.2 Step 1: Train ASD/HC Classification Model

Firstly, we tested classifier performance on the Destrieux atlas [5] (148 ROIs) using proposed GNN. Since our pipeline integrated interpretation and classification, we apply a random forest (RF) using 1000 trees as an additional ”reality check”, while the other existing graph classification models either cannot achieve the performance as GNN [2, 7] or do not have straightforward and reliable interpretation ability [1]. We flattened the features to and () and used this input to the RF. In our GNN, , , resulting in 2746 trainable parameters and we tried different pooling ratios ( in Fig. 2, which was implemented based on [6]. We applied softmax after the network output and combined cross entropy loss and regularization loss with as the objective function. We used the Adam optimizer with initial learning 0.001, then decreased it by a factor of

every 50 epochs. We trained the network 300 epochs for all of the splits and measured the instance classification by accuracy, F-score, precision and recall (see Table

1). Our proposed model significantly outperformed the alternative method, due to its ability to embed high dimensional features based on the structural relationship. We selected the best GNN model with in the next step: interpreting biomarkers.### 3.3 Step 2: Interpret and Explain Biomarkers

We put forth the hypothesis that the more accurate the classifier, the more reliable biomarkers can be found. We used the best RF model using as inputs ( accuracy on testing set) and used the RF-based feature importance (mean Gini impurity decrease) as a form of standard method for comparison. For GNN interpretation, we also chose the best model ( accuracy on testing set). Further, to be comparable with RF, all of the interpretation experiments were performed on the training set only. The interpretation results are shown in Fig. 3, where the top 30 important ROIs (averaged over node features and instances) selected by RF are shown in yellow and the top 30 important ROIs selected by our proposed GNN in red. Nine important ROIs were selected by both methods. In addition, for node attribute importance, we averaged the importance score over ROIs and instances for RF. For GNN, we averaged gradient explanation over all the nodes and instances, i.e. , where , which quantifies the sensitivity of the th node attribute. From Fig. 3(c) we show the relative importance to the most important node attribute, our proposed method assigned more uniform importance to each node attribute, among which the biological motion parameter was the most important. In addition, similar features, mean/std of task-fMRI (tf_mean/tf_std) and coordinates , have similar scores, which makes more sense for human interpretation. Notice that our proposed pipeline is also able to identify sub-graph importance from Eq. (4), which is helpful for understanding the interaction between different brain regions. We selected the top 2 sub-graphs (=20) and used Neurosynth [15] to decode the functional keywords associated with the sub-graphs (shown in Fig. 4). These networks are both associated with high-level social behaviors. To illustrate the predictive power of the 2 sub-graphs, we retrained the network using the graph slicing on those 19 ROIs of the 2 sub-graphs as input. Accuracy on the testing set (in the split of the best model) was , achieving comparable performance to using the whole graph.

### 3.4 Evaluation: Robustness Discussion

To examine the potential influence of different graph building strategies on the reliability of network estimates, the functional and anatomical data were registered and parcellated by the Destrieux atlas (A1) and the Desikan-Killiany atlas (A2) [4]. We also showed the robustness of the results with respect to the number of clusters for . The results are shown in Fig. 5. We ranked ECCs for each node and indicated the top 30 ROIs in A1 and top 15 ROIs in A2. The altas and number of clusters are indicated on the left of each sub-figure. Orbitofrontal cortex and ventromedial prefrontal cortex are selected in all the cases, which are social motivation related and have previously been shown to be associated with ASD [9]. We also validated the results by decoding the neurological functions of the important ROIs overlapped with Neurosynth.

## 4 Conclusion and Future Work

In this paper, we proposed a framework to discover ASD brain biomarkers from task-fMRI using GNN. It achieved improved accuracy and more interpretable features than the baseline method. We also showed our method performed robustly on different atlases and hyper-parameters. Future work will include investigating more hyper-parameters (i.e. suitable size of sub-graphs communities), testing the results on functional atlases and different graph definition methods. The pipeline can be generalized to other feature importance analysis problems, such as resting-fMRI biomarker discovery and vessel cancer detection.

## Acknowledgments

This work was supported by NIH Grant R01 NS035193.

## References

- [1] Adebayo, J., et al.: Sanity checks for saliency maps. In: Advances in Neural Information Processing Systems. pp. 9505–9515 (2018)
- [2] Cangea, C., et al.: Towards sparse hierarchical graph classifiers. arXiv preprint arXiv:1811.01287 (2018)
- [3] Carroll, J.D., Chang, J.J.: Analysis of individual differences in multidimensional scaling via an n-way generalization of “eckart-young” decomposition. Psychometrika 35(3), 283–319 (1970)
- [4] Desikan, R.S., et al.: An automated labeling system for subdividing the human cerebral cortex on mri scans into gyral based regions of interest. Neuroimage 31(3), 968–980 (2006)
- [5] Destrieux, C., et al.: Automatic parcellation of human cortical gyri and sulci using standard anatomical nomenclature. Neuroimage 53(1), 1–15 (2010)
- [6] Fey, M., Lenssen, J.E.: Fast graph representation learning with PyTorch Geometric. CoRR abs/1903.02428 (2019)
- [7] Gilmer, J., et al.: Neural message passing for quantum chemistry. In: ICML 2017. pp. 1263–1272. JMLR. org (2017)
- [8] Goldani, A.A., et al.: Biomarkers in autism. Frontiers in psychiatry 5 (2014)
- [9] Kaiser, M.D., et al.: Neural signatures of autism. PNAS (2010)
- [10] Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
- [11] Ktena, S.I., et al.: Distance metric learning using graph convolutional networks: Application to functional brain networks. In: MICCAI (2017)
- [12] Loe, C.W., Jensen, H.J.: Comparison of communities detection algorithms for multiplex. Physica A: Statistical Mechanics and its Applications 431, 29–45 (2015)
- [13] Nishii, R.: Box-Cox transformation. Encyclopedia of Mathematics (2001)
- [14] Yang, D., et al.: Brain responses to biological motion predict treatment outcome in young children with autism. Translational psychiatry 6(11), e948 (2016)
- [15] Yarkoni, T., et al.: Large-scale automated synthesis of human functional neuroimaging data. Nature methods 8(8), 665 (2011)

Comments

There are no comments yet.