1 Introduction
There is an increasing focus on applying deep learning on unstructured data in the medical domain, especially using Graph Convolutional Networks (GCNs) [defferrard2016convolutional]. Multiple applications have been demonstrated so far, including Autism Spectrum Disorder prediction with manifold learning to distinguish between diseased and healthy brains [ktena2018metric], matrix completion to predict the missing values in medical data [vivar2018multi], and finding drug similarity using graph auto encoders [ma2018drug]. In this paper, we study the task of Alzheimer and Autism disease prediction with complementary imaging and nonimaging multimodal data.
In above works, GCNs had a remarkable impact on the usage of multimodal medical data. One key difference to previous learningbased methods is to set patients in relation to each other with a neighborhood graph, often by associating them through nonimaging data like gender, age, clinical scores or other metainformation. On this graph, patients can be considered as nodes, patient similarities are represented as edge weights and features from e.g. imaging modalities are incorporated through graph signal processing. GCNs then provide a principled manner for learning optimal graph filters that minimize a objective. Here, we use nodelevel classification for our disease prediction task.
A simple analogy to nodebased classification of the population is image segmentation with CNNs, where each pixel is a node and the image grid is the graph. In such domains, filters with a constant size can manage to acquire semantic features over the whole grid domain, given convolutions over a constant number of equidistant neighbors. In the case of irregular graphs, the number of neighbors and their distance from each other leads to heterogeneous density and local structure. Applying filters with constant kernel size over the whole grid domain might not produce semantic and comparable features.
In medical datasets, graphs defined on patient’s data observe similar heterogeneity, as each patient may have a distinct combination of nonimaging data and different number of neighbors. A concrete example is shown in Fig. 1 (left), which depicts a population graph of 150 subjects for Alzheimer’s disease classification, who are arranged in clusters of varying density and local topology (regions a, b and c). Such heterogeneity in the graph structure should be considered to learn clusterspecific features. A model capable of producing similar intracluster and different intercluster features can be designed by applying multisized kernels on the same input. To this end, we propose InceptionGCN, inspired by the successful inception [szegedy2015going] architecture for CNNs. Our model leverages spectral convolutions with different kernel sizes and chooses optimal features to solve the classification problem.
To the best of our knowledge, there is not much related literature that focuses on receptive fields of GCN filters. Earlier works [defferrard2016convolutional, kipf2016semi] use GCNs with constant filter size for the nodebased classification task and show the superiority of GCN but do not address the problem of heterogeneity of the graph. In [liu2018geniepath], a method is proposed that determines a receptive path for each node rather than a field for performing the convolutions for representation learning. Irrespective of nearest neighbors, the aim is to perform convolutions with selective nodes in the receptive field. In [xu2018representation], a DenseNetlike architecture [huang2017densely] is proposed, in which outputs from consecutive layers are concatenated. Here, the receptive field is addressed in an indirect way since the output features of successive layers depend on multiple previous layers through skip connections. Another work [hamilton2017inductive] uses features that are either fixed, handdesigned or based on aggregatorfunctions. Moreover, the method needs a predefined order of nodes which is difficult to obtain.
In this paper we show that InceptionGCN is an improvement in terms of performance and convergence. Our contributions are: (1) we analyze the interdependence of graph structure and filter sizes on one artificial and two public medical datasets and in doing so, we motivate the need for multiple kernel size. (2) We propose our novel InceptionGCN model with multiple filter kernel sizes. We validate it on artificial and clinical data and the show improved performance over regular GCN architectures. (3) We demonstrate the robustness of our model towards different approaches for constructing graph adjacency from nonimaging data.
2 Methodology
Traditional models [parisot2017spectral]
use a constant filter size throughout all layers, which forces the features of every node to be learned using neighbors at a fixed number of hops away without consideration of cluster size and shape. Our proposed InceptionGCN model overcomes this limitation by varying the filters’ size across the GClayers in order to produce class separable output features. This property of our model is highly desirable when each class distribution has distinct variance and/or when the classes are heavily overlapping. Utilizing this setting, we target to solve the disease classification task by incorporating semantics of varied associations coming from different graphs within the population. We provide a detailed description of the model starting from the affinity graph construction followed by the mathematical background and a discussion of the proposed model architecture.
2.1 Affinity graph construction
The construction of an affinity graph is crucial to accurately model the interactions among the patients and should be designed carefully. The affinity graph is constructed on the entire population (including training and testing samples) of the patients, where = N vertices, are the edge connections of the graph and are the weights of the edges. Considering each patient as a node in the graph, incorporates the similarities between the patients with respect to the nonimaging data . The features at every node
are fetched from imaging data. First, we construct a binarized edge graph
representing the connections. Mathematically, can be defined as(1) 
where and are the values of the nonimaging element for nodes and and is the threshold for that element. The weight matrix weights the edges based on the correlation distance between the features at every node. The weight matrix elements are defined as ,where with being the ’correlation distance’ and being the width of the kernel. This weight computation and value of is identical to the procedure described in [parisot2017spectral]
, to provide equal grounds for comparison. The final affinity matrix
is constructed as with being the Hadamard product.2.2 Mathematical background of spectral convolution and localization of filters for inception modules
Let be the normalized version of the graph Laplacian of including self loops. is the diagonal matrix with ,
being the identity matrix. Since
is real positive and semidefinite, it is diagonalizable by its eigen vectors
such that , whereare the corresponding eigen values. The graph Fourier Transform of a signal
at each node is defined as , the inverse Fourier Transform as . With this information, the spectral convolution can be defined as a multiplication of the signal with a learnable filter in the Fourier domain, which results in interpretingas a function of the eigenvalues
[kipf2016semi]. In order to prevent the computationally prohibitive matrix multiplication necessary to perform the Fourier Transform of signal , we redefine using the Chebyshev polynomial parameterization of the filter , where is a vector of Chebyshev coefficients with degree [kipf2016semi, defferrard2016convolutional]. Since , we can write as a function of . Therefore, we can perform the spectral filtering on a signal with . The value of vertex of the filter centered at vertex is given by(2) 
where is Kronecker delta function. Inspired by [hammond2011wavelets], here we explain how the filters of specific receptive fields can be derived. Let be a weighted graph, be the graph Laplacian (normalized or unnormalized), and be an integer (here stands for the hop neighbor), then for any two vertices and :
(3) 
where is the shortest path distance between and and is the sum of all edge weights on the shortest path from to . Therefore from eq. 2 the spectral filters represented by order polynomial of the Laplacian are exactly hop localized.
2.3 Inception modules
The localization of a filter is defined by taking all the neighbors at a distance of hops into account for the spectral convolution with a signal x. A filter with a fixed used on the full dataset can be defined as . Here, describes the output of a filter with neighborhood in hop distance. To account for different sizes and variances of clusters and structure in the data, instead of using one filter we now use filters with varying neighborhood . These combined filters are the centerpiece of the inception module as they simultaneously consider the close proximity of a signal and the broader neighborhood situation. Every filter of the module has its own parameter vector and performs a convolution on the dataset for returning an output vector . The outputs of each filter are merged in an aggregatorfunction to determine the output of the inception module as where every with entries is the learnable parameter vector for each filter of the inception module. To merge the output of each inception module we propose two aggregators
, (1) concatenation and (2) maxpooling. Our model architecture is illustrated in Fig.
1. It is built with inception modules. Each inception module consist of GClayers in parallel with filters of different. We apply ReLU at the output of each GClayer. For the training set, a labelled subset of graph nodes is chosen, for which the loss is computed and gradients are backpropagated. We apply crossentropy loss as the optimization function. Due to the graph connections, the training process on the labelled data is transferred to the unlabeled data by signal diffusion which corresponds to the behavior of a standard GCN.
3 Experiments and Results
In this section, we provide two main experimental setups to show (1) the sensitivity of spectral convolutions to different graphs and kernel sizes of the filters and (2) superiority of the InceptionGCN to other baseline methods. We show our results on two multimodal medical datasets and thoroughly analyze both the baseline [parisot2017spectral] and the proposed model. At last, we provide insights into generalized design choices for building a data and taskspecific model.
3.1 Datasets
TADPOLE [marinescu2018tadpole]:
This dataset is a subset of the Alzheimer’s Disease Neuroimaging Initiative (adni.loni.usc.edu), consisting of 557 patients with 354 multimodal features per patient. The target is to classify each patient into one of the three classes (Cognitively Normal (CN), Mild Cognitive Impairments (MCI) or Alzheimer’s Disease (AD). Features are extracted from MR and PET imaging, cognitive tests, CSF and clinical assessments. The protein class APOE constitutes another factor assisting in patient classification. Testing this gene status provides a risk factor of developing AD. FDGPET imaging measures the brain cell metabolism, where cells affected by AD show reduced metabolism. Furthermore, demographics are provided (age, gender). We construct a binarized graph with each element of demographic data, APOE status and FDG PET measures. We choose
= 2 for age and = 0 for the rest of the three respectively. The edges are based on the i.e. the feature similarity measure. We construct the ’Mixed’ affinity graph by averaging all the graphs weighted with W and ’Mixed (no)’ without weighting.ABIDE [abraham2017deriving]: The Autism Brain Imaging Data Exchange (ABIDE) aggregates data from 20 different sites and openly shares 1112 existing restingstate functional magnetic resonance imaging (RfMRI) datasets with corresponding phenotypic elements (gender) for 2 classes normal and with Autism Spectrum Disorder (ASD). We choose 871 subjects divided into normal(468) and ASD diseased (403) subjects. For fair comparison, we follow the same preprocessing step as performed in the baseline method [parisot2017spectral]. We construct two affinity graphs for nonimaging elements, gender and site, by choosing = 0 for both graphs.
3.2 Experiments on medical datasets
In this subsection we present both the experimental setups mentioned above and discuss our findings on the medical datasets.
Effect of different kernel size on spectral convolution:
Our first set of experiments is designed to investigate the optimal kernel size of the filter required for each graph. The baseline model [parisot2017spectral] with two GClayers in sequence is used to find out the required graph specific filter sizes (i.e. value of ). We investigate the performance of the model with the same input (features and graph) and and . Here =1 and =6 indicate the kernel size of onehop (smallest) and sixhop neighbors (largest) respectively. We select the value of two corresponding to the best performance in the heatmap and incorporate them to our proposed InceptionGCN model as different kernel sizes. Like this, it is guaranteed that the sequential GCN is performing at its optimum when compared to our method. We discuss the validity of this setting in the later section.
Results:
Fig. 2 shows the corresponding results in terms of heatmaps.
Smaller learn local features and larger learn global features. The performance differs with the change in and by a margin of 8% on average. It indicates that spectral convolution models are sensitive to the selection of . The accuracy increases with the value of , but becomes consistent with further increase. For most of the graphs is the best combination, since the initial layer filters look at global features.
Each affinity graph shows a different structure over the same vertices and shows varied results over the same combination of the two . A similar trend is seen for ABIDE, which reassures the concept of sensitivity towards .
Comparison of InceptionGCN against sequential GCN approaches: We show the comparison with four baselines. Parisot et al. [parisot2017spectral] is the traditional GCN with . We modify the same architecture of [parisot2017spectral] with the best combination of the two mentioned as baseline . We evaluate our aggregatorfunction for a proper selection of activations from all the individual GClayers of the inception module by comparing them to the baseline and . This comparison shows that is not biased towards any particular kernel size. With such setting for all methods, each graph yields a different performance, showing the effect of the different neighborhood affinity as shown in Tab. 2.
Our model outperforms the baselines [parisot2017spectral] by an average margin of 4.12 % for TADPOLE dataset.
The comparative results for ABIDE are given in Tab. 3. Our model performs comparable to the baseline [parisot2017spectral], but is not able to outperform it. Interestingly, the mixed graph with featurebased edge weighting performs worse than the weighting case. This confirms the nondiscriminative nature of the features. Images collected from different sites make it harder for the model to learn classdiscriminative features.
3.3 Experiments on simulated data
Seeing the contradictory performance on the two datasets, we investigate the model in detail for better understanding of the spectral model and to interpret better design choices for userspecific tasks. These experiments are specifically designed to investigate only the choice of the kernel size of the filters.
We generate two 2dimensional clusters and
having normal Gaussian distributions with 300 points each in Euclidean domain, each distribution representing one class. We construct the graph based on Euclidean distance between the features and
to sparsify the graph. This represents that the graph is highly correlated to the labels. In order to keep the experiment easy to interpret, we set means = for and respectively and vary the corresponding variances and . For features we show two settings: classdiscriminative, where the (x,y)values of the location of each point are considered as features and classindiscriminative, where we randomly sample the features from a uniform distribution for both classes. Both settings are shown in Fig.
3 (a) and (b). For the model architecture, we keep = 1 for both the baseline model [parisot2017spectral]and InceptionGCN and train both the networks at 200 epochs, with learning rate=0.2.
Results and interpretation:
The results of this experiment are illustrated with boxplots in Fig.3 (c). Each box shows the accuracy of the classification for different values of ranging from 1 to 10 for the baseline model for classdiscriminative features. Keeping = 0.5, we vary for [0.1, 0.5, 1.0]. We repeat the experiments with =1.0. It can be interpreted that when two clusters are clearly separable, the model is less sensitive to the value of . Also it can be seen from the last two boxplots that with higher variance, the model becomes sensitive to . Similar trends are observed when the value of is changed to 1.0, however a consistent drop in accuracy is observed with = 1.0. If there is large variance in the data, filters with larger receptive field will produce generalized global features.
Further, we apply our model to the simulated data with only one Inception module incorporating two GClayers with different =[1,10]. We compare the results of a singlelayered GCN with =[1,5,10] with the one layered inception module for four different settings. The superiority of our model is seen mainly in the challenging scenarios, where the variance of both classes is quite high (i.e. = and = , cf. Tab. 1). Here, we report the results for class indiscriminative features, where the performance drastically drops when features are totally random for all the models. InceptionGCN outperforms the baseline in most of the cases.
k  =0.1  =1.0  

(a)  1  98.50 01.38  94.50 01.83  95.67 02.49  92.50 02.61 
10  99.00 01.11  93.67 04.93  95.50 07.98  91.00 04.42  
InceptionGCN (1 layer [k1,k2]=[1,10])  94.83 03.02  97.00 02.56  92.00 03.56  94.33 03.56  
(b)  1  49.33 06.84  50.33 07.48  49.50 04.60  50.00 06.28 
10  60.33 16.78  53.50 10.99  50.83 06.02  55.33 14.79  
InceptionGCN (1 layer [k1,k2]=[1,10])  66.50 17.12  64.00 17.95  48.00 07.88  69.00 24.79 
Affinity  Age  Gender  APOE  FDG  Mixed  Mixed (no) 
Parisot et al. [parisot2017spectral]  82.55 04.78  84.59 04.82  82.68 05.70  84.46 0 5.46  82.04 05.71  82.11 04.94 
Baselines  
86.42 03.95  87.52 03.51  85.33 04.75  86.61 04.53  83.42 05.93  81.95 05.92  
85.46 05.6  86.19 04.91  85.08 05.21  86.55 04.55  81.85 06.28  81.36 05.98  

86.42 03.98  84.59 04.82  78.75 04.45  84.46 05.46  80.86 05.69  80.99 04.71 
InceptionGCN  
concat  88.35 03.03  88.06 04.39  88.14 03.20  86.99 03.98  84.35 06.97  83.62 06.09 
maxpool  88.53 03.27  88.19 03.83  88.49 03.05  87.65 05.11  84.11 04.50  83.87 05.07 
Affinity  Gender  Site  Mixed  Mixed(no) 
Parisot et al [parisot2017spectral]  67.39 04.76  67.39 01.49  67.85 00.63  69.80 04.35 
Baselines  
68.19 05.38  69.00 04.07  70.26 03.70  70.26 04.58  
66.70 06.90  68.65 04.31  69.91 07.50  69.80 03.90  
65.78 06.50  68.65 04.31  69.00 03.80  69.46 04.69  
InceptionGCN  
concat  66.36 05.66  67.97 04.43  66.70 06.27  69.23 06.66 
maxpool  67.05 05.47  67.39 05.80  66.02 05.92  69.11 06.68 
4 Discussion and Conclusion
In this work we have introduced InceptionGCN, a novel architecture that captures the local and global context of heterogeneous graph structures with multiple kernel sizes. The validation included an investigation of spectral convolution parameters and the behaviour of the proposed model given varying input data, in comparison to a recently proposed baseline method [parisot2017spectral].
Our findings show that applying different sized filters on the same input features and graph improves the process of feature learning at multiscale levels. Such rich and heterogeneous features help the model to learn better filters for classification. We tested the method on two publicly available medical datasets for Alzheimer’s and Autism disease prediction, in order to analyze the robustness of the model towards different features, graph affinities and tasks. Our results show that both the spectral convolution and the proposed model obtained high classification accuracies for TADPOLE (cf. Tab. 2), with a clear margin of InceptionGCN over the baselines. In the case of the ABIDE dataset, however, both methods had comparable performance, which was considerably lower than on TADPOLE (cf. Tab. 3). To investigate the different performances of both models, we utilized simulated data with i) different degrees of class overlap in the feature space and ii) entirely random features, forcing the GCN models to rely on connectivity alone (Tab. 1). It can be concluded that while both GCN models are very sensitive to variance of data, our model shows the superiority in case of having large variances and overlapping of class clusters. The main factors affecting the performance of GCN are features, graph and filters. With all the experiments we discuss all the factors in detail.
Influence of the graph:
For the ABIDE dataset, images are collected from 20 different sites and imaging conditions, which adds considerable heterogeneity to the data. Consequently, the affinity graph based on site information consists of 20 disjoint clusters. Building a graph based on site information allows only the neighbors (i.e. samples from the same site) to contribute to the feature learning. This has less clinical relevance to the classification task, whereas for TADPOLE, the risk factors and demographics are clinically relevant. Such relevance of the graph can be determined using the graphs’ energy function provided in [gansner2013coast].
Next, the mixed affinity graph performs worst overall in terms of accuracy (cf. Tab. 2 and Tab. 3
) and Standard Deviation (SD) (cf. Tab.
3). This indicates that a straightforward creation of the mixed affinity graph by averaging impairs the inherent structure of each graph, and important clinical semantics from individual graphs may get lost. This is confirmed by the unequal performance observed for each affinity graph, which may even indicate a ranking of relevance of each nonimaging element to the objective. A more elegant way to combine all the affinity graphs is by ranking them while training [kazi2018multi].Influence of the features: The importance of a proper feature choice becomes clear in the tests on simulated data. When using randomly sampled features for every node (cf. Tab. 1) the overall performance drops drastically. A large standard deviation in the performance shows that filters are not learned properly and the model does not converge. The same behavior can be seen for the TADPOLE and ABIDE dataset when comparing the mixed and mixed (no) (cf. Tab. 2 and 3). Since the features of the ABIDE dataset are not distinguishing the nodes into different clusters compared to the TADPOLE dataset (Fig. 4), the performance of the models drops for ABIDE when using the feature similarity (), which is used for graph construction. At the same time, the models receive a performance boost when the meaningful features of TADPOLE are included into the graph generation process.
Influence of the kernel size: We investigated the effect of features and heterogeneity of the graph towards the choice of . Our results show that in case of class separable features, a larger value of will give more compact features. From Tab. 3, it is clear that InceptionGCN performs better in case that the classes have large and different variances. In such a case, InceptionGCN with multiple manages to capture the class discriminative features for the nodes. If the clusters are compact (v=0.1) the choice of does not matter. From Fig. 3 (c), we see that the model is not sensitive to if the clusters are compact, whereas it becomes sensitive when the variance increases. In case of class indiscriminative features and a less relevant graph (as is the case of ABIDE) a larger kernel size helps to learn global class discriminative feature.
Sequential model vs. InceptionGCN: Choosing the values of the two from sequential model (GCN) for a parallel setting might seem ambiguous. In Tab. 2, the role of the aggregatorfunction is clearly visible in the performance, since the baselines are all the possible combinations that the final output of our model can get. Furthermore, our proposed model converges 1.63 times faster in terms of epochs compared to the baseline method when trained with early stopping criteria with window size of 25 due to a better feature learning process.
Future scope: Potential improvements of the InceptionGCN model include outofsample inference (i.e. inductive learning), which will highly improve the usability of the model. Another area of investigation is the integration of multiple affinity graphs into one model. Furthermore, the InceptionGCN model structure itself can also be optimized, first by using a learnable preprocessing step to obtain the neighborhood values , and second, by analyzing the number of hidden units in each GClayer and the overall number of inception modules necessary.
5 Acknowledgement
The authors would like to thank Dr. Benedikt Westler for his help and support in understanding the TADPOLE dataset. The study was carried out with financial support of Freunde und Förderer der Augenklinik, München, Germany, Carl Zeiss Meditec AG, Oberkochen, Germany and the German Federal Ministry of Education and Research (BMBF) in connection with the foundation of the German Center for Vertigo and Balance Disorders (DSGZ) (grant number 01 EO 0901).