Non-IID Graph Neural Networks

by   Yiqi Wang, et al.
Michigan State University

Graph classification is an important task on graph-structured data with many real-world applications. The goal of graph classification task is to train a classifier using a set of training graphs. Recently, Graph Neural Networks (GNNs) have greatly advanced the task of graph classification. When building a GNN model for graph classification, the graphs in the training set are usually assumed to be identically distributed. However, in many real-world applications, graphs in the same dataset could have dramatically different structures, which indicates that these graphs are likely non-identically distributed. Therefore, in this paper, we aim to develop graph neural networks for graphs that are not non-identically distributed. Specifically, we propose a general non-IID graph neural network framework, i.e., Non-IID-GNN. Given a graph, Non-IID-GNN can adapt any existing graph neural network model to generate a sample-specific model for this graph. Comprehensive experiments on various graph classification benchmarks demonstrate the effectiveness of the proposed framework. We will release the code of the proposed framework upon the acceptance of the paper.


Watermarking Graph Neural Networks based on Backdoor Attacks

Graph Neural Networks (GNNs) have achieved promising performance in vari...

Meta-Inductive Node Classification across Graphs

Semi-supervised node classification on graphs is an important research p...

GSN: A Universal Graph Neural Network Inspired by Spring Network

The design of universal Graph Neural Networks (GNNs) that operate on bot...

We Cannot Guarantee Safety: The Undecidability of Graph Neural Network Verification

Graph Neural Networks (GNN) are commonly used for two tasks: (whole) gra...

SeedGNN: Graph Neural Networks for Supervised Seeded Graph Matching

Recently, there have been significant interests in designing Graph Neura...

Graph Neural Network-based Early Bearing Fault Detection

Early detection of faults is of importance to avoid catastrophic acciden...

Preserve, Promote, or Attack? GNN Explanation via Topology Perturbation

Prior works on formalizing explanations of a graph neural network (GNN) ...

1. Introduction

Graphs are natural representations for many real-world data such as social networks (Yanardag and Vishwanathan, 2015; Hamilton et al., 2017; Kipf and Welling, 2016; Veličković et al., 2017), biological networks (Schomburg et al., 2004; Borgwardt et al., 2005; Dobson and Doig, 2003; Shervashidze et al., 2011) and chemical molecules (Duvenaud et al., 2015; Gilmer et al., 2017; Dai et al., 2016). A crucial step to perform down-stream tasks on graph data is to learn better representations. Deep neural networks have demonstrated great capabilities in representation learning for Euclidean data and thus have advanced numerous fields including speech recognition (Nassif et al., 2019)

, computer vision 

(He et al., 2016)

and natural language processing 

(Devlin et al., 2018). However, they cannot be directly applied to graph data due to its complex topological structure. Recently, Graph Neural Networks (GNNs) have generalized deep neural networks to graph data that typically perform transforming, propagating and aggregating node features across the graph. They have boosted the performance of many graph related tasks such as node classification (Kipf and Welling, 2016; Hamilton et al., 2017), link prediction (Schütt et al., 2017; Zhang et al., 2018b; Gao and Ji, 2019) and graph classification (Ying et al., 2018; Ma et al., 2019). In this work, we aim to advance Graph Neural Networks for graph classification.

(a) Node size distribution
(b) A graph with 31 nodes
(c) A graph with 302 nodes
(d) Classification performance
Figure 1. An Illustrative Example of Varied Structural Information and its Impact on the Performance of Graph Neural Network based Graph Classification.

In graph classification, each graph is treated as a data sample and the goal is to train a classification model on a set of training graphs that can predict the label for an unlabeled graph by leveraging its associated node features and graph structure. There are numerous real-world applications for graph classification. For example, it can be used to infer whether a protein functions as an enzyme or not where proteins are denoted as graphs (Dobson and Doig, 2003); and it can be applied to forecast Alzheimer’s disease progression in which individual brains are represented as graphs (Song et al., 2019). In reality, graphs in the same training set can present distinct structural information. Figure (a)a demonstrates the distribution of the number of nodes for protein graphs in the D&D dataset (Dobson and Doig, 2003) where the number of nodes varies dramatically from to . We further illustrate two graphs from D&D in Figures (b)b and (c)c, respectively. It can be observed that these two graphs present very different structural information such as the number of edges, density and diameters. It naturally raises a question – does varied structural information of graphs in the training set affect the GNN-based graph classification performance? To investigate this question, we divide graphs in D&D into two sets based on the number of nodes – one consisting of graphs with a small number of nodes and the other containing graphs with a large number of nodes. Then, we split each set into a training set and a test set. We train two GCN models111

The GCN model was originally designed for semi-supervised node classification task, we include a max-pooling layer to generate graph-level representation for graph classification task.

 (Kipf and Welling, 2016) based on two training sets separately, and test their performance on the two test sets. The results are shown in the Figure (d)d. The GNN model trained on graphs with a small number of nodes achieves much better performance on the test graphs with a small number of nodes than these with a large number of nodes. Similar observations can be made for the GNN model trained on graphs with a large number of nodes. Thus, varied structural information can impact the GNN-based graph classification performance. The above preliminary investigations indicate that graphs in the same training set may follow different distributions 222Note that we have done the investigations with more settings such as more datasets and types of structural information and we make very consistent observations.. In other words, they are not non-identically distributed. In fact, this observation is consistent with existing work. For example, it is evident in (Tillman, 2009) that due to differences in individual brains, the distribution of the brain data can vary remarkably across individuals. However, the majority of existing GNN models assume that graphs in the training set are identically distributed.

In this paper, we propose to design graph neural networks for non-identically distributed graphs. In particular, we target on addressing two challenges – (a) how to capture non-identical distributions of graphs; and (b) how to integrate them to build graph neural networks for graph classification. Our attempt to tackle these two challenges leads to a novel graph neural network model for graph classification. The main contributions of this paper are summarised as follows:

  • We introduce a principled approach to model non-identically distributed graphs mathematically;

  • We propose a novel graph neural network framework, Non-IID-GNN, for graph classification, which can learn graph-level representations for independent and non-identically distributed graphs; and

  • We design comprehensive experiments on numerous graph datasets from various domains to verify the effectiveness of the proposed Non-IID-GNN framework.

The rest of paper is organized as follow. In Section 2, we briefly review related work. We introduce the details of the proposed framework in Section 3. In Section 4, we present experimental results with discussions. We conclude this paper with future work in Section 5.

2. related work

Graph Neural Networks have recently drawn great interest due to its strong representation capacity in graph-structures data in many real-world applications(Schomburg et al., 2004; Duvenaud et al., 2015; Borgwardt et al., 2005; Gilmer et al., 2017; Yanardag and Vishwanathan, 2015). Most graph neural networks fit within the framework of “neural message passing” (Gilmer et al., 2017)

, with the basic idea of updating node representations iteratively by aggregating the representations of their neighbor nodes. Generally graph neural networks can be divided into two categories: the spectral approaches and the non-spectral approaches. The spectral methods aim at defining the parameterized filters based on graph spectral theory by using graph Fourier transform and graph Laplacian

(Bruna et al., 2014; Defferrard et al., 2016; Li et al., 2018; Kipf and Welling, 2017), and the non-spectral methods aim at defining parameterized filters based on nodes’ spatial relations by aggregating information from neighboring nodes directly  (Gao et al., 2018; Hamilton et al., 2017; Niepert et al., 2016; Velickovic et al., 2018). More details of the developments of these two approaches can be found in  (Wu et al., 2019; Zhou et al., 2018; Defferrard et al., 2016).

Graph neural networks have achieved great success in a wide variety of tasks including node classification (Kipf and Welling, 2016; Hamilton et al., 2017), link prediction (Schütt et al., 2017; Zhang et al., 2018b; Gao and Ji, 2019) and graph classification (Ying et al., 2018; Ma et al., 2019). Here in this work, we mainly focus on graph classification tasks. In the task of graph classification, one of the most important step is to get a good graph-level representation from all the node representations. A straight-forward way is to directly summarize the the graph representation by globally combining the node representations, such as summing up or averaging all the node representations  (Duvenaud et al., 2015). There is also an approach introducing a “virtual node” which is connected to all the nodes and computing its representation based on attention mechanism which is used as graph representation (Li et al., 2015). In addition, some approaches use learnable deep neural networks to do feature aggregation among different nodes  (Gilmer et al., 2017; Zhang et al., 2018a). DGCNN  (Zhang et al., 2018a)

first selects a fixed number of nodes for all the graph samples and then applies convolutional neural network to these nodes to learn a graph representation. These methods introduced above are inherently flat, which fail capturing the graph structural information in computing the graph representation. Recently there are some works  

(Defferrard et al., 2016; Fey et al., 2018; Simonovsky and Komodakis, 2017) investigating learning hierarchical graph representations by leveraging deterministic graph clustering algorithms, which typically first group nodes into supernodes in each layer, and then learn the supernodes representations layer by layer, and finally learn the graph representation from these supernode representations. There also exists end-to-end models aiming at learning hierarchical graph representations, such as DiffPool  (Ying et al., 2018), which introduces a differentiable module to softly assign nodes to different clusters. Furthermore, some methods  (Gao and Ji, 2019; Lee et al., 2019) propose some principles to select the most important nodes to form a reduced-size graph in each network layer. A recently proposed method, ASAP (Ranjan et al., 2019) proposes to learn a soft cluster assignment base on self-attention mechanism and select top important clusters to form the pooled graph. In addition, there is a lately proposed graph pooling method from the spectral view, EigenPooling  (Ma et al., 2019),which introduces a pooling operation based on graph Fourier transform and is able to capture the local structural information.

Figure 2. An overview of the proposed Non-IID graph neural network.

3. The Proposed Framework

The majority of traditional graph neural networks assume that graphs in the same training data are identically distributed and thus they train a unified GNN model for all graphs. However, our preliminary study suggests that graphs in the same training data are not identically distributed. Therefore, in this section, we detail the proposed framework Non-IID-GNN that has been designed for non-identically distributed graphs.

3.1. The Overall Design

The results in the Figure (d)d indicate that we should not build an unified model for non-identically distributed graphs. This observation naturally motivates us to develop distinct GNN models for graphs with different distributions. However, we face tremendous challenges. First, we have no explicit knowledge about the underlying distributions of graphs. Second, if we separately train different models for different graphs, we have to split the training graphs for each model; and then the training data for each model could be very limited. For example, in the extreme case when one graph has a unique distribution, we only have one training sample for the corresponding model. Third, even if we can well train distinct GNN models for Non-IID graphs, during the test stage, for an unlabelled graph, which trained model we should adopt to make the prediction?

In this work, we propose a Non-IID graph neural network framework, i.e., Non-IID-GNN, which can tackle the aforementioned challenges simultaneously. An overview about the architecture of Non-IID-GNN is demonstrated in Figure 2. The basic idea of Non-IID-GNN is – it approximates the distribution information of a graph sample via applying a adaptor network on its structural information, which serves as the adaptor parameters to adapt each GNN block for and the adapted GNN model can be viewed as a specific graph classification model for . The underlying distribution of a given sample may have different influences on different GNN blocks. Thus, for each GNN block, we introduce one adaptor network. To solve the first challenge of no knowledge about the underlying distribution, we develop an adaptor network to approximate the distribution information of a graph through its observed structural information. To tackle the second challenge, we jointly learn a specific model for each graph and these models share the same set of parameters we need to learn such as adaptor networks and GNN blocks. With this design, the third challenge is addressed automatically. Given an unlabeled graph , the trained Non-IID-GNN will generate an adapted GNN model to predict its label. Next we will introduce details about the adaptor network, the adapted graph neural network for each graph and how to train the framework and how to use the framework for test.

3.2. The Adaptor Network

The goal of the adaptor network is to approximate the distribution information of a given graph since we do not have the knowledge about the underlying distribution of the graph. In particular, we utilize the graph structure information as input for the adaptor network to achieve this goal. The intuition is – the structural differences of graphs are from their different distributions; thus, we want to estimate the distribution from the observed structural information via a powerful adaptor network. Graph neural networks often consist of several subsequent filtering and pooling layers, which can be viewed as different blocks of the graph neural network model. As mentioned before, the distribution of a graph may influence each GNN block differently. Thus, we build an adaptor network to generate adaptor parameters for each block of the graph neural network. We first extract a vector

to denote the structural information of given a graph sample . We will discuss more details about in the experiment section. As shown in the left part of Figure 2, the adaptor networks take the graph structure information of as input and generate the adaptation parameters for each block of the graph neural network model. Assuming we have blocks in the graph neural network, then we have independent adaptor networks. Note that these adaptor networks share the same input while the output of them can be different as the model parameters of different blocks in the graph neural network can be different. Specifically, the adaptor network to learn adaptor parameters for the -th block of the graph neural network can be expressed as follows:


where denotes the parameters of the -th adaptor network and denotes its output, which will be used to adapt the -th learning block of the graph neural network model. The adaptor

can be modeled using any functions. In this work, we utilize feed-forward neural networks due to their strong capability in approximating any functions. For convenience, we summarize the process of the

adaptor networks for as follows:


where contains the generated adaptation parameters of the graph sample for all the GNN blocks and denotes the parameters of the adaptor networks.

3.3. The Adapted Graph Neural Network

Any existing graph neural network model can be adapted by the Non-IID-GNN framework to generate sample-specific models based on the sample’s graph structure information. Therefore, we first generally introduce the GNN model for graph classification and describe how it can be adapted to a specific given sample. Then, we use two concrete examples to illustrate how to adapt a specific GNN model.

3.3.1. An General Adapted Framework

Figure 3. An overview of of the GNN framework

A typical GNN framework for graph classification usually contains two types of layers, i.e. the filtering layer and the pooling layer. The filtering layer takes the graph structure and node representations as input and generates refined node representations as output. The pooling layer takes graph structure and node representations as input to produce a coarsened graph with new graph structure and new node representations. An overview of a general GNN framework for graph classification is shown in Figure 3, where, we have, in total pooling layers, each of which is following stacking filtering layers. Hence, there are, in total, learning blocks in this GNN framework. A graph-level representation can be obtained from these layers that can be further utilized to perform the prediction. Given a graph sample , we need to adapt each of the layers according to its distribution information identified by the adaptor network to generate a GNN model specific to . Next, we first introduce the formulation of the filtering layer, the pooling layer and then describe how they can be adapted with the adaptation parameters learned by the adaptor networks.

Without loss of generality, when introducing a filtering layer or a pooling layer, we use adjacency matrix and node representations to denote the input of these layers where is the number of nodes and is the dimension of node features. Then, the operation of a filtering layer can be described as follows:


where denotes the parameters in the filtering layer and denotes the refined node representations with dimension generated by the filtering layer. Assuming is the corresponding adaptor parameters for this filtering layer, we adapt the model parameter of this filtering layer as follows:


where is the adapted model parameters, which has exactly the same shape as the original model parameter and is the adaptation operator. The adaption operator can have various designs, which can be determined according to specific GNN model. We will provide the details of the adaptation operator when we introduce concrete examples in the following subsections. Then, with the adapted model parameters, we can define the adapted filtering layer as follows:


The process of a pooling layer can be described as follows:


where denotes the parameters of the pooling layer, with is the adjacency matrix for the newly generated coarsened graph and is the learned node representations for the coarsened graph. Similarly, we adapt the model parameters of the pooling layer as follows:


which leads to the following adapted pooling layer:


where is the adaptation parameters generated by the adaptor network for this pooling layer.

To summarize, given a graph sample , its specific adaptor parameters learned by the adaptor networks, and a GNN framework with model parameters of all layers summarized in , we can generate an adapted GNN specific for the sample as . Here, we summarize the layer-wise adaption operation using . Next, we use GCN (Kipf and Welling, 2016) and Diffpool (Ying et al., 2018) as examples to illustrate how to adapt a specific GNN model.

3.3.2. Adapted GCN: Non-IID-GCN

Graph Convolutional Network (GCN) (Kipf and Welling, 2016) is originally proposed for semi-supervised node classification task. The filtering layer in GCN is defined as follows:


where represents the adjacency matrix with self-loops, is the diagonal degree matrix of and denotes the trainable weight matrix in filtering layer and

is some nonlinear activation function. With the adaptation parameter

for the corresponding filtering layer, the adapted filtering layer can be represented as follows:


Specifically, we adopt FiLM (Perez et al., 2018) as the adaption operator. In this case, the dimension of the adaptor parameter is i.e. . We split into two parts and and then the adaptation operation can be expressed as follows


where is a broadcasting function that repeats the vector times, hence, and have the same shape as and denotes the element-wise multiplication between two matrices.

To utilize GCN for graph classification, we introduce a node-wise max pooling layer to generate graph representation from the node representations as follows:


where denotes the graph-level representation and takes the maximum over all the nodes. Note that the max-pooing operation does not involve learable parameters and thus no adaptation is needed for it. A adapted GCN framework, which we call the Non-IID-GCN, consists of adapted filtering layers followed by a single max-pooling layer to generate the graph-level representation, i.e., . The graph representation is then utilized to make the prediction.

3.3.3. Adapted diffpool: Non-IID-Diffpool

Diffpool is a hierarchical graph level representation learning method for graph classification (Ying et al., 2018). The filtering layer in Diffpool is the same as eq. (9) and its corresponding adapted version is shown in eq. (10). Its pooling layer is defined as follows:


where is a filtering layer embedded in the pooling layer, is a soft-assignment matrix, which softly assigns each node into a supernode to generate a coarsened graph. Specifically, the structure and the node representations for the coarsened graph are generated by eq. (15) and eq. (14), respectively, where is the output of the filtering layers proceeds to the pooling layer. To adapt the pooling layer, we only need to adapt eq.(13), which follows the same way as introduced in eq. (10) as it is also a filtering layer. The adapted diffpool model, which we call as Non-IID-Diffpool, can thus be determined by choosing proper numbers for and with all the filtering layers and pooling layers adapted.

3.4. Training and Test

Given a graph sample with adjacency matrix , and feature matrix , the Non-IID-GNN framework performs the classification as follows:


During the training, we are given a set of graphs as training samples, where each graph is associated with a ground truth label . Then, the objective function of Non-IID-GNN can be represented as follows:


where is the number of training samples,

is a loss function used to measure the difference between the predicted labels and the ground truth labels. Here in this work, we use Cross-Entropy as the loss function and adopt ADAM 

(Kingma and Ba, 2014) to optimize the objective.

During the test phase, the label of a given sample can be inferred using eq. (16), i.e., . Specifically, the graph structural information of the sample is first utilized as the input of the adaptor network to identify its distribution information, which is then utilized to adapt the shared model parameter to generate a sample-specific model . This sample-specific model finally performs the classification for this sample.

4. experiment

In this section, we have conducted comprehensive experiments to verify the effectiveness of the proposed Non-IID-GNN framework. We first describe the implementation details of the proposed framework. Then, we evaluate the performance of the framework by comparing original GCN and Diffpool with the adpated GCN, Diffpool models by the Non-IID-GNN framework. Next, we analyse the importance of different components in the adaptor operator of the proposed model via ablation study. Finally we conduct some case studies to further facilitate our understanding of the proposed method.

4.1. Experimental Settings

In this work, we focus on the task of graph classification. In order to demonstrate the effectiveness of the proposed Non-IID-GNN framework, we carried out graph classification tasks on eight datasets from various domains with a variety of representative baselines. Next, we describe the datasets and the baselines.

4.1.1. Datasets

Some major statistics of eight datasets are shown in Table1, and more details of them are introduced as follows:

  • D&D (Dobson and Doig, 2003) is a dataset of protein structures. Each protein is represented as a graph, where each node in a graph represents an amino aid and each edge between two nodes denotes that they are less than 6 Ångstroms apart.

  • ENZYMES (Shervashidze et al., 2011) is a dataset of protein tertiary structures of six classes of enzymes.

  • PROTEINS_full  (Borgwardt et al., 2005) is a dataset of protein structures, where each graph represents a protein and each node represents a secondary structure element(SSE) in the protein.

  • NCI1 and NCI109 (Shervashidze et al., 2011) are two datasets of chemical compounds screened for activity against non-small cell lung cancer and ovarian cancer cell lines, which are provided by Natinal Cancer Institue (NCI).

  • COLLAB (Yanardag and Vishwanathan, 2015) is a dataset of scientific collaboration networks, which describes collaboration pattern of researchers from three different research fields.

  • REDDIT-BINARY and REDDIT-MULTI-5K (Yanardag and Vishwanathan, 2015) are two datasets of online discussion threads crawled from different subreddits in Reddit, where each node represents an user and each edge between two user nodes represents the interaction between them.

Datasets #Graphs #Class #Nodes(avg std)
DD 1,178 2 284.3 272.0
ENZYMES 600 6 32.6 14.9
PROTEINS_full 1,113 2 39.06 45.8
NCI1 4,110 2 29.87 13.5
NCI109 4,127 2 29.68 13.6
COLLAB 5,000 3 74.49 62.3
REDDIT-BINARY 2,000 2 429.63 554.0
REDDIT-MULTI-5K 4,999 5 508.52 452.6
Table 1. The statistics of eight datasets. Graphs denotes the number of graphs. Class denotes the number of graph classes. Nodes(avg

std) denotes the average and standard deviation of the number of nodes among the graphs.

4.1.2. Baselines

The proposed framework is a general framework which can be applied to any graph neural network to facilitate its performance on graph classification tasks. Here we apply the proposed framework to two graph neural networks: a basic graph convolutional network (GCN)  (Kipf and Welling, 2016) and an advanced graph convolutional network with hierarchical pooling, Diffpool  (Ying et al., 2018). The corresponding adapted versions by the proposed framework are Non-IID GCN and Non-IID Diffpool, respectively. To validate the effectiveness of the proposed model, we compare Non-IID-GCN, Non-IID-Diffpool with GCN and Diffpool on the task of graph classification on eight different datasets. Besides these two baselines, to further demonstrate our model’s capacity in capturing non-identical distributions of graph samples via utilizing graph structural properties, we also develop a baseline method, Multi-GCN, which learns multiple graph convolutional networks for graph samples with different structural information. More details of the baseline methods are as follows:

  • GCN (Kipf and Welling, 2016) is originally proposed for semi-supervised node classification. It consists of a stack of GCN layers, where a new representation of each node is computed via transforming and aggregating node representations of its neighbouring nodes. Finally, a graph representation is generated from node representations in the last GCN layer via a global max-pooling layer, and then used for graph classification.

  • Diffpool (Ying et al., 2018) is a recently proposed method which has achieved state-of-the-art performance on the graph classification task. It proposes a differentiable graph pooling approach to hierarchically generate a graph-level representation by coarsening the input graph level by level.

  • Multi-GCN consists of several GCN models trained from different subsets of the training datset. As shown in Figure 4

    , we first cluster data samples from training set into different training subsets via K-means method based on the graph structural information. Note that in this work, the structural information

    of include the number of nodes, the number of edges and the graph density. Then we train different models from different training subsets. During the test phase, given a test graph sample, we first compute the euclidean distance between its graph structural properties and the centroids of different training subsets. Then, we choose the model trained on the closest training subset to do label prediction for this graph sample. Here, in this experiment, we set the number of clusters to and , and denote the corresponding frameworks as Multi-GCN-2 and Multi-GCN-3.

    Figure 4. The framework of Multi-GCN with clusters illustrated in the figure. Here we group training samples into clusters, and train a GCN model for each cluster. Given a test sample , we measure the distance between this sample and the centroids of the three clusters and utilize the corresponding model of the closest cluster to perform the prediction for this sample.
Methods Datasets
GCN 0.7716 0.5176 0.7662 0.7715 0.7574 0.6986 0.8189 0.5039
Diffpool 0.7823 0.5771 0.7894 0.8017 0.7718 0.704 0.8972 0.5646
Multi-GCN-2 0.7435 0.4521 0.7950 0.7743 0.7515 0.699 0.7938 0.5088
Multi-GCN-3 0.7435 0.4604 0.7962 0.7749 0.7555 0.6815 0.8888 0.470
Non-IID-GCN 0.7931 0.5592 0.7788 0.7877 0.7705 0.7316 0.9039 0.5293
Non-IID-Diffpool 0.7856 0.5854 0.7939 0.7932 0.7755 0.738 0.9292 0.5541
Table 2. Comparisons of graph classification performance of in terms of accuracy of seven methods on eight datasets. Here we implement two versions of Multi-GCN: Multi-GCN-2 and Multi-GCN-3, where we group training samples into 2 and 3 clusters, respectively. Note that we highlight the best accuracy in bold.

4.2. Graph Classification Performance Comparison

To conduct the graph classification task, for each graph dataset, we randomly shuffle the datset and then split of the data into the training set and the remaining as test set. We train all the models on the training set and evaluate their performance on the test set with accuracy as the measure. We repeat this process for times and report the average performance of them. The GCN/Non-IID-GCN model consists of filtering layers a single max-poling layer, the hidden dimension of each filtering layer is

, and ReLU 

(Nair and Hinton, 2010) activation is applied after each filtering layer. For Diffpool/Non-IID-Diffpool, we follow the setting of the original paper (Ying et al., 2018) with and , the dimension of hidden filtering layer is set to . We adopt fully-connected network to implement the adaptor network in the Non-IID-GNN frameworks. Its input dimension is the same as the dimension of graph structural information. Specifically, in this work, we use the number of nodes, the number of edges and graph density as the graph structural information in implementation.

The results of graph classification are shown in Table 2. We can make the following observations from the table:

  • The adapted GCN model, Non-IID-GCN, consistently outperforms the original GCN model on all the datasets. We can also find similar observations when comparing Non-IID-Diffpool with the original Diffpool model. These observations demonstrate that the sample-wise adaptation performed by the Non-IID-GNN framework can actually improve the performance of GNN frameworks. Furthermore, the adaptation in the Non-IID-GNN framework can be successfully applied to both flat GNN models such as GCN and GNN models with hierarchical pooling scheme such as Diffpool.

  • The Multi-GCN frameworks outperform the original GCN models in some of the datasets such as PROTEINS and NCI1. They achieve even better performance than Non-IID-GCN in the PROTEINS dataset, which demonstrates that Multi-GCN framework is a potential way to deal with the non-identically underlying distribution in the graph dataset, especially when the clusters are nicely generated. However, it is not always easy to find the clusters and most commonly the clusters we find cannot precisely represent the multiple underlying distributions in the dataset, which might lead to unsatisfactory performance. This is likely why the Multi-GCN performs even worse than the orginal GCN model on some of the datasets such as ENZYMES and DD. On the other hand, the proposed Non-IID-GNN framework provides a principled way to identify the distribution information of graph samples and elegantly incorporate the distribution information into the GNN frameworks, which consistently provides outstanding performance on all the datasets.

Methods Datasets
GCN 0.7716 0.5176 0.7662 0.7715 0.7574 0.6986 0.8189 0.5039
Non-IID-GCN 0.781 0.5225 0.7761 0.779 0.7596 0.7084 0.8517 0.5173
Non-IID-GCN 0.7797 0.54 0.7793 0.7877 0.7714 0.7116 0.8878 0.5188
Non-IID-GCN 0.7931 0.5592 0.7788 0.7878 0.7705 0.7316 0.9039 0.5293
Table 3. Comparison of graph classification performance in terms of accuracy between variants of Non-IDD-GCN and the original GCN model.

4.3. Ablation Study of Adaptation Operator

In this subsection, we investigate the effectiveness of different components in the adaptor operator eq. (11) used in our model. Specifically, we want to investigate whether and play important roles in the adaptor operator. To achieve this goal, we define the following variants of Non-IID-GCN frameworks with different adaptor operators:

  • Non-IID-GCN: It denotes that we utilize a variant of the adaptor operator with only element-wise multiplication operation. Instead of using eq. (11), the adaptation process is now expressed as: .

  • Non-IID-GCN: It denotes that we utilize a variant of the adaptor operator with only element-wise addition operation. Instead of using eq. (11), the adaptation process is now expressed as: .

(a) Embeddings with node size
(b) Embeddings with edge size
(c) Embeddings with graph density
(d) Sample demonstration
Figure 5. Case Study on the D&D dataset: (a), (b) and (c) demonstrate the model embeddings for different graph samples in D&D with the number of nodes, the number of edges and the graph density, respectively, where the red triangle denotes the basic GCN model shared among different samples. (d) demonstrates model embeddings for seven graph samples (three graph samples marked in green star with the number of nodes less than 100 and the other four marked in blue dot with that more than 300), which are mistakenly classified by a unified GCN, but correctly classified by the proposed Non-IID GCN.

Following the previous experimental setting, we investigated the performance of Non-IID-GCN and Non-IID-GCN on the graph classification task on eight datasets. We compared their results with those of Non-IID-GCN with the adaptor operator as described in eq. (11). The results are presented in Table 3. From the table, we can observe that both Non-IID-GCN and Non-IID-GCN can outperform the original GCN model, which indicates that both the term with and the term with are effective for the adaptation and utilizing either one of them can already adapt the original model in a reasonable manner. On the other hand, the Non-IID-GCN model outperforms both Non-IID-GCN and Non-IID-GCN on most of the datasets, which demonstrates that the adaption effect of the term with and the term with are somehow complementary to each other and combing them together during the adaption process can further enhance the performance.

4.4. Case Study

To further illustrate the effectiveness of the proposed framework, we conducted some exploratory experiments on the D&D dataset. First, we visualize the distribution of embeddings of sample-specific model parameters for different graph samples. Specifically, we take the parameters (i.e. the parameters in eq. (11)) of the first filtering layer of each sample-specific Non-IID-GCN framework and then utilize t-sne (Maaten and Hinton, 2008) to project these parameters to -dimensional embeddings. We visualize these -d embeddings in the form of scatter plot as shown in Figure (a)a, Figure (b)b and Figure (c)c. Note that in these three figures, the red triangle denotes the embedding of the parameters (i.e. ) of the original GCN model. For each point in these three figures, we use color to represent the scale of values in terms of node size, edge size and density for Figure (a)a, Figure (b)b and Figure (c)c, respectively. Specifically, here, a deeper red color indicates a larger value, while a deeper blue color indicates a smaller value. We make some observations from these three figures. First, the proposed Non-IID-GCN framework indeed generates various models for different graph samples and they are different from the original model, which is denoted as red triangles in the figures. Second, the points with similar colors stay closely with each other in all three figures, which means that graph samples with similar structural information (the number of nodes, the number of edges and graph density) share similar models. This indicates that the proposed framework is able to approximate the underlying distribution information from the graph structure and incorporate this information during the adaptation process.

In addition, in Figure (d)d, we illustrate the sample-specific model parameters for seven samples with different number of nodes, which are mis-classified by the original and unified GCN model but correctly classified by the proposed Non-IID-GCN framework. It is obvious that Non-IID-GCN has generated seven different GCN models for these graph samples, each of which can successfully predict the label for the corresponding sample.

5. Conclusion

In this paper, we propose a general graph neural network framework, Non-IID-GNN, to deal with graphs that are non-identically distributed. Given a graph sample, the Non-IID-GNN framework is able to approximate its underlying distribution information from its structural information, the Non-IID-GNN framework can then adapt any existing GNN-based graph classification model to generate a specific model for this sample, which is then utilized to predict the label of this sample. Comprehensive experiments demonstrated that the Non-IID-GNN framework can effectively adapt both flat GNN model and hierarchical GNN model to enhance their performance. An interesting future direction is to better infer the underlying distribution given a graph sample. Instead of utilizing hand-engineered graph properties to approximate the underlying distribution information of a given sample, we can design more sophisticated algorithm to achieve this goal.


  • K. M. Borgwardt, C. S. Ong, S. Schönauer, S. Vishwanathan, A. J. Smola, and H. Kriegel (2005) Protein function prediction via graph kernels. Bioinformatics 21 (suppl_1), pp. i47–i56. Cited by: §1, §2, 3rd item.
  • J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun (2014) Spectral networks and locally connected networks on graphs. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, Cited by: §2.
  • H. Dai, B. Dai, and L. Song (2016) Discriminative embeddings of latent variable models for structured data. In

    International conference on machine learning

    pp. 2702–2711. Cited by: §1.
  • M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pp. 3844–3852. Cited by: §2, §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
  • P. D. Dobson and A. J. Doig (2003) Distinguishing enzyme structures from non-enzymes without alignments. Journal of molecular biology 330 (4), pp. 771–783. Cited by: §1, §1, 1st item.
  • D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams (2015) Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pp. 2224–2232. Cited by: §1, §2, §2.
  • M. Fey, J. Eric Lenssen, F. Weichert, and H. Müller (2018)

    SplineCNN: fast geometric deep learning with continuous b-spline kernels


    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 869–877. Cited by: §2.
  • H. Gao and S. Ji (2019) Graph u-nets. arXiv preprint arXiv:1905.05178. Cited by: §1, §2.
  • H. Gao, Z. Wang, and S. Ji (2018) Large-scale learnable graph convolutional networks. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1416–1424. Cited by: §2.
  • J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl (2017) Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1263–1272. Cited by: §1, §2, §2.
  • W. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In Advances in neural information processing systems, pp. 1024–1034. Cited by: §1, §2, §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.4.
  • T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1, §1, §2, §3.3.1, §3.3.2, 1st item, §4.1.2.
  • T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, Cited by: §2.
  • J. Lee, I. Lee, and J. Kang (2019) Self-attention graph pooling. arXiv preprint arXiv:1904.08082. Cited by: §2.
  • R. Li, S. Wang, F. Zhu, and J. Huang (2018) Adaptive graph convolutional neural networks. In

    Thirty-second AAAI conference on artificial intelligence

    Cited by: §2.
  • Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel (2015) Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493. Cited by: §2.
  • Y. Ma, S. Wang, C. C. Aggarwal, and J. Tang (2019) Graph convolutional networks with eigenpooling. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 723–731. Cited by: §1, §2.
  • L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §4.4.
  • V. Nair and G. E. Hinton (2010) Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807–814. Cited by: §4.2.
  • A. B. Nassif, I. Shahin, I. Attili, M. Azzeh, and K. Shaalan (2019) Speech recognition using deep neural networks: a systematic review. IEEE Access 7, pp. 19143–19165. Cited by: §1.
  • M. Niepert, M. Ahmed, and K. Kutzkov (2016) Learning convolutional neural networks for graphs. In International conference on machine learning, pp. 2014–2023. Cited by: §2.
  • E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville (2018) Film: visual reasoning with a general conditioning layer. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §3.3.2.
  • E. Ranjan, S. Sanyal, and P. P. Talukdar (2019) ASAP: adaptive structure aware pooling for learning hierarchical graph representations. arXiv preprint arXiv:1911.07979. Cited by: §2.
  • I. Schomburg, A. Chang, C. Ebeling, M. Gremse, C. Heldt, G. Huhn, and D. Schomburg (2004) BRENDA, the enzyme database: updates and major new developments. Nucleic acids research 32 (suppl_1), pp. D431–D433. Cited by: §1, §2.
  • K. Schütt, P. Kindermans, H. E. S. Felix, S. Chmiela, A. Tkatchenko, and K. Müller (2017) Schnet: a continuous-filter convolutional neural network for modeling quantum interactions. In Advances in neural information processing systems, pp. 991–1001. Cited by: §1, §2.
  • N. Shervashidze, P. Schweitzer, E. J. v. Leeuwen, K. Mehlhorn, and K. M. Borgwardt (2011) Weisfeiler-lehman graph kernels. Journal of Machine Learning Research 12 (Sep), pp. 2539–2561. Cited by: §1, 2nd item, 4th item.
  • M. Simonovsky and N. Komodakis (2017) Dynamic edge-conditioned filters in convolutional neural networks on graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3693–3702. Cited by: §2.
  • T. Song, S. R. Chowdhury, F. Yang, H. Jacobs, G. El Fakhri, Q. Li, K. Johnson, and J. Dutta (2019) Graph convolutional neural networks for alzheimer’s disease classification. In 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), pp. 414–417. Cited by: §1.
  • R. E. Tillman (2009) Structure learning with independent non-identically distributed data. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1041–1048. Cited by: §1.
  • P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §1.
  • P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2018) Graph attention networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, Cited by: §2.
  • Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu (2019) A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596. Cited by: §2.
  • P. Yanardag and S. Vishwanathan (2015) Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1365–1374. Cited by: §1, §2, 5th item, 6th item.
  • Z. Ying, J. You, C. Morris, X. Ren, W. Hamilton, and J. Leskovec (2018) Hierarchical graph representation learning with differentiable pooling. In Advances in neural information processing systems, pp. 4800–4810. Cited by: §1, §2, §3.3.1, §3.3.3, 2nd item, §4.1.2, §4.2.
  • M. Zhang, Z. Cui, M. Neumann, and Y. Chen (2018a) An end-to-end deep learning architecture for graph classification. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §2.
  • Z. Zhang, H. Yang, J. Bu, S. Zhou, P. Yu, J. Zhang, M. Ester, and C. Wang (2018b) ANRL: attributed network representation learning via deep neural networks.. In IJCAI, Vol. 18, pp. 3155–3161. Cited by: §1, §2.
  • J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun (2018) Graph neural networks: a review of methods and applications. arXiv preprint arXiv:1812.08434. Cited by: §2.