1. Introduction
Graphs are natural representations for many realworld data such as social networks (Yanardag and Vishwanathan, 2015; Hamilton et al., 2017; Kipf and Welling, 2016; Veličković et al., 2017), biological networks (Schomburg et al., 2004; Borgwardt et al., 2005; Dobson and Doig, 2003; Shervashidze et al., 2011) and chemical molecules (Duvenaud et al., 2015; Gilmer et al., 2017; Dai et al., 2016). A crucial step to perform downstream tasks on graph data is to learn better representations. Deep neural networks have demonstrated great capabilities in representation learning for Euclidean data and thus have advanced numerous fields including speech recognition (Nassif et al., 2019)
(He et al., 2016)and natural language processing
(Devlin et al., 2018). However, they cannot be directly applied to graph data due to its complex topological structure. Recently, Graph Neural Networks (GNNs) have generalized deep neural networks to graph data that typically perform transforming, propagating and aggregating node features across the graph. They have boosted the performance of many graph related tasks such as node classification (Kipf and Welling, 2016; Hamilton et al., 2017), link prediction (Schütt et al., 2017; Zhang et al., 2018b; Gao and Ji, 2019) and graph classification (Ying et al., 2018; Ma et al., 2019). In this work, we aim to advance Graph Neural Networks for graph classification.In graph classification, each graph is treated as a data sample and the goal is to train a classification model on a set of training graphs that can predict the label for an unlabeled graph by leveraging its associated node features and graph structure. There are numerous realworld applications for graph classification. For example, it can be used to infer whether a protein functions as an enzyme or not where proteins are denoted as graphs (Dobson and Doig, 2003); and it can be applied to forecast Alzheimer’s disease progression in which individual brains are represented as graphs (Song et al., 2019). In reality, graphs in the same training set can present distinct structural information. Figure (a)a demonstrates the distribution of the number of nodes for protein graphs in the D&D dataset (Dobson and Doig, 2003) where the number of nodes varies dramatically from to . We further illustrate two graphs from D&D in Figures (b)b and (c)c, respectively. It can be observed that these two graphs present very different structural information such as the number of edges, density and diameters. It naturally raises a question – does varied structural information of graphs in the training set affect the GNNbased graph classification performance? To investigate this question, we divide graphs in D&D into two sets based on the number of nodes – one consisting of graphs with a small number of nodes and the other containing graphs with a large number of nodes. Then, we split each set into a training set and a test set. We train two GCN models^{1}^{1}1
The GCN model was originally designed for semisupervised node classification task, we include a maxpooling layer to generate graphlevel representation for graph classification task.
(Kipf and Welling, 2016) based on two training sets separately, and test their performance on the two test sets. The results are shown in the Figure (d)d. The GNN model trained on graphs with a small number of nodes achieves much better performance on the test graphs with a small number of nodes than these with a large number of nodes. Similar observations can be made for the GNN model trained on graphs with a large number of nodes. Thus, varied structural information can impact the GNNbased graph classification performance. The above preliminary investigations indicate that graphs in the same training set may follow different distributions ^{2}^{2}2Note that we have done the investigations with more settings such as more datasets and types of structural information and we make very consistent observations.. In other words, they are not nonidentically distributed. In fact, this observation is consistent with existing work. For example, it is evident in (Tillman, 2009) that due to differences in individual brains, the distribution of the brain data can vary remarkably across individuals. However, the majority of existing GNN models assume that graphs in the training set are identically distributed.In this paper, we propose to design graph neural networks for nonidentically distributed graphs. In particular, we target on addressing two challenges – (a) how to capture nonidentical distributions of graphs; and (b) how to integrate them to build graph neural networks for graph classification. Our attempt to tackle these two challenges leads to a novel graph neural network model for graph classification. The main contributions of this paper are summarised as follows:

We introduce a principled approach to model nonidentically distributed graphs mathematically;

We propose a novel graph neural network framework, NonIIDGNN, for graph classification, which can learn graphlevel representations for independent and nonidentically distributed graphs; and

We design comprehensive experiments on numerous graph datasets from various domains to verify the effectiveness of the proposed NonIIDGNN framework.
The rest of paper is organized as follow. In Section 2, we briefly review related work. We introduce the details of the proposed framework in Section 3. In Section 4, we present experimental results with discussions. We conclude this paper with future work in Section 5.
2. related work
Graph Neural Networks have recently drawn great interest due to its strong representation capacity in graphstructures data in many realworld applications(Schomburg et al., 2004; Duvenaud et al., 2015; Borgwardt et al., 2005; Gilmer et al., 2017; Yanardag and Vishwanathan, 2015). Most graph neural networks fit within the framework of “neural message passing” (Gilmer et al., 2017)
, with the basic idea of updating node representations iteratively by aggregating the representations of their neighbor nodes. Generally graph neural networks can be divided into two categories: the spectral approaches and the nonspectral approaches. The spectral methods aim at defining the parameterized filters based on graph spectral theory by using graph Fourier transform and graph Laplacian
(Bruna et al., 2014; Defferrard et al., 2016; Li et al., 2018; Kipf and Welling, 2017), and the nonspectral methods aim at defining parameterized filters based on nodes’ spatial relations by aggregating information from neighboring nodes directly (Gao et al., 2018; Hamilton et al., 2017; Niepert et al., 2016; Velickovic et al., 2018). More details of the developments of these two approaches can be found in (Wu et al., 2019; Zhou et al., 2018; Defferrard et al., 2016).Graph neural networks have achieved great success in a wide variety of tasks including node classification (Kipf and Welling, 2016; Hamilton et al., 2017), link prediction (Schütt et al., 2017; Zhang et al., 2018b; Gao and Ji, 2019) and graph classification (Ying et al., 2018; Ma et al., 2019). Here in this work, we mainly focus on graph classification tasks. In the task of graph classification, one of the most important step is to get a good graphlevel representation from all the node representations. A straightforward way is to directly summarize the the graph representation by globally combining the node representations, such as summing up or averaging all the node representations (Duvenaud et al., 2015). There is also an approach introducing a “virtual node” which is connected to all the nodes and computing its representation based on attention mechanism which is used as graph representation (Li et al., 2015). In addition, some approaches use learnable deep neural networks to do feature aggregation among different nodes (Gilmer et al., 2017; Zhang et al., 2018a). DGCNN (Zhang et al., 2018a)
first selects a fixed number of nodes for all the graph samples and then applies convolutional neural network to these nodes to learn a graph representation. These methods introduced above are inherently flat, which fail capturing the graph structural information in computing the graph representation. Recently there are some works
(Defferrard et al., 2016; Fey et al., 2018; Simonovsky and Komodakis, 2017) investigating learning hierarchical graph representations by leveraging deterministic graph clustering algorithms, which typically first group nodes into supernodes in each layer, and then learn the supernodes representations layer by layer, and finally learn the graph representation from these supernode representations. There also exists endtoend models aiming at learning hierarchical graph representations, such as DiffPool (Ying et al., 2018), which introduces a differentiable module to softly assign nodes to different clusters. Furthermore, some methods (Gao and Ji, 2019; Lee et al., 2019) propose some principles to select the most important nodes to form a reducedsize graph in each network layer. A recently proposed method, ASAP (Ranjan et al., 2019) proposes to learn a soft cluster assignment base on selfattention mechanism and select top important clusters to form the pooled graph. In addition, there is a lately proposed graph pooling method from the spectral view, EigenPooling (Ma et al., 2019),which introduces a pooling operation based on graph Fourier transform and is able to capture the local structural information.3. The Proposed Framework
The majority of traditional graph neural networks assume that graphs in the same training data are identically distributed and thus they train a unified GNN model for all graphs. However, our preliminary study suggests that graphs in the same training data are not identically distributed. Therefore, in this section, we detail the proposed framework NonIIDGNN that has been designed for nonidentically distributed graphs.
3.1. The Overall Design
The results in the Figure (d)d indicate that we should not build an unified model for nonidentically distributed graphs. This observation naturally motivates us to develop distinct GNN models for graphs with different distributions. However, we face tremendous challenges. First, we have no explicit knowledge about the underlying distributions of graphs. Second, if we separately train different models for different graphs, we have to split the training graphs for each model; and then the training data for each model could be very limited. For example, in the extreme case when one graph has a unique distribution, we only have one training sample for the corresponding model. Third, even if we can well train distinct GNN models for NonIID graphs, during the test stage, for an unlabelled graph, which trained model we should adopt to make the prediction?
In this work, we propose a NonIID graph neural network framework, i.e., NonIIDGNN, which can tackle the aforementioned challenges simultaneously. An overview about the architecture of NonIIDGNN is demonstrated in Figure 2. The basic idea of NonIIDGNN is – it approximates the distribution information of a graph sample via applying a adaptor network on its structural information, which serves as the adaptor parameters to adapt each GNN block for and the adapted GNN model can be viewed as a specific graph classification model for . The underlying distribution of a given sample may have different influences on different GNN blocks. Thus, for each GNN block, we introduce one adaptor network. To solve the first challenge of no knowledge about the underlying distribution, we develop an adaptor network to approximate the distribution information of a graph through its observed structural information. To tackle the second challenge, we jointly learn a specific model for each graph and these models share the same set of parameters we need to learn such as adaptor networks and GNN blocks. With this design, the third challenge is addressed automatically. Given an unlabeled graph , the trained NonIIDGNN will generate an adapted GNN model to predict its label. Next we will introduce details about the adaptor network, the adapted graph neural network for each graph and how to train the framework and how to use the framework for test.
3.2. The Adaptor Network
The goal of the adaptor network is to approximate the distribution information of a given graph since we do not have the knowledge about the underlying distribution of the graph. In particular, we utilize the graph structure information as input for the adaptor network to achieve this goal. The intuition is – the structural differences of graphs are from their different distributions; thus, we want to estimate the distribution from the observed structural information via a powerful adaptor network. Graph neural networks often consist of several subsequent filtering and pooling layers, which can be viewed as different blocks of the graph neural network model. As mentioned before, the distribution of a graph may influence each GNN block differently. Thus, we build an adaptor network to generate adaptor parameters for each block of the graph neural network. We first extract a vector
to denote the structural information of given a graph sample . We will discuss more details about in the experiment section. As shown in the left part of Figure 2, the adaptor networks take the graph structure information of as input and generate the adaptation parameters for each block of the graph neural network model. Assuming we have blocks in the graph neural network, then we have independent adaptor networks. Note that these adaptor networks share the same input while the output of them can be different as the model parameters of different blocks in the graph neural network can be different. Specifically, the adaptor network to learn adaptor parameters for the th block of the graph neural network can be expressed as follows:(1) 
where denotes the parameters of the th adaptor network and denotes its output, which will be used to adapt the th learning block of the graph neural network model. The adaptor
can be modeled using any functions. In this work, we utilize feedforward neural networks due to their strong capability in approximating any functions. For convenience, we summarize the process of the
adaptor networks for as follows:(2) 
where contains the generated adaptation parameters of the graph sample for all the GNN blocks and denotes the parameters of the adaptor networks.
3.3. The Adapted Graph Neural Network
Any existing graph neural network model can be adapted by the NonIIDGNN framework to generate samplespecific models based on the sample’s graph structure information. Therefore, we first generally introduce the GNN model for graph classification and describe how it can be adapted to a specific given sample. Then, we use two concrete examples to illustrate how to adapt a specific GNN model.
3.3.1. An General Adapted Framework
A typical GNN framework for graph classification usually contains two types of layers, i.e. the filtering layer and the pooling layer. The filtering layer takes the graph structure and node representations as input and generates refined node representations as output. The pooling layer takes graph structure and node representations as input to produce a coarsened graph with new graph structure and new node representations. An overview of a general GNN framework for graph classification is shown in Figure 3, where, we have, in total pooling layers, each of which is following stacking filtering layers. Hence, there are, in total, learning blocks in this GNN framework. A graphlevel representation can be obtained from these layers that can be further utilized to perform the prediction. Given a graph sample , we need to adapt each of the layers according to its distribution information identified by the adaptor network to generate a GNN model specific to . Next, we first introduce the formulation of the filtering layer, the pooling layer and then describe how they can be adapted with the adaptation parameters learned by the adaptor networks.
Without loss of generality, when introducing a filtering layer or a pooling layer, we use adjacency matrix and node representations to denote the input of these layers where is the number of nodes and is the dimension of node features. Then, the operation of a filtering layer can be described as follows:
(3) 
where denotes the parameters in the filtering layer and denotes the refined node representations with dimension generated by the filtering layer. Assuming is the corresponding adaptor parameters for this filtering layer, we adapt the model parameter of this filtering layer as follows:
(4) 
where is the adapted model parameters, which has exactly the same shape as the original model parameter and is the adaptation operator. The adaption operator can have various designs, which can be determined according to specific GNN model. We will provide the details of the adaptation operator when we introduce concrete examples in the following subsections. Then, with the adapted model parameters, we can define the adapted filtering layer as follows:
(5) 
The process of a pooling layer can be described as follows:
(6) 
where denotes the parameters of the pooling layer, with is the adjacency matrix for the newly generated coarsened graph and is the learned node representations for the coarsened graph. Similarly, we adapt the model parameters of the pooling layer as follows:
(7) 
which leads to the following adapted pooling layer:
(8) 
where is the adaptation parameters generated by the adaptor network for this pooling layer.
To summarize, given a graph sample , its specific adaptor parameters learned by the adaptor networks, and a GNN framework with model parameters of all layers summarized in , we can generate an adapted GNN specific for the sample as . Here, we summarize the layerwise adaption operation using . Next, we use GCN (Kipf and Welling, 2016) and Diffpool (Ying et al., 2018) as examples to illustrate how to adapt a specific GNN model.
3.3.2. Adapted GCN: NonIIDGCN
Graph Convolutional Network (GCN) (Kipf and Welling, 2016) is originally proposed for semisupervised node classification task. The filtering layer in GCN is defined as follows:
(9) 
where represents the adjacency matrix with selfloops, is the diagonal degree matrix of and denotes the trainable weight matrix in filtering layer and
is some nonlinear activation function. With the adaptation parameter
for the corresponding filtering layer, the adapted filtering layer can be represented as follows:(10) 
Specifically, we adopt FiLM (Perez et al., 2018) as the adaption operator. In this case, the dimension of the adaptor parameter is i.e. . We split into two parts and and then the adaptation operation can be expressed as follows
(11) 
where is a broadcasting function that repeats the vector times, hence, and have the same shape as and denotes the elementwise multiplication between two matrices.
To utilize GCN for graph classification, we introduce a nodewise max pooling layer to generate graph representation from the node representations as follows:
(12) 
where denotes the graphlevel representation and takes the maximum over all the nodes. Note that the maxpooing operation does not involve learable parameters and thus no adaptation is needed for it. A adapted GCN framework, which we call the NonIIDGCN, consists of adapted filtering layers followed by a single maxpooling layer to generate the graphlevel representation, i.e., . The graph representation is then utilized to make the prediction.
3.3.3. Adapted diffpool: NonIIDDiffpool
Diffpool is a hierarchical graph level representation learning method for graph classification (Ying et al., 2018). The filtering layer in Diffpool is the same as eq. (9) and its corresponding adapted version is shown in eq. (10). Its pooling layer is defined as follows:
(13)  
(14)  
(15) 
where is a filtering layer embedded in the pooling layer, is a softassignment matrix, which softly assigns each node into a supernode to generate a coarsened graph. Specifically, the structure and the node representations for the coarsened graph are generated by eq. (15) and eq. (14), respectively, where is the output of the filtering layers proceeds to the pooling layer. To adapt the pooling layer, we only need to adapt eq.(13), which follows the same way as introduced in eq. (10) as it is also a filtering layer. The adapted diffpool model, which we call as NonIIDDiffpool, can thus be determined by choosing proper numbers for and with all the filtering layers and pooling layers adapted.
3.4. Training and Test
Given a graph sample with adjacency matrix , and feature matrix , the NonIIDGNN framework performs the classification as follows:
(16) 
During the training, we are given a set of graphs as training samples, where each graph is associated with a ground truth label . Then, the objective function of NonIIDGNN can be represented as follows:
(17) 
where is the number of training samples,
is a loss function used to measure the difference between the predicted labels and the ground truth labels. Here in this work, we use CrossEntropy as the loss function and adopt ADAM
(Kingma and Ba, 2014) to optimize the objective.During the test phase, the label of a given sample can be inferred using eq. (16), i.e., . Specifically, the graph structural information of the sample is first utilized as the input of the adaptor network to identify its distribution information, which is then utilized to adapt the shared model parameter to generate a samplespecific model . This samplespecific model finally performs the classification for this sample.
4. experiment
In this section, we have conducted comprehensive experiments to verify the effectiveness of the proposed NonIIDGNN framework. We first describe the implementation details of the proposed framework. Then, we evaluate the performance of the framework by comparing original GCN and Diffpool with the adpated GCN, Diffpool models by the NonIIDGNN framework. Next, we analyse the importance of different components in the adaptor operator of the proposed model via ablation study. Finally we conduct some case studies to further facilitate our understanding of the proposed method.
4.1. Experimental Settings
In this work, we focus on the task of graph classification. In order to demonstrate the effectiveness of the proposed NonIIDGNN framework, we carried out graph classification tasks on eight datasets from various domains with a variety of representative baselines. Next, we describe the datasets and the baselines.
4.1.1. Datasets
Some major statistics of eight datasets are shown in Table1, and more details of them are introduced as follows:

D&D (Dobson and Doig, 2003) is a dataset of protein structures. Each protein is represented as a graph, where each node in a graph represents an amino aid and each edge between two nodes denotes that they are less than 6 Ångstroms apart.

ENZYMES (Shervashidze et al., 2011) is a dataset of protein tertiary structures of six classes of enzymes.

PROTEINS_full (Borgwardt et al., 2005) is a dataset of protein structures, where each graph represents a protein and each node represents a secondary structure element(SSE) in the protein.

NCI1 and NCI109 (Shervashidze et al., 2011) are two datasets of chemical compounds screened for activity against nonsmall cell lung cancer and ovarian cancer cell lines, which are provided by Natinal Cancer Institue (NCI).

COLLAB (Yanardag and Vishwanathan, 2015) is a dataset of scientific collaboration networks, which describes collaboration pattern of researchers from three different research fields.

REDDITBINARY and REDDITMULTI5K (Yanardag and Vishwanathan, 2015) are two datasets of online discussion threads crawled from different subreddits in Reddit, where each node represents an user and each edge between two user nodes represents the interaction between them.
Datasets  #Graphs  #Class  #Nodes(avg std) 

DD  1,178  2  284.3 272.0 
ENZYMES  600  6  32.6 14.9 
PROTEINS_full  1,113  2  39.06 45.8 
NCI1  4,110  2  29.87 13.5 
NCI109  4,127  2  29.68 13.6 
COLLAB  5,000  3  74.49 62.3 
REDDITBINARY  2,000  2  429.63 554.0 
REDDITMULTI5K  4,999  5  508.52 452.6 
std) denotes the average and standard deviation of the number of nodes among the graphs.
4.1.2. Baselines
The proposed framework is a general framework which can be applied to any graph neural network to facilitate its performance on graph classification tasks. Here we apply the proposed framework to two graph neural networks: a basic graph convolutional network (GCN) (Kipf and Welling, 2016) and an advanced graph convolutional network with hierarchical pooling, Diffpool (Ying et al., 2018). The corresponding adapted versions by the proposed framework are NonIID GCN and NonIID Diffpool, respectively. To validate the effectiveness of the proposed model, we compare NonIIDGCN, NonIIDDiffpool with GCN and Diffpool on the task of graph classification on eight different datasets. Besides these two baselines, to further demonstrate our model’s capacity in capturing nonidentical distributions of graph samples via utilizing graph structural properties, we also develop a baseline method, MultiGCN, which learns multiple graph convolutional networks for graph samples with different structural information. More details of the baseline methods are as follows:

GCN (Kipf and Welling, 2016) is originally proposed for semisupervised node classification. It consists of a stack of GCN layers, where a new representation of each node is computed via transforming and aggregating node representations of its neighbouring nodes. Finally, a graph representation is generated from node representations in the last GCN layer via a global maxpooling layer, and then used for graph classification.

Diffpool (Ying et al., 2018) is a recently proposed method which has achieved stateoftheart performance on the graph classification task. It proposes a differentiable graph pooling approach to hierarchically generate a graphlevel representation by coarsening the input graph level by level.

MultiGCN consists of several GCN models trained from different subsets of the training datset. As shown in Figure 4
, we first cluster data samples from training set into different training subsets via Kmeans method based on the graph structural information. Note that in this work, the structural information
of include the number of nodes, the number of edges and the graph density. Then we train different models from different training subsets. During the test phase, given a test graph sample, we first compute the euclidean distance between its graph structural properties and the centroids of different training subsets. Then, we choose the model trained on the closest training subset to do label prediction for this graph sample. Here, in this experiment, we set the number of clusters to and , and denote the corresponding frameworks as MultiGCN2 and MultiGCN3.
Methods  Datasets  

DD  ENZYMES  PROTEINS_full  NCI1  NCI109  COLLAB  REDDITBINARY  REDDITMULTI5K  
GCN  0.7716  0.5176  0.7662  0.7715  0.7574  0.6986  0.8189  0.5039 
Diffpool  0.7823  0.5771  0.7894  0.8017  0.7718  0.704  0.8972  0.5646 
MultiGCN2  0.7435  0.4521  0.7950  0.7743  0.7515  0.699  0.7938  0.5088 
MultiGCN3  0.7435  0.4604  0.7962  0.7749  0.7555  0.6815  0.8888  0.470 
NonIIDGCN  0.7931  0.5592  0.7788  0.7877  0.7705  0.7316  0.9039  0.5293 
NonIIDDiffpool  0.7856  0.5854  0.7939  0.7932  0.7755  0.738  0.9292  0.5541 
4.2. Graph Classification Performance Comparison
To conduct the graph classification task, for each graph dataset, we randomly shuffle the datset and then split of the data into the training set and the remaining as test set. We train all the models on the training set and evaluate their performance on the test set with accuracy as the measure. We repeat this process for times and report the average performance of them. The GCN/NonIIDGCN model consists of filtering layers a single maxpoling layer, the hidden dimension of each filtering layer is
, and ReLU
(Nair and Hinton, 2010) activation is applied after each filtering layer. For Diffpool/NonIIDDiffpool, we follow the setting of the original paper (Ying et al., 2018) with and , the dimension of hidden filtering layer is set to . We adopt fullyconnected network to implement the adaptor network in the NonIIDGNN frameworks. Its input dimension is the same as the dimension of graph structural information. Specifically, in this work, we use the number of nodes, the number of edges and graph density as the graph structural information in implementation.The results of graph classification are shown in Table 2. We can make the following observations from the table:

The adapted GCN model, NonIIDGCN, consistently outperforms the original GCN model on all the datasets. We can also find similar observations when comparing NonIIDDiffpool with the original Diffpool model. These observations demonstrate that the samplewise adaptation performed by the NonIIDGNN framework can actually improve the performance of GNN frameworks. Furthermore, the adaptation in the NonIIDGNN framework can be successfully applied to both flat GNN models such as GCN and GNN models with hierarchical pooling scheme such as Diffpool.

The MultiGCN frameworks outperform the original GCN models in some of the datasets such as PROTEINS and NCI1. They achieve even better performance than NonIIDGCN in the PROTEINS dataset, which demonstrates that MultiGCN framework is a potential way to deal with the nonidentically underlying distribution in the graph dataset, especially when the clusters are nicely generated. However, it is not always easy to find the clusters and most commonly the clusters we find cannot precisely represent the multiple underlying distributions in the dataset, which might lead to unsatisfactory performance. This is likely why the MultiGCN performs even worse than the orginal GCN model on some of the datasets such as ENZYMES and DD. On the other hand, the proposed NonIIDGNN framework provides a principled way to identify the distribution information of graph samples and elegantly incorporate the distribution information into the GNN frameworks, which consistently provides outstanding performance on all the datasets.
Methods  Datasets  

DD  ENZYMES  PROTEINS_full  NCI1  NCI109  COLLAB  REDDITBINARY  REDDITMULTI5K  
GCN  0.7716  0.5176  0.7662  0.7715  0.7574  0.6986  0.8189  0.5039 
NonIIDGCN  0.781  0.5225  0.7761  0.779  0.7596  0.7084  0.8517  0.5173 
NonIIDGCN  0.7797  0.54  0.7793  0.7877  0.7714  0.7116  0.8878  0.5188 
NonIIDGCN  0.7931  0.5592  0.7788  0.7878  0.7705  0.7316  0.9039  0.5293 
4.3. Ablation Study of Adaptation Operator
In this subsection, we investigate the effectiveness of different components in the adaptor operator eq. (11) used in our model. Specifically, we want to investigate whether and play important roles in the adaptor operator. To achieve this goal, we define the following variants of NonIIDGCN frameworks with different adaptor operators:

NonIIDGCN: It denotes that we utilize a variant of the adaptor operator with only elementwise multiplication operation. Instead of using eq. (11), the adaptation process is now expressed as: .

NonIIDGCN: It denotes that we utilize a variant of the adaptor operator with only elementwise addition operation. Instead of using eq. (11), the adaptation process is now expressed as: .
Following the previous experimental setting, we investigated the performance of NonIIDGCN and NonIIDGCN on the graph classification task on eight datasets. We compared their results with those of NonIIDGCN with the adaptor operator as described in eq. (11). The results are presented in Table 3. From the table, we can observe that both NonIIDGCN and NonIIDGCN can outperform the original GCN model, which indicates that both the term with and the term with are effective for the adaptation and utilizing either one of them can already adapt the original model in a reasonable manner. On the other hand, the NonIIDGCN model outperforms both NonIIDGCN and NonIIDGCN on most of the datasets, which demonstrates that the adaption effect of the term with and the term with are somehow complementary to each other and combing them together during the adaption process can further enhance the performance.
4.4. Case Study
To further illustrate the effectiveness of the proposed framework, we conducted some exploratory experiments on the D&D dataset. First, we visualize the distribution of embeddings of samplespecific model parameters for different graph samples. Specifically, we take the parameters (i.e. the parameters in eq. (11)) of the first filtering layer of each samplespecific NonIIDGCN framework and then utilize tsne (Maaten and Hinton, 2008) to project these parameters to dimensional embeddings. We visualize these d embeddings in the form of scatter plot as shown in Figure (a)a, Figure (b)b and Figure (c)c. Note that in these three figures, the red triangle denotes the embedding of the parameters (i.e. ) of the original GCN model. For each point in these three figures, we use color to represent the scale of values in terms of node size, edge size and density for Figure (a)a, Figure (b)b and Figure (c)c, respectively. Specifically, here, a deeper red color indicates a larger value, while a deeper blue color indicates a smaller value. We make some observations from these three figures. First, the proposed NonIIDGCN framework indeed generates various models for different graph samples and they are different from the original model, which is denoted as red triangles in the figures. Second, the points with similar colors stay closely with each other in all three figures, which means that graph samples with similar structural information (the number of nodes, the number of edges and graph density) share similar models. This indicates that the proposed framework is able to approximate the underlying distribution information from the graph structure and incorporate this information during the adaptation process.
In addition, in Figure (d)d, we illustrate the samplespecific model parameters for seven samples with different number of nodes, which are misclassified by the original and unified GCN model but correctly classified by the proposed NonIIDGCN framework. It is obvious that NonIIDGCN has generated seven different GCN models for these graph samples, each of which can successfully predict the label for the corresponding sample.
5. Conclusion
In this paper, we propose a general graph neural network framework, NonIIDGNN, to deal with graphs that are nonidentically distributed. Given a graph sample, the NonIIDGNN framework is able to approximate its underlying distribution information from its structural information, the NonIIDGNN framework can then adapt any existing GNNbased graph classification model to generate a specific model for this sample, which is then utilized to predict the label of this sample. Comprehensive experiments demonstrated that the NonIIDGNN framework can effectively adapt both flat GNN model and hierarchical GNN model to enhance their performance. An interesting future direction is to better infer the underlying distribution given a graph sample. Instead of utilizing handengineered graph properties to approximate the underlying distribution information of a given sample, we can design more sophisticated algorithm to achieve this goal.
References
 Protein function prediction via graph kernels. Bioinformatics 21 (suppl_1), pp. i47–i56. Cited by: §1, §2, 3rd item.
 Spectral networks and locally connected networks on graphs. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 1416, 2014, Conference Track Proceedings, Cited by: §2.

Discriminative embeddings of latent variable models for structured data.
In
International conference on machine learning
, pp. 2702–2711. Cited by: §1.  Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pp. 3844–3852. Cited by: §2, §2.
 Bert: pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
 Distinguishing enzyme structures from nonenzymes without alignments. Journal of molecular biology 330 (4), pp. 771–783. Cited by: §1, §1, 1st item.
 Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pp. 2224–2232. Cited by: §1, §2, §2.

SplineCNN: fast geometric deep learning with continuous bspline kernels
. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 869–877. Cited by: §2.  Graph unets. arXiv preprint arXiv:1905.05178. Cited by: §1, §2.
 Largescale learnable graph convolutional networks. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1416–1424. Cited by: §2.
 Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1263–1272. Cited by: §1, §2, §2.
 Inductive representation learning on large graphs. In Advances in neural information processing systems, pp. 1024–1034. Cited by: §1, §2, §2.
 Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.4.
 Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1, §1, §2, §3.3.1, §3.3.2, 1st item, §4.1.2.
 Semisupervised classification with graph convolutional networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, Cited by: §2.
 Selfattention graph pooling. arXiv preprint arXiv:1904.08082. Cited by: §2.

Adaptive graph convolutional neural networks.
In
Thirtysecond AAAI conference on artificial intelligence
, Cited by: §2.  Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493. Cited by: §2.
 Graph convolutional networks with eigenpooling. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 723–731. Cited by: §1, §2.
 Visualizing data using tsne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §4.4.
 Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML10), pp. 807–814. Cited by: §4.2.
 Speech recognition using deep neural networks: a systematic review. IEEE Access 7, pp. 19143–19165. Cited by: §1.
 Learning convolutional neural networks for graphs. In International conference on machine learning, pp. 2014–2023. Cited by: §2.
 Film: visual reasoning with a general conditioning layer. In ThirtySecond AAAI Conference on Artificial Intelligence, Cited by: §3.3.2.
 ASAP: adaptive structure aware pooling for learning hierarchical graph representations. arXiv preprint arXiv:1911.07979. Cited by: §2.
 BRENDA, the enzyme database: updates and major new developments. Nucleic acids research 32 (suppl_1), pp. D431–D433. Cited by: §1, §2.
 Schnet: a continuousfilter convolutional neural network for modeling quantum interactions. In Advances in neural information processing systems, pp. 991–1001. Cited by: §1, §2.
 Weisfeilerlehman graph kernels. Journal of Machine Learning Research 12 (Sep), pp. 2539–2561. Cited by: §1, 2nd item, 4th item.
 Dynamic edgeconditioned filters in convolutional neural networks on graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3693–3702. Cited by: §2.
 Graph convolutional neural networks for alzheimer’s disease classification. In 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), pp. 414–417. Cited by: §1.
 Structure learning with independent nonidentically distributed data. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1041–1048. Cited by: §1.
 Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §1.
 Graph attention networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30  May 3, 2018, Conference Track Proceedings, Cited by: §2.
 A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596. Cited by: §2.
 Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1365–1374. Cited by: §1, §2, 5th item, 6th item.
 Hierarchical graph representation learning with differentiable pooling. In Advances in neural information processing systems, pp. 4800–4810. Cited by: §1, §2, §3.3.1, §3.3.3, 2nd item, §4.1.2, §4.2.
 An endtoend deep learning architecture for graph classification. In ThirtySecond AAAI Conference on Artificial Intelligence, Cited by: §2.
 ANRL: attributed network representation learning via deep neural networks.. In IJCAI, Vol. 18, pp. 3155–3161. Cited by: §1, §2.
 Graph neural networks: a review of methods and applications. arXiv preprint arXiv:1812.08434. Cited by: §2.