1 Introduction
Graphs are known to have complicated structures and have myriad of real world applications. Recently, great efforts have been put on utilizing deep learning methods for graph data analysis. Many newly proposed graph learning approaches are inspired by Convolutional Neural Networks (CNNs)
(LeCun and Bengio, 1998), which have been greatly successful in learning twodimensional image data (grid structure). The convolution and pooling layers in CNNs have been redefined to process graph data. Multitude of different Graph Convolutional Networks (GCNs) (Shuman et al., 2013) have been proposed, which can learn node level representations by aggregating feature information from neighbors (spatialbased approaches) (Hamilton et al., 2017) or by introducing filters from the perspective of graph signal processing (spectralbased approaches) (Bengio and LeCun, 2014). On the other hand, similar to the original pooling layer which comes with CNNs, graph pooling module (Defferrard et al., 2016; Zhang et al., 2018)could easily reduce the variance and computation complexity by downsampling from original feature data, which is of vital importance, particularly for graph level classification tasks. Recently, hierarchical pooling methods that can learn hierarchical representations of graphs have been proposed
(Ying et al., 2018; Gao and Ji, 2019; Lee et al., 2019) and shows stateoftheart performance for graph classification tasks.However, the diverse property of graphs have imposed significant challenges on existing graph representation learning techniques. The graphs to be learned have various graph sizes (i.e., different number of nodes and edges) and have various graph properties (e.g., average node degree, diameter, clustering coefficient, etc.). Using unified graph neural network with consistent hyperparameters can bring troubles in the graph convolution operation as well as the graph pooling operation.
For example, when performing noderepresentation learning tasks (by graph convolution operation), it is enough to use small output embedding size for simple and small graphs, as shown in Figure 1(a), since large embedding size could result in overfitting problem. By contrast, it is necessary to set large output embedding sizes for complex and large graphs to learn complex graph structure properties, as shown Figure in 1(b)
. This will create a contradiction when processing a set of irregular heterogeneous graphs. For another example, when coarsening graphs (by graph pooling operation) for graphrepresentation learning tasks, differentsize graphs are coarsened to consistentsize graphs in order to finally obtain fixedsize embedding vectors. According to a stateoftheart graph pooling model DiffPool
(Ying et al., 2018), the graph coarsening process is based on two factors, the largest graph size among all graphs (i.e., the maximum number of nodes) and the coarsening ratio , and all graphs are coarsened to graphs with number of nodes. This will lead to unexpected node split operations (upsampling) instead of node merge operations (downsampling) and bring inaccuracies during the pooling process. But if the coarsening ratio is set too small, though the upsampling for small graphs can be avoided, the large graphs will lose too much information during the hierarchical coarsening process. Thus, this will also bring troubles when processing a set of irregular heterogeneous graphs.graph size  dim=10  dim=60  dim=120 

small  74.70  73.91  73.55 
medium  71.56  73.72  71.47 
large  81.32  81.57  82.72 
graph size  

small  76.34  71.76  73.91 
medium  73.23  74.41  73.50 
large  82.63  82.11  83.42 
To verify our analysis, we divide the PROTEINS graphs dataset (Borgwardt et al., 2005) into three subsets of graphs according to their sizes (number of nodes) and run a stateoftheart graph pooling model DiffPool (Ying et al., 2018) on the three subsets of graphs respectively to perform graph classification task. The PROTEINS dataset is evenly partitioned into three subsets, where the graphs with nodes are considered as small graphs, the graphs with nodes are considered as medium graphs, and the graphs with nodes are considered as large graphs. The average accuracy results^{1}^{1}1The result for each dimension is obtained by averaging multiple results with different coarsening ratios . with different embedding size parameters () on the three subsets are listed in Table 1. We can see that small graphs prefer small embedding size while large graphs prefer large embedding size as expected. Furthermore, we obtain better results on large graphs. We then choose the best embedding size parameter for different size graphs (dim=10 for small graphs, dim=60 for medium graphs, and dim=120 for large graphs) and vary the coarsening ratio settings to see the accuracy results on different size graphs. The results with different coarsening ratio parameters () are listed in Table 2. As expected, the small graphs prefer small coarsening ratio, while the large graphs prefer large coarsening ratio.
The diverse property of graphs dataset and its effect to the preference on various GNN hyperparameter settings motivate us to use multiplex GNN structure to learn graph features in a diverse way. As known, a common solution for augmenting the traditional CNN convolution layers is to use multiple convolution kernels in order to learn multiple local features. The success of CNNs on image data also inspires us to concurrently use multiple graph convolution networks and multiple graph pooling networks to learn graph representations. Besides, the effect of graph size on the performance also motivate us to utilize a priori properties of graphs to guide the learning process.
In this paper, we propose MxPool in hierarchical graph representation learning for graph classification tasks^{2}^{2}2Our code is available at https://github.com/JucatL/MxPool/.. MxPool comprises multiple graph convolution networks (with different hyperparameters) to learn nodelevel representations and also comprises multiple graph pooling networks to coarsen the graph. The nodelevel representations resulted from multiple convolution networks and the coarsened graphs resulted from multiple pooling networks are merged in a learnable way, respectively. The a priori properties of graphs are considered during the merge process, where the tobemerged vectors/matrices from different GNNs are endowed with different weights. The merged node representations and the merged coarsened graph are then used to generate a new coarsened graph, which is used in the next layer. This multiplex structure can adapt to graphs with diverse properties and can extract useful information from different perspectives.
We conduct extensive experiments on numerous graph classification benchmarks and show that our MxPool has marked superiority over other stateoftheart graph representation learning methods. For example, MxPool achieves 92.1% accuracy on the D&D dataset while the second best method DiffPool only achieves 80.64% accuracy.
2 Related Work
In this section, we review the recent literature on GNNs, graph convolution variants, and graph pooling variants.
Graph Neural Networks Inspired by Traditional Deep Learning Techniques. GNNs have recently drawn considerable attention due to their superiority in a wide variety of graph related tasks, including node classification (Kipf and Welling., 2017), link prediction (Schlichtkrull et al., 2018), and graph classification (Dai et al., 2016)
. Many of these GNN models are inspired by traditional learning techniques. Inspired by the huge success of convolutional networks in the computer vision domain, a large number of Graph Convolutional Networks (GCNs) have emerged. Besides convolution operation, pooling operation, as another key component in CNNs, has also inspired research communities to propose graph pooling operations. There are also GNN optimizations originating from other learning approaches. Inspired by Recurrent Neural Networks (RNNs),
You et al. (2018a) apply Graph RNN to the graph generation problem. DGNN (Ma et al., 2018) proposes using LSTM (Hochreiter and Schmidhuber, 1997) to learn node representations in dynamic graphs. Inspired by the attention mechanism (Vaswani et al., 2017) Graph Attention Networks (GATs) (Velickovic et al., 2017)introduce attentions into GCNs by differentiating the influence of neighbors. Graph AutoEncoders (GAEs)
(Wang et al., 2016)origin from the autoencoder mechanism widely used for unsupervised learning and are suitable to learn node representations for graphs. GCPN
(You et al., 2018b)utilizes Reinforcement Learning (RL) for goaldirected molecular graph generation.
Graph Convolution. Graph convolution operations fall into two categories, spectralbased approaches and spatialbased approaches. Bengio and LeCun (2014) first introduce convolution for graph data from spectral domain using the graph Laplacian matrix . Besides, there exist numerous spectralbased graph convolution methods, such as ChebNet (Defferrard et al., 2016), 1stChebNet (Kipf and Welling, 2017), and AGCN (Li et al., 2018). In contrast, spatialbased convolution methods define graph convolution based on a node’s spatial relations. It takes the aggregation of a node representation and its neighbors’ representations to obtain a new representation for this node. In order to explore the depth and breadth of a node’s receptive field, multiple graph convolution layer are stacked together, so that the features of two or more hops away neighbors can be learned. For example, GGNNs (Li et al., 2015), MPNN (Gilmer et al., 2017), GraphSage (Hamilton et al., 2017), PATCHYSAN (Niepert et al., 2016), and DCNN (Atwood and Towsley, 2016) all fall into the spatialbased category.
Graph Pooling. Graph pooling operation is of vital importance for graph classification tasks (Zhang et al., 2018). It coarsens a graph into subgraphs (Defferrard et al., 2016; Ying et al., 2018) or to sum/average over the node representations (Duvenaud et al., 2015; Gilmer et al., 2017), which can obtain a compact representation on graph level. The graph coarsening approaches obtain hierarchical graph representations either by using deterministic pooling methods or by using learned pooling methods. The deterministic pooling methods (Defferrard et al., 2016; Simonovsky and Komodakis, 2017) utilizes graph clustering algorithms to obtain next level coarsened graph that is going to be processed by GNNs, following a twostage approach. On the other hand, the learned pooling methods (Ying et al., 2018; Lee et al., 2019; Diehl, 2019; Gao and Ji, 2019) seek to learn the hierarchical structure, which have shown to outperform deterministic pooling methods. DiffPool (Ying et al., 2018) was the first to propose learned graph pooling. It learns a soft cluster assignment matrix in layer
which contains the probability values of nodes being assigned to clusters. A cluster in layer
will be reduced to a node in layer . A GNN with input node features and adjacency matrix is used to generate the soft assignment matrix, based on which we can learn the cluster embeddings (i.e., node features in the next layer) and the coarsened adjacency matrix denoting the connectivity strength between each pair of the clusters. Besides DiffPool, numerous graph pooling methods have emerged recently, including gPool (Gao and Ji, 2019), SAGPool (Lee et al., 2019), EigenPooling (Ma et al., 2019), Relational Pooling (Murphy et al., 2019), and StructPool (Yuan and Ji, 2020). However, to the best of our knowledge, none of the existing pooling methods employs multiplex structure to learn graph representations in a diverse way.3 Proposed Method
In this section, we propose MxPool to learn graph representations for graph level classification tasks. Before going to the details, we first introduce some notations and the problem setting.
Problem Setting A graph can be represented as , where denotes the adjacency matrix ( is the number of nodes contained in ), and denotes the node feature matrix ( is the dimension of features). In the graph classification setting, given a set of graphs and each being associated with a label, we aim to train a model that takes an unseen graph as input and predicts its corresponding label. To make the prediction, it is important to extract useful information from multiple perspectives including both graph structure and node features.
3.1 Overview
MxPool is a multilayer hierarchical GNN model. At each layer, MxPool consists of convolution operation and pooling operation. The convolution operation aims to learn nodelevel representations, while the pooling operation aims to learn a coarsened graph. The new coarsened graph can then be used as input to next layer. This process can be repeated several times, generating a multilayer GNN model to learn hierarchical graph representations. The convolution operation and pooling operation are both important for graph representation learning. To simplify the illustration, we choose GCN (Kipf and Welling., 2017) as the convolution layer and DiffPool (Ying et al., 2018) as the pooling layer (which is a differentiable pooling method), but it can be extended to use other convolution/pooling variant as well.
Different from other hierarchical GNNs, MxPool launches multiple GCNs to learn nodelevel representations and also launches multiple pooling networks to coarsen the graph. The nodelevel representations resulted from multiple GCNs are then merged in a learnable way by considering the a priori graph properties (e.g., number of nodes/edges, diameter, average node degree, etc.), and the coarsened graphs resulted from multiple pooling networks are also merged. The merged node embeddings and the merged coarsened graph are used to generate a new coarsened graph with a new set of node features. This multiplex structure can help extract useful information from different perspectives and can adapt to graphs with different sizes. We provide an illustrative example as shown in Figure 3.
The procedure of the GCN (Kipf and Welling., 2017) is to “horizontally” learn node representations, as it can only “pass message” between nodes through edges. The procedure of differentiable pooling (Ying et al., 2018) is to “vertically” summarize the node features into the higher level graph representation. The procedure of multiplexing is to “diversely” learn node representations or graph representations from different perspectives. The procedure of merging is to “synthetically” learn the diverse results and put more attention to one or more perspectives according to the a priori graph features. Since the convolution, pooling, and merging operations are all differentiable, we can define an endtoend differentiable graph representation learning framework in a hierarchical manner.
3.2 Multiplex Convolution
In our model, we use GCN for the convolution operation. The original GCN (Kipf and Welling., 2017) is stacked by several convolutional layers, and a single convolutional layer can be written as
(1) 
are the node embeddings computed after steps. . . is a trainable weighted matrix where denotes the output embedding’s size. Equation (1) can be understood as a message passing process. The node embeddings are the “messages” transferred along edges, which are going to be used to generate new node embeddings in next round. A total number of convolutional layers are stacked to learn node representations and the output matrix can be viewed as the final node representations learned by the GCN model.
In multilayer GNN, suppose there are totally layers. At each layer , with the input node feature matrix and the adjacency matrix generated from previous layer, we learn an embedding matrix . Here, we use to denote the output embeddings’ dimension since it will determine the input node embeddings at next layer that will be introduced later. For simplicity’s sake, we will use to denote the GCN process (containing iterations of message passing). Initially, is the original graph’s adjacency matrix, and is the original graph’s node features.
In MxPool, we use multiple GCNs to learn multiple sets of node embeddings. These GCNs can be trained with different sets of hyperparameters , such as weight matrix ’s dimension . Suppose there are GCNs running concurrently at each layer , we will have sets of node embeddings . Let be the hyperparameters set of the th GCN. Then at layer , we have node embeddings resulted from the th GCN as follows:
(2) 
Use Graph’s A Priori Properties for Merging. We then utilize the input graph’s a priori properties to merge these diverse sets of node embeddings. In our implementation, we use the number of nodes, the number of edges, and the average node degree to construct a 3dimensional graph properties vector. Let denote the a priori graph properties vector of a specific input graph where is the number of graph properties to be considered. According to , we employ a softmax normalization to obtain the attention weight of each GCN:
(3) 
where
is the weight matrix of a shared linear transformation which is applied to every input graph. Then, the multiple sets of node embeddings
are weighted first and merged into one set of node embeddings using a neural network:(4) 
where “” denotes a rowwise concatenation operation, ”” denotes a scalar operation, and is a singlelayer MLP neural network. One hyperparameter of is the output embeddings’ dimension , i.e., , which is set by averaging the dimensions of the multiple weight matrices . Suppose , we can set .
3.3 Multiplex Pooling
We follow DiffPool (Ying et al., 2018) to construct our multiplex pooling layer. We learn to assign nodes to clusters at each layer using the node embeddings and adjacency matrix generated from previous layer. Specifically, at each layer , we learn cluster assignment matrices , and each cluster assignment matrix is generated as follows:
(5) 
It is noticeable that is a GCN which is different from the used in the convolution layer, though these two GNNs consume the same input data. Each row of corresponds to one of the nodes at layer , and each column of corresponds to one of the clusters, so that we have . denotes the hyperparameters set of the th GCN. One important hyperparameter could be the coarsening ratio that determines the number of clusters to be assigned. Different pooling networks can have different number of clusters.
These generated assignment matrices are merged into a single assignment matrix in a similar way to that in the convolution operation:
(6) 
where is a singlelayer MLP neural network. Given the number of nodes at the next layer , , should be configured to output an assignment matrix with columns, i.e., .
The merged node embeddings as shown in Equation (4) and the merged assignment matrix as shown in Equation (6) are used to generate embeddings for each of the clusters (Ying et al., 2018). The adjacency matrix and the merged assignment matrix are also used to generate a coarsened adjacency matrix denoting the edge weights between each pair of cluster:
(7)  
Here, since and , we have the cluster embeddings . Similarly, we have the coarsened adjacency matrix
Note that, the coarsened graph is a fully connected weighted graph, so that the coarsened adjacency matrix is a real matrix and each entry in denotes the edge weight between two clusters. The cluster embeddings and the coarsened adjacency matrix will then be used as input to the next layer, where one cluster at layer corresponds to one node at layer .
3.4 Computational Complexity Analysis
Finally we discuss the time complexity of MxPool. Suppose there is totally layers. At each layer , there are GCNs. The th GCN has time for message passing and time for matrix multiplication (linear transformation), where are respectively the number of edges, the number of nodes, the node embedding size for layer ’s input graph and is the th GCN’s output node embedding size. For simplicity of analysis we assume only one convolution layer exists in each GCN. The total time for GCNs is . The time for learning the attention weights for different GCNs is where is the number of considered graph properties. The time for merging GCNs’ results is where is the length of the concatenation of node embeddings. Hence, the total time for multiplex convolution is .
The multiplex pooling step has the same GCN process and attention weights assignment process. In addition, the time for generating the assignment matrices is where is the number of output clusters for the th pooling network, and the time for merging assignment matrices is where . Hence, the time for multiplex pooling at each layer is . The time for generating next layer’s node embeddings and adjacency matrix is .
Therefore, the total time is where and are the number of nodes and edges of the input graph. if since the coarsened graphs are fully connected graphs. MxPool introduces additional convolution networks and pooling networks which brings additional cost. But the main computational cost results from for generating next layer coarsened graph, especially for large graphs where is large.
4 Experiments
In this section, we compare MxPool with the stateoftheart graph representation learning methods in the context of graph classification task.
dataset  graphs  classes  [min,max] nodes  [min,max] edges  [min,max] avgdeg 

D&D  1178  2  [30, 5748]  [63, 14267]  [7.22, 17.87] 
ENZYMES  600  6  [2, 125]  [1, 149]  [2.00, 10.46] 
PROTEINS  1113  2  [4, 620]  [5, 1049]  [3.43, 10.14] 
NCI109  4127  2  [4, 111]  [3, 119]  [2.50, 5.54] 
COLLAB  5000  3  [60, 492]  [60, 40120]  [13.94, 952.02] 
RDTM12K  11929  11  [2, 3782]  [1, 5171]  [4.00, 26.37] 
4.1 Experimental Settings
Datasets. In our experiments, we use four bioinformatics protein datasets: D&D (Dobson and Doig, 2003), ENZYMES (Borgwardt et al., 2005), PROTEINS (Borgwardt et al., 2005), NCI109 (Wale et al., 2008), and two social network datasets: COLLAB (Yanardag and Vishwanathan, 2015) and RDTM12K (Yanardag and Vishwannathan, 2015). Each of these datasets include hundreds to thousands graphs. The details of these datasets are provided in Table 3. The graphs exhibit great diversity on graph sizes and complexity. Table 3 also lists the min/max number of nodes/number of edges/average node degree of each dataset of graphs. The nodes in bioinformatics graphs have categorical features. As regards social graphs, whose nodes do not have features, we use an uninformative feature vector for all nodes. This helps compare all models with the same input representations.
Baselines. We consider the following stateoftheart methods for graph classification task as baselines: GraphSAGE (Hamilton et al., 2017) is a graph convolution framework proposed for semisupervised node classification. GraphSAGE with global meanpooling on the learned node representations can realize graph representation learning. SortPool (Zhang et al., 2018) is a global pooling method which uses sorting for pooling. It is built upon the GCN layer, where the features of nodes are sorted before feeding them into traditional 1D convolutional and dense layers. gPool (Gao and Ji, 2019) achieves pooling operation by adaptively selecting topk nodes to form a smaller graph based on their scalar projection values on a trainable projection vector. SAGPool (Lee et al., 2019) is a SelfAttention Graph Pooling method for GNNs in the context of hierarchical graph pooling. The selfattention mechanism is exploited to distinguish between the nodes that should be dropped and the nodes that should be retained. GIN (Xu et al., 2019) abbr. Graph Isomorphism Network, is shown to be more powerful than traditional GNNs and is as powerful as the WeisfeilerLehman graph isomorphism test. DiffPool (Ying et al., 2018) is the first endtoend trainable graph pooling method that learns hierarchical representations of graphs.
Experimental Setup
In order to remove unwanted bias towards the training data, we use 10fold cross validation for all the baselines and our approach. Since DiffPool is the key components in our MxPool approach, we use the same hyperparameter as in our MxPool approach. Regarding the hyperparameters of GraphSAGE, SortPool, gPool, SAGPool, and GIN, we follow the same experimental setups described in their original papers. In addition, we adopt the widely used evaluation metric, i.e., accuracy, for graph classification to evaluate the performance. The final test fold score is obtained as the mean of three runs with unfavorable random weight initializations. All models are trained using one NVIDIA GeForce RTX 2080 Ti GPU.
MxPool Configurations.
We implement MxPool using Pytorch. The convolution GNN model used is the original Graph Convolution Network model
(Kipf and Welling., 2017). The pooling GNN model used is the DiffPool model. The model configurations for convolution GNN and pooling GNN are the same as DiffPool. Besides, our MxPool comprises multiple graph convolution networks and multiple graph pooling networks with different sets of hyperparameters to learn graph features from different perspectives. We concurrently run 3 graph convolution networks and also concurrently run 3 graph pooling networks. The a priori graph properties vector used in our model contains 3 graph properties including the number of nodes, the number of edges, and the average node degree. These a priori graph properties for each graph are prepared before the training starts. We train our networks using Adam optimizer with a learning rate of 0.001.4.2 Accuracy Results on Graph Classification
Baselines  D&D  ENZYMES  PROTEINS  NCI109  COLLAB  RDTM12K 

BaseLine (Errica et al., 2020)  78.07  61.72  75.16  66.95  55.65  23.58 
GraphSAGE (Hamilton et al., 2017)  72.36  33.25  70.48  76.50  68.25  42.20 
SortPool (Zhang et al., 2018)  78.32  31.29  73.54  70.80  73.76  31.44 
gPool (Gao and Ji, 2019)  75.01  48.33  71.63  74.52  71.12  OOR 
SAGPool (Lee et al., 2019)  76.94  43.99  72.91  72.51  79.27  43.25 
GIN (Xu et al., 2019)  75.57  48.32  71.65  75.44  79.48  47.22 
DiffPool (Ying et al., 2018)  80.01  62.17  75.96  80.10  71.78  47.05 
MxPool (Ours)  81.13  69.53  78.40  83.05  77.20  47.52 
We evaluate our proposed MxPool on six benchmark datasets and compare with several stateoftheart baselines. The baseline method proposed in (Errica et al., 2020) is structureagnostic and only exploits node features. The accuracy results are reported in Table 4 where the best results are shown in bold. For gPool, SAGPool, and GIN, we use their PyTorch Geometric implementations to test graph classification accuracy. Regarding the SAGPool baseline (Lee et al., 2019), we meet OutOfResource error when processing the RDTM12K dataset, and we denote this case by ‘OOR’.
From the table, we observe that our proposed MxPool approach obtains the best performance on 5 out of 6 benchmark datasets. GIN and SAGPool are slightly better than MxPool on the COLLAB dataset. But MxPool performs better than GIN and SAGPool on the other datasets, e.g., MxPool improves upon GIN and SAGPool by an average of 6.53% and 7.99%, respectively. The DiffPool model is the most similar one to our MxPool. We extend the DiffPool model by employing the multiplex structure. Our model outperforms DiffPool by an average of 3.29%, which can be attributed to the effect of the multiplex structure. Specially, MxPool shows significant performance improvement over DiffPool on multiclass classification. For example, MxPool outperforms DiffPool on the ENZYMES dataset (with 6 classes) by 7.36% and on the COLLAB dataset (with 11 classes) by 5.42%, while only outperforms DiffPool on the other datasets by an average of 1.75%.
4.3 Effects of Multiplex Convolution/Pooling
Our motivation for this work is to utilize multiplex hierarchical structure to deal with the diversity and complexity challenges in graph representation learning. Multiple graph convolutional networks with different sets of hyperparameters are used to learn node representations. The node embedding size, as a hyperparameter in GCN, plays an important role in determining the quality of node representation. We vary the node embedding sizes in different GCN networks. On the other hand, multiple graph pooling networks (i.e., DiffPool) with different sets of hyperparameters are used to coarsen graphs. The compression ratio, as a hyperparameter in DiffPool, plays an important role in determining the quality of graph representation. We vary the compression ratios in different DiffPool networks.
Variations  D&D  ENZYMES  PROTEINS  NCI109  COLLAB  

SCSP  80.01  61.42  75.48  80.07  71.31  
79.33  62.17  74.47  80.10  71.33  
78.75  61.44  75.14  78.49  71.14  
MCSP  80.47  68.22  76.30  81.89  73.43  
SCSP  80.01  61.42  75.48  80.07  71.31  
78.64  60.75  75.96  78.47  71.78  
78.79  61.58  74.52  78.95  71.02  
SCMP  80.06  63.02  76.01  80.30  73.05  
MCMP  81.13  69.53  78.40  83.05  77.20 
In order to verify the effectiveness of multiplex convolution and multiplex pooling, we run our MxPool with single convolution network and single pooling network (SCSP), multiple convolution networks and single pooling network (MCSP), single convolution network and multiple pooling networks (SCMP), and multiple convolution networks and multiple pooling networks (MCMP), respectively. The accuracy results on 5 datasets for graph classification are shown in Table 5. Since the suitable node embedding sizes and compression ratios are not consistent for different datasets, we use to denote three different graph convolution parameters (i.e., node embedding size) and to denote three different graph pooling parameters (i.e., coarsening ratio). Note that, they are different for different datasets. We have put the detailed parameter settings on our GitHub project page.
From the table, we observe that the multiplex structure significantly improves performance over the singular structure. By fixing the pooling network with , multiplexing three convolution networks with hyperparameters consistently performs better than using single convolution network with either , , or . A similar trend can be observed when multiplexing pooling networks. Anyhow, the best choice is to simultaneously multiplex convolution networks and multiplex pooling networks (i.e., MCMP). Both multiplex convolution and multiplex pooling play important role in improving performance, but the best choice is to use them together.
4.4 Distribution of Attention Weights
In order to learn graph features from a diverse graphs dataset where graphs exhibit very different properties, MxPool employs multiple convolution networks and multiple pooling networks to learn graph features from different aspects. By exploiting the a prior graph properties (e.g., number of nodes, number of edges, and average degree), MxPool assigns different attention weights to different convolution networks and assigns different attention weights to different pooling networks. In order to verify if the attention weights can be learned through our endtoend structure, we first run the MxPool training process on the NCI109 dataset and record the learned attention weights after the training process is completed.
The recorded attention weights distribution of multiple convolution networks is shown in Figure 4. We group the attention weight results based on graph’s node number and show the average attention weight of each group in the figure. That is, we show the attention weights distribution for different size graphs, so that we can see if the weights distribution varies when processing different size graphs. The attention weight portion of each convolution networks is depicted in Figure 4(a). We observe that graphs with different node numbers indeed obtain different attention weight distributions. It also clearly shows that the attention weight to each convolution network is relevant to the graph size (number of nodes in the graph). Figure 4(b) and Figure 4(c) also show similar trends when processing graphs with different edge numbers and with different average node degrees.
The recorded attention weights distribution of multiple pooling networks is shown in Figure 5. We can also observe the similar trend as the multiple convolution networks as shown in Figure 4. These results verify that our MxPool indeed have learned different attention weights for different convolution/pooling networks, which supports our idea of constructing multiplex structures, because the multiplex structure can help learn graph features from different size graphs.
4.5 Number of Convolution/Pooling Networks
The number of convolution/pooling networks is a hyperparamter in MxPool. In the previous experiments, we use a fixed number of convolution/pooling GNNs to show the performance. In this experiment, we vary the number of convolution/pooling networks from 1 to 5 and evaluate the performance. The number of convolution networks and the number of pooling networks are the same. The graph classification accuracy results on ENZYMES and PROTEINS datasets are listed in Table 6. From the table, we can see that the best performance is achieved when the number is set around 34. As the number is increased larger, the performance is reduced. This may be because that too many networks with a large amount of parameters result in overfitting problem.
Dataset 







ENZYMES  62.53  59.83  69.53  65.33  61.71  
PROTEINS  76.25  74.31  78.40  77.36  76.64 
5 Conclusion
In this paper, we proposed a simple but effective multiplex GNN architecture MxPool for hierarchical graph representation learning. MxPool comprises multiple graph convolution networks to learn nodelevel representations and also comprises multiple graph pooling networks to coarsen the graph. The a priori graph properties are employed to assign the attention weight to each convolution/pooling network, so that the diversity challenge of graph representation learning can be well addressed. Our results show that MxPool has gained performance improvement over the stateoftheart graph representation learning methods. Future work includes designing unpooling layers to form an encoderdecoder learning structure to deal with node classification tasks and link prediction tasks.
References
 Diffusionconvolutional neural networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS 2016), pp. 2001–2009. Cited by: §2.
 Spectral networks and locally connected networks on graphs. In Proceedings of the 2nd International Conference on Learning Representations (ICLR 2014), Cited by: §1, §2.
 Protein function prediction via graph kernels. Bioinformatics 21 (suppl_1), pp. 47–56. Cited by: §1, §4.1.
 Discriminative embeddings of latent variable models for structured data. In Proceedings of the 33rd International Conference on International Conference on Machine Learning (ICML 2016), pp. 2702–2711. Cited by: §2.
 Convolutional neural networks on graphs with fast localized spectral filtering. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS 2016), pp. 3844–3852. Cited by: §1, §2, §2.
 Edge contraction pooling for graph neural networks. CoRR abs/1905.10990. External Links: Link, 1905.10990 Cited by: §2.
 Distinguishing enzyme structures from nonenzymes without alignments. Journal of molecular biology 330 (4), pp. 771–783. Cited by: §4.1.
 Convolutional networks on graphs for learning molecular fingerprints. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS 2015), pp. 2224–2232. Cited by: §2.
 A fair comparison of graph neural networks for graph classification. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2020), Cited by: §4.2, Table 4.
 An endtoend deep learning architecture for graph classification. In Proceedings of the 36th International Conference on Machine Learning (ICML 2019), Cited by: §1, §2, §4.1, Table 4.
 Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017), pp. 1263–1272. Cited by: §2, §2.
 Inductive representation learning on large graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS 2017), pp. 1025–1035. Cited by: §1, §2, §4.1, Table 4.
 Long shortterm memory. Neural Comput. 9 (8), pp. 1735–1780. Cited by: §2.
 Semisupervised classification with graph convolutional networks. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), Cited by: §2.
 Semisupervised classification with graph convolutional networks. In Proceedings of the 7th International Conference on Learning Representations (ICLR 2017), Cited by: §2, §3.1, §3.1, §3.2, §4.1.
 The handbook of brain theory and neural networks. M. A. Arbib (Ed.), pp. 255–258. Cited by: §1.
 Selfattention graph pooling. In Proceedings of the 36th International Conference on Machine Learning (ICML 2019), Cited by: §1, §2, §4.1, §4.2, Table 4.

Adaptive graph convolutional neural networks.
In
Proceedings of the AAAI Conference on Artificial Intelligence (AAAI 2018)
, pp. 3546–3553. Cited by: §2.  Gated graph sequence neural networks. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), Cited by: §2.
 Dynamic graph neural networks. CoRR abs/1810.10627. External Links: Link Cited by: §2.
 Graph convolutional networks with eigenpooling. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery (KDD 2019), pp. 723–731. Cited by: §2.
 Relational pooling for graph representations. In Proceedings of the 36th International Conference on Machine Learning (ICML 2019), Cited by: §2.
 Learning convolutional neural networks for graphs. In Proceedings of the 33rd International Conference on International Conference on Machine Learning (ICML 2016), pp. 2014–2023. Cited by: §2.
 Modeling relational data with graph convolutional networks. In Extended Semantic Web Conference, Cited by: §2.

The emerging field of signal processing on graphs: extending highdimensional data analysis to networks and other irregular domains
. IEEE Signal Processing Magazine 30 (3), pp. 83 – 98 (Undetermined). Cited by: §1.  Dynamic edgeconditioned filters in convolutional neural networks on graphs. CoRR abs/1704.02901. External Links: Link, 1704.02901 Cited by: §2.
 Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS 2017), pp. 6000–6010. Cited by: §2.
 Graph attention networks. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), Cited by: §2.
 Comparison of descriptor spaces for chemical compound retrieval and classification. Knowledge and Information Systems 14 (3), pp. 347–375. Cited by: §4.1.
 Structural deep network embedding. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2016), pp. 1225–1234. Cited by: §2.
 How powerful are graph neural networks?. In Proceedings of the 7th International Conference on Learning Representations (ICLR 2019), Cited by: §4.1, Table 4.
 A structural smoothing framework for robust graph comparison. In Advances in Neural Information Processing Systems, pp. 2134–2142. Cited by: §4.1.
 A structural smoothing framework for robust graph comparison. Advances in Neural Information Processing Systems, pp. 2134–2142. Cited by: §4.1.
 Hierarchical graph representation learning with differentiable pooling. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS 2018), pp. 4805–4815. Cited by: Table 1, Table 2, §1, §1, §1, §2, §3.1, §3.1, §3.3, §3.3, §4.1, Table 4.
 GraphRnn: generating realistic graphs with deep autoregressive models. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), Cited by: §2.
 Graph convolutional policy network for goaldirected molecular graph generation. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems (NIPS 2018), pp. 6412–6422. Cited by: §2.
 StructPool: structured graph pooling via conditional random fields. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2020), Cited by: §2.
 An endtoend deep learning architecture for graph classification. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI 2018), pp. 4438–4445. Cited by: §1, §2, §4.1, Table 4.
Comments
There are no comments yet.