SimPool: Towards Topology Based Graph Pooling with Structural Similarity Features

06/03/2020 ∙ by Yaniv Shulman, et al. ∙ Aleph Zero Records 0

Deep learning methods for graphs have seen rapid progress in recent years with much focus awarded to generalising Convolutional Neural Networks (CNN) to graph data. CNNs are typically realised by alternating convolutional and pooling layers where the pooling layers subsample the grid and exchange spatial or temporal resolution for increased feature dimensionality. Whereas the generalised convolution operator for graphs has been studied extensively and proven useful, hierarchical coarsening of graphs is still challenging since nodes in graphs have no spatial locality and no natural order. This paper proposes two main contributions, the first is a differential module calculating structural similarity features based on the adjacency matrix. These structural similarity features may be used with various algorithms however in this paper the focus and the second main contribution is on integrating these features with a revisited pooling layer DiffPool arXiv:1806.08804 to propose a pooling layer referred to as SimPool. This is achieved by linking the concept of network reduction by means of structural similarity in graphs with the concept of hierarchical localised pooling. Experimental results demonstrate that as part of an end-to-end Graph Neural Network architecture SimPool calculates node cluster assignments that functionally resemble more to the locality preserving pooling operations used by CNNs that operate on local receptive fields in the standard grid. Furthermore the experimental results demonstrate that these features are useful in inductive graph classification tasks with no increase to the number of parameters.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning methods have proven very successful at capturing hidden patterns of Euclidean data and obtained state-of-the-art results in various applications such as time series regression, similarity metric learning, machine vision and natural language processing. However there are many applications where data is best represented in the form of graphs such as the representation of molecules in chemistry, relationships networks, recommender systems and applications in traffic management

Wu_2020

. Machine learning algorithms that are designed to operate on regular grid data are not readily applicable to graphs since these generally have irregular structure with varying sizes of unordered nodes

hamilton2017representation . Due to the irregularity of graphs and lack of natural order of nodes and edges it is challenging to generalize some common grid operations such as shifting, convolutions and coarsening to arbitrary graphs. However in recent years much progress has been made towards unifying deep learning frameworks that operate of regular grids and learning frameworks for graphs of arbitrary topology (7974879, ). In particular one such family of models are Graph Neural Networks (GNN) that utilise similar mechanics to other Artificial Nueral Networks (ANN) and Convolutional Neural Networks (CNN) Goodfellow-et-al-2016 to enable a unified approach to learning on both standard grid and arbitrary graph represented data hamilton2017representation

. Similarly to CNNs, GNNs learn representations of nodes and graphs as points in a vector space

so that geometric relationships between vectors in the embedding space provide sufficient information about the nodes and graphs structure to solve a learning task. An important aspect of end-to-end CNNs operating on regular grid input is the hierarchical coarsening of the grid by localised pooling. CNNs take advantage of the stationarity (shift-invariance) and compositionality of grid arranged data by utilsing local statistics and impose a prior on the data by virtue of the CNN architecture 7974879 . CNNs are typically realised by alternating convolutional and pooling layers where the pooling layers subsample the grid and exchange spatial or temporal resolution for increased feature dimensionality. The output features are translation invariance/covariance depending on whether hierarchical grid coarsening is performed by means of pooling or kept constant 7974879 .

Whereas pooling operations can be naturally defined on generalized grid graphs, extending these operations to graph data of arbitrary topology is challenging since nodes in graphs have no spatial locality and no natural order. However coarsening of the graph is a critical step for generating graph embeddings and is required as a minimum at least once at the point of merging the individual node embeddings to a representation of the entire graph such as the models in DBLP:conf/icml/LiGDVK19 ; 7974879 ; Zhang2018AnED ; 10.5555/3305381.3305512 . Transformation of all nodes into a single representation is referred to as global pooling and has usefulness in inductive learning tasks where graphs typically have different number of nodes. However global pooling does not take the topology of the graph into account since all node embeddings are aggregated at once using the same procedure. Whereas much attention was put into GNNs in general including global pooling strategies, the hierarchical coarsening of graphs as part of an end-to-end graph embedding network does not seem to be as intensely researched.

In social studies the concept of structural equivalence lorrain1971structural is an important explanotory factor in the study of social homogenity Borgatti_Structural_Equivalence . A common definition of structural equivalence is that two nodes are structurally equivalent if they share the same neighbourhoods. Given an undirected graph , a pair of nodes are structurally equivalent if , where is the set of neighbouring nodes (connected with an edge) leicht2005vertex . For directed graphs, the definition changes such that and have incoming edges from the same nodes and outgoing edges to the same nodes. Structural equivalence was introduced as a method for reducing models of social networks e.g. to block models. However it rapidly gained importance as an approach to formalise the concept of relational role or position based on the concept that structurally equivalent nodes share many social attributes Borgatti_Structural_Equivalence .

In the remainder of this paper the focus is on bridging the gap between CNNs and GNNs to enable GNNs to perform hierarchical coarsening of a graph similarly to CNNs that operate on local receptive fields in the standard grid such that nearby nodes are more likely to belong to the same cluster in the coarsened graph. This is achieved by linking the concept of network reduction by means of structural similarity in graphs with the concept of hierarchical localised pooling. To summarise the key contributions of this work are:

  1. A differential module for calculating structural similarity features based on the adjacency matrix by defining a differential representation of the top- operator that is applicable to graphs of various sizes. Furthermore these features may be used in conjunction with various algorithms to calculate node pooling assignments or used in learning tasks where graph structure conveys critical information and augment or replace the node features.

  2. Revisiting the DiffPool algorithm proposed in NIPS2018_7729 and integrating with the structural similarity features to propose SimPool, a pooling layer which calculates node pooling assignments that functionally resemble more to the locality preserving pooling operations used by CNNs that operate on local receptive fields in the standard grid.

2 Related Work

There has been extensive research on GNNs in recent years Wu_2020 ; hamilton2017representation

. These include methods related to spectral graph convolutions that generalise a convolutional network through the Graph Fourier Transform

DBLP:journals/corr/BrunaZSL13 ; Henaff2015DeepCN ; 10.5555/3157382.3157527 ; DBLP:journals/corr/KipfW16 . GNNs such as DBLP:journals/corr/KipfW16 ; column_networks ; graphsage ; DBLP:conf/icml/LiGDVK19 ; simonovsky2017dynamic typically use an approach of generating embeddings for a node or a graph by iteratively aggregating the features of neighbouring nodes. These methods feature a number of desirable attributes such as localised representations, incorporating graph structure, leverage node features and can be used in inductive learning settings as they are capable of generating embeddings for nodes or graphs not present during training hamilton2017representation .

Previous research includes hierarchical coarsening of graphs by combining GNNs with deterministic graph clustering algorithms such as the SortPooling layer proposed in Zhang2018AnED that sorts the node features consistently before inputting them into a 1-D convolutional and dense layers, the use of the VoxelGrid algorithm in simonovsky2017dynamic and the use of Graclus algorithm 10.1109/TPAMI.2007.1115 for clustering combined with localised spectral convolution 10.5555/3157382.3157527 . A somewhat related pooling approach is suggested in DBLP:conf/icml/GaoJ19 that performes down-sampling on graph data by selecting the top- subset of nodes having the largest projection magnitude on a 1-D trainable projection vector. A different end-to-end approach is taken by NIPS2018_7729 where in each pooling layer a differentiable soft assignment matrix is learned by a GNN which computes assignment weights for every node to one of the clusters in the coarsened graph. A self-attention based method coined SAGPool is proposed in 35137c0f4e904fc0ab786021ead07852 where an attention mask is computed by a GNN to determine which nodes are selected to be passed on to the next layer.

3 Problem Description

Methods utilising node features for calculating cluster assignments or node selection will by design assign nodes having similar neighbourhood features to the same cluster. Methods that select nodes by sorting, ranking or an attention mechanism where such operations are based on node features will also select by design nodes having similar features and neighbourhoods thus lose information simply due to weak role similarity with some other real or virtual nodes. These issues are common to all the methods that perform coarsening by message passing GNNs such as Zhang2018AnED ; DBLP:conf/icml/GaoJ19 ; NIPS2018_7729 ; 35137c0f4e904fc0ab786021ead07852 . This type of clustering/filtering is strongly related to the notion of role similarity and is likely, especially in graphs with repeating structures (e.g. molecular datasets), to assign distant nodes to the same cluster NIPS2018_7729 . Whilst role similarity based clustering is a useful concept in its own right it is dissimilar to the pooling layers of grid graphs typically used by CNNs that compose receptive fields of adjacent non-overlapping partitions of the data and thus are able to leverage local statistics of the data such as stationarity and compositionality 7974879 .

4 Preliminaries

In this section a summary of related methods is given to set the notation and naming conventions used in subsequent sections. Let denote a graph where and are sets of nodes and edges respectively. Typically each node is associated with a node feature vector , and each edge is associated with an edge feature vector . Furthermore let denote the cardinality of (i.e., number of elements in ); denote the number of nodes (clusters) and the dimensionality of the node feature vectors in layer respectively.

4.1 Graph matching networks

Graph Matching Networks (GMN) DBLP:conf/icml/LiGDVK19 propose a message passing global pooling GNN architecture that transformes a set of node and edge features into an embedding vector. GMN introduce three layers, two of which, an encoder layer and the propogation layer are used in this work. The encoder layer defined by DBLP:conf/icml/LiGDVK19 transforms separately the node and edge features as follows:

(1)

The propagation layer transforms a set of node representations to new node representations as follows:

(2)

Where

is usually an MLP (multi-layer perceptron) on the concatenated inputs, and

can be either an MLP or a Recurrent Neural Network (RNN) core. The encoder is normally the first hidden layer in the GMN model and by stacking multiple layers of the propagation layer the representation for each node will accumulate information from an increasing neighbourhood size.

4.2 Graph convolutional network

One commonly used variant of approximate spectral convultion GNN is the Graph Convolutional Network (GCN) DBLP:journals/corr/KipfW16 having the layer forward propagation rule:

(3)

Where is the undirected adjacency matrix of the graph with added self-connections;

is the identity matrix;

is the trainable weight matrix for layer ;

is an activation function;

is the matrix of activations in the -th layer; ; and is the nodes feature matrix.

4.3 DiffPool

DiffPool NIPS2018_7729 enables the construction of deep end-to-end multi-layer GNN models with hierarchic pooling by incorporating a differentiable layer that pools graph nodes. At the core of the model is an assignment matrix calculated by a GNN that learns soft cluster assignments of each node at pooling layer to a cluster in the following coarsened pooling layer . Where each

denotes the probabilities of node

in pooling layer to be assigned to each of the clusters in pooling layer .

(4)

Where and are the graph adjacency matrix and the nodes feature matrix in layer respectively.

5 SimPool

5.1 Graph structural similarity matrix

Let be a graph similarity matrix where is a real-valued function that quantifies the similarity between two nodes in . In particular let denote a graph structural similarity matrix where its calculation is defined as:

(5)

Where is a symmetric adjacency matrix with optionally added self connections;

is the standard cosine similarity measure over

; and are the -th and -th column vectors in respectively. Adding self connections is important to improve similarity representation in certain instances such as when a graph is undirected and non-reflexive (i.e., no edges from a node to itself). In such graphs if and are connected, it is not possible for them to be structurally equivalent. In particular the similarity is zero for nodes that are connected directly by an edge but do not share any other common neighbours.

In the case where is asymmetric the calculation of is modified as such:

(6)

Where concatenates its inputs along the columns returning a matrix. Note that the choice of similarity measure is flexible and other similarity (distance) measures can be used such as Hamming or , however at this work the scope is limited to the cosine similarity measure as described in this section.

5.2 Structural similarity features

The utility of is apparent when a node is associated with where now the set is referred to as structural similarity features. The size of the neighbourhood affecting the calculation of the similarity measure is dependent upon and is controlled by . Note that the parameter can be adjusted for each pooling layer independently which may be useful to increase in layers where connectivity is sparse. These features capture the topological similarity between nodes in the graph and can be utilised by various algorithms and optionally be combined with additional node features such as the node labels or features to calculate node cluster assignments that strongly relate to structural similarity. By utilising for pooling calculations such as node cluster assignments GNNs may now be more closely aligned with CNNs that operate on grids where typically the pooling operation depends only on structure, and role similarity is propagated through composition of filters that operate on the data.

5.3 Indices mapping trick

In datasets that contain graphs of various sizes, as is typically the case in inductive learning tasks such as graph classification, the dimensionality of the structural similarity features varies from an example to example. Thus it is impossible to use these features with many standard ANN layers that require a fixed input dimensionality. Consequently additional processing of the structural similarity features is proposed that results in features of constant dimensionality. Intuitively it is vital to retain the indices of the nodes that are most similar to a given node and some notion of their significance. Unfortunately the standard operation for the top- similar nodes is not a differentiable operation and can not be trivially used with gradient based training. However a method referred to as the indices mapping trick is proposed as an alternative differential representation of the top- operator:

(7)

Where returns the matrix containing the indices (assuming one-based indexing) of the top- values for each row of its inputs ordered in descending input value (not descending indices). These indices correspond to the top- most similar nodes for each node in order of descending similarity; is a scaler that determines the magnitude of separation between mapped indices where lower values of increase separation but reduce the possibility of preserving information about similarity magnitude; is a column vector of with dimension ; ; is the Hadamard product operator; is an element-wise indicator operator that returns an binary matrix where an output element is when the corresponding input element is different than , and when the corresponding input element is ; and is the final processed structural similarity features matrix. Note that if then

is padded with zeroes.

Assuming a non-negative adjacency matrix, the resulting is a non-overlapping mapping of the graph node indices to the range , organised in a matrix such that the first column of contains the mapped indices of the most similar node to each corresponding node, the second column denotes the mapped indices of the second most similar nodes as so forth. The indices mapping trick (7) fulfil the following three key criteria:

  1. The typical sparseness of a is exploited for dimensionality reduction in a way that retains the significant topological information inherent to the graph structural similarity features.

  2. The dimensionality of the resulting features is constant to enable use of ANN layers that can accept only inputs of fixed dimensionality.

  3. End-to-end differentiability is maintained to enable use of gradient based optimisation methods.

When there is substantial redundancy in the calculation as defined in (7) therefore an alternative efficient method of calculation is suggested:

(8)

Where returns a matrix containing the column indices only and returns the indices of zero valued elements in its matrix argument.

5.4 Assignment matrix

Following the definition in NIPS2018_7729 let denote the learned soft cluster assignment matrix of each node at pooling layer to a cluster in the following coarsened pooling layer . To perform effective pooling, in each pooling layer an assignment matrix needs to be learned that considers the coarsened graph connectivity at that layer. Therefore it is desirable to use features related to the graph connectivity such as the structural similarity features to calculate the cluster assignments. To formulate the calculation of the isomorphism of graphs by means of permutation of the graph adjacency matrix need also be considered. First a proof that a permutation of the adjacency matrix results in permuted in the same manner.

Proposition 1. An isomorphic permutation of a symmetric adjacency matrix results in permuted in the same manner. Formally, let be a symmetric matrix, assume that and let be any permutation matrix where is a symmetric elementary permutation matrix, then

Proof: First establish that . By symmetry and transposition identities therefore by reapplying the same reasoning repeatedly calculating the central brackets, concluding the proof that . Furthermore by definition in (6) , therefore , concluding the proof of Proposition 1.

Therefore using for calculating cluster assignments is in the case of a symmetric is permutation invariant as long as the function used is permutation invariant (e.g. GMN). However permutations of the adjacency matrix in conjunction with the indices mapping trick (7) do not result in trivial permutations of but rather in a similar permutation of the rows of and in addition a consistent replacement of values in that reflect the permutation of indices in the graph.

When the graph sizes in the dataset are relatively small a RNN core can be used as the first stage in the calculation of since it can accept inputs of various lengths and calculate fixed sized outputs. The use of a RNN core can be stand-alone, combined with other subsequent dense layers or as the first component of a permutation invariant GNN layer such as the GMN encoder DBLP:conf/icml/LiGDVK19 . Utilising a RNN core is done without any loss of information. However when the graphs increase in size the use of a RNN can be prohibitive in terms of the computational resources and performance may be suboptimal due to the inherent difficulties in processing very long input sequences. Therefore the indices mapping trick (7) can be used to fix the dimensionality of the structural similarity features so that any standard ANN layer can be used.

5.5 Complexity

When the indices mapping trick is used the cosine similarity calculation can be done iteratively on a subset of nodes and by retaining the indices and values of the top- most similar nodes only the storage complexity can be reduced to . Furthermore since the similarity features are deterministically calculated from the adjacency matrix with no learned parameters involved, the calculation of the structural similarity features for the input graphs (where the majority of complexity lies) can be done offline once as a dataset preprocessing step. In addition observe that since for any two nodes having a geodesic distance larger than two. Therefore for many if not most real-world graphs a substantial reduction in computation can be achieved by calculating only the similarity between nodes having a geodesic distance of two or less. Consequently assuming a mean degree of the amortised complexity is .

5.6 DiffPool revised

Having defined the structural similarity features they can now be utilised in an end-to-end heirearchical graph pooling architecture. For this purpose the DiffPool model suggested by NIPS2018_7729 is revised and the utility of the suggested structural similarity features as well as additional improvements to the original DiffPool algorithm are evaluated. Changes to the original DiffPool include the following:

  1. A change to the calculation of the adjacency matrix at pooling layer .

  2. Changes to the calculation of the assignment matrix at pooling layer .

  3. Proposition of a new regularisation term to encourage the model to assign nodes to available clusters and to distribute nodes evenly across assigned clusters.

  4. Removal of the auxiliary link prediction objective.

Calculation of the adjacency matrix. Typically all elements of pre-activation are non-negative therefore adding a activation restricts to the range post-activation. This change seemed to result in improved performance and increased stability during training.

(9)

Calculation of the assignment matrix. Since the structural similarity features are calculated from the adjacency matrix and encode neighbourhood connectivity and in addition a similarity based order is imposed on the features when the indices mapping trick is applied there is no longer a necessity to use a GNN layer. On the contrary, using a MLP or a RNN can lead to consideration of information from all nodes when calculating cluster assignments rather than considering only the local neighbourhood of a node and may result in improved performance especially in deeper layers where the connectivity is high. Therefore it is suggested that maybe calculated with any arbitrary ANN layer or subnetwork, including GNNs.

Additional regularisation. In the original DiffPool model NIPS2018_7729 the authors suggest to regularise cluster assignment by minimising :

(10)

Where denotes the entropy function. This regularisation term encourages the rows of

to resemble one-hot-encoded vectors and therefore results in assignment of nodes that is close to ”unique”. However it was observed in experiments that this regularisation technique also encourages the model into effectively utilising a small number of clusters, typically as low as one or two clusters at most. This may explain the statement in

NIPS2018_7729 that training the pooling GNN using only gradient signal wasn’t effective and that training often converge to a spurious local minima early in training. To mitigate this behaviour it is proposed to incorporate an additional regularisation term. Intuitively there are two goals that are desirable to achieve, the first is encouraging the model to utilise as many clusters as it is useful thus improving utilisation of overall model capacity and the second aims at distributing nodes uniformly across assigned clusters. To achieve these goals a second regularisation term is defined as follows:

(11)

Where is a column vector of with dimension . Combining and reduces the solution space drastically and obtains a minimum where nodes are assigned ”uniquely” to clusters yet are spread uniformly across all available clusters.

Removal of link prediction objective. The authors of NIPS2018_7729 explain that the auxiliary link prediction objective and and its corresponding loss term where denotes the Frobenius norm, was introduced to encode the intuition that nearby nodes should be pooled together. Furthermore it seems plausible that this auxiliary task also prevent from collapsing the training process in an early stage into a spurious local minima where at most one or two clusters are utilised and hence limiting the capacity of the model. Since the structural similarity features encode the the notion that neighbouring nodes have features that are ”close” and the introduced regularisation term encourages utilising the full model capacity it seems that the link prediction objective is now redundant.

6 Experimental Results

6.1 Goals

A number of inductive graph classification tasks are performed in order to evaluate the effectiveness of the structural similarity features and effect of proposed changes to DiffPool with the goal of answering the following questions:

  1. How does the use of structural similarity features compare to the use of node features for calculating cluster assignments?

  2. What is the effect of information and permutation invariance loss as the result of utilising only the ordered indices mapping of top- structural similarity features due to the application of the indices mapping trick?

  3. What is the effect of increasing the parameter , the power of the adjacency matrix?

  4. What is the effect of the proposed regularisation term ?

  5. How does SimPool compare to other recently proposed methods for hierarchical graph pooling on graph classification tasks?

6.2 Datasets

The benchmark datasets used are summarised in table 1. These datasets are revised versions of common benchmarks used in graph classification tasks where isomorphic graphs have been removed ivanov2019understanding 111Datasets are available at https://github.com/nd7141/graph_datasets.

Dataset Size Avg. Nodes Avg. Edges Classes
Enzymes Borgwardt_2005 595 32.48 62.17 6
D&D Dobson_2003 1178 284.32 715.66 2
Table 1: Summary of datasets used for inductive learning graph classification experiments.

6.3 Model architecture

All experiments share a common simple feed forward model architecture however layer parameters and inputs are modified per experiment as appropriate. The architecture and parameters used for each of the datasets are summarised in tables 2 and 3, these values are generally unchanged unless explicitly noted. Furthermore no experimentation with more complex architectures, intermediate readout layers or skip connections was performed. For optimisation Adam DBLP:journals/corr/KingmaB14 with the default parameter values is employed and a learning rate of

. Note that there was no attempt to find an optimal architecture or hyperparameter settings for the experiments but merely to measure the relative accuracy for different hyperparameter settings and between the different feature types under this basic network configuration. The indices mapping trick was only applied in the calculation of

, for calculating the structural similarity features were used as is. Lastly, , and , are calculated as per equation (4).

Output Module Layers Units Activation
Enzymes D&D
GMN encoder dense 512 1024 ReLU (10.5555/3104322.3104425, )
GMN propagation dense 512 1024 ReLU
GMN propagation dense 512/256 1024 linear
GMN encoder dense 512 1024 ReLU
GMN propagation dense 512 1024 ReLU
GMN propagation dense 512/8 1024/32 linear/softmax
GCN 512 2048 ReLU
MLP dense 256 2048 ReLU
dense 4 8 softmax
GCN 1024 4096 ReLU
predictions MLP dense 6 2 softmax
Table 2: Common architecture and parameters used for all experiments. For the GMN modules when a single value is specified for the units it refers to both and
Dataset scaler scaler tra/val split epochs
Enzymes 1 12 1.0 0 1.0 1.0 0.9/0.1 100
D&D 25 0.4 230
Table 3: Default parameter values used for all experiments by dataset where not stated explicitly otherwise. These values are likely not optimal but rather chosen as experimental baselines. The Epochs columns indicates the number of epochs used for training.

6.4 Assignment features

For evaluating the utility of the structural similarity features a graph classification task is repeated with the same model architecture and parameters where the only variant is the features used for calculating the assignment matrix . There are three variants compared: structural similarity features, node features and concatenation of both.

Figures 2 and 3 illustrate values of the loss terms and during training and the number of different clusters utilised where a node cluster assignment is determined by the of the corresponding row in the assignment matrix. Generally a combination of low values for both loss terms represents a desirable policy of cluster assignment that is both uniform across all available clusters and unique for each node. The figures indicate that in these experiments the ”node” and ”both” variants seem to result in similar training pattens whereas using only the structural features for calculating assignments allows the model to choose a different and distinct training path. In the case of the D&D dataset using only the structural similarity features for cluster assignments enables the model to utilise substantially more clusters at both pooling layers. It also appears that training is more stable and assignments are becoming distinct in earlier stages of training despite utilising more clusters effectively. Furthermore using the ”structural” variant consistently attains the maximal accuracy obtained on the validation data as summarised in table 4.

Figure 4 illustrate cluster assignment for a number of random graphs chosen from the Enzymes dataset using the proposed structural similarity features combined with the indices mapping trick. In comparison figure 5 illustrates cluster assignments by the same model using the node features for cluster assignments calculations. The plots indicate that the proposed algorithm achieves its stated goals and clusters the nodes in a manner that generally preserves localisation of nodes in the coarsened graph whereas using the node features does not generally preserve topology and is hard to interpret in terms of graph structure. In addition the number of nodes in each cluster in figure 4 seems to be reasonably uniform yet another indication the proposed algorithm is successful in achieving its stated goals.

Dataset Max. Validation Accuracy Epoch
Structural Both Node Structural Both Node
Enzymes 0.7667 0.7 0.7167 53 80 33
D&D 0.7915 0.7881 0.7712 35 48 140
Table 4: Maximal accuracy obtained on the validation data for the experiments in section 6.4. The Epoch columns denote the first epoch where the maximal accuracy was obtained.

6.5 Top-

In this section the effect of application of the indices mapping trick (7) is evaluated by repeating the task with diminishing values of to simulate increasing information loss. In addition an experiment is conducted for evaluating the structural similarity features with no information loss and permutation invariance by modifying the GMN encoder such that is replaced with a core whilst maintaining the same activation and number of units. The results indicate that the choice of has substantial impact on the maximum accuracy obtained and that increasing beyond a certain point does note necessarily improve performance. Furthermore the indices mapping trick seems to substantially increase performance and using less parameters and computational resources.

Max. Validation Accuracy
=3 =6 =9 =12 =15 =18 RNN and
0.6833 0.6833 0.7 0.7667 0.7167 0.733
Table 5: Maximal accuracy obtained on the validation data of the Enzymes dataset for different values of as well as the use of the structural similarity matrix in its entirety with an RNN core.

6.6 Similarity neighbourhood size

In this section the effect of modifying the neighbourhood size in equations (5) and (6) is evaluated. For this purpose only the ”structural” variant was used. The validation accuracy results are summarised in table 6. The results indicate that a smaller neighbourhood size yielded better accuracy by a small margin in the tested settings. Other aspects of the training such as the loss values and number of clusters used by the model were observed to be similar across the different settings of used in this experiment. These results may be explained due to SimPool retaining information from all nodes across pooled layers and therefore there are little or no benefits in increasing artificially the graph connectivity as expressed in the adjacency matrix.

Dataset Max. Validation Accuracy Epoch
=1 =2 =3 =1 =2 =3
Enzymes 0.7667 0.7 0.7 53 78 32
D&D 0.7915 0.7881 0.7797 35 113 162
Table 6: Maximal accuracy obtained on the validation data for different neighbourhood sizes used for calculating the structural similarity features. The Epoch columns denote the first epoch where the maximal accuracy was obtained.

6.7 Regularisation term

In this section the impact of increasing the loss term (11) is explored. Figure 1 illustrate values of the loss term during training and the number of different clusters utilised where a node cluster assignment is determined by the of the corresponding row in the assignment matrix. It is evident that increasing encourages the model to increase the number of assigned clusters in the subsequent pooling layer possibly in exchange for increasing cluster assignment softness. However it is notable that in the first pooling layer the model was able to learn assignment policies that utilise increasing number of clusters while maintaining a close to unique assignment of nodes.

6.8 Performance study

To complete the experimental analysis a number of inductive graph classification tasks are performed and the obtained accuracy is compared against recent similar methods. For this experiment full 10-fold cross validation is performed in all experiments. The overall architecture, units and activations are summarised in table 2, the parameters used by the SimPool layers are summarised in table 7 and table 8

summarises the accuracy results obtained. To calculate the results the maximum accuracy achieved in each fold is used and their mean and standard deviation is calculated. The results stated for methods other than SimPool are taken from the referenced papers. Since there are differences in evaluation methods across different sources the results are indicative and should not be considered as accurate benchmarking.

Dataset scaler scaler epochs
Enzymes 1 15 0.8 0 1.0 1.0 100
D&D 32 0.4 60
Table 7: Parameter values used for all experiments in section 6.8 by dataset. These values may not be optimal.
Method Dataset
Enzymes D&D
DiffPool NIPS2018_7729 0.6253 0.8064
SortPool Zhang2018AnED 0.5712 NIPS2018_7729 Zhang2018AnED
g-U-Nets DBLP:conf/icml/GaoJ19 -
SAGPool 35137c0f4e904fc0ab786021ead07852 -
SimPool TBD
Table 8: Comparison of accuracy achieved for different methods and datasets. The results stated for methods other than SimPool are taken from the referenced papers. Since there are differences in evaluation methods across different sources the results are indicative and should not be considered as accurate benchmarking.
Scaler
0 0.5 1.0 1.5 2.0
Accuracy 0.6833 0.7 0.7167 0.7
Table 9: Maximal accuracy obtained on the validation data of the Enzymes dataset for different scalers of .
(a)
(b)
(c)
(d)
Figure 1: Training statistics of the Enzymes dataset for both assignment matrices with different scalers of the loss. (left column) and (right column). (a) and (b) number of different clusters selected; (c) and (d) loss term representing the uniqueness of nodes cluster assignments.

7 Conclusion

In this paper a method for generating features based on structural similarity that are useful for hierarchical coarsening of graphs is proposed. The method is differentiable and can integrate with many algorithms including end-to-end deep representation learning models, and can also be augmented with additional features such as the node features themselves. Furthermore SimPool is proposed, a pooling layer based on a revisited DiffPool layer that enables end-to-end GNN models to pool neighbouring nodes together in the coarsened graph encouraging the model to learn useful locality preserving pooling in a manner that is closer to the pooling operations used by CNNs that operate on local receptive fields in the standard grid. Experimental results indicate the method is successful in fulfilling its stated goals and contributes towards achieving state-of-the-art results in inductive graph classification tasks when integrated as part of an end-to-end GNN architecture.

References

  • [1] S. Borgatti and T. Grosser. Structural equivalence: Meaning and measures. International Encyclopedia of the Social and Behavioral Sciences, 12 2015.
  • [2] K. M. Borgwardt, C. S. Ong, S. Schönauer, S. V. N. Vishwanathan, A. J. Smola, and H.-P. Kriegel. Protein function prediction via graph kernels. Bioinformatics, 21(1):47–56, Jan. 2005.
  • [3] R. L. Breiger, S. A. Boorman, and P. Arabie. An algorithm for clustering relational data with applications to social network analysis and comparison with multidimensional scaling, 1975.
  • [4] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst. Geometric deep learning: Going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.
  • [5] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectral networks and locally connected networks on graphs. In Y. Bengio and Y. LeCun, editors, 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.
  • [6] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, page 3844–3852, Red Hook, NY, USA, 2016. Curran Associates Inc.
  • [7] I. S. Dhillon, Y. Guan, and B. Kulis.

    Weighted graph cuts without eigenvectors a multilevel approach.

    IEEE Trans. Pattern Anal. Mach. Intell., 29(11):1944–1957, Nov. 2007.
  • [8] P. Dobson and A. Doig. Distinguishing enzyme structures from non-enzymes without alignments. Journal of molecular biology, 330(4):771–783, 7 2003.
  • [9] H. Gao and S. Ji. Graph u-nets. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 2083–2092. PMLR, 2019.
  • [10] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 1263–1272. JMLR.org, 2017.
  • [11] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.
  • [12] W. L. Hamilton, R. Ying, and J. Leskovec. Inductive representation learning on large graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 1025–1035, Red Hook, NY, USA, 2017. Curran Associates Inc.
  • [13] W. L. Hamilton, R. Ying, and J. Leskovec. Representation learning on graphs: Methods and applications, 2017. cite arxiv:1709.05584Comment: Published in the IEEE Data Engineering Bulletin, September 2017; version with minor corrections.
  • [14] M. Henaff, J. Bruna, and Y. LeCun. Deep convolutional networks on graph-structured data. ArXiv, abs/1506.05163, 2015.
  • [15] S. Ivanov, S. Sviridov, and E. Burnaev. Understanding isomorphism bias in graph data sets, 2019.
  • [16] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Y. Bengio and Y. LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
  • [17] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. CoRR, abs/1609.02907, 2016.
  • [18] J. Lee, I. Lee, and J. Kang. Self-attention graph pooling. Jan. 2019.
  • [19] E. A. Leicht, P. Holme, and M. E. J. Newman. Vertex similarity in networks, 2005. cite arxiv:physics/0510143.
  • [20] Y. Li, C. Gu, T. Dullien, O. Vinyals, and P. Kohli. Graph matching networks for learning the similarity of graph structured objects. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 3835–3845. PMLR, 2019.
  • [21] F. Lorrain and H. C. White. Structural equivalence of individuals in social networks. The Journal of Mathematical Sociology, 1(1):49–80, 1971.
  • [22] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, page 807–814, Madison, WI, USA, 2010. Omnipress.
  • [23] T. Pham, T. Tran, D. Phung, and S. Venkatesh. Column networks for collective classification. In

    Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence

    , AAAI’17, page 2485–2491. AAAI Press, 2017.
  • [24] M. Simonovsky and N. Komodakis. Dynamic edge-conditioned filters in convolutional neural networks on graphs, 2017. cite arxiv:1704.02901Comment: Accepted to CVPR 2017; extended version.
  • [25] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, pages 1–21, 2020.
  • [26] Z. Ying, J. You, C. Morris, X. Ren, W. Hamilton, and J. Leskovec. Hierarchical graph representation learning with differentiable pooling. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 4800–4810. Curran Associates, Inc., 2018.
  • [27] M. Zhang, Z. Cui, M. Neumann, and Y. Chen. An end-to-end deep learning architecture for graph classification. In AAAI, 2018.