# Graph Neural Networks with Parallel Neighborhood Aggregations for Graph Classification

We focus on graph classification using a graph neural network (GNN) model that precomputes the node features using a bank of neighborhood aggregation graph operators arranged in parallel. These GNN models have a natural advantage of reduced training and inference time due to the precomputations but are also fundamentally different from popular GNN variants that update node features through a sequential neighborhood aggregation procedure during training. We provide theoretical conditions under which a generic GNN model with parallel neighborhood aggregations (PA-GNNs, in short) are provably as powerful as the well-known Weisfeiler-Lehman (WL) graph isomorphism test in discriminating non-isomorphic graphs. Although PA-GNN models do not have an apparent relationship with the WL test, we show that the graph embeddings obtained from these two methods are injectively related. We then propose a specialized PA-GNN model, called SPIN, which obeys the developed conditions. We demonstrate via numerical experiments that the developed model achieves state-of-the-art performance on many diverse real-world datasets while maintaining the discriminative power of the WL test and the computational advantage of preprocessing graphs before the training process.

## Authors

• 2 publications
• 16 publications
10/01/2018

### How Powerful are Graph Neural Networks?

Graph Neural Networks (GNNs) for representation learning of graphs broad...
07/01/2020

### A Novel Higher-order Weisfeiler-Lehman Graph Convolution

Current GNN architectures use a vertex neighborhood aggregation scheme, ...
09/29/2021

### Equivariant Neural Network for Factor Graphs

Several indices used in a factor graph data structure can be permuted wi...
09/27/2021

### Meta-Aggregator: Learning to Aggregate for 1-bit Graph Neural Networks

In this paper, we study a novel meta aggregation scheme towards binarizi...
01/23/2019

### Constant Time Graph Neural Networks

Recent advancements in graph neural networks (GNN) have led to state-of-...
02/08/2021

### Enhance Information Propagation for Graph Neural Network by Heterogeneous Aggregations

Graph neural networks are emerging as continuation of deep learning succ...
09/04/2021

### Training Graph Neural Networks by Graphon Estimation

In this work, we propose to train a graph neural network via resampling ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Graph neural networks (GNNs) have recently emerged as one the most popular machine learning models for processing and analyzing graph-structured data

[GDL, gamma20spm]. GNNs have gained significant and steady attention due to their extraordinary success in solving many challenging tasks in a variety of scientific disciplines such as computational pharmacology [decagon], molecular chemistry [MoIGAN], physics [phy], finance [finance1, finance2], wireless communications [wicom][opt], to name a few.

The authors are with the Department of Electrical Communication Engineering, Indian Institute of Science, Bangalore, 560012, India.
Email: siddhant.doshi@outlook.com, spchepuri@iisc.ac.in
This work was supported in part by the SERB SRG/2019/000619 grant and Pratiksha Trust Fellowship.

Solving machine learning tasks with graphs such as node property prediction, link prediction, or graph property prediction requires an efficient representation of the underlying graph [dong_gsp, graphBook]. In this work, we focus on semi-supervised graph classification, wherein we are given multiple graphs, each with an associated categorical label and data about the nodes in each graph. The goal is to train a machine learning model that processes the graph-structured data and nodal attributes (i.e., signals associated with the nodes) to predict labels of unseen graphs. For example, multiple graphs may represent different communities of people in a social network (chemical compounds), and the task is to identify the type of community (respectively, enzyme or not).

### 1.1 Related works

A majority of GNNs learn the representation vector of a node (or node embedding) through a sequential neighborhood aggregation procedure during the training process. In each iteration (also referred to as a GNN layer), the representation vector of a node is computed by aggregating representation vectors from its local one-hop neighbors via a first-order graph filtering operation

[gamma20spm]

. Cascading many such layers with non-linear activation functions allows GNNs to learn the structural information beyond the node’s one-hop neighborhood and thus achieve good generalization. Let us call GNN models that sequentially aggregate the neighbor node embeddings as

SA-GNNs (here, SA stands for sequential aggregation). Different choices of sequential neighborhood aggregation functions lead to popular GNN variants such as graph convolutional networks (GCN) [GCN], GraphSAGE [GraphSAGE], graph attention networks (GAT) [GAT], and graph isomorphism networks (GIN) [GIN]. Such SA-GNN models can suffer from gradient issues, over smoothing, or the so-called bottleneck effects [GCN][Li18deepGCN]. Further, for graph classification, the representation vectors of all the nodes in a graph have to be pooled (or readout) to obtain a representation vector of an entire graph, and is referred to as graph embedding. Different graph pooling schemes have been proposed for graph classification with SA-GNNs [SAG, diffpool, DGCNN, ECC].

Recently, GNN models that precompute nodal features by performing neighborhood aggregation at multiple scales in a non-sequential or parallel manner through integer powers of graph operators have been proposed. Examples of such architectures include scalable inception graph neural networks (SIGN) [SIGN] and graph-augmented MLPs (GA-MLPs) [GAMLP]

. These GNN architectures are analogous to inception modules for convolutional neural networks (CNNs), where a filterbank with convolutional filters of different sizes is used

[inception]. Let us call GNN models in which a bank of non-sequential or parallel neighborhood aggregation functions, i.e., graph filterbanks [Tremblay17graphfilterbank], that gather node features from different neighborhoods as PA-GNNs (here, PA stands for parallel aggregation). Since nodal features in PA-GNNs are precomputed, these models benefit from reduced training and inference time compared to SA-GNNs while preserving the structural information about the underlying graph. A natural question to ask is, how do PA-GNN models perform on machine learning tasks with graphs?

While experimental results for node prediction tasks have been presented with SIGN [SIGN] and GA-MLP [GAMLP] architectures, we focus on graph classification tasks with PA-GNN models in this work. For graph classification tasks with GNNs, understanding the discriminative power of the model plays a crucial role. Although most of the neural network models are usually developed based on empirical and experimental evidence, there are theoretical results available to characterize and analyze the discriminative power of a few popular GNN variants, mostly SA-GNNs, to distinguish two different non-isomorphic graph structures [GIN, Morris] by relating them to the well-known Weisfeiler-Lehman (WL) graph isomorphism test [WL_1968, babai]. Specifically, by leveraging the similarity in the neighborhood aggregation iterations in SA-GNNs to the vertex refinement updates in the WL graph isomorphism test, one can determine if a GNN model is as powerful as the WL test or build powerful GNN models such as GIN [GIN]. For instance, [GIN] provides examples of node-level aggregation functions, which when used fail to distinguish graphs that the WL test would distinguish and the discriminative power of integer powers of graph operators is studied in [GAMLP]. Graph classification based on SA-GNNs but without any theoretical characterization about their discriminative power are also commonly used [DGCNN, diffpool]. Non-GNN models such as graph kernel-based approaches [kernel1, social] for graph classification often do not scale well as computing the kernel matrix for a large number of graphs becomes intractable.

### 1.2 Main results and contributions

Unlike SA-GNNs, PA-GNN models considered in this work have several parallel branches, with each branch focusing on different neighborhood depths and yielding representation vectors of all the nodes in the graph. Therefore, to combine the representations from different branches, we propose a new architecture called simple and parallel graph isomorphism network (SPIN). In SPIN, we perform branch-level readouts to obtain the representation vectors of an entire graph at every branch. These yield graph embeddings associated to different neighborhood depths. Further, we pool these branch-level graph representation vectors through a graph-level readout to obtain a single representation vector of an entire graph. We focus on studying the discriminative power of an end-to-end PA-GNN model and provide conditions on branch-level readout and graph-level readout functions under which PA-GNNs are as powerful as the WL test. Our main contributions and results are summarized as follows.

• We propose a generic PA-GNN model, which aggregates node features from different neighborhoods in a non-sequential manner through a bank of graph operators arranged in parallel, pools node-level representations from each branch through branch-level readouts, and finally pools graph embeddings from multiple branches through a graph-level readout.

• We provide theoretical conditions on the branch-level and graph-level readouts under which PA-GNN models are as powerful as the WL test. Unlike the SA-GNN models, the procedure to compute graph embeddings in PA-GNN models do not admit the same form as the updates in the WL test. Towards this end, we show how the embeddings generated from PA-GNNs can be injectively mapped to the updates in the WL test, thereby maintaining the discriminative power of the 1-WL test. As a result, we show that simple graph-level readout functions, like max or mean graph-level readouts fail to discriminate regular graphs with the same node type.

• Based on these generic conditions, we propose SPIN. Specifically, we present two variants of SPIN with and without an attention mechanism at the branch-level readouts to capture the most relevant node features at each branch. Further, we theoretically show that introducing an attention mechanism as a node-level function does not reduce the discriminative power of the model.

We validate the developed model and theory through extensive numerical experiments on twelve benchmark datasets for graph classification. These twelve datasets include five datasets from the social domain, five datasets from the chemical domain, and two datasets related to brain networks. We demonstrate that PA-GNN models with precomputed node features perform on par with SA-GNNs models while maintaining the discriminative power of the WL test and the computational advantage of preprocessing graphs before the training process.

### 1.3 Organization

The rest of the paper is organized as follows. In Section 2, we provide the relevant background on SA-GNN models and the WL test. In Section 3, we present a generic PA-GNN model for learning graph embeddings and also provide a theoretical characterization of the PA-GNN model, which we specialize as SPIN in Section 4. In Section 5, we present results on a variety of datasets. The paper finally concludes in Section 6.

Software and datasets to reproduce the results in the paper are available at https://github.com/siddhant-doshi/SPIN.

## 2 Preliminaries

In this section, we briefly introduce a generic version of GNNs that update node features via sequential neighborhood aggregation. We then describe the Weisfeiler-Lehman graph isomorphism test, which is introduced to characterize the discriminative power of GNNs.

### 2.1 Notation

Let denote a graph with node set and edge set . Each node has a feature (or attribute) vector . A graph with nodes has an adjacency matrix and input feature matrix . We denote the inner product between two vectors and as . Let denote the set of -hop neighboring nodes of node . We frequently use an extension of a set called multiset, which is defined as follows.

###### Definition (Multiset).

Multiset is a collection of elements, which may occur more than once. It is a 2-tuple , where is the underlying set and is a function that gives the multiplicity of each element as .

### 2.2 GNNs with sequential neighborhood aggregation

Most GNNs follow a sequential architecture comprising of a cascade of several local neighborhood aggregation layers, where each layer computes the representation vector of a node by aggregating feature vectors from its 1-hop neighboring nodes. Cascading such local aggregation layers captures the structural information within the -hop neighborhood of a node. The procedure can be viewed as iteratively updating the representation vector of a node as

 x(k+1)v=ϕ(k)(x(k)v,f(k)({x(k)i,∀i∈N(1)v})), (1)

where is the -dimensional representation vector of node at the -th layer with being its input feature vector, is a graph operator that acts as the local aggregation function and propagates node features, and combines the neighborhood information of a node with its own representation vector. Several variants of SA-GNNs have been proposed with different choices of and , such as GCN [GCN], GraphSAGE [GraphSAGE], and GIN [GIN], to name a few. An example of a basic SA-GNN model in (1) is

 X(k+1)=σ(X(k)W(k)self+AX(k)W(k)neigh),

where . The matrices and are trainable parameter matrices and is an elementwise non-linearity (e.g., a ReLU). Here, the graph operator that performs neighborhood aggregation is actually a first-order graph filter.

For graph property prediction or classification, given a set of graphs and their labels , the representation vector of an entire graph is required to predict its label , where is a trained decoder. The graph embedding is computed using a node-level pooling or readout function that operates on the node representation vectors as

where typical choices for the readout function are concatenation, summation, mean, max [JK, SAG], hierarchical pooling [diffpool], and sort pooling [DGCNN].

### 2.3 Weisfeiler-Lehman isomorphism test

Determining whether two graphs are isomorphic is a difficult problem with no known polynomial time solution [garey]. The Weisfeiler-Lehman (WL) vertex refinement algorithm [babai] produces for each graph under test a canonical form, which when not equivalent implies that the graphs are not isomorphic. However, for non-isomorphic graphs that lead to the same canonical form, the WL test is not useful in distinguishing the graphs under test. The WL vertex refinement algorithm (also referred to as the one-dimensional WL or 1-WL test) iteratively updates the labels of a node based on the labels of its neighboring nodes and assigns a unique label. For a graph , 1-WL vertex refinement iteration is given as

 l(t+1)v=φ(l(t)v,{l(t)i,∀i∈N(1)v}), (3)

where is the unique label for node at the -th iteration. Here, is a multiset of the neighborhood labels for node as different nodes can have identical labels and is an injective hashing function that maps a multiset of different neighborhood labels to a distinct label. The iterative procedure in Equation (3) is simultaneously applied on two graphs under test to refine the labels until convergence.

A GNN model is said to be as powerful as the 1-WL test if it generates different graph embeddings for two graphs, identified as non-isomorphic by the 1-WL test. The 1-WL vertex refinement iterations in (3) are similar in nature to the feature update iterations of SA-GNNs in (1). This similarity has been leveraged to theoretically characterize the discriminative power of some of the popular GNN variants, such as GCN [GCN], GraphSAGE [GraphSAGE], and to build GIN [GIN], which is as powerful as the 1-WL test.

## 3 GNNs with parallel neighborhood aggregation

In this section, we present a generic PA-GNN model based on parallel neighborhood aggregations. The model presented in this section is generic as we do not restrict it to a specific aggregation or readout procedure. We also provide theoretical conditions under which the presented model is provably as powerful as the 1-WL test.

### 3.1 A generic PA-GNN model

The representation vector of a node capturing the structural information at multiple scales related to different neighborhood depths can be simultaneously obtained by choosing an appropriate neighborhood aggregation graph operator. These node embeddings from different depths are then combined using readout functions to obtain graph embeddings. Formally, the proposed generic PA-GNN model with branches has the following three main components.

Parallel neighborhood aggregation: At the th branch, the intermediate representation vector of node is computed by aggregating feature vectors from its -hop neighbor nodes as

 z(r)v =g(r)(f(r)v({xj,∀j∈N(r)v})), (4)

where is a graph operator that does neighborhood aggregation and is a learnable transformation function. The updated node features can be efficiently precomputed outside the training process as they do not depend on the learnable parameters.

Branch-level graph pooling: At the th branch, we have a pooling function that maps a set of node embeddings at each branch to an intermediate graph embedding as

 s(r)G=Ω({z(r)v,∀v∈V}),r=0,1,…,R, (5)

Global readout: Finally, we have a global readout function that combines graph embeddings from each branch to obtain the final graph embedding as

 eG=ψ({s(0)G,⋯,s(R)G}), (6)

where is the global readout function.

For node and link prediction tasks, one may compute the final representation vector of node by combining the transformed intermediate representation vectors from the branches as

 hv=Θ(z(0)v,z(1)v,⋯,z(R)v), (7)

where is the learnable global aggregation function. The branch-level and graph-level readouts are usually not required for node and link prediction tasks.

In what follows, we provide conditions for the functions , , , and to theoretically characterize the discriminative power of a PA-GNN model.

### 3.2 Theoretical characterization

PA-GNN models of the form described in Section 3.1 are fundamentally different from SA-GNN models as PA-GNN models reduce to standard neural networks as the neighborhood aggregations are not iterative and precomputed. More importantly, the global readout in Equation (6) combines graph embeddings (unlike, the readout in Equation (2)). Therefore, in this work, we extend the theoretical framework for analyzing and characterizing the discriminative power of SA-GNN models in [GIN] to PA-GNN models.

Our next theorem states conditions required for PA-GNNs to be powerful as the 1-WL test.

###### Theorem 1.

A PA-GNN model with branches maps two non-isomorphic graphs and as identified by the 1-WL test to two different embeddings if:

1. produces node vector representations according to Equation (4) with injective functions and .

2. The branch-level and global readout functions and , in Equations (5) and (6), respectively, are also injective.

We prove Theorem 1 in Appendix A. The proof extends the setting in [GIN] from iterative feature vector updates to simultaneous and parallel feature vector computations. To do so, we leverage the fact that the precomputed local aggregation at the -th branch can be realized by successive local aggregations and that the composition of injective multiset functions is injective. We then show via mathematical induction that there always exists an injective function such that , where is the label generated for node at the -th iteration by the WL vertex refinement algorithm [cf. Equation (3)] and is representation vector of node produced by a PA-GNN model with branches [cf. Equation (7)]. Thus choosing functions in Equations (4)-(6) appropriately, we obtain a PA-GNN model that is provably as powerful as the 1-WL test. We next build one such PA-GNN model, which we call simple and parallel graph isomorphism network (SPIN).

## 4 Simple and parallel graph isomorphism network (SPIN)

SPIN specializes functions in the generic PA-GNN model so that the conditions provided in Theorem 1 are satisfied. Specifically, SPIN has three main components: parallel neighborhood aggregations, branch-level readouts, and a global readout as illustrated in Figure 1.

### 4.1 Parallel neighborhood aggregations

Let us collect the neighborhood feature vectors of node at the -th branch in , which is a multiset. Then the neighborhood aggregation operator at the th branch is a multiset function. For , we use integer powers of the graph operator. Let us define the updated feature vector as

 B(r)=[b(r)1,⋯,b(r)N]\raisebox1.2pt$T$=¯ArX,

where the graph filtering operation implements a sum of the -hop neighbor node embeddings. In other words, for , we use integer powers of the graph operator as

Some choices of the graph operator are the adjacency matrix , which is the basic sum aggregator and is injective, the normalized adjacency matrix with the diagonal degree matrix or its linear combination . When the pair of graphs under test are regular and are of the same size but different node degrees, degree normalized graph operators may lead to the same node embeddings, hence not injective and less powerful as the 1-WL test [GAMLP, Morris]. Although such cases are rare in practice, when encountered (easily verified via prescreening the graphs to be tested), we can use , where retains the degree information and provides the advantages of normalization [GCN].

In a PA-GNN model with branches, we have representation vectors of all the nodes in the graph at each branch. We combine these node representation vectors to compute branch-level graph embeddings for . At the th branch, this amounts to pooling the -hop representation vectors of all the nodes in the graph. We perform this pooling through a weighted summation as

 s(r)G=∑v∈Vα(r)vg(r)(b(r)v)=∑v∈Vα(r)vz(r)v,

where the local features are first transformed as . The weights may be used to focus on the most relevant local features (discussed later on). Lemma 1 suggests that modeling transformation functions

as single-layer perceptrons (SLPs) cannot distinguish multiset local neighborhood features.

###### Lemma 1.

For two distinct multisets , the weighted summation of their linear mappings can be equal with ReLU (or leaky ReLU) as a non-linearity, i.e., , for any .

We prove the lemma in the Appendix B. This lemma is a generalization of [GIN, Lemma 7] to the case with a weighted sum that allows us to include an attention mechanism as discussed next. Thus we use multi-layer perceptrons (MLPs) to model .

Inspired from the attention mechanism [GAT] for pooling, i.e., self-attention pooling [SAG], we design weights through an attention mechanism to focus on the most relevant parts of the local features and retain the most important ones. Formally, the weight of node at the -th branch is computed as

 α(r)v =ATTENTION(w(r),z(r)v)=exp(β(r)v)∑u∈Vexp(β(r)u) (8)

with . Here, is the learnable vector that extracts the attention weight for node at the -th branch based on its local feature vector . For instance, carries the information about the -hop neighborhood for node , and tells how significant that information is in comparison with the -hop information of the other nodes. Our next lemma states that the attention mechanism preserves injectivity.

###### Lemma 2.

For two distinct features and , and are also distinct, if for , and for any .

We prove the lemma in the Appendix C. Figure 2(a) illustrates the weights produced by the attention block at branches , , or for sample graphs from the COLLAB [social] and the PROTEINS dataset [proteins]. We can observe that the same nodes are given different weights in different branches. When attention mechanism is not used, the weights are all set to 1.

Finally, we need to combine the representation vectors of all the nodes in the graph to arrive at a branch-level graph representation vector. Usually used readout operators to compute the representation vector for an entire graph are the mean, max, or a combination of max and mean readout functions [JK]. Our next lemma states that mean or max graph-level readout functions fail to distinguish graph structures.

###### Lemma 3.

Consider two undirected graphs and with different number of nodes and structural connectivity. The embeddings and of the entire graphs and , respectively, from mean or max graph-level readout functions are equal if both the graphs are regular and have the same type of nodes (i.e., they have the same feature vectors).

We prove the lemma in Appendix D. A similar result that the mean and max node-level aggregators to sequentially update the node features are less powerful was provided [GIN]. The main difference is that the mean and max functions here are used to obtain the graph embeddings. Some example graphs that cannot be discriminated using mean or max readouts are shown in Fig. 2(b), where and are regular and have the same type of nodes (indicated by the same color). Therefore, we use the basic summation operator that preserves injectivity to obtain the branch-level graph embedding. This, in other words, means that each branch of SPIN is as powerful as the 1-WL test.

Global readout refers to the pooling of branch-level graph embeddings to obtain a single representation for an entire graph. From Lemma 3, to retain the discriminative power of the branch-level embeddings, we use injectivity preserving concatenation (and not mean or max) function to perform a global readout to obtain the representation vector of an entire graph as

 eG=ψ({s(0),⋯,s(R)})=[s(0)T,⋯,s(R)T]\raisebox1.2pt$T$, (9)

where is the multiset concatenation function.

This completes the SPIN architecture, which satisfies the conditions provided in Theorem 1 and thus is as powerful as the 1-WL test.

### 4.4 Computational complexity

The time complexity of GCN (a SA-GNN model) with graph convolution layers is about for a graph having edges and nodes with each node having a -dimensional feature vector [survey], where for simplicity, we retain -dimensional features in all the hidden layers. Here, the term corresponds to the computations involved in the feature transformation, whereas the term is due to the neighborhood aggregation performed during training. GIN has a similar complexity as GCN, as it also does a sequential neighborhood aggregation. Although GraphSAGE performs neighborhood sampling to reduce the computation cost, the neighbor nodes are identified during training. The number of edges increases the compute requirements of SA-GNN models, particularly for dense graphs, where . In contrast, a PA-GNN model performs neighborhood aggregations beforehand, due to which its runtime is independent of the number of edges in graphs. Specifically, a PA-GNN model with branches has a time complexity of . The time complexity due to branch-level and global readouts does not depend on the size and structure of the graph.

## 5 Experiments

This section describes the experiments performed to evaluate SPIN and compare its graph classification accuracy with state-of-the-art GNN variants.

We evaluate SPIN on twelve diverse benchmark datasets for binary as well as multiclass graph classification tasks. Specifically, we evaluate on five chemical domain datasets: D&D [DD], PROTEINS [proteins], NCI1 [nci1], ENZYMES [enzymes] and OGBG-MOLHIV [ogb], five social domain datasets: IMDB-BINARY, IMDB-MULTI, COLLAB, REDDIT-BINARY and REDDIT-MULTI [social], and two brain network datasets: OHSU and PEKING-1 [brain]. All the datasets are publicly available [public, brain, ogb] and are commonly used for evaluating GNNs. More details about these datasets are provided in Table 4 at the end of the paper.

### 5.1 Experimental setup

We follow the evaluation procedure described in [Fair], which suggests a standard procedure for evaluating machine learning models for graph classification. Based on the sources of dataset curation, the social and the chemical datasets (except OGBG-MOLHIV) are collectively referred to as the TU datasets, OGBG-MOLHIV as the open graph benchmark (OGB) dataset. We next describe the experimental setup for each of them.

#### 5.1.1 TU datasets

For all experiments with TU datasets, we use input features as suggested by [Fair]

. Specifically, for molecular graphs from the chemical domain, the nodes are augmented with the one-hot encodings of their atom types. For the ENZYMES dataset, we append the available 18-dimensional node attributes with the one-hot encodings of their atom types. We conduct experiments on the predefined stratified data splits provided by

[Fair]

. We use an inner holdout technique with 90%-10% training-validation split, and each selected model is trained three times on a testing fold to eliminate any weight initialization biases. We use 10-fold cross-validation and report the average testing accuracy of all the folds with its standard deviation in Tables

1 and 2.

#### 5.1.2 OGB dataset

We consider a molecular graph prediction dataset, referred to as OGBG-MOLHIV [moleculenet]

. This is a large-scale dataset with 41127 graphs to be classified. The nodes in the graph are augmented with the available 9-dimensional input node attributes. We use the same scaffold data split for evaluation as in

[ogb]

. As the dataset is highly skewed, we report the area under the ROC (AUROC) score for the testing set. The process is repeated thrice to avoid any weight initialization biases, and the standard deviation is reported in Table

2.

#### 5.1.3 Brain datasets

We also consider two brain datasets, namely, OHSU and PEKING-1, which are datasets used for hyper/impulsive and gender classification studies, respectively [brain]. Graphs are constructed using a CC200 parcellation on the brain functional magnetic resonance image (fMRI) data by mapping each region of the brain as a node and modeling the similarity between these regions through the edges. Similar to the social domain datasets, we use the nodal degree information as input features for these brain datasets. We use a 90%-10% training-validation split for model selection and train each selected model for three times to avoid the initialization biases. We do a 5-fold cross-validation and report the average testing accuracy and the standard deviation across all the folds in Table 3.

### 5.2 Training and baselines

For all the datasets, we train SPIN in a supervised manner using the cross-entropy loss function, which for a minibatch

of graphs and a -class classification problem, is defined as

 L=−1M∑Gi∈BK∑k=1yiklog(softmax(MLP(eGi))), (10)

where are the one-hot labels associated with graph . The graph label is predicted using an MLP classifier with the graph embedding as input so that . The components of this

-dimensional output vector with softmax function are interpreted as the probabilities assigned to the

classes. We perform experiments on SPIN with attention mechanism (referred to as SPIN-att) and without attention (referred to as SPIN-non-att) obtained by setting

to validate the performance gain achieved using the attention mechanism. We implement early stopping while training, i.e., terminate the training process if there is no significant gain in the validation accuracy after a certain number of training epochs, and use standard techniques like the L2 regularization, dropout, and batch normalization to avoid overfitting. All the hyperparameters, namely, batch size, learning rate, intermediate node representation vector dimensions, depth of a GNN model, L2 regularization parameters, are summarized in Table

5 at the end of the paper.

We compare the graph classification performance of SPIN with non-graph methods, indicated as Baseline in Tables 12, and 3. Molecular fingerprint technique [MFT1, MFT2] is used as a baseline for the chemical datasets, while (non-graph) multi-layer perceptrons [deepsets] are used for the ENZYMES, social domain, and brain datasets. Such a comparison with non-graph methods reveals the capability of GNNs in exploiting the graph topology. Furthermore, we compare SPIN with GNN variants commonly used for graph classification, namely, DGCNN [DGCNN], DiffPool [diffpool], ECC [ECC], GIN [GIN], and GraphSAGE [GraphSAGE].

### 5.3 Results and discussion

Tables 12, and 3 report the performance of SPIN on the social domain, chemical domain, and brain datasets, respectively. Firstly, SPIN can capture and exploit the underlying graph structure, which can be seen from the results as it outperforms the baseline on all the social domain, brain datasets and NCI1 from the chemical domain. On D&D, PROTEINS, and ENZYMES datasets, no GNN model exceeds the baseline, suggesting that the structural information is not important for these datasets. Imposing an inductive bias to learn the graph-structural information for these three datasets deteriorates the model’s performance. Interestingly, on the D&D and PROTEINS datasets, SPIN outperforms other GNN variants and is on par with the baseline, demonstrating its selective nature of utilizing the topological information whenever needed. Furthermore, SPIN-att outperforms SPIN-non-att on all the social domain and PEKING-1 from brain datasets. This might be due to the degree information encoded as the input features for the social domain datasets that aid in identifying and attending the relevant nodes for the classification task at hand. In the chemical domain datasets, all the nodes play an important role, representing a particular type of atom; see in Fig. 2(a) that all the nodes in a sample graph from the PROTEINS dataset have similar attention weights. Suppressing features of an atom from a molecular graph has a negative impact in predicting the type of that molecule, as can be seen from the results. On the D&D and PROTEINS datasets, SPIN-non-att outperforms SPIN-att and other GNN variants.

In summary, SPIN performs competitively or better than existing GNNs on the benchmark datasets while incurring less training and inference time as the training runtime does not depend on the graph structure (the number of edges).

## 6 Conclusions

We have presented a theoretical framework to characterize and analyze the discriminative power of PA-GNN models, which are GNNs that perform neighborhood aggregation in parallel and before the training procedure. Consequently, PA-GNN models are independent of the graph structure in terms of computations, but well-capture the graph structure. We have provided conditions under which PA-GNN models are provably as powerful as the 1-WL test. Although the node embedding aggregation in PA-GNN models has apparently a different form than the iterative label update procedure of the 1-WL algorithm, we have shown that the node labels from the 1-WL test are injectively related to the node embeddings generated by a PA-GNN model. We have also presented an example GNN model, namely, SPIN, that obeys the prescribed theoretical conditions. We have demonstrated via experiments that SPIN outperforms state-of-the-art methods on a majority of graph classification benchmark datasets related to social, chemical, and brain networks.

## Appendix A Proof for Theorem 1

We prove that a PA-GNN model that satisfies the conditions provided in Theorem 1 is as powerful as the 1-WL test. Unlike the iterative node embedding update procedure in SA-GNN models, the procedure to compute the node embeddings in PA-GNN models has apparently a different form than the iterative label update procedure in the 1-WL vertex refinement algorithm. In what follows, we show that the node labels generated from the 1-WL algorithm can be mapped to the node embedding generated by a PA-GNN model through an injective function. The proof has two parts. In the first part, we relate the embedding of a node from a PA-GNN model to its label from the 1-WL algorithm. In the second part, we relate the graph embeddings from a PA-GNN model to the 1-WL test.

For a graph , recall that the 1-WL vertex refinement update equation for a node is given by

 l(t+1)v=φ(l(t)v,{l(t)j,∀j∈N(1)v}),

where is the set of one-hop neighbors of node . The 1-WL iteration is initialized with the input features as . Also, recall that the node embeddings generated using a PA-GNN model with branches are given by

 hv =Θ(z(0)v,z(1)v,⋯,z(R)v)

with

Let us introduce the embedding of node , , that gathers information from its -hop neighborhood (i.e., from branches), and define it as with being the transformation function that operates on the intermediate embeddings. Let us also introduce the composite function . Here, the notation means that the function is composed with the function . We frequently use the fact that composition preserves injectivity.

### a.1 Relating the node embeddings in a PA-GNN model to the labels in the 1-WL algorithm

To begin with, we analyze the relation between the node embeddings generated non-iteratively using a PA-GNN model and the labels generated by the iterative 1-WL algorithm. Through induction, we show that there always exists a function such that for . In other words, the node embedding generated by the first branches of a PA-GNN model gathering information from its -hop neighborhood can be mapped to the labels generated at the -th iteration of the 1-WL test. Let us assume that all the functions involved in generating the node embeddings, namely, , , and are injective.

Let us first verify the base case. For a single branch PA-GNN model with , we have

with . Next, for a two branch PA-GNN model with , we have

 h(1)v=Θ1(z(0)v,z(1)v)=Θ1(c(0)v(xv),c(1)v({xj,∀j∈N(1)v})).

As the composition of injective functions is also injective, the above equation simplifies to

 h(1)v=ρ1(xv,{xj,∀j∈N(1)v})

for some injective function . Since

we have .

Similarly, for a three branch PA-GNN model with we have

 h(2)v=Θ2(c(0)v(xv),c(1)v({xj,∀j∈N(1)v}),c(2)v({xt,∀t∈N(1)j,∀j∈N(1)v})).

As and are injective, can be expressed as

 (11)

for some injective function . In fact, the set in Equation (11) represents embeddings at 2-hop neighbors of node .

In the 1st iteration of the 1-WL update, the labels depend on their 1-hop neighbors, and similarly, in the 2nd iteration, the labels depend on their 1-hop and 2-hop neighbors. The second iteration of the 1-WL update can be written as

 l(2)v=φ(φ(l(0)v,{l(0)u,∀u∈N(1)v}),{φ(l(0)j,{l(0)t,∀t∈N(1)j}),∀j∈N(1)v}). (12)

As in the 1-WL algorithm is an injective hashing function, we have

 l(2)v=φ2(l(0)v,{l(0)j,∀j∈N(1)v},{l(0)t,∀t∈N(1)j,∀j∈N(1)v})

for some injective function . Hence, with .

Next, we assume there exist an injective mapping function up to iterations such that , and prove there exist an injective mapping for the -th iteration. The embedding for node from a branch PA-GNN model is

 h(R)v=ΘR(c(0)v(xv),⋯,c(R−1)v({xj,∀j∈N(R−1)v}),c(R)v({xj,∀j∈N(R)v})),

which can be alternatively represented as

 h(R)v=ΘR(Θ−1R−1∘ΘR−1(U),c(R)v({xj,∀j∈N(R)v})),

where

 U={c(0)v(xv),⋯,c(R−1)v({xj,∀j∈N(R−1)v})}.

Therefore,

 h(R)v=ΘR(Θ−1R−1(h(R−1)v),c(R)v({xj,∀j∈N(R)v})).

Replacing with and the node features with , we get

 h(R)v=ΘR(Θ−1R(τR−1(l(R−1)v)),c(R)v({l(0)j,∀j∈N(R)v})).

For some injective , above equation simplifies to

 h(R)v=ρR(l(R−1)v,{l(0)j,∀j∈N(R)v}).

Similarly, the 1-WL update at the -th iteration will be

 l(R)v=φ(l(R−1)v,{l(R−1)j,∀j∈N(1)v}).

We have seen above that at the -th iteration, the 1-WL update depends on all the labels from nodes within its -hop neighborhood. So, for some function , we can write as

 l(R)v=φR−1(l(R−1)v,{l(0)j,∀j∈N(R)v}).

Hence, with .

### a.2 Relating graph-level readouts from a PA-GNN model and the 1-WL algorithm

Next, we establish the second condition by analyzing the relation between the graph-level embeddings generated by a PA-GNN model and generated using the final labels (after the -th iteration) from the 1-WL test for an entire graph.

With PA-GNN models, we first obtain the embedding for an entire graph at each branch by the branch-level readout of the node representation vectors. Then the embedding of an entire graph is computed through a global readout as

 e1=ψ1(s(0)G,s(1)G,⋯,s(R)G)

with . Here, is the branch-level readout function that operates individually on each branch and is the global readout function that produces the graph embedding by pooling the branch-level embeddings. We assume and to be injective.

The graph representation generated from the node labels of the 1-WL algorithm after iterations is given by

 e2=ψ2({l(R)v,∀v∈V}),

where is an injective graph pooling function.

We show that there always exist an injective function that maps to , i.e., . To do so, let us express as

We introduce an injective function such that

 e1=Δ({z(0)v,⋯,z(R)v,∀v∈V})=Δ(Θ−1({h(R)v,∀v∈V})).

As established previously, we have . Thus

 e1=Δ(Θ−1({τR(l(R)v),∀v∈V}))=Ψ(e2)

with an injective function .

In essence, all the functions involved in a PA-GNN model should be injective for it to be at least as powerful as the 1-WL test.

## Appendix B Proof for Lemma 1

For two distinct multisets , we next prove that the weighted summation of their linear mappings with a ReLU or Leaky-ReLU nonlinearity can be equal, i.e.,

 (13)

for an arbitrary linear mapping and arbitrary weights for and for . Here the nonlinearity is either or , where ReLU is defined as

 ReLU(x)={xif x>00if x≤0

and Leaky-ReLU is defined as

 L-ReLU(x)={xif x>0cxif x≤0 for c>0.

### b.1 The case with ReLU

By definition, is elementwise positive or zero depending on being elementwise positive or negative, respectively. Let us introduce the symbols and to, respectively, denote elementwise greater than or elementwise less than inequalities for vectors.

When , for all . Then

 ∑x∈X1αiReLU(Wx)=∑x∈X2βiReLU(Wx)=0

irrespective of the weights and . When , . Thus

 ∑x∈X1αiReLU(Wx)=W⎛⎝∑x∈X1αix⎞⎠ and ∑x∈X2βiReLU(Wx)=W⎛⎝∑x∈X2βix⎞⎠.

Hence, we can see that if the weighted sums , their linear mappings followed by the ReLU nonlinearity can be equal.

### b.2 The case with Leaky-ReLU

For , the proof remains exactly the same as with the case of ReLU as ReLU and Leaky-ReLu are the same for positive inputs. When , . Since scalar multiplication does not affect linearity, the above argument holds.

Example. Consider with weights and with

. The multisets are different, but their weighted sums are equal. For any linear transformation

, their weighted sum followed by ReLU or Leaky-ReLU are the same.

## Appendix C Proof for Lemma 2

From the conditions provided in Theorem 1, we choose injective functions for the aggregation function and the transformation function . Therefore, the intermediate embeddings produced for two different feature vectors are distinct.

Consider two distinct intermediate embeddings and generated from the same branch for two non-isomorphic graphs. Let us denote the node representation vectors with attention as , where with an arbitrary vector . The attention mechanism is defined as

 αi =ATTENTION(w,zi)=1niexp(βi),

where . Here, we have introduced the normalization factor due to the softmax operation.

We show that for distinct non-zero vectors (element-wise inequality), the attention mechanism preserves injectivity, i.e., . In particular, we show that for the two cases with vectors and being linearly independent and dependent but not identical.

To begin with, we analyze the conditions under which . Assuming that , we have

 1n1exp(β1)z1=1n2exp(β2)z2,

which implies that with . In other words, for , the vectors and have to be linear dependent.

Now, we inspect the conditions on under which and are linearly dependent, i.e., that lead to . When , we have

 exp(β1)z1=pexp(β2)z1.

Since the scaling factors and are positive, with a slight abuse of notation, we have absorbed them in the constant . Substituting , we get

 exp(ReLU(⟨w,z1⟩))z1=pexp(ReLU(⟨w,pz1⟩))z1.

When , using the fact that the scaled ReLU can be written as , we obtain the equation

 b−pbp=0

with The general solution to above equation is given by , where is the analytic continuation of the product log function [Wn]. However, since as