From Graph Low-Rank Global Attention to 2-FWL Approximation

06/14/2020 ∙ by Omri Puny, et al. ∙ 0

Graph Neural Networks (GNNs) are known to have an expressive power bounded by that of the vertex coloring algorithm (Xu et al., 2019a; Morris et al., 2018). However, for rich node features, such a bound does not exist and GNNs can be shown to be universal, namely, have the theoretical ability to approximate arbitrary graph functions. It is well known, however, that expressive power alone does not imply good generalization. In an effort to improve generalization of GNNs we suggest the Low-Rank Global Attention (LRGA) module, taking advantage of the efficiency of low rank matrix-vector multiplication, that improves the algorithmic alignment (Xu et al., 2019b) of GNNs with the 2-folklore Weisfeiler-Lehman (FWL) algorithm; 2-FWL is a graph isomorphism algorithm that is strictly more powerful than vertex coloring. Concretely, we: (i) formulate 2-FWL using polynomial kernels; (ii) show LRGA aligns with this 2-FWL formulation; and (iii) bound the sample complexity of the kernel's feature map when learned with a randomly initialized two-layer MLP. The latter means the generalization error can be made arbitrarily small when training LRGA to learn the 2-FWL algorithm. From a practical point of view, augmenting existing GNN layers with LRGA produces state of the art results on most datasets in a GNN standard benchmark.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many domains, data can be represented as a graph, where entities interact, have meaningful relations and a global structure. The need to be able to infer and gain a better understanding of such data rises in many instances such as social networks, citations and collaborations, chemoinformatics, epidemiology etc. In recent years, along with the major evolution of artificial neural networks, graph learning has also gained a new powerful tool - graph neural networks (GNNs). Since first originated (Gori2005; Scarselli2009) as recurrent algorithms, GNNs have become a central interest and the main tool in graph learning.

Perhaps the most commonly used family of GNNs are message-passing neural networks (Gilmer2017), built by aggregating messages from local neighborhoods at each layer. Since information is only kept at the vertices and propagated via the edges, these models’ complexity scales linearly with , where and are the number of vertices and edges in the graph, respectively. In a recent analysis of the expressive power of such models, (xu2018how; morris2018weisfeiler) have shown that message-passing neural networks are at most as powerful as the first Weisfeiler-Lehman (WL) test, also known as vertex coloring. The -WL tests, are a hierarchy of increasing power and complexity algorithms aimed at solving graph isomorphism. This bound on the expressive power of GNNs led to the design of new architectures (morris2018weisfeiler; maron2019provably) mimicking higher orders of the -WL family, resulting in more powerful yet complex models that scale super-linearly in , hindering their usage for larger graphs.

Although expressive power bounds on GNNs exist, empirically in many datasets, GNNs are able to fit the train data well. This indicates that the expressive power of these models might not be the main roadblock to a successful generalization. Therefore, we focus our efforts in this paper on strengthening GNNs from a generalization point of view. Towards improving the generalization of GNNs we propose the Low-rank global attention (LRGA) module which can be augmented to any GNN layer. We define a -rank attention matrix, where is a parameter, that requires memory and can be applied in computational complexity. This is in contrast to standard attention modules that apply attention matrix to node data with computational complexity.

To theoretically justify LRGA we first note that bounds on the expressive power of GNNs vanish when a graph has informative node features, which is often believed to be the case for real-life data. We therefore restrict our attention to a class of graphs called rich feature graphs which have their structural information encoded in the node features. For this class of graphs we show that GNNs are indeed universal, consistent with the empirical success of GNNs in fitting train data. To give grounds for the expected improved generalization properties of the LRGA compared to generic GNNs over this class of graphs, we show that it aligns with the -folklore WL (FWL) algorithm; -FWL is a strictly more powerful graph isomorphism algorithm than vertex coloring (which bounds message passing GNNs). To do so, we adopt the notion of algorithmic alignment introduced in (Xu2019algoalign), stating that a neural network aligns with some algorithm if it can simulate it with simple modules.

Our theoretical analysis of LRGA under the rich feature graph assumption therefore includes: (i) formulating the -FWL algorithm, which is strictly stronger than vertex coloring, using polynomial kernels; (ii) showing that LRGA aligns with this formulation of -FWL, meaning that LRGA can approximate (for sufficiently high ) the update step of the -FWL algorithm with simple functions; and (iii) bounding the sample complexity of the LRGA module when learning the -FWL update rule. Although our bound is exponential in the graph size, it nevertheless implies that LRGA can provably learn the -FWL step, when training each module independently.

We evaluated our model on a set of benchmark datasets including tasks of graph classification, graph regression, node labeling and link prediction from (dwivedi2020benchmarking; Hu2020). LRGA improves state of the art performance in most of the datasets. We further perform ablation study on the choice of and its effect on the model performance.

2 Related Work

Graph neural networks.

In recent years, graph learning tasks such as graph and node classification, graph regression and link prediction have been approached using graph neural networks, achieving state of the art results on many benchmark datasets compared to previous classical graph kernel methods (Kriege2020). Since first introduced as recurrent models in the early works of (Gori2005; Scarselli2009), many variants of GNNs have been proposed trying to generalize pooling and convolution operations to graphs.

Pooling operations have been used in the task of graph classification (ying2018hierarchical; Zhang2018) where generalized convolution methods have a broader line of applications. Convolutional networks include local (Duvenaud2015; Li2015; Battaglia2016; Niepert2016; Hamilton2017; monti2017geometric; Velickovic2018; Bresson2017Gated; xu2018how) , spectral (Bruna2014; Defferrard2016; Kipf2016), and equivariant (Kondor2018; Ravanbakhsh2017; Maron2019) methods . Many of which can be formulated in the message passing framework (Gilmer2017). This framework assigns new features to nodes by aggregating their neighborhoods and their own feature and applying an update function. Methods which use an unisotropic aggregation of neighborhoods e.g., (Velickovic2018) are referred to as attention or gated models.

Attention mechanisms.

The first work to use an attention mechanism in deep learning was


in the context of natural language processing. Intuitively, attention provides an adaptive importance metric for interactions between pairs of elements, e.g., words in a sentence, pixels in an image or nodes in a graph. Previous graph attention works

(Li2015; Velickovic2018; Abu-El-Haija2018; Bresson2017Gated) restrict learning the attention scores to the local neighborhoods of the nodes in the graph. A broader survey of graph attention networks is provided by (Lee2018).

Another line of works, incorporates global aggregations in graphs using node embeddings (You2019position; Pei2020). The motivation for global aggregation in graphs is due to the fact that local aggregations cannot capture long range relations which may be important when node homophily does not hold.

3 Low-rank global attention (LRGA)

We consider a graph where is the vertex-set of size and is the edge-set. Each vertex carries an input feature vector , where is the input feature dimension. The input vertices’ feature vectors are summarized in a matrix ; in turn, represents the output of the layer of a neural network. We propose the Low-rank global attention (LRGA) module that is added to any graph neural network layer, denoted here generically as , in the following way:


where the brackets imply concatenation along the feature dimension. The LRGA module is defined for an input feature matrix via


where are MLPs operating on the feature dimension, that is , and is a parameter representing the rank of the attention module. Lastly, is a global normalization factor:


where . The matrix can be thought of as a -rank attention matrix that acts globally on the graph’s node features. The normalization represents the empirical expectation of the row sums in , so .

Computational complexity.

Standard attention models

(vaswani2017attention; Luong2015) require explicitly computing the attention score between all possible pairs in the set, meaning that its memory requirement and computational cost scales as . This makes global-attention seem impractical for large sets, or large graphs in our case. Previous attention models applied to graphs resorted to local attention (or gating) mechanisms, masking the allowed pairs with the adjacency matrix.

We address the global attention computational challenge by working with bounded rank (i.e., ) attention matrices, and avoid the need to construct the attention matrix in memory by replacing the standard entry-wise normalization ( or ) with a the global normalization . In turn, the memory requirement of LRGA is , and using low rank matrix-vector multiplications LRGA allows applying global attention in computation cost.

4 Theoretical Analysis

The LRGA (equation 2) module has an interesting justification in the context of graph neural networks that we explore next. In essence, we restrict our attention to a certain graph class with informative node features, called rich feature graphs. For rich feature graphs we show several results: First, that GNNs are universal for these graphs, i.e., can approximate arbitrary graph functions. It is well known that universality alone does not imply good generalization. We therefore justify the LRGA module by showing it is algorithmically aligned (in the sense of (Xu2019algoalign)) with the 2-FWL algorithm, a powerful polynomial-time graph isomorphism approximation algorithm.

4.1 Rich features can make GNNs universal

The theoretical expressive power of GNNs has been shown to be bounded by that of the vertex coloring algorithm (a.k.a. Weisfeiler-Lehman or WL) (xu2018how; Morris2019), which is an approximate graph isomorphism polynomial-time algorithm. However, it is clear that using more expressive node features can make the graph isomorphism problem easy to solve. In fact, many real world datasets consist of graphs with meaningful vertex features which encompass structural information. Here, we are interested in evaluating the power of GNNs when the node features are informative. As a model for informative node features we define rich feature graphs and prove that for this model GNNs are universal, i.e., have maximal expressive power.

Notation. Let , a graph with features per node, i.e., . In matrix form . We further break into blocks, , where consecutive blocks , , contain the same number of columns.

Definition 1 (Rich feature graph).

A graph with node features is a rich feature graph if, for some , there exists a block structure so that for all , the vector represents the isomorphism type of the pair .

Note that in matrix notation . The isomorphism type of a pair , which represents either an edge or a node of graph , summarizes all the information this pair carries in graph . More precisely put, isomorphism type is an equivalence relation defined by: and have the same isomorphism type iff the following conditions holds: (i) ; (ii) and ; and (iii) .

Hence, rich feature graphs carry all their information in their node features. As it turns out, every graph can be represented as a rich feature graph. Indeed, let be the adjacency matrix of , and , , unique representatives of the node’s features. Then, represents the isomorphism types of pairs in

. Using the singular value decomposition (SVD) we can write

and is a rich feature representation for

. Note that in general the isomorphism type is represented as a tensor


For rich feature graphs, existing GNNs are universal, if they have a global attribute block. In particular, we prove the GNN in (battaglia2018relational) is universal in this case (proof in the supplementary):

Proposition 1.

Graph neural networks can approximate an arbitrary continuous function over the class of rich feature graphs with node features in some compact set .

4.2 2-FWL via a polynomial kernel

Towards the goal of showing algorithmic alignment of LRGA and 2-FWL on rich feature graphs we start with formulating the -FWL algorithm with polynomial kernels. In the next section we use this formulation to make the algorithmic alignment claim.

2-FWL algorithm. -Folklore Weisfeiler-Lehman (FWL) (grohe2015pebble; grohe2017descriptive) is part of WL hierarchy of polynomial-time graph isomorphism iterative algorithms which recolor -tuples of vertices at each step according to neighborhoods aggregation. Upon reaching a stable coloring, the algorithm stops and if the histograms of colors of two graphs are not the same then they are deemed not isomorphic. The -FWL algorithm is equivalent to -WL, strictly stronger than vertex coloring (2-WL).

Let be a colored graph with isomorphism types of pairs of vertices represented via a tensor ; that is, represents the isomorphism type of the pair . is the initial coloring of the vertex pairs and is set as the input to the -FWL algorithm. denotes the coloring after the recoloring step. A recoloring step in the algorithm aggregates information from the multiset of neighborhoods colors for each pair. We represent the multiset of neighborhoods colors of the tuple with a matrix . That is, any permutation of the rows of represent the same multiset. The rows of , which represent the elements in the multiset, are , . See the inset for an illustration.

The -FWL update step of a pair from to is done by concatenating the previous pair color and an encoding of the multiset of neighborhoods colors:


is the encoding function that is invariant to the row-order of its input and maps different multisets to different target vectors.
Multiset encoding. As shown in (maron2019provably) the multiset encoding function, , can be defined using the collection of Power-sum Multi-symmetric Polynomials (PMPs). That is, given a multiset the encoding is defined by

where , and .

Let us focus on computing a single output coordinate of the function applied to a particular multiset . This can be efficiently computed using matrix multiplication (maron2019provably): Let , where . Then,


By we mean that we apply the multi-power to the feature dimension, i.e., . This implies that computing the multisets encoding amounts to calculating monomials and their matrix multiplications . Thus the -FWL update rule, equation 4, can be written in the following matrix form, where for notational simplicity we denote :


-FWL via polynomial kernels. Next, we formulate -FWL using polynomial kernels. Let be the node feature matrix at iteration of the algorithm, and are the colors they define on the vertex pairs. We show it is possible to compute directly from using polynomial feature maps. Indeed,


where the second equality is using the feature maps of the (homogeneous) polynomial kernels (Vapnik1998), ; the third equality is reformulating the feature maps on the vectors , and ; and the last equality is due to the closure of kernels to multiplication. We denote the final feature map by . Now, let and , then we have:

where is applying to every row of . Therefore, can be written directly as a function of the node features using the feature maps :


4.3 Algorithmic alignment with -Fwl

Our goal is to show that LRGA is algorithmically aligned with 2-FWL over rich feature graphs providing some justification to its improved generalization properties. We consider the notion of algorithmic alignment as introduced in (Xu2019algoalign), which intuitively means that the neural network can efficiently simulate the algorithm via simple modules. We first show how LRGA can implement -FWL efficiently, where each MLP needs to approximate a simple polynomial (i.e., monomial) feature map. Then, we employ a result that polynomials have bounded sample complexity to show that each learnable module of LRGA can provably learn the -FWL update rule when trained with a two layer MLP and gradient descent.

First, we show that LRGA (equation 2) can implement a single multi-power of the -FWL update rule in equation 6. To implement all multi-powers one would require multi-head LRGA. However, this would be required if all feature vectors in should be separated. We found that in practice a single head is sufficient. The single head -FWL update rule is . Using equation 8 we can write this rule over the input node features :

It can be readily checked that the updated node features indeed define the updated colors with a single head . To finish the argument note that this update equation has the same form as the LRGA. Therefore, we can, using the universal approximation theorem (hornik1989multilayer), take MLPs so that , , , and , all over some compact feature domain . The resulting LRGA module will approximate, to an arbitrary precision, -FWL single head step. Note that the normalization in equation 2 is a multiplication by a scalar and therefore has no influence on the colors (except if it is zero, which is assumed not to be the case). We showed:

Theorem 1.

LRGA module in equation 2 can simulate a single head -FWL update rule under rich feature graph assumption.

Bound on LRGA sample complexity. We conclude this section by proving that the learnable modules in LRGA, namely the MLPs , can provably learn the feature maps . Let us denote by one of these feature maps. As we show in the supplementary, all the outputs of consist of monomials , where and , , and , where is the dimension of all -variate polynomials of degree at-most . We will consider a single output coordinate of , namely , noting that generalization to the vector output case can be done using union bounds as in Theorem 3.5 in (Xu2019algoalign).

Corollary 6.2 in (arora2019fine) provides a bound on the sample complexity, denoted , of a polynomial of the form


where , , ; are the relevant PAC learning constants, and represents an over-parameterized, randomly initialized two-layer MLP trained with gradient descent. It is not immediately clear, however, how to use this theorem to learn an arbitrary monomial since has the above particular form. Nevertheless we show how it can be generalized to this case.

Let , and note that there are elements in . We assume some fixed ordering in is prescribed. Define the sample matrix (multivariate Vandemonde) by . Lemma 2.8 in (wendland2004scattered) implies that is non-singular. Let (i.e., the induced matrix norm); note that is dependant only upon .

Lemma 1.

Fix , and let be arbitrary. Then, there exist coefficients , , so that , for all .

The lemma is proven in the supplementary. We can use this Lemma in the following way: Assume is even or otherwise consider . Further assume that the MLP is two-layer, over-parameterized of the form (i.e., we assume there is a constant plugged in an extra coordinate). We consider training with random initialization and gradient descent using data where is sampled i.i.d. from some distribution over .

Let defined as , where is as promised by Lemma 1. Then, the learning setup described above is equivalent to training the MLP using data of the form , where is sampled i.i.d. from a distribution over

concentrated on the hyperplane

. Now using the Corollary 6.2 from (arora2019fine) proves that is learnable by the MLP . The sample complexity can be bounded in this case by

where (derivation in supplementary). The asymptotic behaviour of is out of scope for this paper, but in any case grows exponentially with the size of the graph . We can say however, that for a fixed graph size, and feature dimension , can be considered as a (very large) constant.
Discussion. The LRGA module is shown to be theoretically powerful when restricted to rich feature graphs and large rank parameter . In practice, the edge structure is only partially manifested in the node features, and is maintained low for computational complexity. For these reasons LRGA complements GNNs that in turn transfers edge information to the node representation.

5 Experiments

We evaluated our method on various tasks including graph regression, graph classification, node classification and link prediction. The datasets we used are from two benchmarks: (i) benchmarking GNNs (dwivedi2020benchmarking); and (ii) Open Graph Benchmark (OGB) (Hu2020). Each benchmark has its own evaluation protocol designed for a fair comparison among different models. These protocols define consistent splits of the data to train/val/test sets, set a budget on the size of the evaluated models, define a stopping criterion for reporting test results and require training with several different initializations to measure the stability of the results. We follow these protocols.
Implementation details of LRGA. We implemented the LRGA module according to the description in Section 3 (See equations 2, 3

) using the pytorch framework and the DGL

(wang2019dgl) and Pytorch geometric (fey2019fast) libraries. Each LRGA module contains MLPs . Each

is a single layer MLP (linear with ReLU activation), where

and . A full implementation of a layer is according to equation 1, where we added another single layer MLP, , for the purpose of reducing the feature dimension size. It should be noted that in the OGB benchmark dataset we did not use the skip connections for better performance. In addition, as advised in (wang2019dgl), we used batch and graph normalization at the end of each layer. For the CIFAR10 and MNIST classification tasks we used dropout with .
Baselines. We compare performance with the following state of the art baselines: MLP, GCN (Kipf2016), GraphSAGE (Hamilton2017), GIN (xu2018how), DiffPool (ying2018hierarchical), GAT (Velickovic2018), MoNet (monti2017geometric) and GatedGCN (Bresson2017Gated), Node2Vec (grover2016node2vec) and MATRIX FACTORIZATION (Hu2020) where a distinct embedding is assigned to each node and is learned end-to-end together with an MLP predictor.

5.1 Benchmarking Graph Neural Networks (dwivedi2020benchmarking)

Datasets. This benchmark contains main datasets (a full description of the datasets is found in the supplementary) : (i) ZINC, graph regression task of molecular dataset evaluated with MAE metric; (ii) MNIST and CIFAR10, the image classification problem converted to graph classification using a super-pixel representation (Knyazev_superpixel)

; (iii) CLUSTER and PATTERN, node classification tasks which aim to classify embedded node structures

(abbe2017community); (iv) TSP, a link prediction variation of the Traveling Salesman Problem (joshi2019efficient)

on 2D Euclidean graph. All tasks are fully supervised learning tasks.

Evaluation protocol. All models are restricted to have roughly parameters and

layers. The learning rate and its decay are set according to a predetermined scheduler using the validation loss. The stopping criterion is set to when the learning rate reaches a specified threshold. All results are averaged over a set of predetermined fixed seeds and standard deviation is reported as well. The data splits are as specified in


width= Model CLUSTER PATTERN CIFAR10 MNIST TSP # Param Acc std # Param Acc std # Param Acc std # Param Acc std # Param F1 std MLP 104305 20.97 0.01 103629 50.13 0.00 106017 56.78 0.12 105717 95.18 0.18 94394 0.548 0.003 GCN 101655 47.82 4.91 100923 74.36 1.59 101657 54.46 0.10 101365 89.99 0.15 108738 0.627 0.003 GraphSAGE 99139 53.90 4.12 98607 81.25 3.84 102907 66.08 0.24 102691 97.20 0.17 98450 0.663 0.003 GIN 103544 52.54 1.03 100884 98.25 0.38 105654 53.28 3.70 105434 93.96 1.30 118574 0.657 0.001 DiffPool - - - - 108042 57.99 0.45 106538 95.02 0.42 - - GAT 110700 54.12 1.21 109936 90.72 2.04 110704 65.48 0.33 110400 95.62 0.13 109250 0.669 0.001 MoNet 104227 45.95 3.39 103775 97.89 0.89 104229 53.42 0.43 104049 90.36 0.47 94274 0.637 0.01 GatedGCN 104355 54.20 3.58 104003 97.24 1.19 104357 69.37 0.48 104217 97.47 0.13 94946 0.802 0.001 LRGA + GatedGCN 93482 62.11 3.47 104663 98.68 0.16 93485 70.65 0.18 93395 98.20 0.03 103347 0.798 0.001

Table 1: Performance on the benchmark GNN datasets.

Results. Tables 1 and 2 (left column in left table) summarize the results of training and evaluating our model according to the evaluation protocol; LRGA combined with GatedGCN achieves state of the art performance in most of the datasets in the benchmark. In order to obey the parameter budget when LRGA is combined with GatedGCN we reduce the width of the GatedGCN layers. While improving SOTA for CLUSTER, PATTERN, CIFAR10, and MNIST, we found that LRGA did not improve GatedGCN on TSP and ZINC. In order to see if LRGA can improve GatedGCN with higher parameter budget, we enlarged the parameter budget to M and evaluated all models with this increased budget on the ZINC dataset. As seen in table 2 (left) our model improved SOTA by a large margin in this case. We further explored the contribution of LRGA to other GNN architectures, see table 2 (right). All models in the table were evaluated with the same augmented LRGA module size () and two versions for their own size: the original setting as appears in the benchmark (total model size of K) versus a reduced model that fits the parameter budget (total model size of K). Observing table 2 (right and compare to left), we see that LRGA improved all the GNNs considerably when augmented to GNNs without the K budget (even compared to larger parameter versions of the GNNs), while improving GCN, GAT, and GIN in the reduced K setting.

Model ZINC ZINC (large) Model Model size Model size
# Param MAE std # Param MAE std MAE std MAE std
MLP 106970 0.681 0.005 2289351 0.7035 0.003 LRGA + GCN 0.457 0.004 0.433 0.008
GCN 103077 0.469 0.002 2189531 0.479 0.007 LRGA + GAT 0.438 0.007 0.432 0.016
GraphSage 105031 0.410 0.005 2176751 0.439 0.006 LRGA + GIN 0.363 0.010 0.355 0.032
GIN 103079 0.408 0.008 2028509 0.382 0.008
DiffPool 110561 0.466 0.006 2291521 0.448 0.005
GAT 102385 0.463 0.002 2080881 0.471 0.005
MoNet 106002 0.407 0.007 2244343 0.372 0.01
GatedGCN 105875 0.363 0.009 2134081 0.338 0.003
LRGA + GatedGCN 94457 0.367 0.008 1989730 0.285 0.01
Table 2: Results on ZINC dataset

width=0.45 Model ogbl-ppa ogbl-collab Hits@100 std Hits@10 std Matrix Factorization 0.3229 0.0094 0.3805 0.0018 Node2Vec 0.2226 0.0083 0.4281 0.0140 GCN 0.1155 0.0153 0.3329 0.0190 GraphSAGE 0.1063 0.0244 0.3121 0.0620 LRGA + GCN 0.2988 0.0211 0.4363 0.0121 LRGA + GCN (large) 0.3426 0.016 0.4541 0.0091

5.2 Link prediction datasets from the OGB benchmark (Hu2020)


We further evaluate LRGA on semi-supervised learning tasks including graphs with hundreds of thousands of nodes, from the OGB benchmark: (i) ogbl-ppa, a graph of proteins and biological connections as edges ;(ii) ogbl-collab, an authors collaborations graph. The evaluation metric for both tasks is Hits@K; more details are in the supplementary.

Evaluation protocol. All models have a hidden layer of size and the number of layers is set to

. Test results are reported by the best validation epoch averaged over

random seeds.
Results. The inset table to the right summarizes the results on ogbl-ppa and ogbl-collab. It should be noted that the first two rows correspond to node embedding methods where the rest are GNNs. Augmenting GCN with LRGA achieves a major improvement on those datasets. Larger versions of LRGA+GCN achieve SOTA results on these datasets, while still using less parameters than node embedding methods, like MATRIX FACTORIZATION which uses more than 60M parameters to achieve distinct embeddings for each node. For comparison our large model uses around 1M parameters and achieves superior results.

Figure 1: Ablation study on CLUSTER dataset.

5.3 Ablation Study

We investigated the affects of the attention’s rank on the performance of LRGA on the CLUSTER dataset. The dataset contains graphs of to nodes. Our experimental setting included fixing the GNN’s hidden dimensions size () and changing . Figure 1 shows that accuracy increases with the rank values until it reaches a plateau around , a fact that could be attributed to saturating the expressiveness of the LRGA module. Moreover, the maximal accuracy is achieved at a value that corresponds to the maximal graph size in the dataset, smaller than what the theory predicts as a function of the graph size .

6 Conclusions

In this work, we introduce the LRGA module which, to the best of our knowledge, is the first application of global attention to graph learning. We provide theoretical analysis justifying LRGA from a point of view of improved generalization rather than expressive power. Since for a certain, reasonable class of graphs we show that GNNs are universal, we suggest a module that algorithmically aligns with -FWL, a more powerful algorithm than the one bounding the expressive power of GNNs. The algorithmic alignment is shown by formulating the -FWL with simple polynomial modules and then showing that the LRGA module can simulate them. Interesting future work is to incorporate the graph structural information directly into the global attention module.

This research was supported in part by the European Research Council (ERC Consolidator Grant, "LiftMatch" 771136) and the Israel Science Foundation (Grant No. 1830/17).


Appendix A Proof of Proposition 1


Every graph function can be formulated as a function of the isomorphism type tensor , and we will approximate such arbitrary continuous functions with GNN. Let be a continuous invariant graph function (i.e., agnostic to ordering the graph nodes) defined over the isomorphism type tensors . Define , where is defined as in Definition 1. is an invariant set function since it is a composition of invariant and equivariant functions (see e.g., (Maron2019) for definition of equivariance); it is also continuous as a composition of continuous functions. Hence can be approximated over using DeepSets (zaheer2017deep) (due to DeepSets universality). Since the GNN in (battaglia2018relational) includes DeepSets as a particular case it can approximate as-well. ∎

Appendix B 2-FWL via polynomial kernels

In this section, we give a full characterization of feature maps, , of the final polynomial kernel we use to formulate the 2-FWL algorithm. A key tool for the derivation of the final feature map is the multinomial theorem, which we state here in a slightly different form to fit our setting.

Multinomial theorem. Let us define a set of variables composed of products of corresponding and ’s. Then,

where , and the notation . The sum is over all possible which sum to , in total elements.

Recall that we wish to compute as in equation 7 in the paper:

We will now follow the equalities in equation 7 to derive the final feature map. The second equality is using the feature maps of the (homogeneous) polynomial kernels (Vapnik1998), , which can be derived from the multinomial theorem.

Suppose the dimensions of are where . Then, consists of monomials of degree of the form , . In total the size of the feaure map is .

The third equality is reformulating the feature maps on the vectors , and .

The last equality is due to the closure of kernels to multiplication. The final feature map, which is the product kernel, is composed of all possible products of elements of the feature maps, i.e.,

where , and for all . The size of the final feature map is where .

Appendix C Bound on LRGA sample complexity

c.1 Proof of Lemma 1


Using the multinomial theorem we have: , where are positive multinomial coefficients. This equation defines a linear relation between the monomial basis and , for . The matrix of this system is multiplied by a positive diagonal matrix with on its diagonal. By inverting this matrix and solving this system for the lemma is proved. ∎

c.2 Derivation of sample complexity bound

Corollary 6.2 in (arora2019fine) provides a bound on the sample complexity, denoted , of a polynomial of the form


where , , ; are the relevant PAC learning constants, and represents an over-parameterized, randomly initialized two-layer MLP trained with gradient descent:

In our case is defined as where and by Lemma 1 there exist such that . The sample complexity bound expression by Corollary 6.2 is therefore:

Let us bound the first term in the numerator of the sample complexity expression:

The first inequality is due to , the second is by Lemma 1 and uniting into the main term. From the above, the bound follows.

Appendix D Implementation Details

In this section we describe the datasets on which we performed our evaluation. In addition, we specify the hyperparameters for the experiments section in the paper. The rest of the model configurations are determined directly by the evaluation protocols defined by the benchmarks. It is worth noting that most of our experiments ran on a single Tesla V-

GPU, if not stated otherwise. We performed our parameter search only on and (except for CIFAR10 and MNIST were we searched over different dropout values), since the rest of the parameters were dictated by the evaluation protocol. The models sizes were restricted by the allowed parameter budget.

Dataset #Graphs #Nodes Avg. Nodes Avg. Edges #Classes
ZINC 12K 9-37 23.16 49.83 -
CLUSTER 12K 40-190 117.20 4301.72 6
PATTERN 14K 50-180 117.47 4749.15 2
MNIST 70K 40-75 70.57 564.53 10
CIFAR10 60K 85-150 117.63 941.07 10
TSP 12K 50-500 275.76 6894.04 2
obgl-ppa 1 576,289 - 30,326,273 -
obgl-collab 1 235,868 - 1,285,465 -
Table 3: Summary of the benchmarking GNN and ogb link prediction Datasets

d.1 Benchmarking Graph Neural Networks (dwivedi2020benchmarking)


This benchmark contains main datasets :

  1. ZINC, a molecular graphs dataset with a graph regression task where each node represents an atom and each edge represents a bond. The regression target is a property known as the constrained solubility (with mean absolute error as evaluation metric). Additionally, the node features represent the atom’s type ( types) and the edge features represents the type of connection ( types). The hyperparameters range which we used in our search was and . For the reported results we used and the averaged time for a single epoch (whole training) was seconds ( minutes).

  2. MNIST and CIFAR10, the known image classification problem is converted to a graph classification task using Super-pixel representation (Knyazev_superpixel), which represents small regions of homogeneous intensity as nodes. The edges in the graph are obtained by applying k-nearest neighbor algorithm on the nodes coordinates. Node features are a concatenation of the Super-pixel intensity (RGB for CIFAR10 and greyscale for MNIST) and its image coordinate. Edges features are the k-nearest distances. For the CIFAR10 and MNIST datasets our search range was ,  and . The chosen hyperparameters for the CIFAR10 dataset were with additional dropout of . The averaged time for a single epoch (whole training) is seconds ( hours). We used the same hyperparameters for the MNIST dataset, besides the dropout which was changed to . Average time per epoch (whole training) is seconds ( hours).

  3. CLUSTER and PATTERN, node classification tasks which aim to identify embedded node structures in stochastic block model graphs (abbe2017community)

    . The goal of the task is to assign each node to the stochastic block it was originated from, while the structure of the graph is governed by two probabilities that define the inner-structure and cross-structure edges. A single representative from each block is assigned with an initial feature that indicates its block while the rest of the nodes have no features. We searched hyperparameters over the range

    and . The hyperparameters for the CLUSTER dataset were . Average time per epoch (whole training) is seconds ( hours). For the PATTERN dataset we used . Averaged running time per epoch (whole training) is seconds ( hours), on a single Tesla P-.

  4. TSP, a link prediction task that tries to tackle the NP-hard classical Traveling Salesman Problem (joshi2019efficient). Given a 2D Euclidean graph the goal is to choose the edges that participate in the minimal edge weight tour of the graph. The evaluation metric for the task is F score for the positive class. Our hyperparameters search was in the range and , the results shown in the paper uses and the averaged running time per epoch (whole training) is seconds ( hours), on a single Tesla P-.

d.2 Link prediction datasets from the OGB benchmark (Hu2020)


In order to provide a more complete evaluation of our model we also evaluate it on semi-supervised learning tasks of link prediction. We searched over the same hyperparameter range and used in both tasks. The two datasets were:

  1. ogbl-ppa, an undirected unweighted graph. Nodes represent types of proteins and the edges signify biological connections between proteins. The initial node feature is a 58-dimensional one-hot-vector that indicates the origin specie of the protein. The learning task is to predict new connections between nodes. The train/validation/test split sizes are M/M/M . The evaluation metric is called Hits@K (Hu2020). Averaged running time was 4.5 minutes per epoch and 1.5 hours for the whole training.

  2. ogbl-collab, is a graph that represents a network of collaborations between authors. Every author in the network is represented by a node and each collaboration is assigned with an edge. Initial node features are obtained by combining word embeddings of papers by that author (128-dimensional vector). Additionally, each collaboration is described by the year of collaboration and the number of collaborations in that year as a weight. The train/validation/test split sizes are M/K/K. Similarly to the previous dataset, the evaluation metric is Hits@K. Averaged running time was 5.22 seconds per epoch and 17.4 minutes for the whole training.