Log In Sign Up

Permutohedral-GCN: Graph Convolutional Networks with Global Attention

by   Hesham Mostafa, et al.

Graph convolutional networks (GCNs) update a node's feature vector by aggregating features from its neighbors in the graph. This ignores potentially useful contributions from distant nodes. Identifying such useful distant contributions is challenging due to scalability issues (too many nodes can potentially contribute) and oversmoothing (aggregating features from too many nodes risks swamping out relevant information and may result in nodes having different labels but indistinguishable features). We introduce a global attention mechanism where a node can selectively attend to, and aggregate features from, any other node in the graph. The attention coefficients depend on the Euclidean distance between learnable node embeddings, and we show that the resulting attention-based global aggregation scheme is analogous to high-dimensional Gaussian filtering. This makes it possible to use efficient approximate Gaussian filtering techniques to implement our attention-based global aggregation scheme. By employing an approximate filtering method based on the permutohedral lattice, the time complexity of our proposed global aggregation scheme only grows linearly with the number of nodes. The resulting GCNs, which we term permutohedral-GCNs, are differentiable and trained end-to-end, and they achieve state of the art performance on several node classification benchmarks.


page 1

page 2

page 3

page 4


Semi-supervised Node Classification via Hierarchical Graph Convolutional Networks

Graph convolutional networks (GCNs) have been successfully applied in no...

Graph Node-Feature Convolution for Representation Learning

Graph convolutional network (GCN) is an emerging neural network approach...

Graph Neural Networks with Composite Kernels

Learning on graph structured data has drawn increasing interest in recen...

SPAGAN: Shortest Path Graph Attention Network

Graph convolutional networks (GCN) have recently demonstrated their pote...

Higher-order Graph Convolutional Networks

Following the success of deep convolutional networks in various vision a...

Expanding Label Sets for Graph Convolutional Networks

In recent years, Graph Convolutional Networks (GCNs) and their variants ...

Learnable Graph Convolutional Attention Networks

Existing Graph Neural Networks (GNNs) compute the message exchange betwe...

1 Introduction

Graph convolutional networks (GCNs) and their variants achieve excellent performance in transductive node classification tasks (Defferrard et al., 2016; Kipf & Welling, 2016). GCNs use a message passing scheme where each node aggregates messages/features from its neighboring nodes in order to update its own feature vector. After message passing rounds, a node is able to aggregate information from nodes up to hops away in the graph. This message passing scheme has three shortcomings: 1) a node does not consider information from nodes that are more than hops away, 2) messages from individual nodes can get swamped by messages from other nodes making it hard to isolate important contributions from individual nodes, 3) node features can become indistinguishable as their aggregation ranges overlap, which severely degrades the performance of downstream tasks when the indistinguishable nodes belong to different classes. This over-smoothing phenomenon (Chen et al., 2019)) is observed even for small values of (as low as (Li et al., 2018)

We propose a global attention-based aggregation scheme that simultaneously addresses the issues above. We modify and extend graph attention networks (GATs) (Veličković et al., 2017) to enable the attention mechanism to operate globally, i.e, a node can aggregate features from any other node in the graph weighted by an attention coefficient. This sidesteps the inherent limited aggregation range of GCNs, and allows a node to aggregate features from distant nodes without having to aggregate features from all nodes in between. Two nearby nodes in the graph can effectively aggregate information from two very different sets of nodes, alleviating the issue of over-smoothing as information does not have to come solely from their two overlapping neighborhoods.

Naive global pair-wise attention is not scalable though, as it would have to consider all pairs of nodes, and computational overhead would grow quadratically with the number of nodes. To address this, we formulate a new attention mechanism where the strength of pair-wise attention depends on the Euclidean distance between learnable node embeddings. Attention-weighted global aggregation is then analogous to a Gaussian filtering operation for which approximate techniques with linear complexity exist. In particular, we use approximate filtering based on the permutohedral lattice (Adams et al., 2010)

. The lattice-based filtering scheme is differentiable and error backpropagates not only through the node features, but also through the node embeddings. The resulting networks, which we call permutohedral-GCNs (PH-GCNs) are thus fully differentiable.

In PH-GCNs, node embeddings are directly learned based on the error signal of the training task (node classification task in our case). Since the attention weights between nodes increase as they move closer in the embedding space, the training error signal will move two nodes closer in the embedding space if the training loss would improve by having them integrate more information from each other. This is in contrast to prior work based on random walks (Perozzi et al., 2014) or matrix factorization (Belkin & Niyogi, 2002) methods that learn embeddings in an unsupervised manner based only on the graph structure.

In our attention-weighted aggregation scheme, a node aggregates features mostly from its close neighbors in the embedding space. Our scheme can thus be seen as a mechanism to establish soft aggregation neighborhoods in a task relevant-manner independently of the graph structure. Since graph structure is not considered in our attention-based global aggregation scheme, we concatenate node features obtained from the global aggregation scheme with node features obtained by conventional aggregation from the node’s graph neighborhood. One half of the node’s feature vector is thus obtained by attention weighted aggregation from all nodes in the graph, while the other half is obtained by attention-weighted aggregation of only the nodes in the graph neighborhood. By combining the structure-agnostic global aggregation scheme with traditional neighborhood-based aggregation, we show that PH-GCNs are able to reach state of the art performance on several node classification tasks.

2 Related work

In GCNs, information flows along the edges of the graph. The manner in which information propagates depends on the local structure of the neighborhoods: densely connected neighborhoods rapidly expand a node’s influence while tree-like neighborhoods diffuse information slowly. A common way to communicate information across multiple neighborhood hubs is by constructing deeper (hierarchical) GCN architectures. A drawback of such GCN layer stacking is that it doesn’t disseminate information at a uniform rate across the graph neighborhoods (Xu et al., 2018). Densely connected nodes with a large sphere of influence will aggregate messages from a large number of nodes, which will over-smooth their features and make them indistinguishable from their close neighbors in the graph. On the other hand, sparsely connected nodes receive limited direct information from other nodes in the graph. Jumping knowledge networks (Xu et al., 2018) use skip connections to alleviate some of these problems by controlling the influence radius of each node separately. However, they still operate on a local scale which limits their effectiveness in capturing long-range node interactions. To capture contributions from distant nodes, structurally non-local aggregation approaches are needed (Wang et al., 2018).

Dynamic graph networks change the graph structure to connect semantically related nodes together without adding unnecessary depth. The node features are then updated by aggregating messages along the new edges. Dynamic graph CNNs rebuild the graph after each layer (using kd-trees) using the node features computed in the previous layer (Wang et al., 2019). Data-driven pooling techniques provide an alternative way to change the graph structure by clustering nodes that are semantically relevant. These techniques will typically coarsen the graph by either selecting top scoring nodes (Gao & Ji, 2019), or computing a soft assignment matrix (Ying et al., 2018). While rewiring the graph structure can provide long-range connections across the graph, it forces the network to learn how to retain the original graph since the original graph encodes semantically relevant relationships.

This inefficiency is addressed by graph networks that combine a traditional neighborhood aggregation scheme with a long-range aggregation scheme. Position-aware graph neural networks achieve this by using anchor-sets. A node in the graph aggregates node feature information from these anchor sets weighted by the distance between the node and the anchor-set

(You et al., 2019). Position-aware graph networks capture the local network structure as well as retain the global network position of a given node. Alternatively, Geom-GCN proposes to map the graph nodes to an embedding space via various node embedding methods (Pei et al., 2020). Geom-GCN then uses the geometric relationships defined on this embedding space to build structural neighborhoods that are used in addition to the graph neighborhoods in a bi-level aggregation scheme. In contrast to our proposal, Geom-GCN uses pre-computed embeddings while we propose to learn the embeddings jointly with the graph network in an efficient and fully differentiable fashion.

Our proposed graph networks are built on top of an efficient attention mechanism. Attention was applied to graph neural networks by Veličković et al. (2017)

. The resulting graph attention networks (GATs) aggregate features from a node’s neighbors after weighting them by attention coefficients that are produced by an attention network. The attention network is a Perceptron layer that operates on the concatenation of the features of a pair of nodes. In contrast to GAT, our attention mechanism uses Euclidean distances between learned node embeddings to compute the attention coefficients. This formulation results in an attention mechanism that is analogous to high-dimensional Gaussian filtering. Approximate Gaussian filtering methods such as the permutohedral lattice can then be used to realize scalable global attention.

The permutohedral lattice has been used in convolutional neural networks operating on sparse inputs 

(Su et al., 2018). Permutohedral lattice convolutions have also been used to extend the standard convolutional filters so that they encompasses not only pixels in a spatial neighborhood, but also neighboring pixels in the color or intensity spaces (Jampani et al., 2016; Wannenwetsch et al., 2019). An extension of this approach uses learned feature spaces, where the filter centered on a voxel encompasses nearby voxels in the learned feature space (Joutard et al., 2019). While these approaches share the common feature of attending more strongly to nearby elements in fixed or learned feature spaces, they are not true attention mechanisms as the attention coefficients are not normalized.

3 Methods

3.1 Graph convolutional networks with global attention

We consider neural networks operating on a directed graph where is a set of nodes/vertices and is the set of edges. denotes the node and is the structural neighborhood of , i.e, the indices of all nodes connected by an edge to node . Each node is associated with a feature vector , where is the dimension of the input feature space. We now describe how our graph convolutional layer with global attention uses and to generate for each node an output feature vector , where is the dimension of the output feature space.

The graph layer’s learnable parameters are a projection matrix , and a set of parameters parameterizing the generic attention function . The unnormalized attention coefficient between nodes and is given by:


To stop nodes from indiscriminately attending to all attention targets, we apply a softmax to the unnormalized attention coefficients of each node so that they sum to 1. We distinguish between structural attention where a node attends only to its graph neighbors, and global attention where a node attends to all nodes in the graph. The structural attention coefficients are given by:


and they are used to aggregate features from the node’s neighborhood:



is a non-linear activation function. The global attention coefficients are given by


and they are used to aggregate features from all nodes in the graph:


We concatenate the structurally aggregated and the globally aggregated feature vectors to yield the final feature vector :


where is the concatenation operator. We denote the action of our graph convolutional layer with global attention acting in the manner described above by . A standard technique is to use multiple attention heads (Veličković et al., 2017; Vaswani et al., 2017) and concatenate the outputs of the different heads. Using attention heads, the output feature vector has dimension and is given by:


In its current form, the attention-based global aggregation scheme is hardly practical since evaluating the global attention coefficients and implementing the global aggregation scheme in Eq. 5 would need to consider all pairs of nodes in the graph. In the following section, we show that for a particular choice of the attention function , global aggregation (Eqs. 4 and 5) can be efficiently implemented using a single approximate filtering step.

3.2 Attention-based global aggregation and non-local filtering

In image denoising, non-local image filters achieve superior performance compared to local filters under general statistical assumptions (Buades et al., 2004). Unlike local filters which update a pixel value based only on a spatially restricted image patch around the pixel, non-local filters average all image pixels weighted by how similar they are to the current pixel (Buades et al., 2005). For points, assume point has position in the D-dim Euclidean similarity space, and an associated feature vector . The similarity space may for example be the color space. A general non-local filtering operation can be written as:


where is the output feature at position and is the weighting function. The most common weighting function is the Gaussian kernel where is the 2-norm and

the inverse of the standard deviation. The weighting function we use is the exponential decay kernel, which yields the non-local filtering equation:


Going back to the attention mechanism in Eq. 1, GAT uses a single feedforward layer operating on the concatenated features of a pair of nodes to produce the attention coefficient between the pair. An alternative attention mechanism uses the dot product between the pair of feature vectors (Vaswani et al., 2017). We introduce an attention mechanism based on Euclidean distances:


is an embedding matrix that embeds the node’s transformed features into the D-dim node similarity space. Combining equations  14, and 5, and using the Euclidean distance form of attention in Eq. 10, the attention-weighted globally aggregated features at node is then:


which can be evaluated using two applications of the non-local filtering operation given by Eq. 9. To see that, identify:


Equation 11 can then be written as:


Both the numerator and denominator are standard non-local filtering operations (all points/nodes in the denominator have a unity feature). In practice, we implement Eq. 13 using only one non-local filtering step: we append 1 to the projected feature vector of every node () to get an ()-dim vector and then execute the non-local filtering step; we take the first entries in the resulting feature vectors and normalize by the normalizing factor at position .

3.3 Permutohedral lattice filtering

Exactly evaluating the non-local filtering operation in Eq. 13 for all nodes still scales as where is the number of nodes. However, approximate algorithms such as KD-trees (Adams et al., 2009), and permutohedral lattice filtering  (Adams et al., 2010) have a more favorable scaling behavior of and , respectively. We use permutohedral lattice filtering as our approximate filtering algorithm because it is end-to-end differentiable with respect to both the node embeddings and the node features.

Figure 1: Illustration of the action of a single attention head in the PH-GCN layer. The structural aggregation pathway is similar to GAT (Veličković et al., 2017), except that we use attention based on Euclidean distances between embeddings. Global aggregation is approximated by filtering in the permutohedral lattice which involves three steps: splatting, blurring, and slicing. The -dimensional output feature vectors from each pathway are concatenated to yield the final -dimensional output vector.

The permutohedral lattice in -dim space is an integer lattice that lives within the

-dim hyperplane

. The lattice tesselates the hyperplane with uniform simplices. There are efficient techniques for finding the enclosing simplex of any point in the hyperplane, as well as for finding the neighboring lattice points of any point in the lattice. The permutohedral lattice in the 2D plane is illustrated in Fig. 1. Approximate filtering using the permutohedral lattice involves three steps:

  • Splatting : The D-dim location vectors are projected onto the -dim hyperplane of the permutohedral lattice. The initial feature vectors for all lattice points are initialized to zero. The lattice points defining the enclosing simplex of each projected location are found and for each location , the associated feature vector is added to the feature vector of each enclosing lattice point after being scaled by its proximity to the lattice point.

  • Blurring : This step applies the filtering kernel over the lattice points. The filtering kernel decays exponentially with distance. Therefore, we calculate the output feature of a lattice point using only a small neighborhood of lattice points since contributions from more distant points would be small. This is one of the key approximations of the method: a lattice point only aggregates features from its neighborhood (weighted by the filtering kernel), instead of aggregating features from all lattice points.

  • Slicing : After applying the approximate filtering operation over the lattice points, the output feature vector for point is evaluated as the sum of the output feature vectors of its enclosing lattice points, weighted by the proximity of each enclosing lattice point to .

These three steps are illustrated in Fig. 1. Figure 1 shows the two parallel pathways used in a PH-GCN layer: the structural aggregation pathway uses attention coefficients based on the Euclidean distance between node embeddings to aggregate features from a node’s immediate neighborhood; while the global aggregation pathway uses the attention coefficients to aggregate from all nodes. The -dimensional output feature vectors from each pathway are concatenated to yield the final -dimensional output feature vector.

Figure 2:

(fig:non_local_graph_a) Two sample graphs in an inductive node classification task. The graphs have a chain structure where each element in the chain can either be motif_1, motif_2, or a single node (the spacer node). Each node is connected to itself (self edges not shown). Nodes with the same color have the same feature vector. The red nodes are present in motif_1 and motif_2. The goal is to classify each red node in the graph based on whether it is present in the dominant motif, i.e, the motif which occurs most frequently. In the middle graph for example, motif_1 is dominant, hence red nodes in

have label 1 while red nodes in motif_2 have label 0 (the labels are shown inside the nodes). (fig:non_local_graph_b) Accuracy of of GAT and PH-GCN as a function of training iterations. Mean and standard error bars from 10 trials.

4 Experimental results

4.1 Node classification based on motif counts

We first illustrate the power of the PH-GCN layer with its combination of structural and global aggregation on a synthetic inductive node classification task. The task is illustrated in Fig. 2. We generate random graphs formed by a random combination of two motifs. We classify nodes in each motif based on whether the motif they are in is the motif that occurs most frequently in the graph. This is a difficult task as it requires information to flow across the whole graph in order to detect the frequency of occurrence of the different motifs and label the nodes in each motif accordingly.

We train a 3-layer GAT network and a 3-layer PH-GCN network. In every training iteration, we sample a new random chain graph composed of 10 elements (each element is either motif_1, motif_2, or a spacer node). We report the mean test accuracy for classifying two randomly selected nodes in 100 randomly sampled graph. As shown in Fig. 2, GAT completely fails to solve the task and its performance stays at chance level, while PH-GCN is consistently able to learn the task.

The failure of GAT is expected as a 3-layer network can only aggregate node features from up to 3 hops away in the graph. This is clearly insufficient to judge whether a motif is dominant in the graph. PH-GCN on the other hand is able to learn that a node label correlates with the frequency of occurrence of its motif in the graph. This requires the neighborhood aggregation part of PH-GCN to detect the motif structure, and the global aggregation part to compare the frequency of occurrence of the two motifs. This synthetic example shows that PH-GCNs can solve problems beyond the ability of standard graph convolutional networks using the same number of layers.

4.2 Transductive node classification

Cora Citeseer Pubmed Cornell Texas Wisconsin Actor
Number of nodes 2708 3327 19717 183 183 251 7600
Number of edges 5429 4732 44338 295 309 499 33544
Node feature dimensions 1433 3703 500 1703 1703 1703 931
Number of classes 7 6 3 5 5 5 5
Table 1: Properties of graph datasets
Cora Citeseer Pubmed Cornell Texas Wisconsin Actor
Table 2: Percentage classification accuracy. Mean and standard deviation from 10 different data splits. The Geom-GCN results were taken from the original paper.

We test PH-GCN on the following transductive node classification graphs:

  • Citation graphs : Nodes represent papers and edges represent citations. Node features are binary 0/1 vectors indicating the absence or presence of certain key words in the paper. Each node/paper belongs to an academic topic and the goal is to predict the label/topic of the test nodes. We benchmark on three citation graphs: Cora, Citeseer, and Pubmed.

  • WebKB graphs : The graphs capture the structure of the webpages of computer science departments in various universities. Nodes represent webpages, and edges webpage links. As in the citation graphs, node features are bag of words binary vectors. Each node is manually classed as ‘course’, ‘faculty’, ‘student’, ‘project’, or ‘staff’. We benchmark on three WebKB graphs: Cornell, Texas, and Wisconsin.

  • Actor co-occurrence graph : The graph was constructed by crawling Wikipedia articles. Each node represents an actor, and an edge indicates one actor occurs on another’s Wikipedia page. Node features are bag of words binary vectors. Nodes are classified into five categories based on the topic of the actor’s wikipedia page. This graph is a subgraph of the film-director-actor-writer graph in  (Tang et al., 2009).

Table 1 summarizes the properties of the graph datasets we use.

We compare the performance of PH-GCN against two baselines that only use structural aggregation: GCN (Kipf & Welling, 2016) and GAT (Veličković et al., 2017). In addition we also test against Geom-GCN (Pei et al., 2020)

which uses a combination of structural aggregation and aggregation based on node proximity in an embedding space. Geom-GCN uses three different algorithms to create the node embeddings: Isomap, Poincare embeddings, and struc2vec, which leads to three different Geom-GCN variants: Geom-GCN-I, Geom-GCN-P, and Geom-GCN-S, respectively. The node embeddings used by Geom-GCN are fixed and depend only on the graph structure. Node embeddings in PH-GCN, however, are dynamic and depend on the node features. Most importantly, PH-GCN node embeddings are directly trained using the loss function of the task at hand.

PH-GCN has two novel aspects: an attention mechanism based on Euclidean distances, and an efficient scheme for attention-based global aggregation that is built on top of the new attention mechanism. We separately evaluate the performance of the new attention mechanism by replacing the attention mechanism used in GAT by our Euclidean distance attention mechanism. We term the resulting networks GAT-EDA.

For each node classification task, we randomly split the nodes of each class into a 60%-20%-20% split for training, testing, and validation respectively. During training, we monitor the validation loss and report test accuracy for the model with the smallest validation loss. We repeat all experiments 10 times for 10 different random splits and report the mean and standard deviation of the test accuracy. The hyper-parameters we tune are the number of attention heads, dropout probabilities, and the fixed learning rate. We tune these hyper-parameters to obtain best validation loss.

For PH-GCN and GAT-EDA, we always use an embedding dimension of 4. In the blurring step in the permutohedral lattice used in PH-GCN, a lattice point aggregates features from neighboring lattice points up to three hops away in all directions, i.e, we use a lattice filter of width 7. The distance scaling coefficient (Eq. 10) was set to for structural aggregation, and set to for global aggregation. scales distances in the embedding space. The larger causes the attention coefficients to decay much faster with distance when doing global aggregation. This is needed to make global attention more selective as it has a much larger number of attention targets (all nodes in the graph) compared to attention in the structural aggregation step (which only considers the node’s neighbors). All the networks we test have two layers. We use the ADAM optimizer (Kingma & Ba, 2014) for all experiments.

Figure 3: Visualization of the learned embeddings for the two attention heads in the first PH-GCN layer when learning on (fig:wisconsin_data) the Wisconsin graph and (fig:cornell_data) the Cornell graph. We used a 2D embedding space. The top row shows the embeddings of all nodes in the two graphs for the two attention heads, colored according to their class label. Nodes belonging to the same class tend to cluster together. Bottom row shows the relation between the number of hops separating two nodes (graph distance), and the distance between their learned embeddings. No clear correlation exists.

The accuracy results are summarized in Table 2. For Cora and Citeseer, the performance of PH-GCN is not significantly different from GCN and GAT. There is a small but significant improvement of PH-GCN over GCN and GAT on the Pubmed dataset. The lack of consistent major improvement on the three citation graphs can be attributed to their high assortativity (Pei et al., 2020) where nodes belonging to the same class tend to connect together. In highly assortative graphs, a node does not need long-range aggregation in order to accumulate features from similar nodes, as similar nodes are close together in the graph anyway. Hence structural aggregation alone should perform well. Similar observations have been made regarding Geom-GCN (Pei et al., 2020) which fails to show consistent improvements on the citation graphs. For the WebKB graphs and the actor co-occurrence graph, the performance advantage of PH-GCN over GCN and GAT is large and significant. For the WebKB graphs, the accuracy standard deviation can be large since the graphs are small, and differences in the training/validation/testing splits significantly alter the learning problem.

Across all graphs, PH-GCN either outperforms all Geom-GCN variants or outperforms two out of the three variants. One of the issues with Geom-GCN is that it has to pick an embedding algorithm for generating the fixed node embeddings before the start of training, and there is no clear way to select the embedding algorithm that would perform best on the task. As shown in Table 2, different Geom-GCN embeddings perform best on different tasks, and the accuracy difference between the different embeddings can be large. PH-GCN completely sidesteps this issue by learning the embeddings while learning the task which consistently leads to high performance.

Except for the Cora and Citeseer graphs where the performance of PH-GCN and GAT-EDA are not significantly different, PH-GCN performs significantly better than GAT-EDA. Both share the same novel Euclidean distance attention mechanism. This shows the performance advantage of PH-GCN cannot all be attributed to the new attention mechanism we use, but that global attention-based aggregation plays a crucial role.

4.3 PH-GCN embeddings

In this section, we investigate the properties of the node embeddings learned by PH-GCN. Each PH-GCN layer can use multiple attention heads. Each attention head learns its own node embeddings, and the Euclidean distances between these node embeddings translate to attention coefficients between the nodes. To visualize these embeddings, we use a 2D embedding space and train a 2-layer PH-GCN on the Wisconsin and Cornell graphs. We use two attention heads in the first PH-GCN layer. Figure 3

shows the embeddings learned by the two heads in both tasks, colored according to their class labels. It is clear PH-GCN learns embeddings where nodes belonging to the same class are more clustered together. PH-GCN learns embeddings that minimize the classification loss; by placing nodes of the same class closer together in the embedding space, nodes belonging to one class will pre-dominantly aggregate information from each other and not from nodes belonging to other classes. Nodes belonging to the same class will thus have similar output feature vectors (since they all aggregate information from the same embedding neighborhood). In other words, by learning embeddings that cluster same-class nodes together, PH-GCN reduces the within-class variance of of the node features. This simplifies the classification task for the second (output) layer.

The embeddings learned by PH-GCN are thus directly tied to the task at hand. In Fig. 3, we plot the relation between the distance between a pair of nodes in the graph (minimum number of edges that need to be traversed to get from one node to another), and the distance between their learned embeddings. It is clear no correlation exists between the two. For all node pairs that are edges apart in the graph, there is very large variability in their separation distance in embedding space, as evidenced by the large standard deviation bars. This shows that the learned embeddings do not represent the graph structure. Instead, the embeddings represent the aggregation neighborhoods needed to perform well on the task.

5 Discussion

Attention-based feature aggregation is a powerful mechanism that finds use in deep learning models ranging from natural language processing (NLP) models 

(Vaswani et al., 2017) to graph neural networks (Veličković et al., 2017) to generative models (Zhang et al., 2018). It enables a model to capture relevant long-range information within a large context. Attention mechanisms, however, intrinsically have an unfavorable quadratic scaling behavior with the number of attention targets (words in an NLP setting, or nodes in a graph setting). This has traditionally been addressed by limiting the range of the attention mechanism (using small word contexts in NLP setting, or using limited node neighborhoods in graph settings).

Using approximate Gaussian filtering methods to implement global attention makes it possible to use larger attention contexts beyond what has been possible using exact attention. Our approximate formulation of global attention is directly applicable to a range of models beyond graph networks. One example is transformer models (Vaswani et al., 2017). Recently, an approximate global attention mechanism based on locality sensitive hashing has been used in the transformer architecture (Kitaev et al., 2020) to break the quadratic scaling behavior. Unlike our approximate filtering approach, however, it has a non-differentiable step (the hashing step), and it scales as while our approach scales as where is the number of attention targets.

Learning informative node embeddings in graphs is a long-standing problem (Goyal & Ferrara, 2018). The majority of prior work, however, first chooses a similarity measure, and then finds node embeddings that put similar nodes closer together. The similarity measure could for example be adjacency in the graph (Belkin & Niyogi, 2002) or the co-occurrence frequency of two nodes in random walks over the graph (Perozzi et al., 2014). What the right similarity measure is, however, depends on the task at hand. In PH-GCN, we dispense with similarity-based node embedding approaches, and directly optimize the embeddings based on the task loss. Nodes are close together in the embedding space if the task loss improves by having them aggregate information from each other. Our approach naturally simplifies the learning problem on the graph as we do not need a separate embedding generation step.