1 Introduction
Graph convolutional networks (GCNs) and their variants achieve excellent performance in transductive node classification tasks (Defferrard et al., 2016; Kipf & Welling, 2016). GCNs use a message passing scheme where each node aggregates messages/features from its neighboring nodes in order to update its own feature vector. After message passing rounds, a node is able to aggregate information from nodes up to hops away in the graph. This message passing scheme has three shortcomings: 1) a node does not consider information from nodes that are more than hops away, 2) messages from individual nodes can get swamped by messages from other nodes making it hard to isolate important contributions from individual nodes, 3) node features can become indistinguishable as their aggregation ranges overlap, which severely degrades the performance of downstream tasks when the indistinguishable nodes belong to different classes. This oversmoothing phenomenon (Chen et al., 2019)) is observed even for small values of (as low as ) (Li et al., 2018)
We propose a global attentionbased aggregation scheme that simultaneously addresses the issues above. We modify and extend graph attention networks (GATs) (Veličković et al., 2017) to enable the attention mechanism to operate globally, i.e, a node can aggregate features from any other node in the graph weighted by an attention coefficient. This sidesteps the inherent limited aggregation range of GCNs, and allows a node to aggregate features from distant nodes without having to aggregate features from all nodes in between. Two nearby nodes in the graph can effectively aggregate information from two very different sets of nodes, alleviating the issue of oversmoothing as information does not have to come solely from their two overlapping neighborhoods.
Naive global pairwise attention is not scalable though, as it would have to consider all pairs of nodes, and computational overhead would grow quadratically with the number of nodes. To address this, we formulate a new attention mechanism where the strength of pairwise attention depends on the Euclidean distance between learnable node embeddings. Attentionweighted global aggregation is then analogous to a Gaussian filtering operation for which approximate techniques with linear complexity exist. In particular, we use approximate filtering based on the permutohedral lattice (Adams et al., 2010)
. The latticebased filtering scheme is differentiable and error backpropagates not only through the node features, but also through the node embeddings. The resulting networks, which we call permutohedralGCNs (PHGCNs) are thus fully differentiable.
In PHGCNs, node embeddings are directly learned based on the error signal of the training task (node classification task in our case). Since the attention weights between nodes increase as they move closer in the embedding space, the training error signal will move two nodes closer in the embedding space if the training loss would improve by having them integrate more information from each other. This is in contrast to prior work based on random walks (Perozzi et al., 2014) or matrix factorization (Belkin & Niyogi, 2002) methods that learn embeddings in an unsupervised manner based only on the graph structure.
In our attentionweighted aggregation scheme, a node aggregates features mostly from its close neighbors in the embedding space. Our scheme can thus be seen as a mechanism to establish soft aggregation neighborhoods in a task relevantmanner independently of the graph structure. Since graph structure is not considered in our attentionbased global aggregation scheme, we concatenate node features obtained from the global aggregation scheme with node features obtained by conventional aggregation from the node’s graph neighborhood. One half of the node’s feature vector is thus obtained by attention weighted aggregation from all nodes in the graph, while the other half is obtained by attentionweighted aggregation of only the nodes in the graph neighborhood. By combining the structureagnostic global aggregation scheme with traditional neighborhoodbased aggregation, we show that PHGCNs are able to reach state of the art performance on several node classification tasks.
2 Related work
In GCNs, information flows along the edges of the graph. The manner in which information propagates depends on the local structure of the neighborhoods: densely connected neighborhoods rapidly expand a node’s influence while treelike neighborhoods diffuse information slowly. A common way to communicate information across multiple neighborhood hubs is by constructing deeper (hierarchical) GCN architectures. A drawback of such GCN layer stacking is that it doesn’t disseminate information at a uniform rate across the graph neighborhoods (Xu et al., 2018). Densely connected nodes with a large sphere of influence will aggregate messages from a large number of nodes, which will oversmooth their features and make them indistinguishable from their close neighbors in the graph. On the other hand, sparsely connected nodes receive limited direct information from other nodes in the graph. Jumping knowledge networks (Xu et al., 2018) use skip connections to alleviate some of these problems by controlling the influence radius of each node separately. However, they still operate on a local scale which limits their effectiveness in capturing longrange node interactions. To capture contributions from distant nodes, structurally nonlocal aggregation approaches are needed (Wang et al., 2018).
Dynamic graph networks change the graph structure to connect semantically related nodes together without adding unnecessary depth. The node features are then updated by aggregating messages along the new edges. Dynamic graph CNNs rebuild the graph after each layer (using kdtrees) using the node features computed in the previous layer (Wang et al., 2019). Datadriven pooling techniques provide an alternative way to change the graph structure by clustering nodes that are semantically relevant. These techniques will typically coarsen the graph by either selecting top scoring nodes (Gao & Ji, 2019), or computing a soft assignment matrix (Ying et al., 2018). While rewiring the graph structure can provide longrange connections across the graph, it forces the network to learn how to retain the original graph since the original graph encodes semantically relevant relationships.
This inefficiency is addressed by graph networks that combine a traditional neighborhood aggregation scheme with a longrange aggregation scheme. Positionaware graph neural networks achieve this by using anchorsets. A node in the graph aggregates node feature information from these anchor sets weighted by the distance between the node and the anchorset
(You et al., 2019). Positionaware graph networks capture the local network structure as well as retain the global network position of a given node. Alternatively, GeomGCN proposes to map the graph nodes to an embedding space via various node embedding methods (Pei et al., 2020). GeomGCN then uses the geometric relationships defined on this embedding space to build structural neighborhoods that are used in addition to the graph neighborhoods in a bilevel aggregation scheme. In contrast to our proposal, GeomGCN uses precomputed embeddings while we propose to learn the embeddings jointly with the graph network in an efficient and fully differentiable fashion.Our proposed graph networks are built on top of an efficient attention mechanism. Attention was applied to graph neural networks by Veličković et al. (2017)
. The resulting graph attention networks (GATs) aggregate features from a node’s neighbors after weighting them by attention coefficients that are produced by an attention network. The attention network is a Perceptron layer that operates on the concatenation of the features of a pair of nodes. In contrast to GAT, our attention mechanism uses Euclidean distances between learned node embeddings to compute the attention coefficients. This formulation results in an attention mechanism that is analogous to highdimensional Gaussian filtering. Approximate Gaussian filtering methods such as the permutohedral lattice can then be used to realize scalable global attention.
The permutohedral lattice has been used in convolutional neural networks operating on sparse inputs
(Su et al., 2018). Permutohedral lattice convolutions have also been used to extend the standard convolutional filters so that they encompasses not only pixels in a spatial neighborhood, but also neighboring pixels in the color or intensity spaces (Jampani et al., 2016; Wannenwetsch et al., 2019). An extension of this approach uses learned feature spaces, where the filter centered on a voxel encompasses nearby voxels in the learned feature space (Joutard et al., 2019). While these approaches share the common feature of attending more strongly to nearby elements in fixed or learned feature spaces, they are not true attention mechanisms as the attention coefficients are not normalized.3 Methods
3.1 Graph convolutional networks with global attention
We consider neural networks operating on a directed graph where is a set of nodes/vertices and is the set of edges. denotes the node and is the structural neighborhood of , i.e, the indices of all nodes connected by an edge to node . Each node is associated with a feature vector , where is the dimension of the input feature space. We now describe how our graph convolutional layer with global attention uses and to generate for each node an output feature vector , where is the dimension of the output feature space.
The graph layer’s learnable parameters are a projection matrix , and a set of parameters parameterizing the generic attention function . The unnormalized attention coefficient between nodes and is given by:
(1) 
To stop nodes from indiscriminately attending to all attention targets, we apply a softmax to the unnormalized attention coefficients of each node so that they sum to 1. We distinguish between structural attention where a node attends only to its graph neighbors, and global attention where a node attends to all nodes in the graph. The structural attention coefficients are given by:
(2) 
and they are used to aggregate features from the node’s neighborhood:
(3) 
where
is a nonlinear activation function. The global attention coefficients are given by
(4) 
and they are used to aggregate features from all nodes in the graph:
(5) 
We concatenate the structurally aggregated and the globally aggregated feature vectors to yield the final feature vector :
(6) 
where is the concatenation operator. We denote the action of our graph convolutional layer with global attention acting in the manner described above by . A standard technique is to use multiple attention heads (Veličković et al., 2017; Vaswani et al., 2017) and concatenate the outputs of the different heads. Using attention heads, the output feature vector has dimension and is given by:
(7) 
In its current form, the attentionbased global aggregation scheme is hardly practical since evaluating the global attention coefficients and implementing the global aggregation scheme in Eq. 5 would need to consider all pairs of nodes in the graph. In the following section, we show that for a particular choice of the attention function , global aggregation (Eqs. 4 and 5) can be efficiently implemented using a single approximate filtering step.
3.2 Attentionbased global aggregation and nonlocal filtering
In image denoising, nonlocal image filters achieve superior performance compared to local filters under general statistical assumptions (Buades et al., 2004). Unlike local filters which update a pixel value based only on a spatially restricted image patch around the pixel, nonlocal filters average all image pixels weighted by how similar they are to the current pixel (Buades et al., 2005). For points, assume point has position in the Ddim Euclidean similarity space, and an associated feature vector . The similarity space may for example be the color space. A general nonlocal filtering operation can be written as:
(8) 
where is the output feature at position and is the weighting function. The most common weighting function is the Gaussian kernel where is the 2norm and
the inverse of the standard deviation. The weighting function we use is the exponential decay kernel, which yields the nonlocal filtering equation:
(9) 
Going back to the attention mechanism in Eq. 1, GAT uses a single feedforward layer operating on the concatenated features of a pair of nodes to produce the attention coefficient between the pair. An alternative attention mechanism uses the dot product between the pair of feature vectors (Vaswani et al., 2017). We introduce an attention mechanism based on Euclidean distances:
(10) 
is an embedding matrix that embeds the node’s transformed features into the Ddim node similarity space. Combining equations 1, 4, and 5, and using the Euclidean distance form of attention in Eq. 10, the attentionweighted globally aggregated features at node is then:
(11) 
which can be evaluated using two applications of the nonlocal filtering operation given by Eq. 9. To see that, identify:
(12) 
Equation 11 can then be written as:
(13) 
Both the numerator and denominator are standard nonlocal filtering operations (all points/nodes in the denominator have a unity feature). In practice, we implement Eq. 13 using only one nonlocal filtering step: we append 1 to the projected feature vector of every node () to get an ()dim vector and then execute the nonlocal filtering step; we take the first entries in the resulting feature vectors and normalize by the normalizing factor at position .
3.3 Permutohedral lattice filtering
Exactly evaluating the nonlocal filtering operation in Eq. 13 for all nodes still scales as where is the number of nodes. However, approximate algorithms such as KDtrees (Adams et al., 2009), and permutohedral lattice filtering (Adams et al., 2010) have a more favorable scaling behavior of and , respectively. We use permutohedral lattice filtering as our approximate filtering algorithm because it is endtoend differentiable with respect to both the node embeddings and the node features.
The permutohedral lattice in dim space is an integer lattice that lives within the
dim hyperplane
. The lattice tesselates the hyperplane with uniform simplices. There are efficient techniques for finding the enclosing simplex of any point in the hyperplane, as well as for finding the neighboring lattice points of any point in the lattice. The permutohedral lattice in the 2D plane is illustrated in Fig. 1. Approximate filtering using the permutohedral lattice involves three steps:
Splatting : The Ddim location vectors are projected onto the dim hyperplane of the permutohedral lattice. The initial feature vectors for all lattice points are initialized to zero. The lattice points defining the enclosing simplex of each projected location are found and for each location , the associated feature vector is added to the feature vector of each enclosing lattice point after being scaled by its proximity to the lattice point.

Blurring : This step applies the filtering kernel over the lattice points. The filtering kernel decays exponentially with distance. Therefore, we calculate the output feature of a lattice point using only a small neighborhood of lattice points since contributions from more distant points would be small. This is one of the key approximations of the method: a lattice point only aggregates features from its neighborhood (weighted by the filtering kernel), instead of aggregating features from all lattice points.

Slicing : After applying the approximate filtering operation over the lattice points, the output feature vector for point is evaluated as the sum of the output feature vectors of its enclosing lattice points, weighted by the proximity of each enclosing lattice point to .
These three steps are illustrated in Fig. 1. Figure 1 shows the two parallel pathways used in a PHGCN layer: the structural aggregation pathway uses attention coefficients based on the Euclidean distance between node embeddings to aggregate features from a node’s immediate neighborhood; while the global aggregation pathway uses the attention coefficients to aggregate from all nodes. The dimensional output feature vectors from each pathway are concatenated to yield the final dimensional output feature vector.
(fig:non_local_graph_a) Two sample graphs in an inductive node classification task. The graphs have a chain structure where each element in the chain can either be motif_1, motif_2, or a single node (the spacer node). Each node is connected to itself (self edges not shown). Nodes with the same color have the same feature vector. The red nodes are present in motif_1 and motif_2. The goal is to classify each red node in the graph based on whether it is present in the dominant motif, i.e, the motif which occurs most frequently. In the middle graph for example, motif_1 is dominant, hence red nodes in
have label 1 while red nodes in motif_2 have label 0 (the labels are shown inside the nodes). (fig:non_local_graph_b) Accuracy of of GAT and PHGCN as a function of training iterations. Mean and standard error bars from 10 trials.
4 Experimental results
4.1 Node classification based on motif counts
We first illustrate the power of the PHGCN layer with its combination of structural and global aggregation on a synthetic inductive node classification task. The task is illustrated in Fig. 2. We generate random graphs formed by a random combination of two motifs. We classify nodes in each motif based on whether the motif they are in is the motif that occurs most frequently in the graph. This is a difficult task as it requires information to flow across the whole graph in order to detect the frequency of occurrence of the different motifs and label the nodes in each motif accordingly.
We train a 3layer GAT network and a 3layer PHGCN network. In every training iteration, we sample a new random chain graph composed of 10 elements (each element is either motif_1, motif_2, or a spacer node). We report the mean test accuracy for classifying two randomly selected nodes in 100 randomly sampled graph. As shown in Fig. 2, GAT completely fails to solve the task and its performance stays at chance level, while PHGCN is consistently able to learn the task.
The failure of GAT is expected as a 3layer network can only aggregate node features from up to 3 hops away in the graph. This is clearly insufficient to judge whether a motif is dominant in the graph. PHGCN on the other hand is able to learn that a node label correlates with the frequency of occurrence of its motif in the graph. This requires the neighborhood aggregation part of PHGCN to detect the motif structure, and the global aggregation part to compare the frequency of occurrence of the two motifs. This synthetic example shows that PHGCNs can solve problems beyond the ability of standard graph convolutional networks using the same number of layers.
4.2 Transductive node classification
Cora  Citeseer  Pubmed  Cornell  Texas  Wisconsin  Actor  

Number of nodes  2708  3327  19717  183  183  251  7600 
Number of edges  5429  4732  44338  295  309  499  33544 
Node feature dimensions  1433  3703  500  1703  1703  1703  931 
Number of classes  7  6  3  5  5  5  5 
Cora  Citeseer  Pubmed  Cornell  Texas  Wisconsin  Actor  

GCN  
GAT  
GeomGCNI  
GeomGCNP  
GeomGCNS  
GATEDA  
PHGCN 
We test PHGCN on the following transductive node classification graphs:

Citation graphs : Nodes represent papers and edges represent citations. Node features are binary 0/1 vectors indicating the absence or presence of certain key words in the paper. Each node/paper belongs to an academic topic and the goal is to predict the label/topic of the test nodes. We benchmark on three citation graphs: Cora, Citeseer, and Pubmed.

WebKB graphs : The graphs capture the structure of the webpages of computer science departments in various universities. Nodes represent webpages, and edges webpage links. As in the citation graphs, node features are bag of words binary vectors. Each node is manually classed as ‘course’, ‘faculty’, ‘student’, ‘project’, or ‘staff’. We benchmark on three WebKB graphs: Cornell, Texas, and Wisconsin.

Actor cooccurrence graph : The graph was constructed by crawling Wikipedia articles. Each node represents an actor, and an edge indicates one actor occurs on another’s Wikipedia page. Node features are bag of words binary vectors. Nodes are classified into five categories based on the topic of the actor’s wikipedia page. This graph is a subgraph of the filmdirectoractorwriter graph in (Tang et al., 2009).
Table 1 summarizes the properties of the graph datasets we use.
We compare the performance of PHGCN against two baselines that only use structural aggregation: GCN (Kipf & Welling, 2016) and GAT (Veličković et al., 2017). In addition we also test against GeomGCN (Pei et al., 2020)
which uses a combination of structural aggregation and aggregation based on node proximity in an embedding space. GeomGCN uses three different algorithms to create the node embeddings: Isomap, Poincare embeddings, and struc2vec, which leads to three different GeomGCN variants: GeomGCNI, GeomGCNP, and GeomGCNS, respectively. The node embeddings used by GeomGCN are fixed and depend only on the graph structure. Node embeddings in PHGCN, however, are dynamic and depend on the node features. Most importantly, PHGCN node embeddings are directly trained using the loss function of the task at hand.
PHGCN has two novel aspects: an attention mechanism based on Euclidean distances, and an efficient scheme for attentionbased global aggregation that is built on top of the new attention mechanism. We separately evaluate the performance of the new attention mechanism by replacing the attention mechanism used in GAT by our Euclidean distance attention mechanism. We term the resulting networks GATEDA.
For each node classification task, we randomly split the nodes of each class into a 60%20%20% split for training, testing, and validation respectively. During training, we monitor the validation loss and report test accuracy for the model with the smallest validation loss. We repeat all experiments 10 times for 10 different random splits and report the mean and standard deviation of the test accuracy. The hyperparameters we tune are the number of attention heads, dropout probabilities, and the fixed learning rate. We tune these hyperparameters to obtain best validation loss.
For PHGCN and GATEDA, we always use an embedding dimension of 4. In the blurring step in the permutohedral lattice used in PHGCN, a lattice point aggregates features from neighboring lattice points up to three hops away in all directions, i.e, we use a lattice filter of width 7. The distance scaling coefficient (Eq. 10) was set to for structural aggregation, and set to for global aggregation. scales distances in the embedding space. The larger causes the attention coefficients to decay much faster with distance when doing global aggregation. This is needed to make global attention more selective as it has a much larger number of attention targets (all nodes in the graph) compared to attention in the structural aggregation step (which only considers the node’s neighbors). All the networks we test have two layers. We use the ADAM optimizer (Kingma & Ba, 2014) for all experiments.
The accuracy results are summarized in Table 2. For Cora and Citeseer, the performance of PHGCN is not significantly different from GCN and GAT. There is a small but significant improvement of PHGCN over GCN and GAT on the Pubmed dataset. The lack of consistent major improvement on the three citation graphs can be attributed to their high assortativity (Pei et al., 2020) where nodes belonging to the same class tend to connect together. In highly assortative graphs, a node does not need longrange aggregation in order to accumulate features from similar nodes, as similar nodes are close together in the graph anyway. Hence structural aggregation alone should perform well. Similar observations have been made regarding GeomGCN (Pei et al., 2020) which fails to show consistent improvements on the citation graphs. For the WebKB graphs and the actor cooccurrence graph, the performance advantage of PHGCN over GCN and GAT is large and significant. For the WebKB graphs, the accuracy standard deviation can be large since the graphs are small, and differences in the training/validation/testing splits significantly alter the learning problem.
Across all graphs, PHGCN either outperforms all GeomGCN variants or outperforms two out of the three variants. One of the issues with GeomGCN is that it has to pick an embedding algorithm for generating the fixed node embeddings before the start of training, and there is no clear way to select the embedding algorithm that would perform best on the task. As shown in Table 2, different GeomGCN embeddings perform best on different tasks, and the accuracy difference between the different embeddings can be large. PHGCN completely sidesteps this issue by learning the embeddings while learning the task which consistently leads to high performance.
Except for the Cora and Citeseer graphs where the performance of PHGCN and GATEDA are not significantly different, PHGCN performs significantly better than GATEDA. Both share the same novel Euclidean distance attention mechanism. This shows the performance advantage of PHGCN cannot all be attributed to the new attention mechanism we use, but that global attentionbased aggregation plays a crucial role.
4.3 PHGCN embeddings
In this section, we investigate the properties of the node embeddings learned by PHGCN. Each PHGCN layer can use multiple attention heads. Each attention head learns its own node embeddings, and the Euclidean distances between these node embeddings translate to attention coefficients between the nodes. To visualize these embeddings, we use a 2D embedding space and train a 2layer PHGCN on the Wisconsin and Cornell graphs. We use two attention heads in the first PHGCN layer. Figure 3
shows the embeddings learned by the two heads in both tasks, colored according to their class labels. It is clear PHGCN learns embeddings where nodes belonging to the same class are more clustered together. PHGCN learns embeddings that minimize the classification loss; by placing nodes of the same class closer together in the embedding space, nodes belonging to one class will predominantly aggregate information from each other and not from nodes belonging to other classes. Nodes belonging to the same class will thus have similar output feature vectors (since they all aggregate information from the same embedding neighborhood). In other words, by learning embeddings that cluster sameclass nodes together, PHGCN reduces the withinclass variance of of the node features. This simplifies the classification task for the second (output) layer.
The embeddings learned by PHGCN are thus directly tied to the task at hand. In Fig. 3, we plot the relation between the distance between a pair of nodes in the graph (minimum number of edges that need to be traversed to get from one node to another), and the distance between their learned embeddings. It is clear no correlation exists between the two. For all node pairs that are edges apart in the graph, there is very large variability in their separation distance in embedding space, as evidenced by the large standard deviation bars. This shows that the learned embeddings do not represent the graph structure. Instead, the embeddings represent the aggregation neighborhoods needed to perform well on the task.
5 Discussion
Attentionbased feature aggregation is a powerful mechanism that finds use in deep learning models ranging from natural language processing (NLP) models
(Vaswani et al., 2017) to graph neural networks (Veličković et al., 2017) to generative models (Zhang et al., 2018). It enables a model to capture relevant longrange information within a large context. Attention mechanisms, however, intrinsically have an unfavorable quadratic scaling behavior with the number of attention targets (words in an NLP setting, or nodes in a graph setting). This has traditionally been addressed by limiting the range of the attention mechanism (using small word contexts in NLP setting, or using limited node neighborhoods in graph settings).Using approximate Gaussian filtering methods to implement global attention makes it possible to use larger attention contexts beyond what has been possible using exact attention. Our approximate formulation of global attention is directly applicable to a range of models beyond graph networks. One example is transformer models (Vaswani et al., 2017). Recently, an approximate global attention mechanism based on locality sensitive hashing has been used in the transformer architecture (Kitaev et al., 2020) to break the quadratic scaling behavior. Unlike our approximate filtering approach, however, it has a nondifferentiable step (the hashing step), and it scales as while our approach scales as where is the number of attention targets.
Learning informative node embeddings in graphs is a longstanding problem (Goyal & Ferrara, 2018). The majority of prior work, however, first chooses a similarity measure, and then finds node embeddings that put similar nodes closer together. The similarity measure could for example be adjacency in the graph (Belkin & Niyogi, 2002) or the cooccurrence frequency of two nodes in random walks over the graph (Perozzi et al., 2014). What the right similarity measure is, however, depends on the task at hand. In PHGCN, we dispense with similaritybased node embedding approaches, and directly optimize the embeddings based on the task loss. Nodes are close together in the embedding space if the task loss improves by having them aggregate information from each other. Our approach naturally simplifies the learning problem on the graph as we do not need a separate embedding generation step.
References
 Adams et al. (2009) Adams, A., Gelfand, N., Dolson, J., and Levoy, M. Gaussian kdtrees for fast highdimensional filtering. In ACM SIGGRAPH 2009 papers, pp. 1–12. 2009.
 Adams et al. (2010) Adams, A., Baek, J., and Davis, M. Fast highdimensional filtering using the permutohedral lattice. In Computer Graphics Forum, volume 29, pp. 753–762. Wiley Online Library, 2010.
 Belkin & Niyogi (2002) Belkin, M. and Niyogi, P. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in neural information processing systems, pp. 585–591, 2002.
 Buades et al. (2004) Buades, A., Coll, B., and Morel, J. On image denoising methods. CMLA Preprint, 5, 2004.

Buades et al. (2005)
Buades, A., Coll, B., and Morel, J.
A nonlocal algorithm for image denoising.
In
2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)
, volume 2, pp. 60–65. IEEE, 2005.  Chen et al. (2019) Chen, D., Lin, Y., Li, W., Li, P., Zhou, J., and Sun, X. Measuring and relieving the oversmoothing problem for graph neural networks from the topological view. arXiv preprint arXiv:1909.03211, 2019.
 Defferrard et al. (2016) Defferrard, M., Bresson, X., and Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pp. 3844–3852, 2016.

Gao & Ji (2019)
Gao, H. and Ji, S.
Graph Unets.
In
Proceedings of The 36th International Conference on Machine Learning
, 2019.  Goyal & Ferrara (2018) Goyal, P. and Ferrara, E. Graph embedding techniques, applications, and performance: A survey. KnowledgeBased Systems, 151:78–94, 2018.
 Jampani et al. (2016) Jampani, V., Kiefel, M., and Gehler, P. Learning sparse high dimensional filters: Image filtering, dense crfs and bilateral neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4452–4461, 2016.
 Joutard et al. (2019) Joutard, S., Dorent, R., Isaac, A., Ourselin, S., Vercauteren, T., and Modat, M. Permutohedral attention module for efficient nonlocal neural networks. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pp. 393–401. Springer, 2019.
 Kingma & Ba (2014) Kingma, D. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kipf & Welling (2016) Kipf, T. and Welling, M. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
 Kitaev et al. (2020) Kitaev, N., Kaiser, Ł., and Levskaya, A. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.

Li et al. (2018)
Li, Q., Han, Z., and Wu, X.M.
Deeper insights into graph convolutional networks for semisupervised learning.
InThirtySecond AAAI Conference on Artificial Intelligence
, 2018.  Pei et al. (2020) Pei, H., Wei, B., Chang, K. C.C., Lei, Y., and Yang, B. Geomgcn: Geometric graph convolutional networks. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=S1e2agrFvS.
 Perozzi et al. (2014) Perozzi, B., AlRfou, R., and Skiena, S. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710, 2014.
 Su et al. (2018) Su, H., Jampani, V., Sun, D., Maji, S., Kalogerakis, E., Yang, M.H., and Kautz, J. Splatnet: Sparse lattice networks for point cloud processing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2530–2539, 2018.
 Tang et al. (2009) Tang, J., Sun, J., Wang, C., and Yang, Z. Social influence analysis in largescale networks. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 807–816, 2009.
 Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008, 2017.
 Veličković et al. (2017) Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., and Bengio, Y. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
 Wang et al. (2018) Wang, X., Girshick, R., Gupta, A., and He, K. Nonlocal neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7794–7803, 2018.
 Wang et al. (2019) Wang, Y., Sun, Y., Liu, Z., Sarma, S. E., and Bronstein, M. M. Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (TOG), 38(5), 2019.
 Wannenwetsch et al. (2019) Wannenwetsch, A., Kiefel, M., Gehler, P., and Roth, S. Learning taskspecific generalized convolutions in the permutohedral lattice. In German Conference on Pattern Recognition, pp. 345–359. Springer, 2019.
 Xu et al. (2018) Xu, K., Li, C., Tian, Y., Sonobe, T., ichi Kawarabayashi, K., and Jegelka, S. Representation learning on graphs with jumping knowledge networks. arXiv preprint arXiv:1806.03536, 2018.
 Ying et al. (2018) Ying, Z., You, J., Morris, C., Ren, X., Hamilton, W., and Leskovec, J. Hierarchical graph representation learning with differentiable pooling. In Advances in neural information processing systems, pp. 4800–4810, 2018.
 You et al. (2019) You, J., Ying, R., and Leskovec, J. Positionaware graph neural networks. In Proceedings of The 36th International Conference on Machine Learning, 2019.
 Zhang et al. (2018) Zhang, H., Goodfellow, I., Metaxas, D., and Odena, A. Selfattention generative adversarial networks. arXiv preprint arXiv:1805.08318, 2018.