Spectral Graph Attention Network

Variants of Graph Neural Networks (GNNs) for representation learning have been proposed recently and achieved fruitful results in various fields. Among them, graph attention networks (GATs) first employ a self-attention strategy to learn attention weights for each edge in the spatial domain. However, learning the attentions over edges only pays attention to the local information of graphs and greatly increases the number of parameters. In this paper, we first introduce attentions in the spectral domain of graphs. Accordingly, we present Spectral Graph Attention Network (SpGAT) that learn representations for different frequency components regarding weighted filters and graph wavelets bases. In this way, SpGAT can better capture global patterns of graphs in an efficient manner with much fewer learned parameters than that of GAT. We thoroughly evaluate the performance of SpGAT in the semi-supervised node classification task and verified the effectiveness of the learned attentions in the spectral domain.



page 2


Understanding Graph Neural Networks from Graph Signal Denoising Perspectives

Graph neural networks (GNNs) have attracted much attention because of th...

Beyond Low-Pass Filters: Adaptive Feature Propagation on Graphs

Graph neural networks (GNNs) have been extensively studied for predictio...

wsGAT: Weighted and Signed Graph Attention Networks for Link Prediction

Graph Neural Networks (GNNs) have been widely used to learn representati...

Geometric Scattering Attention Networks

Geometric scattering has recently gained recognition in graph representa...

DDGK: Learning Graph Representations for Deep Divergence Graph Kernels

Can neural networks learn to compare graphs without feature engineering?...

How to Find Your Friendly Neighborhood: Graph Attention Design with Self-Supervision

Attention mechanism in graph neural networks is designed to assign large...

EdgeNets:Edge Varying Graph Neural Networks

Driven by the outstanding performance of neural networks in the structur...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

(a) Original image
(b) Low-frequency (e.g., background)
(c) High-frequency (e.g., outlines)
(d) Original graph
(e) Low-frequency
(f) High-frequency
Figure 1: Motivation: Separating the low- and high-frequency signals in both image and graph contributes to the feature learning. For the graph part, a unweighted barbell graph is reconstructed with filtering out only half low-frequency components ((e)) and half high-frequency components ((f)). The color of edge represents the corresponding weight of edge. Color bars in reconstructed barbell graph (e) and (f) indicate the measurement of reconstructed edge weights.

Graph Neural Networks (GNNs) Wu et al. (2019c) aim at imitating the expressive capability of deep neural networks from grid-like data (e.g. images and sequences) to graph structures. The fruitful progress of GNNs in the past decade has made them a crucial kind of tools for a variety of applications, from social networks Li et al. (2019)

, computer vision 

Kampffmeyer et al. (2018); Zeng et al. (2019), text classification Yao et al. (2019), to chemistry Liao et al. (2019).

Graph Attention Network (Veličković et al. (2018), as one central type of GNNs introduces the attention mechanism to further refine the convolution process in generic GCNs Kipf and Welling (2017). Specifically, during the node aggregation, assigns a self-attention weight to each edge, which can capture the local similarity among neighborhoods, and further boost the expressing power of GNNs as the weight itself is learnable. Many variants have been proposed since GAT Morris et al. (2019); Wang et al. (2019); Wu et al. (2019b).

, along with its variants, considers the attention in a straightforward way: learning the edge attentions in the spatial domain. In this sense, this attention can capture the local structure of graphs, i.e., the information from neighbors. However, it is unable to explicitly encode the global structure of graphs. Furthermore, computing the attention weights for every edge in the graph is inefficient, especially for large graphs.

In computer vision, a natural image can be decomposed into a low spatial frequency component containing the smoothly changing structure, e.g., background, and a high spatial frequency component describing the rapidly changing fine details, e.g., outlines. Figure 1(a) ~ 1(c) depict the example of low- and high-frequency components on a panda image. Obviously, the contribution of different frequency components varies with different downstream tasks. To accommodate this phenomenon, Chen et al. (2019b) proposed Octave Convolution () to factorize convolutional feature maps into two groups of different spatial frequencies and process them with different convolutions at their corresponding frequency.

In graph representational learning, this decomposing of low- and high-frequency can be observed more naturally, since graph signal processing (GSP) provides us a way to directly divide the low- and high-frequency components based on the ascending ordered eigenvalues of Laplacian in graphs. The eigenvectors associated with small eigenvalues carry smoothly varying signals, encouraging neighbor nodes to share similar values. In contrast, the eigenvectors associated with large eigenvalues carry sharply varying signals across edges 

Donnat et al. (2018). As demonstrated in Figure 1(d) ~ 1(f), a barbell graph tends to retain the information inside the clusters when it is reconstructed with only low-frequency components, but reserve knowledge between the clusters when constructed with only high-frequency ones. As pointed out by Wu et al. (2019a); Maehara (2019), the low- and high-frequency components in the spectral domain may reflect the local and global structural information of graphs in the spatial domain respectively. Moreover, recent works Donnat et al. (2018); Maehara (2019) reveal the different importance of low- and high-frequency components of graphs that contributes to the learning of modern GNNs.

Inspired by recent works, we propose to extend the attention mechanism to the spectral domain of graph to explicitly encode the structural information of graphs from a global perspective. Accordingly, we present Spectral Graph Attention Network(). In

, we choose the graph wavelets as the spectral bases and decompose them into low- and high-frequency components with respect to their indices. Then we construct two distinct convolutional kernels according to the low- and high-frequency components and apply the attention mechanism on these two kernels to capture the their importance respectively. Finally, the pooling function, as well as the activation function, are applied to produce the output. Figure 

2 provides an overview of the design of . Furthermore, we employ the Chebyshev Polynomial approximation to compute the spectral wavelets of graphs and propose the variant - which is more efficient on large graphs. We thoroughly validate the performance of and - on five challenging benchmarks with eleven competitive baselines. and - achieve state-of-the-art results on most datasets. The contributions of this paper are summarized as follows:

  • [noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt]

  • To better exploit the local and global structural information of graphs, we propose to extend the attention to the spectral domain rather than the spatial domain and design . To the best of our knowledge, is the first attempt to adopt the attention mechanism to the spectral domain of graphs.

  • Compared with traditional , which needs to compute the attentions for each edge, only employs the attention operation on low- and high-frequency components in the spectral domain. We show that has the same parameter complexity as .

  • To accelerate the computation of the spectral wavelets, we propose the Chebyshev Polynomial approximation which reduces the computation complexity and achieves at most 7.9x acceleration in benchmark datasets.

  • Extensive experiments show the superiority of and demonstrate the rationale behind the attention on the spectral domain.

Figure 2: The overview of .

2 Preliminary

We denote as an undirected graph, where is the set of nodes, and is the set of edges, where . The adjacency matrix is defined as a symmetric matrix , where indicates an edge . We denote as the node degrees matrix, where represents the node degree of node .

The original version of is developed by Kipf and Welling (2017). The feed-forward GCN layer is defined as:



are the output of hidden vectors from the layer with

as the input features. refers to the normalized adjacency matrix, where is the corresponding degree matrix of .

refers to the activation function, such as ReLu.

refers to the learning parameters of the layer, where and refers to the feature dimension of input and output respectively.

From the spatial perspective, is viewed as the feature aggregation among the neighbors of nodes in the spatial domain of graphs. Therefore, we rewrite Eq. (1) to a more general form:


where refers to the neighborhood set of node in graph111Usually, we include in .; refers to the aggregation weight of neighbor for node ; and refers to the aggregation function that aggregates the output of each neighbor, such as and . Usually, can be viewed as the special case of Eq. (2) where , and .

Based on Eq. (2), Veličković et al. (2018) introduces the attention mechanism in graphs and proposes Graph Attention Network (). Concretely, instead of employing the (normalized) adjacency matrix as the aggregation weight, proposes to compute the weight by a self-attention strategy, namely:


where . refers to the attention weight. On one hand, compared with , it is expensive for to compute the attention weight for every edge in spatial domain because the parameter complexity of -head is , while that of is . On the other hand, the self-attention strategy in only consider the local structural information in graphs, i.e., the neighborhoods. It ignores the global structural information in graphs.

3 Spectral Graph Attention Network

Other than the neighbor aggregation in the spatial domain, from Kipf and Welling (2017); Jin et al. (2019); Chang et al. (2020), can also be understood as the Graph Signal Processing in the spectral domain:


where is a signal on every node. are the spectral bases extracted from graphs. is a diagonal filter parameterized by . Given Eq.( 4),

can be viewed as the spectral graph convolution based on the Fourier transform on graphs with first-order Chebyshev polynomial approximations

Kipf and Welling (2017). Further, we can separate the spectral graph convolution into two stages Xu et al. (2019):


In Eq. (5), is the diagonal matrix for graph convolution kernel. For instance, the graph convolution kernel for is , is the eigenvalues of the normalized graph Laplacian matrix in ascending order, while the spectral bases for is the corresponding eigenvectors.

3.1 The Construction of Layer

In this section, we start to describe the construction of layer. From the Graph Signal Processing perspective, the diagonal values on can be treated as the frequencies on the graph. We denote the diagonal values with small / large indices as the low / high frequencies respectively. Meanwhile, the corresponding spectral bases in are low- and high-frequency components. As discussed in Section 1, the low- and high-frequency components carry different structural information in graphs. In this vein, we first split the spectral bases into two groups and re-write Eq. (5) as follows:


where and are the low- and high-frequency components, respectively. Here is a hyper-parameter that decides the splitting boundary of low- and high-frequency. When , Eq. (6) is equivalent to the graph convolution stage in Eq. (5).

In Eq. (6), can be viewed as the importance of the low- and high-frequency. Therefore, we introduce the learnable attention weights by exploiting the re-parameterization trick:


In Eq. (7), and are the two diagonal matrices parameterized by two learnable attention and , respectively. To ensure and are positive and comparable, we normalize them by the function:

Theoretically, there are many approaches to re-parameterize and , such as self-attention w.r.t the spectral bases

. However, these kinds of re-parameterization can not reflect the nature of low- and high- frequency components. On the other hand, they may introduce too many additional learnable parameters, especially for large graphs. These parameters might prohibit the efficient training due to the limited amount of training data in graphs, such as under graph-based semi-supervised learning setting. Meanwhile, we validate that such re-parameterization is simple but efficient and effective in practice.

3.2 Choice of Spectral Bases

Another important issue of is the choice of the spectral bases. While the Fourier bases have become the common choice in construction of spectral graph convolution, recent works Donnat et al. (2018); Xu et al. (2019) observed the advantages by utilizing spectral wavelets as bases in graph embedding techniques over traditional Fourier ones. Instead of Fourier bases, we choose the graph wavelets as spectral bases in .

Formally, the spectral graph wavelet is defined as the signal resulting from the modulation in the spectral domain of a signal centered around the associated node  Hammond et al. (2011); Shuman et al. (2013). Then, given the graph , the graph wavelet transform is conducted by employing a set of wavelets as bases. Concretely, the spectral graph wavelet transformation is given as:


where is the eigenvectors of normalized graph Laplacian matrix ,

is a scaling matrix with heat kernel scaled by hyperparameter

. The inverse of graph wavelets is obtained by simply replacing the in with corresponding to the heat kernel Donnat et al. (2018). Smaller indices in graph wavelets correspond to low-frequency components and vice versa.

The benefits that spectral graph wavelet bases have over Fourier bases mainly fall into three aspects: 1. Given the sparse real-world networks, the graph wavelet bases are usually much more sparse than Fourier bases, e.g., the density of is comparing with of  Xu et al. (2019). The sparseness of graph wavelets makes them more computationally efficient for use. 2. In spectral graph wavelets, the signal resulting from heat kernel filter is typically localized on the graph and in the spectral domain Shuman et al. (2013). By adjusting the scaling parameter , one can easily constrain the range of localized neighborhood. Smaller values of generally associate with smaller neighborhoods. 3. Since the information of eigenvalue is implicitly contained in wavelets from the process of construction of wavelets, we would not suffer the information loss when do re-parameterization.

Therefore, the architecture of layer with graph wavelet bases can be written as:


In Eq. (9), aiming to further reduce the parameter complexity, we share the parameters in feature transformation stage for and , i.e, . In this way, we reduce the parameter complexity from to , which is nearly the same as . The parameter complexity of is much less than that of with -head attention. Comparing with GAT, which captures the local structure of graph from spatial domain, our proposed could better tackle global information by combining the low- and high-frequency features explicitly from spectral domain.

4 Fast Approximation of Spectral Wavelets via Chebyshev Polynomials

In , directly computing the transformation according to Eq.( 8) is intensive for large graphs, since diagonalizing Laplacian commonly requires computational complexity. Fortunately, we can employ the Chebyshev polynomials to fast approximate the spectral graph wavelet without eigen-decompositionHammond et al. (2011).

Theorem 1.

Let be the fixed scaling parameter in the heat filter kernel and be the degree of the Chebyshev polynomial approximations for the scaled wavelet (Larger value of yields more accurate approximations but higher computational cost in opposite), the graph wavelet is given by


where , is the order Chebyshev polynomial approximation, and is the Bessel function of the first kind.

Theorem 1 can be deviated from Section 6 in Hammond et al. (2011). To further accelerate the computation, we build a look-up table for the Bessel function to avoid addtional integral operations. With this Chebyshev polynomial approximation, the computational cost of spectral graph wavelets is decreased to . Due the real world graphs are usually sparse, this computational difference can be very significant. We denote with Chebyshev polynomial approximation as -.

5 Related Works

Spectral convolutional networks on graphs. Existing methods of defining a convolutional operation on graphs can be broadly divided into two categories: spectral based and spatial based methods Zhang et al. (2018). We focus on the spectral graph convolutions in this paper. Spectral CNN Bruna et al. (2014) first attempts to generalize CNNs to graphs based on the spectrum of the graph Laplacian and defines the convolutional kernel in the spectral domain. Boscaini et al. (2015) further employs windowed Fourier transformation to define a local spectral CNN approach. ChebyNet Defferrard et al. (2016) introduces a fast localized convolutional filter on graphs via Chebyshev polynomial approximation. Vanilla GCN Kipf and Welling (2017) further extends the spectral graph convolutions considering networks of significantly larger scale by several simplifications. Khasanova and Frossard (2017) learns graph-based features on images that are inherently invariant to isometric transformations. Cayleynets Levie et al. (2018) alternatively introduce Cayley polynomials allowing to efficiently compute spectral filters on graphs. FastGCN Chen et al. (2018) and ASGCN Huang et al. (2018) further accelerate the training of Vanilla GCN via sampling approaches. Lanczos algorithm is utilized in LanczosNet Liao et al. (2019) to construct low-rank approximations of the graph Laplacian for convolution. SGC Wu et al. (2019a) further reduces the complexity of Vanilla GCN by successively removing the non-linearities and collapsing weights between consecutive layers. Despite their effective performance, all these convolution theorem based methods lack the strategy to explicitly treat low- and high-frequency components with different importance.

Spectral graph wavelets. Theoretically, the lifting scheme is proposed for the construction of wavelets that can be adapted to irregular graphs in Sweldens (1998). Hammond et al. (2011) defines wavelet transforms appropriate for graphs and describes a fast algorithm for computation via fast Chebyshev polynomial approximation. For applications, Tremblay and Borgnat (2014) utilizes graph wavelets for multi-scale community mining and obtains a local view of the graph from each node. Donnat et al. (2018) introduces the property of graph wavelets that describes information diffusion and learns structural node embeddings accordingly.  Xu et al. (2019) first attempts to construct graph neural networks with graph wavelets. These works emphasize the local and sparse property of graph wavelets for graph signal processing both theoretically and practically.

Space/spectrum-aware feature representation. In computer vision, Chen et al. (2019b) first defines space-aware feature representations based on scale-space theory and reduces spatial redundancy of vanilla CNN models by proposing the Octave Convolution () model. Durall et al. (2019)

further leverages octave convolutions for designing stabilizing GANs. To our knowledge, this is the first time that spectrum-aware feature representations are considered in irregular graph domain and established with graph convolutional neural networks.

6 Experiments

Dataset Nodes Edges Classes Features Label rate
Citeseer 3,327 4,732 3,703
Cora 2,708 5,429 1,433
Pubmed 19,717 44,338 500
Coauthor CS 18,333 81,894 6,805
Amazon Photo 7,487 11,9043 745
Table 1: The overview of dataset statistics.

6.1 Datasets

Joining the practice of previous works, we focus on five node classification benchmark datasets under semi-supervised setting with different graph size and feature type. (1) Three citation networks: Citeseer, Cora and Pubmed Sen et al. (2008)

, which aims to classify the research topics of papers. (2) A coauthor network: Coauthor CS which aims to predict the most active fields of study for each author from the KDD Cup 2016 challenge

222https://kddcup2016.azurewebsites.net. (3) A co-purchase network: Amazon Photo McAuley et al. (2015) which aims to predict the category of products from Amazon. For the citation networks, we follow the public split setting provided by Yang et al. (2016), that is, 20 labeled nodes per class in each dataset for training and 500 / 1000 labeled samples for valiation / test respectively. For the other two datasets, we follow the splitting setting from Shchur et al. (2018); Chen et al. (2019a). Statistical overview of all datasets is given in Table 1. Label rate denotes the ratio of labeled nodes fetched in training process.

Model Citeseer Cora Pubmed Coauthor CS Amazon Photo
Perozzi et al. (2014)
Yang et al. (2016)
Li et al. (2016)
Defferrard et al. (2016)
Kipf and Welling (2017)
Hamilton et al. (2017b)
Verma et al. (2018)
Veličković et al. (2018)
Bai et al. (2019)
Xu et al. (2019)
Bianchi et al. (2019)
Morris et al. (2019)
Table 2: Experimental results (in percent) on semi-supervised node classification.

6.2 Baselines

We thoroughly evaluate the performance of with 11 representative baselines. Among them,  Perozzi et al. (2014) and  Yang et al. (2016) are the traditional graph embedding methods.  Defferrard et al. (2016),  Kipf and Welling (2017),  Xu et al. (2019),  Bianchi et al. (2019) are the spectral-based GNNs. Li et al. (2016),  Hamilton et al. (2017a),  Veličković et al. (2018), Verma et al. (2018), Bai et al. (2019) and  Morris et al. (2019) are the spatial-based GNNs. In addition, we also implement the the variant of with Chebyshev Polynomial approximation, which is denoted as -.

6.3 Experimental setup

For all experiments, a 2-layer network of our model is constructed using TensorFlow 

Abadi et al. (2015) with 64 hidden units. We train our model utilizing the Adam optimizer Kingma and Ba (2014) with an initial learning rate . We train the model using early stopping with a window size of 100. Most training process are stopped in less than 200 steps as expected. We initialize the weights matrix following Glorot and Bengio (2010), employ L2 regularization on weights and dropout input and hidden layers to prevent overfitting Srivastava et al. (2014).

For constructing wavelets , we follow the suggestion from Xu et al. (2019) for each dataset, i.e., for Citeseer, for Pubmed and for Cora, Coauthor CS and Amazon Photo. In addition, we employ the grid search to determine the best of low-frequency components in and the impact of this parameter would be discussed in Section 6.5.1. Without specification, we use the aggregation function in .

Figure 3: The performance of learned w.r.t the proportion of low-frequency components . The best fraction is marked with the red vertical line.

6.4 Performance on Semi-supervised Node Classification

Table 2 summaries the results on all datasets. For all baselines, we reuse result the reported in literatures Veličković et al. (2018); Chen et al. (2019a). From Table 2, we have these findings. (1) Clearly, the attention-based GNNs (, and -) achieve the best performance among all datasets. It validates that the attention mechanism can capture the important pattern from either spatial and spectral perspective. (2) Specifically, achieves best performance on four datasets; particularly on Pubmed, the best accuracy by - is 80.5% and it is better than the previous best(79.0%), which is regarded as a remarkable boost considering the challenge on this benchmark. Meanwhile, compared with baselines, - can also achieve the better performance on three datasets and even gain the best performance on two of them. (4). Compared with aggregation, aggregation seems a better choice for . This may due to aggregation can preserve the significant signals learned by . (5) It’s worth to note that to achieve such results, both and - only employ the attention on low- and high-frequency filter of graphs in spectral domain, while needs to learn the attention weights on every edge in spatial domain. It verifies that is more efficient than since the spectral domain contains the meaningful signals and can capture the global information of graphs.

6.5 Ablation Studies

6.5.1 The impact of low-frequency components proportion

To evaluate the impact of the hyperparameter , we fix the other hyperparameters and vary from to linearly to run on Citeseer, Cora and Pubmed, respectively. Figure 4

depicts the mean (in bold line) and variance (in light area) of every

on three datasets. As shown in Figure 4, the mean value curve of three datasets exhibits the similar pattern, that is, the best performance are achieved when is small. The best proportion of low-frequency components are , , and for Citeseer, Cora and Pubmed, respectively. In the other words, consistently, only a small fraction of components needs to be treated as the low-frequency components in .

6.5.2 The ablation study on low- and high-frequency Components

To further elaborate the importance of low- and high-frequency components in , we conduct the ablation study on the classification results by testing only with low- or high-frequency components w.r.t the best proportion in Section 6.5.1. Specially, we manually set or to 0 during testing stage to observe how the learned low- and high-frequency components in graphs affect the classification accuracy. As shown in Table 3, both low- and high-frequency components are essential for the model. Meanwhile, we can find that with very small proportion (5% - 15%) of low-frequency components can achieve the comparable results to those obtained by full . It reads that the low-frequency components contain more information that can contribute to the feature representation learned from the model.

6.5.3 The learned attention on low- and high-frequency components

To investigate the results in Section 6.5.2, we further show the learned attentions of w.r.t the best proportion for Cietseer, Cora and Pubmed which are demonstrated in Table 4. Interestingly, despite the small proportion, the attention weight of low-frequency components learned by is much larger than that of high-frequency components in each layer consistently. Hence, is successfully to capture the importance of low- and high-frequency components of graphs in the spectral domain. Moreover, as pointed out by Donnat et al. (2018); Maehara (2019), the low-frequency components in graphs usually indicate smooth varying signals which can reflect the locality property in graphs. It implies that the local structural information is important for these datasets. This may explain why GAT also gains good performance on these datasets.

(a) on Citeseer
(b) on Citeseer
(c) on Citeseer
(d) on Citeseer
(e) on Pubmed
(f) on Pubmed
(g) on Pubmed
(h) on Pubmed
Figure 4: The t-SNE visualization of comparing with other baselines. Each color corresponds to a different class that the embeddings belongs to.
Methods Citeseer () Cora () Pubmed ()
with low-frequency 57.70 66.80 76.70
with high-frequency 70.90 82.40 80.40
71.60 83.50 80.50
Table 3: The results of ablation study on low- and high-frequency compoents.

6.6 Time Efficiency of Chebyshev Polynomials Approximation

As discussed in Section 4, we propose the fast approximation of spectral wavelets according to Chebyshev polynomials. To elaborate its efficiency, we compare the time cost of calculating between via eigen-decomposition () and fast approximation (-). We report the mean time cost of and - with second-order Chebyshev Polynomials after 10 runs for Core, Citeseer and Pubmed respectively. As shown in Table 5, we can find that this fast approximation can greatly accelerate the training process. Specifically, - run 7.9x times faster than that of . It validate the efficiency of the proposed fast approximation approaches.

Dataset Citeseer () Cora () Pubmed ()
Attention filter weights
Learned value (first layer) 0.838 0.162 0.722 0.278 0.860 0.140
Learned value (second layer) 0.935 0.065 0.929 0.071 0.928 0.072
Table 4: Learned attention weights and of for low- and high-frequency w.r.t the best proportion of low frequency components (number followed after the name of datasets).

6.7 t-SNE Visualization of Learned Embeddings

To evalute the effectiveness of the learned features of qualitatively, we depict the t-SNE visualization Maaten and Hinton (2008) of learned embeddings of on Citeseer and Pubmed in Figure 4 comparing with , and . The representation exhibits discernible clustering in the projected 2D space. In Figure 4, the color indicates the class label in datasets. Compared with the other methods, the intersections of different classes in are more separated. It verifies the discriminative power of across the classes.

Models Eigen-decomposition Fast approximation
Citeseer 11.23 5.19 (~)
Cora 5.79 2.78 (~)
Pubmed 1185.12 150.79 (~)
Table 5: Running time () comparison for obtaining the spectral wavelets between and - on Cora, Citeseer and Pubmed.

7 Conclusion

In this paper, we propose , a novel spectral-based graph convolutional neural network to learn the representation of graph with respect to different frequency components in the spectral domain. By introduce the distinct trainable attention weight for low- and high-frequency component, can effectively capture both local and global information in graphs and enhance the performance of GNNs. Furthermore, a fast Chebyshev polynomial approximation is proposed to accelerate the spectral wavelet calculation. To the best of our knowledge, this is the first attempt to adopt the attention mechanism to the spectral domain of graphs. It is expected that will shed light on building more efficient architectures for learning with graphs.


  • M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. (2015)

    TensorFlow: large-scale machine learning on heterogeneous systems, 2015

    Software available from tensorflow. org 1 (2). Cited by: §6.3.
  • S. Bai, F. Zhang, and P. H. S. Torr (2019) Hypergraph convolution and hypergraph attention.. arXiv preprint arXiv:1901.08150. External Links: Link Cited by: §6.2, Table 2.
  • F. M. Bianchi, D. Grattarola, L. Livi, and C. Alippi (2019) Graph neural networks with convolutional arma filters.. arXiv preprint arXiv:1901.01343. External Links: Link Cited by: §6.2, Table 2.
  • D. Boscaini, J. Masci, S. Melzi, M. M. Bronstein, U. Castellani, and P. Vandergheynst (2015) Learning class-specific descriptors for deformable shapes using localized spectral convolutional networks. In Computer Graphics Forum, Vol. 34, pp. 13–23. Cited by: §5.
  • J. Bruna, W. Zaremba, A. Szlam, and Y. Lecun (2014) Spectral networks and locally connected networks on graphs. In International Conference on Learning Representations (ICLR2014), CBLS, April 2014, pp. http–openreview. Cited by: §5.
  • H. Chang, Y. Rong, T. Xu, W. Huang, H. Zhang, P. Cui, W. Zhu, and J. Huang (2020) A restricted black-box adversarial framework towards attacking graph embedding models. In

    AAAI 2020 : The Thirty-Fourth AAAI Conference on Artificial Intelligence

    Cited by: §3.
  • D. Chen, Y. Lin, W. Li, P. Li, J. Zhou, and X. Sun (2019a) Measuring and relieving the over-smoothing problem for graph neural networks from the topological view. arXiv preprint arXiv:1909.03211. Cited by: §6.1, §6.4.
  • J. Chen, T. Ma, and C. Xiao (2018) FastGCN: fast learning with graph convolutional networks via importance sampling. In ICLR 2018 : International Conference on Learning Representations 2018, External Links: Link Cited by: §5.
  • Y. Chen, H. Fan, B. Xu, Z. Yan, Y. Kalantidis, M. Rohrbach, S. Yan, and J. Feng (2019b) Drop an octave: reducing spatial redundancy in convolutional neural networks with octave convolution.. arXiv preprint arXiv:1904.05049. Cited by: §1, §5.
  • M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. neural information processing systems, pp. 3844–3852. Cited by: §5, §6.2, Table 2.
  • C. Donnat, M. Zitnik, D. Hallac, and J. Leskovec (2018) Learning structural node embeddings via diffusion wavelets. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1320–1329. Cited by: §1, §3.2, §3.2, §5, §6.5.3.
  • R. Durall, F. Pfreundt, and J. Keuper (2019) Stabilizing gans with octave convolutions. arXiv preprint arXiv:1905.12534. Cited by: §5.
  • X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256. Cited by: §6.3.
  • W. L. Hamilton, R. Ying, and J. Leskovec (2017a) Representation learning on graphs: methods and applications.. IEEE Data(base) Engineering Bulletin 40, pp. 52–74. Cited by: §6.2.
  • W. L. Hamilton, Z. Ying, and J. Leskovec (2017b) Inductive representation learning on large graphs. neural information processing systems, pp. 1024–1034. Cited by: Table 2.
  • D. K. Hammond, P. Vandergheynst, and R. Gribonval (2011) Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis 30 (2), pp. 129–150. Cited by: §3.2, §4, §4, §5.
  • W. Huang, T. Zhang, Y. Rong, and J. Huang (2018) Adaptive sampling towards fast graph representation learning. In Advances in Neural Information Processing Systems, pp. 4563–4572. Cited by: §5.
  • M. Jin, H. Chang, W. Zhu, and S. Sojoudi (2019) Power up! robust graph convolutional network against evasion attacks based on graph powering. arXiv preprint arXiv:1905.10029. Cited by: §3.
  • M. Kampffmeyer, Y. Chen, X. Liang, H. Wang, Y. Zhang, and E. P. Xing (2018)

    Rethinking knowledge graph propagation for zero-shot learning

    arXiv preprint arXiv:1805.11724. External Links: Link Cited by: §1.
  • R. Khasanova and P. Frossard (2017) Graph-based isometry invariant representation learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1847–1856. Cited by: §5.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §6.3.
  • T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. international conference on learning representations. Cited by: §1, §2, §3, §5, §6.2, Table 2.
  • R. Levie, F. Monti, X. Bresson, and M. M. Bronstein (2018) Cayleynets: graph convolutional neural networks with complex rational spectral filters. IEEE Transactions on Signal Processing 67 (1), pp. 97–109. Cited by: §5.
  • J. Li, Y. Rong, H. Cheng, H. Meng, W. Huang, and J. Huang (2019) Semi-supervised graph classification: A hierarchical graph perspective. In The World Wide Web Conference (WWW), San Francisco, CA, USA, pp. 972–982. Cited by: §1.
  • Y. Li, D. Tarlow, M. Brockschmidt, and R. S. Zemel (2016) Gated graph sequence neural networks.. In ICLR (Poster), External Links: Link Cited by: §6.2, Table 2.
  • R. Liao, Z. Zhao, R. Urtasun, and R. S. Zemel (2019) LanczosNet: multi-scale deep graph convolutional networks. In ICLR 2019 : 7th International Conference on Learning Representations, Cited by: §1, §5.
  • L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §6.7.
  • T. Maehara (2019) Revisiting graph neural networks: all we have is low-pass filters. arXiv preprint arXiv:1905.09550. Cited by: §1, §6.5.3.
  • J. McAuley, C. Targett, Q. Shi, and A. Van Den Hengel (2015) Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 43–52. Cited by: §6.1.
  • C. Morris, M. Ritzert, M. Fey, W. Hamilton, J. E. Lenssen, G. Rattan, and M. Grohe (2019) Weisfeiler and leman go neural: higher-order graph neural networks. AAAI 2019 : Thirty-Third AAAI Conference on Artificial Intelligence 33 (1), pp. 4602–4609. External Links: Link Cited by: §1, §6.2, Table 2.
  • B. Perozzi, R. Al-Rfou, and S. Skiena (2014) Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. Cited by: §6.2, Table 2.
  • P. Sen, G. M. Namata, M. Bilgic, L. Getoor, B. Gallagher, and T. Eliassi-Rad (2008) Collective classification in network data. Ai Magazine 29 (3), pp. 93–106. Cited by: §6.1.
  • O. Shchur, M. Mumme, A. Bojchevski, and S. Günnemann (2018) Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868. Cited by: §6.1.
  • D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst (2013)

    The emerging field of signal processing on graphs: extending high-dimensional data analysis to networks and other irregular domains

    IEEE Signal Processing Magazine 30 (3), pp. 83–98. Cited by: §3.2, §3.2.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §6.3.
  • W. Sweldens (1998) The lifting scheme: a construction of second generation wavelets. SIAM journal on mathematical analysis 29 (2), pp. 511–546. Cited by: §5.
  • N. Tremblay and P. Borgnat (2014) Graph wavelets for multiscale community mining. IEEE Transactions on Signal Processing 62 (20), pp. 5227–5239. Cited by: §5.
  • P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2018) Graph attention networks. In ICLR 2018 : International Conference on Learning Representations 2018, Cited by: §1, §2, §6.2, §6.4, Table 2.
  • N. Verma, E. Boyer, and J. Verbeek (2018) FeaStNet: feature-steered graph convolutions for 3d shape analysis. In

    2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 2598–2606. External Links: Link Cited by: §6.2, Table 2.
  • X. Wang, H. Ji, C. Shi, B. Wang, Y. Ye, P. Cui, and P. S. Yu (2019) Heterogeneous graph attention network. In The World Wide Web Conference, pp. 2022–2032. Cited by: §1.
  • F. Wu, A. H. Souza, T. Zhang, C. Fifty, T. Yu, and K. Q. Weinberger (2019a) Simplifying graph convolutional networks. In ICML 2019 : Thirty-sixth International Conference on Machine Learning, pp. 6861–6871. Cited by: §1, §5.
  • Q. Wu, H. Zhang, X. Gao, P. He, P. Weng, H. Gao, and G. Chen (2019b) Dual graph attention networks for deep latent representation of multifaceted social effects in recommender systems. In The World Wide Web Conference, pp. 2091–2102. Cited by: §1.
  • Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu (2019c) A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596. Cited by: §1.
  • B. Xu, H. Shen, Q. Cao, Y. Qiu, and X. Cheng (2019) Graph wavelet neural network. international conference on learning representations. Cited by: §3.2, §3.2, §3, §5, §6.2, §6.3, Table 2.
  • Z. Yang, W. W. Cohen, and R. Salakhutdinov (2016) Revisiting semi-supervised learning with graph embeddings. In ICML 2016, pp. 40–48. Cited by: §6.1, §6.2, Table 2.
  • L. Yao, C. Mao, and Y. Luo (2019) Graph convolutional networks for text classification. AAAI 2019 : Thirty-Third AAAI Conference on Artificial Intelligence 33, pp. 7370–7377. External Links: Link Cited by: §1.
  • R. Zeng, W. Huang, M. Tan, Y. Rong, P. Zhao, J. Huang, and C. Gan (2019) Graph convolutional networks for temporal action localization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7094–7103. Cited by: §1.
  • Z. Zhang, P. Cui, and W. Zhu (2018) Deep learning on graphs: a survey. arXiv preprint arXiv:1812.04202. Cited by: §5.