SoGCN: Second-Order Graph Convolutional Networks

10/14/2021 ∙ by Peihao Wang, et al. ∙ 0

Graph Convolutional Networks (GCN) with multi-hop aggregation is more expressive than one-hop GCN but suffers from higher model complexity. Finding the shortest aggregation range that achieves comparable expressiveness and minimizes this side effect remains an open question. We answer this question by showing that multi-layer second-order graph convolution (SoGC) is sufficient to attain the ability of expressing polynomial spectral filters with arbitrary coefficients. Compared to models with one-hop aggregation, multi-hop propagation, and jump connections, SoGC possesses filter representational completeness while being lightweight, efficient, and easy to implement. Thereby, we suggest that SoGC is a simple design capable of forming the basic building block of GCNs, playing the same role as 3 × 3 kernels in CNNs. We build our Second-Order Graph Convolutional Networks (SoGCN) with SoGC and design a synthetic dataset to verify its filter fitting capability to validate these points. For real-world tasks, we present the state-of-the-art performance of SoGCN on the benchmark of node classification, graph classification, and graph regression datasets.



There are no comments yet.


page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Graph Convolutional Networks (GCNs) has gained popularity in recent years. Researchers have shown that non-localized multi-hop GCNs (Liao et al., 2019; Luan et al., 2019; Abu-El-Haija et al., 2019) have better performance than localized one-hop GCNs (Defferrard et al., 2016; Kipf and Welling, 2017; Wu et al., 2019)

. However, in Convolutional Neural Networks (CNNs), the localized

kernels’ expressiveness has been shown in image recognition both experimentally (Simonyan and Zisserman, 2014) and theoretically (Zhou, 2020). These contradictory observations motivate us to search for a maximally localized Graph Convolution (GC) kernel with guaranteed feature expressivness.

Kernel Expressiveness Localized Complexity
Vanilla GCN Very Low
GIN Medium
Multi-hop Full
SoGCN (Ours) Full
Table 1: Comparison of different GC kernels in terms of expressivness, localization, and time complexity. In this table, represents the number of neighborhoods around a graph node, is the dimension of input features, and denotes the aggregation length of multi-hop GCs. We compute the time complexity with respect to the method of Abu-El-Haija et al. (2019).

Most existing GCN layers adopt localized graph convolution based on one-hop aggregation scheme as the basic building block (Kipf and Welling, 2017; Hamilton et al., 2017; Xu et al., 2019). The effectiveness of these one-hop models is based on the intuition that a richer class of convolutional functions can be recovered by stacking multiple one-hop layers. However, extensive works (Li et al., 2018; Oono and Suzuki, 2019; Cai and Wang, 2020) have shown performance limitations of such design, which indicates this hypothesis may not hold. Liao et al. (2019); Luan et al. (2019); Abu-El-Haija et al. (2019)

observed that multi-hop aggregation run in each layer could lead to significant improvement in prediction accuracy. However, a longer-range aggregator introduces extra hyperparameters and higher computational cost. This design also contradicts the compositionality principle of deep learning that neural networks benefit from deep connections and localized kernels

(LeCun et al., 2015).

Recent studies point out that one-hop GCNs suffer from filter incompleteness (Hoang and Maehara, 2019). By relating low-pass filtering on the graph spectrum with over-smoothing, one can show one-hop filtering could lead to performance limitations. One natural solution of adding a more complex graph kernel, such as multi-hop connections, seems to work in practical settings (Abu-El-Haija et al., 2019). The question is: “what is the simplest graph kernel with the full expressive power of the graph convolution?” We show that with a second-order graph kernel of “two-hop” connection, we could approximate any complicated graph relationships. Intuitively, it means we should extract a contrast between graph nodes that almost-connected (via their neighbor) vs. directly-connected.

We construct the two-hop graph kernel with second-order polynomials in an adjacency matrix and call it the Second-Order GC (SoGC). We show this Second-Order GC (SoGC) is the “sweet spot” balancing localization and fitting capability. To justify our conclusion, we introduce a Layer Spanning Space (LSS) framework to quantify the filter representation power of multi-layer GCs. Our LSS works by mapping GC filters’ composition with arbitrary coefficients to polynomial multiplication (Section 3.1).

Under this LSS framework, we can show that SoGCs can approximate any linear GCNs in channel-wise filtering (Theorem 1). Vanilla GCN and GIN (first-order polynomials in adjacency matrix) cannot represent all polynomial filters in general; multi-hop GCs (higher-order polynomials in adjacency matrix) do not contribute more expressiveness (Section 3.2). In this sense, SoGC is the most localized GC kernel with the full representation power.

To validate our theory, we build our Second-Order Graph Convolutional Networks (SoGCN) by layering up SoGC layers (Section 3.3). We reproduce our theoretical results on a synthetic datasets for filtering power testing (Section 4.1). On the public benchmark datasets (Dwivedi et al., 2020), SoGCN using simple graph topological features consistently boosts the performance of our baseline model (i.e., vanilla GCN) comparable to the state-of-the-art GNN models (with more complex attention/gating mechanisms). We also verify that our SoGCN fits extensive real-world tasks, including network node classification, super-pixel graph classification, and molecule graph regression (Section 4.3).

To our best knowledge, this work is the first study that identifies the distinctive competence of the two-hop neighborhood in the context of expressing a polynomial filter with arbitrary coefficients. Our SoGC is a special but non-trivial case of polynomial approximated graph filters (Defferrard et al., 2016). Kipf and Welling (2017) conducted an ablation study with GC kernels of different orders but missed the effectiveness of the second-order relationships. The work of Abu-El-Haija et al. (2019) talked about muti-hop graph kernels; however, they did not identify the critical importance of the two-hop form. In contrast, we clarify the prominence of SoGCs in theories and experiments.

Our research on GCN using pure topologically relationship is orthogonal to those using geometric relations (Monti et al., 2017; Fey et al., 2018; Pei et al., 2020), or those with expressive edge features (Li et al., 2016; Gilmer et al., 2017), and hyper-edges (Morris et al., 2019; Maron et al., 2018, 2019). It is also independent with graph sampling procedures (Rong et al., 2019; Hamilton et al., 2017).

(a) Vanilla GC
(b) Our SoGC
(c) Multi-Hop GC
Figure 1: Vertex domain interpretations of vanilla GC, SoGC, and Multi-Hop GC. Denote the first-hop aggregator, the second-hop aggregator, and the -th hop aggregator. Nodes in the same colored ring share the same weights. (a) Vanilla GC only aggregates information from the first-hop neighbor nodes. (b) SoGC incorporates additional information from the second-hop (almost-connected) neighborhood. (c) Multi-hop GC simply repeats mixing information from every neighborhood within hops.

2 Related Work

Spectral GCNs.

Graph convolution is defined as element-wise multiplication on graph spectrum (Hammond et al., 2011). Bruna et al. (2014) first proposed spectral GCN with respect to this definition. ChebyNet (Defferrard et al., 2016) approximates graph filters using Chebyshev polynomials. Vanilla GCN (Kipf and Welling, 2017; Wu et al., 2019) further reduces the GC layer to a degree-one polynomial with lumping of first-order and constant terms. GIN (Xu et al., 2019) disentangles the effect of self-connection and pairwise neighboring connections by adding a separate mapping for central nodes. However, these simplifications causes performance limitations (Oono and Suzuki, 2019; Cai and Wang, 2020). APPNP (Klicpera et al., 2019) uses Personalized PageRank to derive a fixed polynomial filter. Bianchi et al. (2019) proposes a multi-branch GCN architecture to simulate ARMA graph filters. GCNII (Ming Chen et al., 2020) incorporates identity mapping and initial mapping to relieve over-smoothing problem and deepen GCNs. However, these models are not easy to implement and introduce additional hyper-parameters.

Multi-Hop GCNs.

To exploit multi-hop information, Liao et al. (2019) proposes to use Lanczos algorithm to construct low rank approximations of the graph Laplacian for graph convolution. Luan et al. (2019) devises two architectures Snowball GCN and Truncated Krylov GCN to capture neighborhoods at various distances. To simulate neighborhood delta functions, Abu-El-Haija et al. (2019) repeat mixing multi-hop features to identify more topological information. JKNet (Xu et al., 2018) combines all feature activation of previous layers to learn adaptive and structure-aware representations of different graph substructures. These models exhibit the strength of multi-hop GCNs over one-hop GCNs while leaving the propagation length as a hyper-parameter. In the meanwhile, long-range aggregation in each layer causes higher complexity (Table 1).

Expressiveness of GCNs.

Most of the works on GCN’s expressiveness are restricted to the over-smoothing problem: Li et al. (2018) first poses the over-smoothing problem; Hoang and Maehara (2019) indicates GCNs are no more than low-pass filters; Luan et al. (2019); Oono and Suzuki (2019) demonstrate the asymptotic behavior of feature activation to a subspace; Cai and Wang (2020) examines the decreasing Dirichlet energy. These analytic frameworks do not provide constructive improvements on building more powerful GCN. Ming Chen et al. (2020) first proposed to assess GCN’s overall expressiveness by tackling the ability of expressing polynomial filters with arbitrary coefficients. But their theory is only applicable for transductive learning on a single graph, and does not upper bound the degree of the graph polynomial filters.

3 Second-Order Graph Convolution

We begin by introducing our notation. We are interested in learning on a finite graph set . Assume each graph is simple and undirected, associated with a finite vertex set , an edge set , and a symmetric normalized adjacency matrix (Chung and Graham, 1997; Shi and Malik, 2000). Without loss of generality and for simplicity, for every . We denote single-channel signals supported in graph as

, a vectorization of function


Graph Convolution (GC) is defined as Linear Shift-Invariant (LSI) operators to adjacency matrices (Sandryhaila and Moura, 2013). This property enables GC to extract features regardless of where the local structure falls. A single-channel GC can be written as a mapping . According to Defferrard et al. (2016), a GC can be approximated by a polynomial in adjacency matrix (Figure 0(c)) 111We can replace the Laplacian matrix in Defferrard et al. (2016) with the normalized adjacency matrix since .:


where represents the kernel weights. The Equation 1 indicates that graph convolution can be interpreted as a linear combination of features aggregated by . Thereby, the hyperparameter can reflect the localization of a GC kernel.

Previous work of Kipf and Welling (2017) simplified the Equation 1 to a one-hop kernel, so-called vanilla GC (Figure 0(a)). We formulate vanilla GC as below:


We are interested in the overall graph convolution networks’ representation power of expressing a polynomial filter (cf. Equation 1) with arbitrary degrees and coefficients. At first glance, one-hop GC (cf. Equation 2) can approximate any high-order GC kernels by stacking multiple layers. However, that is not the case (See formal arguments in Section 3.2). In contrast, when plugging the second-order term into vanilla GC, we will show that this approximation ability can be attained (See formal arguments in Theorem 1). We name this improved design Second-Order Graph Convolution (SoGC), as it can be written as the second-order polynomial in adjacency matrix:


We illustrate its vertex-domain interpretation in Figure 0(b). The critical insight is that graph filter approximation can be viewed as a polynomial factorization problem. It is known that any univariate polynomial can be factorized into sub-polynomials of degree two. Based on this fact, we show by stacking enough SoGCs (and varying their parameters) can achieve decomposition of any polynomial filters.

Figure 2:

Visualizing output activation in graph spectrum domain for vanilla GCN, SoGCN, and GRU variants. The test is conducted on a graph from the ZINC dataset. The spectrum is defined as a projection of activation functions on the graph eigenvectors. SoGCN preserved higher-order spectrum, while vanilla GCN shows over-smoothing. See Appendix

F for more visualizations on the ZINC dataset.

3.1 Representation Power Justification

To be more precise, we introduce our Layer Spanning Space (LSS) framework in this subsection and mathematically prove that arbitrary GC kernel can be decomposed into finite many SoGC kernels.

First, to illustrate overall graph filter space and filters expressed by a single layer, we define a graph filter space, in which every polynomial filter has degree no more than :

Definition 1.

(Graph filter space) Suppose the parameter space is . The Linear Shift-Invariant (LSI) graph filter space222See Appendix A for a remark on how relates with the term “Linear Shift-Invariant”. of degree with respect to a finite graph set is defined as , where follows the definition in Equation 1.

We further provide Definition 2 and Lemma 1 to discuss the upper limit of ’s dimension.

Definition 2.

(Spectrum capacity) Let spectrum set , where

denotes the eigenvalues of

. Spectrum capacity is the cardinality of all distinct graph eigenvalues. In particular, if every graph adjacency matrix has no common eigenvalues other than .

Lemma 1.

Filter space with degree has dimension as a vector space.

Lemma 1 follows from Theorem 3 of Sandryhaila and Moura (2013). See the complete proof in Appendix C.

According to Lemma 1, one can define an ambient filter space of degree (i.e., ). Suppose a GCN consists of -hop convolutional kernels , where the superscript indicates the layer number, and denotes the network depth. We can consider each layer is sampled from . We intend to justify this GCN’s filter representation power via its Layer Spanning Space (LSS):


where the whole LSS is constructed by varying parameters of over . When , the LSS covers the entire ambient space, then we say the GCN composed of GC kernels in has full filter representation power.

As is a function space, can be analytically tricky. To investigate this space, we define a mapping , where denotes a polynomial vector space of degree at most :


We provide Lemma 2 to reveal a good property of :

Lemma 2.

is a ring isomorphism when .

The proof can be found in Appendix D.

This ring isomorphism signifies that the composition of filters in is identical to polynomial multiplication. Therefore, one can study the LSS through the polynomials that can be factorized into the corresponding sub-polynomials. For example, Equation 4 is identical to:


In the rest of this subsection, we will show that -layer SoGCs (i.e., ) can attain full representation power. That is, covers the whole ambient filter space when is as large as . We summarize a formal argument in the following theorem:

Theorem 1.

For any , there exists with coefficients such that where .


Proving Theorem 1 requires a fundamental polynomial factorization theorem, rephrased as below:

Lemma 3.

(Fundamental theorem of algebra) Over the field of reals, the degree of an irreducible non-trivial univariate polynomial is either one or two.

For any , apply (cf. Equation 5) to map kernel to polynomial . By Lemma 1, . By Lemma 3, factorize into series of polynomials with the degree at most two, and then merge first-order polynomials into second-order ones until one single or no first-order sub-polynomial remains. As a consequence, can be written as , where . If is even, for every . Otherwise, except for at most one whose degree is one, all terms have degree two.

The last step is to apply the inverse of morphism formulated as below:

Since is also a ring isomorphism, we have:

where by definition, which implies . ∎

Theorem 1 can be regarded as the universal approximation theorem of linear GCNs. Although nonlinear activation is not considered within our theoretical framework, we make reasonable hypothesis that achieving linear filter completeness can also boost GCNs with nonlinearity (Ming Chen et al., 2020). See our experiments in Section 4.3.

Theorem 1 implies that multi-layer SoGC can implement arbitrary filtering effects and extract features at any positions on the spectrum (Figure 2). Theorem 1 also coincides with Dehmamy et al. (2019) on how GCNs built on SoGC kernels could utilize depth to raise fitting accuracy (Figure 3).

3.2 Compared with Other Graph Convolution

In this subsection, we will show that vanilla GCN (and GIN) does not attain full expressiveness in terms of filter representation. We will also contrast our SoGC to multi-hop GCs (i.e., higher-order GCs in our terminology) to further reveal the prominence of SoGC. A brief comparison is summarized in Table 1.

Vanilla vs. second-order.

Vanilla GCN (Kipf and Welling, 2017) is a typical one-hop GCN with lumbing of of graph node self-connection and pairwise neighboring connection. Compared with SoGC, vanilla GC is more localized and computationally cheaper. However, this design has huge performance limitations (Hoang and Maehara, 2019; Wu et al., 2019; Li et al., 2018; Oono and Suzuki, 2019; Cai and Wang, 2020). We illustrate this issue in terms of filter approximation power based on the LSS framework.

Suppose a GCN stacks GC layers , apply mapping to its spanned LSS, the isomorphic polynomial space is:


where . According to Equation 7, one can see no matter how large is or how a optimizer tunes the parameters , , which implies degenerates to a negligible subspace inside . GIN (Xu et al., 2019) disentangles the the weights for neighborhoods and central nodes. We can write this GC layer as . The LSS of GIN is isomorphic to the polynomial space:


This polynomial space represents all polynomials that can split over the real domain. However, since not all polynomials can be factorized into first-order polynomials. The expectation number of real roots of a -degree polynomial with zero-mean random coefficients is (Ibragimov and Maslova, 1971). When the ambient dimension goes larger, all-real-root polynomials only occupy a small proportion in the ambient space (Li, 2011), which indicates GIN does not have full expressiveness in terms of filter representation either.

Higher-order vs. second-order.

Higher-order GCs refer to those polynomial filters with degree larger than three (i.e., ). They can model multi-hop GCNs such as Luan et al. (2019); Liao et al. (2019); Abu-El-Haija et al. (2019). Compared to SoGCs, higher-order GCs have equivalent expressive power, since they can be reduced to SoGCs. However, we point out four limitations of adopting higher-order kernels: 1) From our polynomial factorization perspective, fitting graph filters using higher-order GC requires coefficient sparsity, which brings about learning difficulty. Abu-El-Haija et al. (2019) overcomes this problem by adding lasso regularization and extra training procedures. Adopting SoGC can avoid these troubles since decomposition into second-order polynomials results in at most one zero coefficient (See Section 3.1). 2) Eigenvalues of graph adjacency matrices diminish when powered. This leads to a decreasing numerical rank of and makes aggregating larger-scale information ineffective. SoGCs can alleviate this problem by preventing higher-order powering operations. 3) Higher-order GC lacks nonlinearity. SoGCN can bring a better balance between the expressive power of low-level layers and nonlinearity among them. 4) Multi-hop aggregation consumes higher computational resources (See Table 1). In contrast, SoGC matches the time complexity of vanilla GCN by fixing the kernel size to two.

Figure 3: Relations between test MAE and layer size. Experiments are conducted on synthetic Band-Pass dataset. Each model has 16 channels per hidden layer with varying layer size.

3.3 Implementation of Second-Order Graph Convolutional Networks

In this subsection, we introduce other building blocks to establish our Second-Order Graph Convolutional Networks (SoGCN) following the latest trends of GCN. We promote SoGC to its multi-channel version analogous to Kipf and Welling (2017). Then we prepend a feature embedding layer, cascade multiple SoGC layers, and append a readout module. Suppose the network input is supported in graph , denote the output of -th layer as , the final node-level output as , or graph-level output as , we formulate our novel deep GCN built with SoGC (cf. Equation 3) as follows:


where are trainable weights for linear filters ; is an equivariant embedder (Maron et al., 2018) with parameters ; is an activation function. For node-level readout, can be a decoder (with parameters ) or an output activation (e.g., softmax) in place of the prior layer. For graph-level output, should be an invariant readout function (Maron et al., 2018)

, e.g., channel-wise sum, mean or max. In practice, we adopt ReLU as nonlinear activation (i.e.,

), a multi-layer perceptron (MLP) as the embedding function

, another MLP for node regression readout, and sum (Xu et al., 2019) for graph classification readout.

We also provide a variant of SoGCN integrated with Gated Recurrent Unit (GRU)

(Girault et al., 2015), termed SoGCN-GRU. According to Cho et al. (2014), GRU can utilize gate mechanism to preserve and forget information. We hypothesize that a GRU can be trained to remove redundant signals and retain lost features on the spectrum. Similar to Li et al. (2016); Gilmer et al. (2017), we append a shared GRU module after each GC layer, which takes the signal before the GC layer as the hidden state, after the GC layer as the current input. We show by our experiment that GRU can facilitate SoGCN in avoiding noises and enhancing features on the spectrum (Figure 2). Our empirical study in Table 5 also indicates the effectiveness of GRU for spectral GCNs is general. Hence, we suggest including this recurrent module as another basic building block of our SoGCNs.

Model #Param Test MAE
High-Pass Low-Pass Band-Pass
Vanilla 4611 0.308 0.317 0.559
GIN 4627 0.344 0.096 0.274
SoGCN 12323 0.021 0.023 0.050
3rd-Order 16179 0.021 0.022 0.045
4th-Order 20035 0.021 0.022 0.049
Table 2: The performance of graph node signal regression with High-Pass, Low-Pass, and Band-Pass filters (over graph spectral space) as learning target. Each model has 16 GC layers and 16 channels of hidden layers.
Vanilla GCN
Vanilla GCN + GRU
Test MAE s.d.
0.4070.028 333 This is the result of 3WLGNN with 100k parameters. The test MAE of 3WLGNN with 500k parameters is increased to 0.4270.011.
Test ACC s.d. (%)
90.7050.218 55.7100.381 53.4452.029 63.8800.074
96.0200.090 61.3320.849 57.9320.168 70.1940.216
95.5350.205 64.2230.455 57.7320.323 75.8241.823
90.8050.032 65.9112.515 58.0640.131 85.4820.037
97.3120.097 65.7670.308 50.4540.145 50.5160.001
96.4850.252 55.2551.527 58.3840.236 85.5900.011
97.3400.143 67.3120.311 60.4040.419 84.4800.122
95.0750.961 59.1751.593 57.1306.539 85.6610.353
96.7850.113 66.3380.155 68.1671.164 85.7350.037
97.7290.159 68.2080.271 67.9942.619 85.7110.047
Table 3: Results and comparison with other GNN models on ZINC, CIFAR10, MNIST, CLUSTER and PATTERN datasets. For ZINC dataset, the parameter budget is set to 500k. For CIFAR10, MNIST, CLUSTER and PATTERN datasets, the parameter budget is set to 100k. Red: the best model, Green: good models.

4 Experiments

4.1 Synthetic Graph Spectrum Dataset for Filter Fitting Power Testing

To validate the expressiveness of SoGCN, and its power to fit arbitrary graph filters, we build a Synthetic Graph Spectrum (SGS) dataset for the node signal filtering regression task. We construct SGS dataset with random graphs. The learning task is to simulate three types of hand-crafted filtering functions: high-pass, low-pass, and band-pass on the graph spectrum (defined over the graph eigenvectors). There are 1k training graphs, 1k validation graphs, and 2k testing graphs for each filtering function. Each graph is undirected and comprises 80 to 120 nodes. Appendix E

covers more details of our SGS dataset. We choose Mean Absolute Error (MAE) as evaluation metric.

Experimental Setup.

We compare SoGCN with vanilla GCN (Kipf and Welling, 2017), GIN (Xu et al., 2019), and higher-order GCNs on the synthetic dataset. To evaluate each model’s expressiveness purely on the GC kernel design, we remove ReLU activations for all tested models. We adopt the Adam optimizer (Kingma and Ba, 2015)

in our training process, with a batch size of 128. The learning rate begins with 0.01 and decays by half once the validation loss stagnates for more than 10 training epochs.

Results and Discussion.

Table 2 summarizes the quantitative comparisons. SoGCN achieves the superior performance on all of the 3 tasks outperforming vanilla GCN and GIN, which implies that SoGC graph convolutional kernel does benefit from explicit disentangling of the second-hop neighborhoods. Our results also show that higher-order (3rd-order and 4th-order) GCNs do not improve the performance further, even though they incorporate much more parameters. SoGCN is more expressive and does a better trade-off between performance and model size.

Figure 3 plots MAE results as we vary the depth of GC layers for each graph kernel type. Vanilla GCN and GIN can not benefit from depth while SoGC and higher-order GCs can leverage depth to span larger LSS, contributing to the remarkable filtering results. SoGC and higher-order GCs have very close performance after increasing the layer size, which suggests higher-order GCs do not obtain more expressiveness than SoGC.

Model ogb-protein
#Param Time / Ep. ROC-AUC s.d.
Vanilla GCN 96880 3.47 0.40 72.16 0.55
GIN 128512 4.33 0.27 76.77 0.20
GCNII 227696 4.96 0.29 74.79 1.17
GCNII* 424304 5.09 0.17 72.50 2.49
APPNP 96880 6.56 0.37 65.37 1.15
GraphSage 193136 6.51 0.13 77.53 0.30
SoGCN 192512 4.88 0.36 79.28 0.47
4th-Order GCN 320512 8.89 0.82 78.95 0.57
6th-Order GCN 448512 9.76 0.64 78.61 0.42
Table 4: The performance of node-level multi-label classification on ogb-protein dataset. We compare each model condering the following dimensions: the number of parameters, training time (in seconds) per epoch (ep.), and final test ROC-AUC (%).
Model Test MAE s.d. Test ACC s.d. (%)
Vanilla GCN  (Baseline)  (Baseline)  (Baseline)
SoGCN  ( 0.129)  ( 6.080)  ( 10.628)
4th-Order GCN  ( 0.124)  ( 5.462)  ( 8.520)
6th-Order GCN  ( 0.106)  ( 5.587)  ( 7.977)
Vanilla GCN + GRU
SoGCN + GRU  ( 0.166)  ( 7.024)  ( 12.498)
4th-Order GCN + GRU
6th-Order GCN + GRU
Table 5: Results of ablation study on ZINC, MNIST and CIFAR10 datasets. Vanilla GCN is the comparison baseline and the number in the () and () represents the performance gain compared with the baseline.

4.2 OGB Benchmarks

We choose Open Graph Benchmark (OGB) (Hu et al., 2020) to compare our SoGC with other GCNs in terms of the parameter numbers, train time per epoch, and test ROC-AUC. We only demonstrate the results for predicting presence of protein functions (multi-label graph classification). We refer interested reader to Appendix G for more results on OGB.

Experiment Setup

The chosen models mainly include spectral-domain models: vanilla GCN, GIN, APPNP (Klicpera et al., 2019), GCNII (Ming Chen et al., 2020)

, our SoGCN, and two high-order GCNs. We also obtain the performance of GraphSage, the vertex-domain GNN baseline, for a reference. We build GIN, GCNII, GraphSage, and APPNP based on official implementations in PyTorch Geometric

(Fey and Lenssen, 2019). Every model consists of three GC layers, and the same node embedder and readout modules. We borrow the method of Dwivedi et al. (2020) to compute the number of paramters. We run an exclusive training program on an Nvidia Quadro P6000 GPU to test the training time per epoch. We follow the same training and evaluation procedures on the OGB benchmarks to ensure fair comparisons. We train each model until convergence (~1k epochs for vanilla GCN, GIN, GraphSage, and ~3k epochs for SoGCN, higher-order GCNs, APPNP, GCNII).

Results and Discussion.

Table 4 demonstrates the ROC-AUC score for each model on ogb-protein dataset. Our SoGCN achieves the best performance among all presented GCNs but its parameter number and time complexity is only slightly higher than GIN (consistent with Table 1). SoGC is more expressive than other existing graph filters (such as APPNP and GCNII), and also outperforms message-passing GNN baseline GraphSage. Compared with higher-order (4th and 6th) GCNs, the ROC-AUC score of SoGCN surpasses all of them while reducing model complexity significantly.

4.3 GNN Benchmarks

We follow the benchmarks outlined in  Dwivedi et al. (2020)

for evaluating GNNs on several datasets across a variety of artificial and real-world tasks. We choose to evaluate our SoGCN on a real-world chemistry dataset (ZINC molecules) for the graph regression task, two semi-artificial computer vision datasets (CIFAR10 and MNIST superpixels) for the graph classification task, and two artificial social network datasets (CLUSTER and PATTERN) for node classification.

Experimental Setup.

We compare our proposed SoGCN and SoGCN-GRU with state-of-the-art GNNs: vanilla GCN, GIN, GraphSage, GAT (Veličković et al., 2018), MoNet (Monti et al., 2017), GatedGCN (Bresson and Laurent, 2017) and 3WL-GNN (Maron et al., 2019). To ensure fair comparisons, we follow the same training and evaluation pipelines (including optimizer settings) and data splits of benchmarks. Furthermore, we adjust our model’s depth and width to ensure it satisfies parameter budgets as specified in the benchmark. Note that we do not use any geometrical information to encode rich graph edge relationship, as in models such as GatedGCN-E-PE. We only employ graph connectivity information for all tested models.

Results and Discussion.

Table 3 reports the benchmark results. Our model SoGCN makes small computational changes to GCN by adopting second-hop neighborhood, and it outperforms models with complicated message-passing mechanisms, such as GAT and GraphSage. With GRU module, SoGCN-GRU tops almost all state-of-the-art GNNs on the ZINC, MNIST and CIFAR10 datasets. In Figure 2, we visualize a spectrum of the last layer’s feature activation on ZINC dataset. One can see our SoGC can extract features on high-frequency bands and GRU can further sharpen these patterns. However, GRU does not lift accuracy on CLUSTER and PATTERN datasets for node classification task. According to Li et al. (2018), that GRU suppresses low-frequency band results in the slight performance drop on the CLUSTER and PATTERN datasets.

Ablation Study.

To contrast the performance gain produced by different aggregation ranges and GRU on the benchmarks, we evaluate vanilla GCN, SoGCN, 4th-Order GCN, 6th-Order GCN as well as their GRU variants on the ZINC, MNIST and CIFAR10 datasets. Table 5 presents the results of our ablation study, which are consistent with our observation on Section 4.1 and 4.2. As shown by our ablation study, adopting the second-hop aggregation makes huge performance gain (vanilla GCN vs. SoGCN). However, high-order GCNs are not capable of boosting the performance further over SoGCN. On the contrary, higher-order GCs can even lead to the performance drop (4th-Order GCN vs. 6th-Order GCN vs. SoGCN). We also testify GRU’s effectiveness for each presented model. But the gain brought by GRU is not as large as adding second-hop aggregation. Figure 2 shows our SoGC can extract patterns on the spectrum alone. GRU plays a role of enhancing the features.

5 Conclusion

What should be the basic convolutional blocks for GCNs? To answer this, we seek the most localized graph convolution kernel (GC) with full expressiveness. We establish our LSS framework to assess GC layers of different aggregation ranges. We show the second-order graph convolutional filter, termed SoGC, possesses the full representation power than one-hop GCs. Hence, it becomes the efficient and simplest GC building blocks that we adopt to establish our SoGCN. Both synthetic and benchmark experiments exhibit the prominence of our theoretic design. We also make an empirical study on the GRU’s effects in spectral GCNs. Interesting directions for future work include analyzing two-hop aggregation schemes with message-passing GNNs and proving the universality of nonlinear GCNs.


  • S. Abu-El-Haija, B. Perozzi, A. Kapoor, N. Alipourfard, K. Lerman, H. Harutyunyan, G. V. Steeg, and A. Galstyan (2019) Mixhop: higher-order graph convolutional architectures via sparsified neighborhood mixing. In ICML, Cited by: Table 1, §1, §1, §1, §1, §2, §3.2.
  • F. M. Bianchi, D. Grattarola, L. Livi, and C. Alippi (2019) Graph neural networks with convolutional arma filters. In CoRR, Cited by: §G.2, §2.
  • X. Bresson and T. Laurent (2017) Residual gated graph convnets. arXiv:1711.07553. Cited by: §4.3.
  • J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun (2014) Spectral networks and locally connected networks on graphs. In ICLR, Cited by: §2.
  • C. Cai and Y. Wang (2020) A note on over-smoothing for graph neural networks. In ICML, Cited by: §H.2, §1, §2, §2, §3.2.
  • K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv:1406.1078. Cited by: §H.2, §3.3.
  • F. R. Chung and F. C. Graham (1997) Spectral graph theory. American Mathematical Soc.. Cited by: §3.
  • M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In NeurIPS, Cited by: §1, §1, §2, §3, footnote 1.
  • N. Dehmamy, A. Barabási, and R. Yu (2019) Understanding the representation power of graph neural networks in learning graph topology. In NeurIPS, Cited by: §3.1.
  • V. P. Dwivedi, C. K. Joshi, T. Laurent, Y. Bengio, and X. Bresson (2020) Benchmarking graph neural networks. arXiv:2003.00982. Cited by: §1, §4.2, §4.3.
  • M. Fey, J. Eric Lenssen, F. Weichert, and H. Müller (2018) Splinecnn: fast geometric deep learning with continuous b-spline kernels. In CVPR, Cited by: §1.
  • M. Fey and J. E. Lenssen (2019) Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, Cited by: §G.2, §H.1, §4.2.
  • J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl (2017) Neural message passing for quantum chemistry. In ICML, Cited by: §H.2, §H.2, §1, §3.3.
  • B. Girault, P. Gonçalves, and É. Fleury (2015) Translation on graphs: an isometric shift operator. SPL. Cited by: §3.3.
  • W. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In NeurIPS, Cited by: §G.2, §H.1, §1, §1.
  • D. K. Hammond, P. Vandergheynst, and R. Gribonval (2011) Wavelets on graphs via spectral graph theory. ACHA. Cited by: §2.
  • N. Hoang and T. Maehara (2019) Revisiting graph neural networks: all we have is low-pass filters. arXiv:1905.09550. Cited by: §1, §2, §3.2.
  • W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec (2020) Open graph benchmark: datasets for machine learning on graphs. arXiv:2005.00687. Cited by: §G.2, §4.2.
  • I. A. Ibragimov and N. B. Maslova (1971) The mean number of real zeros of random polynomials. i. coefficients with zero mean.

    Theory of Probability & Its Applications

    16, pp. 228–248.
    Cited by: §3.2.
  • D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §4.1.
  • T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In ICLR, Cited by: §G.1, §H.1, §1, §1, §1, §2, §3.2, §3.3, §3, §4.1.
  • J. Klicpera, A. Bojchevski, and S. Günnemann (2019) Predict then propagate: graph neural networks meet personalized pagerank. In ICLR, Cited by: §G.2, §2, §4.2.
  • Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. Nature. Cited by: §1.
  • Q. Li, Z. Han, and X. Wu (2018)

    Deeper insights into graph convolutional networks for semi-supervised learning

    In AAAI, Cited by: §1, §2, §3.2, §4.3.
  • W. Li (2011) Probability of all real zeros for random polynomial with the exponential ensemble. Preprint. Cited by: §3.2.
  • Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel (2016) Gated graph sequence neural networks. In ICLR, Cited by: §H.2, §H.2, §1, §3.3.
  • R. Liao, Z. Zhao, R. Urtasun, and R. S. Zemel (2019) Lanczosnet: multi-scale deep graph convolutional networks. In ICLR, Cited by: §1, §1, §2, §3.2.
  • S. Luan, M. Zhao, X. Chang, and D. Precup (2019) Break the ceiling: stronger multi-scale deep graph convolutional networks. In NeurIPS, Cited by: §1, §1, §2, §2, §3.2.
  • H. Maron, H. Ben-Hamu, H. Serviansky, and Y. Lipman (2019) Provably powerful graph networks. In NeurIPS, Cited by: §1, §4.3.
  • H. Maron, H. Ben-Hamu, N. Shamir, and Y. Lipman (2018) Invariant and equivariant graph networks. In ICLR, Cited by: §1, §3.3.
  • Z. W. Ming Chen, B. D. Zengfeng Huang, and Y. Li (2020) Simple and deep graph convolutional networks. In ICML, Cited by: §G.2, §2, §2, §3.1, §4.2.
  • F. Monti, D. Boscaini, J. Masci, E. Rodola, J. Svoboda, and M. M. Bronstein (2017) Geometric deep learning on graphs and manifolds using mixture model cnns. In CVPR, Cited by: §1, §4.3.
  • C. Morris, M. Ritzert, M. Fey, W. L. Hamilton, J. E. Lenssen, G. Rattan, and M. Grohe (2019) Weisfeiler and leman go neural: higher-order graph neural networks. In AAAI, Cited by: §1.
  • K. Oono and T. Suzuki (2019) Graph neural networks exponentially lose expressive power for node classification. In ICLR, Cited by: §H.2, §1, §2, §2, §3.2.
  • H. Pei, B. Wei, K. C. Chang, Y. Lei, and B. Yang (2020) Geom-gcn: geometric graph convolutional networks. In ICLR, Cited by: §1.
  • Y. Rong, W. Huang, T. Xu, and J. Huang (2019) Dropedge: towards deep graph convolutional networks on node classification. In ICLR, Cited by: §1.
  • A. Sandryhaila and J. M. Moura (2013) Discrete signal processing on graphs. IEEE Trans. Signal Process. Cited by: Appendix C, Appendix F, §3.1, §3.
  • J. Shi and J. Malik (2000) Normalized cuts and image segmentation. TPAMI. Cited by: §3.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. Cited by: §1.
  • P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2018) Graph attention networks. In ICLR, Cited by: §4.3.
  • M. Wang, D. Zheng, Z. Ye, Q. Gan, M. Li, X. Song, J. Zhou, C. Ma, L. Yu, Y. Gai, T. Xiao, T. He, G. Karypis, J. Li, and Z. Zhang (2019) Deep graph library: a graph-centric, highly-performant package for graph neural networks. arXiv:1909.01315. Cited by: §H.1.
  • F. Wu, T. Zhang, A. H. d. Souza Jr, C. Fifty, T. Yu, and K. Q. Weinberger (2019) Simplifying graph convolutional networks. In ICML, Cited by: §1, §2, §3.2.
  • K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2019) How powerful are graph neural networks?. In ICLR, Cited by: §G.1, §1, §2, §3.2, §3.3, §4.1.
  • K. Xu, C. Li, Y. Tian, T. Sonobe, K. Kawarabayashi, and S. Jegelka (2018) Representation learning on graphs with jumping knowledge networks. In ICML, Cited by: §2.
  • D. Zhou (2020) Universality of deep convolutional neural networks. ACHA. Cited by: §1.

Appendix A Remark on Definition 1

Let us rewrite the following Definition 1:

We claim that functions are all Linear Shift-Invariant (LSI) to adjacency matrix.


Given arbitrary graph , any filter associated with it can be written as below:

where is the eigendecomposition of . Therefore, is also diagonalized by the eigenvectors of . By the Lemma 4:

Lemma 4.

Diagonalizable matrices and are simultaneously diagonalized if and only if .

we say that commutes with . For any , . ∎

Appendix B Ring Isomorphism

We introduce a mathematical device that bridges the gap between the filter space and the polynomial space .

Since is finite, we can construct a block diagonal matrix , with adjacency matrix of every graph on the diagonal:

Remark 1.

The spectrum capacity in Definition 2 represents the number of eigenvalues of without multiplicity.

Eigenvalues of adjacency matrices signify graph similarity. The spectrum capacity identifies a set of graphs by enumerating the structural patterns. Even if the graph set goes extremely large (to guarantee the generalization capability), the distribution of spectrum provides the upper bound of , so our theories remain their generality.

Now we construct a matrix space by applying a ring homomorphism to every element in :


Concretely, we write the matrix space as follows:


In the rest section, we prove that is a ring isomorphism.

Figure 4: An example of graph spectrum in our SGS dataset and its corresponding high-pass, low-pass and band-pass filtered output using our hand-crafted filters.

First, we can verify that is a ring homomorphism because it is invariant to “summation” and “multiplication”. Second, we can prove its surjectivity by the definition of (cf. Equation 14).

Finally, we show its injectivity as follows: Consider any pair of with parameters , there exists and such that . After applying , we have their images . Let , where denote the all-zero vector of length , then we have:

Hence, concludes the injectivity. ∎

Appendix C Proof of Lemma 1


One can show is a vector space by verifying the linear combination over is closed (or simply implied from the ring isomorphism ).

Due to isomorphism, . Then Lemma 1 follows from Theorem 3 of Sandryhaila and Moura (2013). We briefly conclude the proof as below.

Let denote the minimal polynomial of . We have . Suppose . First, cannot be larger than , because is a spanning set. If , then there exists some polynomial with , such that . This contradicts the minimality of . Therefore, can only be .

Suppose . For any where polynomial has . By polynomial division, there exists unique polynomials and such that


where . We insert into Equation 15 as below:

Therefore, form a basis of , i.e., . ∎

Appendix D Proof of Lemma 2


Consider a mapping :


When , (as ), which implies is a ring isomorphism as well. Since function composition preserves isomorphism property, we can conclude the proof by showing that . ∎

Remark 2.

The assumption that each graph has the same number of vertices is made only for the sake of simplicity. Lemma 1 and Lemma 2 still hold when the vertex numbers are varying, since the construction of (cf. Equation 12) is independent of this assumption.

Remark 3.

The graph set need to be finite, otherwise might be uncountable. We leave the discussion on infinite graph sets for future study.

Appendix E Synthetic Graph Spectrum Dataset

Our Synthetic Graph Spectrum (SGS) dataset is designed for testing the filter fitting power of spectral GCNs. It includes 3 types of graph signal filters: High-Pass (HP), Low-Pass (LP) and Band-Pass (BP) filters. For each type, we generate 1k, 1k and 2k undirected graphs along with graph signals and groundtruth response in training set, validation set and test set, respectively. Each graph has 80~120 nodes and 80~350 edges. Models are trained on each dataset to learn the corresponding filter by supervision on the MAE loss.

For each sample, we generate an undirected Erdős-Rényi random graph with normalized adjacency matrix

, i.e., the existence of the edge between each pair of nodes accords to a Bernoulli distribution

. In our experiments, we set to satisfy . We also compute , where with are eigenvalues, are corresponding eigenvectors.

Figure 5: Visualize the spectrum of outputs from vanilla GCN, GIN and SoGCN on the SGS Band-Pass dataset.

Next, we generate input graph signals

on the spectral domain. Independent sampling for each frequency from a distribution tends to generate white noises. Hence, we synthesize spectrum by summing random functions. We notice the mixture of beta function

and Gaussian function is a powerful model to construct diverse curves by tuning shape parameters and . We sum two discretized beta functions and four discretized Gaussian functions with random parameters to generate signal spectrums. Equation 17 elaborates the generation process and hyper-parameter chosen in our experiments, where is the PDF of distribution, denotes the PDF of distribution.


We can retrieve the vertex-domain signals via inverse graph Fourier transformation:

. Then Gaussian noise is added to the vertex-domain signals to simulate observation errors: .

We design three filters , , in Equation 18:


where . For supervising purpose, we applying each filter to synthetic inputs to generate the groundtruth output: . Figure 4 illustrates an example of the generated spectral signals and the groundtruth responses of three filters.

Appendix F More Visualizations of Spectrum

For multi-channel node signals , where is the number of nodes and is the number of signal channels, the spectrum of is computed by . More information about the graph spectrum and graph Fourier transformation can be found in Sandryhaila and Moura (2013).

Figure 5 shows the output spectrum of vanilla GCN, GIN and SoGCN on the synthetic Band-Pass dataset. The visualizations are consistent with the results in Table 2 and Figure 3 in the main text. Vanilla GCN almost loses all the band-pass frequency, resulting in very poor performance. GIN learns to pass a part of middle-frequency band but still has a distance from the groundtruth. SoGCN’s filtering response is close to the groundtruth response, showing its strong ability to represent graph signal filters.

We arbitrarily sample graph data from the ZINC dataset as input and visualize the output spectrum of vanilla GCN, SoGCN and their GRU variants in Figure 6. Each curve in the visualization figure represents the spectrum of each output channel, i.e., each column of is plotted as a curve.

Figure 6: More visualizations of output spectrum on the ZINC dataset.

Appendix G More Experiments

(a) Relation on High-Pass dataset
(b) Relation on Low-Pass dataset
Figure 7: Relations between test MAE and layer size. Each model has 16 channels per hidden layer with varying layer size.

g.1 Additional Experiments on SGS Dataset

We supplement two experiments to compare vanilla GCN (Kipf and Welling, 2017), GIN (Xu et al., 2019), SoGCN and 4th-order GCN on synthetic High-Pass and Low-Pass datasets, respectively. With Figure 7, we conclude that SoGCN and high-order GCNs perform closely on High-Pass and Low-Pass datasets and achieve remarkable filtering capability, while vanilla GCN and GIN cannot converge to considerable results by increasing the layer size. This conclusion is consistent with the previous results on Band-Pass dataset presented in the main text.

g.2 Additional Experiments on OGB Benchmark

In the main text, we have demonstrated our results on ogb-protein dataset (Hu et al., 2020) for node-level tasks. In this subsection, we also show our SoGCN’s effectiveness on ogb-molhiv dataset (Hu et al., 2020) for graph-level tasks. As the same with experiments on ogb-protein, we evaluate different GCN models in terms of their total parameter numbers, training time per epoch, and test ROC-AUC.

Experiment Setup

Again, we choose vanilla GCN, GIN, GraphSage(Hamilton et al., 2017), APPNP (Klicpera et al., 2019), GCNII (Ming Chen et al., 2020), ARMA (Bianchi et al., 2019), our SoGCN, and two high-order GCNs for performance comparison. We adopt the example code of vanilla GCN and GIN provided in OGB. We reimplemented GCNII, GraphSage, APPNP, ARMA based on the official code in PyTorch Geometric (Fey and Lenssen, 2019). According to the benchmark’s guideline, we add edge features to fan-out node features while propagation. Every model has the same depth and width, as well as other modules. The timing, training and evaluation procedures conform with the descriptions in our main text. We train vanilla GCN, GIN, APPNP, GraphSage for ~100 epochs, and SoGCN, higher-order GCNs, GCNII, ARMA for ~500 epochs.

Model ogb-molhiv
#Param Time / Ep. ROC-AUC s.d.
Vanilla GCN 527,701 25.57 1.37 76.06 0.97
GIN 980,706 29.01 1.24 75.58 1.40
GCNII 524,701 24.19 1.26 77.04 1.03
APPNP 327,001 13.56 1.32 68.00 1.36
GraphSage 976,201 24.43 1.39 76.90 1.36
ARMA 8,188,201 43.14 0.99 76.91 1.75
SoGCN 1,426,201 27.02 1.28 77.26 0.85
4th-Order GCN 2,326,201 32.24 1.10 77.24 1.21
6th-Order GCN 3,226,201 37.64 1.15 77.10 0.72
Table 6: The performance of graph-level multi-label classification on ogb-molhiv dataset. We compare each model condering the following dimensions: the number of parameters, training time (in seconds) per epoch (ep.), and final test ROC-AUC (%).

Results and Discussion.

Table 6 demonstrates the ROC-AUC score for each model on ogb-molhiv dataset. We reach the same conclusion with our main text. On ogb-molhiv dataset, we notice that GCNII is another lightweight yet effective model. However, GCNII only allows inputs whose channel number equals to output dimension. One needs to add additional blocks (e.g., linear modules) to support varying hidden dimensions, which incorporates more parameters and higher complexity (e.g., on ogb-protein dataset).

Appendix H Implementation Details

We open source our implementation of SoGCN at All of our code, datasets, hyper-parameters, and runtime configurations can be found there.

h.1 Second-Order Graph Convolution

Our SoGC can be implemented using a message-passing scheme (Hamilton et al., 2017) (cf. Equation 19). We regard the normalized adjacency matrix as a one-hop aggregator (message propagator). When we compute the power of , we invoke the propagator multiple times. After passing the messages twice, we transform and mix up aggregated information from two hops via a linear block.


where is the input feature vector for node , denotes the output for node . is the degree for vertex , is the set of ’s neighbor vertices. is the feature representation of ’s first-hop neighborhood. It can be computed by aggregating information once from the directly neighboring nodes. is the feature representation of ’s second-hop neighborhood. It can be computed by feature aggregation upon neighbors’ . are the weight matrices (a.k.a. layer parameters).

Our design can reduce computational time by reusing previously aggregated information and preventing power operations on . In practice, our SoGC is easy to implement. Our message-passing design conforms to mainstream graph learning frameworks, such as Deep Graph Library (Wang et al., 2019) and PyTorch Geometric (Fey and Lenssen, 2019). One can simply add another group of parameters and invoke the “propagation” method of vanilla GC (Kipf and Welling, 2017) twice to simulate our SoGC. For the sake of clarity, we provide the pseudo-code for general -order GCs in Algorithm 1. Our SoGC can be called by passing .

  Input: Graph , node degrees , input features , GC order , weight matrices .
  Output: Feature representation .
  for  to  do
     for  do
     end for
  end for
Algorithm 1 -Order Graph Convolution

h.2 Gated Recurrent Unit

We supplement two motivations behind using Gated Recurrent Unit (GRU) (Cho et al., 2014): 1) GRU has been served as a basic building block in message-passing GNN architectures (Li et al., 2016; Gilmer et al., 2017). We make an explorative attempt to first introduce them into spectral GCNs. 2) By selectively maintaining information from previous layer and canceling the dominance of DC components (Figure 6), GRU can also relieve the side-effect of ReLU, which is proved to be a special low-pass filter (Oono and Suzuki, 2019; Cai and Wang, 2020).

Similar to Li et al. (2016); Gilmer et al. (2017), we appends a shared GRU module after each GC layer, which takes the signals before the GC layers as the hidden state, after the GC layers as the current input. We formulate its implementation by replacing Equation 10 with Equation 20 as below.


where is the input, represents the hidden state, denotes parameters of the GRU. Figure 6 illustrates the spectrum outputs of vanilla GCN + GRU and SoGCN + GRU. One can see, without filtering power of SoGCN, vanilla GCN + GRU fails to extract sharp patterns on the spectrum. Thereby, we suggest that it is SoGC that mainly contributes to the higher expressiveness.