1 Introduction
Graph Convolutional Networks (GCNs) has gained popularity in recent years. Researchers have shown that nonlocalized multihop GCNs (Liao et al., 2019; Luan et al., 2019; AbuElHaija et al., 2019) have better performance than localized onehop GCNs (Defferrard et al., 2016; Kipf and Welling, 2017; Wu et al., 2019)
. However, in Convolutional Neural Networks (CNNs), the localized
kernels’ expressiveness has been shown in image recognition both experimentally (Simonyan and Zisserman, 2014) and theoretically (Zhou, 2020). These contradictory observations motivate us to search for a maximally localized Graph Convolution (GC) kernel with guaranteed feature expressivness.Kernel  Expressiveness  Localized  Complexity 

Vanilla GCN  Very Low  
GIN  Medium  
Multihop  Full  
SoGCN (Ours)  Full 
Most existing GCN layers adopt localized graph convolution based on onehop aggregation scheme as the basic building block (Kipf and Welling, 2017; Hamilton et al., 2017; Xu et al., 2019). The effectiveness of these onehop models is based on the intuition that a richer class of convolutional functions can be recovered by stacking multiple onehop layers. However, extensive works (Li et al., 2018; Oono and Suzuki, 2019; Cai and Wang, 2020) have shown performance limitations of such design, which indicates this hypothesis may not hold. Liao et al. (2019); Luan et al. (2019); AbuElHaija et al. (2019)
observed that multihop aggregation run in each layer could lead to significant improvement in prediction accuracy. However, a longerrange aggregator introduces extra hyperparameters and higher computational cost. This design also contradicts the compositionality principle of deep learning that neural networks benefit from deep connections and localized kernels
(LeCun et al., 2015).Recent studies point out that onehop GCNs suffer from filter incompleteness (Hoang and Maehara, 2019). By relating lowpass filtering on the graph spectrum with oversmoothing, one can show onehop filtering could lead to performance limitations. One natural solution of adding a more complex graph kernel, such as multihop connections, seems to work in practical settings (AbuElHaija et al., 2019). The question is: “what is the simplest graph kernel with the full expressive power of the graph convolution?” We show that with a secondorder graph kernel of “twohop” connection, we could approximate any complicated graph relationships. Intuitively, it means we should extract a contrast between graph nodes that almostconnected (via their neighbor) vs. directlyconnected.
We construct the twohop graph kernel with secondorder polynomials in an adjacency matrix and call it the SecondOrder GC (SoGC). We show this SecondOrder GC (SoGC) is the “sweet spot” balancing localization and fitting capability. To justify our conclusion, we introduce a Layer Spanning Space (LSS) framework to quantify the filter representation power of multilayer GCs. Our LSS works by mapping GC filters’ composition with arbitrary coefficients to polynomial multiplication (Section 3.1).
Under this LSS framework, we can show that SoGCs can approximate any linear GCNs in channelwise filtering (Theorem 1). Vanilla GCN and GIN (firstorder polynomials in adjacency matrix) cannot represent all polynomial filters in general; multihop GCs (higherorder polynomials in adjacency matrix) do not contribute more expressiveness (Section 3.2). In this sense, SoGC is the most localized GC kernel with the full representation power.
To validate our theory, we build our SecondOrder Graph Convolutional Networks (SoGCN) by layering up SoGC layers (Section 3.3). We reproduce our theoretical results on a synthetic datasets for filtering power testing (Section 4.1). On the public benchmark datasets (Dwivedi et al., 2020), SoGCN using simple graph topological features consistently boosts the performance of our baseline model (i.e., vanilla GCN) comparable to the stateoftheart GNN models (with more complex attention/gating mechanisms). We also verify that our SoGCN fits extensive realworld tasks, including network node classification, superpixel graph classification, and molecule graph regression (Section 4.3).
To our best knowledge, this work is the first study that identifies the distinctive competence of the twohop neighborhood in the context of expressing a polynomial filter with arbitrary coefficients. Our SoGC is a special but nontrivial case of polynomial approximated graph filters (Defferrard et al., 2016). Kipf and Welling (2017) conducted an ablation study with GC kernels of different orders but missed the effectiveness of the secondorder relationships. The work of AbuElHaija et al. (2019) talked about mutihop graph kernels; however, they did not identify the critical importance of the twohop form. In contrast, we clarify the prominence of SoGCs in theories and experiments.
Our research on GCN using pure topologically relationship is orthogonal to those using geometric relations (Monti et al., 2017; Fey et al., 2018; Pei et al., 2020), or those with expressive edge features (Li et al., 2016; Gilmer et al., 2017), and hyperedges (Morris et al., 2019; Maron et al., 2018, 2019). It is also independent with graph sampling procedures (Rong et al., 2019; Hamilton et al., 2017).
2 Related Work
Spectral GCNs.
Graph convolution is defined as elementwise multiplication on graph spectrum (Hammond et al., 2011). Bruna et al. (2014) first proposed spectral GCN with respect to this definition. ChebyNet (Defferrard et al., 2016) approximates graph filters using Chebyshev polynomials. Vanilla GCN (Kipf and Welling, 2017; Wu et al., 2019) further reduces the GC layer to a degreeone polynomial with lumping of firstorder and constant terms. GIN (Xu et al., 2019) disentangles the effect of selfconnection and pairwise neighboring connections by adding a separate mapping for central nodes. However, these simplifications causes performance limitations (Oono and Suzuki, 2019; Cai and Wang, 2020). APPNP (Klicpera et al., 2019) uses Personalized PageRank to derive a fixed polynomial filter. Bianchi et al. (2019) proposes a multibranch GCN architecture to simulate ARMA graph filters. GCNII (Ming Chen et al., 2020) incorporates identity mapping and initial mapping to relieve oversmoothing problem and deepen GCNs. However, these models are not easy to implement and introduce additional hyperparameters.
MultiHop GCNs.
To exploit multihop information, Liao et al. (2019) proposes to use Lanczos algorithm to construct low rank approximations of the graph Laplacian for graph convolution. Luan et al. (2019) devises two architectures Snowball GCN and Truncated Krylov GCN to capture neighborhoods at various distances. To simulate neighborhood delta functions, AbuElHaija et al. (2019) repeat mixing multihop features to identify more topological information. JKNet (Xu et al., 2018) combines all feature activation of previous layers to learn adaptive and structureaware representations of different graph substructures. These models exhibit the strength of multihop GCNs over onehop GCNs while leaving the propagation length as a hyperparameter. In the meanwhile, longrange aggregation in each layer causes higher complexity (Table 1).
Expressiveness of GCNs.
Most of the works on GCN’s expressiveness are restricted to the oversmoothing problem: Li et al. (2018) first poses the oversmoothing problem; Hoang and Maehara (2019) indicates GCNs are no more than lowpass filters; Luan et al. (2019); Oono and Suzuki (2019) demonstrate the asymptotic behavior of feature activation to a subspace; Cai and Wang (2020) examines the decreasing Dirichlet energy. These analytic frameworks do not provide constructive improvements on building more powerful GCN. Ming Chen et al. (2020) first proposed to assess GCN’s overall expressiveness by tackling the ability of expressing polynomial filters with arbitrary coefficients. But their theory is only applicable for transductive learning on a single graph, and does not upper bound the degree of the graph polynomial filters.
3 SecondOrder Graph Convolution
We begin by introducing our notation. We are interested in learning on a finite graph set . Assume each graph is simple and undirected, associated with a finite vertex set , an edge set , and a symmetric normalized adjacency matrix (Chung and Graham, 1997; Shi and Malik, 2000). Without loss of generality and for simplicity, for every . We denote singlechannel signals supported in graph as
, a vectorization of function
.Graph Convolution (GC) is defined as Linear ShiftInvariant (LSI) operators to adjacency matrices (Sandryhaila and Moura, 2013). This property enables GC to extract features regardless of where the local structure falls. A singlechannel GC can be written as a mapping . According to Defferrard et al. (2016), a GC can be approximated by a polynomial in adjacency matrix (Figure 0(c)) ^{1}^{1}1We can replace the Laplacian matrix in Defferrard et al. (2016) with the normalized adjacency matrix since .:
(1) 
where represents the kernel weights. The Equation 1 indicates that graph convolution can be interpreted as a linear combination of features aggregated by . Thereby, the hyperparameter can reflect the localization of a GC kernel.
Previous work of Kipf and Welling (2017) simplified the Equation 1 to a onehop kernel, socalled vanilla GC (Figure 0(a)). We formulate vanilla GC as below:
(2) 
We are interested in the overall graph convolution networks’ representation power of expressing a polynomial filter (cf. Equation 1) with arbitrary degrees and coefficients. At first glance, onehop GC (cf. Equation 2) can approximate any highorder GC kernels by stacking multiple layers. However, that is not the case (See formal arguments in Section 3.2). In contrast, when plugging the secondorder term into vanilla GC, we will show that this approximation ability can be attained (See formal arguments in Theorem 1). We name this improved design SecondOrder Graph Convolution (SoGC), as it can be written as the secondorder polynomial in adjacency matrix:
(3) 
We illustrate its vertexdomain interpretation in Figure 0(b). The critical insight is that graph filter approximation can be viewed as a polynomial factorization problem. It is known that any univariate polynomial can be factorized into subpolynomials of degree two. Based on this fact, we show by stacking enough SoGCs (and varying their parameters) can achieve decomposition of any polynomial filters.
3.1 Representation Power Justification
To be more precise, we introduce our Layer Spanning Space (LSS) framework in this subsection and mathematically prove that arbitrary GC kernel can be decomposed into finite many SoGC kernels.
First, to illustrate overall graph filter space and filters expressed by a single layer, we define a graph filter space, in which every polynomial filter has degree no more than :
Definition 1.
(Graph filter space) Suppose the parameter space is . The Linear ShiftInvariant (LSI) graph filter space^{2}^{2}2See Appendix A for a remark on how relates with the term “Linear ShiftInvariant”. of degree with respect to a finite graph set is defined as , where follows the definition in Equation 1.
We further provide Definition 2 and Lemma 1 to discuss the upper limit of ’s dimension.
Definition 2.
(Spectrum capacity) Let spectrum set , where
denotes the eigenvalues of
. Spectrum capacity is the cardinality of all distinct graph eigenvalues. In particular, if every graph adjacency matrix has no common eigenvalues other than .Lemma 1.
Filter space with degree has dimension as a vector space.
Lemma 1 follows from Theorem 3 of Sandryhaila and Moura (2013). See the complete proof in Appendix C.
According to Lemma 1, one can define an ambient filter space of degree (i.e., ). Suppose a GCN consists of hop convolutional kernels , where the superscript indicates the layer number, and denotes the network depth. We can consider each layer is sampled from . We intend to justify this GCN’s filter representation power via its Layer Spanning Space (LSS):
(4) 
where the whole LSS is constructed by varying parameters of over . When , the LSS covers the entire ambient space, then we say the GCN composed of GC kernels in has full filter representation power.
As is a function space, can be analytically tricky. To investigate this space, we define a mapping , where denotes a polynomial vector space of degree at most :
(5) 
We provide Lemma 2 to reveal a good property of :
Lemma 2.
is a ring isomorphism when .
The proof can be found in Appendix D.
This ring isomorphism signifies that the composition of filters in is identical to polynomial multiplication. Therefore, one can study the LSS through the polynomials that can be factorized into the corresponding subpolynomials. For example, Equation 4 is identical to:
(6) 
In the rest of this subsection, we will show that layer SoGCs (i.e., ) can attain full representation power. That is, covers the whole ambient filter space when is as large as . We summarize a formal argument in the following theorem:
Theorem 1.
For any , there exists with coefficients such that where .
Proof.
Proving Theorem 1 requires a fundamental polynomial factorization theorem, rephrased as below:
Lemma 3.
(Fundamental theorem of algebra) Over the field of reals, the degree of an irreducible nontrivial univariate polynomial is either one or two.
For any , apply (cf. Equation 5) to map kernel to polynomial . By Lemma 1, . By Lemma 3, factorize into series of polynomials with the degree at most two, and then merge firstorder polynomials into secondorder ones until one single or no firstorder subpolynomial remains. As a consequence, can be written as , where . If is even, for every . Otherwise, except for at most one whose degree is one, all terms have degree two.
The last step is to apply the inverse of morphism formulated as below:
Since is also a ring isomorphism, we have:
where by definition, which implies . ∎
Theorem 1 can be regarded as the universal approximation theorem of linear GCNs. Although nonlinear activation is not considered within our theoretical framework, we make reasonable hypothesis that achieving linear filter completeness can also boost GCNs with nonlinearity (Ming Chen et al., 2020). See our experiments in Section 4.3.
3.2 Compared with Other Graph Convolution
In this subsection, we will show that vanilla GCN (and GIN) does not attain full expressiveness in terms of filter representation. We will also contrast our SoGC to multihop GCs (i.e., higherorder GCs in our terminology) to further reveal the prominence of SoGC. A brief comparison is summarized in Table 1.
Vanilla vs. secondorder.
Vanilla GCN (Kipf and Welling, 2017) is a typical onehop GCN with lumbing of of graph node selfconnection and pairwise neighboring connection. Compared with SoGC, vanilla GC is more localized and computationally cheaper. However, this design has huge performance limitations (Hoang and Maehara, 2019; Wu et al., 2019; Li et al., 2018; Oono and Suzuki, 2019; Cai and Wang, 2020). We illustrate this issue in terms of filter approximation power based on the LSS framework.
Suppose a GCN stacks GC layers , apply mapping to its spanned LSS, the isomorphic polynomial space is:
(7) 
where . According to Equation 7, one can see no matter how large is or how a optimizer tunes the parameters , , which implies degenerates to a negligible subspace inside . GIN (Xu et al., 2019) disentangles the the weights for neighborhoods and central nodes. We can write this GC layer as . The LSS of GIN is isomorphic to the polynomial space:
(8) 
This polynomial space represents all polynomials that can split over the real domain. However, since not all polynomials can be factorized into firstorder polynomials. The expectation number of real roots of a degree polynomial with zeromean random coefficients is (Ibragimov and Maslova, 1971). When the ambient dimension goes larger, allrealroot polynomials only occupy a small proportion in the ambient space (Li, 2011), which indicates GIN does not have full expressiveness in terms of filter representation either.
Higherorder vs. secondorder.
Higherorder GCs refer to those polynomial filters with degree larger than three (i.e., ). They can model multihop GCNs such as Luan et al. (2019); Liao et al. (2019); AbuElHaija et al. (2019). Compared to SoGCs, higherorder GCs have equivalent expressive power, since they can be reduced to SoGCs. However, we point out four limitations of adopting higherorder kernels: 1) From our polynomial factorization perspective, fitting graph filters using higherorder GC requires coefficient sparsity, which brings about learning difficulty. AbuElHaija et al. (2019) overcomes this problem by adding lasso regularization and extra training procedures. Adopting SoGC can avoid these troubles since decomposition into secondorder polynomials results in at most one zero coefficient (See Section 3.1). 2) Eigenvalues of graph adjacency matrices diminish when powered. This leads to a decreasing numerical rank of and makes aggregating largerscale information ineffective. SoGCs can alleviate this problem by preventing higherorder powering operations. 3) Higherorder GC lacks nonlinearity. SoGCN can bring a better balance between the expressive power of lowlevel layers and nonlinearity among them. 4) Multihop aggregation consumes higher computational resources (See Table 1). In contrast, SoGC matches the time complexity of vanilla GCN by fixing the kernel size to two.
3.3 Implementation of SecondOrder Graph Convolutional Networks
In this subsection, we introduce other building blocks to establish our SecondOrder Graph Convolutional Networks (SoGCN) following the latest trends of GCN. We promote SoGC to its multichannel version analogous to Kipf and Welling (2017). Then we prepend a feature embedding layer, cascade multiple SoGC layers, and append a readout module. Suppose the network input is supported in graph , denote the output of th layer as , the final nodelevel output as , or graphlevel output as , we formulate our novel deep GCN built with SoGC (cf. Equation 3) as follows:
(9)  
(10)  
(11) 
where are trainable weights for linear filters ; is an equivariant embedder (Maron et al., 2018) with parameters ; is an activation function. For nodelevel readout, can be a decoder (with parameters ) or an output activation (e.g., softmax) in place of the prior layer. For graphlevel output, should be an invariant readout function (Maron et al., 2018)
, e.g., channelwise sum, mean or max. In practice, we adopt ReLU as nonlinear activation (i.e.,
), a multilayer perceptron (MLP) as the embedding function
, another MLP for node regression readout, and sum (Xu et al., 2019) for graph classification readout.We also provide a variant of SoGCN integrated with Gated Recurrent Unit (GRU)
(Girault et al., 2015), termed SoGCNGRU. According to Cho et al. (2014), GRU can utilize gate mechanism to preserve and forget information. We hypothesize that a GRU can be trained to remove redundant signals and retain lost features on the spectrum. Similar to Li et al. (2016); Gilmer et al. (2017), we append a shared GRU module after each GC layer, which takes the signal before the GC layer as the hidden state, after the GC layer as the current input. We show by our experiment that GRU can facilitate SoGCN in avoiding noises and enhancing features on the spectrum (Figure 2). Our empirical study in Table 5 also indicates the effectiveness of GRU for spectral GCNs is general. Hence, we suggest including this recurrent module as another basic building block of our SoGCNs.Model  #Param  Test MAE  

HighPass  LowPass  BandPass  
Vanilla  4611  0.308  0.317  0.559 
GIN  4627  0.344  0.096  0.274 
SoGCN  12323  0.021  0.023  0.050 
3rdOrder  16179  0.021  0.022  0.045 
4thOrder  20035  0.021  0.022  0.049 
Model 

Vanilla GCN 
Vanilla GCN + GRU 
GAT 
MoNet 
GraphSage 
GIN 
GatedGCN 
3WLGNN 
SoGCN 
SoGCNGRU 
Test MAE s.d. 

ZINC 
0.3670.011 
0.2950.005 
0.3840.007 
0.2920.006 
0.3980.002 
0.3870.015 
0.3500.020 
0.4070.028 ^{3}^{3}3 This is the result of 3WLGNN with 100k parameters. The test MAE of 3WLGNN with 500k parameters is increased to 0.4270.011. 
0.2380.017 
0.2010.006 
Test ACC s.d. (%)  
MNIST  CIFAR10  CLUSTER  PATTERN 
90.7050.218  55.7100.381  53.4452.029  63.8800.074 
96.0200.090  61.3320.849  57.9320.168  70.1940.216 
95.5350.205  64.2230.455  57.7320.323  75.8241.823 
90.8050.032  65.9112.515  58.0640.131  85.4820.037 
97.3120.097  65.7670.308  50.4540.145  50.5160.001 
96.4850.252  55.2551.527  58.3840.236  85.5900.011 
97.3400.143  67.3120.311  60.4040.419  84.4800.122 
95.0750.961  59.1751.593  57.1306.539  85.6610.353 
96.7850.113  66.3380.155  68.1671.164  85.7350.037 
97.7290.159  68.2080.271  67.9942.619  85.7110.047 
4 Experiments
4.1 Synthetic Graph Spectrum Dataset for Filter Fitting Power Testing
To validate the expressiveness of SoGCN, and its power to fit arbitrary graph filters, we build a Synthetic Graph Spectrum (SGS) dataset for the node signal filtering regression task. We construct SGS dataset with random graphs. The learning task is to simulate three types of handcrafted filtering functions: highpass, lowpass, and bandpass on the graph spectrum (defined over the graph eigenvectors). There are 1k training graphs, 1k validation graphs, and 2k testing graphs for each filtering function. Each graph is undirected and comprises 80 to 120 nodes. Appendix E
covers more details of our SGS dataset. We choose Mean Absolute Error (MAE) as evaluation metric.
Experimental Setup.
We compare SoGCN with vanilla GCN (Kipf and Welling, 2017), GIN (Xu et al., 2019), and higherorder GCNs on the synthetic dataset. To evaluate each model’s expressiveness purely on the GC kernel design, we remove ReLU activations for all tested models. We adopt the Adam optimizer (Kingma and Ba, 2015)
in our training process, with a batch size of 128. The learning rate begins with 0.01 and decays by half once the validation loss stagnates for more than 10 training epochs.
Results and Discussion.
Table 2 summarizes the quantitative comparisons. SoGCN achieves the superior performance on all of the 3 tasks outperforming vanilla GCN and GIN, which implies that SoGC graph convolutional kernel does benefit from explicit disentangling of the secondhop neighborhoods. Our results also show that higherorder (3rdorder and 4thorder) GCNs do not improve the performance further, even though they incorporate much more parameters. SoGCN is more expressive and does a better tradeoff between performance and model size.
Figure 3 plots MAE results as we vary the depth of GC layers for each graph kernel type. Vanilla GCN and GIN can not benefit from depth while SoGC and higherorder GCs can leverage depth to span larger LSS, contributing to the remarkable filtering results. SoGC and higherorder GCs have very close performance after increasing the layer size, which suggests higherorder GCs do not obtain more expressiveness than SoGC.
Model  ogbprotein  

#Param  Time / Ep.  ROCAUC s.d.  
Vanilla GCN  96880  3.47 0.40  72.16 0.55 
GIN  128512  4.33 0.27  76.77 0.20 
GCNII  227696  4.96 0.29  74.79 1.17 
GCNII*  424304  5.09 0.17  72.50 2.49 
APPNP  96880  6.56 0.37  65.37 1.15 
GraphSage  193136  6.51 0.13  77.53 0.30 
SoGCN  192512  4.88 0.36  79.28 0.47 
4thOrder GCN  320512  8.89 0.82  78.95 0.57 
6thOrder GCN  448512  9.76 0.64  78.61 0.42 
Model  Test MAE s.d.  Test ACC s.d. (%)  

ZINC  MNIST  CIFAR10  
Vanilla GCN  (Baseline)  (Baseline)  (Baseline) 
SoGCN  ( 0.129)  ( 6.080)  ( 10.628) 
4thOrder GCN  ( 0.124)  ( 5.462)  ( 8.520) 
6thOrder GCN  ( 0.106)  ( 5.587)  ( 7.977) 
Vanilla GCN + GRU  
SoGCN + GRU  ( 0.166)  ( 7.024)  ( 12.498) 
4thOrder GCN + GRU  
6thOrder GCN + GRU 
4.2 OGB Benchmarks
We choose Open Graph Benchmark (OGB) (Hu et al., 2020) to compare our SoGC with other GCNs in terms of the parameter numbers, train time per epoch, and test ROCAUC. We only demonstrate the results for predicting presence of protein functions (multilabel graph classification). We refer interested reader to Appendix G for more results on OGB.
Experiment Setup
The chosen models mainly include spectraldomain models: vanilla GCN, GIN, APPNP (Klicpera et al., 2019), GCNII (Ming Chen et al., 2020)
, our SoGCN, and two highorder GCNs. We also obtain the performance of GraphSage, the vertexdomain GNN baseline, for a reference. We build GIN, GCNII, GraphSage, and APPNP based on official implementations in PyTorch Geometric
(Fey and Lenssen, 2019). Every model consists of three GC layers, and the same node embedder and readout modules. We borrow the method of Dwivedi et al. (2020) to compute the number of paramters. We run an exclusive training program on an Nvidia Quadro P6000 GPU to test the training time per epoch. We follow the same training and evaluation procedures on the OGB benchmarks to ensure fair comparisons. We train each model until convergence (~1k epochs for vanilla GCN, GIN, GraphSage, and ~3k epochs for SoGCN, higherorder GCNs, APPNP, GCNII).Results and Discussion.
Table 4 demonstrates the ROCAUC score for each model on ogbprotein dataset. Our SoGCN achieves the best performance among all presented GCNs but its parameter number and time complexity is only slightly higher than GIN (consistent with Table 1). SoGC is more expressive than other existing graph filters (such as APPNP and GCNII), and also outperforms messagepassing GNN baseline GraphSage. Compared with higherorder (4th and 6th) GCNs, the ROCAUC score of SoGCN surpasses all of them while reducing model complexity significantly.
4.3 GNN Benchmarks
We follow the benchmarks outlined in Dwivedi et al. (2020)
for evaluating GNNs on several datasets across a variety of artificial and realworld tasks. We choose to evaluate our SoGCN on a realworld chemistry dataset (ZINC molecules) for the graph regression task, two semiartificial computer vision datasets (CIFAR10 and MNIST superpixels) for the graph classification task, and two artificial social network datasets (CLUSTER and PATTERN) for node classification.
Experimental Setup.
We compare our proposed SoGCN and SoGCNGRU with stateoftheart GNNs: vanilla GCN, GIN, GraphSage, GAT (Veličković et al., 2018), MoNet (Monti et al., 2017), GatedGCN (Bresson and Laurent, 2017) and 3WLGNN (Maron et al., 2019). To ensure fair comparisons, we follow the same training and evaluation pipelines (including optimizer settings) and data splits of benchmarks. Furthermore, we adjust our model’s depth and width to ensure it satisfies parameter budgets as specified in the benchmark. Note that we do not use any geometrical information to encode rich graph edge relationship, as in models such as GatedGCNEPE. We only employ graph connectivity information for all tested models.
Results and Discussion.
Table 3 reports the benchmark results. Our model SoGCN makes small computational changes to GCN by adopting secondhop neighborhood, and it outperforms models with complicated messagepassing mechanisms, such as GAT and GraphSage. With GRU module, SoGCNGRU tops almost all stateoftheart GNNs on the ZINC, MNIST and CIFAR10 datasets. In Figure 2, we visualize a spectrum of the last layer’s feature activation on ZINC dataset. One can see our SoGC can extract features on highfrequency bands and GRU can further sharpen these patterns. However, GRU does not lift accuracy on CLUSTER and PATTERN datasets for node classification task. According to Li et al. (2018), that GRU suppresses lowfrequency band results in the slight performance drop on the CLUSTER and PATTERN datasets.
Ablation Study.
To contrast the performance gain produced by different aggregation ranges and GRU on the benchmarks, we evaluate vanilla GCN, SoGCN, 4thOrder GCN, 6thOrder GCN as well as their GRU variants on the ZINC, MNIST and CIFAR10 datasets. Table 5 presents the results of our ablation study, which are consistent with our observation on Section 4.1 and 4.2. As shown by our ablation study, adopting the secondhop aggregation makes huge performance gain (vanilla GCN vs. SoGCN). However, highorder GCNs are not capable of boosting the performance further over SoGCN. On the contrary, higherorder GCs can even lead to the performance drop (4thOrder GCN vs. 6thOrder GCN vs. SoGCN). We also testify GRU’s effectiveness for each presented model. But the gain brought by GRU is not as large as adding secondhop aggregation. Figure 2 shows our SoGC can extract patterns on the spectrum alone. GRU plays a role of enhancing the features.
5 Conclusion
What should be the basic convolutional blocks for GCNs? To answer this, we seek the most localized graph convolution kernel (GC) with full expressiveness. We establish our LSS framework to assess GC layers of different aggregation ranges. We show the secondorder graph convolutional filter, termed SoGC, possesses the full representation power than onehop GCs. Hence, it becomes the efficient and simplest GC building blocks that we adopt to establish our SoGCN. Both synthetic and benchmark experiments exhibit the prominence of our theoretic design. We also make an empirical study on the GRU’s effects in spectral GCNs. Interesting directions for future work include analyzing twohop aggregation schemes with messagepassing GNNs and proving the universality of nonlinear GCNs.
References
 Mixhop: higherorder graph convolutional architectures via sparsified neighborhood mixing. In ICML, Cited by: Table 1, §1, §1, §1, §1, §2, §3.2.
 Graph neural networks with convolutional arma filters. In CoRR, Cited by: §G.2, §2.
 Residual gated graph convnets. arXiv:1711.07553. Cited by: §4.3.
 Spectral networks and locally connected networks on graphs. In ICLR, Cited by: §2.
 A note on oversmoothing for graph neural networks. In ICML, Cited by: §H.2, §1, §2, §2, §3.2.
 Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv:1406.1078. Cited by: §H.2, §3.3.
 Spectral graph theory. American Mathematical Soc.. Cited by: §3.
 Convolutional neural networks on graphs with fast localized spectral filtering. In NeurIPS, Cited by: §1, §1, §2, §3, footnote 1.
 Understanding the representation power of graph neural networks in learning graph topology. In NeurIPS, Cited by: §3.1.
 Benchmarking graph neural networks. arXiv:2003.00982. Cited by: §1, §4.2, §4.3.
 Splinecnn: fast geometric deep learning with continuous bspline kernels. In CVPR, Cited by: §1.
 Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, Cited by: §G.2, §H.1, §4.2.
 Neural message passing for quantum chemistry. In ICML, Cited by: §H.2, §H.2, §1, §3.3.
 Translation on graphs: an isometric shift operator. SPL. Cited by: §3.3.
 Inductive representation learning on large graphs. In NeurIPS, Cited by: §G.2, §H.1, §1, §1.
 Wavelets on graphs via spectral graph theory. ACHA. Cited by: §2.
 Revisiting graph neural networks: all we have is lowpass filters. arXiv:1905.09550. Cited by: §1, §2, §3.2.
 Open graph benchmark: datasets for machine learning on graphs. arXiv:2005.00687. Cited by: §G.2, §4.2.

The mean number of real zeros of random polynomials. i. coefficients with zero mean.
Theory of Probability & Its Applications
16, pp. 228–248. Cited by: §3.2.  Adam: a method for stochastic optimization. In ICLR, Cited by: §4.1.
 Semisupervised classification with graph convolutional networks. In ICLR, Cited by: §G.1, §H.1, §1, §1, §1, §2, §3.2, §3.3, §3, §4.1.
 Predict then propagate: graph neural networks meet personalized pagerank. In ICLR, Cited by: §G.2, §2, §4.2.
 Deep learning. Nature. Cited by: §1.

Deeper insights into graph convolutional networks for semisupervised learning
. In AAAI, Cited by: §1, §2, §3.2, §4.3.  Probability of all real zeros for random polynomial with the exponential ensemble. Preprint. Cited by: §3.2.
 Gated graph sequence neural networks. In ICLR, Cited by: §H.2, §H.2, §1, §3.3.
 Lanczosnet: multiscale deep graph convolutional networks. In ICLR, Cited by: §1, §1, §2, §3.2.
 Break the ceiling: stronger multiscale deep graph convolutional networks. In NeurIPS, Cited by: §1, §1, §2, §2, §3.2.
 Provably powerful graph networks. In NeurIPS, Cited by: §1, §4.3.
 Invariant and equivariant graph networks. In ICLR, Cited by: §1, §3.3.
 Simple and deep graph convolutional networks. In ICML, Cited by: §G.2, §2, §2, §3.1, §4.2.
 Geometric deep learning on graphs and manifolds using mixture model cnns. In CVPR, Cited by: §1, §4.3.
 Weisfeiler and leman go neural: higherorder graph neural networks. In AAAI, Cited by: §1.
 Graph neural networks exponentially lose expressive power for node classification. In ICLR, Cited by: §H.2, §1, §2, §2, §3.2.
 Geomgcn: geometric graph convolutional networks. In ICLR, Cited by: §1.
 Dropedge: towards deep graph convolutional networks on node classification. In ICLR, Cited by: §1.
 Discrete signal processing on graphs. IEEE Trans. Signal Process. Cited by: Appendix C, Appendix F, §3.1, §3.
 Normalized cuts and image segmentation. TPAMI. Cited by: §3.
 Very deep convolutional networks for largescale image recognition. arXiv:1409.1556. Cited by: §1.
 Graph attention networks. In ICLR, Cited by: §4.3.
 Deep graph library: a graphcentric, highlyperformant package for graph neural networks. arXiv:1909.01315. Cited by: §H.1.
 Simplifying graph convolutional networks. In ICML, Cited by: §1, §2, §3.2.
 How powerful are graph neural networks?. In ICLR, Cited by: §G.1, §1, §2, §3.2, §3.3, §4.1.
 Representation learning on graphs with jumping knowledge networks. In ICML, Cited by: §2.
 Universality of deep convolutional neural networks. ACHA. Cited by: §1.
Appendix A Remark on Definition 1
Let us rewrite the following Definition 1:
We claim that functions are all Linear ShiftInvariant (LSI) to adjacency matrix.
Proof.
Given arbitrary graph , any filter associated with it can be written as below:
where is the eigendecomposition of . Therefore, is also diagonalized by the eigenvectors of . By the Lemma 4:
Lemma 4.
Diagonalizable matrices and are simultaneously diagonalized if and only if .
we say that commutes with . For any , . ∎
Appendix B Ring Isomorphism
We introduce a mathematical device that bridges the gap between the filter space and the polynomial space .
Since is finite, we can construct a block diagonal matrix , with adjacency matrix of every graph on the diagonal:
(12) 
Remark 1.
The spectrum capacity in Definition 2 represents the number of eigenvalues of without multiplicity.
Eigenvalues of adjacency matrices signify graph similarity. The spectrum capacity identifies a set of graphs by enumerating the structural patterns. Even if the graph set goes extremely large (to guarantee the generalization capability), the distribution of spectrum provides the upper bound of , so our theories remain their generality.
Now we construct a matrix space by applying a ring homomorphism to every element in :
(13) 
Concretely, we write the matrix space as follows:
(14) 
In the rest section, we prove that is a ring isomorphism.
Proof.
First, we can verify that is a ring homomorphism because it is invariant to “summation” and “multiplication”. Second, we can prove its surjectivity by the definition of (cf. Equation 14).
Finally, we show its injectivity as follows: Consider any pair of with parameters , there exists and such that . After applying , we have their images . Let , where denote the allzero vector of length , then we have:
Hence, concludes the injectivity. ∎
Appendix C Proof of Lemma 1
Proof.
One can show is a vector space by verifying the linear combination over is closed (or simply implied from the ring isomorphism ).
Due to isomorphism, . Then Lemma 1 follows from Theorem 3 of Sandryhaila and Moura (2013). We briefly conclude the proof as below.
Let denote the minimal polynomial of . We have . Suppose . First, cannot be larger than , because is a spanning set. If , then there exists some polynomial with , such that . This contradicts the minimality of . Therefore, can only be .
Suppose . For any where polynomial has . By polynomial division, there exists unique polynomials and such that
(15) 
where . We insert into Equation 15 as below:
Therefore, form a basis of , i.e., . ∎
Appendix D Proof of Lemma 2
Proof.
Consider a mapping :
(16) 
When , (as ), which implies is a ring isomorphism as well. Since function composition preserves isomorphism property, we can conclude the proof by showing that . ∎
Remark 2.
The assumption that each graph has the same number of vertices is made only for the sake of simplicity. Lemma 1 and Lemma 2 still hold when the vertex numbers are varying, since the construction of (cf. Equation 12) is independent of this assumption.
Remark 3.
The graph set need to be finite, otherwise might be uncountable. We leave the discussion on infinite graph sets for future study.
Appendix E Synthetic Graph Spectrum Dataset
Our Synthetic Graph Spectrum (SGS) dataset is designed for testing the filter fitting power of spectral GCNs. It includes 3 types of graph signal filters: HighPass (HP), LowPass (LP) and BandPass (BP) filters. For each type, we generate 1k, 1k and 2k undirected graphs along with graph signals and groundtruth response in training set, validation set and test set, respectively. Each graph has 80~120 nodes and 80~350 edges. Models are trained on each dataset to learn the corresponding filter by supervision on the MAE loss.
For each sample, we generate an undirected ErdősRényi random graph with normalized adjacency matrix
, i.e., the existence of the edge between each pair of nodes accords to a Bernoulli distribution
. In our experiments, we set to satisfy . We also compute , where with are eigenvalues, are corresponding eigenvectors.Next, we generate input graph signals
on the spectral domain. Independent sampling for each frequency from a distribution tends to generate white noises. Hence, we synthesize spectrum by summing random functions. We notice the mixture of beta function
and Gaussian function is a powerful model to construct diverse curves by tuning shape parameters and . We sum two discretized beta functions and four discretized Gaussian functions with random parameters to generate signal spectrums. Equation 17 elaborates the generation process and hyperparameter chosen in our experiments, where is the PDF of distribution, denotes the PDF of distribution.(17) 
We can retrieve the vertexdomain signals via inverse graph Fourier transformation:
. Then Gaussian noise is added to the vertexdomain signals to simulate observation errors: .Appendix F More Visualizations of Spectrum
For multichannel node signals , where is the number of nodes and is the number of signal channels, the spectrum of is computed by . More information about the graph spectrum and graph Fourier transformation can be found in Sandryhaila and Moura (2013).
Figure 5 shows the output spectrum of vanilla GCN, GIN and SoGCN on the synthetic BandPass dataset. The visualizations are consistent with the results in Table 2 and Figure 3 in the main text. Vanilla GCN almost loses all the bandpass frequency, resulting in very poor performance. GIN learns to pass a part of middlefrequency band but still has a distance from the groundtruth. SoGCN’s filtering response is close to the groundtruth response, showing its strong ability to represent graph signal filters.
We arbitrarily sample graph data from the ZINC dataset as input and visualize the output spectrum of vanilla GCN, SoGCN and their GRU variants in Figure 6. Each curve in the visualization figure represents the spectrum of each output channel, i.e., each column of is plotted as a curve.
Appendix G More Experiments
g.1 Additional Experiments on SGS Dataset
We supplement two experiments to compare vanilla GCN (Kipf and Welling, 2017), GIN (Xu et al., 2019), SoGCN and 4thorder GCN on synthetic HighPass and LowPass datasets, respectively. With Figure 7, we conclude that SoGCN and highorder GCNs perform closely on HighPass and LowPass datasets and achieve remarkable filtering capability, while vanilla GCN and GIN cannot converge to considerable results by increasing the layer size. This conclusion is consistent with the previous results on BandPass dataset presented in the main text.
g.2 Additional Experiments on OGB Benchmark
In the main text, we have demonstrated our results on ogbprotein dataset (Hu et al., 2020) for nodelevel tasks. In this subsection, we also show our SoGCN’s effectiveness on ogbmolhiv dataset (Hu et al., 2020) for graphlevel tasks. As the same with experiments on ogbprotein, we evaluate different GCN models in terms of their total parameter numbers, training time per epoch, and test ROCAUC.
Experiment Setup
Again, we choose vanilla GCN, GIN, GraphSage(Hamilton et al., 2017), APPNP (Klicpera et al., 2019), GCNII (Ming Chen et al., 2020), ARMA (Bianchi et al., 2019), our SoGCN, and two highorder GCNs for performance comparison. We adopt the example code of vanilla GCN and GIN provided in OGB. We reimplemented GCNII, GraphSage, APPNP, ARMA based on the official code in PyTorch Geometric (Fey and Lenssen, 2019). According to the benchmark’s guideline, we add edge features to fanout node features while propagation. Every model has the same depth and width, as well as other modules. The timing, training and evaluation procedures conform with the descriptions in our main text. We train vanilla GCN, GIN, APPNP, GraphSage for ~100 epochs, and SoGCN, higherorder GCNs, GCNII, ARMA for ~500 epochs.
Model  ogbmolhiv  

#Param  Time / Ep.  ROCAUC s.d.  
Vanilla GCN  527,701  25.57 1.37  76.06 0.97 
GIN  980,706  29.01 1.24  75.58 1.40 
GCNII  524,701  24.19 1.26  77.04 1.03 
APPNP  327,001  13.56 1.32  68.00 1.36 
GraphSage  976,201  24.43 1.39  76.90 1.36 
ARMA  8,188,201  43.14 0.99  76.91 1.75 
SoGCN  1,426,201  27.02 1.28  77.26 0.85 
4thOrder GCN  2,326,201  32.24 1.10  77.24 1.21 
6thOrder GCN  3,226,201  37.64 1.15  77.10 0.72 
Results and Discussion.
Table 6 demonstrates the ROCAUC score for each model on ogbmolhiv dataset. We reach the same conclusion with our main text. On ogbmolhiv dataset, we notice that GCNII is another lightweight yet effective model. However, GCNII only allows inputs whose channel number equals to output dimension. One needs to add additional blocks (e.g., linear modules) to support varying hidden dimensions, which incorporates more parameters and higher complexity (e.g., on ogbprotein dataset).
Appendix H Implementation Details
We open source our implementation of SoGCN at
https://github.com/yuehaowang/SoGCN. All of our code, datasets, hyperparameters, and runtime configurations can be found there.h.1 SecondOrder Graph Convolution
Our SoGC can be implemented using a messagepassing scheme (Hamilton et al., 2017) (cf. Equation 19). We regard the normalized adjacency matrix as a onehop aggregator (message propagator). When we compute the power of , we invoke the propagator multiple times. After passing the messages twice, we transform and mix up aggregated information from two hops via a linear block.
(19) 
where is the input feature vector for node , denotes the output for node . is the degree for vertex , is the set of ’s neighbor vertices. is the feature representation of ’s firsthop neighborhood. It can be computed by aggregating information once from the directly neighboring nodes. is the feature representation of ’s secondhop neighborhood. It can be computed by feature aggregation upon neighbors’ . are the weight matrices (a.k.a. layer parameters).
Our design can reduce computational time by reusing previously aggregated information and preventing power operations on . In practice, our SoGC is easy to implement. Our messagepassing design conforms to mainstream graph learning frameworks, such as Deep Graph Library (Wang et al., 2019) and PyTorch Geometric (Fey and Lenssen, 2019). One can simply add another group of parameters and invoke the “propagation” method of vanilla GC (Kipf and Welling, 2017) twice to simulate our SoGC. For the sake of clarity, we provide the pseudocode for general order GCs in Algorithm 1. Our SoGC can be called by passing .
h.2 Gated Recurrent Unit
We supplement two motivations behind using Gated Recurrent Unit (GRU) (Cho et al., 2014): 1) GRU has been served as a basic building block in messagepassing GNN architectures (Li et al., 2016; Gilmer et al., 2017). We make an explorative attempt to first introduce them into spectral GCNs. 2) By selectively maintaining information from previous layer and canceling the dominance of DC components (Figure 6), GRU can also relieve the sideeffect of ReLU, which is proved to be a special lowpass filter (Oono and Suzuki, 2019; Cai and Wang, 2020).
Similar to Li et al. (2016); Gilmer et al. (2017), we appends a shared GRU module after each GC layer, which takes the signals before the GC layers as the hidden state, after the GC layers as the current input. We formulate its implementation by replacing Equation 10 with Equation 20 as below.
(20) 
where is the input, represents the hidden state, denotes parameters of the GRU. Figure 6 illustrates the spectrum outputs of vanilla GCN + GRU and SoGCN + GRU. One can see, without filtering power of SoGCN, vanilla GCN + GRU fails to extract sharp patterns on the spectrum. Thereby, we suggest that it is SoGC that mainly contributes to the higher expressiveness.
Comments
There are no comments yet.