# Overcoming Oversmoothness in Graph Convolutional Networks via Hybrid Scattering Networks

Geometric deep learning (GDL) has made great strides towards generalizing the design of structure-aware neural network architectures from traditional domains to non-Euclidean ones, such as graphs. This gave rise to graph neural network (GNN) models that can be applied to graph-structured datasets arising, for example, in social networks, biochemistry, and material science. Graph convolutional networks (GCNs) in particular, inspired by their Euclidean counterparts, have been successful in processing graph data by extracting structure-aware features. However, current GNN models (and GCNs in particular) are known to be constrained by various phenomena that limit their expressive power and ability to generalize to more complex graph datasets. Most models essentially rely on low-pass filtering of graph signals via local averaging operations, thus leading to oversmoothing. Here, we propose a hybrid GNN framework that combines traditional GCN filters with band-pass filters defined via the geometric scattering transform. We further introduce an attention framework that allows the model to locally attend over the combined information from different GNN filters at the node level. Our theoretical results establish the complementary benefits of the scattering filters to leverage structural information from the graph, while our experiments show the benefits of our method on various learning tasks.

## Authors

• 5 publications
• 5 publications
• 19 publications
• 13 publications
• 36 publications
10/06/2020

### Data-Driven Learning of Geometric Scattering Networks

Graph neural networks (GNNs) in general, and graph convolutional network...
03/18/2020

### Scattering GCN: Overcoming Oversmoothness in Graph Convolutional Networks

Graph convolutional networks (GCNs) have shown promising results in proc...
01/14/2021

### BiGCN: A Bi-directional Low-Pass Filtering Graph Neural Network

Graph convolutional networks have achieved great success on graph-struct...
04/21/2020

### PAI-GCN: Permutable Anisotropic Graph Convolutional Networks for 3D Shape Representation Learning

Demand for efficient 3D shape representation learning is increasing in m...
03/29/2022

### Efficient Hybrid Network: Inducting Scattering Features

Recent work showed that hybrid networks, which combine predefined and le...
05/21/2021

### Dynamic Filters in Graph Convolutional Neural Networks

Over the last few years, we have seen increasing data generated from non...
08/04/2020

### Graph Neural Networks with Low-rank Learnable Local Filters

For the classification of graph data consisting of features sampled on a...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Deep learning is typically most effective when the structure of the data can be used to design the architecture of the relevant network. For example, the design of recurrent neural networks is informed by the sequential nature of time-series data. Similarly, the design of convolutional neural networks is based in part on the fact that the pixels of an image are arranged in a rectangular grid. The success of neural networks in these, as well as many other applications, has inspired the rise of geometric deep learning

[1, 2], which aims to extend the success of deep learning to other forms of structured data and to develop intelligent methods for data sets that have a non-Euclidean structure.

A common approach in geometric deep learning is to model the data by a graph. In many applications, this is done by defining edges between data points that interact in a specific way, e.g., “friends” on a social network. In many other applications, one may construct a graph from a high-dimensional data set, either by defining an edge between each point and its

-nearest neighbors or by defining weighted edges via a similarity kernel. Inspired by the increasing ubiquity of graph-structured data, numerous recent works have shown graph neural networks (GNNs) to perform well in a variety of fields including biology, chemistry and social networks [3, 4, 5]. In these methods, the graph is often considered in conjunction with a set of node features, which contain “local” information about, e.g., each user of a social network.

One common family of tasks are so-called graph-level tasks, where one seeks to learn a whole-graph representation for the purposes of, e.g., predicting properties of proteins [6, 7, 8]. Another common family of tasks, which has been the primary focus of graph convolutional networks (GCNs) [5], are node-level tasks such as node classification. There, the entire data set is modeled as one large graph and the network aims to produce a useful representation of each node using both the node features and the graph structure. This work is typically conducted in the semi-supervised setting where one only knows the labels of a small fraction of the nodes.

Many popular state-of-the-art GNNs essential aim to promote similarity between adjacent nodes, which may be interpreted as a smoothing operation. While this is effective in certain settings, it can also cause a decrease in performance because of the oversmoothing [9] problem, where nodes become increasingly indistinguishable from one another after each subsequent layer. One promising approach for addressing this problem is the graph attention network [10], which uses attention mechanisms computed from node features to learn adaptive weighting in its message passing operations. However, this approach still fundamentally aims at enforcing similarity among neighbors, albeit in an adaptive manner.

Here, we propose to augment traditional GNN architectures by also including novel band-pass filters, in addition to conventional GCN-style filters111Throughout this text, we will use the term GCN to refer to the network introduced by Kipf and Welling in [5]. We will use the term GNN to refer to graph neural networks (spectral, convolutional, or otherwise) in general. that essentially perform low-pass filtering [11], in order to extract richer representations for each node. This approach is based on the geometric scattering transform [12, 13, 14], whose construction is inspired by the Euclidean scattering transform introduced by Mallat in [15], and utilizes iterative cascades of graph wavelets [16, 17]

and pointwise nonlinear activation functions to produce useful graph data representations.

The main contribution of this work is two hybrid GNN frameworks that utilize both traditional GCN-style filters and also novel filters based on the scattering transform. This approach is based on the following simple idea: GCN-based networks are very useful, but as they aim to enforce similarity among nodes, they essentially focus on low-frequency information. Wavelets, on the other hand, are naturally equipped to capture high-frequency information. Therefore, in a hybrid network, the different channels will capture different types of information. Such a network will therefore be more powerful than a network that only uses one style of filter or the other.

We also introduce complementary GNN modules that enhance the performance of such hybrid scattering models, including (i) the graph residual convolution, an adaptive low-pass filter that corrects high-frequency noise, and (ii) an attention framework that enables the aggregation of information from different filters individually at every node. We present theoretical results, based off of a new notion of graph structural difference, that highlight the sensitivity of scattering filters to graph regularity patterns not captured by GCN filters. Extensive empirical experiments demonstrate the ability of hybrid scattering networks for (transductive) semi-supervised node classification, to (i) alleviate oversmoothing and (ii) generalize to complex (low-homophily) datasets. Moreover, we also present empirical evidence that our framework translates well to inductive graph-level tasks.

The remainder of this paper is organized as follows. We review related work on GNN models and geometric scattering in Sec. 2 and introduce important concepts that will be use throughout this work in Sec. 3. We then formulate the hybrid scattering network in Sec. 4, followed by a theoretical study of the benefits of such models in Sec. 5. In Sec. 6, we present empirical results before concluding in Sec. 7.

## 2 Related Work

Theoretical analyses[9, 11] of GCN and related models show that they may be viewed as Laplacian smoothing operations and, from the signal processing prespective, essentially perform low-pass filters on the graph features. One approach towards addressing this problem is the graph attention network proposed by [10], which uses self-attention mechanisms to address these shortcomings by adaptively reweighting local neighborhoods. In [18], the authors construct a low-rank approximation of the graph Laplacian that efficiently gathers multi-scale information and demonstrate the effectiveness of their method on citation networks and the QM8 quantum chemistry dataset. In [19], the authors take an approach similar to GCN, but use multiple powers of the adjacency matrix to learn higher-order neighborhood information. Finally, in [20] the authors used graph wavelets to extract higher-order neighborhood.

In addition to the learned networks discussed above, several works[12, 13, 14] have introduced different variations of the graph scattering transform. These papers aim to extend the Euclidean scattering transform of Mallat [15] to graph-structured data and propose predesigned, wavelet-based networks. In [14, 12, 21, 22], extensive theoretical studies of these networks show that they have desirable stability, invariance, and conservation of energy properties. The practical utility of these networks has been established in [13], which primarily focuses on graph classification, and in [23, 24, 25], which used the graph scattering transform to generate molecules. Building off of these results, which use handcrafted formulations of the scattering transform, recent work [26] has proposed a framework for a data-driven tuning of the traditionally handcrafted geometric scattering design that maintains the theoretical properties from traditional designs, while also showing strong empirical results in whole-graph settings.

## 3 Geometric Deep Learning Background

### 3.1 Graph Signal Processing

Let be a weighted graph, characterized by a set of nodes (also called vertices) , a set of undirected edges , and a function assigning positive edge weights to the edges. Let be a node features matrix. We shall interpret the row of as representing the features of the node , and therefore, we shall denote these rows by . The columns of , on the other hand, will be denoted by . Each of these columns may be naturally identified with a graph signal, i.e., a function ,

. In what follows, for simplicity, we will not distinguish between the vectors

and the functions and will refer to both as graph signals.

We define the weighted adjacency matrix of by if , and set all other entries to zero. We further define the degree matrix as the diagonal matrix with each diagonal element being the degree of a node . In the following, we will also use the shorthand to denote the degree of . We consider the combinatorial graph Laplacian matrix and the symmetric normalized Laplacian given by

 L\coloneqqD−1/2LD−1/2=In−D−1/2WD−1/2.

It is well known that this matrix is symmetric, positive semi-definite, and admits an orthonormal basis of eigenvectors such that

Therefore, we may write

 L=QΛQT=n∑i=1λiqiqTi,

where and

is the orthogonal matrix whose

-th column is .

We will use the eigendecomposition of to define the graph Fourier transform, with the eigenvectors

being interpreted as Fourier modes. The notion of oscillation on irregular domains like graphs is delicate, but can be reframed in terms of increasing variation of the modes, with the eigenvalues

interpreted as (squared) frequencies.222This interpretation originates from motivating the graph Fourier transform via the combinatorial graph Laplacian with the variation of . The Fourier transform of a graph signal is defined by for

and the inverse Fourier transform may be computed by

. It will frequently be convenient to write these equations in matrix form as and .

We recall that in the Euclidean setting, the convolution of two signals in the spatial domain corresponds to the pointwise multiplication of their Fourier transforms. Therefore, we may define the convolution of a signal with a filter by the rule that is the unique vector such that for . Applying the inverse Fourier transform, one may verify that

 g⋆x=n∑i=1^g[i]^x[i]qi=n∑i=1^g[i]qiqTix=QˆGQTx, (1)

where . Hence, convolutional graph filters can be parameterized by considering the Fourier coefficients in .

### 3.2 Spectral Graph Neural Network Constructions

A graph filter is a function that transforms a node feature matrix into a new feature matrix . GNNs typically feature several layers each of which produces a new set of features by filtering the output of the previous layer. We will usually let denote the initial node feature matrix, which is the input to the network and let denote the node feature matrix after the -th layer.

In light of Eq. 1, a natural way to construct learned graph filters would be to directly learn the Fourier coefficients in . Indeed this was the approach used in the pioneering work of Bruna et al. [27]. However, this approach has several drawbacks. First, it results in learnable parameters in each convolutional filter. Therefore, a network using such filters would not scale well to large data sets due to the computational cost. At the same time, such filters are not necessarily well-localized in space and are prone to overfitting [28]. Moreover, networks of the form introduced in [27] typically cannot be generalized to different graphs [1]. However, recent work [29] has shown that this latter issue can be overcome by formulating the Fourier coefficients as smooth functions of the Laplacian eigenvalues , . In particular, this will be the case for the filters used in the networks considered in this work.

A common approach (e.g., used in [28, 5, 30, 31, 18]) to formulate such filters is by using polynomials of the Laplacian eigenvalues to set (or equivalently ) for some . It can be verified [28] that this approach yields convolutional filters that are -localized in space and that can be written as This reduces the number of trainable parameters in each filter from to and allows one to perform convolution without explicitly computing the spectral decomposition of the Laplacian, which is expensive for large graphs.

One particularly noteworthy network that uses this method is [28], which writes the filters in terms of the Chebyshev polynomials defined by , and . They first renormalize the eigenvalue matrix and then define . This gives rise to a -localized filtering operation of the form

 gθ⋆x=k∑j=0θjTj(~L), (2)

where .

### 3.3 Graph Convolutional Networks

One of the most widely used GNNs is the Graph Convolutional Network (GCN) [5]. This network is derived from the Chebyshev construction[28] mentioned above by setting in Eq. 2 and approximating , which yields

 gθ0,θ1⋆x ≈θ0x+θ1(L−In)x =θ0x−θ1D−1/2WD−1/2x.

To further reduce the number of trainable parameters, the authors then set . The resulting convolutional filter has the form

 gθ⋆x=θ(In+D−1/2WD−1/2)x. (3)

One may verify that , and therefore, Eq. 3 essentially corresponds to setting in Eq. 1. The eigenvalues of take values in Thus, to avoid vanishing or exploding gradients, the authors use a renormalization trick

 In+D−1/2WD−1/2→~D−1/2~W~D−1/2, (4)

where and is a diagonal matrix with for . Setting and using multiple channels we obtain a layer-wise propagation rule of the form where is the number of channels used in the -th layer and is an elementwise nonlinearity. In matrix form we write

 Xℓ=Fgcn(Xℓ−1)=σ(AXℓ−1Θℓ). (5)

We interpret the matrix as computing a localized average of each channel around each mode and the matrix as sharing information across channels. This filter can also be observed at the node level as

 Xℓ[v]=σ⎛⎜⎝∑w∈Nv–1√(dv+1)(dw+1)Xℓ−1[w]Θℓ⎞⎟⎠,

where denotes the one-step neighborhood of node and . This process can be split into three steps:

 Xℓa[w]=Xℓ−1[w]Θℓ for% all w∈Nv– (6a) Xℓb[v]=∑w∈Nv–1√(dv+1)(dw+1)Xℓa[w] (6b) Xℓc[v]=σ(Xℓb[v]), (6c)

which we refer to as the transformation step (Eq. 6a), the aggregation step (Eq. 6b) and the activation step (Eq. 6c).

As discussed earlier, the GCN filter described above may be viewed as a low-pass filter that suppresses high-frequency information. For simplicity, we focus on the convolution in Eq. 3 before the renormalization. This convolution essentially corresponds to pointwise Fourier multiplication by , which is strictly decreasing in . Therefore, repeated applications of this filter effectively zero-out the higher frequencies. This is consistent with the oversmoothing problem discussed in [9].

### 3.4 Graph Attention Networks

Another popular network that is widely used for node classification tasks is the graph attention network (GAT) [10], which uses an attention mechanism to guide and adjust the aggregation of features from adjacent nodes. First, the node features

are linearly transformed to

using a learned weight matrix . Then, the aggregation coefficients are learned via

 αv←u=exp(LeakyReLU([¯Xℓ[v]∥¯Xℓ[u]]a)∑w∈Nv–exp(LeakyReLU([¯Xℓ[v]∥¯Xℓ[w]]a),

where is a shared attention vector and denotes horizontal concatenation. The output feature corresponding to a single attention head is given by . To increase the expressivity of this network, the authors then use a multi-headed attention mechanism, with heads, to generate concatenated features

 Xℓ[v]=∥Γγ=1σ(∑u∈Nv–αγv←u¯Xℓγ[u]). (7)

### 3.5 Challenges in Geometric Deep Learning

Many GNN models, including GCN [5] and GAT [10], are subject to the so-called oversmoothing problem [9], caused by aggregation steps (such as Eq. 6b) that essentially consist of localized averaging operations. As discussed in Sec. 3.3 and also [11], from a signal processing point of view, this corresponds to a low-pass filtering of graph signals. Moreover, as discussed in [32], these networks are also subject to underreaching [32]. Most GNNs (including GCN and GAT) can only relate information from nodes within a distance equal to the number of GNN layers, and because of the oversmoothing problem, they typically use a small number of layers in practice. Therefore, the oversmoothing and underreaching problem combine to significantly limit the ability of GNNs to capture long-range dependencies. In Sec. 4, we will introduce a hybrid network, which aims to address these challenges by using both GCN-style channels and channels based on the geometric scattering transform discussed below.

### 3.6 Geometric Scattering

In this section, we review the geometric scattering transform constructed in [13] for graph classification and show how it may be adapated for node-level tasks. As we shall see, this node-level geometric scattering will address the challenges discussed above in Sec. 3.5, by using band-pass filters that capture high-frequency information and have wider receptive fields.

The geometric scattering transform uses wavelets based upon raising the lazy random walk matrix

 P\coloneqq12(In+WD−1),

to dyadic powers , which can be interpreted as differing degrees of resolution. Entrywise, we note that

 (PX)[v]=12X[v]+12∑w∈Nvd−1wX[w]. (8)

Thus, may be viewed as a localized averaging operation operator analogous to those used in, e.g., GCN, and the powers may be viewed as low-pass filters which surpress high-frequencies. In order to better retain this high-frequency information, we define multiscale graph diffusion wavelets by subtracting these low-pass filters at different scales [17]. Specifically, for , we define a wavelet at scale by

 Ψ0\coloneqqIn−P,Ψk\coloneqqP2k−1−P2k,k≥1. (9)

From a frequency perspective, we may interpret each as capturing information at a different frequency band. From a spatial perspective, we may view as encoding information on how a -step neighborhood differs from a smaller -step one. Such wavelets are usually organized in a filter bank , along with a low-pass filter . One should note that these wavelets are not self-adjoint operators on the standard space. However, it was shown in [22] that this wavelet filter bank is a self-adjoint non-expansive frame on a suitably weighted inner-product space.

The geometric scattering transform is a multi-layered architecture that iteratively applies wavelet convolutions and nonlinear activation functions in an alternating cascade as illustrated in Fig. 1. It is parameterized by paths . Formally, we define

 Upx\coloneqqΨkm∘σ∘Ψkm−1⋯σ∘Ψk2∘σ∘Ψk1x (10)

where is a nonlinear activation function.333In a slight deviation from previous work, here does not include the outermost nonlinearity in the cascade. We note that the nonlinearity might vary in each step of the cascade. However, we will suppress this possible dependence to avoid cumbersome notation. We also note that in our theoretical results, if we assume, for example, that is strictly increasing, this assumption is intended to apply to all nonlinearities used in the cascade. In our experiments, we use the absolute value, i.e., .

When applied in the context of graph-classification, the scattering features are frequently aggregated by taking

-order moments,

 Sp,qx\coloneqq∑ni=1|(Upx)[vi]|q. (11)

However, since our current focus is on node-level tasks we will not use these moments here.

The original formulations of geometric scattering were fully designed networks without any learned convolutions between channels. Here, we will incorporate learning by defining the following scattering propagation rule similar to the one used by GCN:

 Xℓ=Fp-scat(Xℓ)=σ(Up(Xℓ−1Θℓ)). (12)

Analogously to GCN, we note the the scattering propagation rule can be decomposed into three steps:

 Xℓa′[w]=Xℓ−1[w]Θℓ for all w∈Nv– (13a) Xℓb′[v]=(Up(Xℓa′))[v] (13b) Xℓc′[v]=σ(Xℓb′[v])q. (13c)

Importantly, we note that the scattering transform addresses the underreaching problem as wavelets that are leveraged in the aggregation step (Eq. 13b) have larger receptive fields than most traditional GNNs. However, scattering does not result in oversmoothing because the subtraction results in band-pass filters rather than low-pass filters. In this manner, the scattering transform addresses the challenges discussed in the previous subsection.

## 4 Hybrid Scattering Networks

Here, we introduce two hybrid networks that combine aspects of GCN and the geometric scattering transform discussed in the previous section. Our networks will use both low-pass and band-pass filters in different channels to capture different types of information. As a result, our hybrid networks will have greater expressive power than either traditional GCNs, which only use low-pass filters or a pure scattering network, which only uses band-pass filters.

In our low-pass channels, we use modified GCN filters, which are similar to those used in Eq. 5 but use higher powers of and include bias terms. Specifically, we use a channel update rule of the form

 Xℓlow,i\coloneqqσ(AriXℓ−1Θℓlow,i+Bℓlow,i) (14)

for . We note, in particular, that the use of higher powers of enables a wider receptive field (of radius ), without increasing the number of trainable parameters (unlike in GCN).

Similarly, in our band-pass channels, we use a modified version of Eq. 12, with an added bias term, and our update rule is given by

 Xℓband,i\coloneqqσ(Upi(Xℓ−1Θℓband,i)+Bℓband,i) (15)

for . Here, similarly to Eq. 10, is a path that determines the cascade of wavelets used in the -th channel.

Aggregation module. Each hybrid layer uses a set of low-pass and band-pass channels to transform the node features , as described in Eq. 14 and Eq. 15, respectively. The resulting filter responses are aggregated to new -dimensional node representations via an aggregation module

 (16)

Later, in Sections 4.1 and 4.2, we will discuss two models that use different aggregation modules.

Graph Residual Convolution. After aggregating the outputs of the (low-pass) GCN channels and (band-pass) scattering channels in Eq. 16, we apply the graph residual convolution, which acts as a low-pass filtering and aims to eliminate any high-frequency noise introduced by the scattering channels. This noise can arise, for example, if there are various different label rates in different graph substructures. In this case, the scattering features may learn the difference between labeled and unlabeled nodes and thereby produce high-frequency noise.

This filter uses a modified diffusion matrix given by

 Ares(α)=1α+1(In+αWD−1),

where the hyperparameter

determines the magnitude of the low-pass filtering. Choosing yields the identity (no filtering), while results in the random walk matrix . Thus, can be interpreted as lying between a completely lazy random walk that never moves and a non-resting one that moves at every time step.

The full residual convolution update rule is given by

 Xℓ+1\coloneqqAres(α)XℓΘres+Bres.

The multiplication with the weight matrix corresponds to a fully connected layer applied to the output from Eq. 16 (after low-pass filtering with

) with each neuron learning a linear combination of the signals output by the aggregation module.

### 4.1 Scattering GCN

The Scattering GCN (Sc-GCN) network was first introduced in [33]. Here, the aggregation module concatenates the filter responses horizontally yielding wide node representations of the form

 [Xℓlow,1∥…∥Xℓlow,Clow∥Xℓband,1∥…∥Xℓband,Cband]. (17)

Sc-GCN then learns relationships between the channels via the graph residual convolution.

Configuration. The primary goal of Sc-GCN is to alleviate oversmoothing in popular semi-supervised node-level tasks on, e.g., citation networks. As regularity patterns in such datasets are often dominated by low-frequency information such as inter-cluster node-similarities, we choose our parameters to focus on low-frequency information. We use three low-pass filters, with receptive fields of sizes radius , and two band-pass filters. We use as our nonlinearity in all steps except the outermost nonlinearity. Inspired by the aggregation step in classical geometric scattering[13], for the outermost nonlinearity, we additionally apply the power at the node level, i.e., .

The paths and the parameter from the graph residual convolution are tuned as hyperparameters of the network.

### 4.2 Geometric Scattering Attention Network

An important observation in Sc-GCN above is that the model decides globally about how to combine different channel information. The network first concatenates all of the features from the low-pass and band-pass channels in Eq. 17 and and then combines these features via multiplication with the weight matrix . However, for complex tasks or datasets, important regularity patterns may vary significantly across different graph regions. In such settings, a model should ideally attend locally over the aggregation and adaptively combine filter responses at different nodes.

This observation inspired the design of the Geometric Scattering Attention Network (GSAN) [34]. Drawing inspiration from  [35], GSAN uses an aggregation module based on a multi-head node-level attention framework. However, the attention mechanism used in GSAN differs from [35] by attending over the combination of the different filter responses rather than over the combination of node features from neighboring nodes. We will first focus on the processing performed independently by each attention head, and then discuss a multi-head configuration.

In a slight deviation from Sc-GCN, the weight matrices in Eq. 14 and Eq. 15 are shared across the filters of each attention head. Therefore, both aggregation steps (Eq. 6b and Eq. 13b) take the same input denoted by . Next, we define and to be the outputs of the aggregation steps, Eq. 6b and Eq. 13b, with the bias terms set to zero. We then compute attention score vectors that will be used to determine the importance of each of the filter responses for every single node. We calculate

 eℓlow,i =LeakyReLU([¯Xℓ∥¯Xℓlow,i]a),

with analogous and being a learned attention vector that is shared across all filters of the attention head. These attention scores are then normalized across all filters using the softmax function. Specifically, we define

 αℓlow,j \coloneqqexp(eℓlow,j)∑Clowi=1exp(eℓ% low,i)+∑Cbandi=1exp(eℓband,i),

where the exponential function is applied elementwise, and define analogously. Finally, for every node, the filter responses are summed together, weighted by the corresponding (node-to-filter) attention score. We also normalize by the number of filters , which gives

 Xℓ=C−1~σ( Clow∑j=1αℓlow,j⊙¯Xℓlow,j+Cband∑j=1αℓband,j⊙¯Xℓband,j),

where denotes the Hadamard (elementwise) product of with each column of . We further use in the equation above.

Multi-head Attention. Similar to other applications of attention mechanisms [35], we use multi-head attention to stabilize the training, thus yielding as output of the aggregation module (by a slight abuse of notation)

 Xℓ=∥Γγ=1Xℓγ(Θℓγ,αℓγ), (18)

concatenating attention heads. As explained above, each attention head has individual trained parameters and and outputs a filter response . Similar to Sc-GCN, this concatenation results in wide node representations that are further refined by the graph residual convolution.

Configuration. For GSAN, we set , giving the model balanced access to both low-pass and band-pass information. The aggregation process of the attention layer is shown in Fig. 3, where represent three first-order scattering transformations with , and . The number of attention heads and the parameter from the graph residual convolution are tuned as hyperparameters of the network.

## 5 Theory of Hybrid Scattering Networks

In this section, we analyze the ability of scattering filters to capture information not captured by traditional GCN filters. As discussed in Section 3.3, GCN filters effectively act as localized averaging operators and focus primarily on low-frequency information. By contrast, band-pass filters are able to retain high-frequency information and a broader class of regularity patterns. As a simple example, we recall Lemma 1 of [33]. This result considers a two-periodic signal on a cycle graph with an even number of nodes. It shows that Eq. 3 yields a constant signal as filter response, while the scattering filter from Eq. 9 still produces a two-coloring signal. Inductively, we note that this result may be extended to any finite linear cascade of such filters, i.e., or . Moreover, Lemma 2 of the same paper shows that this result may be generalized considerably and can be extended to any two-coloring of a regular bipartite graph.

In the context of semi-supervised learning, these results imply that on a bipartite graph, a GCN-based method cannot be used to predict which part of the vertex set a given node belongs to (assuming that the input feature is a two-coloring of the graph). However scattering methods preserve two-colorings and therefore will likely be successful for this task. Here, we will further develop this theory and analyze the advantages of our hybrid scattering networks on more general graphs.

Our analysis will be based on a notion of node discriminability444Formal definitions are provided in Appendix. A.1, , which intuitively corresponds to a network producing different representations of two nodes and in its hidden layers. We will let denote the -step node neighborhood4 of a node (including itself), and for we will let denote the corresponding induced subgraph.4 We will say two induced subgraphs and are isomorphic4 if they have identical geometric structure and write to indicate that is an isometry (geometry preserving map) between them. In the definition below, we introduce a class of graph-intrinsic node features that encode graph topology information. We will use these features as the input for GNN models in order to produce geometrically informed node representations.

###### Definition 1 (Intrinsic Node Features).

A nonzero node feature matrix is -intrinsic if for any such that is isomorphic to , we have .

These intrinsic node features encode important -local geometric information in the sense that if , then the -step neighborhoods of and must have different geometric structure. To further understand this definition, we observe that the degree vector, or any (elementwise) function of it, is a one-intrinsic node feature matrix. Setting to the average node degree of nodes in yields -intrinsic node features. As a slightly more complicated example, features with the number of triangles contained in are also -intrinsic. Fig. 4 illustrates how such node features can help distinguish nodes from different graph structures.

As a concrete example, consider the task of predicting traffic on a network of Wikipedia articles, such as the Chameleon dataset [36], with nodes and edges corresponding to articles and hyperlinks. Here, intrinsic node features provide insights into the topological context of each node. A high degree suggests a widely cited article, which is likely to have a lot of traffic, while counting -cliques or considering average degrees from nearby nodes can shed light on the density of the local graph region.

The following theorem characterizes situations when GCN filters cannot discriminate nodes based on the information represented in intrinsic node features.

###### Theorem 1.

Let , , and such that . Let be any -intrinsic node feature matrix and let be the output of an -layer GCN, i.e., , with each layer as in Eq. 5. Then, for any , we have . Therefore, the nodes and cannot be discriminated based on the filtered node features from an -layer GCN.

Below, we provide a sketch of the main ideas of the proof. For a complete proof, please see Appendix A.2.

###### Proof Sketch.

We use induction to show for all and all . The case follows from Definition 1. One may then establish the inductive step by showing that the steps Eq. 6a-6c preserve this equality for all . Therefore, for all and so an -layer GCN cannot be used to distinguish these nodes. ∎

Next, we introduce a notion of structural difference that allows us to analyze situations where scattering networks have strictly greater discriminatory power than GCNs.

###### Definition 2 (Structural Difference).

For two -isomorphic induced subgraphs , a structural difference relative to and a node feature matrix is manifested at if or .555Note that is only relevant for nodes in the boundary4 because for all interior4 nodes , as the degree is 1-intrinsic. For , we further assume .

Notably, if the -step neighborhoods of two nodes are -isomorphic and is any -intrinsic node feature matrix, then no structural difference relative to and can be manifested at . Theorem 2 stated below shows that a scattering network will be able to produce different representations of two nodes and if there are structural differences in their surrounding neighborhoods, assuming (among a few other mild assumptions) that a certain pathological case is avoided. The next definition characterizes this pathological situation that arises, roughly speaking, when two sums and coincide, even though for all . We note that although this condition seems complicated, it is highly unlikely to occur in practice.

###### Definition 3 (Coincidental Correspondence).

Let and such that . Let be a node feature matrix, and let

 Δu\coloneqq{w∈Nu: struc. % difference rel. to ϕ and X at w}.

We say that a node is subject to coincidental correspondence relative to and if is non-empty and

 ∑w∈Δud−1wX[w]=∑w∈ϕ(Δu)d−1wX[w]. (19)

We further say that a set of nodes is subject to coincidental correspondence if there exists at least one that is subject to coincidental correspondence.

###### Remark 1.

When assuming there is no coincidental correspondence relative to and on a set of nodes for some , we will implicitly assume this to also hold for diffused node features obtained from multiple diffusions via or cascades of wavelets and nonlinear activations with spatial support4 up to a radius of .

###### Theorem 2.

As in Theorem 1, let , , and such that , and consider any K-intrinsic node feature matrix . Further assume that there exists at least one structural difference rel. to and in , and let be the smallest positive integer such that a structural difference is manifested in . If the nonlinearity is strictly monotonic and is not subject to coincidental correspondence rel. to and , then one can define a scattering configuration such that scattering features defined as in Eq. 12 discriminate and .

Theorem 2 provides a large class of situations where scattering filters are able to discriminate between two nodes even if GCN filters are not. In particular, even if the step neighborhood of and are isomorphic, it is possible for there to exist a in this step neighborhood such that a -intrinsic node feature takes different values at and . Theorem 2 shows that scattering will produce differing representations of these two nodes (i.e., and ), while Theorem 1 shows that GCN will not. The following two remarks discuss minor variations of Theorem 2.

###### Remark 2.

The absolute value operation is not monotonic. Therefore, Theorem 2 cannot be directly applied in this case. However, the above result and all subsequent results can be extended to the case as long as certain pathological cases are avoided. Namely, Theorem 2 will remain valid as long as the features at are assumed not the be negatives the features at , i.e., for nodes , and similarly to Remark 1, we also must assume that the diffused node features are not negatives of each other. We note that while this condition is complex, it is rarely violated in practice.

###### Remark 3.

A result analogous to Theorem 2 is also valid for any permutation of the three steps Eq. 13a-13c (even with added bias terms as in Eq. 15). In particular, it applies to the update rule

 Xℓ=σ(Up(Xℓ−1)Θℓ+Bℓ).

We will provide a full sketch of the proof of Theorem 2 here and provide details in the appendix. The key to the proof will be applying Lemma 1 stated below. In order to explain these ideas, we must first introduce some notation that partition the neighborhoods from Theorem 2 into a set of node layers that are parameterized by their minimal distance from .

###### Notation 1.

Let , , , such that . Assume that there exists at least one node in , where a structural difference is manifested rel. to and node features . We define

 Vdiff\coloneqq{u∈NDv–: struc. difference rel. to ϕ and X at u}.

and we fix the following notations:

1. Let and define the node set . Note that these are exactly the nodes in where a structural difference is manifested relative to and .

2. Let be the number of paths of minimal length between and vertices in , and denote these by with for .

3. Let , and define the generalized path from to as .

With this notation established, we may now state the following lemma which is critical to the proof of Theorem 2.

###### Lemma 1.

Let , , and such that . Assume there exists at least one node in where a structural difference is manifested relative to and to , and let be the smallest positive integer such that a structural difference is manifested in . Assume that is not subject to coincidental correspondence relative to and , and let be the generalized path as defined in Notation 1. Then, for , the nodes in are exactly the nodes in where a structural difference is manifested relative to and filtered node features .

Below, we provide a sketch of the main ideas of the proof. A complete proof is provided in Appendix A.3.

###### Proof Sketch.

The main idea of the proof is to use induction to show that at every step the matrix propagates the information about the structural difference one node layer closer towards .

By Notation 1(i), we have . Therefore, the case is immediate. In the inductive step, it suffices to show

 Yj+1[u]≠Yj+1[ϕ(u)]

for all and

 Yj+1[w]=Yj+1[ϕ(w)]

for all under the assumption that these results are already valid for . This can be established by writing and using the inductive hypothesis together with the definition of .∎

We now use Lemma 1 to sketch the proof of Theorem 2. The complete proof is provided in Appendix A.4.

###### Proof Sketch for Theorem 2.

We need to show that we can choose the parameters in a way that guarantees . For simplicity, we set . In this case, since strictly monotonic, and therefore injective, it suffices to show that we can construct such that

 Up(X)[v]≠Up(X)[ϕ(v)]. (20)

Using binary expansion, we may choose , , such that and set . Given , we define truncated paths and let for For , we let denote the empty path of length 0 and set

Recalling the generalized path defined in Notation 1, we will use induction to show that is equal to the set of nodes in such that a structural difference is manifested relative to and for . Since and , this will imply Eq. 20 and thus prove the theorem. Analogously to the base case in the proof of Lemma 1, the case follows from Notation 1(i). In the inductive step, it suffices to show that

 Zi+1[u]≠Zi+1[ϕ(u)],

for all and

 Zi+1[w]=Zi+1[ϕ(w)],

for all , under the assumption that these results are true for . Since