1 Introduction
Deep learning is typically most effective when the structure of the data can be used to design the architecture of the relevant network. For example, the design of recurrent neural networks is informed by the sequential nature of timeseries data. Similarly, the design of convolutional neural networks is based in part on the fact that the pixels of an image are arranged in a rectangular grid. The success of neural networks in these, as well as many other applications, has inspired the rise of geometric deep learning
[1, 2], which aims to extend the success of deep learning to other forms of structured data and to develop intelligent methods for data sets that have a nonEuclidean structure.A common approach in geometric deep learning is to model the data by a graph. In many applications, this is done by defining edges between data points that interact in a specific way, e.g., “friends” on a social network. In many other applications, one may construct a graph from a highdimensional data set, either by defining an edge between each point and its
nearest neighbors or by defining weighted edges via a similarity kernel. Inspired by the increasing ubiquity of graphstructured data, numerous recent works have shown graph neural networks (GNNs) to perform well in a variety of fields including biology, chemistry and social networks [3, 4, 5]. In these methods, the graph is often considered in conjunction with a set of node features, which contain “local” information about, e.g., each user of a social network.One common family of tasks are socalled graphlevel tasks, where one seeks to learn a wholegraph representation for the purposes of, e.g., predicting properties of proteins [6, 7, 8]. Another common family of tasks, which has been the primary focus of graph convolutional networks (GCNs) [5], are nodelevel tasks such as node classification. There, the entire data set is modeled as one large graph and the network aims to produce a useful representation of each node using both the node features and the graph structure. This work is typically conducted in the semisupervised setting where one only knows the labels of a small fraction of the nodes.
Many popular stateoftheart GNNs essential aim to promote similarity between adjacent nodes, which may be interpreted as a smoothing operation. While this is effective in certain settings, it can also cause a decrease in performance because of the oversmoothing [9] problem, where nodes become increasingly indistinguishable from one another after each subsequent layer. One promising approach for addressing this problem is the graph attention network [10], which uses attention mechanisms computed from node features to learn adaptive weighting in its message passing operations. However, this approach still fundamentally aims at enforcing similarity among neighbors, albeit in an adaptive manner.
Here, we propose to augment traditional GNN architectures by also including novel bandpass filters, in addition to conventional GCNstyle filters^{1}^{1}1Throughout this text, we will use the term GCN to refer to the network introduced by Kipf and Welling in [5]. We will use the term GNN to refer to graph neural networks (spectral, convolutional, or otherwise) in general. that essentially perform lowpass filtering [11], in order to extract richer representations for each node. This approach is based on the geometric scattering transform [12, 13, 14], whose construction is inspired by the Euclidean scattering transform introduced by Mallat in [15], and utilizes iterative cascades of graph wavelets [16, 17]
and pointwise nonlinear activation functions to produce useful graph data representations.
The main contribution of this work is two hybrid GNN frameworks that utilize both traditional GCNstyle filters and also novel filters based on the scattering transform. This approach is based on the following simple idea: GCNbased networks are very useful, but as they aim to enforce similarity among nodes, they essentially focus on lowfrequency information. Wavelets, on the other hand, are naturally equipped to capture highfrequency information. Therefore, in a hybrid network, the different channels will capture different types of information. Such a network will therefore be more powerful than a network that only uses one style of filter or the other.
We also introduce complementary GNN modules that enhance the performance of such hybrid scattering models, including (i) the graph residual convolution, an adaptive lowpass filter that corrects highfrequency noise, and (ii) an attention framework that enables the aggregation of information from different filters individually at every node. We present theoretical results, based off of a new notion of graph structural difference, that highlight the sensitivity of scattering filters to graph regularity patterns not captured by GCN filters. Extensive empirical experiments demonstrate the ability of hybrid scattering networks for (transductive) semisupervised node classification, to (i) alleviate oversmoothing and (ii) generalize to complex (lowhomophily) datasets. Moreover, we also present empirical evidence that our framework translates well to inductive graphlevel tasks.
The remainder of this paper is organized as follows. We review related work on GNN models and geometric scattering in Sec. 2 and introduce important concepts that will be use throughout this work in Sec. 3. We then formulate the hybrid scattering network in Sec. 4, followed by a theoretical study of the benefits of such models in Sec. 5. In Sec. 6, we present empirical results before concluding in Sec. 7.
2 Related Work
Theoretical analyses[9, 11] of GCN and related models show that they may be viewed as Laplacian smoothing operations and, from the signal processing prespective, essentially perform lowpass filters on the graph features. One approach towards addressing this problem is the graph attention network proposed by [10], which uses selfattention mechanisms to address these shortcomings by adaptively reweighting local neighborhoods. In [18], the authors construct a lowrank approximation of the graph Laplacian that efficiently gathers multiscale information and demonstrate the effectiveness of their method on citation networks and the QM8 quantum chemistry dataset. In [19], the authors take an approach similar to GCN, but use multiple powers of the adjacency matrix to learn higherorder neighborhood information. Finally, in [20] the authors used graph wavelets to extract higherorder neighborhood.
In addition to the learned networks discussed above, several works[12, 13, 14] have introduced different variations of the graph scattering transform. These papers aim to extend the Euclidean scattering transform of Mallat [15] to graphstructured data and propose predesigned, waveletbased networks. In [14, 12, 21, 22], extensive theoretical studies of these networks show that they have desirable stability, invariance, and conservation of energy properties. The practical utility of these networks has been established in [13], which primarily focuses on graph classification, and in [23, 24, 25], which used the graph scattering transform to generate molecules. Building off of these results, which use handcrafted formulations of the scattering transform, recent work [26] has proposed a framework for a datadriven tuning of the traditionally handcrafted geometric scattering design that maintains the theoretical properties from traditional designs, while also showing strong empirical results in wholegraph settings.
3 Geometric Deep Learning Background
3.1 Graph Signal Processing
Let be a weighted graph, characterized by a set of nodes (also called vertices) , a set of undirected edges , and a function assigning positive edge weights to the edges. Let be a node features matrix. We shall interpret the row of as representing the features of the node , and therefore, we shall denote these rows by . The columns of , on the other hand, will be denoted by . Each of these columns may be naturally identified with a graph signal, i.e., a function ,
. In what follows, for simplicity, we will not distinguish between the vectors
and the functions and will refer to both as graph signals.We define the weighted adjacency matrix of by if , and set all other entries to zero. We further define the degree matrix as the diagonal matrix with each diagonal element being the degree of a node . In the following, we will also use the shorthand to denote the degree of . We consider the combinatorial graph Laplacian matrix and the symmetric normalized Laplacian given by
It is well known that this matrix is symmetric, positive semidefinite, and admits an orthonormal basis of eigenvectors such that
Therefore, we may writewhere and
is the orthogonal matrix whose
th column is .We will use the eigendecomposition of to define the graph Fourier transform, with the eigenvectors
being interpreted as Fourier modes. The notion of oscillation on irregular domains like graphs is delicate, but can be reframed in terms of increasing variation of the modes, with the eigenvalues
interpreted as (squared) frequencies.^{2}^{2}2This interpretation originates from motivating the graph Fourier transform via the combinatorial graph Laplacian with the variation of . The Fourier transform of a graph signal is defined by forand the inverse Fourier transform may be computed by
. It will frequently be convenient to write these equations in matrix form as and .We recall that in the Euclidean setting, the convolution of two signals in the spatial domain corresponds to the pointwise multiplication of their Fourier transforms. Therefore, we may define the convolution of a signal with a filter by the rule that is the unique vector such that for . Applying the inverse Fourier transform, one may verify that
(1) 
where . Hence, convolutional graph filters can be parameterized by considering the Fourier coefficients in .
3.2 Spectral Graph Neural Network Constructions
A graph filter is a function that transforms a node feature matrix into a new feature matrix . GNNs typically feature several layers each of which produces a new set of features by filtering the output of the previous layer. We will usually let denote the initial node feature matrix, which is the input to the network and let denote the node feature matrix after the th layer.
In light of Eq. 1, a natural way to construct learned graph filters would be to directly learn the Fourier coefficients in . Indeed this was the approach used in the pioneering work of Bruna et al. [27]. However, this approach has several drawbacks. First, it results in learnable parameters in each convolutional filter. Therefore, a network using such filters would not scale well to large data sets due to the computational cost. At the same time, such filters are not necessarily welllocalized in space and are prone to overfitting [28]. Moreover, networks of the form introduced in [27] typically cannot be generalized to different graphs [1]. However, recent work [29] has shown that this latter issue can be overcome by formulating the Fourier coefficients as smooth functions of the Laplacian eigenvalues , . In particular, this will be the case for the filters used in the networks considered in this work.
A common approach (e.g., used in [28, 5, 30, 31, 18]) to formulate such filters is by using polynomials of the Laplacian eigenvalues to set (or equivalently ) for some . It can be verified [28] that this approach yields convolutional filters that are localized in space and that can be written as This reduces the number of trainable parameters in each filter from to and allows one to perform convolution without explicitly computing the spectral decomposition of the Laplacian, which is expensive for large graphs.
One particularly noteworthy network that uses this method is [28], which writes the filters in terms of the Chebyshev polynomials defined by , and . They first renormalize the eigenvalue matrix and then define . This gives rise to a localized filtering operation of the form
(2) 
where .
3.3 Graph Convolutional Networks
One of the most widely used GNNs is the Graph Convolutional Network (GCN) [5]. This network is derived from the Chebyshev construction[28] mentioned above by setting in Eq. 2 and approximating , which yields
To further reduce the number of trainable parameters, the authors then set . The resulting convolutional filter has the form
(3) 
One may verify that , and therefore, Eq. 3 essentially corresponds to setting in Eq. 1. The eigenvalues of take values in Thus, to avoid vanishing or exploding gradients, the authors use a renormalization trick
(4) 
where and is a diagonal matrix with for . Setting and using multiple channels we obtain a layerwise propagation rule of the form where is the number of channels used in the th layer and is an elementwise nonlinearity. In matrix form we write
(5) 
We interpret the matrix as computing a localized average of each channel around each mode and the matrix as sharing information across channels. This filter can also be observed at the node level as
where denotes the onestep neighborhood of node and . This process can be split into three steps:
(6a)  
(6b)  
(6c) 
which we refer to as the transformation step (Eq. 6a), the aggregation step (Eq. 6b) and the activation step (Eq. 6c).
As discussed earlier, the GCN filter described above may be viewed as a lowpass filter that suppresses highfrequency information. For simplicity, we focus on the convolution in Eq. 3 before the renormalization. This convolution essentially corresponds to pointwise Fourier multiplication by , which is strictly decreasing in . Therefore, repeated applications of this filter effectively zeroout the higher frequencies. This is consistent with the oversmoothing problem discussed in [9].
3.4 Graph Attention Networks
Another popular network that is widely used for node classification tasks is the graph attention network (GAT) [10], which uses an attention mechanism to guide and adjust the aggregation of features from adjacent nodes. First, the node features
are linearly transformed to
using a learned weight matrix . Then, the aggregation coefficients are learned viawhere is a shared attention vector and denotes horizontal concatenation. The output feature corresponding to a single attention head is given by . To increase the expressivity of this network, the authors then use a multiheaded attention mechanism, with heads, to generate concatenated features
(7) 
3.5 Challenges in Geometric Deep Learning
Many GNN models, including GCN [5] and GAT [10], are subject to the socalled oversmoothing problem [9], caused by aggregation steps (such as Eq. 6b) that essentially consist of localized averaging operations. As discussed in Sec. 3.3 and also [11], from a signal processing point of view, this corresponds to a lowpass filtering of graph signals. Moreover, as discussed in [32], these networks are also subject to underreaching [32]. Most GNNs (including GCN and GAT) can only relate information from nodes within a distance equal to the number of GNN layers, and because of the oversmoothing problem, they typically use a small number of layers in practice. Therefore, the oversmoothing and underreaching problem combine to significantly limit the ability of GNNs to capture longrange dependencies. In Sec. 4, we will introduce a hybrid network, which aims to address these challenges by using both GCNstyle channels and channels based on the geometric scattering transform discussed below.
3.6 Geometric Scattering
In this section, we review the geometric scattering transform constructed in [13] for graph classification and show how it may be adapated for nodelevel tasks. As we shall see, this nodelevel geometric scattering will address the challenges discussed above in Sec. 3.5, by using bandpass filters that capture highfrequency information and have wider receptive fields.
The geometric scattering transform uses wavelets based upon raising the lazy random walk matrix
to dyadic powers , which can be interpreted as differing degrees of resolution. Entrywise, we note that
(8) 
Thus, may be viewed as a localized averaging operation operator analogous to those used in, e.g., GCN, and the powers may be viewed as lowpass filters which surpress highfrequencies. In order to better retain this highfrequency information, we define multiscale graph diffusion wavelets by subtracting these lowpass filters at different scales [17]. Specifically, for , we define a wavelet at scale by
(9) 
From a frequency perspective, we may interpret each as capturing information at a different frequency band. From a spatial perspective, we may view as encoding information on how a step neighborhood differs from a smaller step one. Such wavelets are usually organized in a filter bank , along with a lowpass filter . One should note that these wavelets are not selfadjoint operators on the standard space. However, it was shown in [22] that this wavelet filter bank is a selfadjoint nonexpansive frame on a suitably weighted innerproduct space.
The geometric scattering transform is a multilayered architecture that iteratively applies wavelet convolutions and nonlinear activation functions in an alternating cascade as illustrated in Fig. 1. It is parameterized by paths . Formally, we define
(10) 
where is a nonlinear activation function.^{3}^{3}3In a slight deviation from previous work, here does not include the outermost nonlinearity in the cascade. We note that the nonlinearity might vary in each step of the cascade. However, we will suppress this possible dependence to avoid cumbersome notation. We also note that in our theoretical results, if we assume, for example, that is strictly increasing, this assumption is intended to apply to all nonlinearities used in the cascade. In our experiments, we use the absolute value, i.e., .
When applied in the context of graphclassification, the scattering features are frequently aggregated by taking
order moments,
(11) 
However, since our current focus is on nodelevel tasks we will not use these moments here.
The original formulations of geometric scattering were fully designed networks without any learned convolutions between channels. Here, we will incorporate learning by defining the following scattering propagation rule similar to the one used by GCN:
(12) 
Analogously to GCN, we note the the scattering propagation rule can be decomposed into three steps:
(13a)  
(13b)  
(13c) 
Importantly, we note that the scattering transform addresses the underreaching problem as wavelets that are leveraged in the aggregation step (Eq. 13b) have larger receptive fields than most traditional GNNs. However, scattering does not result in oversmoothing because the subtraction results in bandpass filters rather than lowpass filters. In this manner, the scattering transform addresses the challenges discussed in the previous subsection.
4 Hybrid Scattering Networks
Here, we introduce two hybrid networks that combine aspects of GCN and the geometric scattering transform discussed in the previous section. Our networks will use both lowpass and bandpass filters in different channels to capture different types of information. As a result, our hybrid networks will have greater expressive power than either traditional GCNs, which only use lowpass filters or a pure scattering network, which only uses bandpass filters.
In our lowpass channels, we use modified GCN filters, which are similar to those used in Eq. 5 but use higher powers of and include bias terms. Specifically, we use a channel update rule of the form
(14) 
for . We note, in particular, that the use of higher powers of enables a wider receptive field (of radius ), without increasing the number of trainable parameters (unlike in GCN).
Similarly, in our bandpass channels, we use a modified version of Eq. 12, with an added bias term, and our update rule is given by
(15) 
for . Here, similarly to Eq. 10, is a path that determines the cascade of wavelets used in the th channel.
Aggregation module. Each hybrid layer uses a set of lowpass and bandpass channels to transform the node features , as described in Eq. 14 and Eq. 15, respectively. The resulting filter responses are aggregated to new dimensional node representations via an aggregation module
(16) 
Later, in Sections 4.1 and 4.2, we will discuss two models that use different aggregation modules.
Graph Residual Convolution. After aggregating the outputs of the (lowpass) GCN channels and (bandpass) scattering channels in Eq. 16, we apply the graph residual convolution, which acts as a lowpass filtering and aims to eliminate any highfrequency noise introduced by the scattering channels. This noise can arise, for example, if there are various different label rates in different graph substructures. In this case, the scattering features may learn the difference between labeled and unlabeled nodes and thereby produce highfrequency noise.
This filter uses a modified diffusion matrix given by
where the hyperparameter
determines the magnitude of the lowpass filtering. Choosing yields the identity (no filtering), while results in the random walk matrix . Thus, can be interpreted as lying between a completely lazy random walk that never moves and a nonresting one that moves at every time step.The full residual convolution update rule is given by
The multiplication with the weight matrix corresponds to a fully connected layer applied to the output from Eq. 16 (after lowpass filtering with
) with each neuron learning a linear combination of the signals output by the aggregation module.
4.1 Scattering GCN
The Scattering GCN (ScGCN) network was first introduced in [33]. Here, the aggregation module concatenates the filter responses horizontally yielding wide node representations of the form
(17) 
ScGCN then learns relationships between the channels via the graph residual convolution.
Configuration. The primary goal of ScGCN is to alleviate oversmoothing in popular semisupervised nodelevel tasks on, e.g., citation networks. As regularity patterns in such datasets are often dominated by lowfrequency information such as intercluster nodesimilarities, we choose our parameters to focus on lowfrequency information. We use three lowpass filters, with receptive fields of sizes radius , and two bandpass filters. We use as our nonlinearity in all steps except the outermost nonlinearity. Inspired by the aggregation step in classical geometric scattering[13], for the outermost nonlinearity, we additionally apply the power at the node level, i.e., .
The paths and the parameter from the graph residual convolution are tuned as hyperparameters of the network.
4.2 Geometric Scattering Attention Network
An important observation in ScGCN above is that the model decides globally about how to combine different channel information. The network first concatenates all of the features from the lowpass and bandpass channels in Eq. 17 and and then combines these features via multiplication with the weight matrix . However, for complex tasks or datasets, important regularity patterns may vary significantly across different graph regions. In such settings, a model should ideally attend locally over the aggregation and adaptively combine filter responses at different nodes.
This observation inspired the design of the Geometric Scattering Attention Network (GSAN) [34]. Drawing inspiration from [35], GSAN uses an aggregation module based on a multihead nodelevel attention framework. However, the attention mechanism used in GSAN differs from [35] by attending over the combination of the different filter responses rather than over the combination of node features from neighboring nodes. We will first focus on the processing performed independently by each attention head, and then discuss a multihead configuration.
In a slight deviation from ScGCN, the weight matrices in Eq. 14 and Eq. 15 are shared across the filters of each attention head. Therefore, both aggregation steps (Eq. 6b and Eq. 13b) take the same input denoted by . Next, we define and to be the outputs of the aggregation steps, Eq. 6b and Eq. 13b, with the bias terms set to zero. We then compute attention score vectors that will be used to determine the importance of each of the filter responses for every single node. We calculate
with analogous and being a learned attention vector that is shared across all filters of the attention head. These attention scores are then normalized across all filters using the softmax function. Specifically, we define
where the exponential function is applied elementwise, and define analogously. Finally, for every node, the filter responses are summed together, weighted by the corresponding (nodetofilter) attention score. We also normalize by the number of filters , which gives
where denotes the Hadamard (elementwise) product of with each column of . We further use in the equation above.
Multihead Attention. Similar to other applications of attention mechanisms [35], we use multihead attention to stabilize the training, thus yielding as output of the aggregation module (by a slight abuse of notation)
(18) 
concatenating attention heads. As explained above, each attention head has individual trained parameters and and outputs a filter response . Similar to ScGCN, this concatenation results in wide node representations that are further refined by the graph residual convolution.
Configuration. For GSAN, we set , giving the model balanced access to both lowpass and bandpass information. The aggregation process of the attention layer is shown in Fig. 3, where represent three firstorder scattering transformations with , and . The number of attention heads and the parameter from the graph residual convolution are tuned as hyperparameters of the network.
5 Theory of Hybrid Scattering Networks
In this section, we analyze the ability of scattering filters to capture information not captured by traditional GCN filters. As discussed in Section 3.3, GCN filters effectively act as localized averaging operators and focus primarily on lowfrequency information. By contrast, bandpass filters are able to retain highfrequency information and a broader class of regularity patterns. As a simple example, we recall Lemma 1 of [33]. This result considers a twoperiodic signal on a cycle graph with an even number of nodes. It shows that Eq. 3 yields a constant signal as filter response, while the scattering filter from Eq. 9 still produces a twocoloring signal. Inductively, we note that this result may be extended to any finite linear cascade of such filters, i.e., or . Moreover, Lemma 2 of the same paper shows that this result may be generalized considerably and can be extended to any twocoloring of a regular bipartite graph.
In the context of semisupervised learning, these results imply that on a bipartite graph, a GCNbased method cannot be used to predict which part of the vertex set a given node belongs to (assuming that the input feature is a twocoloring of the graph). However scattering methods preserve twocolorings and therefore will likely be successful for this task. Here, we will further develop this theory and analyze the advantages of our hybrid scattering networks on more general graphs.
Our analysis will be based on a notion of node discriminability^{4}^{4}4Formal definitions are provided in Appendix. A.1, , which intuitively corresponds to a network producing different representations of two nodes and in its hidden layers. We will let denote the step node neighborhood^{4} of a node (including itself), and for we will let denote the corresponding induced subgraph.^{4} We will say two induced subgraphs and are isomorphic^{4} if they have identical geometric structure and write to indicate that is an isometry (geometry preserving map) between them. In the definition below, we introduce a class of graphintrinsic node features that encode graph topology information. We will use these features as the input for GNN models in order to produce geometrically informed node representations.
Definition 1 (Intrinsic Node Features).
A nonzero node feature matrix is intrinsic if for any such that is isomorphic to , we have .
These intrinsic node features encode important local geometric information in the sense that if , then the step neighborhoods of and must have different geometric structure. To further understand this definition, we observe that the degree vector, or any (elementwise) function of it, is a oneintrinsic node feature matrix. Setting to the average node degree of nodes in yields intrinsic node features. As a slightly more complicated example, features with the number of triangles contained in are also intrinsic. Fig. 4 illustrates how such node features can help distinguish nodes from different graph structures.
As a concrete example, consider the task of predicting traffic on a network of Wikipedia articles, such as the Chameleon dataset [36], with nodes and edges corresponding to articles and hyperlinks. Here, intrinsic node features provide insights into the topological context of each node. A high degree suggests a widely cited article, which is likely to have a lot of traffic, while counting cliques or considering average degrees from nearby nodes can shed light on the density of the local graph region.
The following theorem characterizes situations when GCN filters cannot discriminate nodes based on the information represented in intrinsic node features.
Theorem 1.
Let , , and such that . Let be any intrinsic node feature matrix and let be the output of an layer GCN, i.e., , with each layer as in Eq. 5. Then, for any , we have . Therefore, the nodes and cannot be discriminated based on the filtered node features from an layer GCN.
Below, we provide a sketch of the main ideas of the proof. For a complete proof, please see Appendix A.2.
Proof Sketch.
Next, we introduce a notion of structural difference that allows us to analyze situations where scattering networks have strictly greater discriminatory power than GCNs.
Definition 2 (Structural Difference).
For two isomorphic induced subgraphs , a structural difference relative to and a node feature matrix is manifested at if or .^{5}^{5}5Note that is only relevant for nodes in the boundary^{4} because for all interior^{4} nodes , as the degree is 1intrinsic. For , we further assume .
Notably, if the step neighborhoods of two nodes are isomorphic and is any intrinsic node feature matrix, then no structural difference relative to and can be manifested at . Theorem 2 stated below shows that a scattering network will be able to produce different representations of two nodes and if there are structural differences in their surrounding neighborhoods, assuming (among a few other mild assumptions) that a certain pathological case is avoided. The next definition characterizes this pathological situation that arises, roughly speaking, when two sums and coincide, even though for all . We note that although this condition seems complicated, it is highly unlikely to occur in practice.
Definition 3 (Coincidental Correspondence).
Let and such that . Let be a node feature matrix, and let
We say that a node is subject to coincidental correspondence relative to and if is nonempty and
(19) 
We further say that a set of nodes is subject to coincidental correspondence if there exists at least one that is subject to coincidental correspondence.
Remark 1.
When assuming there is no coincidental correspondence relative to and on a set of nodes for some , we will implicitly assume this to also hold for diffused node features obtained from multiple diffusions via or cascades of wavelets and nonlinear activations with spatial support^{4} up to a radius of .
Theorem 2.
As in Theorem 1, let , , and such that , and consider any Kintrinsic node feature matrix . Further assume that there exists at least one structural difference rel. to and in , and let be the smallest positive integer such that a structural difference is manifested in . If the nonlinearity is strictly monotonic and is not subject to coincidental correspondence rel. to and , then one can define a scattering configuration such that scattering features defined as in Eq. 12 discriminate and .
Theorem 2 provides a large class of situations where scattering filters are able to discriminate between two nodes even if GCN filters are not. In particular, even if the step neighborhood of and are isomorphic, it is possible for there to exist a in this step neighborhood such that a intrinsic node feature takes different values at and . Theorem 2 shows that scattering will produce differing representations of these two nodes (i.e., and ), while Theorem 1 shows that GCN will not. The following two remarks discuss minor variations of Theorem 2.
Remark 2.
The absolute value operation is not monotonic. Therefore, Theorem 2 cannot be directly applied in this case. However, the above result and all subsequent results can be extended to the case as long as certain pathological cases are avoided. Namely, Theorem 2 will remain valid as long as the features at are assumed not the be negatives the features at , i.e., for nodes , and similarly to Remark 1, we also must assume that the diffused node features are not negatives of each other. We note that while this condition is complex, it is rarely violated in practice.
Remark 3.
We will provide a full sketch of the proof of Theorem 2 here and provide details in the appendix. The key to the proof will be applying Lemma 1 stated below. In order to explain these ideas, we must first introduce some notation that partition the neighborhoods from Theorem 2 into a set of node layers that are parameterized by their minimal distance from .
Notation 1.
Let , , , such that . Assume that there exists at least one node in , where a structural difference is manifested rel. to and node features . We define
and we fix the following notations:

Let and define the node set . Note that these are exactly the nodes in where a structural difference is manifested relative to and .

Let be the number of paths of minimal length between and vertices in , and denote these by with for .

Let , and define the generalized path from to as .
With this notation established, we may now state the following lemma which is critical to the proof of Theorem 2.
Lemma 1.
Let , , and such that . Assume there exists at least one node in where a structural difference is manifested relative to and to , and let be the smallest positive integer such that a structural difference is manifested in . Assume that is not subject to coincidental correspondence relative to and , and let be the generalized path as defined in Notation 1. Then, for , the nodes in are exactly the nodes in where a structural difference is manifested relative to and filtered node features .
Below, we provide a sketch of the main ideas of the proof. A complete proof is provided in Appendix A.3.
Proof Sketch.
The main idea of the proof is to use induction to show that at every step the matrix propagates the information about the structural difference one node layer closer towards .
By Notation 1(i), we have . Therefore, the case is immediate. In the inductive step, it suffices to show
for all and
for all under the assumption that these results are already valid for . This can be established by writing and using the inductive hypothesis together with the definition of .∎
We now use Lemma 1 to sketch the proof of Theorem 2. The complete proof is provided in Appendix A.4.
Proof Sketch for Theorem 2.
We need to show that we can choose the parameters in a way that guarantees . For simplicity, we set . In this case, since strictly monotonic, and therefore injective, it suffices to show that we can construct such that
(20) 
Using binary expansion, we may choose , , such that and set . Given , we define truncated paths and let for For , we let denote the empty path of length 0 and set
Recalling the generalized path defined in Notation 1, we will use induction to show that is equal to the set of nodes in such that a structural difference is manifested relative to and for . Since and , this will imply Eq. 20 and thus prove the theorem. Analogously to the base case in the proof of Lemma 1, the case follows from Notation 1(i). In the inductive step, it suffices to show that
for all and
for all , under the assumption that these results are true for . Since
Comments
There are no comments yet.