1 Introduction and motivations
Graph neural networks (GNNs) Sperduti (1993); Goller and Kuchler (1996); Gori et al. (2005); Scarselli et al. (2008); Bruna et al. (2014); Defferrard et al. (2016); Kipf and Welling (2017); Battaglia et al. (2016, 2018) and in particular their Message Passing formulation (MPNN) Gilmer et al. (2017); Brandstetter et al. (2022) have become the standard ML tool for dealing with different types of relations and interactions, ranging from social networks to particle physics and drug design. One of the often cited drawbacks of traditional GNN models is their poor ‘explainability’, making it hard to know why and how they make certain predictions Ying et al. (2019); Yuan et al. (2020), and in which situations they may work and when they would fail. Limitations of GNNs that have attracted attention are oversmoothing Nt and Maehara (2019); Oono and Suzuki (2020); Cai and Wang (2020), oversquashing and bottlenecks Alon and Yahav (2021); Topping et al. (2022), and performance on heterophilic data Pei et al. (2020); Zhu et al. (2020); Chien et al. (2021); Bo et al. (2021); Yan et al. (2021) – where adjacent nodes usally have different labels.
Contributions.
We propose a Gradient Flow Framework () where the GNN equations follow the direction of steepest descent of a learnable energy. Thanks to this framework we can (i) interpret GNNs as a multiparticle dynamics where the learned parameters determine pairwise attractive and repulsive potentials in the feature space. This sheds light on how GNNs can adapt to heterophily and explains their performance and the smoothness of the prediction. (ii) leads to residual convolutional models where the channelmixing is performed by a shared symmetric bilinear form inducing attraction and repulsion via its positive and negative eigenvalues, respectively. We theoretically investigate the interaction of the graph spectrum with the spectrum of the channelmixing proving that if there is more mass on the negative eigenvalues of , then the dynamics is dominated by the graphhigh frequencies, which could be desirable on heterophilic graphs. We also extend results of Nt and Maehara (2019); Oono and Suzuki (2020); Cai and Wang (2020)
by showing that when we drop the residual connection intrinsic to the gradient flow framework, graph convolutional models always induce a lowfrequency dominated dynamics
independent of the sign and magnitude of the spectrum of the channelmixing. We also discuss how simple choices make common architectures fit and conduct thorough ablation studies to corroborate the theoretical analysis on the role of the spectrum of . (iii) We crystallize an instance of our framework into a linear, residual, convolutional model that retains explainability and achieves competitive performance on homophilic and heterophilic real world graphs whilst being faster than GCN.Related work.
Our analysis is related to investigating GNNs as filters on the graph spectrum Defferrard et al. (2016); Hammond et al. (2019); Balcilar et al. (2020); He et al. (2021) and studying the oversmoothing effect Nt and Maehara (2019); Oono and Suzuki (2020); Cai and Wang (2020) and partly adopts techniques similar to Oono and Suzuki (2020). The key difference is that we also consider the spectrum of the ‘channelmixing’ matrix. The concept of gradient flows has been a standard tool in physics and differential geometry Eells and Sampson (1964), from which they were adopted for image processing Kimmel et al. (1997), and more recently also used in ML Sander et al. (2022) for the analysis of Transformers Vaswani et al. (2017). Our continuoustime evolution equations follows the spirit of Neural ODES Haber and Ruthotto (2018); Chen et al. (2018); Biloš et al. (2021) and the study of GNNs as continuous dynamical systems Xhonneux et al. (2020); Chamberlain et al. (2021a); Eliasof et al. (2021); Chamberlain et al. (2021b)
, and more generally resonates with the geometric deep learning blueprint
Bronstein et al. (2021); Topping et al. (2022); Bodnar et al. (2022); Di Giovanni et al. (2022).Outline.
In Section 2, we review the continuous and discrete Dirichlet energy and the associated gradient flow framework. We formalize the notion of oversmoothing and low(high)frequencydominated dynamics to investigate GNNs and study the dominant components in their evolution. We extend the graph Dirichlet energy to allow for a nontrivial norm for the feature edgegradient. This leads to gradient flow equations that diffuse the features and oversmooth in the limit. Accordingly, in Section 3 we introduce a more general energy with a symmetric channelmixing matrix giving rise to attractive and repulsive pairwise terms via its positive and negative eigenvalues and show that the negative spectrum can induce a highfrequencydominant dynamics. In Section 4 we first compare with continuous GNN models and we then discretize the equations and provide a ‘recipe’ for making standard GNN architectures fit a gradient flow framework. We adapt the spectral analysis to discretetime showing that gradient flow convolutional models can generate a dynamics dominated by the high frequencies via the negative eigenvalues of while this is impossible if we drop the residual connection. In Section 5 we corroborate our theoretical analysis on the role of the spectrum of via ablation studies on graphs with varying homophily. Experiments on real world datasets show a competitive performance of our model despite its simplicity and reduced number of parameters.
2 Gradientflow formalism
Notations adopted throughout the paper.
Let be an undirected graph with nodes. We denote by the matrix of dimensional node features, by its th row (transposed), by its th column, and by
the vectorization of
obtained by stacking its columns. Given a symmetric matrix , we let denote its most positive and negative eigenvalues, respectively, and be its spectral radius. If , then denotes the positive smallest eigenvalue of . denotes the temporal derivative, is the Kronecker product and ‘a.e.’ means almost every w.r.t. Lebesgue measure and usually refers to data in the complement of some lower dimensional subspace in . Proofs and additional results appear in the Appendix.Starting point: a geometric parallelism.
To motivate a gradientflow approach for GNNs, we start from the continuous case (see Section A.1 for details). Consider a smooth map with a constant metric represented by . The Dirichlet energy of is defined by
(1) 
and measures the ‘smoothness’ of . A natural approach to find minimizers of  called harmonic maps  was introduced in Eells and Sampson (1964) and consists in studying the gradient flow of , wherein a given map is evolved according to . This type of evolution equations have historically been the core of variational and PDEbased image processing; in particular, gradient flows of the Dirichlet energy were shown Kimmel et al. (1997) to recover the PeronaMalik nonlinear diffusion Perona and Malik (1990).
Motivation: GNNs for nodeclassification.
We wish to extend the gradient flow formalism to node classification on graphs. Assume we have a graph , nodefeatures and labels on , and that we want to predict the labels on . A GNN typically evolves the features via some parametric rule, , and uses a decoding map for the prediction . In graph convolutional models Defferrard et al. (2016); Kipf and Welling (2017),
consists of two operations: applying a shared linear transformation to the features (
‘channel mixing’) and propagating them along the edges of the graph (‘diffusion’). Our goal consists in studying when is the gradient flow of some parametric class of energies , which generalize the Dirichlet energy. This means that the parameters can be interpreted as ‘finding the right notion of smoothness’ for our task. We evolve the features by with prediction for some optimal time .Why a gradient flow? Since , the energy dissipates along the gradient flow. Accordingly, this framework allows to explain the GNN dynamics as flowing the node features in the direction of steepest descent of . Indeed, we find that parametrizing an energy leads to equations governed by attractive and repulsive forces that can be controlled via the spectrum of symmetric ‘channelmixing’ matrices. This shows that by learning to distribute more mass over the negative (positive) eigenvalues of the channelmixing, gradient flow models can generate dynamics dominated by the higher (respectively, lower) graph frequencies and hence tackle different homophily scenarios. The gradient flow framework also leads to sharing of the weights across layers (since we parametrize the energy rather than the evolution equations, as usually done in GNNs), allowing us to reduce the number of parameters without compromising performance (see Table 1).
Analysis on graphs: preliminaries.
Given a connected graph with selfloops, its adjacency matrix is defined as if and zero otherwise. We let be the degree matrix and write . Let be the matrix representation of a signal. Its graph gradient is . We define the Laplacian as (the divergence div is the adjoint of ), represented by . We refer to the eigenvalues of as frequencies: the lowest frequency is always while the highest frequency is Chung and Graham (1997). As for the continuum case, the gradient allows to define a (graph) Dirichlet energy as Zhou and Schölkopf (2005)
(2) 
where the extra is for convenience. As for manifolds, measures smoothness. If we stack the columns of into , the gradient flow of yields the heat equation on each channel:
(3) 
for . Similarly to Cai and Wang (2020), we rely on to assess whether a given dynamics is a smoothing process. A different choice of Laplacian with nonnormalized adjacency induces the analogous Dirichlet energy . Throughout this paper, we rely on the following definitions (see Section A.3 for further equivalent formulations and justifications):
Definition 2.1.
initialized at is smoothing if , with a constant only depending on and . Oversmoothing occurs if either or for .
Our notion of ‘oversmoothing’ is a relaxed version of the definition in Rusch et al. (2022) – although in the linear case one always finds an exponential decay of . We note that iff for each column . As in Oono and Suzuki (2020), this corresponds to a loss of separation power along the solution where nodes with equal degree become indistinguishable since we converge to (if we replaced with then we would not even be able to separate nodes with different degrees in the limit).
To motivate the next definition, consider . Despite being unbounded for a.e. , the lowfrequency components are growing the fastest and indeed s.t. for . We formalize this scenario – including the opposite case of highfrequency components being dominant – by studying , i.e. the Rayleigh quotient of .
Definition 2.2.
initialized at is Low/HighFrequencyDominant (L/HFD) if (respectively, ) for .
We report a consequence of Definition 2.2 and refer to Section A.3 for additional details and motivations for the characterizations of and .
Lemma 2.3.
is () iff for each there exist and s.t. and ( , respectively).
If a graph is homophilic, adjacent nodes are likely to share the same label and we expect a smoothing or dynamics enhancing the lowfrequency components to be successful at node classification tasks Wu et al. (2019); Klicpera et al. (2019). In the opposite case of heterophily, the highfrequency components might contain more relevant information for separating classes Bo et al. (2021); Bodnar et al. (2022)
– the prototypical example being the eigenvector of
associated with largest frequency separating a regular bipartite graph. In other words, the class of heterophilic graphs contain instances where signals should be sharpened by increasing rather than smoothed out. Accordingly, an ideal framework for learning on graphs must accommodate both of these opposite scenarios by being able to induce either an or a dynamics.Parametric Dirichlet energy: channelmixing as metric in feature space.
In eq. 1 a constant nontrivial metric in leads to the mixing of the feature channels. We adapt this idea by considering a symmetric positive semidefinite with and using it to generalize as
(4) 
We note the analogy with eq. 1, where the sum over the nodes replaces the integration over the domain and the th derivative at some point is replaced by the gradient along the edge . We generally treat as learnable weights and study the gradient flow of :
(5) 
We see that eq. 5 generalizes eq. 3. Below ‘smoothing’ is intended as in Definition 2.1.
Proposition 2.4.
Proposition 2.4 implies that no weight matrix in eq. 5 can separate the limit embeddings of nodes with same degree and input features. If has a trivial kernel, then nodes with same degrees converge to the same representation and oversmoothing occurs as per Definition 2.1. Differently from Nt and Maehara (2019); Oono and Suzuki (2020); Cai and Wang (2020), oversmoothing occurs independently of the spectral radius of the ‘channelmixing’ if its eigenvalues are positive – even for equations which lead to residual GNNs when discretized Chen et al. (2018). According to Proposition 2.4, we do not expect eq. 5 to succeed on heterophilic graphs where smoothing processes are generally harmful – this is confirmed in Figure 2 (see prodcurve). To remedy this problem, we generalize eq. 5 to a gradient flow that can be as per Definition 2.2.
3 A general parametric energy for pairwise interactions
We extend the Dirichlet energy associated with to an energy accounting for mutual – possibly repulsive – interactions in feature space . We first rewrite the energy in eq. 4 as
(6) 
Replacing the occurrences of with symmetric matrices leads to
(7) 
with associated gradient flow of the form (see Appendix B)
(8) 
Note that eq. 8 is gradient flow of some energy iff both and are symmetric.
A multiparticle system point of view: attraction vs repulsion.
Consider the dimensional nodefeatures as particles in with energy . While the term is independent of the graph topology and represents an external field in the feature space, the second term constitutes a potential energy, with a bilinear form determining the pairwise interactions of adjacent node representations. Given a symmetric , we write , by decomposing the spectrum of in positive and negative values.We can rewrite , i.e.
(9) 
The gradient flow of minimizes and maximizes . The matrix encodes repulsive pairwise interactions via its negativedefinite component which lead to terms increasing along the solution. The latter affords a ‘sharpening’ effect desirable on heterophilic graphs where we need to disentangle adjacent node representations and hence ‘magnify’ the edgegradient.
Spectral analysis of the channelmixing.
We will now show that eq. 8 can lead to a dynamics. To this end, we assume that so that eq. 8 becomes According to eq. 9 the negative eigenvalues of lead to repulsion. We show that the latter can induce dynamics as per Definition 2.2. We let
be the orthogonal projection into the eigenspace of
associated with the eigenvalue . Moreover, we defineProposition 3.1.
If , then is for a.e. : we have
and converges to such that , for .
Proposition 3.1 shows that if enough mass of the spectrum of the ‘channelmixing’ is distributed over the negative eigenvalues, then the evolution is dominated by the graph high frequencies. This analysis is made possible in our gradient flow framework where must be symmetric. The HFD dynamics induced by negative eigenvalues of is confirmed in Figure 2 (negprodcurve in the bottom chart).
A more general energy.
Equations with a source term may have better expressive power Xhonneux et al. (2020); Chen et al. (2020); Thorpe et al. (2021). In our framework this means adding an extra energy term of the form to eq. 7 with some learnable and . This leads to the following gradient flow:
(10) 
We also observe that one could replace the fixed matrix with a more general symmetric graph vector field satisfying if . This in particular includes the case where is learned based on the initial encoding via an attention mechanism Vaswani et al. (2017); Veličković et al. (2018). In this case, the pairwise energy generalizes to
(11) 
Since in the experiments we have not observed improvements from learning and this option does make the model slower, we stick to the special choice . We also note that when , then eq. 8 becomes . We perform a spectral analysis of this case in Section B.2.
Nonlinear activations.
The inner products in the formulations of can be combined with nonlinear activations of the form (the gradient flow derivation is reported in Section B.3)
While nonlinear activations offer greater expressive power, the analysis presented thus far and the following comparisons with existing GNN models are investigated in the linear case for a few reasons. First, in the spirit of Wu et al. (2019); Oono and Suzuki (2020); Chen et al. (2020), by dropping nonlinear maps we can perform spectral analysis in closed form. Second, in all our experiments we have seen no gain in performance when including nonlinear activations. Third, we can always ‘push the nonlinear maps’ in either the encoding block or the decoding one without affecting the linear gradient flow as discussed in Section 4.
4 Comparison with GNNs
In this Section, we study standard GNN models from the perspective of our gradient flow framework.
4.1 Continuous case
Continuous GNN models replace layers with continuous time. In contrast with Proposition 3.1, we show that three main linearized continuous GNN models are either smoothing or as per Definition 2.2. The linearized PDEGCN model Eliasof et al. (2021) corresponds to choosing and in eq. 10, for some timedependent family :
The CGNN model Xhonneux et al. (2020) can be derived from eq. 10 by setting :
Finally, in linearized GRAND Chamberlain et al. (2021a)
a rowstochastic matrix
is learned from the encoding via an attention mechanism and we haveWe note that if is not symmetric, then GRAND is not a gradient flow.
Proposition 4.1.
, and satisfy the following:

is a smoothing model: .

For a.e. it holds: is never and if we remove the source term, then .

If is connected, as , with , .
By (ii) the sourcefree CGNNevolution is independent of . Moreover, by (iii), oversmoothing occurs for GRAND as per Definition 2.1. On the other hand, Proposition 3.1 shows that the negative eigenvalues of can make the sourcefree gradient flow in eq. 8 . Experiments in Section 5 confirm that the gradient flow model outperforms CGNN and GRAND on heterophilic graphs.
4.2 Discrete case
We now describe a discrete version of our gradient flow model and compare it to ‘discrete’ GNNs where discrete time steps correspond to different layers. In the spirit of Chen et al. (2018), we use explicit Euler scheme with step size to solve eq. 10 and set . In the gradient flow framework we parametrize the energy rather than the actual equations, which leads to symmetric channelmixing matrices that are shared across the layers. Since the matrices are square, an encoding block is used to process input features and generally reduce the hidden dimension from to . Moreover, the iterations inherently lead to a residual architecture because of the explicit Euler discretization:
(12) 
with prediction produced by a decoder , where is the number of label classes and integration time of the form , so that represents the number of layers. Although eq. 12 is linear, we can include nonlinear activations in . Until this point, we have considered equations that minimize a multiparticle energy. We now study when GNNs minimize an energy .
Are discrete GNNs gradient flows?
Given a (learned) symmetric graph vector field satisfying if , consider a family of linear GNNs with shared weights of the form
(13) 
Symmetry is the key requirement to interpret GNNs in eq. 13 in a gradient flow framework.
Lemma 4.2.
Equation 13 is the unit step size discrete gradient flow of , with defined in eq. 11, iff and are symmetric.
Lemma 4.2 provides a recipe for making standard architectures into a gradient flow, with symmetry being the key requirement. When eq. 13 is a gradient flow, the underlying GNN dynamics becomes explainable in terms of minimizing a multiparticle energy by learning attractive and repulsive directions in feature space as discussed in Section 3. In Section C.2, we show how Lemma 4.2 covers linear versions of GCN Kipf and Welling (2017); Wu et al. (2019), GAT Veličković et al. (2018), GraphSAGE Hamilton et al. (2017) and GCNII Chen et al. (2020) to name a few.
Oversmoothing analysis in discrete setting.
By Proposition 3.1 we know that the continuous version of eq. 12 can be thanks to the negative eigenvalues of . The next result represents a discrete counterpart of Proposition 3.1 and shows that residual, symmetrized graph convolutional models can be . Below is the projection into the eigenspace associated with the eigenvalue and we report the explicit value of in eq. 29 in Section C.3. We let:
(14) 
Theorem 4.3.
Given , with symmetric, if eq. 14 holds then
hence the dynamics is for a.e. and in fact s.t. . Conversely, if is not bipartite, then for a.e. the system , with symmetric, is independent of the spectrum of .
Theorem 4.3 shows that linear discrete gradient flows can be due to the negative eigenvalues of . This differs from statements that standard GCNs act as lowpass filters and thus oversmooth in the limit. Indeed, in these cases the spectrum of is generally ignored Wu et al. (2019); Chen et al. (2020)
or required to be sufficiently small in terms of singular value decomposition
Nt and Maehara (2019); Oono and Suzuki (2020); Cai and Wang (2020) when no residual connection is present. On the other hand, Theorem 4.3 emphasizes that the spectrum of plays a key role to enhance the high frequencies when enough mass is distributed over the negative eigenvalues provided that a residual connection exists – this is confirmed by the negprodcurve in Figure 2.The residual connection from a spectral perspective.
Given a sufficiently small stepsize so that the right hand side of inequality 14 is satisfied, is for a.e. if , i.e. ‘there is more mass’ in the negative spectrum of than in the positive one. This means that differently from Nt and Maehara (2019); Oono and Suzuki (2020); Cai and Wang (2020), there is no requirement on the minimal magnitude of the spectral radius of coming from the graph topology as long as is small enough. Conversely, without a residual term, the dynamics is for a.e. independently of the sign and magnitude of the eigenvalues of . This is also confirmed by the GCNcurve in Figure 2.
Gradient flow as spectral GNNs.
We finally discuss eq. 12 from the perspective of spectral GNNs as in Balcilar et al. (2020). Let us assume that , . We can write eq. 12 as (see Section C.3)
(15) 
with the eigendecomposition of the graph Laplacian. Namely, if we let be the spectrum of with associated orthonormal basis of eigenvectors given by , and we introduce defined by , then we can rephrase eq. 15 as the system
(16) 
Accordingly, for each projection into the th eigenvector of
, we have a spectral function in the graph frequency domain given by
. If we have a lowpass filter while if we have a highpass filter. Moreover, we see that along the eigenvectors of , if then the dynamics is equivalent to flipping the sign of the edge weights, which offers a direct comparison with methods proposed in Bo et al. (2021); Yan et al. (2021) where some ‘attentive’ mechanism is proposed to learn negative edge weights based on feature information. The very same procedure of changing the sign of the edge weights – which is equivalent to reversing the time orientation for the evolution of – can indeed be accomplished via the repulsive forces generated by the negative spectrum of .5 Experiments
In this section we evaluate the gradient flow framework (). We corroborate the spectral analysis using synthetic data with controllable homophily. We confirm that having negative (positive) eigenvalues of the channelmixing are essential in heterophilic (homophilic) scenarios where the gradient flow should align with () respectively. We show that the gradient flow in eq. 12 – a linear, residual, symmetric graph convolutional model – achieves competitive performance on real world datasets.
Methodology.
We crystallize in the model presented in eq. 12 with implemented as single linear layers or MLPs, and we set to be diagonal. For the realworld experiments we consider diagonallydominant , diagonal and timedependent choices for the structure of that offer explicit control over its spectrum. In the case, we consider a symmetric with zero diagonal and defined by , and set . Due to the Gershgorin Theorem the eigenvalues of belong to , so the model ‘can’ easily redistribute mass in the spectrum of via . This generalizes the decomposition of in Chen et al. (2020) providing a justification in terms of its spectrum and turns out to be more efficient w.r.t. the hidden dimension as shown in Figure 4 in the Appendix. For we take to be diagonal, with entries sampled and fixed – i.e., we do not train over – and only learn . We also include a timedependent model where varies across layers. To investigate the role of the spectrum of on synthetic graphs, we construct three additional variants: , named sum, prod and negprod respectively where prod (negprod) variants have only nonnegative (nonpositive) eigenvalues.
Complexity and number of parameters.
If we treat the number of layers as a constant, the discrete gradient flow scales as , where and are input feature and hidden dimension respectively, with usually. Note that GCN has complexity and in fact our model is faster than GCN as confirmed in Figure 5 in Appendix D. Since are single linear layers (MLPs), we can bound the number of parameters by , with the number of label classes, in the variant while in the variant we have . Further ablation studies appear in Figure 4 in the Appendix showing that outperforms sum and GCN – especially in the lower hidden dimension regime – on realworld benchmarks with varying homophily.
Synthetic experiments and ablation studies.
To investigate our claims in a controlled environment we use the synthetic Cora dataset of
(Zhu et al., 2020, Appendix G). Graphs are generated for target levels of homophily via preferential attachment – see Section D.3 for details. Figure 2 confirms the spectral analysis and offers explainability in terms of performance and smoothness of the predictions. Each curve – except GCN – represents one version of as in ‘methodology’ and we implement eq. 12 with , . Figure 2 (top) reports the test accuracy vs true label homophily. Negprod is better than prod on lowhomophily and viceversa on highhomophily. This confirms Proposition 3.1 where we have shown that the gradient flow can lead to a dynamics – that are generally desirable with lowhomophily – through the negative eigenvalues of . Conversely, the prod configuration (where we have an attractiononly dynamics) struggles in lowhomophily scenarios even though a residual connection is present. Both prod and negprod are ‘extreme’ choices and serve the purpose of highlighting that by turning off one side of the spectrum this could be the more damaging depending on the underlying homophily. In general though ‘neutral’ variants like sum and are indeed more flexible and better performing. In fact, outperforms GCN especially in lowhomophily scenarios, confirming Theorem 4.3 where we have shown that without a residual connection convolutional models are – and hence more sensitive to underlying homophily – irrespectively of the spectrum of . This is further confirmed by additional ablation studies in Figure 3.In Figure 2 (bottom) we compute the homophily of the prediction (cross) for a given method and we compare with the homophily (circle) of the prediction read from the encoding (i.e. graphagnostic). The homophily here is a proxy to assess whether the evolution is smoothing, the goal being explaining the smoothness of the prediction via the spectrum of as per our theoretical analysis. For negprod the homophily after the evolution is lower than that of the encoding, supporting the analysis that negative eigenvalues of enhance highfrequencies. The opposite behaviour occurs in the case of prod and explains that in the lowhomophily regime prod is underperformant due to the prediction being smoother than the true homophily. and sum variants adapt better to the true homophily. We note how the encoding compensates when the dynamics can only either attract or repulse (i.e. the spectrum of has a sign) by decreasing or increasing the initial homophily respectively.
Real world experiments.
We test against a range of datasets with varying homophily Sen et al. (2008); Rozemberczki et al. (2021); Pei et al. (2020) (see Section D.4 for additional details). We use results provided in (Yan et al., 2021, Table 1), which includes standard baselines as GCN Kipf and Welling (2017), GraphSAGE Hamilton et al. (2017), GAT Veličković et al. (2018), PairNorm Zhao and Akoglu (2019) and recent models tailored towards the heterophilic setting (GGCN Yan et al. (2021), GeomGCN Pei et al. (2020), H2GCN Zhu et al. (2020) and GPRGNN Chien et al. (2021)). For Sheaf Bodnar et al. (2022), a recent topperformer on heterophilic datasets, we took the best performing variant (out of six provided) for each dataset. We also include continuous baselines CGNN Xhonneux et al. (2020) and GRAND Chamberlain et al. (2021a) to provide empirical evidence for Proposition 4.1. Splits taken from Pei et al. (2020) are used in all the comparisons. The model discussed in ‘methodology’ is a very simple architecture with shared parameters across layers and runtime smaller than GCN and more recent models like GGCN designed for heterophilic graphs (see Figure 5 in the Appendix). Nevertheless, it achieves competitive results on all datasets, performing on par or better than more complex recent models. Moreover, comparison with the ‘timedependent’ variant confirms that by sharing weights across layers we do not lose performance. We note that on heterophilic graphs short integration time is usually needed due to the topology being harmful and the negative eigenvalues of leading to exponential behaviour (see Appendix D).
Texas  Wisconsin  Cornell  Film  Squirrel  Chameleon  Citeseer  Pubmed  Cora  
Hom level  0.11  0.21  0.30  0.22  0.22  0.23  0.74  0.80  0.81 
#Nodes  183  251  183  7,600  5,201  2,277  3,327  18,717  2,708 
#Edges  295  466  280  26,752  198,493  31,421  4,676  44,327  5,278 
#Classes  5  5  5  5  5  5  7  3  6 
GGCN  
GPRGNN  
H2GCN  
GCNII  
GeomGCN  
PairNorm  
GraphSAGE  
GCN  
GAT  
MLP  
CGNN  
GRAND  
Sheaf (max)  
(DD)  
(D)  
timedep (DD) 
6 Conclusions
In this work, we developed a framework for explainable GNNs where the evolution can be interpreted as minimizing a multiparticle learnable energy. This translates into studying the interaction between the spectrum of the graph and the spectrum of the ‘channelmixing’ leading to a better understanding of when and why the induced dynamics is low (high) frequency dominated. From a theoretical perspective, we refined existing asymptotic analysis of GNNs to account for the role of the spectrum of the channelmixing as well. From a practical perspective, our framework allows for ‘educated’ choices resulting in a simple, more explainable convolutional model that achieves competitive performance on homophilic and heterophilic benchmarks while being faster than GCN. Our results refute the folklore of graph convolutional models being too simple for complex benchmarks.
Limitations and future works.
We limited our attention to a constant bilinear form , which might be excessively rigid. It is possible to derive nonconstant alternatives that are aware of the features or the position in the graph. The main challenge amounts to matching the requirement for local ‘heterogeneity’ with efficiency: we reserve this question for future work. Our analysis is also a first step into studying the interaction of the graph and ‘channelmixing’ spectra; we did not explore other dynamics that are neither nor as per our definitions. The energy formulation points to new models more ‘physics’ inspired; this will be explored in future work.
Societal impact.
Our work sheds light on the actual dynamics of GNNs and could hence improve their explainability, which is crucial for assessing their impact on largescale applications. We also show that instances of our framework achieve competitive performance on heterophilic data despite being faster than GCN, providing evidence for efficient methods with reduced footprint.
References
 On the bottleneck of graph neural networks and its practical implications. In International Conference on Learning Representations, External Links: Link Cited by: §1.
 Analyzing the expressive power of graph neural networks in a spectral perspective. In International Conference on Learning Representations, Cited by: §1, §4.2.
 Interaction networks for learning about objects, relations and physics. Advances in neural information processing systems 29. Cited by: §1.
 Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261. Cited by: §1.
 Experiment tracking with weights and biases. Note: Software available from wandb.com External Links: Link Cited by: §D.2, §D.5.
 Neural flows: efficient alternative to neural odes. In Advances in Neural Information Processing Systems, Vol. 34. External Links: Link Cited by: §1.
 Beyond lowfrequency information in graph convolutional networks. In AAAI. AAAI Press. Cited by: §1, §2, §4.2.
 Neural sheaf diffusion: a topological perspective on heterophily and oversmoothing in gnns. arXiv preprint arXiv:2202.04579. Cited by: §1, §2, §5.
 Message passing neural pde solvers. arXiv preprint arXiv:2202.03376. Cited by: §1.
 Geometric deep learning: grids, groups, graphs, geodesics, and gauges. arXiv preprint arXiv:2104.13478. Cited by: §1.
 Spectral networks and locally connected networks on graphs. In 2nd International Conference on Learning Representations, ICLR 2014, Cited by: §1.
 A note on oversmoothing for graph neural networks. arXiv preprint arXiv:2006.13318. Cited by: §A.2, §1, §1, §1, §2, §2, §4.2, §4.2.

Grand: graph neural diffusion.
In
International Conference on Machine Learning
, pp. 1407–1418. Cited by: §C.1, §1, §4.1, §5.  Beltrami flow and neural diffusion on graphs. Advances in Neural Information Processing Systems 34. Cited by: §1.
 Simple and deep graph convolutional networks. In International Conference on Machine Learning, pp. 1725–1735. Cited by: §C.2, §3, §3, §4.2, §4.2, §5.
 . Advances in neural information processing systems 31. Cited by: §D.2, §1, §2, §4.2.
 Adaptive universal generalized pagerank graph neural network. In 9th International Conference on Learning Representations, ICLR 2021, External Links: Link Cited by: §1, §5.
 Spectral graph theory. (92). Cited by: §A.2, §2.
 Convolutional neural networks on graphs with fast localized spectral filtering. Advances in neural information processing systems 29. Cited by: §1, §1, §2.
 Heterogeneous manifolds for curvatureaware graph embedding. arXiv preprint arXiv:2202.01185. Cited by: §1.
 Harmonic mappings of riemannian manifolds. American journal of mathematics 86 (1), pp. 109–160. Cited by: §A.1, §1, §2.

Pdegcn: novel architectures for graph neural networks motivated by partial differential equations
. Advances in Neural Information Processing Systems 34. Cited by: §B.3, §1, §4.1. 
Fast graph representation learning with PyTorch Geometric
. In ICLR Workshop on Representation Learning on Graphs and Manifolds, Cited by: §D.2.  Neural message passing for quantum chemistry. In International Conference on Machine Learning, pp. 1263–1272. Cited by: §1.

Learning taskdependent distributed representations by backpropagation through structure
. In Proceedings of International Conference on Neural Networks (ICNN’96), Vol. 1, pp. 347–352. Cited by: §1.  A new model for learning in graph domains. In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., Vol. 2, pp. 729–734. Cited by: §1.
 Stable architectures for deep neural networks. Inverse Problems 34. Cited by: §1.
 Inductive representation learning on large graphs. Advances in neural information processing systems 30. Cited by: §C.2, §4.2, §5.
 The spectral graph wavelet transform: fundamental theory and fast computation. In VertexFrequency Analysis of Graph Signals, pp. 141–175. Cited by: §1.
 BernNet: learning arbitrary graph spectral filters via bernstein approximation. Advances in Neural Information Processing Systems 34. Cited by: §1.

From high energy physics to low level vision.
In
International Conference on ScaleSpace Theories in Computer Vision
, pp. 236–247. Cited by: §A.1, §1, §2.  SemiSupervised Classification with Graph Convolutional Networks. In Proceedings of the 5th International Conference on Learning Representations, ICLR ’17. External Links: Link Cited by: §B.2, §C.2, §1, §2, §4.2, §5.
 Diffusion improves graph learning. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Cited by: §2.
 Revisiting graph neural networks: all we have is lowpass filters. arXiv preprint arXiv:1905.09550. Cited by: §1, §1, §1, §2, §4.2, §4.2.
 Graph neural networks exponentially lose expressive power for node classification. In International Conference on Learning Representations, Cited by: §A.2, §1, §1, §1, §2, §2, §3, §4.2, §4.2.
 PyTorch: an imperative style, highperformance deep learning library. In NeurIPS, Cited by: §D.2.
 Geomgcn: geometric graph convolutional networks. In 8th International Conference on Learning Representations, ICLR 2020, External Links: Link Cited by: §D.4, §D.4, §1, §5.
 Scalespace and edge detection using anisotropic diffusion. PAMI 12 (7), pp. 629–639. Cited by: §2.
 Multiscale attributed node embedding. Journal of Complex Networks 9 (2), pp. cnab014. Cited by: §5.
 Graphcoupled oscillator networks. In International Conference on Machine Learning, Cited by: §2.

Sinkformers: transformers with doubly stochastic attention.
In
International Conference on Artificial Intelligence and Statistics
, pp. 3515–3530. Cited by: §1.  The graph neural network model. IEEE transactions on neural networks 20 (1), pp. 61–80. Cited by: §1.
 Collective classification in network data. AI magazine 29 (3), pp. 93–93. Cited by: §5.
 Pitfalls of graph neural network evaluation. In NIPS workshop. Cited by: §D.4.
 Encoding labeled graphs by labeling raam. Advances in Neural Information Processing Systems 6. Cited by: §1.
 GRAND++: graph neural diffusion with a source term. In International Conference on Learning Representations, Cited by: §3.
 Understanding oversquashing and bottlenecks on graphs via curvature. International Conference on Learning Representations. Cited by: §1, §1.
 Attention is all you need. Advances in neural information processing systems 30. Cited by: §1, §3.
 Graph attention networks. In International Conference on Learning Representations, External Links: Link Cited by: §C.2, §3, §4.2, §5.
 Simplifying graph convolutional networks. In International conference on machine learning, pp. 6861–6871. External Links: Link Cited by: §C.2, §2, §3, §4.2, §4.2.
 Continuous graph neural networks. In International Conference on Machine Learning, pp. 10432–10441. External Links: Link Cited by: §1, §3, §4.1, §5.
 Two sides of the same coin: heterophily and oversmoothing in graph convolutional neural networks. arXiv preprint arXiv:2102.06462. Cited by: §D.4, §1, §4.2, §5.
 Gnnexplainer: generating explanations for graph neural networks. Advances in neural information processing systems 32. Cited by: §1.
 Explainability in graph neural networks: a taxonomic survey. arXiv preprint arXiv:2012.15445. Cited by: §1.
 Pairnorm: tackling oversmoothing in gnns. arXiv preprint arXiv:1909.12223. Cited by: §5.

Regularization on discrete spaces.
In
Joint Pattern Recognition Symposium
, pp. 361–368. External Links: Link Cited by: §2.  Beyond homophily in graph neural networks: current limitations and effective designs. Advances in Neural Information Processing Systems 33, pp. 7793–7804. Cited by: §D.3, §1, §5, §5.
Appendix A Proofs and additional details of Section 2
a.1 Discussion on continuous Dirichlet energy and harmonic maps
In this subsection we briefly expand on the formulation of continuous Dirichlet energy in Section 2 to provide more context. Consider a smooth map , where is usually a larger manifold we embed into, and are Riemannian metrics on domain and codomain respectively. The Dirichlet energy of is defined by