Graph Neural Networks as Gradient Flows

06/22/2022
by   Francesco Di Giovanni, et al.
Twitter
101

Dynamical systems minimizing an energy are ubiquitous in geometry and physics. We propose a gradient flow framework for GNNs where the equations follow the direction of steepest descent of a learnable energy. This approach allows to explain the GNN evolution from a multi-particle perspective as learning attractive and repulsive forces in feature space via the positive and negative eigenvalues of a symmetric "channel-mixing" matrix. We perform spectral analysis of the solutions and conclude that gradient flow graph convolutional models can induce a dynamics dominated by the graph high frequencies which is desirable for heterophilic datasets. We also describe structural constraints on common GNN architectures allowing to interpret them as gradient flows. We perform thorough ablation studies corroborating our theoretical analysis and show competitive performance of simple and lightweight models on real-world homophilic and heterophilic datasets.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

06/09/2022

Towards Understanding Graph Neural Networks: An Algorithm Unrolling Perspective

The graph neural network (GNN) has demonstrated its superior performance...
01/10/2020

An eikonal equation approach to thermodynamics and the gradient flows in information geometry

We can consider the equations of states in thermodynamics as the general...
06/16/2021

Towards a Rigorous Theoretical Analysis and Evaluation of GNN Explanations

As Graph Neural Networks (GNNs) are increasingly employed in real-world ...
06/22/2021

Continuous-Depth Neural Models for Dynamic Graph Prediction

We introduce the framework of continuous-depth graph neural networks (GN...
02/14/2022

Semi-Equivariant GNN Architectures for Jet Tagging

Composing Graph Neural Networks (GNNs) of operations that respect physic...
08/30/2021

Adversarial Stein Training for Graph Energy Models

Learning distributions over graph-structured data is a challenging task ...

1 Introduction and motivations

Graph neural networks (GNNs) Sperduti (1993); Goller and Kuchler (1996); Gori et al. (2005); Scarselli et al. (2008); Bruna et al. (2014); Defferrard et al. (2016); Kipf and Welling (2017); Battaglia et al. (2016, 2018) and in particular their Message Passing formulation (MPNN) Gilmer et al. (2017); Brandstetter et al. (2022) have become the standard ML tool for dealing with different types of relations and interactions, ranging from social networks to particle physics and drug design. One of the often cited drawbacks of traditional GNN models is their poor ‘explainability’, making it hard to know why and how they make certain predictions Ying et al. (2019); Yuan et al. (2020), and in which situations they may work and when they would fail. Limitations of GNNs that have attracted attention are over-smoothing Nt and Maehara (2019); Oono and Suzuki (2020); Cai and Wang (2020), over-squashing and bottlenecks Alon and Yahav (2021); Topping et al. (2022), and performance on heterophilic data Pei et al. (2020); Zhu et al. (2020); Chien et al. (2021); Bo et al. (2021); Yan et al. (2021) – where adjacent nodes usally have different labels.

Contributions.

We propose a Gradient Flow Framework () where the GNN equations follow the direction of steepest descent of a learnable energy. Thanks to this framework we can (i) interpret GNNs as a multi-particle dynamics where the learned parameters determine pairwise attractive and repulsive potentials in the feature space. This sheds light on how GNNs can adapt to heterophily and explains their performance and the smoothness of the prediction. (ii) leads to residual convolutional models where the channel-mixing is performed by a shared symmetric bilinear form inducing attraction and repulsion via its positive and negative eigenvalues, respectively. We theoretically investigate the interaction of the graph spectrum with the spectrum of the channel-mixing proving that if there is more mass on the negative eigenvalues of , then the dynamics is dominated by the graph-high frequencies, which could be desirable on heterophilic graphs. We also extend results of Nt and Maehara (2019); Oono and Suzuki (2020); Cai and Wang (2020)

by showing that when we drop the residual connection intrinsic to the gradient flow framework, graph convolutional models always induce a low-frequency dominated dynamics

independent of the sign and magnitude of the spectrum of the channel-mixing. We also discuss how simple choices make common architectures fit and conduct thorough ablation studies to corroborate the theoretical analysis on the role of the spectrum of . (iii) We crystallize an instance of our framework into a linear, residual, convolutional model that retains explainability and achieves competitive performance on homophilic and heterophilic real world graphs whilst being faster than GCN.

Figure 1: actual dynamics: attractive and repulsive forces lead to a non-smoothing process able to separate labels.

Related work.

Our analysis is related to investigating GNNs as filters on the graph spectrum Defferrard et al. (2016); Hammond et al. (2019); Balcilar et al. (2020); He et al. (2021) and studying the over-smoothing effect Nt and Maehara (2019); Oono and Suzuki (2020); Cai and Wang (2020) and partly adopts techniques similar to Oono and Suzuki (2020). The key difference is that we also consider the spectrum of the ‘channel-mixing’ matrix. The concept of gradient flows has been a standard tool in physics and differential geometry Eells and Sampson (1964), from which they were adopted for image processing Kimmel et al. (1997), and more recently also used in ML Sander et al. (2022) for the analysis of Transformers Vaswani et al. (2017). Our continuous-time evolution equations follows the spirit of Neural ODES Haber and Ruthotto (2018); Chen et al. (2018); Biloš et al. (2021) and the study of GNNs as continuous dynamical systems Xhonneux et al. (2020); Chamberlain et al. (2021a); Eliasof et al. (2021); Chamberlain et al. (2021b)

, and more generally resonates with the geometric deep learning blueprint

Bronstein et al. (2021); Topping et al. (2022); Bodnar et al. (2022); Di Giovanni et al. (2022).

Outline.

In Section 2, we review the continuous and discrete Dirichlet energy and the associated gradient flow framework. We formalize the notion of over-smoothing and low(high)-frequency-dominated dynamics to investigate GNNs and study the dominant components in their evolution. We extend the graph Dirichlet energy to allow for a non-trivial norm for the feature edge-gradient. This leads to gradient flow equations that diffuse the features and over-smooth in the limit. Accordingly, in Section 3 we introduce a more general energy with a symmetric channel-mixing matrix giving rise to attractive and repulsive pairwise terms via its positive and negative eigenvalues and show that the negative spectrum can induce a high-frequency-dominant dynamics. In Section 4 we first compare with continuous GNN models and we then discretize the equations and provide a ‘recipe’ for making standard GNN architectures fit a gradient flow framework. We adapt the spectral analysis to discrete-time showing that gradient flow convolutional models can generate a dynamics dominated by the high frequencies via the negative eigenvalues of while this is impossible if we drop the residual connection. In Section 5 we corroborate our theoretical analysis on the role of the spectrum of via ablation studies on graphs with varying homophily. Experiments on real world datasets show a competitive performance of our model despite its simplicity and reduced number of parameters.

2 Gradient-flow formalism

Notations adopted throughout the paper.

Let be an undirected graph with nodes. We denote by the matrix of -dimensional node features, by its -th row (transposed), by its -th column, and by

the vectorization of

obtained by stacking its columns. Given a symmetric matrix , we let denote its most positive and negative eigenvalues, respectively, and be its spectral radius. If , then denotes the positive smallest eigenvalue of . denotes the temporal derivative, is the Kronecker product and ‘a.e.’ means almost every w.r.t. Lebesgue measure and usually refers to data in the complement of some lower dimensional subspace in . Proofs and additional results appear in the Appendix.

Starting point: a geometric parallelism.

To motivate a gradient-flow approach for GNNs, we start from the continuous case (see Section A.1 for details). Consider a smooth map with a constant metric represented by . The Dirichlet energy of is defined by

(1)

and measures the ‘smoothness’ of . A natural approach to find minimizers of - called harmonic maps - was introduced in Eells and Sampson (1964) and consists in studying the gradient flow of , wherein a given map is evolved according to . This type of evolution equations have historically been the core of variational and PDE-based image processing; in particular, gradient flows of the Dirichlet energy were shown Kimmel et al. (1997) to recover the Perona-Malik nonlinear diffusion Perona and Malik (1990).

Motivation: GNNs for node-classification.

We wish to extend the gradient flow formalism to node classification on graphs. Assume we have a graph , node-features and labels on , and that we want to predict the labels on . A GNN typically evolves the features via some parametric rule, , and uses a decoding map for the prediction . In graph convolutional models Defferrard et al. (2016); Kipf and Welling (2017),

consists of two operations: applying a shared linear transformation to the features (

‘channel mixing’) and propagating them along the edges of the graph (‘diffusion’). Our goal consists in studying when is the gradient flow of some parametric class of energies , which generalize the Dirichlet energy. This means that the parameters can be interpreted as ‘finding the right notion of smoothness’ for our task. We evolve the features by with prediction for some optimal time .

Why a gradient flow? Since , the energy dissipates along the gradient flow. Accordingly, this framework allows to explain the GNN dynamics as flowing the node features in the direction of steepest descent of . Indeed, we find that parametrizing an energy leads to equations governed by attractive and repulsive forces that can be controlled via the spectrum of symmetric ‘channel-mixing’ matrices. This shows that by learning to distribute more mass over the negative (positive) eigenvalues of the channel-mixing, gradient flow models can generate dynamics dominated by the higher (respectively, lower) graph frequencies and hence tackle different homophily scenarios. The gradient flow framework also leads to sharing of the weights across layers (since we parametrize the energy rather than the evolution equations, as usually done in GNNs), allowing us to reduce the number of parameters without compromising performance (see Table 1).

Analysis on graphs: preliminaries.

Given a connected graph with self-loops, its adjacency matrix is defined as if and zero otherwise. We let be the degree matrix and write . Let be the matrix representation of a signal. Its graph gradient is . We define the Laplacian as (the divergence div is the adjoint of ), represented by . We refer to the eigenvalues of as frequencies: the lowest frequency is always while the highest frequency is Chung and Graham (1997). As for the continuum case, the gradient allows to define a (graph) Dirichlet energy as Zhou and Schölkopf (2005)

(2)

where the extra is for convenience. As for manifolds, measures smoothness. If we stack the columns of into , the gradient flow of yields the heat equation on each channel:

(3)

for . Similarly to Cai and Wang (2020), we rely on to assess whether a given dynamics is a smoothing process. A different choice of Laplacian with non-normalized adjacency induces the analogous Dirichlet energy . Throughout this paper, we rely on the following definitions (see Section A.3 for further equivalent formulations and justifications):

Definition 2.1.

initialized at is smoothing if , with a constant only depending on and . Over-smoothing occurs if either or for .

Our notion of ‘over-smoothing’ is a relaxed version of the definition in Rusch et al. (2022) – although in the linear case one always finds an exponential decay of . We note that iff for each column . As in Oono and Suzuki (2020), this corresponds to a loss of separation power along the solution where nodes with equal degree become indistinguishable since we converge to (if we replaced with then we would not even be able to separate nodes with different degrees in the limit).

To motivate the next definition, consider . Despite being unbounded for a.e. , the low-frequency components are growing the fastest and indeed s.t. for . We formalize this scenario – including the opposite case of high-frequency components being dominant – by studying , i.e. the Rayleigh quotient of .

Definition 2.2.

initialized at is Low/High-Frequency-Dominant (L/HFD) if (respectively, ) for .

We report a consequence of Definition 2.2 and refer to Section A.3 for additional details and motivations for the characterizations of and .

Lemma 2.3.

is () iff for each there exist and s.t. and ( , respectively).

If a graph is homophilic, adjacent nodes are likely to share the same label and we expect a smoothing or dynamics enhancing the low-frequency components to be successful at node classification tasks Wu et al. (2019); Klicpera et al. (2019). In the opposite case of heterophily, the high-frequency components might contain more relevant information for separating classes Bo et al. (2021); Bodnar et al. (2022)

– the prototypical example being the eigenvector of

associated with largest frequency separating a regular bipartite graph. In other words, the class of heterophilic graphs contain instances where signals should be sharpened by increasing rather than smoothed out. Accordingly, an ideal framework for learning on graphs must accommodate both of these opposite scenarios by being able to induce either an or a dynamics.

Parametric Dirichlet energy: channel-mixing as metric in feature space.

In eq. 1 a constant nontrivial metric in leads to the mixing of the feature channels. We adapt this idea by considering a symmetric positive semi-definite with and using it to generalize as

(4)

We note the analogy with eq. 1, where the sum over the nodes replaces the integration over the domain and the -th derivative at some point is replaced by the gradient along the edge . We generally treat as learnable weights and study the gradient flow of :

(5)

We see that eq. 5 generalizes eq. 3. Below ‘smoothing’ is intended as in Definition 2.1.

Proposition 2.4.

Let be the projection onto . Equation 5 is smoothing since

In fact s.t. : for each we have .

Proposition 2.4 implies that no weight matrix in eq. 5 can separate the limit embeddings of nodes with same degree and input features. If has a trivial kernel, then nodes with same degrees converge to the same representation and over-smoothing occurs as per Definition 2.1. Differently from Nt and Maehara (2019); Oono and Suzuki (2020); Cai and Wang (2020), over-smoothing occurs independently of the spectral radius of the ‘channel-mixing’ if its eigenvalues are positive – even for equations which lead to residual GNNs when discretized Chen et al. (2018). According to Proposition 2.4, we do not expect eq. 5 to succeed on heterophilic graphs where smoothing processes are generally harmful – this is confirmed in Figure 2 (see prod-curve). To remedy this problem, we generalize eq. 5 to a gradient flow that can be as per Definition 2.2.

3 A general parametric energy for pairwise interactions

We extend the Dirichlet energy associated with to an energy accounting for mutual – possibly repulsive – interactions in feature space . We first rewrite the energy in eq. 4 as

(6)

Replacing the occurrences of with symmetric matrices leads to

(7)

with associated gradient flow of the form (see Appendix B)

(8)

Note that eq. 8 is gradient flow of some energy iff both and are symmetric.

A multi-particle system point of view: attraction vs repulsion.

Consider the -dimensional node-features as particles in with energy . While the term is independent of the graph topology and represents an external field in the feature space, the second term constitutes a potential energy, with a bilinear form determining the pairwise interactions of adjacent node representations. Given a symmetric , we write , by decomposing the spectrum of in positive and negative values.We can rewrite , i.e.

(9)

The gradient flow of minimizes and maximizes . The matrix encodes repulsive pairwise interactions via its negative-definite component which lead to terms increasing along the solution. The latter affords a ‘sharpening’ effect desirable on heterophilic graphs where we need to disentangle adjacent node representations and hence ‘magnify’ the edge-gradient.

Spectral analysis of the channel-mixing.

We will now show that eq. 8 can lead to a dynamics. To this end, we assume that so that eq. 8 becomes According to eq. 9 the negative eigenvalues of lead to repulsion. We show that the latter can induce dynamics as per Definition 2.2. We let

be the orthogonal projection into the eigenspace of

associated with the eigenvalue . Moreover, we define

Proposition 3.1.

If , then is for a.e. : we have

and converges to such that , for .

Proposition 3.1 shows that if enough mass of the spectrum of the ‘channel-mixing’ is distributed over the negative eigenvalues, then the evolution is dominated by the graph high frequencies. This analysis is made possible in our gradient flow framework where must be symmetric. The HFD dynamics induced by negative eigenvalues of is confirmed in Figure 2 (neg-prod-curve in the bottom chart).

A more general energy.

Equations with a source term may have better expressive power Xhonneux et al. (2020); Chen et al. (2020); Thorpe et al. (2021). In our framework this means adding an extra energy term of the form to eq. 7 with some learnable and . This leads to the following gradient flow:

(10)

We also observe that one could replace the fixed matrix with a more general symmetric graph vector field satisfying if . This in particular includes the case where is learned based on the initial encoding via an attention mechanism Vaswani et al. (2017); Veličković et al. (2018). In this case, the pairwise energy generalizes to

(11)

Since in the experiments we have not observed improvements from learning and this option does make the model slower, we stick to the special choice . We also note that when , then eq. 8 becomes . We perform a spectral analysis of this case in Section B.2.

Non-linear activations.

The inner products in the formulations of can be combined with non-linear activations of the form (the gradient flow derivation is reported in Section B.3)

While non-linear activations offer greater expressive power, the analysis presented thus far and the following comparisons with existing GNN models are investigated in the linear case for a few reasons. First, in the spirit of Wu et al. (2019); Oono and Suzuki (2020); Chen et al. (2020), by dropping non-linear maps we can perform spectral analysis in closed form. Second, in all our experiments we have seen no gain in performance when including non-linear activations. Third, we can always ‘push the non-linear maps’ in either the encoding block or the decoding one without affecting the linear gradient flow as discussed in Section 4.

4 Comparison with GNNs

In this Section, we study standard GNN models from the perspective of our gradient flow framework.

4.1 Continuous case

Continuous GNN models replace layers with continuous time. In contrast with Proposition 3.1, we show that three main linearized continuous GNN models are either smoothing or as per Definition 2.2. The linearized PDE-GCN model Eliasof et al. (2021) corresponds to choosing and in eq. 10, for some time-dependent family :

The CGNN model Xhonneux et al. (2020) can be derived from eq. 10 by setting :

Finally, in linearized GRAND Chamberlain et al. (2021a)

a row-stochastic matrix

is learned from the encoding via an attention mechanism and we have

We note that if is not symmetric, then GRAND is not a gradient flow.

Proposition 4.1.

, and satisfy the following:

  • is a smoothing model: .

  • For a.e. it holds: is never and if we remove the source term, then .

  • If is connected, as , with , .

By (ii) the source-free CGNN-evolution is independent of . Moreover, by (iii), over-smoothing occurs for GRAND as per Definition 2.1. On the other hand, Proposition 3.1 shows that the negative eigenvalues of can make the source-free gradient flow in eq. 8 . Experiments in Section 5 confirm that the gradient flow model outperforms CGNN and GRAND on heterophilic graphs.

4.2 Discrete case

We now describe a discrete version of our gradient flow model and compare it to ‘discrete’ GNNs where discrete time steps correspond to different layers. In the spirit of Chen et al. (2018), we use explicit Euler scheme with step size to solve eq. 10 and set . In the gradient flow framework we parametrize the energy rather than the actual equations, which leads to symmetric channel-mixing matrices that are shared across the layers. Since the matrices are square, an encoding block is used to process input features and generally reduce the hidden dimension from to . Moreover, the iterations inherently lead to a residual architecture because of the explicit Euler discretization:

(12)

with prediction produced by a decoder , where is the number of label classes and integration time of the form , so that represents the number of layers. Although eq. 12 is linear, we can include non-linear activations in . Until this point, we have considered equations that minimize a multi-particle energy. We now study when GNNs minimize an energy .

Are discrete GNNs gradient flows?

Given a (learned) symmetric graph vector field satisfying if , consider a family of linear GNNs with shared weights of the form

(13)

Symmetry is the key requirement to interpret GNNs in eq. 13 in a gradient flow framework.

Lemma 4.2.

Equation 13 is the unit step size discrete gradient flow of , with defined in eq. 11, iff and are symmetric.

Lemma 4.2 provides a recipe for making standard architectures into a gradient flow, with symmetry being the key requirement. When eq. 13 is a gradient flow, the underlying GNN dynamics becomes explainable in terms of minimizing a multi-particle energy by learning attractive and repulsive directions in feature space as discussed in Section 3. In Section C.2, we show how Lemma 4.2 covers linear versions of GCN Kipf and Welling (2017); Wu et al. (2019), GAT Veličković et al. (2018), GraphSAGE Hamilton et al. (2017) and GCNII Chen et al. (2020) to name a few.

Over-smoothing analysis in discrete setting.

By Proposition 3.1 we know that the continuous version of eq. 12 can be thanks to the negative eigenvalues of . The next result represents a discrete counterpart of Proposition 3.1 and shows that residual, symmetrized graph convolutional models can be . Below is the projection into the eigenspace associated with the eigenvalue and we report the explicit value of in eq. 29 in Section C.3. We let:

(14)
Theorem 4.3.

Given , with symmetric, if eq. 14 holds then

hence the dynamics is for a.e. and in fact s.t. . Conversely, if is not bipartite, then for a.e. the system , with symmetric, is independent of the spectrum of .

Theorem 4.3 shows that linear discrete gradient flows can be due to the negative eigenvalues of . This differs from statements that standard GCNs act as low-pass filters and thus over-smooth in the limit. Indeed, in these cases the spectrum of is generally ignored Wu et al. (2019); Chen et al. (2020)

or required to be sufficiently small in terms of singular value decomposition

Nt and Maehara (2019); Oono and Suzuki (2020); Cai and Wang (2020) when no residual connection is present. On the other hand, Theorem 4.3 emphasizes that the spectrum of plays a key role to enhance the high frequencies when enough mass is distributed over the negative eigenvalues provided that a residual connection exists – this is confirmed by the neg-prod-curve in Figure 2.

The residual connection from a spectral perspective.

Given a sufficiently small step-size so that the right hand side of inequality 14 is satisfied, is for a.e. if , i.e. ‘there is more mass’ in the negative spectrum of than in the positive one. This means that differently from Nt and Maehara (2019); Oono and Suzuki (2020); Cai and Wang (2020), there is no requirement on the minimal magnitude of the spectral radius of coming from the graph topology as long as is small enough. Conversely, without a residual term, the dynamics is for a.e. independently of the sign and magnitude of the eigenvalues of . This is also confirmed by the GCN-curve in Figure 2.

Gradient flow as spectral GNNs.

We finally discuss eq. 12 from the perspective of spectral GNNs as in Balcilar et al. (2020). Let us assume that , . We can write eq. 12 as (see Section C.3)

(15)

with the eigendecomposition of the graph Laplacian. Namely, if we let be the spectrum of with associated orthonormal basis of eigenvectors given by , and we introduce defined by , then we can rephrase eq. 15 as the system

(16)

Accordingly, for each projection into the -th eigenvector of

, we have a spectral function in the graph frequency domain given by

. If we have a low-pass filter while if we have a high-pass filter. Moreover, we see that along the eigenvectors of , if then the dynamics is equivalent to flipping the sign of the edge weights, which offers a direct comparison with methods proposed in Bo et al. (2021); Yan et al. (2021) where some ‘attentive’ mechanism is proposed to learn negative edge weights based on feature information. The very same procedure of changing the sign of the edge weights – which is equivalent to reversing the time orientation for the evolution of – can indeed be accomplished via the repulsive forces generated by the negative spectrum of .

5 Experiments

In this section we evaluate the gradient flow framework (). We corroborate the spectral analysis using synthetic data with controllable homophily. We confirm that having negative (positive) eigenvalues of the channel-mixing are essential in heterophilic (homophilic) scenarios where the gradient flow should align with () respectively. We show that the gradient flow in eq. 12 – a linear, residual, symmetric graph convolutional model – achieves competitive performance on real world datasets.

Methodology.

We crystallize in the model presented in eq. 12 with implemented as single linear layers or MLPs, and we set to be diagonal. For the real-world experiments we consider diagonally-dominant , diagonal and time-dependent choices for the structure of that offer explicit control over its spectrum. In the -case, we consider a symmetric with zero diagonal and defined by , and set . Due to the Gershgorin Theorem the eigenvalues of belong to , so the model ‘can’ easily re-distribute mass in the spectrum of via . This generalizes the decomposition of in Chen et al. (2020) providing a justification in terms of its spectrum and turns out to be more efficient w.r.t. the hidden dimension as shown in Figure 4 in the Appendix. For we take to be diagonal, with entries sampled and fixed – i.e., we do not train over – and only learn . We also include a time-dependent model where varies across layers. To investigate the role of the spectrum of on synthetic graphs, we construct three additional variants: , named sum, prod and neg-prod respectively where prod (neg-prod) variants have only non-negative (non-positive) eigenvalues.

Complexity and number of parameters.

If we treat the number of layers as a constant, the discrete gradient flow scales as , where and are input feature and hidden dimension respectively, with usually. Note that GCN has complexity and in fact our model is faster than GCN as confirmed in Figure 5 in Appendix D. Since are single linear layers (MLPs), we can bound the number of parameters by , with the number of label classes, in the -variant while in the -variant we have . Further ablation studies appear in Figure 4 in the Appendix showing that outperforms sum and GCN – especially in the lower hidden dimension regime – on real-world benchmarks with varying homophily.

Figure 2: Experiments on synthetic datasets with controlled homophily.

Synthetic experiments and ablation studies.

To investigate our claims in a controlled environment we use the synthetic Cora dataset of

(Zhu et al., 2020, Appendix G). Graphs are generated for target levels of homophily via preferential attachment – see Section D.3 for details. Figure 2 confirms the spectral analysis and offers explainability in terms of performance and smoothness of the predictions. Each curve – except GCN – represents one version of as in ‘methodology’ and we implement eq. 12 with , . Figure 2 (top) reports the test accuracy vs true label homophily. Neg-prod is better than prod on low-homophily and viceversa on high-homophily. This confirms Proposition 3.1 where we have shown that the gradient flow can lead to a dynamics – that are generally desirable with low-homophily – through the negative eigenvalues of . Conversely, the prod configuration (where we have an attraction-only dynamics) struggles in low-homophily scenarios even though a residual connection is present. Both prod and neg-prod are ‘extreme’ choices and serve the purpose of highlighting that by turning off one side of the spectrum this could be the more damaging depending on the underlying homophily. In general though ‘neutral’ variants like sum and are indeed more flexible and better performing. In fact, outperforms GCN especially in low-homophily scenarios, confirming Theorem 4.3 where we have shown that without a residual connection convolutional models are – and hence more sensitive to underlying homophily – irrespectively of the spectrum of . This is further confirmed by additional ablation studies in Figure 3.

In Figure 2 (bottom) we compute the homophily of the prediction (cross) for a given method and we compare with the homophily (circle) of the prediction read from the encoding (i.e. graph-agnostic). The homophily here is a proxy to assess whether the evolution is smoothing, the goal being explaining the smoothness of the prediction via the spectrum of as per our theoretical analysis. For neg-prod the homophily after the evolution is lower than that of the encoding, supporting the analysis that negative eigenvalues of enhance high-frequencies. The opposite behaviour occurs in the case of prod and explains that in the low-homophily regime prod is under-performant due to the prediction being smoother than the true homophily. and sum variants adapt better to the true homophily. We note how the encoding compensates when the dynamics can only either attract or repulse (i.e. the spectrum of has a sign) by decreasing or increasing the initial homophily respectively.

Real world experiments.

We test against a range of datasets with varying homophily Sen et al. (2008); Rozemberczki et al. (2021); Pei et al. (2020) (see Section D.4 for additional details). We use results provided in (Yan et al., 2021, Table 1), which includes standard baselines as GCN Kipf and Welling (2017), GraphSAGE Hamilton et al. (2017), GAT Veličković et al. (2018), PairNorm Zhao and Akoglu (2019) and recent models tailored towards the heterophilic setting (GGCN Yan et al. (2021), Geom-GCN Pei et al. (2020), H2GCN Zhu et al. (2020) and GPRGNN Chien et al. (2021)). For Sheaf Bodnar et al. (2022), a recent top-performer on heterophilic datasets, we took the best performing variant (out of six provided) for each dataset. We also include continuous baselines CGNN Xhonneux et al. (2020) and GRAND Chamberlain et al. (2021a) to provide empirical evidence for Proposition 4.1. Splits taken from Pei et al. (2020) are used in all the comparisons. The model discussed in ‘methodology’ is a very simple architecture with shared parameters across layers and run-time smaller than GCN and more recent models like GGCN designed for heterophilic graphs (see Figure 5 in the Appendix). Nevertheless, it achieves competitive results on all datasets, performing on par or better than more complex recent models. Moreover, comparison with the ‘time-dependent’ variant confirms that by sharing weights across layers we do not lose performance. We note that on heterophilic graphs short integration time is usually needed due to the topology being harmful and the negative eigenvalues of leading to exponential behaviour (see Appendix D).

Texas Wisconsin Cornell Film Squirrel Chameleon Citeseer Pubmed Cora
Hom level 0.11 0.21 0.30 0.22 0.22 0.23 0.74 0.80 0.81
#Nodes 183 251 183 7,600 5,201 2,277 3,327 18,717 2,708
#Edges 295 466 280 26,752 198,493 31,421 4,676 44,327 5,278
#Classes 5 5 5 5 5 5 7 3 6
GGCN
GPRGNN
H2GCN
GCNII
Geom-GCN
PairNorm
GraphSAGE
GCN
GAT
MLP
CGNN
GRAND
Sheaf (max)
(DD)
(D)
-timedep (DD)
Table 1: Results on heterophilic and homophilic datasets

6 Conclusions

In this work, we developed a framework for explainable GNNs where the evolution can be interpreted as minimizing a multi-particle learnable energy. This translates into studying the interaction between the spectrum of the graph and the spectrum of the ‘channel-mixing’ leading to a better understanding of when and why the induced dynamics is low (high) frequency dominated. From a theoretical perspective, we refined existing asymptotic analysis of GNNs to account for the role of the spectrum of the channel-mixing as well. From a practical perspective, our framework allows for ‘educated’ choices resulting in a simple, more explainable convolutional model that achieves competitive performance on homophilic and heterophilic benchmarks while being faster than GCN. Our results refute the folklore of graph convolutional models being too simple for complex benchmarks.

Limitations and future works.

We limited our attention to a constant bilinear form , which might be excessively rigid. It is possible to derive non-constant alternatives that are aware of the features or the position in the graph. The main challenge amounts to matching the requirement for local ‘heterogeneity’ with efficiency: we reserve this question for future work. Our analysis is also a first step into studying the interaction of the graph and ‘channel-mixing’ spectra; we did not explore other dynamics that are neither nor as per our definitions. The energy formulation points to new models more ‘physics’ inspired; this will be explored in future work.

Societal impact.

Our work sheds light on the actual dynamics of GNNs and could hence improve their explainability, which is crucial for assessing their impact on large-scale applications. We also show that instances of our framework achieve competitive performance on heterophilic data despite being faster than GCN, providing evidence for efficient methods with reduced footprint.

References

  • U. Alon and E. Yahav (2021) On the bottleneck of graph neural networks and its practical implications. In International Conference on Learning Representations, External Links: Link Cited by: §1.
  • M. Balcilar, G. Renton, P. Héroux, B. Gaüzère, S. Adam, and P. Honeine (2020) Analyzing the expressive power of graph neural networks in a spectral perspective. In International Conference on Learning Representations, Cited by: §1, §4.2.
  • P. Battaglia, R. Pascanu, M. Lai, D. Jimenez Rezende, et al. (2016) Interaction networks for learning about objects, relations and physics. Advances in neural information processing systems 29. Cited by: §1.
  • P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al. (2018) Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261. Cited by: §1.
  • L. Biewald (2020) Experiment tracking with weights and biases. Note: Software available from wandb.com External Links: Link Cited by: §D.2, §D.5.
  • M. Biloš, J. Sommer, S. S. Rangapuram, T. Januschowski, and S. Günnemann (2021) Neural flows: efficient alternative to neural odes. In Advances in Neural Information Processing Systems, Vol. 34. External Links: Link Cited by: §1.
  • D. Bo, X. Wang, C. Shi, and H. Shen (2021) Beyond low-frequency information in graph convolutional networks. In AAAI. AAAI Press. Cited by: §1, §2, §4.2.
  • C. Bodnar, F. Di Giovanni, B. P. Chamberlain, P. Liò, and M. M. Bronstein (2022) Neural sheaf diffusion: a topological perspective on heterophily and oversmoothing in gnns. arXiv preprint arXiv:2202.04579. Cited by: §1, §2, §5.
  • J. Brandstetter, D. Worrall, and M. Welling (2022) Message passing neural pde solvers. arXiv preprint arXiv:2202.03376. Cited by: §1.
  • M. M. Bronstein, J. Bruna, T. Cohen, and P. Veličković (2021) Geometric deep learning: grids, groups, graphs, geodesics, and gauges. arXiv preprint arXiv:2104.13478. Cited by: §1.
  • J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun (2014) Spectral networks and locally connected networks on graphs. In 2nd International Conference on Learning Representations, ICLR 2014, Cited by: §1.
  • C. Cai and Y. Wang (2020) A note on over-smoothing for graph neural networks. arXiv preprint arXiv:2006.13318. Cited by: §A.2, §1, §1, §1, §2, §2, §4.2, §4.2.
  • B. Chamberlain, J. Rowbottom, M. I. Gorinova, M. Bronstein, S. Webb, and E. Rossi (2021a) Grand: graph neural diffusion. In

    International Conference on Machine Learning

    ,
    pp. 1407–1418. Cited by: §C.1, §1, §4.1, §5.
  • B. Chamberlain, J. Rowbottom, D. Eynard, F. Di Giovanni, X. Dong, and M. Bronstein (2021b) Beltrami flow and neural diffusion on graphs. Advances in Neural Information Processing Systems 34. Cited by: §1.
  • M. Chen, Z. Wei, Z. Huang, B. Ding, and Y. Li (2020) Simple and deep graph convolutional networks. In International Conference on Machine Learning, pp. 1725–1735. Cited by: §C.2, §3, §3, §4.2, §4.2, §5.
  • R. T. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud (2018)

    Neural ordinary differential equations

    .
    Advances in neural information processing systems 31. Cited by: §D.2, §1, §2, §4.2.
  • E. Chien, J. Peng, P. Li, and O. Milenkovic (2021) Adaptive universal generalized pagerank graph neural network. In 9th International Conference on Learning Representations, ICLR 2021, External Links: Link Cited by: §1, §5.
  • F. R. Chung and F. C. Graham (1997) Spectral graph theory. (92). Cited by: §A.2, §2.
  • M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. Advances in neural information processing systems 29. Cited by: §1, §1, §2.
  • F. Di Giovanni, G. Luise, and M. Bronstein (2022) Heterogeneous manifolds for curvature-aware graph embedding. arXiv preprint arXiv:2202.01185. Cited by: §1.
  • J. Eells and J. H. Sampson (1964) Harmonic mappings of riemannian manifolds. American journal of mathematics 86 (1), pp. 109–160. Cited by: §A.1, §1, §2.
  • M. Eliasof, E. Haber, and E. Treister (2021)

    Pde-gcn: novel architectures for graph neural networks motivated by partial differential equations

    .
    Advances in Neural Information Processing Systems 34. Cited by: §B.3, §1, §4.1.
  • M. Fey and J. E. Lenssen (2019)

    Fast graph representation learning with PyTorch Geometric

    .
    In ICLR Workshop on Representation Learning on Graphs and Manifolds, Cited by: §D.2.
  • J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl (2017) Neural message passing for quantum chemistry. In International Conference on Machine Learning, pp. 1263–1272. Cited by: §1.
  • C. Goller and A. Kuchler (1996)

    Learning task-dependent distributed representations by backpropagation through structure

    .
    In Proceedings of International Conference on Neural Networks (ICNN’96), Vol. 1, pp. 347–352. Cited by: §1.
  • M. Gori, G. Monfardini, and F. Scarselli (2005) A new model for learning in graph domains. In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., Vol. 2, pp. 729–734. Cited by: §1.
  • E. Haber and L. Ruthotto (2018) Stable architectures for deep neural networks. Inverse Problems 34. Cited by: §1.
  • W. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. Advances in neural information processing systems 30. Cited by: §C.2, §4.2, §5.
  • D. K. Hammond, P. Vandergheynst, and R. Gribonval (2019) The spectral graph wavelet transform: fundamental theory and fast computation. In Vertex-Frequency Analysis of Graph Signals, pp. 141–175. Cited by: §1.
  • M. He, Z. Wei, H. Xu, et al. (2021) BernNet: learning arbitrary graph spectral filters via bernstein approximation. Advances in Neural Information Processing Systems 34. Cited by: §1.
  • R. Kimmel, N. Sochen, and R. Malladi (1997) From high energy physics to low level vision. In

    International Conference on Scale-Space Theories in Computer Vision

    ,
    pp. 236–247. Cited by: §A.1, §1, §2.
  • T. N. Kipf and M. Welling (2017) Semi-Supervised Classification with Graph Convolutional Networks. In Proceedings of the 5th International Conference on Learning Representations, ICLR ’17. External Links: Link Cited by: §B.2, §C.2, §1, §2, §4.2, §5.
  • J. Klicpera, S. Weißenberger, and S. Günnemann (2019) Diffusion improves graph learning. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Cited by: §2.
  • H. Nt and T. Maehara (2019) Revisiting graph neural networks: all we have is low-pass filters. arXiv preprint arXiv:1905.09550. Cited by: §1, §1, §1, §2, §4.2, §4.2.
  • K. Oono and T. Suzuki (2020) Graph neural networks exponentially lose expressive power for node classification. In International Conference on Learning Representations, Cited by: §A.2, §1, §1, §1, §2, §2, §3, §4.2, §4.2.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In NeurIPS, Cited by: §D.2.
  • H. Pei, B. Wei, K. C. Chang, Y. Lei, and B. Yang (2020) Geom-gcn: geometric graph convolutional networks. In 8th International Conference on Learning Representations, ICLR 2020, External Links: Link Cited by: §D.4, §D.4, §1, §5.
  • P. Perona and J. Malik (1990) Scale-space and edge detection using anisotropic diffusion. PAMI 12 (7), pp. 629–639. Cited by: §2.
  • B. Rozemberczki, C. Allen, and R. Sarkar (2021) Multi-scale attributed node embedding. Journal of Complex Networks 9 (2), pp. cnab014. Cited by: §5.
  • T. K. Rusch, B. P. Chamberlain, J. Rowbottom, S. Mishra, and M. M. Bronstein (2022) Graph-coupled oscillator networks. In International Conference on Machine Learning, Cited by: §2.
  • M. E. Sander, P. Ablin, M. Blondel, and G. Peyré (2022) Sinkformers: transformers with doubly stochastic attention. In

    International Conference on Artificial Intelligence and Statistics

    ,
    pp. 3515–3530. Cited by: §1.
  • F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini (2008) The graph neural network model. IEEE transactions on neural networks 20 (1), pp. 61–80. Cited by: §1.
  • P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad (2008) Collective classification in network data. AI magazine 29 (3), pp. 93–93. Cited by: §5.
  • O. Shchur, M. Mumme, A. Bojchevski, and S. Günnemann (2018) Pitfalls of graph neural network evaluation. In NIPS workshop. Cited by: §D.4.
  • A. Sperduti (1993) Encoding labeled graphs by labeling raam. Advances in Neural Information Processing Systems 6. Cited by: §1.
  • M. Thorpe, T. M. Nguyen, H. Xia, T. Strohmer, A. Bertozzi, S. Osher, and B. Wang (2021) GRAND++: graph neural diffusion with a source term. In International Conference on Learning Representations, Cited by: §3.
  • J. Topping, F. Di Giovanni, B. P. Chamberlain, X. Dong, and M. M. Bronstein (2022) Understanding over-squashing and bottlenecks on graphs via curvature. International Conference on Learning Representations. Cited by: §1, §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. Advances in neural information processing systems 30. Cited by: §1, §3.
  • P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2018) Graph attention networks. In International Conference on Learning Representations, External Links: Link Cited by: §C.2, §3, §4.2, §5.
  • F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, and K. Weinberger (2019) Simplifying graph convolutional networks. In International conference on machine learning, pp. 6861–6871. External Links: Link Cited by: §C.2, §2, §3, §4.2, §4.2.
  • L. Xhonneux, M. Qu, and J. Tang (2020) Continuous graph neural networks. In International Conference on Machine Learning, pp. 10432–10441. External Links: Link Cited by: §1, §3, §4.1, §5.
  • Y. Yan, M. Hashemi, K. Swersky, Y. Yang, and D. Koutra (2021) Two sides of the same coin: heterophily and oversmoothing in graph convolutional neural networks. arXiv preprint arXiv:2102.06462. Cited by: §D.4, §1, §4.2, §5.
  • Z. Ying, D. Bourgeois, J. You, M. Zitnik, and J. Leskovec (2019) Gnnexplainer: generating explanations for graph neural networks. Advances in neural information processing systems 32. Cited by: §1.
  • H. Yuan, H. Yu, S. Gui, and S. Ji (2020) Explainability in graph neural networks: a taxonomic survey. arXiv preprint arXiv:2012.15445. Cited by: §1.
  • L. Zhao and L. Akoglu (2019) Pairnorm: tackling oversmoothing in gnns. arXiv preprint arXiv:1909.12223. Cited by: §5.
  • D. Zhou and B. Schölkopf (2005) Regularization on discrete spaces. In

    Joint Pattern Recognition Symposium

    ,
    pp. 361–368. External Links: Link Cited by: §2.
  • J. Zhu, Y. Yan, L. Zhao, M. Heimann, L. Akoglu, and D. Koutra (2020) Beyond homophily in graph neural networks: current limitations and effective designs. Advances in Neural Information Processing Systems 33, pp. 7793–7804. Cited by: §D.3, §1, §5, §5.

Appendix A Proofs and additional details of Section 2

a.1 Discussion on continuous Dirichlet energy and harmonic maps

In this subsection we briefly expand on the formulation of continuous Dirichlet energy in Section 2 to provide more context. Consider a smooth map , where is usually a larger manifold we embed into, and are Riemannian metrics on domain and codomain respectively. The Dirichlet energy of is defined by