The majority of graph neural networks (GNNs) are based on the message passing paradigm (Gilmer2017), wherein node features are learned by means of a non-linear propagation on the graph. Multiple recent works have pointed to the limitations of the message passing approach. These include; limited expressive power (shervashidze2011weisfeiler; xu2018how; bouritsas2020improving; bodnar2021weisfeiler), the related oversmoothing problem (NT2019; Oono2019) and the bottleneck phenomena (Alon2020; wu2020graph), which render such approaches inefficient, especially in deep GNNs. Multiple alternatives have been proposed, among which are higher-order methods (maron2019provably; bodnar2021weisfeiler) and decoupling the propagation and input graphs by modifying the topology, often referred to as graph rewiring. Topological modifications can take different forms such as graph sampling (Hamilton2017), NN (Klicpera2019), using the complete graph (Vaswani2017; Alon2020), latent graph learning (wang2019dynamic; kazi2020differentiable), or multi-hop filters (wu2019simplifying; rossi2020sign). However, there is no agreement in the literature on when and how to modify the graph, and a single principled framework for doing so.
A somewhat underappreciated fact is that GNNs are intimately related to diffusion equations (Chamberlain2021a), a connection that was exploited in the early work of Scarselli et al. (Scarselli2008). Diffusion PDEs have been historically important in computer graphics (sun2009concise; bronstein2010scale; litman2013learning; patane2016star)caselles1997geodesic; chan2001active; bertalmio2000image), and image processing (perona1990scale; sochen1998general; weickert1998anisotropic; tomasi1998bilateral; durand2002fast; buades2005non)belkin2003laplacian) and diffusion maps (coifman2006diffusion), as well as the family of PageRank algorithms (page1999pagerank; chakrabarti2007dynamic)
. In deep learning, differential equations are used as models of neural networks(Chamberlain2021a; Chen2018a; Dupont2019; Xhonneux2020; Queiruga2020; Zhuang2020a) and for physics-informed learning (Raissi2017a; cranmer2019learning; Sanchez-Gonzalez2019; cranmer2020lagrangian; shlomi2020graph; Li2020c).
In this paper, we propose a novel class of GNNs based on the discretised non-Euclidean diffusion PDE in joint positional and feature space, inspired by the Beltrami flow (sochen1998general) used two decades ago in the image processing literature for edge-preserving image denoising. We show that the discretisation of the spatial component of the Beltrami flow offers a principled view on positional encoding and graph rewiring, whereas the discretisation of the temporal component can replace GNN layers with more flexible adaptive numerical schemes. Based on this model, we introduce Beltrami Neural Diffusion (BLEND) that generalises a broad range of GNN architectures and shows state-of-the-art performance on many popular benchmarks. In a broader perspective, our approach explores new tools from PDEs and differential geometry that are less well known in the graph ML community.
Kimmel et al. (kimmel1997high; sochen1998general; kimmel2000images) considered images as 2-manifolds (parametric surfaces) embedded in some larger ambient space as where is a scaling factor, are the 2D positional coordinates of the pixels, and are the -dimensional colour or feature coordinates (with or for grayscale or RGB images, or when using patches as features (buades2005non)). In these works, the image is evolved along the gradient flow of a functional called the Polyakov action (polyakov1981quantum), which roughly measures the smoothness of the embedding111Explicitly, by minimising the functional with respect to the embedding one finds Euler-Lagrange (EL) equations that can be used to dictate the evolution process of the embedding itself.. For images embedded in Euclidean space with the functional minimised with respect to both the embedding and the metric , one obtains the following PDE:
and boundary conditions as appropriate. Here is the Laplace-Beltrami operator, the Laplacian operator induced on by the Euclidean space we embed the image into. Namely, the embedding of the manifold allows us to pull-back the Euclidean distance structure on the image: the distance between two nearby points and is given by
where is a matrix called the Riemannian metric. The fact that the distance is a combination of the positional component (distance between pixels in the plane, ) and the colour component (distance between the colours of the pixels, ) is crucial as it allows edge-preserving image diffusion.
When dealing with images, the evolution of the first two components of is a nuisance amounting to the reparametrisation of the manifold and can be ignored. For grayscale images (the case when and ), this is done by projection along the dimension , in which case the Beltrami flow takes the form of an inhomogeneous diffusion equation of ,
determining the speed of diffusion at each point, can be interpreted as an edge indicator: diffusion is weak across edges where . The result is an adaptive diffusion (perona1990scale) popular in image processing due to its ability to denoise images while preserving their edges. For cases with (multiple colour channels), equation (3) is applied to each channel separately; however, the metric couples the channels, which results in their gradients becoming aligned (kimmel2012numerical).
In the limit case , equation (3) becomes the simple homogeneous isotropic diffusion , where
is the standard Euclidean Laplacian operator. The solution is given in closed form as the convolution of the initial image and a Gaussian kernel with time-dependent variance,
and can be considered a simple linear low-pass filtering. In the limit , the image becomes constant and equal to the average colour.222Assuming appropriate boundary conditions.
Another interpretation of the Beltrami flow is passing a Gaussian on the manifold (see Figure 1, bottom), which can locally be expressed as non-linear filtering with the bilateral kernel (tomasi1998bilateral) dependent on the joint positional and colour distance (Figure 1, top),
For , the bilateral filter (6) reduces to a simple convolution with a time-dependent Gaussian.
3 Discrete Beltrami flow on graphs
We now develop the analogy of Beltrami flow for graphs. We consider a graph to be a discretisation of a continuous structure (manifold), and show that the evolution of the feature coordinates in time amounts to message passing layers in GNNs, whereas the evolution of the positional coordinates amounts to graph rewiring, which is used in some GNN architectures.
3.1 Graph Beltrami flow
Let be an undirected graph, where and denote node and edge sets, respectively. We further assume node-wise -dimensional features for . Denote by the embedding of the graph in a joint space , where is a -dimensional space with a metric representing the node coordinates (for simplicity, we will assume unless otherwise stated). We refer to and as the positional, feature and joint coordinates of node , respectively, and arrange them into the matrices , , and , of sizes , , and .
For images, Beltrami flow amounts to evolving the embedding along , with a diffusivity map.333Note that in equation (3) the diffusivity function coincides with . Similarly to (sochen1998general, Section 4.2) we have neglected the extra term appearing in (3). Thus, we consider the graph Beltrami flow to be the discrete diffusion equation
We motivate our definition as follows: and are the discrete analogies of the gradient and divergence , both with respect to a graph that can be interpreted as the numerical stencil for the discretisation of the continuous Laplace-Beltrami operator in (3). Note that can be different from the input (referred to as ‘rewiring’). As discussed in Section 3.3, most GNNs use (input graph is used for diffusion, no rewiring). Alternatively, the positional coordinates of the nodes can be used to define a new graph topology either with , for some radius , or using nearest neighbours. This new rewiring is precomputed using the input positional coordinates (i.e., ) or updated throughout the diffusion (i.e., ). Therefore, (7) can be compactly rewritten as
The function is the diffusivity controlling the diffusion strength between nodes and and is assumed to be normalised: . The dependence of the diffusivity on the embedding matches the smooth PDE analysed in e.g. (sochen1998general, Section 4.2) and is consistent with the form of attention mechanism used in e.g. (Velickovic2018; Vaswani2017). In matrix-form, we can also rewrite (7) as
where we emphasise the evolution of both the positional and feature components, coupled through the matrix-valued function
representing the diffusivity. The graph Beltrami flow produces an evolution of the joint positional and feature coordinates, . In Section 3.3 we will show how the evolution of the feature coordinates results in feature diffusion or message passing on the graph, the core of GNNs. As noted in Section 2, in the smooth case the Beltrami flow is obtained as gradient flow of an energy functional when minimised with respect to both the embedding and the metric on the surface (an image). When the embedding takes values in the Euclidean space, this leads to equations of the form (3) with no channel-mixing and an exact form of the diffusivity determined by the pull-back of the Euclidean metric. To further motivate our approach, it is tempting to investigate whether a similar conclusion can be attained here. Although in the discrete case the operation of pull-back is not well-defined, we are able to derive that the gradient flow of a modified graph Dirichlet energy gives rise to an equation of the form (7). We note though that the gradient flow does not recover the exact form of the diffusivity implemented in this paper. This is not a limitation of the theory and should be expected: by requiring the gradient flow to avoid channel-mixing and imitate the image analogy in (sochen1998general) and by inducing a discrete pull-back condition, we are imposing constraints on the problem. We leave the theoretical implications for future work and refer to Appendix B for a more thorough discussion, including definitions and proofs.
Under structural assumptions on the diffusivity, graph Beltrami flow (7) is the gradient flow of the discrete Polyakov functional.
3.2 Numerical solvers
Explicit vs implicit schemes
Equation (7) is solved numerically, which in the simplest case is done by replacing the continuous time derivative with forward time difference:
Here denotes the discrete time index (iteration) and is the time step (discretisation parameter). Rewriting (9
) compactly in matrix-vector form withleads to the explicit Euler scheme:
where and the matrix (diffusion operator) is given by
The solution to the diffusion equation is computed by applying scheme (10) multiple times in sequence, starting from some initial . It is ‘explicit’ because the update is done directly by the application of the diffusion operator on (as opposed to implicit schemes of the form arising from backward time differences that require inversion of the diffusion operator (Weickert1998)).
Multi-step and adaptive schemes
Higher-order approximation of temporal derivatives amount to using intermediate fractional steps, which are then linearly combined. Runge-Kutta (RK) (runge1895numerische; kutta1901beitrag), ubiquitously used in numerical analysis, is a classical family of explicit numerical schemes, including Euler as a particular case. The Dormand-Prince (DOPRI) (dormand1980family)
is an RK method based on fifth and fourth-order approximations, the difference between which is used as an error estimate guiding the time step size(shampine1986some).
Adaptive spatial discretisation and rewiring
Many numerical PDE solvers also employ adaptive spatial discretisation. The choice of the stencil (mesh) for spatial derivatives is done based on the character of the solution at these points; in the simulation of phenomena such as shock waves it is often desired to use denser sampling in the respective regions of the domain, which can change in time. A class of techniques for adaptive rewiring of the spatial derivatives are known as Moving Mesh (MM) methods (hawken1991review). Interpreting the graph in (7) as the numerical stencil for the discretisation of the continuous Laplace-Beltrami operator in (3), we can regard rewiring as a form of MM.
3.3 Relation to graph neural networks
Equation (9) has the structure of many GNN architectures of the ‘attentional’ type (gdlbook), where the discrete time index corresponds to a (convolutional or attentional) layer of the GNN and multiple diffusion iterations amount to a deep GNN. In the diffusion formalism, the time parameter acts as a continuous analogy of the layers, in the spirit of neural differential equations (Chen2018a). Typical GNNs amount to explicit single-step (Euler) discretisation schemes, whereas our continuous interpretation can exploit more efficient numerical schemes.
GNNs as instances of graph Beltrami flow
The graph Beltrami framework leads to a family of graph neural networks that generalise many popular architectures (see Table 1). For example, GAT (Velickovic2018) can be obtained as a particular setting of our framework where the input graph is fixed and only the feature coordinates are evolved. Equation (10) in this case becomes
and corresponds to the update formula of GAT with a residual connection and the assumption of no non-linearity between the layers. The role of the diffusivity is played by a learnable parametricattention function, which is generally time-dependent: . This results in separate attention parameters per layer , which can be learned independently. Our intentionally simplistic choice of a time-independent attention function amounts to parameter sharing across layers. We will show in Section 5.1 that this leads to a smaller model that is less likely to overfit.
Another popular architecture MoNet (Monti2016) uses linear diffusion of the features with weights dependent on the structure of the graph expressed as ‘pseudo-coordinates’, which can be cast as attention of the form . Transformers (vaswani2017attention) can be interpreted as feature diffusion on a fixed complete graph with (gdlbook). Positional encoding (used in Transformers as well as in several recent GNN architectures (bouritsas2020improving; dwivedi2020generalization)) amounts to attention dependent on both and , which allows the diffusion to adapt to the local structure of the graph; importantly, the positional coordinates are precomputed. Similarly, DeepSets (zaheer2017deep) and PointNet (qi2017pointnet) architectures can be interpreted as GNNs applied on a graph with no edges (), where each node is treated independently of the others (gdlbook). DIGL (Klicpera2019) performs graph rewiring as a pre-processing step using personalised page rank as positional coordinates, which are then fixed and not evolved. In the point cloud methods, DGCNN (wang2019dynamic) and DGM (kazi2020differentiable), the graph is constructed based on the feature coordinates and rewired adaptively (in our notation, ).
|ChebNet||Features||Fixed||Fixed||Explicit fixed step|
|GAT||Features||Fixed||Explicit fixed step|
|MoNet||Features||Fixed||Explicit fixed step|
|Transformer||Features||Fixed||Explicit fixed step|
|DeepSet/PointNet||Features||Fixed||Explicit fixed step|
|DIGL||Features||Fixed||Explicit fixed step|
|DGCNN/DGM||Features||Adaptive||Explicit fixed step|
|Beltrami||Positions||Adaptive||Explicit adaptive step /|
Multiple authors have recently argued in favor of decoupling the input graph from the graph used for diffusion. Such rewiring can take the form of graph sampling (Hamilton2017) to address scalability issues, data denoising (Klicpera2019), removal of information bottlenecks (Alon2020), or larger multi-hop filters (rossi2020sign). The graph construction can also be made differentiable and a task-specific rewiring can be learned (wang2019dynamic; kazi2020differentiable). The statement of Klicpera et al. (Klicpera2019) that ‘diffusion improves graph learning’, leading to the eponymous paradigm (DIGL), can be understood as a form of diffusion on the graph connectivity independent of the features. Specifically, the authors used as node positional encoding the Personalised PageRank (PPR), which can be interpreted as the steady-state of a diffusion process
where is the random walk graph Laplacian and is a parameter such that
represents the restart probability. The resulting positional encoding of dimensioncan be used to rewire the graph by NN sampling, which corresponds to using in our framework.
All the aforementioned GNN architectures can be seen as an explicit discretisation of equation (7) with fixed step size. On the other hand, our continuous diffusion framework offers an additional advantage of employing more efficient numerical schemes with adaptive step size. Graph rewiring of the form can be interpreted as adaptive spatial discretisation (moving mesh method).
There are multiple theoretical and empirical arguments (liu2019hyperbolic; chami2019hyperbolic) in favor of using hyperbolic spaces to represent real-life ‘small-world’ graphs (in particular, scale-free networks can be obtained as NN graphs in such spaces (boguna2021network)). Our framework allows using a non-Euclidean metric for the positional coordinates (Figure 2). In Section 5 we show that hyperbolic positional encodings allow for a significant reduction in model size with only a marginal degradation of performance.
The diffusivity function which we assumed time-independent and which lead to parameter sharing across layers (i.e., updates of the form ) can be made time-dependent of the form , where and denote shared and layer-dependent parameters, respectively.
As we noted, the Beltrami flow diffuses each channel separately. A more general variant of diffusion allowing for feature mixing is the Onsager diffusion (onsager1931reciprocal) of the form , where the matrix-valued function acts across the channels. GCN (kipf2017) can be regarded a particular setting thereof, with update of the form .
Finally, we note that the Beltrami flow amounts to linear aggregation with non-linear coefficients, or the ‘attentional’ flavor of GNNs (gdlbook). The more general message passing flavor (Gilmer2017) is possible using a generic non-linear equation of the form .
4 BLEND: Beltrami Neural Diffusion
Beltrami Neural Diffusion (BLEND) is a novel class of graph neural network architectures derived from the graph Beltrami framework. We assume an input graph with nodes and -dimensional node-wise features represented as a matrix . We further assume a -dimensional positional encoding of the graph nodes. BLEND architectures implement a learnable joint diffusion process of and and runs it for time , to produce an output node embeddings ,
where , are learnable positional and feature encoders and is a learnable decoder (possibly changing the output dimensions). Here the in Equations (2) and (8) is absorbed by and made learnable. is given by the graph Beltrami flow equation (8), where the diffusivity function (attention) is also learnable. The choice of attention function depends on the geometry of the positional encoding and for Euclidean encodings we find the scaled dot product attention (Vaswani2017) performs well, in which case
where and are learned matrices, and
is a hyperparameter.
5 Experimental results
In this section, we compare the proposed Beltrami framework to popular GNN architectures on standard node classification benchmarks and provide a detailed study of the choice of the positional encoding space. Implementation details, including runtimes and hyperparameter tuning are given in Section C and code is available at https://github.com/twitter-research/graph-neural-pde.
In our experiments, we use the following datasets: Cora(mccallum2000automating)
, Citeseer(sen2008collective), Pubmed (namata2012query), CoauthorCS (shchur2018pitfalls), Amazon, Computer, and Photo (mcauley2015image), and OGB-arxiv (hu2020open). Since many works using the first three datasets rely on the Planetoid splits (yang2016revisiting), we included them Table 2, together with a more robust evaluation on 100 random splits with 20 random initialisations (shchur2018pitfalls). In all cases the largest connected component is used and dataset statistics are given in Appendix A.
We compare to the following GNN architectures: GCN (kipf2017), GAT (Velickovic2018), MoNet (Monti2016) and GraphSAGE (Hamilton2017), and recent ODE-based GNN models: CGNN (Xhonneux2020), GDE (Poli2019), GODE (Zhuang2020a), and two versions of LanczosNet (liao2019lanczosnet). We use two variants of our method: using fixed input graph (BLEND) and using
NN graph in the positional coordinates (BLEND-knn).
5.1 Node Classification
In these experiments, we followed the methodology of (shchur2018pitfalls) using 20 random weight initialisations for datasets with fixed Planetoid splits and 100 random splits otherwise. Where available, results from (shchur2018pitfalls) are reported. Hyperparameters with the highest validation accuracy were chosen and results are reported on a test set that is used only once. Hyperparameter search used Ray Tune (Liaw2018)
with a thousand random trials using an asynchronous hyperband scheduler with a grace period and half life of ten epochs. The code to reproduce our results is included with the submission and will be released publicly following the review process. Experiments ran on AWS p2.8xlarge machines, each with 8 Tesla V100-SXM2 GPUs.
For all datasets excepting ogb-arxiv, adaptive explicit Dormand-Prince scheme was used as the numerical solver; for ogb-arxiv, we used the Runge-Kutta method. For the two smallest datasets (Cora and Citeseer) we performed direct backpropagation through each step of the numerical integrator. For the larger datasets, to reduce memory complexity, we use Pontryagin’s maximum principle to propagate gradients backwards in time(pontryagin2018mathematical). For the larger datasets, kinetic energy and Jacobian regularisation (Finlay2020; Kelly2020) was employed. The regularisation ensures the learned dynamics is well-conditioned and easily solvable by a numeric solver, which reduced training time. We use constant initialisation for the attention weights, , so training starts from a well-conditioned system that induces small regularisation penalty terms (Finlay2020).
The space complexity of BLEND is dominated by evaluating attention (13) over edges and is where is the edge set following rewiring and is dimension of features and is the dimension of positional encoding. The runtime complexity is , split between the forward and backward pass and can be dominated by either depending on the number of function evaluations (, ).
Tables 2–3 summarise the results of our experiments. BLEND outperforms other GNNs in most of the experiments, showing state-of-the-art results on some datasets. Another important point to note is that compared GNNs use different sets of parameters per layer, whereas in BLEND, due to our choice of a time-independent attention, parameters are shared. This results in significantly fewer parameters: for comparison, the OGB versions of GCN, SAGE and GAT used in the ogb-arxiv experiment require 143K, 219K and 1.63M parameters respectively, compared to only 70K in BLEND.
5.2 Positional encoding
In the second experiment, we investigated the impact of the positional encodings and vary the dimensionality and underlying geometry, in order to showcase the flexibility of our framework.
Figure 3 shows that for all datasets BLEND is superior to a Euclidean model where positional encodings are not used, which corresponds to (BLEND w/o positional in Figure 3) and a version of GAT where attention is over a concatenation of the same positional encodings used in BLEND and the features. The only exception is CoathorCS, where the performance without positional encodings is unchanged.
We experimented with three forms of positional encoding: DIGL PPR embeddings of dimension (Klicpera2019), DeepWalk embeddings (perozzi2014deepwalk) of dimensions –, and hyperbolic embeddings in the Poincare ball (chamberlain2017neural; nickel2018learning) of dimension –. Positional encodings are calculated as a preprocessing step and input as the
coordinates to BLEND. We calculated DeepWalk node embeddings using PyTorch Geometric’s Node2Vec implementation with parameters, , context size 20, and 16 walks per node. We trained with walk length ranging between 40 and 120 and took the embeddings with the highest accuracy for the link prediction task. Shallow hyperbolic embeddings were generated using the HGCN (chami2019hyperbolic) implementation with the default parameters provided by the authors.
Figure 4 (left) compares the performance of the best model using DIGL -dimensional positional encodings against the best performing -dimensional hyperbolic positional encodings with tuned over the range 2-16. The average performance with DIGL positional encodings is 85.48, compared to 85.28 for hyperbolic encodings. Only in one case the DIGL encodings outperform the best hyperbolic encodings. Figure 4 (right) show the change in performance with the dimension of the positional encoding using a fixed hyperparameter configuration. As expected, we observe monotonic increase in performance as function of . Importantly most of the performance is captured by either 16 dimensional hyperbolic or positional encodings, with a small additional margin gained for going up to , which is impractical for larger graphs.
5.3 Additional ablations
In addition to studying the affect of positional encodings , we performed ablation studies on the step size used in the integrator as well as different forms of attention.
In Figure 5 we studied the affect of changing the step size of the integrator using the explicit Euler method with a fixed terminal time set to be the optimal terminal time. The left hand side of the figure shows the performance using the adaptive stepsize Dopri5 for comparison. Dopri5 gives the most consistent performance and is best if three of the six datasets. Details of additional attention functions and their relative performance can be found in Appendix A.1.
6 Related work
Image processing, computer vision, and graphics.
Following the Perona-Malik scheme (perona1990scale), PDE-based approaches flourished in the 1990s with multiple versions of anisotropic (weickert1998anisotropic) and non-Euclidean (sochen1998general)
diffusion used primarily for image denoising. The realisation of these ideas in the form of nonlinear filters(tomasi1998bilateral; buades2005non) was adopted in the industry. PDE-based methods were also used for low-level vision tasks including inpainting (bertalmio2000image) and image segmentation (caselles1997geodesic; chan2001active). In computer graphics, fundamental solutions (‘heat kernels’) of diffusion equations were used as shape descriptors (sun2009concise). The closed-form expression of such solutions using the Laplace-Beltrami operator served as inspiration for some of the early approaches for GNNs (HeBrLe2015; defferrard2016convolutional; kipf2017; LeMoBrBr2017).
Neural differential equations
The interpretation of neural networks as discretised differential equations (‘neural ODEs’) (Chen2018a) was an influential recent result with multiple follow-up works (Dupont2019; Finlay2020; Liu2019; sdes2020). In deep learning on graphs, this mindset has been applied to GNN architectures (Avelar2019; Poli2019) and continuous message passing (Xhonneux2020). Continuous GNNs were also explored in (gu2020implicit) who, similarly to (Scarselli2008)
, addressed the solutions of fixed point equations. Ordinary Differential Equations on Graph Networks (GODE)(Zhuang2020a) approach the problem using the technique of invertible ResNets. Finally, (Sanchez-Gonzalez2019) used graph-based ODEs to generate physics simulations.
Solving PDEs with deep learning has been explored by (Raissi2017a). Neural networks appeared in (Li2020c) to accelerate PDE solvers with applications in the physical sciences. These have been applied to problems where the PDE can be described on a graph (Li2020d). (belbute2020combining) consider the problem of predicting fluid flow and use a PDE inside a GNN. These approaches differ from ours in that they solve a given PDE, whereas we use the notion of discretising PDEs as a principle to understand and design GNNs.
Neural ODEs on graphs
The most similar work to this is GRAND (Chamberlain2021a) of which BLEND can be considered a non-Euclidean extension. In addition there are several other works that apply the neural ODE framework to graphs. In GDE (Poli2019), GODE (Zhuang2020a) and CGNN (Xhonneux2020)
, the goal is to adapt neural ordinary differential equations to graphs. In contrast, we consider non-Euclidean partial differential equations and regard GNNs as particular instances of their discretisation (both spatial and temporal). We can naturally use spaces with any metric, in particular, extending recent works on hyperbolic graph embeddings. None of the previous techniques explore the link to differential geometry. More specifically, GDE, GODE, and CGNN consider Neural ODEs of the canonical formwhere is a graph neural network (GODE), the message passing component of a GNN (CGNN), or restricting to be layers of bijective functions on graphs (GDE). Furthermore, in CGNN only ODEs with closed form solutions are considered. (Sanchez-Gonzalez2019) on the other hand is quite distinct as they are not concerned with GNN design and instead use graph-based ODEs to generate physics simulations.
We developed a new class of graph neural networks based on the discretisation of a non-Euclidean diffusion PDE called Beltrami flow. We represent the graph structure and node features as a manifold in a joint continuous space, whose evolution by a parametric diffusion PDE (driven by the downstream learning task) gives rise to feature learning, positional encoding, and possibly also graph rewiring. Our experimental results show very good performance on popular benchmarks with a small fraction of parameters used by other GNN models. Perhaps most importantly, our framework establishes important links between GNNs, differential geometry, and PDEs – fields with a rich history and profound results that in our opinion are still insufficiently explored in the graph ML community.
While we show that our framework generalises many popular GNN architectures we seek to use the graph as a numerical representation of an underlying continuous space. This view of the graph as an approximation of a continuous latent structure is a common paradigm of manifold learning (tenenbaum2000global; belkin2003laplacian; coifman2006diffusion) and network geometry (boguna2021network). If adopted in graph ML, this mindset offers a rigorous mathematical framework for formalising and generalising some of the recent trends in the field, including the departure from the input graph as the basis for message passing (Alon2020), latent graph inference (kipf2018neural; wang2019dynamic; franceschi2019learning; kazi2020differentiable) higher-order (bodnar2021weisfeiler; maron2019provably) and directional (beaini2020directional) message passing (which can be expressed as anisotropic diffusion arising from additional structure of the underlying continuous space and different discretisation of the PDEs e.g. based on finite elements), and exploiting efficient numerical PDE solvers (Chen2018a). We leave these exciting directions for future research.
GNNs have recently become increasingly utilized in industrial applications e.g. in recommender systems and social networks, and hence could potentially lead to a negative societal impact if used improperly. We would like to emphasize that our paper does not study such potential negative applications and the mathematical framework we develop could help to interpret and understand existing GNN models. We believe that better understanding of ML models is key to managing their potential societal implications and preventing their negative impact.
The assumption that the graph can be modelled as a discretisation of some continuous space makes our framework applicable only to cases where the edge and node features are continuous in nature. Applications e.g. to knowledge graphs with categorical attributes could only be addressed by first embedding such attributes in a continuous space. Finally, the structural result presented in Theorem1 that link our model to the Polyakov action have not been implemented. This will be addressed in future works.
We thank Nils Hammerla and Gabriele Corso for feedback on early version of this manuscript. MB and JR are supported in part by ERC Consolidator grant no 724228 (LEMAN).
Appendix A Additional experimental results and implementation details
The statistics for the largest connected components of the experimental datasets are given in Table 4.
Replication of results and hyper-parameters
Code to regenerate our experimental results together with hyperparameters for all datasets is provided. The hyperparameters are listed in best_params.py. Refer to the README.md for instructions to run the experiments.
Numerical ODE solver
We use the library torchdiffeq (Chen2018a) to discretise the continuous time evolution and learn the system dynamics. The Pontryagin maximum / adjoint method is used to replace backpropagation for all datasets, with the exception of Cora and Citeseer due to the high memory complexity of applying backpropagation directly through the computational graph of the numerical integrator.
Decoupling the terminal integration time between inference and training
At training time we use a fixed terminal time that is tuned as a hyperparameter. For inference, we treat as a flexible parameter to which we apply a patience and measure the validation performance throughout the integration.
We report transductive node classification results in Tables 2 and 3 in the main paper. To facilitate direct comparison with many previous works, Table 2 uses the Planetoid splits given by (yang2016revisiting). As discussed in e.g. (shchur2018pitfalls), there are many limitations with results based on these fixed dataset splits and so we report a more robust set of experimental results in Table 3. Here we use 100 random splits with 20 random weight initializations. In each case 20 labelled nodes per class are used at training time, with the remaining labels split between test and validation. This methodology is consistent with (shchur2018pitfalls).
In Section 5.2 three types of positional encoding are described: DIGL PPR (Klicpera2019), DeepWalk (perozzi2014deepwalk) and hyperbolic embeddings (chamberlain2017neural; nickel2018learning). We provide the latter two as preprocessed pickle files within our repo. The DIGL PPR positional encodings are too large and so these are automatically generated and saved whenever code that requires them is run.
a.1 Diffusivity (attention) function
In section 4 of the main paper one choice of attention function is described. Additionally four alternatives were considered, which achieved roughly equivalent performance:
We additionally conducted an ablation study using these functions, results are given in Table 5.
|cosine_sim||83.8 0.5||73.5 1.1||79.0 2.3||92.8 0.1||84.8 0.7||93.2 0.4|
|exp_kernel||83.6 1.8||75.0 1.3||79.9 1.4||92.8 0.2||84.8 0.5||93.2 0.5|
|pearson||83.7 1.5||75.2 1.3||92.8 0.2||84.8 0.5|
|scaled dot||79.5 1.4||92.9 0.6|
Appendix B Theory of Beltrami flow
This section proves Theorem 1 in the main paper:
Under structural assumptions on the diffusivity, graph Beltrami flow
is the gradient flow of the discrete Polyakov functional.
b.1 Polyakov action and Beltrami flow on manifolds
In this section we briefly review the analysis in (kimmel1997high) to motivate the introduction of a discrete Polyakov action on graphs. Assume that and are Riemannian manifolds with coordinates and respectively. The Polyakov action for an embedding can be written as
with the volume form on associated with the metric . We restrict to the case where is the -dimensional Euclidean space , so that the functional becomes
The quantity above can be rewritten more geometrically as
From equation (19) we see that the Polyakov action is measuring the smoothness of the embedding - more precisely its Dirichlet energy with respect to the metric - on each feature channel . Given a geometric object , it is natural to find an optimal way of mapping it to a larger space, in this case . Accordingly, one can minimize the functional in (19) either with respect to the embedding or with respect to both the embedding and the metric . If we choose to minimize by varying both and we find that the metric must be the metric induced on by the map , namely its pullback. In local coordinates, this amounts to the constraint below:
Once the condition on is satisfied, the Euler-Lagrange equations for each feature channel become
where the operator is the Laplace-Beltrami operator on associated with the metric . As a specific example, if we embed an image into via a grey color mapping , the gradient flow associated with the functional and hence the Euler-Lagrange equation (21) for the grey channel is exactly the one reported in Section 2.
Therefore, from a high-level perspective, the Beltrami flow derived from the Polyakov action consists in minimizing a functional representing the Dirichlet energy of the embedding computed with respect to a metric depending - precisely via the pullback - on the embedding itself.
b.2 A discrete Polyakov action
We consider the graph counterpart to the problem of minimizing a generalized Dirichlet energy with respect to the embedding. Let be a simple, undirected and unweighted graph with . Also let and denote the space of signal functions and respectively. Recall that the graph gradient of is defined by
Its adjoint operator is called graph divergence and satisfies
Assume now we have a graph embedding of the form for some scaling , with . The coordinates and are called positional and feature coordinates respectively. In analogy with the Polyakov action (19), we introduce a modified Dirichlet energy measuring the smoothness of a graph embedding across its different channels: given a family of nonnegative maps , with , and , we define
where is the graph adjacency matrix. Defining the following -weighted norm for each
then (22) can be written as
Beyond the notational similarity with (19), the quantity sums the integrals over all channels - i.e. summations on the vertex set - of the norms of the gradients of exactly as for the continuum Polyakov action. The dependence of such norms on the embedding is to take into account that in the smooth case the Beltrami flow imposes the constraint (20), meaning that the metric with respect to which we compute the gradient norm of the embedding depends on the embedding itself. We note that the choice
yields the classic Dirichlet energy of a multi-channel graph signal
From now on we refer to the function in (22) as the discrete Polyakov action.
b.3 Proof of Theorem 1
We recall that given an embedding , we are interested in studying a discrete diffusion equation of the form
The coupling is called diffusivity. We now prove that the differential system above is the gradient flow of the discrete Polyakov action under additional assumptions on the structure of the diffusivity. We restate Theorem 1 in a more precise way:
Theorem 1. Let be a family of maps satisfying the assumptions
There exist a family of maps , with , such that
If we write , then we require
Then, the gradient flow associated with is given by (25), with the diffusivity satisfying
where the dependence of on is as in (26).
Remark. We observe that by taking to be the constant function one, we recover the classical case (24). In fact, for such choice the diffusivity satisfies and the gradient flow is given by the graph Laplacian, namely
Once we choose the family of maps as in the statement, the discrete Polyakov action is a map . To ease the notation, given a vector we write . To prove the result we now simply need to compute the gradient of the functional : given and we have
From the assumptions we can expand the partial derivative of the map as
where we have simply written Therefore, we can write (27) as
We now use the assumption (ii) to see that and similarly for . Up to renaming dummy indices we get
By inspection, we see that the right hand side can be rewritten as
with the diffusivity given in the statement of Theorem 1. Therefore, the gradient flow associated with is