Beltrami Flow and Neural Diffusion on Graphs

10/18/2021 ∙ by Benjamin Paul Chamberlain, et al. ∙ Twitter 74

We propose a novel class of graph neural networks based on the discretised Beltrami flow, a non-Euclidean diffusion PDE. In our model, node features are supplemented with positional encodings derived from the graph topology and jointly evolved by the Beltrami flow, producing simultaneously continuous feature learning and topology evolution. The resulting model generalises many popular graph neural networks and achieves state-of-the-art results on several benchmarks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The majority of graph neural networks (GNNs) are based on the message passing paradigm (Gilmer2017), wherein node features are learned by means of a non-linear propagation on the graph. Multiple recent works have pointed to the limitations of the message passing approach. These include; limited expressive power (shervashidze2011weisfeiler; xu2018how; bouritsas2020improving; bodnar2021weisfeiler), the related oversmoothing problem (NT2019; Oono2019) and the bottleneck phenomena (Alon2020; wu2020graph), which render such approaches inefficient, especially in deep GNNs. Multiple alternatives have been proposed, among which are higher-order methods (maron2019provably; bodnar2021weisfeiler) and decoupling the propagation and input graphs by modifying the topology, often referred to as graph rewiring. Topological modifications can take different forms such as graph sampling (Hamilton2017), NN (Klicpera2019), using the complete graph (Vaswani2017; Alon2020), latent graph learning (wang2019dynamic; kazi2020differentiable), or multi-hop filters (wu2019simplifying; rossi2020sign). However, there is no agreement in the literature on when and how to modify the graph, and a single principled framework for doing so.

A somewhat underappreciated fact is that GNNs are intimately related to diffusion equations (Chamberlain2021a), a connection that was exploited in the early work of Scarselli et al. (Scarselli2008). Diffusion PDEs have been historically important in computer graphics (sun2009concise; bronstein2010scale; litman2013learning; patane2016star)

, computer vision

(caselles1997geodesic; chan2001active; bertalmio2000image), and image processing (perona1990scale; sochen1998general; weickert1998anisotropic; tomasi1998bilateral; durand2002fast; buades2005non)

, where they created an entire trend of variational and PDE-based approaches. In machine learning and data science, diffusion equations underpin such popular manifold learning methods as eigenmaps

(belkin2003laplacian) and diffusion maps (coifman2006diffusion), as well as the family of PageRank algorithms (page1999pagerank; chakrabarti2007dynamic)

. In deep learning, differential equations are used as models of neural networks

(Chamberlain2021a; Chen2018a; Dupont2019; Xhonneux2020; Queiruga2020; Zhuang2020a) and for physics-informed learning (Raissi2017a; cranmer2019learning; Sanchez-Gonzalez2019; cranmer2020lagrangian; shlomi2020graph; Li2020c).

Main contributions

In this paper, we propose a novel class of GNNs based on the discretised non-Euclidean diffusion PDE in joint positional and feature space, inspired by the Beltrami flow (sochen1998general) used two decades ago in the image processing literature for edge-preserving image denoising. We show that the discretisation of the spatial component of the Beltrami flow offers a principled view on positional encoding and graph rewiring, whereas the discretisation of the temporal component can replace GNN layers with more flexible adaptive numerical schemes. Based on this model, we introduce Beltrami Neural Diffusion (BLEND) that generalises a broad range of GNN architectures and shows state-of-the-art performance on many popular benchmarks. In a broader perspective, our approach explores new tools from PDEs and differential geometry that are less well known in the graph ML community.

2 Background

Beltrami flow

Kimmel et al. (kimmel1997high; sochen1998general; kimmel2000images) considered images as 2-manifolds (parametric surfaces) embedded in some larger ambient space as where is a scaling factor, are the 2D positional coordinates of the pixels, and are the -dimensional colour or feature coordinates (with or for grayscale or RGB images, or when using patches as features (buades2005non)). In these works, the image is evolved along the gradient flow of a functional called the Polyakov action (polyakov1981quantum), which roughly measures the smoothness of the embedding111Explicitly, by minimising the functional with respect to the embedding one finds Euler-Lagrange (EL) equations that can be used to dictate the evolution process of the embedding itself.. For images embedded in Euclidean space with the functional minimised with respect to both the embedding and the metric , one obtains the following PDE:

(1)

and boundary conditions as appropriate. Here is the Laplace-Beltrami operator, the Laplacian operator induced on by the Euclidean space we embed the image into. Namely, the embedding of the manifold allows us to pull-back the Euclidean distance structure on the image: the distance between two nearby points and is given by

(2)

where is a matrix called the Riemannian metric. The fact that the distance is a combination of the positional component (distance between pixels in the plane, ) and the colour component (distance between the colours of the pixels, ) is crucial as it allows edge-preserving image diffusion.

When dealing with images, the evolution of the first two components of is a nuisance amounting to the reparametrisation of the manifold and can be ignored. For grayscale images (the case when and ), this is done by projection along the dimension , in which case the Beltrami flow takes the form of an inhomogeneous diffusion equation of ,

(3)
Figure 1: Two interpretations of the Beltrami flow: position-dependent bilateral kernel (top) and a Gaussian passed on the manifold (bottom).

The diffusivity

(4)

determining the speed of diffusion at each point, can be interpreted as an edge indicator: diffusion is weak across edges where . The result is an adaptive diffusion (perona1990scale) popular in image processing due to its ability to denoise images while preserving their edges. For cases with (multiple colour channels), equation (3) is applied to each channel separately; however, the metric couples the channels, which results in their gradients becoming aligned (kimmel2012numerical).

Special cases

In the limit case , equation (3) becomes the simple homogeneous isotropic diffusion , where

is the standard Euclidean Laplacian operator. The solution is given in closed form as the convolution of the initial image and a Gaussian kernel with time-dependent variance,

(5)

and can be considered a simple linear low-pass filtering. In the limit , the image becomes constant and equal to the average colour.222Assuming appropriate boundary conditions.

Another interpretation of the Beltrami flow is passing a Gaussian on the manifold (see Figure 1, bottom), which can locally be expressed as non-linear filtering with the bilateral kernel (tomasi1998bilateral) dependent on the joint positional and colour distance (Figure 1, top),

(6)

For , the bilateral filter (6) reduces to a simple convolution with a time-dependent Gaussian.

3 Discrete Beltrami flow on graphs

We now develop the analogy of Beltrami flow for graphs. We consider a graph to be a discretisation of a continuous structure (manifold), and show that the evolution of the feature coordinates in time amounts to message passing layers in GNNs, whereas the evolution of the positional coordinates amounts to graph rewiring, which is used in some GNN architectures.

3.1 Graph Beltrami flow

Let be an undirected graph, where and denote node and edge sets, respectively. We further assume node-wise -dimensional features for . Denote by the embedding of the graph in a joint space , where is a -dimensional space with a metric representing the node coordinates (for simplicity, we will assume unless otherwise stated). We refer to and as the positional, feature and joint coordinates of node , respectively, and arrange them into the matrices , , and , of sizes , , and .

For images, Beltrami flow amounts to evolving the embedding along , with a diffusivity map.333Note that in equation (3) the diffusivity function coincides with . Similarly to (sochen1998general, Section 4.2) we have neglected the extra term appearing in (3). Thus, we consider the graph Beltrami flow to be the discrete diffusion equation

(7)

We motivate our definition as follows: and are the discrete analogies of the gradient and divergence , both with respect to a graph that can be interpreted as the numerical stencil for the discretisation of the continuous Laplace-Beltrami operator in (3). Note that can be different from the input (referred to as ‘rewiring’). As discussed in Section 3.3, most GNNs use (input graph is used for diffusion, no rewiring). Alternatively, the positional coordinates of the nodes can be used to define a new graph topology either with , for some radius , or using nearest neighbours. This new rewiring is precomputed using the input positional coordinates (i.e., ) or updated throughout the diffusion (i.e., ). Therefore, (7) can be compactly rewritten as

The function is the diffusivity controlling the diffusion strength between nodes and and is assumed to be normalised: . The dependence of the diffusivity on the embedding matches the smooth PDE analysed in e.g. (sochen1998general, Section 4.2) and is consistent with the form of attention mechanism used in e.g. (Velickovic2018; Vaswani2017). In matrix-form, we can also rewrite (7) as

(8)

where we emphasise the evolution of both the positional and feature components, coupled through the matrix-valued function

representing the diffusivity. The graph Beltrami flow produces an evolution of the joint positional and feature coordinates, . In Section 3.3 we will show how the evolution of the feature coordinates results in feature diffusion or message passing on the graph, the core of GNNs. As noted in Section 2, in the smooth case the Beltrami flow is obtained as gradient flow of an energy functional when minimised with respect to both the embedding and the metric on the surface (an image). When the embedding takes values in the Euclidean space, this leads to equations of the form (3) with no channel-mixing and an exact form of the diffusivity determined by the pull-back of the Euclidean metric. To further motivate our approach, it is tempting to investigate whether a similar conclusion can be attained here. Although in the discrete case the operation of pull-back is not well-defined, we are able to derive that the gradient flow of a modified graph Dirichlet energy gives rise to an equation of the form (7). We note though that the gradient flow does not recover the exact form of the diffusivity implemented in this paper. This is not a limitation of the theory and should be expected: by requiring the gradient flow to avoid channel-mixing and imitate the image analogy in (sochen1998general) and by inducing a discrete pull-back condition, we are imposing constraints on the problem. We leave the theoretical implications for future work and refer to Appendix B for a more thorough discussion, including definitions and proofs.

Theorem 1.

Under structural assumptions on the diffusivity, graph Beltrami flow (7) is the gradient flow of the discrete Polyakov functional.

3.2 Numerical solvers

Explicit vs implicit schemes

Equation (7) is solved numerically, which in the simplest case is done by replacing the continuous time derivative with forward time difference:

(9)

Here denotes the discrete time index (iteration) and is the time step (discretisation parameter). Rewriting (9

) compactly in matrix-vector form with

leads to the explicit Euler scheme:

(10)

where and the matrix (diffusion operator) is given by

The solution to the diffusion equation is computed by applying scheme (10) multiple times in sequence, starting from some initial . It is ‘explicit’ because the update is done directly by the application of the diffusion operator on (as opposed to implicit schemes of the form arising from backward time differences that require inversion of the diffusion operator (Weickert1998)).

Multi-step and adaptive schemes

Higher-order approximation of temporal derivatives amount to using intermediate fractional steps, which are then linearly combined. Runge-Kutta (RK) (runge1895numerische; kutta1901beitrag), ubiquitously used in numerical analysis, is a classical family of explicit numerical schemes, including Euler as a particular case. The Dormand-Prince (DOPRI) (dormand1980family)

is an RK method based on fifth and fourth-order approximations, the difference between which is used as an error estimate guiding the time step size

(shampine1986some).

Adaptive spatial discretisation and rewiring

Many numerical PDE solvers also employ adaptive spatial discretisation. The choice of the stencil (mesh) for spatial derivatives is done based on the character of the solution at these points; in the simulation of phenomena such as shock waves it is often desired to use denser sampling in the respective regions of the domain, which can change in time. A class of techniques for adaptive rewiring of the spatial derivatives are known as Moving Mesh (MM) methods (hawken1991review). Interpreting the graph in (7) as the numerical stencil for the discretisation of the continuous Laplace-Beltrami operator in (3), we can regard rewiring as a form of MM.

3.3 Relation to graph neural networks

Equation (9) has the structure of many GNN architectures of the ‘attentional’ type (gdlbook), where the discrete time index corresponds to a (convolutional or attentional) layer of the GNN and multiple diffusion iterations amount to a deep GNN. In the diffusion formalism, the time parameter acts as a continuous analogy of the layers, in the spirit of neural differential equations (Chen2018a). Typical GNNs amount to explicit single-step (Euler) discretisation schemes, whereas our continuous interpretation can exploit more efficient numerical schemes.

GNNs as instances of graph Beltrami flow

The graph Beltrami framework leads to a family of graph neural networks that generalise many popular architectures (see Table 1). For example, GAT (Velickovic2018) can be obtained as a particular setting of our framework where the input graph is fixed and only the feature coordinates are evolved. Equation (10) in this case becomes

(11)

and corresponds to the update formula of GAT with a residual connection and the assumption of no non-linearity between the layers. The role of the diffusivity is played by a learnable parametric

attention function, which is generally time-dependent: . This results in separate attention parameters per layer , which can be learned independently. Our intentionally simplistic choice of a time-independent attention function amounts to parameter sharing across layers. We will show in Section 5.1 that this leads to a smaller model that is less likely to overfit.

Another popular architecture MoNet (Monti2016) uses linear diffusion of the features with weights dependent on the structure of the graph expressed as ‘pseudo-coordinates’, which can be cast as attention of the form . Transformers (vaswani2017attention) can be interpreted as feature diffusion on a fixed complete graph with (gdlbook). Positional encoding (used in Transformers as well as in several recent GNN architectures (bouritsas2020improving; dwivedi2020generalization)) amounts to attention dependent on both and , which allows the diffusion to adapt to the local structure of the graph; importantly, the positional coordinates are precomputed. Similarly, DeepSets (zaheer2017deep) and PointNet (qi2017pointnet) architectures can be interpreted as GNNs applied on a graph with no edges (), where each node is treated independently of the others (gdlbook). DIGL (Klicpera2019) performs graph rewiring as a pre-processing step using personalised page rank as positional coordinates, which are then fixed and not evolved. In the point cloud methods, DGCNN (wang2019dynamic) and DGM (kazi2020differentiable), the graph is constructed based on the feature coordinates and rewired adaptively (in our notation, ).

Method Evolution Diffusivity Graph Discretisation
ChebNet Features Fixed Fixed Explicit fixed step
GAT Features Fixed Explicit fixed step
MoNet Features Fixed Explicit fixed step
Transformer Features Fixed Explicit fixed step
DeepSet/PointNet Features Fixed Explicit fixed step
DIGL Features Fixed Explicit fixed step
DGCNN/DGM Features Adaptive Explicit fixed step
Beltrami Positions Adaptive Explicit adaptive step /
+Features Implicit
Table 1: GNN architectures interpreted as particular instances of our framework. Attentional variant.

Graph rewiring

Multiple authors have recently argued in favor of decoupling the input graph from the graph used for diffusion. Such rewiring can take the form of graph sampling (Hamilton2017) to address scalability issues, data denoising (Klicpera2019), removal of information bottlenecks (Alon2020), or larger multi-hop filters (rossi2020sign). The graph construction can also be made differentiable and a task-specific rewiring can be learned (wang2019dynamic; kazi2020differentiable). The statement of Klicpera et al. (Klicpera2019) that ‘diffusion improves graph learning’, leading to the eponymous paradigm (DIGL), can be understood as a form of diffusion on the graph connectivity independent of the features. Specifically, the authors used as node positional encoding the Personalised PageRank (PPR), which can be interpreted as the steady-state of a diffusion process

(12)

where is the random walk graph Laplacian and is a parameter such that

represents the restart probability. The resulting positional encoding of dimension

can be used to rewire the graph by NN sampling, which corresponds to using in our framework.

Numerical schemes

All the aforementioned GNN architectures can be seen as an explicit discretisation of equation (7) with fixed step size. On the other hand, our continuous diffusion framework offers an additional advantage of employing more efficient numerical schemes with adaptive step size. Graph rewiring of the form can be interpreted as adaptive spatial discretisation (moving mesh method).

3.4 Extensions

Figure 2: Graph Beltrami flow with hyperbolic positional coordinates.

Non-Euclidean geometry

There are multiple theoretical and empirical arguments (liu2019hyperbolic; chami2019hyperbolic) in favor of using hyperbolic spaces to represent real-life ‘small-world’ graphs (in particular, scale-free networks can be obtained as NN graphs in such spaces (boguna2021network)). Our framework allows using a non-Euclidean metric for the positional coordinates (Figure 2). In Section 5 we show that hyperbolic positional encodings allow for a significant reduction in model size with only a marginal degradation of performance.

Time-dependent diffusivity

The diffusivity function which we assumed time-independent and which lead to parameter sharing across layers (i.e., updates of the form ) can be made time-dependent of the form , where and denote shared and layer-dependent parameters, respectively.

Onsager diffusion

As we noted, the Beltrami flow diffuses each channel separately. A more general variant of diffusion allowing for feature mixing is the Onsager diffusion (onsager1931reciprocal) of the form , where the matrix-valued function acts across the channels. GCN (kipf2017) can be regarded a particular setting thereof, with update of the form .

MPNNs

Finally, we note that the Beltrami flow amounts to linear aggregation with non-linear coefficients, or the ‘attentional’ flavor of GNNs (gdlbook). The more general message passing flavor (Gilmer2017) is possible using a generic non-linear equation of the form .

4 BLEND: Beltrami Neural Diffusion

Beltrami Neural Diffusion (BLEND) is a novel class of graph neural network architectures derived from the graph Beltrami framework. We assume an input graph with nodes and -dimensional node-wise features represented as a matrix . We further assume a -dimensional positional encoding of the graph nodes. BLEND architectures implement a learnable joint diffusion process of and and runs it for time , to produce an output node embeddings ,

where , are learnable positional and feature encoders and is a learnable decoder (possibly changing the output dimensions). Here the in Equations (2) and (8) is absorbed by and made learnable. is given by the graph Beltrami flow equation (8), where the diffusivity function (attention) is also learnable. The choice of attention function depends on the geometry of the positional encoding and for Euclidean encodings we find the scaled dot product attention (Vaswani2017) performs well, in which case

(13)

where and are learned matrices, and

is a hyperparameter.

5 Experimental results

In this section, we compare the proposed Beltrami framework to popular GNN architectures on standard node classification benchmarks and provide a detailed study of the choice of the positional encoding space. Implementation details, including runtimes and hyperparameter tuning are given in Section C and code is available at https://github.com/twitter-research/graph-neural-pde.

Datasets

In our experiments, we use the following datasets: Cora 

(mccallum2000automating)

, Citeseer 

(sen2008collective), Pubmed (namata2012query), CoauthorCS (shchur2018pitfalls), Amazon, Computer, and Photo (mcauley2015image), and OGB-arxiv (hu2020open). Since many works using the first three datasets rely on the Planetoid splits (yang2016revisiting), we included them Table 2, together with a more robust evaluation on 100 random splits with 20 random initialisations  (shchur2018pitfalls). In all cases the largest connected component is used and dataset statistics are given in Appendix A.

Baselines

We compare to the following GNN architectures: GCN (kipf2017), GAT (Velickovic2018), MoNet (Monti2016) and GraphSAGE (Hamilton2017), and recent ODE-based GNN models: CGNN (Xhonneux2020), GDE (Poli2019), GODE (Zhuang2020a), and two versions of LanczosNet (liao2019lanczosnet). We use two variants of our method: using fixed input graph (BLEND) and using

NN graph in the positional coordinates (BLEND-knn).

5.1 Node Classification

Method CORA CiteSeer PubMed
GCN 81.90.8 69.50.9 79.00.5
GAT 82.80.5 71.00.6 77.01.3
MoNet 82.20.7 70.00.6 77.70.6
GS-maxpool 77.41.0 67.01.0 76.60.8
Lanczos 79.51.8 66.21.9 78.30.3
AdaLanczos 80.41.1 68.71.0 78.10.4
CGNN 81.70.7 68.11.2 80.20.3
GDE 83.80.5 72.50.5 79.90.3
GODE 83.30.3 72.40.6 80.10.3
BLEND 84.20.6 74.40.7 80.7 0.7
BLEND-kNN 83.10.8 73.30.9 81.50.5
Table 2: Performance (test accuracystd) of different GNN models using Planetoid splits.

In these experiments, we followed the methodology of (shchur2018pitfalls) using 20 random weight initialisations for datasets with fixed Planetoid splits and 100 random splits otherwise. Where available, results from (shchur2018pitfalls) are reported. Hyperparameters with the highest validation accuracy were chosen and results are reported on a test set that is used only once. Hyperparameter search used Ray Tune (Liaw2018)

with a thousand random trials using an asynchronous hyperband scheduler with a grace period and half life of ten epochs. The code to reproduce our results is included with the submission and will be released publicly following the review process. Experiments ran on AWS p2.8xlarge machines, each with 8 Tesla V100-SXM2 GPUs.

Implementation details

For all datasets excepting ogb-arxiv, adaptive explicit Dormand-Prince scheme was used as the numerical solver; for ogb-arxiv, we used the Runge-Kutta method. For the two smallest datasets (Cora and Citeseer) we performed direct backpropagation through each step of the numerical integrator. For the larger datasets, to reduce memory complexity, we use Pontryagin’s maximum principle to propagate gradients backwards in time 

(pontryagin2018mathematical). For the larger datasets, kinetic energy and Jacobian regularisation (Finlay2020; Kelly2020) was employed. The regularisation ensures the learned dynamics is well-conditioned and easily solvable by a numeric solver, which reduced training time. We use constant initialisation for the attention weights, , so training starts from a well-conditioned system that induces small regularisation penalty terms (Finlay2020).

The space complexity of BLEND is dominated by evaluating attention (13) over edges and is where is the edge set following rewiring and is dimension of features and is the dimension of positional encoding. The runtime complexity is , split between the forward and backward pass and can be dominated by either depending on the number of function evaluations (, ).

Tables 23 summarise the results of our experiments. BLEND outperforms other GNNs in most of the experiments, showing state-of-the-art results on some datasets. Another important point to note is that compared GNNs use different sets of parameters per layer, whereas in BLEND, due to our choice of a time-independent attention, parameters are shared. This results in significantly fewer parameters: for comparison, the OGB versions of GCN, SAGE and GAT used in the ogb-arxiv experiment require 143K, 219K and 1.63M parameters respectively, compared to only 70K in BLEND.

Method CORA CiteSeer PubMed Coauthor CS Computer Photo ogb-arxiv
GCN 81.51.3 71.9 1.9 77.82.9 91.10.5 82.62.4 91.21.2 71.740.29
GAT 81.81.3 71.41.9 78.72.3 90.50.6 78.019 85.720 73.010.19
GAT-ppr 81.60.3 68.50.2 76.70.3 91.30.1 85.40.3 90.90.3
MoNet 81.31.3 71.22.0 78.62.3 90.80.6 83.52.2 91.22.3
GS-mean 79.27.7 71.61.9 77.42.2 91.32.8 82.41.8 91.41.3 71.490.27
GS-maxpool 76.61.9 67.52.3 76.12.3 85.01.1 90.41.3
CGNN 81.41.6 66.91.8 66.64.4 92.30.2 80.32.0 91.41.5 58.702.50
GDE 78.72.2 71.81.1 73.93.7 91.60.1 82.90.6 92.42.0 56.6610.9
BLEND 84.80.9 75.91.3 79.51.4 92.90.2 86.90.6 92.90.6 72.560.1
BLEND-kNN 82.50.9 73.40.5 80.90.7 92.30.1 86.70.6 93.50.3
Table 3: Performance (test accuracystd) of different GNN models using random splits. OGB GAT reference has 1.5M parameters vs ours 70K. BLEND-kNN pre-processes the graph using the DIGL methodology (Klicpera2019) (Section 3.3), which constructs an n-dimensional representation of each node (an n-by-n matrix), then sparsifies into a kNN graph. The ogb-arxiv dataset has >150K nodes and goes OOM. This is not a limitation of BLEND, but that of DIGL. Other forms of initial rewiring are possible, but we chose to compare with DIGL (arguably the most popular graph rewiring) and so this result is missing

5.2 Positional encoding

Figure 3: An ablation study showing BLEND with and without positional encodings as well as GAT, the most similar conventional GNN with positional encodings added

In the second experiment, we investigated the impact of the positional encodings and vary the dimensionality and underlying geometry, in order to showcase the flexibility of our framework.

Figure 3 shows that for all datasets BLEND is superior to a Euclidean model where positional encodings are not used, which corresponds to (BLEND w/o positional in Figure 3) and a version of GAT where attention is over a concatenation of the same positional encodings used in BLEND and the features. The only exception is CoathorCS, where the performance without positional encodings is unchanged.

We experimented with three forms of positional encoding: DIGL PPR embeddings of dimension (Klicpera2019), DeepWalk embeddings (perozzi2014deepwalk) of dimensions , and hyperbolic embeddings in the Poincare ball (chamberlain2017neural; nickel2018learning) of dimension . Positional encodings are calculated as a preprocessing step and input as the

coordinates to BLEND. We calculated DeepWalk node embeddings using PyTorch Geometric’s Node2Vec implementation with parameters

, , context size 20, and 16 walks per node. We trained with walk length ranging between 40 and 120 and took the embeddings with the highest accuracy for the link prediction task. Shallow hyperbolic embeddings were generated using the HGCN (chami2019hyperbolic) implementation with the default parameters provided by the authors.

Figure 4 (left) compares the performance of the best model using DIGL -dimensional positional encodings against the best performing -dimensional hyperbolic positional encodings with tuned over the range 2-16. The average performance with DIGL positional encodings is 85.48, compared to 85.28 for hyperbolic encodings. Only in one case the DIGL encodings outperform the best hyperbolic encodings. Figure 4 (right) show the change in performance with the dimension of the positional encoding using a fixed hyperparameter configuration. As expected, we observe monotonic increase in performance as function of . Importantly most of the performance is captured by either 16 dimensional hyperbolic or positional encodings, with a small additional margin gained for going up to , which is impractical for larger graphs.

Figure 4: Left: performance comparison between Euclidean and hyperbolic positional embeddings. Right: results of positional embeddings ablation. Hyperbolic or Euclidean embeddings with allow to obtain performances comparable to euclidean embeddings with .

5.3 Additional ablations

Figure 5: step size against accurcy for explicit Euler compared to the adaptive dopri5.

In addition to studying the affect of positional encodings , we performed ablation studies on the step size used in the integrator as well as different forms of attention.

In Figure 5 we studied the affect of changing the step size of the integrator using the explicit Euler method with a fixed terminal time set to be the optimal terminal time. The left hand side of the figure shows the performance using the adaptive stepsize Dopri5 for comparison. Dopri5 gives the most consistent performance and is best if three of the six datasets. Details of additional attention functions and their relative performance can be found in Appendix A.1.

6 Related work

Image processing, computer vision, and graphics.

Following the Perona-Malik scheme (perona1990scale), PDE-based approaches flourished in the 1990s with multiple versions of anisotropic (weickert1998anisotropic) and non-Euclidean (sochen1998general)

diffusion used primarily for image denoising. The realisation of these ideas in the form of nonlinear filters

(tomasi1998bilateral; buades2005non) was adopted in the industry. PDE-based methods were also used for low-level vision tasks including inpainting (bertalmio2000image) and image segmentation (caselles1997geodesic; chan2001active). In computer graphics, fundamental solutions (‘heat kernels’) of diffusion equations were used as shape descriptors (sun2009concise). The closed-form expression of such solutions using the Laplace-Beltrami operator served as inspiration for some of the early approaches for GNNs (HeBrLe2015; defferrard2016convolutional; kipf2017; LeMoBrBr2017).

Neural differential equations

The interpretation of neural networks as discretised differential equations (‘neural ODEs’) (Chen2018a) was an influential recent result with multiple follow-up works (Dupont2019; Finlay2020; Liu2019; sdes2020). In deep learning on graphs, this mindset has been applied to GNN architectures (Avelar2019; Poli2019) and continuous message passing (Xhonneux2020). Continuous GNNs were also explored in (gu2020implicit) who, similarly to (Scarselli2008)

, addressed the solutions of fixed point equations. Ordinary Differential Equations on Graph Networks (GODE)

(Zhuang2020a) approach the problem using the technique of invertible ResNets. Finally, (Sanchez-Gonzalez2019) used graph-based ODEs to generate physics simulations.

Physics-inspired learning

Solving PDEs with deep learning has been explored by (Raissi2017a). Neural networks appeared in (Li2020c) to accelerate PDE solvers with applications in the physical sciences. These have been applied to problems where the PDE can be described on a graph (Li2020d). (belbute2020combining) consider the problem of predicting fluid flow and use a PDE inside a GNN. These approaches differ from ours in that they solve a given PDE, whereas we use the notion of discretising PDEs as a principle to understand and design GNNs.

Neural ODEs on graphs

The most similar work to this is GRAND (Chamberlain2021a) of which BLEND can be considered a non-Euclidean extension. In addition there are several other works that apply the neural ODE framework to graphs. In GDE (Poli2019), GODE (Zhuang2020a) and CGNN (Xhonneux2020)

, the goal is to adapt neural ordinary differential equations to graphs. In contrast, we consider non-Euclidean partial differential equations and regard GNNs as particular instances of their discretisation (both spatial and temporal). We can naturally use spaces with any metric, in particular, extending recent works on hyperbolic graph embeddings. None of the previous techniques explore the link to differential geometry. More specifically, GDE, GODE, and CGNN consider Neural ODEs of the canonical form

where is a graph neural network (GODE), the message passing component of a GNN (CGNN), or restricting to be layers of bijective functions on graphs (GDE). Furthermore, in CGNN only ODEs with closed form solutions are considered. (Sanchez-Gonzalez2019) on the other hand is quite distinct as they are not concerned with GNN design and instead use graph-based ODEs to generate physics simulations.

7 Conclusion

We developed a new class of graph neural networks based on the discretisation of a non-Euclidean diffusion PDE called Beltrami flow. We represent the graph structure and node features as a manifold in a joint continuous space, whose evolution by a parametric diffusion PDE (driven by the downstream learning task) gives rise to feature learning, positional encoding, and possibly also graph rewiring. Our experimental results show very good performance on popular benchmarks with a small fraction of parameters used by other GNN models. Perhaps most importantly, our framework establishes important links between GNNs, differential geometry, and PDEs – fields with a rich history and profound results that in our opinion are still insufficiently explored in the graph ML community.

Future directions

While we show that our framework generalises many popular GNN architectures we seek to use the graph as a numerical representation of an underlying continuous space. This view of the graph as an approximation of a continuous latent structure is a common paradigm of manifold learning (tenenbaum2000global; belkin2003laplacian; coifman2006diffusion) and network geometry (boguna2021network). If adopted in graph ML, this mindset offers a rigorous mathematical framework for formalising and generalising some of the recent trends in the field, including the departure from the input graph as the basis for message passing (Alon2020), latent graph inference (kipf2018neural; wang2019dynamic; franceschi2019learning; kazi2020differentiable) higher-order (bodnar2021weisfeiler; maron2019provably) and directional (beaini2020directional) message passing (which can be expressed as anisotropic diffusion arising from additional structure of the underlying continuous space and different discretisation of the PDEs e.g. based on finite elements), and exploiting efficient numerical PDE solvers (Chen2018a). We leave these exciting directions for future research.

Societal impact

GNNs have recently become increasingly utilized in industrial applications e.g. in recommender systems and social networks, and hence could potentially lead to a negative societal impact if used improperly. We would like to emphasize that our paper does not study such potential negative applications and the mathematical framework we develop could help to interpret and understand existing GNN models. We believe that better understanding of ML models is key to managing their potential societal implications and preventing their negative impact.

Limitations

The assumption that the graph can be modelled as a discretisation of some continuous space makes our framework applicable only to cases where the edge and node features are continuous in nature. Applications e.g. to knowledge graphs with categorical attributes could only be addressed by first embedding such attributes in a continuous space. Finally, the structural result presented in Theorem

1 that link our model to the Polyakov action have not been implemented. This will be addressed in future works.

8 Acknowledgements

We thank Nils Hammerla and Gabriele Corso for feedback on early version of this manuscript. MB and JR are supported in part by ERC Consolidator grant no 724228 (LEMAN).

Appendix A Additional experimental results and implementation details

Datasets

The statistics for the largest connected components of the experimental datasets are given in Table 4.

Dataset Type Classes Features Nodes Edges Label rate
Cora citation 7 1433 2485 5069 0.056
Citeseer citation 6 3703 2120 3679 0.057
PubMed citation 3 500 19717 44324 0.003
Coauthor CS co-author 15 6805 18333 81894 0.016
Computers co-purchase 10 767 13381 245778 0.015
Photos co-purchase 8 745 7487 119043 0.021
OGB-Arxiv citation 40 128 169343 1166243 0.005
Table 4: Dataset Statistics

Replication of results and hyper-parameters

Code to regenerate our experimental results together with hyperparameters for all datasets is provided. The hyperparameters are listed in best_params.py. Refer to the README.md for instructions to run the experiments.

Numerical ODE solver

We use the library torchdiffeq (Chen2018a) to discretise the continuous time evolution and learn the system dynamics. The Pontryagin maximum / adjoint method is used to replace backpropagation for all datasets, with the exception of Cora and Citeseer due to the high memory complexity of applying backpropagation directly through the computational graph of the numerical integrator.

Decoupling the terminal integration time between inference and training

At training time we use a fixed terminal time that is tuned as a hyperparameter. For inference, we treat as a flexible parameter to which we apply a patience and measure the validation performance throughout the integration.

Data splits

We report transductive node classification results in Tables 2 and 3 in the main paper. To facilitate direct comparison with many previous works, Table 2 uses the Planetoid splits given by (yang2016revisiting). As discussed in e.g. (shchur2018pitfalls), there are many limitations with results based on these fixed dataset splits and so we report a more robust set of experimental results in Table 3. Here we use 100 random splits with 20 random weight initializations. In each case 20 labelled nodes per class are used at training time, with the remaining labels split between test and validation. This methodology is consistent with (shchur2018pitfalls).

Positional encodings

In Section 5.2 three types of positional encoding are described: DIGL PPR (Klicpera2019), DeepWalk (perozzi2014deepwalk) and hyperbolic embeddings (chamberlain2017neural; nickel2018learning). We provide the latter two as preprocessed pickle files within our repo. The DIGL PPR positional encodings are too large and so these are automatically generated and saved whenever code that requires them is run.

a.1 Diffusivity (attention) function

In section 4 of the main paper one choice of attention function is described. Additionally four alternatives were considered, which achieved roughly equivalent performance:

  • Scaled dot

    (14)
  • Cosine similarity

    (15)
  • Pearson correlation

    (16)
  • Exponential kernel

    (17)

We additionally conducted an ablation study using these functions, results are given in Table 5.

Method Cora Citeseer Pubmed CoauthorCS Computers Photo
cosine_sim 83.8 0.5 73.5 1.1 79.0 2.3 92.8 0.1 84.8 0.7 93.2 0.4
exp_kernel 83.6 1.8 75.0 1.3 79.9 1.4 92.8 0.2 84.8 0.5 93.2 0.5
pearson 83.7 1.5 75.2 1.3 92.8 0.2 84.8 0.5
scaled dot 79.5 1.4 92.9 0.6
Table 5: Performance with different attention functions. Scaled dot is best performing in four of the six experiments

Appendix B Theory of Beltrami flow

This section proves Theorem 1 in the main paper:

Theorem 2.

Under structural assumptions on the diffusivity, graph Beltrami flow

(18)

is the gradient flow of the discrete Polyakov functional.

b.1 Polyakov action and Beltrami flow on manifolds

In this section we briefly review the analysis in (kimmel1997high) to motivate the introduction of a discrete Polyakov action on graphs. Assume that and are Riemannian manifolds with coordinates and respectively. The Polyakov action for an embedding can be written as

with the volume form on associated with the metric . We restrict to the case where is the -dimensional Euclidean space , so that the functional becomes

The quantity above can be rewritten more geometrically as

(19)

From equation (19) we see that the Polyakov action is measuring the smoothness of the embedding - more precisely its Dirichlet energy with respect to the metric - on each feature channel . Given a geometric object , it is natural to find an optimal way of mapping it to a larger space, in this case . Accordingly, one can minimize the functional in (19) either with respect to the embedding or with respect to both the embedding and the metric . If we choose to minimize by varying both and we find that the metric must be the metric induced on by the map , namely its pullback. In local coordinates, this amounts to the constraint below:

(20)

Once the condition on is satisfied, the Euler-Lagrange equations for each feature channel become

(21)

where the operator is the Laplace-Beltrami operator on associated with the metric . As a specific example, if we embed an image into via a grey color mapping , the gradient flow associated with the functional and hence the Euler-Lagrange equation (21) for the grey channel is exactly the one reported in Section 2.

Therefore, from a high-level perspective, the Beltrami flow derived from the Polyakov action consists in minimizing a functional representing the Dirichlet energy of the embedding computed with respect to a metric depending - precisely via the pullback - on the embedding itself.

b.2 A discrete Polyakov action

We consider the graph counterpart to the problem of minimizing a generalized Dirichlet energy with respect to the embedding. Let be a simple, undirected and unweighted graph with . Also let and denote the space of signal functions and respectively. Recall that the graph gradient of is defined by

Its adjoint operator is called graph divergence and satisfies

Assume now we have a graph embedding of the form for some scaling , with . The coordinates and are called positional and feature coordinates respectively. In analogy with the Polyakov action (19), we introduce a modified Dirichlet energy measuring the smoothness of a graph embedding across its different channels: given a family of nonnegative maps , with , and , we define

(22)

where is the graph adjacency matrix. Defining the following -weighted norm for each

then (22) can be written as

(23)

Beyond the notational similarity with (19), the quantity sums the integrals over all channels - i.e. summations on the vertex set - of the norms of the gradients of exactly as for the continuum Polyakov action. The dependence of such norms on the embedding is to take into account that in the smooth case the Beltrami flow imposes the constraint (20), meaning that the metric with respect to which we compute the gradient norm of the embedding depends on the embedding itself. We note that the choice

(24)

yields the classic Dirichlet energy of a multi-channel graph signal

From now on we refer to the function in (22) as the discrete Polyakov action.

b.3 Proof of Theorem 1

We recall that given an embedding , we are interested in studying a discrete diffusion equation of the form

(25)

The coupling is called diffusivity. We now prove that the differential system above is the gradient flow of the discrete Polyakov action under additional assumptions on the structure of the diffusivity. We restate Theorem 1 in a more precise way:

Theorem 1. Let be a family of maps satisfying the assumptions

  • There exist a family of maps , with , such that

    (26)
  • If we write , then we require

Then, the gradient flow associated with is given by (25), with the diffusivity satisfying

where the dependence of on is as in (26).

Remark. We observe that by taking to be the constant function one, we recover the classical case (24). In fact, for such choice the diffusivity satisfies and the gradient flow is given by the graph Laplacian, namely

Proof.

Once we choose the family of maps as in the statement, the discrete Polyakov action is a map . To ease the notation, given a vector we write . To prove the result we now simply need to compute the gradient of the functional : given and we have

(27)

From the assumptions we can expand the partial derivative of the map as

where we have simply written Therefore, we can write (27) as

We now use the assumption (ii) to see that and similarly for . Up to renaming dummy indices we get

By inspection, we see that the right hand side can be rewritten as

with the diffusivity given in the statement of Theorem 1. Therefore, the gradient flow associated with is