DeepAI
Log In Sign Up

Rewiring Networks for Graph Neural Network Training Using Discrete Geometry

Information over-squashing is a phenomenon of inefficient information propagation between distant nodes on networks. It is an important problem that is known to significantly impact the training of graph neural networks (GNNs), as the receptive field of a node grows exponentially. To mitigate this problem, a preprocessing procedure known as rewiring is often applied to the input network. In this paper, we investigate the use of discrete analogues of classical geometric notions of curvature to model information flow on networks and rewire them. We show that these classical notions achieve state-of-the-art performance in GNN training accuracy on a variety of real-world network datasets. Moreover, compared to the current state-of-the-art, these classical notions exhibit a clear advantage in computational runtime by several orders of magnitude.

READ FULL TEXT VIEW PDF

page 5

page 6

11/22/2021

Graph Neural Networks with Parallel Neighborhood Aggregations for Graph Classification

We focus on graph classification using a graph neural network (GNN) mode...
04/19/2020

Binarized Graph Neural Network

Recently, there have been some breakthroughs in graph analysis by applyi...
01/29/2022

Rewiring with Positional Encodings for Graph Neural Networks

Several recent works use positional encodings to extend the receptive fi...
06/09/2020

On the Bottleneck of Graph Neural Networks and its Practical Implications

Graph neural networks (GNNs) were shown to effectively learn from highly...
06/30/2021

Curvature Graph Neural Network

Graph neural networks (GNNs) have achieved great success in many graph-b...
10/11/2021

Ab-Initio Potential Energy Surfaces by Pairing GNNs with Neural Wave Functions

Solving the Schrödinger equation is key to many quantum mechanical prope...
02/17/2021

Ego-based Entropy Measures for Structural Representations on Graphs

Machine learning on graph-structured data has attracted high research in...

Abstract

Information over-squashing is a phenomenon of inefficient information propagation between distant nodes on networks. It is an important problem that is known to significantly impact the training of graph neural networks (GNNs), as the receptive field of a node grows exponentially. To mitigate this problem, a preprocessing procedure known as rewiring is often applied to the input network. In this paper, we investigate the use of discrete analogues of classical geometric notions of curvature to model information flow on networks and rewire them. We show that these classical notions achieve state-of-the-art performance in GNN training accuracy on a variety of real-world network datasets. Moreover, compared to the current state-of-the-art, these classical notions exhibit a clear advantage in computational runtime by several orders of magnitude.

Keywords:

Discrete curvature, geometric deep learning, graph neural networks, graph rewiring, information over-squashing.

1 Introduction

The abundance of data availability has resulted in the occurrence of data captured by structures beyond vectors living in Euclidean space. Much fundamental information is encoded in data that exhibit more complex structures, some with a distinct geometric characterization—such as networks. The driving premise underlying

geometric deep learning

is that this geometry encompasses information that is crucial to take into consideration when developing machine learning techniques (specifically, deep learning) to handle these data

(geometric-learning). Thus, the inherent, non-Euclidean geometry of the data structures as well as the space they live in are fundamental aspects to understand and build into deep learning architectures.

In this paper, we consider network data and an important problem associated with training graph neural networks (GNNs). Specifically, we study the problem of information over-squashing (bottleneck; bottleneck-bronstein), which amounts to inefficient information propagation between distant nodes on a graph. This phenomenon is especially significant in tree-like graphs, where multiple nodes lead to a single node—namely, the “bottleneck.” A common approach to mitigate this problem is to perform graph rewiring on the input network data by adding or suppressing edges in the network in order to alleviate such bottlenecks and increase the efficiency of information flow over a network. Recent pioneering work by bottleneck-bronstein models information flow on a network using notions of discrete curvature and uses this network curvature information to perform graph rewiring prior to training GNNs, yielding the current state-of-the-art for GNN training in the presence of bottlenecks. In particular, a novel discrete curvature, the balanced Forman curvature, was introduced and utilized to identify bottlenecks and rewire graphs prior to training for increased efficiency in information propagation over networks (bottleneck-bronstein).

The discretization of classical notions of smooth geometry has been actively studied in recent decades (see for instance, najman-romon), resulting in various definitions of discrete curvature. An important motivation behind such discretizations are for the application of geometric methods to statistics and machine learning tasks for data exhibiting discrete geometric structure, such as network learning by sampling (e.g., barkanass2020geometric; sigbeku2021curved). In this work, we return to these fundamental principles and study the alleviation of information over-squashing by graph rewiring following the procedure used by bottleneck-bronstein. We systematically test and compare the performance of several classical discrete curvature notions against the recently proposed balanced Forman curvature on several benchmarking datasets, and find that these classical discrete curvature notions are able to achieve state-of-the-art performance in terms of accuracy for GNN training. Moreover, the computation of these classical discrete curvatures is much more efficient and runs several orders of magnitude faster than the state-of-the-art.

The remainder of this paper is organized as follows. Section 2 discusses in further detail the GNNs and the problem of information over-squashing, and briefly surveys various approaches to reducing over-squashing. In Section 3, we then switch to discussing mathematical details on discrete curvature and formally present the notions studied in this paper; we also present the balanced Forman curvature recently proposed by bottleneck-bronstein. We also overview the procedure to identifying network bottlenecks and performing graph rewiring adapted by bottleneck-bronstein, which use discretizations of smooth curvature concepts. This will be the same procedure that we will implement with the various classical discrete curvature notions. In Section 4, we describe the data we study and our experimental design and setup. In Section 5, we demonstrate the method on a wide variety of benchmarking datasets and present the accuracy and computational runtime results. Finally, we close in Section 6 with a discussion and some proposals for future research.

2 Graph Neural Networks and Information Over-Squashing

Neural networks are systems of algorithms that aim to identify underlying relationships in data in a manner similar to how biological neural networks in brains function. They consist of collections of artificial “neurons” and “synapses” that are typically organized into layers. Among deep neural networks, a specific class are adapted specifically to handling graph or network data—collections of vertices connected by edges that mathematically describe a dependency structure—-which is the focus of this paper.

The main difference between the traditional deep neural networks and GNNs lies in the functioning of the message passing algorithm (message-passing). Briefly, in message passing, at each layer and for each node, features from the neighboring nodes are aggregated before updating the features of the target node. This is the mechanism by which the network captures the information from the graph structure of the data.

2.1 Information Over-Squashing

Training GNNs presents new issues in comparison to standard neural network training due to the discrete geometric structure of network data. A particularly important challenge that has recently gained much research interest is that of information over-squashing, also known as the bottleneck problem (e.g., bottleneck; bottleneck-bronstein).

In over-squashing, the principle concern is that the influence of certain node features (which may be important) may be too small and eventually have minimal or no impact on features of distant nodes on the network when performing message passing over the GNN. This is particularly problematic in the context of network data, since the receptive field of a graph node is known to grow exponentially (bottleneck).

Example 1.

In a binary tree, let the -jump neighborhood of a node be the set of nodes in the graph that contains only the nodes that have the shortest path of at most to the node . Then the receptive field of the root doubles with every jump: there are twice as many nodes in the -jump neighborhood than in the -jump neighborhood for any integer . See Figure 1.

Figure 1: The receptive field of the root node in the binary tree grows exponentially. When the information from node reaches node , it will be “squashed” together with information from all of the other nodes in the righthand subtree.

Thus, over-squashing is a crucial issue to take into consideration, especially when the long-range dependencies in the data are important for the learning task when training on graph data. It is primarily caused by the poor propagation of long-distance information by some specific edges in the graph. As an illustration, consider two components that are cliques (or two densely connected, clique-like graphs) connected only by a single edge, illustrated in Figure 1(a). When propagating information from a node in a source component to a node in the target component, over-squashing is likely to happen as the information is crowded or “squashed” together with all other node features from the source component. This happens on the edge connecting the two components which, here, is the main source of over-squashing in the graph and called a bottleneck.

2.2 Mitigating Over-Squashing

Bottlenecks may be alleviated to reduce over-squashing via graph rewiring, which adds or suppresses edges in the graph to obtain a new graph with the same nodes and node features, but a different set of edges. The goal of the rewiring is to better support the bottleneck and give alternative routes of access between components which improves the information flow between components and reduces the risk that features become crowded out (over-squashed). Edges that have little impact on the flow of information in the graph can be deleted to control the size of the graph. An example of a rewired graph with an alleviated bottleneck is shown in Figure 1(b).

Several approaches to bottleneck alleviation have been proposed in the recent literature. For example, digl propose graph diffusion convolution (GDC) as a graph rewiring approach using a discretization of the gas diffusion equation to model the propagation of information on a network. However, this method fails to capture long-range dependencies on networks (bottleneck-bronstein). There also exist other bottleneck alleviation methods that do not entail rewiring. For example, much in the spirit of the work of this paper, cgnn propose a curvature GNN (CGNN) which, instead of adding or deleting edges, assigns specific weights to graph edges as a measure of how much information flows over this edge where the weights are determined by discrete curvature. The work of bottleneck-bronstein uses a hybrid of these two methods where graph rewiring is performed driven by a newly proposed discrete curvature.

While graph rewiring effectively alleviates the over-squashing problem, it does present limitations. First, on some types of data, the rewiring approach may not be applicable at all; an example is in chemical data where the graphs represent molecules and adding or deleting edges changes the molecule under study entirely. Second, rewiring alters the structure (topology) of the graph and changes the information that may be captured from the graph connectivity, which can negatively impact feature recognition (bottleneck). In this case, there is a trade-off between reducing bottlenecks and changing the graph topology to consider. It is therefore important to obtain a measure of severity of the bottleneck—the bottleneckness of the graph—and and use it as a guide when performing graph rewiring.

(a)
(b)
Figure 2: Graph rewiring reduces over-squashing. (fig:rewiring1): A graph with a bottleneck (blue edge). (fig:rewiring2): A possible rewiring that alleviates the bottleneck.

2.3 Quantifying Over-Squashing

In this paper, we use the Jacobian as a measure of the bottleneckness of a graph. Consider a graph with nodes; take two nodes that are at a distance from each other, where is the number of edges on the shortest path between these nodes. To quantify over-squashing, we need to measure how much of an impact the feature vector of has on the feature vector of after -many forward passes (i.e., message passing is performed times); denote this impact by . Then the Jacobian

(1)

for -distance dependencies quantifies the over-squashing in a graph (bottleneck-bronstein).

bottleneck-bronstein show that the bounds on the elements of the Jacobian of -distance dependencies are proportional to the respective th powers of the normalized augmented adjacency matrix, i.e.,

(2)

for a constant . The powers of the normalized augmented adjacency matrix then measure the degree to which a given graph is prone to over-squashing. In other words, bottlenecks are associated with the entries of the powers of the matrix with small values. Note that the values cannot be zero, since zero values indicate no edge between two nodes.

It is also important to note that in our setting of graphs the Jacobian (1) is computed as a discrete derivative. In our work, we assume that the Jacobian is computed by numerical approximations; no further details were provided bottleneck-bronstein.

There also exist other measures to quantify over-squashing. For example, the Cheeger constant is a direct measure of over-squashing that captures how easy or difficult it is to totally disconnect a graph; however, it is known to be NP-hard to compute (cheeger).

3 Discrete Geometry and Curvature

In this section, we turn to the mathematical aspects of discrete curvature, which may be used to model information flow on a network (e.g., cgnn; bottleneck-bronstein). Here, we present the origins of smooth geometry and curvature, and discuss the evolution towards discrete notions. Most importantly, we define all discrete curvatures that will be implemented in our study.

3.1 Ricci Curvature and Ricci Flow

The Ricci curvature of differential geometry is, roughly speaking, a measure that quantifies the extent to which a Riemannian manifold locally differs from a Euclidean space in various tangential directions. In particular, Ricci curvature determines whether two geodesics shot in parallel from two nearby points on a given manifold tend to converge, remain parallel, or diverge along the manifold. Then curvature is positive, if the geodesics converge to a single point; zero, if the geodesics remain parallel; and negative, if the geodesics diverge; see Figure 3. The quicker the convergence or divergence, the larger the absolute value of Ricci curvature.

(a)
(b)
(c)
(d)
Figure 3: View of two geodesics shot in parallel from blue points on example manifolds to illustrate Ricci curvature, . (fig:sphere1), (fig:sphere2) Two perspectives on a sphere, where the geodesics converge at the top of the sphere. (fig:plane) On a plane, the geodesics remain parallel. (fig:hyperbolic) On a hyperbolic manifold, the geodesics diverge.

Ricci Flow.

Ricci curvature can be used to smooth a manifold via the Ricci flow

, namely the partial differential equation

(3)

where denotes the Riemannian metric and the Ricci curvature (ricci-flow). It should be noted that in most discretizations the 2-dimensional version of the flow is adopted (see e.g. GuYau. In this dimension, , where denotes the classical Gauss curvature, thus the Ricci flow becomes

In the discrete setting of meshes or networks, the PDE above becomes an ODE, thus the flow is reversible, which a fact of practical importance in many applications and, in particular, the one we study in this paper. Also observe that regions where () tend to shrink, while those with () tend to expand.

Example 2.

Consider the example manifold in Figure 4 for an intuition on how Ricci flow may be used to smooth a manifold. In Figure 3(a), the Ricci flow is illustrated by the color and thickness of the arrows and indicate how much as well as the direction in which an expansion produces a smoothed version of the manifold illustrated in Figure 3(b).

In the above Example 2, the regions of negative Ricci curvature where the Ricci flow is illustrated with blue arrows can be seen as a bottleneck of the manifold. This observation motivates a discretization of the manifolds to graphs, as well as corresponding notions of Ricci curvature and Ricci flow, to be able to use them to model and reduce information over-squashing.

(a)
(b)
Figure 4: Visualisation of the Ricci flow for a 2D shape. (fig:ricci-flow1) The arrows show the direction and relative magnitude of the flow at current time step

. Red and blue arrows correspond to points with positive and negative curvature respectively, and figuratively represent the metric tensor. (fig:ricci-flow2) The manifold at a later time step

, expanded and shrunk accordingly to the arrows in (fig:ricci-flow1).

From Manifolds to Graphs.

In certain instances, there is a natural reduction of manifolds to graphs. For example, images can be represented in a discrete manner by meshes, which can be seen as 4-regular graphs, while in graphics, data is encoded as triangular meshes whose 1-skeleta are also graphs.

Concretely, for the three types of curvature discussed above (positive, zero, and negative), there exist natural graph analogies. For a sphere where curvature is positive, a clique is a suitable representation: two parallel geodesics shot from two nearby points on a sphere meet at the top of a sphere, and, likewise, two edges from two adjacent points (connected directly by an edge) in a clique can meet at a common node to create a triangle. For a plane where curvature is flat, a rectangular grid is an appropriate graphical representation: parallel lines on a plane remain parallel forever, and edges from two adjacent points remain parallel. Finally, a hyperbolic manifold with negative curvature may be represented by a binary tree. See Figure 5 for graphical examples of manifolds with positive, zero, and negative curvature.

(a)
(b)
(c)
Figure 5: Graph analogies of manifolds. (fig:clique-analogy) A graphical representation of a sphere; (fig:grid-analogy) A graphical representation of a plane; (fig:tree-analogy) A graphical representation of a hyperbolic manifold. The example initial pairs nodes and the edges joining them are highlighted in red; the example choices of the next edges and points are highlighted in orange.

3.2 Discrete Curvature

With the above intuition of discretizing manifolds to graphs, it is natural to correspondingly define discretized versions of curvature. On graphs, discrete curvatures are traditionally node-based measures, however, discrete Ricci curvature is an edge-based measure. This is not only natural, given that classical curvature is a directional measure, hence attached to vectors, it also allows for a better and deeper understanding of networks, which are defined by the relationships between their nodes, i.e., by their edges (najman-romon; discrete-curvature).

In the discrete Ricci curvature, the edge endpoints correspond to the two nearby points on the manifold from which parallel geodesics are shot to determine the Ricci curvature. We note that the discrete Ricci curvature can also be defined for graph nodes by aggregating, e.g., averaging, the discrete curvature of incident edges, however the notion of node curvature does not play a role in our study of over-squashing which is an edge-specific phenomenon.

There is no single, established definition of discrete curvature. Depending on heuristics, there are many types. Here, we outline the first and best-known discrete curvatures historically proposed for networks. The driving motivation is that, in analogy to the Ricci flow for manifolds, the bottlenecks will have the lowest discrete curvature in the graph.

Note that in our setting, we work with undirected networks and these curvatures are defined for undirected networks. Analogues for directed networks exist, but since the interest of this work is to explore the role of various discrete curvatures on networks and their performance in reducing over-squashing as studied by bottleneck-bronstein, who study the undirected case, we follow suit in our work. Furthermore, we work with unweighted networks, which give rise to combinatorial properties of graphs that lend computational benefits.

1D Forman curvature.

Definition 3.

For two nodes in a graph and an edge between them, the general 1D Forman curvature of is given by (forman-curvature):

(4)

where and denote the edges other than that that are adjacent to nodes and respectively; , , and denote the weights of edges , and respectively; and and denote the weights of the nodes and respectively.

Recall, however, that here, we study unweighted graphs, which means that only combinatorial weights of nodes and edges are considered, i.e., the weights of all nodes and edges are equal to . In this case, (4) becomes simply

(5)

where denotes the degree of node . Note that the first term is rather than because the node is counted as a neighbor in and vice versa.

In our setting, the 1D Forman curvature is given by (5) which is a very simple expression and extremely fast to compute, and is concerned only by the degrees of the endpoints of the edge under consideration. The highest value of the curvature is equal to and is attained when the edge is disconnected from the rest of the graph. The 1D Forman curvature is negative for the majority of edges in general, as it is always negative when the edge is directly connected to at least other edges.

The drawback of the simplicity of this curvature is that it is not always very descriptive, even in comparison with the curvature values of other edges in the graph, since in the model case of combinatorial weights, the 1D Forman curvature gives information only about the number of edges directly connected to the edge under consideration. For example, for two clique-like subgraphs connected by one edge, as in Figure 1(a), the bottleneck would be correctly identified as an edge with the lowest curvature. However, it would generally assign lower curvature to clique-like components of the graph rather than the tree-like components, as an edge in a clique is connected directly to all of the other edges in the clique, while e.g., in a binary tree it is only directly connected to other edges. Thus the measure of interest is the relative curvature in comparison to the curvature of other edges in the graph, mainly in the combinatorial case.

Augmented Forman curvature.

The augmented Forman curvature or 2D Forman curvature attempts to solve the above-mentioned drawbacks of 1D Forman curvature.

Definition 4.

For two nodes in a graph and an edge between them, the augmented Forman curvature or 2D Forman curvature is given by (2d-forman):

(6)

where denotes that is parallel to , i.e., and have a common higher or lower dimensional graph face (e.g., they are two edges that are both a part of the same triangle, which would be in turn equivalent to them having a common neighbour); denotes that is a graph face of (e.g., is an edge and is a triangle that is a part of); and the rest of the notation is as in Definition 3 (here the faces are also weighted).

The mathematical definition of augmented Forman curvature captured in (6) is significantly more complex than 1D Forman curvature. However, following 2d-forman, we can consider solely -cycles, i.e., triangles, and chose again only combinatorial weights. This reduces (6) to the following form that relates to (5) (2d-forman)

(7)

where is the number of triangles containing the edge under consideration.

The idea is that the curvature (7) increases in relation to if an edge is contained in some triangles. More precisely, the factor of in (7) equal to guarantees that edges that create a triangle together with the edge under consideration do not contribute negatively to the curvature, but positively instead. Indeed, there should intuitively be no problem with information over-squashing within a -cycle, which is the simplest form of a clique.

For each pair of edges that create a triangle with , the curvature increases by , while for the 1D Forman curvature if would simply decrease the curvature by , due to the contribution to the degrees of endpoints of . If an edge is not a member a triangle with , it contributes negatively to the augmented Forman curvature by decreasing it by , just as in the 1D version. Hence, the augmented version maintains a balance between the growth of degrees of endpoints and creation of -cycles.

Haantjes curvature.

The Haantjes curvature (haantjes2) is less common than the Forman curvatures, even though its definition is by far the simplest and most intuitive of all the discrete network curvatures; see haantjes for its even simpler network adaptations.

Definition 5.

Consider a graph where all weights are equivalently equal to 1 (i.e., the combinatorial case). For two nodes in a graph and an edge between them, the Haantjes curvature is given by

(8)

where is as in (7), i.e., the number of triangles containing the edge .

The original Haantjes curvature is a metric curvature, thus in the network case it takes into account solely edge weights. However, in practice, one can devise new edge weights that incorporate both the original ones as well as the given vertex weights (see haantjes). Definition 5 is commonly used in graphics settings and simply counts the triangles adjacent to a given edge, i.e., the number of -cycles containing the edge under consideration, .

As a consequence, the Haantjes curvature is indeed higher for clique-like components of a graph than for tree-like components (each edge of a -clique is a part of triangles, but there are no triangles in a tree). Haantjes curvature is trivially nonnegative, which is also in contrast with 1D Forman, where the majority of edges usually have negative curvature. The augmented Forman curvature can now be thought of as a balance between 1D Forman and Haantjes curvatures. (Note that an “augmented” Haantjes curvature, that takes into account face weights as well, has been introduced in haantjes.)

Balanced Forman curvature.

The most recent proposal for discrete curvature that will also be studied in this paper is is the balanced Forman curvature (BFC). It was introduced specifically in the context of over-squashing on graphs (bottleneck-bronstein).

Definition 6 (Balanced Forman curvature, (bottleneck-bronstein)).

Consider an edge . Let for ; , the number of triangles containing ; for , the number of neighbors of that create a -cycle (square) that contains and does not contain any diagonals (see Figure 6); and be the maximal number of -cycles that contain traversing a common node. Then the balanced Forman curvature is defined as if and

(9)

otherwise.

The idea behind the BFC is to preserve a balance between the complexity of computation (in the spirit of the simple formulation of the classical Forman curvature) and the richness of structural information associated with neighboring edges. In particular, the BFC formulation takes into account - and -cycles, as well as “loose” neighboring edges, i.e., those that do not create - or -cycles. Here, loose edges have a negative curvature contribution, -cycles have a positive curvature contribution, and -cycles have zero curvature contribute to the BFC. These components are normalized by node degrees.

Note that the BFC is similar to the Forman and Haantjes curvatures where -cycles are explicitly taken into consideration, while -cycles are considered via loose edges. Here, in contrast to the BFC, loose edges are considered as components of -cycles and have a negative curvature contribution to the Forman and Haantjes curvatures.

(a)
(b)
(c)
Figure 6: Values of for .

3.3 Discrete Ricci Flow: The Stochastic Discrete Ricci Flow

We now outline the discretization of Ricci flow that will be implemented in our experimental work, namely, the stochastic discrete Ricci flow (SDRF) algorithm (bottleneck-bronstein). Specifically, it is a graph rewiring algorithm that was introduced with the aim of addressing the problem of over-squashing in GNN training; it is designed to support edges with low curvature which are identified as the bottlenecks by adding new edges to increase curvature to increase the efficiency of message passing. It operates very much in the same spirit as Ricci flow, where, in particular, regions of negative or low curvature are identified and compensated by an opposite effect depending on the negativity in order to smooth the manifold. Additionally, it incorporates a mechanism to prevent a blow-up on the size of the graph. The algorithm thus intakes a graph and produces another graph where the regions of the most negatively curved edges of the input graph are augmented with additional edges to increase the curvature at those regions.

At each iteration, the algorithm chooses the edge with the smallest curvature, candidate edges to add to support the edge under consideration, and the edge to add from candidates with softmax probability (regulated with a

temperature parameter ) with the aim to increase curvature, where this latter value is calculated as the difference between curvature of the edge under consideration before and after adding the support edge. The algorithm then chooses the edge with the highest curvature and, if this curvature value surpasses a certain threshold, removes this edge from the graph and thus ensures a bound on the size of the graph. The process repeats until either the convergence is reached, in the sense that there are no additional candidates and no edges to remove, or the maximum number of iterations is reached.

Input: graph , temperature , max number of iterations, discrete curvature Curv, optional Curv upper-bound
repeat
     for edge with minimal discrete curvature  do
          Calcukate vector where , the improvement to from adding edge where ;
          Sample index with probability softmax and add edge to .
     end for
     Remove edge with maximal discrete curvature Curv if Curv.
until convergence or max iterations reached
Algorithm 1 Stochastic Discrete Ricci Flow (SDRF)
Example 7.

See Figure 2 for an example run of Algorithm 1.

Notice here that the curvature computation is incorporated into the SDRF algorithm when the softmax probability is computed for each candidate selection.

4 Data and Experimental Setup

In this section, we describe the datasets used and give details on the setup of our experimental study. The aim is to test the performance in terms of accuracy and computational runtime of various discrete curvatures in the SDRF algorithm that is designed to reduce information over-squashing in training GNNs. With this aim in mind, the experiments were set up to closely align with the setup of bottleneck-bronstein in order to better relate the findings. Furthermore, to ensure fairness of method evaluation, the performance on additional datasets that were not studied by bottleneck-bronstein was also evaluated, resulting in a wider variety of dataset applications and an independent implementation of their proposed BFC. Recall, however, that an important difference between our work and theirs is that only one curvature—the BFC—was used in their curvature-based rewiring.

4.1 Datasets

We used the following 12 benchmarking datasets in our experimental study:

  • Cora (cora)

    and Citeseer

    (Citeseer)
    Large citations datasets containing information about presence of specific words in publications

  • Pubmed (pubmed)

    Large citations dataset containing information about diabetes of patients classified into one of three classes

  • Cornell, Texas, and Wisconsin (ctw)
    Small datasets containing information about world wide web pages collected from computer science departments of corresponding universities

  • Chameleon, Squirrel (chameleon-squirrel), and Actor (actor)
    Large datasets based on the Wikipedia networks

  • Computers and Photo (computers-photo)
    Large e-commerce (Amazon) datasets,

  • Coauthor CS (cgnn)
    Large citation dataset with papers in computer science

The datasets Computers, Photo and Coauthor CS were not evaluated in bottleneck-bronstein. The details of the datasets are summarized in Table 1.

Cora Citeseer Pubmed Cornell Texas Wisconsin
Nodes 2485 2120 19717 140 135 184
Edges 5069 7358 44324 219 251 362
Features 1433 778 500 1703 1703 1703
Classes 7 6 3 5 5 5
Undirected Yes Yes Yes No No No

max width= Chameleon Squirrel Actor Computers Photo Coauthor CS Nodes 832 2186 4388 13381 7487 18333 Edges 12355 65224 21907 245778 119043 81894 Features 2323 2089 931 767 745 6805 Classes 5 5 5 10 8 15 Undirected No No No Yes Yes Yes

Table 1: Details of datasets. ‘Undirected’ specifies whether the network is undirected by default. If not, it is made undirected for the experiments.

4.2 Experimental Design

We tested the performances of no curvature (i.e., no rewiring); 1D Forman curvature; augmented Forman curvature; Haantjes curvature; and balanced Forman curvature in the SDRF algorithm for graph rewiring. The implementation of the SDRF algorithm was taken from the repository associated with bottleneck-bronstein available at https://github.com/jctops/understanding-oversquashing

. Other design choices and setup parameters such as such as data loading, selection of largest connected component, network type, hyperparameters, and seeds have been set following

bottleneck-bronstein; Table 2 presents the hyperparameters used for training of GNN models.

max width=0.95 Cora Citeseer Pubmed Cornell Texas Wisconsin Dropout 0.3396 0.4103 0.3749 0.2911 0.216 0.2452 Hidden depth 1 1 3 1 1 1 Hidden dimension 128 64 128 128 64 64 Learning date 0.0244 0.0199 0.0112 0.0056 0.0229 0.2452 Weight decay 0.1076 0.4551 0.0138 0.0366 0.0137 0.1559 Max iterations 100 84 166 126 89 136 163 180 115 145 22 12 Removal bound 0.95 0.22 14.43 0.88 1.64 7.95 Patience 100 10 10 100 100 100

max width=0.95 Chameleon Squirrel Actor Computers Photo Coauthor CS Dropout 0.4886 0.3079 0.3424 0.3396 0.3396 0.3396 Hidden depth 1 1 1 1 1 1 Hidden dimension 32 32 64 128 128 128 Learning rate 0.0268 0.0299 0.0129 0.0244 0.0244 0.0244 Weight decay 0.4056 0.0158 0.0126 0.1076 0.1076 0.1076 Max iterations 2442 1396 3249 100 100 100 252 436 106 163 163 163 Removal bound 2.84 5.88 0 0.95 0.95 0.95 Patience 10 10 10 10 10 10

Table 2: Training hyperparameters. Note here that for the Squirrel dataset for rewiring using augmented Forman curvature (same as for Cora), since for we encounter integer overflow.

Software and Data Availability.

The full implementation of the SDRF algorithm with all curvatures studied incorporated and datasets are freely and publicly available at https://github.com/jakubbober/discrete-curvature-rewiring.

5 Results: Supervised Learning with Graph Rewiring

We now present the results of the supervised learning task of SDRF-based graph rewiring on each of the 12 datasets discussed in the previous section. We report results on accuracy and computational runtime.

5.1 Accuracy

Each experiment was run for 100 seeds, so we report 95% confidence intervals of mean accuracies are reported using a

-score of 1.96. For reference and performance comparison, the 95% confidence intervals for the SDRF-rewiring using BFC reported by bottleneck-bronstein are also given for those relevant datasets.

The best two results are highlighted for each dataset in each accuracy table: the best one in red bold, the second best in black bold (excluding the reported BFC results from bottleneck-bronstein for reference). The None curvature row represents results without any rewiring. OOM indicates that the out of memory error has occurred. N/A in the reference BFC row for Computers, Photo and Coauthor CS datasets indicates that there are no reference results for these datasets as these datasets were not studied by bottleneck-bronstein.

max width= Cora Citeseer Pubmed Cornell Texas Wisconsin None 81.65 0.25 72.14 0.31 77.74 0.40 48.50 0.60 59.19 0.38 50.24 0.54 1D 81.15 0.24 72.17 0.28 77.76 0.37 52.75 0.82 64.59 1.11 52.70 0.71 Augmented 81.56 0.24 72.12 0.30 77.70 0.40 55.43 0.62 65.48 1.23 52.62 0.74 Haantjes 81.55 0.25 72.19 0.30 77.75 0.38 56.29 0.50 63.33 0.94 55.81 0.77 BFC 81.38 0.25 72.09 0.28 OOM 58.39 0.64 61.11 0.68 48.86 0.91 Reference BFC 82.76 0.23 72.58 0.20 79.10 0.11 57.54 0.34 70.35 0.60 61.55 0.86

max width= Chameleon Squirrel Actor Computers Photo Coauthor CS None 47.38 0.45 38.16 0.32 27.82 0.24 41.74 1.41 56.4 2.85 90.89 0.11 1D 44.88 0.43 36.83 0.27 29.41 0.26 42.24 1.58 54.93 3.46 90.83 0.11 Augmented 43.54 0.88 36.75 0.25 29.81 0.30 42.93 1.56 54.44 3.01 90.89 0.11 Haantjes 46.14 0.55 36.59 0.26 29.36 0.24 42.95 1.74 56.74 3.12 90.86 0.10 BFC 46.92 0.73 37.82 0.36 29.11 0.25 41.55 1.91 54.29 3.13 OOM Reference BFC 44.46 0.17 37.67 0.23 28.35 0.06 N/A N/A N/A

max width= Cora Citeseer Pubmed Cornell Texas Wisconsin None 81.55 0.23 72.21 0.29 77.90 0.36 48.11 0.60 59.33 0.40 49.95 0.49 1D 81.10 0.24 72.45 0.29 77.90 0.35 51.00 0.88 68.07 1.09 54.51 0.84 Augmented 81.57 0.25 72.22 0.27 77.89 0.38 53.89 0.63 64.81 1.20 56.49 0.79 Haantjes 81.56 0.24 72.10 0.28 77.71 0.41 57.18 0.57 64.78 1.15 55.86 0.76 BFC 81.25 0.25 72.04 0.29 OOM 54.61 0.50 58.37 0.67 56.19 0.84 Reference BFC 82.76 0.23 72.58 0.20 79.10 0.11 57.54 0.34 70.35 0.60 61.55 0.86

max width= Chameleon Squirrel Actor Computers Photo Coauthor CS None 46.86 0.44 38.25 0.33 27.69 0.22 42.45 1.55 53.39 2.75 90.90 0.10 1D 44.99 0.40 36.49 0.29 29.66 0.26 41.11 1.86 55.57 3.14 90.82 0.10 Augmented 42.69 0.65 36.70 0.26 29.98 0.25 41.97 1.71 56.19 2.96 90.90 0.12 Haantjes 45.97 0.51 36.83 0.24 29.52 0.21 42.38 1.60 55.34 2.93 90.88 0.11 BFC 46.62 0.70 37.61 0.34 29.34 0.28 42.11 1.65 54.51 2.89 OOM Reference BFC 44.46 0.17 37.67 0.23 28.35 0.06 N/A N/A N/A

Table 3: 95% confidence intervals of mean accuracies for given datasets and curvature types given in percentages of two experimental runs (first two tables for the first run, last two for the second).

The results reported in Table 3 indicate that SDRF rewiring generally increases the training performance. The reference BFC results reported in bottleneck-bronstein are also generally comparable to the BFC results of the performed experiment, although we note a tendency for our computation of BFC-based SDRF rewiring to be on the lower side, though still in the general range where we are able to claim reproducibility.

In particular, we note that performance for the classical curvatures is generally better than the performance without any rewiring, and often better than performance of BFC. For some results in Table 3, the simplest form of curvature—the 1D Forman curvature—tends to give the best results. This tends to indicate that the edges with large sums of degrees are the graph bottlenecks and suffer from over-squashing. The results for Haantjes curvature are the best for some of the other datasets, which suggests that association with many -cycles helps an edge to reduce over-squashing. Although with less frequency, the augmented Forman curvature also yields best results for certain experiments, which could mean that maintaining the balance between the two metrics can reduce over-squashing most effectively.

Note, however, that the experiments upon rerunning yielded results that differ quite significantly, especially for small datasets (Cornell, Texas, Wisconsin). For example, Table 3 shows that the Haantjes curvature seems to generally bring the best results in the first run, while the augmented Forman curvature performs best in the second run. More importantly, it is often the case that the corresponding results (dataset–curvature pairs) for different rewirings for the two runs are often not within the respective confidence intervals, indicating a lack of robustness of the results. One explanation for this phenomenon can be overfitting of the average accuracy to one instance of the SDRF rewiring. This can have a significant impact on the average performance, especially for the small datasets, for which the rewiring of multiple edges can have a greater impact on the graph structure than for larger datasets. The results for these datasets also differ significantly between each type of curvature, and in relation to performance without any rewiring. Moreover, the BFC results differ more significantly for these datasets than for others with respect to the reference BFC results.

To further investigate the intuition that adding or deleting edges on smaller graphs impact the overall graph structure more significantly (taking into account that the hyperparameters set in Table 2 are very high relative to graph size), we re-ran experiments for Cora, Citeseer, Cornell, Texas and Wisconsin datasets with rewiring for each seed. These datasets were selected given that rewiring was the fastest (as will be discussed further on in discussing computational runtime). The test results of these experiments are shown in Table 4.

max width= Cora Citeseer Cornell Texas Wisconsin None 81.63 0.24 72.13 0.29 48.04 0.60 59.74 0.36 50.11 0.53 1D 81.15 0.26 72.14 0.31 53.39 0.81 67.00 1.28 55.54 0.89 Augmented 81.64 0.25 72.05 0.29 54.93 0.59 64.56 1.15 55.49 0.82 Haantjes 81.64 0.24 72.19 0.33 56.50 0.60 62.96 0.92 55.95 0.72 BFC 81.18 0.27 72.12 0.29 53.07 0.74 59.19 0.69 54.24 0.93 Reference BFC 82.76 0.23 72.58 0.20 57.54 0.34 70.35 0.60 61.55 0.86

max width= Cora Citeseer Cornell Texas Wisconsin None 81.56 0.23 72.24 0.29 48.46 0.56 59.48 0.40 49.97 0.52 1D 81.24 0.23 72.30 0.29 52.75 0.80 67.74 1.26 55.62 0.79 Augmented 81.69 0.22 72.23 0.30 55.39 0.68 64.93 1.10 55.27 0.79 Haantjes 81.49 0.24 72.21 0.27 55.61 0.58 63.11 1.05 56.08 0.82 BFC 81.07 0.25 72.01 0.32 53.00 0.73 60.30 0.80 54.59 0.88 Reference BFC 82.76 0.23 72.58 0.20 57.54 0.34 70.35 0.60 61.55 0.86

Table 4: 95% confidence intervals of selected datasets with graph rewiring for each seed given in percentages run twice.

Table 4 presents results from two runs with rewiring for every seed, which are shown to be significantly more robust. The sizes of the confidence intervals are comparable to those reported previously in Table 3, but only two pairs of corresponding runs are not contained in the confidence intervals of one another (namely, Cornell–Haantjes and Texas–BFC). As there are

dataset–curvature pairs for which the experiments were run, the mean results are indeed robust and it is reasonable to consider the results as close to being independent and identically distributed (i.i.d.): the probability that two or more out of 25 means of i.i.d. random variables are not within the corresponding

confidence intervals is , which is high.

Furthermore, the results of these additional experiments are significantly worse than the reference BFC results. This is likely due to the accuracies for differently rewired graphs having been averaged out, as opposed to using the rewiring with the best validation accuracy for the benchmarking. In contrast, the results in Table 3 are slightly better for some dataset–curvature pairs than the reference BFC results, and sometimes slightly worse. When actually using the framework in practice, for the best results, the training can be performed for several different seeds and the model with the best validation accuracy can be chosen with the most effective rewired graph structure.

We summarize the test results for rewiring instances and model parameters pairs that achieved the best validation accuracy in the experiments reported in the second run from Table 4 in Table 5. Only the second run is considered, but this does not have a significantly negative impact on the robustness of the results, since, as previously justified, the results in Table 4 are robust.

max width= Cora Citeseer Cornell Texas Wisconsin None 82.34 74.19 50.0 51.85 51.35 1D 82.23 70.48 60.71 74.07 54.05 Augmented 83.35 74.19 57.14 74.07 59.46 Haantjes 83.55 73.23 53.57 66.67 59.46 BFC 82.84 74.03 56.94 62.96 54.05 Reference BFC 82.76 72.58 57.54 70.35 61.55

Table 5: Accuracy in % for the best rewiring instances and models from experiments presented in Table 4.

The main conclusion we draw from these experiemnts is that there is no clear curvature type that has better mean performance overall across the multiple datasets, but it is reasonable to conclude that using the classical curvatures for SDRF-based rewiring can lead to significant performance improvement, often achieving better results than BFC. For every dataset, performing the SDRF-based rewiring almost always yields the best test accuracy when using one of the three classical curvature types, compared to BFC (although no rewiring may also yield the best results). Often, the two best test accuracies are achieved using classical curvatures.

Determining Bottleneckness: Jacobian Bounds.

We now turn to an assessment of the bottleneckness of each of our datasets as a further validation of our accuracy conclusions. As described above in Section 2, heuristically, bottleneckness is more severe the faster the decrease of the minimum nonzero values of the powers of the normalized augmented adjacency matrix (2).

Figure 7 presents the log-log plot of the minimum non-zero entries of the normalized augmented adjacency matrix for the first powers for most of the studied datasets. Pubmed, Computers, Photo and Coauthor CS, which are not in this plot, produced adjacency matrices that were too large, which led to killing the processes responsible for computing the powers.

Figure 7: Log-log plot of minimum non-zero values of powers of normalized augmented adjacency matrix (2)

From this figure, we see that the decay of the values is the slowest for Cora and Citeseer datasets. This confirms our previous conclusions, since according to the previously reported results, the difference in accuracy between no rewiring and rewirings for different curvatures was minimal for these datasets suggesting that these datasets are not significantly affected by over-squashing. Similar conclusions can be drawn for the Squirrel dataset, for which the plot in Figure 7 also decreases more slowly (especially between and ), and which is also reported to have the best accuracy for no rewiring in Table 3.

To capture the reduction of over-squashing via rewiring for a graph, we ran experiments to compute the powers of matrix values as above for both before and after the rewiring. The results of these experiments for several datasets, curvature types, and matrix powers are presented in Table 6.

max width= Curvature Power Cora Citeseer Cornell Texas Wisconsin Chameleon Squirrel Actor None 5 8.69e-10 1.25e-08 1.29e-10 7.84e-11 3.55e-11 4.24e-10 1.23e-11 2.99e-10 10 4.74e-15 1.53e-14 1.67e-20 6.14e-21 1.26e-21 1.25e-17 4.63e-16 2.94e-17 20 1.86e-18 2.09e-22 2.79e-40 3.77e-41 1.59e-42 2.35e-31 1.49e-21 8.62e-34 40 8.09e-19 2.48e-24 7.78e-80 1.42e-81 2.53e-84 5.51e-62 2.21e-42 7.44e-67 1D 5 3.33e-09 1.77e-08 2.34e-06 2.30e-06 2.36e-07 1.26e-10 8.62e-11 9.23e-10 10 1.14e-14 1.55e-14 1.66e-06 1.04e-08 1.71e-08 2.57e-15 1.43e-14 9.18e-13 20 3.74e-17 5.15e-22 2.93e-08 2.35e-09 4.31e-09 1.64e-14 4.27e-14 5.19e-12 40 1.09e-18 9.27e-24 1.32e-13 2.74e-13 1.87e-13 2.06e-18 3.70e-17 1.18e-15 Augmented 5 8.31e-10 9.36e-09 3.17e-06 6.57e-07 9.85e-08 1.04e-09 7.37e-11 3.30e-10 10 2.87e-15 1.55e-14 7.89e-06 2.95e-08 1.31e-08 4.85e-15 2.00e-14 5.14e-13 20 1.85e-18 2.92e-22 8.90e-08 3.80e-09 2.77e-09 4.48e-14 4.52e-14 2.27e-11 40 8.55e-19 2.88e-24 7.58e-13 4.44e-13 1.89e-13 2.90e-17 4.46e-17 1.19e-15 Haantjes 5 1.10e-09 1.56e-08 4.98e-06 2.23e-06 2.28e-07 1.41e-09 1.10e-10 1.03e-09 10 1.43e-14 1.68e-14 8.96e-07 1.49e-09 1.31e-08 5.01e-15 1.33e-14 2.56e-13 20 7.91e-19 2.27e-22 5.36e-08 1.58e-09 2.81e-09 3.27e-14 9.37e-14 8.55e-12 40 9.65e-19 3.61e-24 7.06e-12 5.19e-13 2.80e-13 4.43e-18 2.84e-17 3.17e-15 BFC 5 1.09e-09 9.42e-09 7.83e-07 6.44e-07 1.47e-07 1.49e-09 1.47e-10 4.30e-10 10 1.01e-14 1.29e-14 2.85e-06 1.13e-06 1.72e-07 1.07e-10 1.65e-10 4.81e-09 20 3.30e-17 1.86e-22 2.56e-08 2.29e-08 1.12e-08 3.87e-11 1.60e-11 1.06e-10 40 1.10e-18 8.71e-24 1.88e-13 2.40e-13 1.11e-13 3.98e-16 6.13e-17 3.86e-16

Table 6: Minimum nonzero values of respective powers of normalized augmented adjacency matrices of datasets after SDRF-based rewiring using respective curvature types.

Figure 8 represents analogous plots to that of Figure 7, but for datasets after rewiring using each curvature type. These are graphical representations of the results in Table 6.

(a) 1D
(b) Augmented
(c) Haantjes
(d) BFC
Figure 8: Log-log plots of minimum non-zero values of powers of normalized augmented adjacency matrices after rewiring using different discrete curvature types.

The rewiring instances chosen for this experiment are the ones with the best validation accuracy used for Table 5. For the datasets that were not evaluated (Chameleon, Squirrel, Actor), rewiring instances from the second run of experiment from Table 3 were selected. The data presented in Table 6 is rounded to decimal places. The matrix powers reported for each curvature type are chosen to be , , and .

From Table 6 and the comparison of Figure 7 to Figure 8, we see that rewiring successfully decreases the decay of the Jacobian bounds. The minimum nonzero values of powers of normalized augmented adjacency matrices are lower by many degrees of magnitude for no rewiring than for SDRF-based rewiring using any discrete curvature. The only two exceptions are Cora and Citeseer datasets, for which the values for respective powers are similar with and without rewiring; see Table 6. This also confirms that these two datasets have low bottleneckness and are not particularly prone to over-squashing.

5.2 Computational Runtime

We now assess computation runtime of the SDRF algorithm for graph rewiring based on each curvature type. We measure the runtime for one rewiring process per curvature type and per dataset; the measurements are given in Table 7.

max width= Cora Citeseer Pubmed Cornell Texas Wisconsin 1D 5.86 5.31 53.57 0.34 0.41 0.61 Augmented 6.16 5.48 107.55 0.19 0.21 0.49 Haantjes 2.88 4.27 39.65 0.13 0.13 0.14 BFC 27.64 12.26 OOM 34.95 21.2 21.24

max width= Chameleon Squirrel Actor Computers Photo Coauthor CS 1D 86.51 900 418.06 4369.07 853.52 47.15 Augmented 251.25 901.71 872.78 10504.34 2262.72 88.10 Haantjes 53.58 531.31 105.36 462.14 139.97 30.12 BFC 1627.61 2006.78 5121.31 6431.32 1274.44 OOM

Table 7: Computation times of the SDRF rewiring given in seconds.

The runtimes here are reported for only one instance for each dataset and each curvature type, in order to avoid influences of spurious computational issues such as CPU and GPU occupancy with other processes which would become much more significant with repeated iterations. Here, the interest is rather in the comparison between longer computation times which shows the difference in computational complexity at scale.

From these results, we see that all of the classical discrete curvatures studied have a significantly shorter computation time than the BFC. The slowest among the three classical curvatures is the augmented Forman curvature, at scale. This is expected, as it essentially needs to do the same calculations as 1D Forman and Haantjes curvatures combined (computation of degrees of endpoints and adjacent triangles for each edge).

For the Computers and Photo datasets, however, the computation of the augmented Forman curvature took longer than the computation of BFC. This suggests that for some types of graphs, possibly for bigger or more dense graphs (notice from Table 1 that the edges to nodes ratio is very high for these two datasets), the BFC computation can outperform the augmented Forman curvature computation in terms of computation time. Nevertheless, the 1D Forman and Haantjes curvatures are still quicker to compute.

6 Discussion

In this paper, we systematically and comprehensively studied the role of various classical and novel discrete curvatures in mitigating the over-squashing problem in training GNNs. Specifically, following the work of bottleneck-bronstein, we adapted discretizations of Ricci curvature and Ricci flow, which can be viewed as the smooth, manifold-valued analogues of important characteristics on networks relevant in the information over-squashing problem—namely, information flow on a network and bottleneckness of a network, respectively. In bottleneck-bronstein—considered to be the current state-of-the-art in mitigating the over-squashing problem in GNN training, classified among the top 1.5% of submissions in the 2022 International Conference on Learning Representations (ICLR) with Honorable Mention—the BFC was proposed as a discrete notion of Ricci curvature, while the SDRF algorithm was proposed as a discrete notion of Ricci flow. In our work, we tested a wide range of classical discrete curvatures against the BFC in the implementation of the SDRF algorithm. We found that more classical curvatures were able to achieve performance of the same order as the BFC in training accuracy and, at times, outperformed the BFC. Moreover, they far outperformed it in computational runtime. From this systematic study, we find that the impact of the contribution by bottleneck-bronstein lies in the SDRF algorithm, rather than the BFC. We also conclude that almost any of the more classical discrete curvatures may be used over the BFC together with the SDRF algorithm in favor of the more efficient computational runtimes, which is an important consideration when studying very large networks.

Directions of future study include exploring the performance of classical discrete curvatures taking into account directedness of the graphs in the SDRF and other rewiring methods. Also, alternative discrete geometric approaches to mitigating the over-squashing problem that do not involve rewiring may be explored, in the spirit of the CGNN (cgnn). Here, other computational notions of geometry for networks may also be investigated, such as those arising from topological data analysis, where persistent homology concurrently captures the topology of data as well as its integral geometry. Such an approach would be particularly interesting when the goal is to preserve the topology of a graph, as the CGNN does.

References