Graphon Neural Networks and the Transferability of Graph Neural Networks

06/05/2020 ∙ by Luana Ruiz, et al. ∙ University of Pennsylvania 0

Graph neural networks (GNNs) rely on graph convolutions to extract local features from network data. These graph convolutions combine information from adjacent nodes using coefficients that are shared across all nodes. As a byproduct, coefficients can also be transferred to different graphs, thereby motivating the analysis of transferability across graphs. In this paper we introduce graphon NNs as limit objects of GNNs and prove a bound on the difference between the output of a GNN and its limit graphon-NN. This bound vanishes with growing number of nodes if the graph convolutional filters are bandlimited in the graph spectral domain. This result establishes a tradeoff between discriminability and transferability of GNNs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Graph neural networks (GNNs) are the counterpart of convolutional neural networks (CNNs) to learning problems involving network data. Like CNNs, GNNs have gained popularity due to their superior performance in a number of learning tasks

(Kipf and Welling, 2017; Defferrard et al., 2016; Gama et al., 2018; Bronstein et al., 2017). Aside from the ample amount of empirical evidence, GNNs are proven to work well because of properties such as invariance and stability (Ruiz et al., 2019; Gama et al., 2019a), which are also shared with CNNs (Bruna and Mallat, 2013).

A defining characteristic of GNNs is that their number of parameters does not depend on the size (i.e., the number of nodes) of the underlying graph. This is because graph convolutions are parametrized by graph shifts in the same way that time and spatial convolutions are parametrized by delays and translations. From a complexity standpoint, the independence between the GNN parametrization and the graph is beneficial because there are less parameters to learn. Perhaps more importantly, the fact that its parameters are not tied to the underlying graph suggests that a GNN can be transferred from graph to graph. It is then natural to ask to what extent the performance of a GNN is preserved when its graph changes. The ability to transfer a machine learning model with performance guarantees is usually referred to as transfer learning or

transferability.

In GNNs, there are two typical scenarios where transferability is desirable. The first involves applications in which we would like to reproduce a model trained on a graph to multiple other graphs without retraining. This would be the case, for instance, of replicating a GNN model for analysis of NYC air pollution data on the air pollution sensor network in Philadelphia. The second concerns problems where the network size changes over time. In this scenario, we would like the GNN model to be robust to nodes being added or removed from the network, i.e., for it to be transferable in a scalable way. An example are recommender systems based on user similarity networks in which the user base grows by the day.

Both of these scenarios involve solving the same task on networks that, although different, can be seen as being of the same “type”. This motivates studying the transferability of GNNs within families of graphs that share certain structural characteristics. We propose to do so by focusing on collections of graphs associated with the same graphon. A graphon is a bounded symmetric kernel that can be interpreted a a graph with an uncountable number of nodes. Graphons are suitable representations of graph families because they are the limit objects of sequences of graphs where the density of certain structural “motifs” is preserved. They can also be used as generating models for undirected graphs where, if we associate nodes and with points and in the unit interval, is the weight of the edge . The main result of this paper (Theorem 2) shows that GNNs are transferable between deterministic graphs obtained from a graphon in this way.

Theorem (GNN transferability, informal) Let be a GNN with fixed parameters. Let and be deterministic graphs with and nodes obtained from a graphon . Under mild conditions, .

An important consequence of this result is the existence of a trade-off between transferability and discriminability, which is related to a restriction on the passing band of the graph convolutional filters of the GNN. Its proof is based on the definition of the graphon neural network (Section 4), a theoretical limit object of independent interest that can be used to generate GNNs from a common family. The interpretation of graphon neural networks as generating models for GNNs is important because it identifies the graph as a flexible parameter of the learning architecture and allows adapting the GNN not only by changing its weights, but also by changing the underlying graph.

The rest of this paper is organized as follows. Section 2 goes over related work. Section 3 introduces preliminary definitions and discusses GNNs and graphon information processing. The aforementioned contributions are presented in Sections 4 and 5. In Section 6, transferability of GNNs is illustrated in a numerical experiment where the performance of a recommender system is analyzed for networks of growing size. We finish with concluding remarks on Section 7. Proofs are deferred to the supplementary material.

2 Related Work

Graphons and convergent graph sequences have been broadly studied in mathematics (Lovász, 2012; Lovász and Szegedy, 2006; Borgs et al., 2008, 2012) and have found applications in statistics (Wolfe and Olhede, 2013; Xu, 2018; Gao et al., 2015)

, game theory

(Parise and Ozdaglar, 2019), network science (Avella-Medina et al., 2018; Vizuete et al., 2020) and controls (Gao and Caines, 2018). Recent works also use graphons to study network information processing in the limit (Ruiz et al., 2020, 2020; Morency and Leus, 2017). In particular, Ruiz et al. (2020) study the convergence of graph signals and graph filters by introducing the theory of signal processing on graphons. The use of limit and continuous objects, e.g. neural tangent models (Jacot et al., 2018), is also common in the analysis of the behavior of neural networks.

A concept related to transferability is the notion of stability of GNNs to graph perturbations. This is studied in (Gama et al., 2019a) building on stability analyses of graph scattering transforms (Gama et al., 2019b). These results do not consider graphs of varying size. Transferability as the number of nodes in a graph grows is analyzed in (Levie et al., 2019a), following up on the work of Levie et al. (2019b) which studies the transferability of spectral graph filters. This work looks at graphs as discretizations of generic topological spaces, which yields a different asymptotic regime relative to the graphon limits we consider in this paper.

3 Preliminary Definitions

We go over the basic architecture of a graph neural network and formally introduce graphons and graphon data. These concepts will be important in the definition of graphon neural networks in Section 4 and in the derivation of a transferability bound for GNNs in Section 5.

3.1 Graph neural networks

GNNs are deep convolutional architectures with two main components per layer: a bank of graph convolutional filters or graph convolutions

, and a nonlinear activation function. The graph convolution couples the data with the underlying network, lending GNNs the ability to learn accurate representations of network data.

Networks are represented as graphs , where , , is the set of nodes, is the set of edges and is a function assigning weights to the edges of . We restrict our attention to undirected graphs, so that . Network data are modeled as graph signals , where each element corresponds to the value of the data at node (Shuman et al., 2013; Ortega et al., 2018). In this setting, it is natural to model data exchanges as operations parametrized by the graph. This is done by considering the graph shift operator (GSO) , a matrix that encodes the sparsity pattern of by satisfying if and only if or . In this paper, we use the adjacency matrix as the GSO, but other examples include the degree matrix and the graph Laplacian .

The GSO effects a shift, or diffusion, of data on the network. Note that, at each node , the operation is given by , i.e., nodes shift their data values to neighbors according to their proximity measured by . This notion of shift allows defining the convolution operation on graphs. In time or space, the filter convolution is defined as a weighted sum of data shifted through delays or translations. Analogously, we define the graph convolution as a weighted sum of data shifted to neighbors at most hops away. Explicitly,

(1)

where are the filter coefficients and denotes the convolution operation with GSO . Because the graph is undirected, is symmetric and diagonalizable as , where

is a diagonal matrix containing the graph eigenvalues and

forms an orthonormal eigenvector basis that we call the graph spectral basis. Replacing

by its diagonalization in (1) and calculating the change of basis , we get

(2)

from which we conclude that the graph convolution has a spectral representation which only depends on the coefficients and on the eigenvalues of .

Denoting the nonlinear activation function , the th layer of a GNN is written as

(3)

for each feature , . The quantities and are the numbers of features at the output of layers and respectively for . The GNN output is , while the input features at the first layer, which we denote , are the input data with . For a more succinct representation, this GNN can also be expressed as a map , where the set groups all learnable parameters as .

In (3), note that the GNN parameters do not depend on , the number of nodes of . This means that, once the model is trained and these parameters are learned, the GNN can be used to perform inference on any other graph by replacing in (3). In this case, the goal of transfer learning is for the model to maintain a similar enough performance in the same task over different graphs. A question that arises is then: for which graphs are GNNs transferable? To answer this question, we focus on graphs belonging to “graph families” identified by graphons.

3.2 Graphons and graphon data

A graphon is a bounded, symmetric, measurable function  that can be thought of as an undirected graph with an uncountable number of nodes. This can be seen by relating nodes and with points , and edges with weights . This construction suggests a limit object interpretation and, in fact, it is possible to define sequences of graphs  that converge to .

3.2.1 Graphons as limit objects

To characterize the convergence of a graph sequence , we consider arbitrary unweighted and undirected graphs that we call “graph motifs”. Homomorphisms of into are adjacency preserving maps in which  implies . There are maps from  to , but only some of them are homomorphisms. Hence, we can define a density of homomorphisms , which represents the relative frequency with which the motif appears in .

Homomorphisms of graphs into graphons are defined analogously. Denoting the density of homomorphisms of the graph into the graphon , we then say that a sequence converges to the graphon if, for all finite, unweighted and undirected graphs ,

(4)

It can be shown that every graphon is the limit object of a convergent graph sequence, and every convergent graph sequence converges to a graphon (Lovász, 2012, Chapter 11). Thus, a graphon identifies an entire collection of graphs. Regardless of their size, these graphs can be considered similar in the sense that they belong to the same “graphon family”.

A simple example of convergent graph sequence is obtained by evaluating the graphon. In particular, in this paper we are interested in deterministic graphs constructed by assigning regularly spaced points to nodes and weights to edges , i.e.

(5)

where is the adjacency matrix of . An example of a stochastic block model graphon and of an -node deterministic graph drawn from it are shown at the top of Figure 1, from left to right. A sequence generated in this fashion satisfies the condition in (4), therefore converges to (Lovász, 2012, Chapter 11).

3.2.2 Graphon information processing

Data on graphons can be seen as an abstraction of network data on graphs with an uncountable number of nodes. Graphon data is defined as graphon signals mapping points of the unit interval to the real numbers (Ruiz et al., 2020). The coupling between this data and the graphon is given by the integral operator , which is defined as

(6)

Since is bounded and symmetric, is a self-adjoint Hilbert-Schmidt operator. This allows expressing the graphon in the operator’s spectral basis as and rewriting as

(7)

where the eigenvalues , , are ordered according to their sign and in decreasing order of absolute value, i.e. , and accumulate around 0 as (Lax, 2002, Theorem 3, Chapter 28).

Similarly to the GSO, defines a notion of shift on the graphon. We refer to it as the graphon shift operator (WSO), and use it to define the graphon convolution as a weighted sum of at most data shifts. Explicitly,

(8)

where and is the identity operator (Ruiz et al., 2020). The operation stands for the convolution with graphon , and are the filter coefficients. Using the spectral decomposition in (7), can also be written as

(9)

where we note that, like the graph convolution, has a spectral representation which only depends on the graphon eigenvalues and the coefficients .

4 Graphon Neural Networks

Similarly to how sequences of graphs converge to graphons, we can think of a sequence of GNNs converging to a graphon neural network (WNN). This limit architecture is defined by a composition of layers consisting of graphon convolutions and nonlinear activation functions, tailored to process data supported on graphons. Denoting the nonlinear activation function , the th layer of a graphon neural network can be written as

(10)

for , where stands for the number of features at the output of layer , . The WNN output is given by , and the input features at the first layer, , are the input data for . A more succinct representation of this WNN can be obtained by writing it as the map , where groups the filter coefficients at all layers. Note that the parameters in are agnostic to the graphon.

4.1 WNNs as generating models for GNNs

Figure 1: Example of a graphon neural network (WNN) given by , and of a graph neural network (GNN) instantiated from it as . The graphon

, shown on the top left corner, is a stochastic block model with intra-community probability

and inter-community probability , and the graphon signal is plotted on the bottom left corner. The graph (top right corner) and the graph signal (bottom right corner) are obtained from and according to (11). Note that the parameter set is shared between the WNN and the GNN.

Comparing the GNN and WNN maps [cf. Section 4] and , we see that they can have the same set of parameters . On graphs belonging to a graphon family, this means that GNNs can be built as instantiations of the WNN and, therefore, WNNs can be seen as generative models for GNNs. We consider GNNs built from a WNN by defining for and setting

(11)

where is the GSO of , the deterministic graph obtained from as in Section 3.2, and is the deterministic graph signal obtained by evaluating the graphon signal at points . An example of a WNN and of a GNN instantiated from it in this way are shown in Figure 1

. Considering GNNs as instantiations of WNNs is interesting because it allows looking at graphs not as fixed GNN hyperparameters, but as parameters that can be tuned. In other words, it allows GNNs to be adapted both by optimizing the weights in

and by changing the graph . This makes the learning model scalable and adds flexibility in cases where there are uncertainties associated with the graph.

Conversely, we can also define WNNs induced by GNNs. The WNN induced by a GNN is defined as , and it is obtained by constructing a partition of with to define

(12)

where is the graphon induced by and is the graphon signal induced by the graph signal . This definition is useful because it allows comparing GNNs with WNNs.

4.2 Approximating WNNs with GNNs

Consider GNNs instantiated from a WNN as in (11). For increasing , converges to , so we can expect the GNNs to become increasingly similar to the WNN. In other words, the output of the GNN and of the WNN should grow progressively close and can be used to approximate . We wish to quantify how good this approximation is for different values of . Naturally, the continuous output cannot be compared with the discrete output directly. In order to make this comparison, we consider the output of the WNN induced by , which is given by [cf. (12)]. We also consider the following assumptions.

As 1.

The graphon is -Lipschitz, i.e. .

As 2.

The convolutional filters are -Lipschitz and non-amplifying, i.e. .

As 3.

The graphon signal is -Lipschitz.

As 4.

The activation functions are normalized Lipschitz, i.e. , and .

Theorem 1 (WNN approximation by GNN).

Consider the -layer WNN given by , where and for . Let the graphon convolutions [cf. (9)] be such that is constant for . For the GNN instantiated from this WNN as [cf. (11)], under Assumptions 1 through 4 it holds

where is the WNN induced by [cf. (12)], is the cardinality of the set , and , with and denoting the eigenvalues of and respectively.

From Theorem 1, we conclude that a graphon neural network can be approximated with performance guarantees by the GNN where the graph and the signal are obtained from and as in (11). In this case, the approximation error is controlled by the transferability constant and the fixed error term . The fixed error term is unrelated to the GNN architecture and measures the difference between the graphon signal and the graph signal . The transferability constant depends on the graphon, the parameters of the GNN, and the size of the graph. The dependence on the graphon is given by the Lispchitz constant , which is smaller for graphons with less variability. The dependence on the architecture happens through the numbers of layers and features and , as well as through the parameters of the graph convolution , and . Although these parameters can be tuned, note that, in general, deeper and wider architectures have larger approximation error. For better approximation, the convolutional filters should have limited variability, which is controlled by both the Lipschitz constant and the length of the band . The number of eigenvalues should satisfy (i.e. ) for asymptotic convergence, which is guaranteed by the fact that the eigenvalues of converge to the eigenvalues of (Lovász, 2012, Chapter 11.6) and therefore , which is the minimum eigengap of the graphon for . Finally, WNNs are increasingly transferable with the size of the graph , as expected from the limit behavior of convergent graph sequences.

5 Transferability of Graph Neural Networks

The main result of this paper is that GNNs are transferable between graphs of different sizes associated with the same graphon. This result follows from Theorem 1 by the triangle inequality.

Theorem 2 (GNN transferability).

Let and , and and , be graphs and graph signals obtained from the graphon and the graphon signal as in (11), with . Consider the -layer GNNs given by and , where and for . Let the graph convolutions [cf. (2)] be such that is constant for . Then, under Assumptions 1 through 4 it holds

where is the WNN induced by [cf. (12)], is the maximum cardinality of the sets , and , with and denoting the eigenvalues of and respectively.

Theorem 2

compares the vector outputs of the same GNN (with same parameter set

) on and by bounding the norm difference between the graphon neural networks induced by and by . This result is useful in two important ways. First, it means that, provided that its design parameters are chosen carefully, a GNN trained on a given graph can be transferred to multiple other graphs in the same graphon family with performance guarantees. This is desirable in problems where the same task has to be replicated on different networks, because it eliminates the need for retraining the GNN on every graph. Second, it implies that GNNs are scalable, as the graph on which it is trained can be smaller than the graphs on which it is deployed, and vice-versa. This is helpful in problems where the graph size can change, e.g. recommender systems with a growing customer base. In this case, the advantage of transferability is mainly that training GNNs on smaller graphs is easier than training them on large graphs.

When transferring GNNs between graphs, the performance guarantee is measured by the transferability constant and the fixed error term . The fixed error term measures the difference between the graph signals and through their distance to the graphon signal . Therefore, its contribution is small if both signals are associated with the same data model. The transferability constant depends on the the graphs and and, implicitly, on the graphon through the Lipschitz constant , which measures its variability. It also depends on the width and depth of the GNN, and on the convolutional filter parameters , and . These are design parameters, which can be tuned. In particular, if we make , Theorem 2 implies that a GNN trained on the graph is asymptotically transferable to any graph in the same family where . This is because, as , the term , which is a fixed eigengap of . On the other hand, the restriction on reflects a restriction on the passing band of the graph convolutions, suggesting a trade-off between the transferability and discriminability of GNNs.

6 Numerical Results

To illustrate Theorem 2, we simulate the problem of transferring a GNN-based recommender system between user networks of different sizes using the MovieLens 100k dataset (Harper and Konstan, 2016). This dataset contains 100,000 integer ratings between 1 and 5 given by users to movies. Each user is seen as a node of a user similarity network, and the collection of user ratings to a given movie is a signal on this graph. To build the user network, a matrix is defined where is the rating given by user to movie , or if this rating does not exist. The proximity between users and is then calculated as the pairwise correlation between rows and , and each user is connected to its 40 nearest neighbors. The graph signals are the columns vectors consisting of the user ratings to movie .

Given a network with users, we implement a GNN111We use the GNN library available at https://github.com/alelab-upenn/graph-neural-networks

and implemented with PyTorch.

with the goal of predicting the ratings given by user 405, which is the user who has rated the most movies in the dataset (737 ratings). This GNN has convolutional layer with and , followed by a readout layer at node that maps its features to a one-hot vector of dimension (corresponding to ratings 1 through 5). To generate the input data, we pick the movies rated by user 405 and generate the corresponding movie signals by "zero-ing" out the ratings of user 405 while keeping the ratings given by other users. This data is then split between for training and for testing, with of the training data used for validation. Only training data is used to build the user network in each split.

To analyze transferability, we start by training GNNs on user subnetworks consisting of random groups of users, including user 405. We optimize the cross-entropy loss using ADAM with learning rate and decaying factors and

, and keep the models with the best validation RMSE over 40 epochs. Once the weights

are learned, we use them to define the GNN , and test both and on the movies in the test set. The goal here is to assess how the performance difference , i.e. the difference between the test RMSE obtained on the subnetwork and on the full user network, changes with . The evolution of the difference between these RMSEs, relative to the RMSE obtained on the subnetworks , is plotted in Figure 2 for 50 random splits. We observe that this difference, which is proportional to , decreases as the size of the subnetwork increases, conforming with the transferability bound of Theorem 2.

Figure 2: Relative RMSE difference on the test set for the GNNs and . Average over 50 random splits. Error bars have been scaled by .

7 Conclusions

We have introduced WNNs and shown that they can be used as generating models for GNNs. We have also demonstrated that GNNs can be used to approximate WNNs arbitrarily well, with an approximation error that decays asymptotically with . This result is used to prove transferability of GNNs on deterministic graphs associated with the same graphon. The extent to which a GNN is transferable depends on the graphon, the parameters of the learning architecture, and the number of nodes of both graphs. In particular, GNN output difference decays asymptotically with for graph convolutional filters with small passing bands, suggesting a trade-off between representation power and stability. Finally, GNN transferability was demonstrated in a numerical experiment where we observe that a recommender system trained on a subnetwork of users and deployed on the full user network is increasingly transferable with the size of the subnetwork.

Broader Impact

A very important implication of GNN transferability is allowing learning models to be replicated in different networks without the need for redesign. This can potentially save both data and computational resources. However, since our work utilizes standard training procedures of graph neural networks, it may inherit any potential biases present in supervised training methods, e.g. data collection bias.

References

Proof of Theorem 1

To prove Theorem 1, we interpret graphon convolutions as generative models for graph convolutions. Given the graphon and a graphon convolution written as

we can generate graph convolutions by defining for and setting

(13)

where is the GSO of , the deterministic graph obtained from as in Section 3.2, is the deterministic graph signal obtained by evaluating the graphon signal at points , and and are the eigenvalues and eigenvectors of respectively. It is also possible to define graphon convolutions induced by graph convolutions. The graph convolution induces a graphon convolution obtained by constructing a partition of with and defining

(14)

where is the graphon induced by , is the graphon signal induced by the graph signal and and

are the eigenvalues and eigenfunctions of

.

Theorem 1 is a direct consequence of the following theorem, which states that graphon convolutions can be approximated by graph convolutions on large graphs.

Theorem 3.

Consider the graphon convolution given by as in (9), where is constant for . For the graph convolution instantiated from as [cf. (13)], under Assumptions 1 through 3 it holds

(15)

where is the graph convolution induced by [cf. (14)], is the cardinality of the set , and , with and denoting the eigenvalues of and respectively. In particular, if we have

(16)
Proof of Theorem 3.

To prove Theorem 3, we need the following three propositions.

Proposition 1.

Let be an -Lipschitz graphon, and let be the graphon induced by the deterministic graph obtained from as in (5). The norm of satisfies

Proof.

Partitioning the unit interval as for (the same partition used to obtain , and thus , from ), we can use the graphon’s Lipschitz property to derive

We can then write

which, since , implies

Proposition 2.

Let and be two self-adjoint operators on a separable Hilbert space whose spectra are partitioned as and respectively, with and . If there exists such that and , then

Proof.

See [Seelmann, 2014]. ∎

Proposition 3.

Let be an -Lipschitz graphon signal, and let be the graphon signal induced by the deterministic graph signal obtained from as in (11) and (13). The norm of satisfies

Proof.

Partitioning the unit interval as for (the same partition used to obtain , and thus , from ), we can use the Lipschitz property of to derive

We can then write