pna
Implementation of Principal Neighbourhood Aggregation for Graph Neural Networks in PyTorch, DGL and PyTorch Geometric
view repo
Graph Neural Networks (GNNs) have been shown to be effective models for different predictive tasks on graph-structured data. Recent work on their expressive power has focused on isomorphism tasks and countable feature spaces. We extend this theoretical framework to include continuous features - which occur regularly in real-world input domains and within the hidden layers of GNNs - and we demonstrate the requirement for multiple aggregation functions in this setting. Accordingly, we propose Principal Neighbourhood Aggregation (PNA), a novel architecture combining multiple aggregators with degree-scalers (which generalize the sum aggregator). Finally, we compare the capacity of different models to capture and exploit the graph structure via a benchmark containing multiple tasks taken from classical graph theory, which demonstrates the capacity of our model.
READ FULL TEXT VIEW PDFImplementation of Principal Neighbourhood Aggregation for Graph Neural Networks in PyTorch, DGL and PyTorch Geometric
None
Graph Neural Networks (GNNs) have been an active research field for the last ten years with great advancements in graph representation learning scarselli2009 ; Bronstein_2017 ; battaglia2018relational ; hamilton2017representation . However, it is difficult to understand the effectiveness of new GNNs due to the lack of standardized benchmarks dwivedi2020benchmarking and of theoretical frameworks for their expressive power.
In fact, most work in this domain has focused on improving the GNN architectures on a set of graph benchmarks, without evaluating the capacity of their network to properly characterize the graphs’ structural properties. Only recently there have been significant studies on the expressive power of various GNN models xu2018gin ; garg2020generalization ; luan_break_2019_snowball ; morris2018weisfeiler ; murphy2019relational ; sato2020survey . However, these have mainly focused on the capacity of distinguishing different graph topologies, with little work done on understanding their capacity to capture and exploit the underlying features of the graph structure.
Alternatively, some work focuses on generalizing convolutional neural networks (CNN) to graphs using the spectral domain, as first proposed by Bruna
et al. bruna_spectral_2014_spectral . To improve the efficiency of the spectral analysis and improve the performance of the models, Chebyshev polynomials were developed defferrard_convolutional_2017_spectral and later generalized into Cayley filters levie_cayleynets_2018_spectral or replaced by wavelet transforms xu_graph_2019_spectral . In our work, we look at the capacity of different GNN models to understand certain aspects of the spectral decomposition, namely the graph Laplacian and the spectral radius, as they constitute fundamental aspects of the graphs’ spectral properties. Although the spectral properties and filters are not explicitly encoded in our architecture, a powerful enough spatial GNN should still be able to learn them effectively.Previous work on tasks taken from classical graph theory focuses on evaluating the performance of GNN models on a single task such as shortest paths velikovic2019neural ; xu2019neural ; Graves2016
, graph moments
dehmamy2019understanding or travelling salesman problem dwivedi2020benchmarking ; joshi2019efficient . Instead, we took a different approach by developing a multi-task benchmark containing problems both on the node level and the graph level. In particular, we look at the ability of each GNN to predict single-source shortest paths, eccentricity, laplacian features, connectivity, diameter and spectral radius. Many of these tasks are based on algorithms using dynamic programming and, therefore, are known to be well suited for GNNs xu2019neural . We believe this multi-task approach ensures that the GNNs are able to understand multiple properties simultaneously, which is fundamental for solving complex graph problems. Moreover, efficiently sharing parameters between the tasks suggests a deeper understanding of the structural features of the graphs. Furthermore, we explore the generalization ability of the networks by testing on graphs of larger sizes than those present in the training set.We hypothesize that the aggregation layers of current GNNs are unable to extract enough information from the nodes’ neighbourhoods in a single layer, which limits their expressive power and learning abilities. In fact, recent works show how different aggregators perform better on different tasks velikovic2019neural ; xu2018gin , that GNNs do not excel at learning nodes’ clustering themselves as they don’t properly characterize their neighbourhood noutahi_towards_2020_lapool , and that they are unable to reliably find substructures chen_can_2020_substructure .
We first prove mathematically the need for multiple aggregators by proposing a solution for the uncountable multiset injectivity problem introduced by xu2018gin . Then, we propose the concept of degree-scalers as a generalization to the sum aggregation, which allow the network to amplify or attenuate signals based on the degree of each node. Combining the above, we design the proposed Principal Neighbourhood Aggregation (PNA) network and demonstrate empirically that using multiple aggregation strategies concurrently improves the performance of the GNN on graph theory problems.
Dehmamy et al. dehmamy2019understanding have also found empirically that using multiple aggregators (mean, sum and normalized mean), which extract similar statistics from the input message, improves the performance of GNNs on the task of graph moments. In contrast, our work extends the theoretical framework by deriving the necessity to use complementary aggregators. Accordingly, we propose using different statistical aggregations to allow each node to understand the distribution of the messages it receives, and we generalize the mean aggregation as the first moment of a set of possible n-moment aggregations.
We present a consistently well-performing and parameter efficient encode-process-decode architecture hamrick2018relational for GNNs. This differs from traditional GNNs by allowing a variable number of convolutions, vanquishing the limitation of GNNs to distributed local algorithms angluin1980local described by sato2020survey .
Using this model, we compare the multi-task performances of some of the most diffused models in the literature (GCN kipf2016gcn , GAT velikovic2017gat , GIN xu2018gin and MPNN gilmer2017mpnn ) with our PNA and a clear hierarchy arises. In particular, we observe that the proposed PNA, formed by the combination of various aggregators and scalers, significantly outperforms these baselines. The fact that this outperformance was consistent along all tasks with all the architectures experimented further supports our hypothesis.
In this section, we first explain the motivation behind using multiple aggregators concurrently. Then, we present the idea of degree-based scalers, linking to the prior related work of GNN expressiveness. Finally, we detail the design of graph convolutional layers which leverage the proposed Principal Neighbourhood Aggregation.
One of the main concerns of this manuscript is the ability to understand a one-hop node neighbourhood using a single GNN layer. Doing so will reduce the effects of over-smoothing zhao2019pairnorm ; chen2019measuring ; wang2019improving and allow the depth of the network to focus on understanding the interactions of far-away nodes and support more complex latent state.
Most work in the literature uses only a single aggregation method, with mean, sum and max aggregators being the most used in the state-of-the-art models xu2018gin ; kipf2016gcn ; gilmer2017mpnn ; velikovic2019neural . In Figure 1, we observe how different neighbourhood aggregators fail to discriminate between different messages when using a single GNN layer.
We formalize our observations in the theorem given below.
theorem1 In order to discriminate between multisets of size whose underlying set is , at least aggregators are needed.
proposition1 The moments of a multiset (as defined in equation 4) constitute a valid example using aggregators.
We prove Theorem 1 in Appendix A and Proposition 1 in Appendix B. Note that unlike Xu et al. xu2018gin , we consider a continuous input features space; this better represents many real-world tasks where the observed values have uncertainty, and better models the latent node features within a neural network’s representations. Using continuous features makes the multiset uncountable, and voids the injectivity proof of the sum aggregation presented by Xu et al. xu2018gin .
To make further progress, we redefine the concepts of aggregators and scalers. Aggregators are a continuous function of a multiset which compute a statistic on the neighbouring nodes, such as mean, max or standard deviation. The continuity is important with continuous input spaces, as small variations in the input should result in small variations of the aggregators’ output. Scalers are applied on the aggregated value and perform either an amplification or an attenuation of the incoming messages, which is dependent on the number of messages being aggregated (usually the node degree). In this framework, we may re-express the sum aggregator as a mean aggregator followed by a linear-degree scaling (see Section 2.2).
Theorem 1 proves that the number of independent aggregators used is a limiting factor in the expressiveness of GNNs. To empirically demonstrate this, here we leverage four aggregators, namely mean, maximum, minimum and standard deviation. Furthermore, we note that this concept can be generalized to the normalized moment aggregator, which allows for variable numbers of aggregators and extracting advanced distribution information whenever the degree of the graph is high.
The following subsections will detail the aggregators we leveraged in our architectures. We also provide descriptions of a few additional aggregation functions of interest in Appendix D.
The most common message aggregator in the literature, wherein each node receives a weighted average or weighted sum of its incoming messages. Equation 1 presents, on the left, the general mean equation, and, on the right, the direct neighbour formulation, where is any multiset, are the nodes’ features at layer , is the neighbourhood of node and . For clarity we use where is a multiset of size to be defined as .
(1) |
Also often used in literature, they are very useful for discrete tasks, when extrapolating such tasks to unseen distributions of graphs velikovic2019neural and for domains where credit assignment is important. Alternatively, we present the softmax and softmin aggregators in Appendix D, which are differentiable and work for weighted graphs, but don’t perform as well on our benchmarks.
(2) |
The standard deviation (STD or ) is used to quantify the spread of neighbouring nodes features, such that a node can assess the diversity of the signals it receives. Equation 3 presents, on the left, the standard deviation formulation and, on the right, the STD of a graph-neighbourhood. ReLU
is the rectified linear unit used to avoid negative values caused by numerical errors and
is a small positive number to ensure is differentiable.(3) |
The mean and variance being the first and second central moments of a signal (
), additional moments could be useful do better describe the signal, such as the skewness (
) and the kurtosis (
). This becomes more important when the degree of a node is high, because the four previous aggregators are then insufficient to describe the neighbourhood accurately. The central moments normalized by the standard deviation are presented in equation 4, for which we develop a corresponding node aggregator in Equation 5, where is the desired moment of degree . A ReLU can once again be used for all even values of to avoid negative moments caused by numerical errors.(4) |
(5) |
Xu et al. xu2018gin shows that the use of mean and max aggregators by themselves fails to distinguish between neighbourhoods with identical node features’ distribution but differing cardinalities and the same applies to the other aggregators described above. They propose the sum aggregator to discriminate between such multisets. Redefining the sum aggregator as the composition of a mean aggregator and a scaling linear to the degree of each node , allows us to generalise this property:
theorem2 The mean aggregation composed with any scaling linear to an injective function on the neighbourhood size can generate injective functions on bounded multisets of countable elements.
We formalize and prove Theorem 2 in Appendix C. Thus, the results proven in xu2018gin about the sum aggregator become a particular case of this theorem, and we can use any kind of injective scaler to discriminate between multisets of various sizes.
Recent work shows that summation aggregation doesn’t generalize well to unseen graphs velikovic2019neural , especially when they are of larger size. One reason is that a small change of the degree will cause the message and gradients to be amplified/attenuated exponentially (a linear amplification at each layer will cause an exponential amplification after multiple layers). Although there are different strategies to deal with this problem, we propose using a logarithmic amplification to reduce this effect. Note that the logarithm is injective for positive values, and is defined non-negative.
Another motivation for using logarithmic scalers is to better describe the neighbourhood influence of a given node. Let’s suppose we have a social network where nodes A, B and C have respectively 5 million, 1 million and 100 followers. On a linear scale, nodes B and C appear more similar than nodes A and B, however, this does not accurately model their relative influence. Hence, the logarithmic scale discriminates better between messages received by influencer and follower nodes.
We propose the logarithmic scaler presented in Equation 6, where is the average amplification in the training set, and is the degree of the node receiving the message.
(6) |
We may further generalize this scaler in Equation 7, where is a variable parameter that is negative for attenuation, positive for amplification or zero for no scaling. Other definitions of can be used—such as a linear scaling—as long as the function is injective for .
(7) |
Combining the aggregators and scalers presented in previous sections, we now propose the Principal Neighbourhood Aggregation (PNA). The PNA performs a total of twelve operations: four neighbour-aggregations with three scalers each, summarized in Equation 8. The aggregators are defined in Equations 1–3, while the scalers are defined in Equation 7, with
being the tensor product.
(8) |
As mentioned earlier, higher degree graphs such as social networks could benefit from further aggregators (e.g. using the moments proposed in Equation 5). We insert the PNA operator within the framework of a message passing neural network gilmer2017mpnn , obtaining the following GNN layer:
(9) |
where and are neural networks (for our benchmarks, a linear layer was enough). Further, reduces the size of the concatenated message (in space ) back to where is the dimension of the hidden features in the network. As in the MPNN paper gilmer2017mpnn , we employ multiple towers to improve computational complexity and generalization performance.
Using twelve operations per kernel will require the usage of additional weights per input feature in the function, which could seem to be just quantitatively—not qualitatively—more powerful than an ordinary MPNN with a single aggregator gilmer2017mpnn . However, the overall increase in parameters in the GNN model is modest and, as per our analysis, it is likely that usage of a single aggregation method is a potential limiting factor in GNNs.
This is comparable to convolutional neural networks (CNN) where a simple convolutional kernel requires 9 weights per feature (1 weight per neighbour). Using a CNN with a single weight per kernel will clearly reduce the computational capacity since the feedforward network won’t be able to compute derivatives or the Laplacian operator. Hence, it is intuitive that the GNNs should also require multiple weights per node, as previously demonstrated in Theorem 1. We will demonstrate this observation empirically, by running experiments on baseline models with larger dimensions of the hidden features (and, therefore, more parameters).
We compare various GNN layers, including PNA, on common architectures formed by such layers, followed by three fully-connected layers for node labels and a set2set (S2S) vinyals2015order readout function for graph labels. For the following experiments^{1}^{1}1The code for all the aggregators, scalers, models, architectures and dataset generation is available at https://github.com/lukecavabarrett/pna . we used an architecture with a GRU after the aggregation function, and a variable number of repeated middle convolutional layer. In particular, we want to highlight:
Gated Recurrent Units (GRU) cho2014gru applied after the update function of each layer, as in gilmer2017mpnn ; li2015gated . Their ability to retain information from previous layers proved effective when increasing .
Weight sharing in all the GNN layers from the second to -th (i.e. all but the first), makes the architecture follow an encode-process-decode configuration battaglia2018relational ; hamrick2018relational . This is a strong prior which works well on all our experimental tasks, with a parameter-efficient architecture that allows the model to have a variable number of layers, .
Variable depth,
, decided at inference time (based on the size of the input graph and/or other heuristics). This is important when using models on distributions of graphs with a variety of different sizes. In our experiments, we have only used heuristics dependant on the number of nodes
. For the final architecture, we settled with , where is the floor operation and the number of nodes in the graph. It would be interesting to test heuristics based on properties of the graph such as the diameter or an adaptive computation time heuristic graves2016adaptive based on, for example, the convergence of the nodes’ features velikovic2019neural . We leave these analyses to future work.This architecture layout (represented in Figure 2) was determined based on the combination of its downstream performance and parameter efficiency. We note that all attempted architectures have yielded similar comparative performance of GNN layers.
Skip connections after each GNN layer all feeding into the fully connected layers were also tried. They are known to improve learning in deep architectures he_deep_2015_resnet , especially for GNNs where they reduce over-smoothing luan_break_2019_snowball . However, in presence of GRUs, they did not give significant performance improvements on our benchmarks.
In this subsection, we present the details of the four graph convolution layers from existing models that we used to compare the performance of the PNA.
use a form of normalized mean aggregator followed by a linear transformation and an activation function, as defined in Equation
10. Here, is the adjacency matrix with self-connections, is a trainable weight matrix and a learnable bias.(10) |
velikovic2017gat perform a linear transformation of the input features followed by an aggregation of the neighbourhood as a weighted sum of the transformed features, where the weights are set by an attention mechanism , defined in Equation 11. Here, is a trainable projection matrix. As in the original paper, we employ the use of multi-head attention.
(11) |
gilmer2017mpnn perform a transformation before and after an arbitrary aggregator defined in Equation 13, where and are neural networks and is a single aggregator. In particular, we test models with sum and max aggregators, as they are the most used in literature. As with PNA layers, we found that linear transformations are sufficient for and and, as in the original paper gilmer2017mpnn , we employ multiple towers.
(13) |
Following previous work velikovic2019neural ; you2019positionaware , the benchmark contains undirected unweighted graphs of a wide variety of types (we provide, in parentheses, the approximate proportion of the graphs in the overall benchmark). Letting be the total number of nodes:
Erdős-Rényi erdos1960
(20%): random probability of edge creation for each node
Barabási-Albert albert2002 (20%): the number of edges for a new node is taken randomly from
Grid (5%): 2d grid graph with and and as close as possible
Caveman watts1999caveman (5%): with cliques of size , with and as close as possible
Tree (15%): generated with a power-law degree distribution with exponent 3
Ladder graphs (5%)
Line graphs (5%)
Star graphs (5%)
Caterpillar graphs (10%): with a backbone of size (drawn from ), and pendent vertices uniformly connected to the backbone
Lobster graphs (10%): with a backbone of size (drawn from ), and (drawn from ) pendent vertices uniformly connected to the backbone, and additional pendent vertices uniformly connected to the previous pendent vertices.
Additional randomness was introduced to the generated graphs by randomly toggling arcs, without strongly impacting the average degree and main structure. If is the number of edges and the number of ’missing edges’ (), the probabilities and of an existing and missing edge to be toggled are:
(14) |
After performing the random toggling, we discarded graphs containing singleton nodes, as these are in no way affected by the choice of aggregation. For the presented results we used graphs of small sizes (15 to 50 nodes) as they were already sufficient to demonstrate clear differences between the models.
The graph property tasks consist of a range of individual properties for each node and global properties of the entire graph. In the multi-task benchmark, we consider three node labels and three graph labels.
Single-source shortest-path lengths: length of the shortest path from a node to all the others, where the source node is specified via a one-hot vector. The labels of nodes outside the connected component of the source are set to 0. Note that, since the graph is unweighted, this task corresponds to performing a breadth-first search (BFS).
Eccentricity: for every node , the longest shortest path from to any other node within its connected component.
Laplacian features: where is the Laplacian matrix of the graph and are the input node feature vectors.
Connected: whether the graph is connected.
Diameter: the longest shortest path between any two nodes that share components.
Spectral radius: the largest absolute value of the eigenvalues of the adjacency matrix (always real since
is real and symmetric).As input features, the network is provided with two vectors of size . The first represents a one-hot vector representing which node is the starting point for the single-source shortest-path tasks. The second is the feature vector where each element is i.i.d. sampled as .
Apart from taking part in the Laplacian features task, this random feature vector also provides a “unique identifier” for the nodes in other tasks. This allows for addressing some of the problems highlighted in garg2020generalization ; chen_can_2020_substructure ; e.g. the task of whether a graph is connected could be performed by continually aggregating the maximum feature of the neighbourhood and then checking whether they are all equal in the readout. Similar strengthening via random features was also concurrently discovered by sato2020random .
While having clear differences, these tasks also share related subroutines, e.g. tasks (1, 2, 4, 5) can all be expressed via graph traversals, and the diameter is the maximum of all node eccentricities. While we do not give this sharing of subroutines as prior to the models as in velikovic2019neural , we expect models that have the capacity of understanding and exploiting the graph structure to pick up on these commonalities, efficiently share parameters and reinforce each other during the training.
Tasks are normalised by dividing each label by the maximum value of their label (among all nodes in node labels) in the training set; since all labels are non-negative, this results in all tasks having normalised labels between 0 and 1. This normalisation allows for a better equilibrium between the various tasks during the training and validation. The model’s predictive power on the benchmark is calculated as the average of the mean squared errors (MSE) on the (normalised) tasks.
We trained the models using the Adam optimizer for a maximum of 10,000 epochs, using early stopping with a patience of 1,000 epochs. Learning rates, weight decay, dropout and other hyper-parameters such as the number of towers/attention heads were tuned on the validation set for each model. For each model, we run 10 training runs with different seeds and different hyper-parameters (but close to the tuned values) and report the five with least validation error.
The multi-task results are presented in Figure (a)a, where we observe that the proposed PNA model consistently outperforms state-of-the-art models, and in Figure (b)b, where we note that the PNA performs consistently better on all tasks. The baseline represents the MSE from predicting the average of the training set for all tasks (or the variance of each task).
The trend of these multi-task results follows but amplifies differences in the average performances of the models when trained separately on the individual tasks, which suggests that the PNA model can better capture and exploit the common sub-units of these tasks. Further, PNA showed to perform the best on all architecture layouts that we attempted. We should note that the GIN architecture was the only one whose performance suffered when switching from an architecture without weight-sharing to the encode-process-decode architecture; in all other cases, the GIN model had performances in-between the MPNN and GAT models.
We note in Figure (b)b
that GCN, GAT and GIN are unable to estimate the Laplacian transformation of the features, and perform very close to the baseline for other node tasks. This shows a strong limitation of the models’ capacity and can be attributed to the over-smoothing effect. This limitation is slightly reduced using skip-connections, but their general performance remains close to the baseline.
In order to demonstrate that the performance improvements of the PNA model are not due to the (relatively small) number of additional parameters it has compared to the other models (about 15%), we ran tests on all the other models with latent size increased from 16 to 20 features. The results of these models, presented in Table 1, suggest that even when baseline models are given 30% more parameters than the PNA, they are qualitatively less capable of capturing the graphs’ structure.
Size 16 | Size 20 | |||
---|---|---|---|---|
Model | # params | Avg score | # params | Avg score |
PNA | 8350 | -3.13 | - | - |
MPNN (sum) | 7294 | -2.53 | 11186 | -2.19 |
MPNN (max) | 8032 | -2.50 | 12356 | -2.23 |
GAT | 6694 | -2.26 | 10286 | -2.08 |
GCN | 6662 | -2.04 | 10246 | -1.96 |
GIN | 7272 | -1.99 | 11168 | -1.91 |
Finally, we explored the extrapolation of the models to larger graphs, in particular, we trained models on graphs of sizes between 15 and 25, validated them with graphs between 25 and 30 and evaluated their performances on graphs between 20 and 50. This task presents many challenges, two of the most significant are: firstly, unlike in velikovic2019neural the models are not given the step-wise supervision or trained on subroutines that can be extended. Secondly, the models have to cope with their architectures being extended to further hidden layers than trained on, which can sometimes cause problems with rapidly increasing feature scales.
Due to the aforementioned challenges, as expected, the performance of the models (as a proportion of the baseline performance) gradually worsens, with some of the models having feature explosions. However, the PNA model had once again consistently outperformed all the other models on all graph sizes. Our results also follow the findings in velikovic2019neural , i.e. that between single aggregators the max tends to perform best when extrapolating to larger graphs. For the PNA, we believe that it converges to a better aggregator combining the advantages of each operation.
We have extended the theoretical framework in which GNNs are analysed to continuous features and proven the need for multiple aggregators in such circumstances. We also generalize the aggregation by presenting degree-scalers and propose the use of a logarithmic scaling. Taking all of the above into consideration, we have presented a method, Principal Neighbourhood Aggregation, composed of multiple aggregators and degree-scalers. With the goal of understanding the capacity of GNNs to capture graph structures, we have proposed a novel multi-task benchmark and an encode-process-decode architecture for solving it. We believe that our findings constitute a step towards establishing a hierarchy of models w.r.t. their expressive power and, in this sense, the PNA model appears to outperform the prior art in GNN layer design.
Geometric deep learning: Going beyond euclidean data.
IEEE Signal Processing Magazine, 34(4):18–42, Jul 2017.Proceedings of the Twelfth Annual ACM Symposium on Theory of Computing
, STOC ’80, page 82–93, New York, NY, USA, 1980. Association for Computing Machinery.theorem1
Let be the -dimensional subspace of formed by all tuples such that , and notice how is the collection of the aforementioned multisets. We defined an aggregator as a continuous function from multisets to reals, which corresponds to a continuous function .
Assume by contradiction that it is possible to discriminate between all the multisets of size using only aggregators, viz. .
Define to be the function mapping each multiset to its output vector . Since are continuous, so is , and, since we assumed these aggregators are able to discriminate between all the multisets, is injective.
As is a -dimensional Euclidean subspace, it is possible to define a -sphere entirely contained within it, i.e. . According to Borsuk–Ulam theorem [41, 42], there are two distinct (in particular, non-zero and antipodal) points satisfying , showing not to be injective; hence the required contradiction. ∎
Note: aggregators are actually sufficient. A simple example is to use where the -th smallest item in . It’s clear to see that the multiset whose elements are is , which can hence be uniquely determined by the aggregators.
proposition1
For , we can trivially uniquely determine the original multiset, so assume , and hence knowledge of . Let be the multiset to be found, and define .
Notice how , and , and for we have , i.e. all the symmetric power sums () are uniquely determined by the moments.
Additionally, , the elementary symmetric sums of , i.e. the sum of the products of all the sub-multisets of size (), are determined as follow:
, the sum of all elements, is equal to ; , the sum of the products of all pairs in , is ; ,the sum of the products of all triplets, is , and so on. Notice how can be computed using the following recursive formula [43]:
Consider polynomial , i.e. the unique polynomial of degree with leading coefficient 1 whose roots are . This defines , the coefficients of , i.e. the real numbers for which . Using Vieta’s formulas [44]:
which applied to yield that is equal to (equal to 1 in ) divided by multiplied by . Hence is uniquely determined, and so is , being its coefficients a valid definition of it. By the fundamental theorem of algebra, has (possibly repeated) roots, which are the elements of , hence uniquely determining the latter.
Finally, can be easily determined adding to each element of . ∎
Note: the proof above assume the knowledge of . In the case that is variable (as in GNNs) and so we have multisets of up to elements an extra aggregator will be needed. An example of such aggregator is the mean multiplied by any injective scaler which would allow the degree of the node to be inferred.
theorem2
Let be the countable input feature space from which the elements of the multisets are taken. Since is countable and the cardinality of multisets is bounded, let be an injection from to natural numbers, and such that for all .
Let’s define an injective function , and without loss of generality, assume (otherwise for the rest of the proof consider as which is positive for all ). can only take value in , therefore let us define . Since is injective, for , which implies .
Let be a positive real number and consider .
, so .
We proceed to show that the cardinality of can be uniquely determined, and itself can be determined as well, by showing that exist an injection over the multisets.
Let us as a function that scales the mean of by an injective function of the cardinality:
We want show that the value of can be uniquely inferred from the value of . Assume by contradiction multisets of size at most such that but ; since is injective , without loss of generality let , then:
which is a contradiction. So it is impossible for the size of a multiset to be ambiguous from the value of .
Let us define as the function mapping to .
Considering the -th digit after the decimal point in the base representation of , it can be inferred that contains elements , and, so, all the elements in can be determined; hence is injective over the multisets in . ∎
Note: this proof is a generalization of the one by Xu et al. [6] on the sum aggregator.
Apart from the aggregators we described and used above, there are other aggregators that we have experimented with or that are used in literature, you can find some examples below. Domain-specific metrics could also be an effective choice.
As an alternative to max and min, softmax and softmin are differentiable and can be weighted in the case of weighted graphs or attention networks. They also allow an asymmetric message passing in the direction of the strongest signal. Equation 15 presents the direct neighbour formulation of the softmax and softmin operations, where are the nodes’ features at layer with respect to node and is the neighbourhood of node :
(15) |