Redundancy-Free Computation Graphs for Graph Neural Networks

06/09/2019 ∙ by Zhihao Jia, et al. ∙ Microsoft Stanford University 1

Graph Neural Networks (GNNs) are based on repeated aggregations of information across nodes' neighbors in a graph. However, because common neighbors are shared between different nodes, this leads to repeated and inefficient computations. We propose Hierarchically Aggregated computation Graphs (HAGs), a new GNN graph representation that explicitly avoids redundancy by managing intermediate aggregation results hierarchically, eliminating repeated computations and unnecessary data transfers in GNN training and inference. We introduce an accurate cost function to quantitatively evaluate the runtime performance of different HAGs and use a novel HAG search algorithm to find optimized HAGs. Experiments show that the HAG representation significantly outperforms the standard GNN graph representation by increasing the end-to-end training throughput by up to 2.8x and reducing the aggregations and data transfers in GNN training by up to 6.3x and 5.6x, while maintaining the original model accuracy.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Graph neural networks (GNNs) have shown state-of-the-art performance across a number of tasks with graph-structured data, such as social networks, molecule networks, and webpage graphs Kipf and Welling (2016); Hamilton et al. (2017); Ying et al. (2018); Xu et al. (2019); Duvenaud et al. (2015). GNNs use a recursive neighborhood aggregation scheme — in a GNN layer, each node aggregates its neighbors’ activations from the previous GNN layer and uses the aggregated value to update its own activations. The activations of the final GNN layer are used for prediction tasks, such as node classification, graph classification, or link prediction.

Due to the clustering nature of real-world graphs, different nodes in a graph may share a number of common neighbors. For example, in webpage graphs, different websites under the same domain generally have a number of common links (i.e., neighbors). As another example, in recommender systems, users in the same group may have interests in common items.

However, existing GNN representations do not capture these common neighbors in real-world graphs, leading to redundant and unnecessary computation in both GNN training and inference. In particular, existing GNN representations define computation in each GNN layer with a GNN computation graph (referred to as a GNN-graph). For each node in the input graph, the GNN-graph includes an individual tree structure to describe how to compute ’s activations by aggregating the previous-layer activations of ’s neighbors. Figure 1b shows the GNN-graph of the input graph in Figure 1a; for example, for node , its neighbor’s activations , and from the layer are aggregated to compute new activations for the layer (see the top portion of Figure 1b). The new activations of the other nodes are computed similarly using the previous activations of their neighbors. Notice that this representation results in redundant computation and data transfers. In this small example, both and are aggregated twice. In wider and mlulti-layer GNNs, the redundancies in existing GNN representations account for a significant fraction of all computation.

Figure 1: Comparison between a GNN-graph and an equivalent HAG. (a) Input graph; (b) 1-layer GNN computation graph (GNN-graph); (c) HAG that avoids redundant computation. The GNN-graph computes new activations by aggregating the previous-layer activations of ’s neighbors. Because nodes in the input graph share common neighbors, the GNN-graph performs redundant computation (e.g., both and are aggregated twice). (c) By identifying common computational patterns, the HAG avoids repeated computation.

In this paper, we propose a new GNN representation called Hierarchically Aggregated computation Graphs (HAGs). Figure 1c shows one possible HAG for the input graph in Figure 1a. HAGs are functionally equivalent to standard GNN-graphs (produce the same output), but represent common neighbors across different nodes using aggregation hierarchies, which eliminates redundant computation and unnecessary data transfers in both GNN training and inference. In addition, a HAG is agnostic to any particular GNN model, and can be used to eliminate redundancy for arbitrary GNNs.

For a GNN-graph, there exist numerous equivalent HAGs with different aggregation hierarchies and runtime performance. Finding HAGs with optimized performance is challenging since the number of possible HAGs is exponential in the input graph size. We introduce an accurate cost function to quantitatively estimate the performance of different HAGs and develop a novel HAG search algorithm to automatically find optimized HAGs.

Theoretically, we prove that the search algorithm can find HAGs with strong performance guarantees: (1) for GNNs whose neighborhood aggregations require a specific ordering on a node’s neighbors, the algorithm can find a globally optimal HAG under the cost function; and (2) for other GNNs, the algorithm can find HAGs whose runtime performance is at least a approximation () of globally optimal HAGs using the submodularity property Mossel and Roch (2007). Empirically, the algorithm finds highly optimized HAGs for real-world graphs, reducing the number of aggregations by up to 6.3.

Our HAG abstraction maintains the predictive performance of GNNs but leads to much faster training and inference. We evaluate the performance of HAGs on five real-world datasets and along three dimensions: (a) end-to-end training and inference performance; (b) number of aggregations; and (c) size of data transfers. Experiments show that HAGs increase the end-to-end training and inference performance by up to 2.8 and 2.9, respectively. In addition, compared to GNN-graphs, HAGs reduce the number of aggregations and the size of data transfers by up to 6.3 and , respectively.

To summarize, our contributions are:

  • [leftmargin=*]

  • We propose HAG, a new GNN graph representation to eliminate redundant computation and data transfers in GNNs.

  • We define a cost model to quantitatively evaluate the runtime performance of different HAGs and develop a HAG search algorithm to automatically find optimized HAGs. Theoretically, we prove that the HAG search algorithm at least finds a -approximation of globally optimal HAGs under the cost model.

  • We show that HAGs significantly outperform GNN-graphs by increasing GNN training and inference performance by up to XX and YY, respectively, and reducing the aggregations and data transfers in GNN-graphs by up to 6.3 and 5.6, respectively.

2 Related Work

Graph neural networks have been used to solve various real-world tasks with relational structures Kipf and Welling (2016); Hamilton et al. (2017); Ying et al. (2018); Xu et al. (2019); Duvenaud et al. (2015). FastGCN Chen et al. (2018) and SGC Wu et al. (2019) accelerate GNN training using importance sampling and removing nonlinearilities. This paper solves an orthogonal problem: how to optimize GNN efficiency while maintaining network accuracy. HAG is agnostic to any particular GNN model and provides a general approach that can be automatically applied to eliminate redundancy for arbitrary GNN models.

Join-trees are a tree decomposition technique that maps a graph into a corresponding tree structure to solve optimization problems on the graph, such as query optimization Flum et al. (2002). Although a join-tree provides a possible way to find optimal HAGs for a GNN-graph, its time complexity is exponential in the treewidth of a GNN-graph Arnborg et al. (1987), and real graphs tend to have very large treewidths. For example, Adcock et al. (2016) shows that the treewidth of real-world social networks grow linearly with the network size, making it infeasible to use join-trees to find optimal HAGs.

Computation reduction in neural networks.

Several techniques have been proposed to reduce computation in neural networks, including weights pruning Han et al. (2015) and quantization Han et al. (2016). These techniques reduce computation at the cost of modifying networks, resulting in decreased accuracy (as reported in these papers). By contrast, we propose a new GNN representation that accelerates GNN training by eliminating redundancy in GNN-graphs while maintaining the original network accuracy.

3 Hierarchically Aggregated Computation Graphs (HAGs)

GNN
Set Aggregate
GCN Kipf and Welling (2016)
GraphSAGE-P Hamilton et al. (2017)
Sequential Aggregate
GraphSAGE-LSTM Hamilton et al. (2017)
-ary Tree-LSTM Tai et al. (2015)
Table 1: Existing GNNs described in our abstraction. GraphSAGE-P and GraphSAGE-LSTM are the pooling and LSTM variants of GraphSAGE, respectively. and max indicate element-wise non-linear activation and max functions. For sequential Aggregate, denotes the -th in-neighbor of node .
1:
2:for  do
3:     for  do
4:         
5:               
6:
7:Goal: minimize
Algorithm 1 An abstraction for GNNs. is the set of nodes in an input graph, and denotes the set of neighbors for node .

GNN abstraction.

A GNN takes an input graph and node features as inputs and iteratively learns representations for individual nodes over the entire graph through a number of GNN layers. Algorithm 1 shows an abstraction for GNNs: is the learned activations of node at layer , and we initialize with input node features . At the -th layer, denotes the aggregated activations of ’s neighbors, which is combined with to compute an updated activation . The learned node activations of the final layer (i.e.,

) are used for predictions, and a GNN model generally minimizes a loss function

that takes the final node activations as inputs (line 6).

Existing GNN models use a GNN computation graph (GNN-graph) to describe the computation in each GNN layer, as shown in Figure 1b. For each node in the input graph, the GNN-graph includes an individual tree structure to define how to compute the activations of node by aggregating the previous-layer activations of ’s neighbors (i.e., ). GNN-graphs are efficient at expressing direct neighborhood relations between nodes, but are not capable of capturing common neighbors across multiple nodes, leading to redundant computation in GNN training and inference.

3.1 HAG Definition

We propose Hierarchically Aggregated computation Graphs (HAGs) for GNNs, which eliminate redundancy in GNN-graphs by hierarchically managing and reusing intermediate aggregation results. Compared to a GNN-graph, a HAG includes a new set of aggregation nodes, each of which represents the intermediate aggregations result for a subset of nodes (i.e., aggregation on a subset of ). Similar to edges in GNN-graphs, an edge in a HAG denotes an aggregation relation — computing ’s activations requires aggregating ’s activations.

Our HAG abstraction is general and applicable to many existing GNN models. Table 1 shows how to use our abstraction to define existing GNNs, which can be further divided into two categories.

  • [leftmargin=*]

  • Set Aggregate. Most GNNs assume the neighbors of a node have no ordering, and the aggregations are associative and commutative operations that are invariant to the order in which the aggregations are performed. Examples include GCN with summation aggregations and GraphSAGE-P with element-wise pooling aggregations (Table 1). Note that set aggregations in GNNs are designed to be order invariant and thus can be performed in a hierarchical fashion as we do in HAGs.

  • Sequential Aggregate. Another class of GNNs require a specific ordering of a node’s neighbors and the aggregations are not commutative. Examples include -ary Tree-LSTM Tai et al. (2015) and the LSTM variant of GraphSAGE Hamilton et al. (2017). However, HAGs can be applied in the case of sequential aggregations as well. Rather than identifying common subsets of neighbors, we identify the common prefixes of the sequence of aggregated nodes, which can then be reused among nodes.

We shall use to denote the nodes in the input graph and use to denote the aggregation nodes added in a HAG. The standard GNN-graph representation can be considered as a special case in the HAG representation with no intermediate aggregation nodes (i.e., ). We further define two additional functions for each node:

First, is the aggregation results of node :

where denotes the in-neighbors of node in a HAG. Note that is recursively defined, and there exists a sequential ordering to evaluate for all nodes since each HAG is acyclic.

Second, we use to describe how to compute by using the input activations from the previous layer.

(1)

defines the coverage of node in a HAG. For the HAG example in Figure 1c, because , , and are used as inputs to compute .

For a set Aggregate, is an unordered set:

(2)

For a sequential Aggregate, is an ordered list:

(3)

where are the ordered in-neighbors of .

3.2 GNNs with HAGs

1:
2:for  do
3:     for  do
4:               
5:     for  do
6:               
7:     for  do
8:         
9:               
Algorithm 2 A GNN abstraction with HAGs. denotes the result of at a GNN layer. We exclude layer index superscripts in to denote that does not need to be memorized for back propagation, and its memory can be reused across all layers.

Existing GNNs are defined with GNN-graphs as shown in Algorithm 1. We extend the GNN abstraction in Algorithm 2 to make it also applicable to HAGs. The extension does not require any modification to a GNN model, and the only difference is how to compute neighborhood aggregations (i.e., ) in each GNN layer. In Algorithm 2, we first compute the results of intermediate aggregation nodes and save the results in (line 5-6). We then compute the neighborhood aggregations (i.e., ) for nodes in the input graph using the intermediate aggregation results .

Memory overhead.

Although Algorithm 2 includes new intermediate variables , the memory overhead for storing is negligible since is not used for back propagation and can be saved in a constant memory across all GNN layers. In the experiments, we show HAGs can increase the training throughput by at the cost of 0.1% memory overhead.

We define a GNN-graph and a HAG to be equivalent for a GNN model if (1) the GNN model outputs the same activations (i.e., ) at each GNN layer, and (2) the GNN model computes the same gradients for all trainable parameters in back propagation. We can use equivalent graphs interchangeably for both inference and training, since equivalent graphs produce the same outputs and gradients by definition. Theorem 1 provides a necessary and sufficient condition for graph equivalence. We prove the theorem in Appendix.

Theorem 1.

A GNN-graph and a HAG are equivalent if and only if for all , where is ’s neighbors in the input graph and is defined in Equation 6 and 3.

Equivalent graphs achieve the same model accuracy but have different runtime performance. Theorem 1 provides an efficient way to check equivalence between GNN-graphs and HAGs, and can be used as an oracle to search for optimized HAGs for any GNN-graph.

4 HAG Search Algorithm

For an arbitrary GNN model and an input GNN-graph, our goal is to find an equivalent HAG with optimized runtime performance. We define a realistic cost function to quantitatively evaluate the runtime performance of arbitrary HAGs, and introduce a HAG search algorithm that automatically finds an optimized HAG with the following theoretical guarantees:

  • [leftmargin=*]

  • For GNNs with sequential Aggregate, the HAG search algorithm can find globally optimal HAGs under the cost function.

  • For GNNs with set Aggregate, finding an optimal HAG is NP-hard by a reduction from the NP-hard maximum coverage problem (see Appendix for the proof). The search algorithm finds at least a -approximation of globally optimal HAGs based on the submodularity property Mossel and Roch (2007).

4.1 Cost Function

We introduce a realistic cost function that quantitatively evaluates the runtime performance of a HAG by measuring the computation cost to perform one epoch GNN training on the HAG.

The computation cost of a GNN model includes aggregating the neighbors of each node by calling Aggregate and updating the activations of each node via Update, as shown in Algorithm 2. For a GNN model , we assume the cost of performing Aggregate on two elements is , and the cost of computing an Update is . In Algorithm 2, computing with neighbors requires performing binary aggregations, whose cost is . Therefore, the total computation cost of training a GNN model on a HAG is

Since is determined by the input graph, our goal is to minimize as much as possible.

4.2 Search Algorithm

1:Input: A GNN-graph and a GNN model .
2:Output: An equivalent HAG
3:
4:function Redundancy()
5:     if  has a set Aggregate then
6:         
7:     else
8:               
9:     return
10:
11:
12:while  do
13:      Redundancy()
14:     if  then
15:          where is a new node
16:         
17:         for  do
18:              if  then
19:                                               
20:return
Algorithm 3 A HAG search algorithm to automatically find an equivalent HAG for a GNN-graph with optimized runtime performance. Redundancy() calculates the number of nodes aggregating both and . is the set of aggregation nodes in a HAG. Recall that is an ordered list for sequential Aggregate (see Equation 3).

We present a HAG search algorithm that finds a globally optimal HAG for GNNs with sequential Aggregate and a -approximation of globally optimal HAGs for GNNs with set Aggregate. In addition to an input GNN-graph and a GNN model, the algorithm also takes a hyper-parameter capacity, defining an upper limit on the number of intermediate aggregation nodes (i.e., ).

Algorithm 3 shows the pseudocode of the HAG search algorithm. We start with an input GNN-graph, and iteratively insert aggregation nodes into the current HAG to merge highly redundant aggregations and remove unnecessary computation and data transfers.

In each iteration, we find a binary aggregation with the highest redundancy and insert a new aggregation node in to represent the binary aggregation results (line 12-15). All nodes containing this binary aggregation can directly use the output of without recomputing the aggregation (line 16-18). The HAG search algorithm iteratively reduces the computation cost of the HAG by eliminating the most redundant binary aggregation in each iteration.

For a GNN model with a sequential Aggregate, Theorem 2 shows that  Algorithm 3 finds an equivalent HAG with globally optimal computation cost. We prove the theorem in Appendix.

Theorem 2.

For any GNN-graph and any GNN model with a sequential Aggregate, Algorithm 3 returns an equivalent HAG with globally minimized cost as long as .

For a GNN model with a set Aggregate, Theorem 3 shows that Algorithm 3 finds a HAG that is at least a -approximation of the globally optimal HAGs. We prove the theorem in Appendix.

Theorem 3.

For any GNN-graph and any GNN model with a set Aggregate, Algorithm 3 gives a -approximation of globally optimal HAGs under the cost function. More specifically, let be the HAG returned by Algorithm 3, and is a globally optimal HAG under the capacity constraint, we have

Time complexity.

The overall time complexity of Algorithm 3 is (see Appendix for the proof).

5 Experiments

Our HAG abstraction maintains predictive performance of GNNs but leads to much faster runtime performance. This section evaluates the runtime performance of HAGs on five real-world graph datasets. We evaluate HAGs along three dimensions: (a) end-to-end training and inference performance; (b) number of aggregations; and (c) size of data transfers.

5.1 Implementation

Existing frameworks such as TensorFlow 

Abadi et al. (2016)

and PyTorch 

Pyt (2017) are designed for spatial data structures (e.g., images and text), and have limited support for irregular data structures such as graphs. As a result, GNN models in existing frameworks translate graph structures to sparse adjacent matrices and use matrix operations to perform GNN training.

We implemented the following operations in TensorFlow r1.13 to support GNN training with HAGs. First, graph_to_hag automatically transforms an input GNN-graph to an equivalent HAG with optimized performance. Second, hag_aggregate takes a HAG and nodes’ activations as inputs, and computes the aggregated activations of all nodes. Finally, hag_aggregate_grad computes the gradients of hag_aggregate for back propagation.

Our implementation minimizes changes to existing GNN programs: a GNN application can directly use all HAG optimizations by only modifying a few lines of code.

5.2 Experimental Setup

Name # Nodes # Edges
Node Classification
BZR Kriege and Mutzel (2012) 6,519 137,734
PPI Zitnik and Leskovec (2017) 56,944 1,612,348
REDDIT Hamilton et al. (2017) 232,965 57,307,946
Graph Classification
IMDB Yanardag and Vishwanathan (2015) 19,502 197,806
COLLAB Yanardag and Vishwanathan (2015) 372,474 12,288,900
Table 2: Datasets used in the experiments.

Datasets.

Table 2 summarizes the public datasets used in our experiments. BZR is a chemical compound dataset, where each node is an atom and an edge is a chemical bond between two atoms Kriege and Mutzel (2012). PPI contains a number of protein-protein interaction graphs, each of which corresponds to a different human tissue Zitnik and Leskovec (2017). REDDIT is an online discussion forum dataset, with each node being a Reddit post and each edge being commenting relations. For both PPI and REDDIT, we directly use prepossessed data from Hamilton et al. (2017). IMDB and COLLAB are two collaboration datasets for graph classification Yanardag and Vishwanathan (2015). IMDB is a movie collaboration dataset, with each node representing an actor/actress, while COLLAB is a scientific collaboration dataset, with each node representing a researcher.

All experiments were performed running TensorFlow r1.13 on NVIDIA Tesla V100 GPUs. Following previous work Kipf and Welling (2016); Hamilton et al. (2017)

, each GNN model has two GNN layers and one SoftMax layer. For graph classification datasets, each GNN model also includes a mean-pooling layer to gather graph-level activations. For all experiments, we set the maximum

capacity of in a HAG to be , which achieves high performance on real-world graphs.

Figure 2: End-to-end performance comparison between GNN-graphs and HAGs. We measure the per-epoch training time and inference latency on a 2-layer GCN model with 16 hidden dimensions in each layer. The performance numbers are normalized by the GNN-graph numbers.

5.3 End-to-End Performance

We first measure the per-epoch training time and inference latency to run a 2-layer GCN model on different graph datasets. We follow previous work Hamilton et al. (2017); Kriege and Mutzel (2012); Yanardag and Vishwanathan (2015) to split the datasets into training/validation/testing sets, and use the testing sets to measure the inference latency.

Figure 2 compares the per-epoch training time and inference latency between GNN-graphs and HAGs. Compared to GNN-graphs, HAGs can improve the training and inference performance by up to 2.8 and 2.9, respectively, while maintaining the same network accuracy. We note this improvement is achieved completely automatically, and computing a HAG is inexpensive. Thus, because the improvement is essentially for free, we believe there is no reason not to use HAGs in preference to GNN-graphs.

(a) Set Aggregations.
(b) Sequential Aggregations.
Figure 3: Comparing the number of aggregations and amount of data transfers between GPU threads to perform aggregations (lower is better). The y-axes are normalized by GNN-graphs, and the last column in each figure is the geometry mean over all datasets.

5.4 Aggregation Performance

We further compare the aggregation performance of GNN-graphs and HAGs on the following two metrics: (1) the number of binary aggregations performed in each GNN layer; and (2) the size of data transfers between GPU threads to perform the aggregations. Note that aggregating a neighbor’s activations requires transferring the activations from GPU global memory to a thread’s local memory.

Figure 3 shows the comparison results. For GNNs with set aggregations, HAGs reduce the number of aggregations by 1.5-6.3 and the size of data transfers by 1.3-5.6. For GNNs with sequential aggregations, HAGs reduce aggregations and data transfers by up to 1.8 and 1.9, respectively.

Although the search algorithm finds a globally optimal HAG for sequential aggregations (Theorem 2) and a -approximation of globally optimal HAGs for set aggregations (Theorem 3), we observe the performance improvement is more significant for set aggregations. Optimality for HAGs with set aggregation involves more potential redundancy compared to sequential aggregations, due to permutation invariance of set aggregation. Thus higher performance can be achieved with HAGs for set aggregations, though optimal solutions are more difficult to compute.

It is also worth noting that the HAG search algorithm can find highly optimized HAGs even on very sparse graphs. For example, on the COLLAB dataset with a graph density of 0.01%, our algorithm reduces the number of aggregations and data transfers by 3.3 and 2.2, respectively.

5.5 Capacity

Figure 4: Comparing different HAGs and their per-epoch GCN training time on the COLLAB dataset. The red line indicates the training time of the best discovered HAG by the search algorithm.

We study how different values of capacity affect the runtime performance of the generated HAGs. Recall that capacity is an upper bound on the number of aggregation nodes in a HAG. In our HAG search algorithm, a larger value of capacity allows the algorithm to eliminate more redundant aggregations and therefore achieves lower cost.

Figure 4 shows that a larger value of capacity can consistently improve the end-to-end training performance, which indicates that the cost function is an appropriate metric to evaluate and compare the performance of different HAGs.

By gradually increasing the capacity, the search algorithm eventually finds a HAG with 150K aggregation nodes, which consume 6MB of memory (0.1% memory overhead) while improving the training performance by 2.8.

6 Conclusion

We have introduced HAG, a new GNN graph representation to eliminate redundant computation and data transfers in GNNs. We propose a cost function to quantitatively evaluate the runtime performance of different HAGs and use a HAG search algorithm to find optimized HAGs. Our experiments show that HAGs significantly outperform existing GNN-graphs by improving the end-to-end training performance and reducing the aggregations and data transfers in GNN training.

References

Appendix A Proof of Theorem 1

Proof.

It is sufficient to prove that if for all , then the GNN-graph and the HAG generate the same outputs (i.e., ) for every GNN layer.

We prove this by induction. Assume a GNN-graph and a HAG generate the same outputs for the (-1)-th layer, we prove the two graphs produce the same outputs for the -th GNN layer.

In Algorithm 2, is the aggregation results of node , which is defined as

This proves that Algorithm 1 and Algorithm 2 compute the same . In addition, both algorithms use the same Update function that takes and as inputs and computes , which applies that the two algorithms compute the same . ∎

Appendix B Proof of Theorem 2

Proof.

Sequential aggregations require a specific ordering of a node’s neighbors. Let denote the ordered list of node ’s neighbors and denote a list of the first elements in :

where is the -th neighbor of node .

represents a necessary intermediate aggregation step for computing (since sequential aggregations are not commutative), and therefore any HAG must compute as an intermediate aggregation. Counting the number of distinct (where and ) provides a lower bound on the number of aggregations any equivalent HAG must perform. Assuming is a globally optimal HAG under the cost model, we have:

where lb is the number of distinct that must be computed by any equivalent HAG.

Assuming is the output HAG of Algorithm 3, we prove that by using contradiction. In the case , must perform more than aggregations.

Case 1. One possible case is that computes at least one aggregation that is not a prefix of any , indicating that performs some useless aggregations, which contradicts with the fact that all intermediate aggregations added to must be used at least once.

Case 2. The other possible case is that computes the aggregation of some multiple times. However, in Algorithm 3, each iteration reduces the number of aggregations by at least 1, and there are aggregations initially. This implies there cannot be redundant aggregations after iterations, which contradicts with the precondition of Case 2. ∎

Appendix C Proof of Theorem 3

Proof.

The idea of the proof is to build a monotone submodular function Cormen et al. [2009] based on the cost model.

For any GNN-graph and an equivalent , we define

(4)
(5)

where is the set of aggregation nodes in , and and are the set of edges in and , respectively. measures the number of aggregations that can be saved by using for GNN training.

We begin by defining the subset relations between different HAGs. For two HAGs and , we define iff is a subset of , where and are the aggregation nodes in and , respectively.

Prove that is monotone. We show that for all , . This is true since indicates that contains all aggregation nodes in , which applies that can at least save the same number of aggregations as .

Prove that is submodular. We show that for all and any aggregation node , . This inequality holds because measures the number of aggregations we can further save by adding aggregation to the existing HAG, which monotonically decreases as we add more aggregation nodes to the HAG.

Let denote the result HAG after the -th iteration of Algorithm 3. includes exactly aggregation nodes. Let denote the optimal HAG under the cost model with aggregation nodes. We claim via induction that for ,

(6)

The base case is trivially true. In the -th step, Algorithm 3 selects an aggregation node by maximizing the marginal gain . Observe that the remaining aggregation nodes includes , a set of at most elements. The submodularity applies that

and this implies that the aggregation node has marginal value

Assuming that Inequality 6 holds for , we have

which proves Inequality 6. Therefore, we have

By taking in the definition of , we have

Appendix D Time Complexity of Algorithm 3

Theorem 4.

The overall time complexity of Algorithm 3 is .

Proof.

We use a heap to maintain the redundancy score of each potential node pair and only update the heap when we add and remove edges in . Since the depth of the heap is at most  111This is because there can be at most node pairs., querying the most redundant binary aggregation and modifying each takes time.

First, we calculate the number of queries and updates to the heap structure:

  • The algorithm iteratively pull the most redundant binary aggregation from the heap and add it to . Since the number of vertices in is smaller than capacity, the total number of queries is .

  • The algorithm inserts two new edges into in line 16 and removes one edge from in line 19. Since line 16 can be invoked at most times, the total number of invocations to line 19 is . Therefore, the overall number of updates is .

Second, the enumeration over all vertices in (line 17) involves time complexity of . Therefore, the overall time complexity of Algorithm 3 is