Graph neural networks (GNNs) are taking over the world of machine learning (ML) by storm [58, 223]. They have been used in a plethora of complex problems such as node classification, graph classification, or edge prediction [110, 77]. Example areas of application are social sciences (e.g., studying human interactions), bioinformatics (e.g., analyzing protein structures), chemistry (e.g., designing compounds), medicine (e.g., drug discovery), cybersecurity (e.g., identifying intruder machines), entertainment services (e.g., predicting movie popularity), linguistics (e.g., modeling relationships between words), transportation (e.g., finding efficient routes), and others [223, 251, 244, 58, 100, 49, 66, 118, 108, 57, 23, 89]. Some recent celebrated success stories are cost-effective and fast placement of high-performance chips , simulating complex physics [170, 177], guiding mathematical discoveries , or significantly improving the accuracy of protein folding prediction .
GNNs uniquely generalize both traditional deep learning [95, 138, 15] and graph processing [152, 176, 96]. Still, contrarily to the former, they do not operate on regular grids and highly structured data (such as, e.g., image processing); instead, the data in question is highly unstructured, irregular, and the resulting computations are data-driven and lacking straightforward spatial or temporal locality . Moreover, contrarily to the latter, vertices and/or edges are associated with complex data and processing. For example, in many GNN models, each vertex has an assigned -dimensional feature vector, and each such vector is combined with the vectors of ’s neighbors; this process is repeated iteratively. Thus, while the overall style of such GNN computations resembles label propagation algorithms such as PageRank [167, 33], it comes with additional complexity due to the high dimensionality of the vertex features.
Yet, this is only how the simplest GNN models, such as basic Graph Convolution Networks (GCN) 
, work. In many, if not most, GNN models, high-dimensional data may also be attached to every edge, and complex updates to the edge data take place at every iteration. For example, in the Graph Attention Network (GAT) model, to compute the scalar weight of a single edge
, one must first concatenate linear transformations of the feature vectors of both verticesand , and then construct a dot product of such a resulting vector with a trained parameter vector. Other models come with even more complexity. For example, in Gated Graph ConvNet (G-GCN)  model, the edge weight may be a multidimensional vector.
|Structure of graph inputs|
|A graph; and are sets of vertices and edges.|
|Numbers of vertices and edges in ; .|
|Neighbors of , in-neighbors of , and .|
|The degree of a vertex and the maximum degree in a graph.|
|The graph adjacency and the degree matrices.|
|and matrices with self-loops (, ).|
|Normalization: and .|
|Structure of GNN computations|
|The number of GNN layers and input features.|
|Input (vertex) feature matrix.|
|Output (vertex) feature matrix, hidden (vertex) feature matrix.|
|Input, output, and hidden feature vector of a vertex (layer ).|
|A parameter matrix in layer .|
|Element-wise activation and/or normalization.|
|,||Matrix multiplication and element-wise multiplication.|
At the same time, parallel and distributed processing have essentially become synonyms for computational efficiency. Virtually each modern computing architecture is parallel: cores form a socket while sockets form a non-uniform memory access (NUMA) compute node. Nodes may be further clustered into blades, chassis, and racks [25, 180, 87]. Numerous memory banks enable data distribution. All these parts of the architectural hierarchy run in parallel. Even a single sequential core offers parallelism in the form of vectorization, pipelining, or instruction-level parallelism (ILP). On top of that, such architectures are often heterogeneous: Processing units can be CPUs or GPUs, Field Programmable Gate Arrays (FPGAs), or others. How to harness all these rich features to achieve more performance in GNN workloads?
To help answer this question, we systematically analyze different aspects of GNNs, focusing on the amount of parallelism and distribution in these aspects. We use fundamental theoretical parallel computing machinery, for example the Work-Depth model , to reveal architecture independent insights. We put special focus on the linear algebra formulation of computations in GNNs, and we investigate the sparsity and density of the associated tensors. This offers further insights into performance-critical features of GNN computations, and facilitates applying parallelization mechanisms such as vectorization. In general, our investigation will help to develop more efficient GNN computations.
For a systematic analysis, we propose an in-depth taxonomy of parallelism in GNNs. The taxonomy identifies fundamental forms of parallelism in GNNs. While some of them have direct equivalents in traditional deep learning, we also illustrate others that are specific to GNNs.
To ensure wide applicability of our analysis, we cover a large number of different aspects of the GNN landscape. Among others, we consider different categories of GNN models (e.g., spatial, spectral, convolution, attentional, message passing), a large selection of GNN models (e.g., GCN , SGC , GAT , G-GCN ), parts of GNN computations (e.g., inference, training), building blocks of GNNs (e.g., layers, operators/kernels), programming paradigms (e.g., SAGA-NN , GReTA ), execution schemes behind GNNs (e.g., reduce, activate, different tensor operations), GNN frameworks (e.g., NeuGraph ), GNN accelerators (e.g., HyGCN ) GNN-driven ML tasks (e.g., node classification, edge prediction), mini-batching vs. full-batch training, different forms of sampling, and asynchronous GNN pipelines.
We finalize our work with general insights into parallel and distributed GNNs, and a set of research challenges and opportunities. Thus, our work can serve as a guide when developing parallel and distributed solutions for GNNs executing on modern architectures, and for choosing the next research direction in the GNN landscape.
1.1 Complementary Analyses
We discuss related works on the theory and applications of GNNs. There exist general GNN surveys [48, 223, 251, 244, 58, 100, 49, 241], works on theoretical aspects (spatial–spectral dichotomy [63, 11], the expressive power of GNNs , or heterogeneous graphs [229, 224]
), analyzes of GNNs for specific applications (knowledge graph completion, traffic forecasting [119, 195], symbolic computing , recommender systems , text classification , or action recognition ), explainability of GNNs , and on software (SW) and hardware (HW) accelerators and SW/HW co-design . We complement these works as we focus on parallelism and distribution of GNN workloads.
1.2 Scope of this Work & Related Domains
We focus on GNNs, but we also cover parts of the associated domains. In the graph embeddings area, one develops methods for finding low-dimensional representations of elements of graphs, most often vertices [211, 68, 210, 69]. As such, GNNs can be seen as a part of this area, because one can use a GNN to construct an embedding . However, we exclude non-GNN related methods for constructing embeddings, such as schemes based on random walks [168, 97] or graph kernel designs [203, 46, 131].
2 Graph Neural Networks: Overview
We overview GNNs. Table I explains the most important notation. We first summarize a GNN computation and GNN-driven downstream ML tasks (§ 2.1). We then discuss different parts of a GNN computation in more detail, providing both the basic knowledge and general opportunities for parallelism and distribution. This includes the input GNN datasets (§ 2.2), the mathematical theory and formulations for GNN models that form the core of GNN computations (§ 2.3), GNN inference vs. GNN training (§ 2.4), and the programmability aspects (§ 2.5). We finish with a taxonomy of parallelism in GNNs (§ 2.6) and parallel & distributed theory used for formal analyses (§ 2.7).
2.1 GNN Computation: A High-Level Summary
We overview a GNN computation in Figure 1. The input is a graph dataset, which can be a single graph (usually a large one, e.g., a brain network), or several graphs (usually many small ones, e.g., chemical molecules). The input usually comes with input feature vectors that encode the semantics of a given task. For example, if the input nodes and edges model – respectively – papers and citations between these papers, then each node could come with an input feature vertex being a one-hot bag-of-words encoding, specifying the presence of words in the abstract of a given publication. Then, a GNN model – underlying the training and inference process – uses the graph structure and the input feature vectors to generate the output feature vectors. In this process, intermediate hidden latent vectors are often created. Note that hidden features may be updated iteratively more than once (we refer to a single such iteration, that updates all the hidden features, as a GNN layer). The output feature vectors are then used for the downstream ML tasks such as node classification or graph classification.
A single GNN layer is summarized in Figure 2. In general, one first applies a certain graph-related operation to the features. For example, in the GCN model , one aggregates the features of neighbors of each vertex into the feature vector of using summation. Then, a selected operation related to traditional neural networks
is applied to the feature vectors. A common choice is an MLP or a plain linear projection. Finally, one often uses some form of non-linear activation (e.g., ReLU) and/or normalization.
One key difference between GNNs and traditional deep learning are possible dependencies between input data samples which make the parallelization of GNNs much more challenging. We show GNN data samples in Figure 3
. A single sample can be a node (a vertex), an edge (a link), a subgraph, or a graph itself. One may aim to classify samples (assign labels from a discrete set) or conduct regression (assign continuous values to samples). Both vertices and edges have inter-dependencies: vertices are connected with edges while edges share common vertices. The seminal work by Kipf and Welling focuses on node classification. Here, one is given a single graph as input, data samples are single vertices, and the goal is to classify all unlabeled vertices.
Graphs – when used as basic data samples – are usually independent [231, 225] (cf. Figure 3, 3rd column). An example use case is classifying chemical molecules. This setting resembles traditional deep learning (e.g., image recognition), where samples (single pictures) also have no explicit dependencies. Note that, as chemical molecules may differ in sizes, load balancing issues may arise. This also has analogies in traditional deep learning, e.g., sampled videos also may have varying sizes . However, graph classification may also feature graph samples with inter-dependencies (cf. Figure 3, 4th column). This is useful when studying, for example, relations between network communities .
2.2 Input Datasets & Output Structures in GNNs
A GNN computation starts with the input graph , modeled as a tuple ; is a set of vertices and is a set of edges; and . denotes the set of vertices adjacent to vertex (node) , is ’s degree, and is the maximum degree in (all symbols are listed in Table I). The adjacency matrix (AM) of a graph is . determines the connectivity of vertices: . The input, output, and hidden feature vector of a vertex are denoted with, respectively, . We have and , where is the number of input features. These vectors can be grouped in matrices, denoted respectively as . If needed, we use the iteration index to denote the latent features in an iteration (GNN layer) (, ). Sometimes, for clarity of equations, we omit the index .
2.3 GNN Mathematical Models
A GNN model defines a mathematical transformation that takes as input (1) the graph structure and (2) the input features , and generates the output feature matrix . Unless specified otherwise, models vertex features. The exact way of constructing based on and is an area of intense research. Here, hundreds of different GNN models have been developed [223, 251, 244, 58, 100, 49, 241]. We now discuss different categories of GNN models, see Figure 4 for a summary. Importantly for parallel and distributed execution, one can formulate most GNN models using either the local formulation (LC) based on functions operating on single edges or vertices, or the global formulation (GL), based on operations on matrices grouping all vertex- and edge-related vectors.
2.3.1 Local (LC) GNN Formulations
In many GNN models, the latent feature vector of a given node is obtained by applying a permutation invariant aggregator function , such as sum or max, over the feature vectors of the neighbors of . Moreover, the feature vector of each neighbor of may additionally be transformed by a function . Finally, the outcome of may be also transformed with another function . The sequence of these three transformations forms one GNN layer. We denote such a GNN model formulation (based on ) as local (LC). Formally, the equation specifying the feature vector of a vertex in the next GNN layer is as follows:
Depending on the details of , one can further distinguish three GNN classes : Convolutional GNNs (C-GNNs), Attentional GNNs (A-GNNs), and Message-Passing GNNs (MP-GNNs). In short, in these three classes of models, is – respectively – a fixed scalar coefficient (C-GNNs), a learnable function that returns a scalar coefficient (A-GNNs), or a learnable function that returns a vector coefficient (MP-GNNs).
As an example, consider the seminal GCN model by Kipf and Welling . Here, is a sum over , acts on each neighbor ’s feature vector by multiplying it with a scalar , and is a linear projection with a trainable parameter matrix followed by . Thus, the LC formulation is given by . Note that each iteration may have different projection matrices .
There are many ways in which one can parallelize GNNs in the LC formulation. Here, the first-class citizens are “fine-grained” functions being evaluated for vertices and edges. Thus, one could execute these functions in parallel over different vertices, edges, and graphs, parallelize a single function over the feature dimension or over the graph structure, pipeline a sequence of functions within a GNN layer or across GNN layers, or fuse parallel execution of functions. We discuss all these aspects in the following sections.
2.3.2 Global (GL) GNN Formulations
Many GNN models can also be formulated using operations on matrices , , , and others. We will refer to this approach as the global (GL) linear algebraic approach.
For example, the GL formulation of the GCN model is . is the normalized adjacency matrix with self loops (cf. Table I): . This normalization incorporates coefficients shown in the LC formulation above (the original GCN paper gives more details about normalization).
Many GL models use higher powers of (or its normalizations). Based on this criterion, GL models can be linear (L) (if only the 1st power of is used), polynomial (P) (if a polynomial power is used), and rational (R) (if a rational power is used) . This aspect impacts how to best parallelize a given model, as we illustrate in Section 4. For example, the GCN model  is linear.
2.4 GNN Inference vs. GNN Training
A series of GNN layers stacked one after another, as detailed in Figure 2 and in § 2.3, constitutes GNN inference. GNN training consists of three parts: forward pass, loss computation, and backward pass. The forward pass has the same structure as GNN inference. For example, in classification, the loss is obtained as follows: , where is a set of all the labeled samples, is the final prediction for sample , and is the ground-truth label for sample . In practice, one often uses the cross-entropy loss ; other functions may also be used .
Backpropagation outputs the gradients of all the trainable weights in the model. A standard chain rule is used to obtain mathematical formulations for respective GNN models. For example, the gradients for the first GCN layer, assuming a total of two layers (), are as follows :
where is a matrix grouping all the ground-truth vertex labels, cf. Table I for other symbols. This equation reflects the forward propagation formula (cf. § 2.3.2); the main difference is using transposed matrices (because backward propagation involves propagating information in the reverse direction on the input graph edges) and the derivative of the non-linearity .
The structure of backward propagation depends on whether full-batch or mini-batch training is used. Parallelizing mini-batch training is more challenging due to the inter-sample dependencies, we analyze it in Section 3.
2.5 GNN Programming Models and Operators
Recent works that originated in the systems community come with programming and execution models. These models facilitate GNN computations. In general, they each provide a set of programmable kernels, aka operators (also referred to as UDFs – User Defined Functions) that enable implementing the GNN functions both in the LC formulation () and in the GL formulation (matrix products and others). Figure 5 shows both LC and GL formulations, and how they translate to the programming kernels.
The most widespread programming/execution model is SAGA  (“Scatter-ApplyEdge-Gather-ApplyVertex”), used in many GNN libraries . In the Scatter operator, the feature vectors of the vertices adjacent to a given edge are processed (e.g., concatenated) to create the data specific to the edge . Then, in ApplyEdge, this data is transformed (e.g., passed through an MLP). Scatter and ApplyEdge together implement the function. Then, Gather aggregates the outputs of ApplyEdge for each vertex, using a selected commutative and associative operation. This enables implementing the function. Finally, ApplyVertex conducts some user specified operation on the aggregated vertex vectors (implementing ).
Note that, to express the edge related kernels Scatter and UpdateEdge, the LC formulation provides a generic function . On the other hand, to express these kernels in the GL formulation, one adds an element-wise product between the adjacency matrix
and some other matrix being a result of matrix operations that provide the desired effect. For example, to compute a “vanilla attention” model on graph edges, one uses a product ofwith itself transposed.
Other operators, proposed in GReTA , FlexGraph , and others, are similar. For example, GReTA has one additional operator, Activate, which enables a separate specification of activation. On the other hand, GReTA does not provide a kernel for applying the function.
We illustrate the relationships between operators and GNN functions from the LC and GL formulations, in Figure 5. Here, we use the name Aggregate instead of Gather to denote the kernel implementing the function. This is because “Gather” has traditionally been used to denote bringing several objects together into an array 111Another name sometimes used in this context is “Reduce”.
2.6 Taxonomy of Parallelism in GNNs
In traditional deep learning, there are two fundamental ways to parallelize processing a neural network : data parallelism and model parallelism that – respectively – partition data samples and neural weights among different workers. Model parallelism can further be divided into pipeline parallelism (different NN layers are processed in parallel) and operator parallelism (a single sample or neural activity is processed in parallel).
We overview the parallelism taxonomy in Figure 6, and show how it translates to parallelism in GNNs in Figure 7. It is similar to that of traditional DL, in that it also has data parallelism and model parallelism. However, there are certain differences that we identify and analyze.
For example, as we detail in Section 3, data parallelism in GNNs has two variants: mini-batch parallelism (when one parallelizes processing a mini-batch, and updates the weights after each mini-batch) and graph [partition] parallelism (when one parallelizes a batch due to the inability to store a given batch on one worker, and only updates the weights after the whole batch). Note that graph partition parallelism could also be applied to a large mini-batch, if that mini-batch cannot be stored on a single worker. Mini-batch parallelism further divides into dependent mini-batch parallelism (whenever samples have dependencies between one another) and independent mini-batch parallelism (no dependencies between samples). Graph partition parallelism and dependent mini-batch parallelism are much more challenging than their equivalent forms in traditional deep learning because of dependencies between data samples.
Model parallelism in GNNs also has several variants. First, in pipeline parallelism, we distinguish macro-pipeline parallelism (pipelining the actual GNN layers) and micro-pipeline parallelism (pipelining the processing of samples within a single GNN layer). Second, single operators can also be parallelized (operator parallelism) in different ways (feature parallelism when updating in parallel different features in a single feature vector, and graph [structure] parallelism when processing in parallel a single feature by assigning different workers to different neighbors of a given vertex). Finally, the UpdateEdge and UpdateVertex kernels come with dense neural network operations and thus one can apply traditional artificial neural network (ANN) parallelism to these kernels. We refer to this form in general as ANN parallelism and we further distinguish ANN-pipeline parallelism and ANN-model parallelism.
2.7 Parallel and Distributed Models and Algorithms
We use formal models for reasoning about parallelism. For a single-machine (shared-memory), we use the work-depth (WD) analysis, an established approach for bounding run-times of parallel algorithms. The work of an algorithm is the total number of operations and the depth is defined as the longest sequential chain of execution in the algorithm (assuming infinite number of parallel threads executing the algorithm), and it forms the lower bound on the algorithm execution time [39, 42]. One usually wants to minimize depth while preventing work from increasing too much.
In multi-machine (distributed-memory) settings, one is often interested in understanding the algorithm cost in terms of the amount of communication (i.e., communicated data volume), synchronization (i.e., the number of global “supersteps”), and computation (i.e., work), and minimizing these factors. A popular model used in this setting is Bulk Synchronous Parallel (BSP) .
3 Data Parallelism
In traditional deep learning, a basic form of data parallelism is to parallelize the processing of input data samples within a mini-batch. Each worker processes its own portion of samples, computes partial updates of the model weights, and synchronizes these updates with other workers using established strategies such as parameter servers or allreduce . As samples (e.g., pictures) are independent, it is easy to parallelize their processing, and synchronization is only required when updating the model parameters. In GNNs, mini-batch parallelism is more complex because very often, there are dependencies between data samples (cf. Figure 3 and § 2.1. Moreover, the input datasets as a whole are often massive. Thus, regardless of whether and how mini-batching is used, one is often forced to resort to graph partition parallelism because no single server can fit the dataset. We now detail both forms of GNN data parallelism. We illustrate them in Figure 8.
3.1 Graph Partition Parallelism
Some graphs may have more than 250 billion vertices and beyond 10 trillion edges [147, 18], and each vertex and/or edge may have a large associated feature vector . Thus, one inevitably must distribute such graphs over different workers as they do not fit into one server memory. We refer to this form of GNN parallelism as the graph partition parallelism, because it is rooted in the established problem of graph partitioning [52, 122] and the associated mincut problem [90, 122, 84]. The main objective in graph partition parallelism is to distribute the graph across workers in such a way that both communication between the workers and work imbalance among workers are minimized.
We illustrate variants of graph partitioning in Figure 9. When distributing a graph over different workers and servers, one can specifically distribute vertices (edge [structure] partitioning, i.e., edges are partitioned), edges (vertex [structure] partitioning, i.e., vertices are partitioned), or edge and/or vertex input features (edge/vertex [feature] partitioning
, i.e., edge and/or vertex input feature vectors are partitioned). Importantly, these methods can be combined, e.g., nothing prevents using both edge and feature vector partitioning together. Edge partitioning is probably the most widespread form of graph partitioning, but it comes with large communication and work imbalance when partitioning graphs with skewed degree distributions. Vertex partitioning alleviates it to a certain degree, but if a high-degree vertex is distributed among many workers, it may also lead to large overheads in maintaining a consistent distributed vertex state. Differences between edge and vertex partitioning are covered extensively in rich literature[55, 73, 94, 52, 122, 73, 51, 50, 9, 124, 75, 104, 121]. Feature vertex partitioning was not addressed in the graph processing area because usually traditional distributed graph algorithms, vertices and/or edges are associated with scalar values.
Partitioning entails communication when a given part of a graph depends on another part kept on a different server. This may happen during a graph related operator (Scatter, Aggregate) if edges or vertices are partitioned, and during a neural network related operator (UpdateEdge, UpdateVertex) if feature vectors are partitioned.
3.1.1 Full-Batch Training
Graph partition parallelism is commonly used to alleviate large memory requirements of full-batch training. In full-batch training, one must store all the activations for each feature in each vertex in each GNN layer). Thus, a common approach for executing and parallelizing this scheme is using distributed-memory large-scale clusters that can hold the massive input datasets in their combined memories, together with graph partition parallelism. Still, using such clusters may be expensive, and it still does not alleviate the slow convergence. Hence, mini-batching is often used.
3.2 Mini-Batch Parallelism
In GNNs, if data samples are independent graphs, then mini-batch parallelism is similar to traditional deep learning. First, one mini-batch is a set of such graph samples, with no dependencies between them. Second, samples (e.g., molecules) may have different sizes (causing potential load imbalance), similarly to, e.g., videos . This setting is common in graph classification or graph regression. We illustrate this in Figure 8 (right), and we refer to it as independent mini-batch parallelism. Note that while such graph samples may have different sizes (e.g., molecules can have different counts of atoms and bonds), their corresponding feature vectors are of the same dimensionality.
Yet, in most GNN computations, mini-batch parallelism is much more challenging because of inter-sample dependencies (dependent mini-batch parallelism). As a concrete example, consider node classification. Similarly to graph partition parallelism, one may experience load imbalance issues, e.g., because vertices may differ in their degrees. Moreover, a key challenge in GNN mini-batching is the information loss when selecting the target vertices forming a given mini-batch. In traditional deep learning, one picks samples randomly. In GNNs, straightforwardly applying such a strategy would result in very low accuracy. This is because, when selecting a random subset of nodes, this subset may not even be connected, but most definitely it will be very sparse and due to the missing edges, a lot of information about the graph structure is lost during the Aggregate or Scatter operator. This information loss challenge was circumvented in the early GNN works with full-batch training [128, 223] (cf. § 3.1.1
). Unfortunately, full-batch training comes with slow convergence (because the model is updated only once per epoch, which may require processing billions of vertices), and the above-mentioned large memory requirements. Hence, two more recent approaches that address specifically mini-batching were proposedincorporating support vertices, and appropriately selecting target vertices.
3.2.1 Support Vertices
In a line of works initiated by GraphSAGE , one adds some neighbors of sampled target vertices as so called support vertices to the mini-batch. These support vertices are only used to increase the accuracy of predictions for target vertices (i.e., they are not used as target vertices in that mini-batch). Specifically, when executing the Scatter and Aggregate kernels for each of target vertices in a mini-batch, one also considers the pre-selected support vertices. Hence, the results of Scatter and Aggregate are more accurate. Support vertices of each target vertex usually come from not only 1-hop, but also from -hop neighborhoods of , where may be as large as graph’s diameter. The exact selection of support vertices depends on the details of each scheme. In GraphSAGE, they are sampled (for each target vertex) for each GNN layer before the actual training.
We illustrate support vertices in Figure 10. Here, note that the target vertices within each mini-batch may be clustered but may also be spread across the graph (depending on a specific scheme [139, 101, 60, 61]). Support vertices, indicated with darker shades of each mini-batch color, are located up to 2 hops away from their target vertices.
One challenge related to support vertices is the overhead of their pre-selection. For example, in GraphSAGE, one has to – in addition to the forward and backward propagation passes – conduct as many sampling steps as there are layers in a GNN, to select support vertices for each layer and for each vertex. While this can be alleviated with parallelization schemes also used for forward and backward propagation, it inherently increases the depth of a GNN computation by a multiplicative constant factor.
Another associated challenge is called the neighborhood explosion and is related to the memory overhead due to maintaining potentially many such vertices. In the worst case, for each vertex in a mini-batch, assuming keeping all its neighbors up to hops, one has to maintain state. Even if some of these vertices are target vertices in that mini-batch and thus are already maintained, when increasing , their ratio becomes lower. GraphSAGE alleviates this by sampling a constant fraction of vertices from each neighborhood instead of keeping all the neighbors, but the memory overhead may still be large . We show an example neighborhood explosion in Figure 11.
3.2.2 Appropriate Selection of Target Vertices
More recent GNN mini-batching works focus on the appropriate selection of target nodes included in mini-batches, such that support vertices are not needed for high accuracy. For example, Cluster-GCN first clusters a graph and then assign clusters to be mini-batches [65, 230]. This way, one reduces (but not eliminates) the loss of information because a mini-batch usually contains a tightly knit community of vertices. We illustrate this in Figure 10 (right). However, one has to additionally compute graph clustering as a form of preprocessing. This can be parallelized with one of many established parallel clustering routines [122, 29, 36, 30].
We first observe that the key difference between graph partition parallelism and mini-batch parallelism is the timing of updating model weights. It takes place after the whole batch (for the former) and after each mini-batch (for the latter). Other differences are as follows. First, the primary objective when partitioning a graph is to minimize communication and work imbalance across workers. Contrarily, in mini-batching, one aims at a selection of target vertices that maximizes accuracy. Second, each vertex belongs to some partition(s), but not each vertex is necessarily included in a mini-batch. Third, while mini-batch parallelism has a variant with no inter-sample dependencies, graph partition parallelism nearly always deals with a connected graph and has to consider such dependencies.
We also note that one could consider the asynchronous execution of different mini-batches. This would entail asynchronous GNN training, with model updates being conducted asynchronously. Such a scheme could slow down convergence, but would offer potential for more parallelism.
3.3 Work-Depth Analysis: Full-Batch vs. Mini-Batch
We analyze work/depth of different GNN training schemes that use full-batch or mini-batch training, see Table II.
|Method||Work & depth in one training iteration|
|Full-batch training schemes:|
|Mini-batch training schemes:|
First, all methods have a common term in work being that equals the number of layers times the number of operations conducted in each layer, which is for sparse graph operations (Aggregate) and for dense neural network operations (UpdateVertex). This is the total work for full-batch methods. Mini-batch schemes have additional work terms. Schemes based on support vertices (GraphSAGE, VR-GCN, FastGCN) have terms that reflect how they pick these vertices. GraphSAGE and VR-GCN have a particularly high term due to the neighborhood explosion ( is the number of vertices sampled per neighborhood). FastGCN alleviates the neighborhood explosion by sampling vertices per whole layer, resulting in work. Then, approaches that focus on appropriately selecting target vertices (GraphSAINT, Cluster-GCN) do not have the work terms related to the neighborhood explosion. Instead, they have preprocessing terms indicated with . Cluster-GCN’s depends on the selected clustering method, which heavily depends on the input graph size (, ). GraphSAINT, on the other hand, does stochastic mini-batch selection, the work of which does not necessarily grow with or .
In terms of depth, all the full-batch schemes depend on the number of layers . Then, in each layer, two bottleneck operations are the dense neural network operation (UpdateVertex, e.g., a matrix-vector multiplication) and the sparse graph operation (Aggregate). They take and depth, respectively. Mini-batch schemes are similar, with the main difference being the instead of term for the schemes based on support vertices. This is because Aggregate in these schemes is applied over sampled neighbors. Moreover, in Cluster-GCN and GraphSAINT, the neighborhoods may have up to vertices, yielding the term. They however have the additional preprocessing depth term that depends on the used sampling or clustering scheme.
To summarize, full-batch and mini-batch GNN training schemes have similar depth. Note that this is achieved using graph partition parallelism in full-batch training methods, and mini-batch parallelism in mini-batching schemes. Contrarily, overall work in mini-batching may be larger due to the overheads from support vertices, or additional preprocessing when selecting target vertices using elaborate approaches. However, mini-batching comes with faster convergence and usually lower memory pressure.
3.4 Tradeoff Between Parallelism & Convergence
The efficiency tradeoff between the amount of parallelism in a mini-batch and the convergence speed, controlled with the mini-batch size, is an important part of parallel traditional ANNs . In short, small mini-batches would accelerate convergence but may limit parallelism while large mini-batches may slow down convergence but would have more parallelism. In GNNs, finding the “right” mini-batch size is much more complex, because of the inter-sample dependencies. For example, a large mini-batch consisting of vertices that are not even connected, would result in very low accuracy. On the other hand, if a mini-batch is small but it consists of tightly connected vertices that form a cluster, then the accuracy of the updates based on processing that mini-batch can be high .
4 Model Parallelism
In traditional neural networks, models are often large. In GNNs, models () are usually small and often fit into the memory of a single machine. However, numerous forms of model parallelism are heavily used to improve throughput; we provided an overview in § 2.6 and in Figure 7.
In the following model analysis, we often picture the used linear algebra objects and operations. For clarity, we indicate their shapes, densities, and dimensions, using small figures, see Table III for a list. Interestingly, GNN models in the LC formulations heavily use dense matrices and vectors with dimensionalities dominated by , and the associated operations. On the other hand, the GL formulations use both sparse and dense matrices of different shapes (square, rectangular, vectors), and the used matrix multiplications can be dense–dense (GEMM, GEMV), dense–sparse (SpMM), and sparse–sparse (SpMSpM). Other operations are elementwise matrix products or rational sparse matrix powers. This rich diversity of operations immediately illustrates a huge potential for parallel and distributed techniques to be used with different classes of models.
|Symbol||Description||Used often in|
|Matrices and vectors|
|,||Dense vectors, dimensions: ,||LC models|
|Dense matrices, dimensions:||GL & LC models|
|,||Dense matrices, dimensions: ,||GL models|
|Sparse matrix, dimensions:||GL models|
|Matrix multiplications (dimensions as stated above)|
|GEMM, dense tall matrix dense square matrix||GL models|
|GEMM, dense square matrix dense square matrix||GL models|
|GEMM, dense tall matrix dense tall matrix||GL models|
|GEMV, dense matrix dense vector||LC models|
|SpMM, sparse matrix dense matrix||GL models|
|Elementwise matrix products and other operations|
|Elementwise product of a sparse matrix and some object||GL models|
|SpMSpM, sparse matrix sparse matrix||GL models|
|Rational sparse matrix power||GL models|
|,||Vector dot product, elementwise vector product||LC models|
|,||Vector concatenation, sum of vectors,||LC models|
4.1 Operator Parallelism
When analyzing operator parallelism, we first focus on the LC formulations, and then proceed to the GL formulation.
4.1.1 Parallelism in LC Formulations of GNN Models
We illustrate generic work and depth equations of LC GNN formulations in Figure 12. Overall, work is the sum of any preprocessing costs , post-processing costs , and work of a single GNN layer times the number of layers . In the considered generic formulation in Eq. (8), equals to work needed to evaluate for each edge (), for each vertex (), and for each vertex (). Depth is analogous, with the main difference that the depth of a single GNN layer is a plain sum of depths of computing , , and (each function is evaluated in parallel for each vertex and edge, hence no multiplication with or ).
We now analyze work and depth of many specific GNN models, by focusing on the three functions forming these models: , , and . The analysis outcomes are in Tables V and VI. We select the representative models based on a recent survey . We also indicate whether a model belongs to the class of convolutional (C-GNN), attentional (A-GNN), or message-passing (MP-GNN) models  (cf. § 2.3.1).
Analysis of We show the analysis results in Table V. We provide the formulation of for each model, and we also illustrate all algebraic operations needed to obtain . All C-GNN models have their determined during preprocessing. This preprocessing corresponds to the adjacency matrix row normalization (), the column normalization (), or the symmetric normalization () . In all these cases, their derivation takes depth and work. Then, A-GNNs and MP-GNNs have much more complex formulations of than C-GNNs. Details depend on the model, but - importantly - nearly all the models have work and depth. The most computationally intense model, GAT, despite having its work equal to , has also logarithmic depth of . This means that computing in all the considered models can be effectively parallelized. As for the sparsity pattern and type of operations involved in evaluating , most models use GEMV. All the considered A-GNN models also use transposition of dense vectors. GAT also uses vector concatenation and sum of up to vectors. Finally, one considered MP-GNN model uses an elementwise MV product. In general, each considered GNN model uses dense matrix and vector operations to obtain for each of the associated edges.
Analysis of The aggregate operator is almost always a commutative and associative operation such as min, max, or plain sum [209, 79]. While it operates on vectors of dimensionality , each dimension can be computed independently of others. Thus, to compute , one needs depth and work, using established parallel tree reduction algorithms . Hence, is the bottleneck in depth in all the considered models. This is because (maximum vertex degree) is usually much larger than .
Analysis of The analysis of is shown in Table VI (for the same models as in Table V). We show the total model work and depth. All the models entail matrix-vector dense products and a sum of up to dense vectors. Depth is logarithmic. Work varies, being the highest for GAT.
We also illustrate the operator parallelism in the LC formulation, focusing on the GNN programming kernels, in the top part of Figure 13. We provide the corresponding generic work-depth analysis in Table IV, and we also assess communication and synchronization (discussed separately in § 4.1.5). The four programming kernels follow the work and depth of the corresponding LC functions ().
|Reference||Class||Formulation for||Dimensions & density of one execution of||Pr?||Work & depth of one execution of|
|GraphSAGE  (mean)||C-GNN|
|Vanilla attention ||A-GNN||( )|
|MoNet ||A-GNN||exp( )|
|Attention-based GNNs ||A-GNN||( )|
|G-GCN ||MP-GNN||( )|
|GraphSAGE  (pooling)||MP-GNN|
|EdgeConv  “choice 1”||MP-GNN|
|EdgeConv  “choice 5”||MP-GNN|
|Reference||Class||Formulation of for ; are stated in Table V||Dimensions & density of computing , excluding||Work & depth (a whole training iteration or inference, including from Table V)|
|GraphSAGE  (mean)||C-GNN|
|Vanilla attention ||A-GNN|
|Attention-based GNNs ||A-GNN|
|GraphSAGE  (pooling)||MP-GNN||( ∥ ( ) )|
|EdgeConv  “choice 1”||MP-GNN|
|EdgeConv  “choice 5”||MP-GNN|
|Reference||Type||Algebraic formulation for||Dimensions & density of deriving||#I||Work & depth (one whole training iteration or inference)|
|GraphSAGE  (mean)||L|
|Dot Product ||L|
|EdgeConv  “choice 1”||L|
|DeepWalk ||P||+ …+||1|
|ChebNet ||P||+ …+||1|
|DCNN , GDC ||P||+ …+||1|
|Node2Vec ||P||+ … +||1|
|LINE , SDNE ||P||+||1|
|Auto-Regress [250, 256]||R||1|
|PPNP [230, 129, 43]||R||1|
|ARMA , ParWalks ||R||1|
4.1.2 Parallelism in GL Formulations of GNN Models
Parallelism in GL formulations is analyzed in Table VII. The models with both LC and FG formulations (e.g., GCN) have the same work and depth. Thus, fundamentally, they offer the same amount of parallelism. However, the GL formulations based on matrix operations come with potential for different parallelization approaches than the ones used for the LC formulations. For example, there are more opportunities to use vectorization, because one is not forced to vectorize the processing of feature vectors for each vertex or edge separately (as in the LC formulation), but instead one could vectorize the derivation of the whole matrix .
There are also models designed in the GL formulations with no known LC formulations, cf. Tables V–VI. These are models that use polynomial and rational powers of the adjacency matrix, cf. § 2.3.2 and Figure 4. These models have only one iteration. They also offer parallelism, as indicated by the logarithmic depth (or square logarithmic for rational models requiring inverting the adjacency matrix ). While they have one iteration, making the term vanish, they require deriving a given power <