Graph Neural Networks (GNNs) are powerful and flexible neural networks that use the naturally sparse connectivity information of the data. GNNs represent this connectivity as sparse matrices, which have lower arithmetic intensity and thus higher communication costs compared to dense matrices, making GNNs harder to scale to high concurrencies than convolutional or fully-connected neural networks. We present a family of parallel algorithms for training GNNs. These algorithms are based on their counterparts in dense and sparse linear algebra, but they had not been previously applied to GNN training. We show that they can asymptotically reduce communication compared to existing parallel GNN training methods. We implement a promising and practical version that is based on 2D sparse-dense matrix multiplication using torch.distributed. Our implementation parallelizes over GPU-equipped clusters. We train GNNs on up to a hundred GPUs on datasets that include a protein network with over a billion edges.READ FULL TEXT VIEW PDF
Graph Neural Networks (GNNs) 
are types of neural networks that use the connectivity information that is natural in datasets that can be represented as graphs, such as molecules, transportation and social networks, the power grid, and proteins. The neighborhood connectivity information in GNNs is unrestricted and potentially irregular giving them greater applicability than convolutional neural networks (CNNs), which impose a fixed regular neighborhood structure. GNNs have been successful in many application domains and often take advantage of specialized variations such as recurrent GNNs, spatial-temporal GNNs, graph convolutional networks, and graph autoencoders. High-quality surveys of GNNs describe these variations and their applications in more detail[31, 34]. GNNs are also provable quite powerful and, for example, are known to be equivalent to the powerful Weisfeiler-Lehman algorithm for graph isomorphism when the GNNs’ underlying aggregation and combination operators are sufficiently flexible .
Two dominant uses of GNNs are node embedding, which predicts certain properties of individual vertices in a large graph, and graph embedding, which predicts certain properties of whole graphs. The presentation of this work follows the node embedding case but the general techniques we introduce are applicable to the graph embedding case as well.
As with CNNs, the training algorithms for GNNs are based on variations of gradient descent, and typically use the idea of mini-batching in which a small set of training samples are evaluated and then used to update the model in a single step. Mini-batching has two purposes. First, it allows the neural neural to train on a smaller memory footprint. Second, it strikes a good balance between achieving high-performance through higher arithmetic intensity and achieving good convergence. Unlike images in a database, vertices of a graph are dependent on each other, which is one reason behind GNNs expressiveness. However in the case of GNNs, this dependency makes it hard to process a mini-batch of vertices. After only a few layers, the chosen mini-batch ends up being dependent on the whole graph. This phenomenon, known as the neighborhood explosion, completely nullifies the memory reduction goals.
To overcome neighborhood explosion, researchers resort to sophisticated sampling-based algorithms that can help GNN training have a smaller memory footprint by reducing the number of -hop neighbors considered. Sampling algorithms, however, come with approximation errors. Here, we use the aggregate memory of a cluster or supercomputer to train GNNs without mini-batching, similar to other work that use distributed memory to train GNNs [35, 18]. In particular, ROC  showed that (1) full gradient descent can be competitive with mini-batching in terms of performance, and (2) sampling based methods can lead to lower accuracy. We build on this work by presenting distributed algorithms with reduced communication. Our distributed algorithms are general and while presented for full gradient descent, they can be easily modified to operate on a mini-batch setting.
Training of GNNs can be memory limited on single node machines, and we support training on distributed-memory architectures where two primary challenges are communication costs and load balance. This paper primarily focuses on minimizing communication costs for GNN training. Some of the best algorithms presented in this paper, namely the 2D and 3D algorithms, also automatically address load balance through a combination of random vertex permutations and the implicit partitioning of the adjacencies of high-degree vertices.
The primary contribution of our paper is the presentation of parallel GNN training algorithms that reduce communication, which are fundamentally different than existing approaches for GNN training. On processors, our 2D algorithm, which consumes optimal memory, communicate a factor of fewer words than commonly utilized vertex-partitioning based approaches. The 3D algorithm we describe reduces the number of words communicated by another factor of at the expense of higher memory consumption. Our work presents algorithmic recipes to get the fastest GNN implementations at large scale.
Our paper does not attempt to provide the most optimized GNN-specific framework. Our distributed algorithms can be implemented in any system that allows arbitrary divisions of tensors to processes, such as Mesh-Tensorflow
. We opted to use PyTorch for our proof-of-concept demonstration due to its ubiquity and excellent support for existing GNN models through PyTorch Geometric. All of our experiments are run on the Summit supercomputer at the Oak Ridge Leadership Computing Facility (OLCF). We did not optimize single-node kernels and used the existing implementations in cuSPARSE that are easily called from PyTorch. Faster implementations of key kernels such as sparse matrix times tall-skinny dense matrix (SpMM) exist[33, 2] and would decrease our overall runtime. As a by-product, more optimized SpMM implementations are equivalent from a relative cost perspective to running on clusters with slower networks; both increase the relative cost of communication in the overall runtime, making our reduced-communication algorithms more beneficial.
Our current implementations operate on the standard real field but they can be trivially extended to support arbitrary aggregate operations to increase the expressive power of GNNs . For example, many distributed libraries such as Cyclops Tensor Framework  and Combinatorial BLAS  allow the user to overload scalar addition operations through their semiring interface, which is exactly the neighborhood aggregate function when applied to graphs. Finally, we note that while our focus is on GNN training, all of our algorithms are applicable to GNN inference.
Parallelism opportunities in the training of deep neural networks (DNNs) have been studied intensively in the recent years 
. For DNNs generally, the two broad cases of parallelism are model and data parallelism. Data parallelism replicates the DNN model in the memory of each process and only partitions the data. Data parallelism can be sub-classified into sample (also called batch) and domain parallelism. In the particular case of convolutional neural networks (CNNs), domain parallelism is often referred to as spatial parallelism. Model parallelism, on the other hand, partitions the model explicitly. In the common case, each DNN layer can be partitioned into all processes and layers can be computed in their original order. Alternatively, inter-layer pipeline parallelism can be exploited for certain parts of the computation, but not all. For CNNs, further dimensions of parallelism in the form of filter and channel are exploitable as special cases of model parallelism .
This might seem like a daunting list of parallelism opportunities to consider for GNN training. Fortunately, as shown in the next section, GNN training is simply a series of algebraic transformations on sparse and dense matrices. Consequently, one can achieve highly-parallel algorithms without even considering the semantic meaning of the dimensions that are partitioned by the algorithm. Ours is similar to an approach taken in earlier work in parallelizing the training of fully-connected and convolutional neural networks . Differently for GNNs, the issues of sparsity and load balance play a prominent role in performance and scalability. Our paper provides several different classes of algorithms that take advantage of many different parallelism opportunities available. We asymptotically analyze the communication costs of the algorithms we present, as well as potential alternatives.
Existing work in parallel GNN training implement their algorithms in specialized frameworks [18, 22, 35]. This requires practitioners to port their models and code to that framework, which might be impossible given the lack of an ecosystem to rapidly implement different GNN algorithms. We implement our algorithms using Pytorch , utilizing torch.distributed and PyTorch Geometric libraries. Given the wide availability and popularity of PyTorch, not to mention the vast set of GNN variants implemented in PyTorch Geometric , any practitioner with access to a distributed cluster can easily utilize our algorithms to scale their models.
The other PyTorch based distributed graph embedding libraries we are aware of are PyTorch-BigGraph (PBG)  and Deep Graph Library (DGL) . PGB’s website explicitly says that it is not for use with models such as graph convolutional networks and deep networks. Consequently, while it presents some interesting ideas, PGB does not seem to have the expressiveness required to implement GNNs. By contrast, our distributed algorithms can be used to implement anything that is supported by PyTorch Geometric, which already implements a vast majority of top GNN models in the literature. On the other hand, DGL is an active large-scale project that provides a convenient graph-based interface instead of exposing sparse matrices to the user, and it automatically fuses operations to avoid unnecessary data movement and computation. Our algorithmic work is complementary and can incorporated into DGL in the future.
The details of data partitioning in various GNN training systems are light. ROC  advocates a specialized graph partitioning method, and shows that it scales better than random vertex and edge partitioning. AliGraph  mentions that it implements both a graph partitioning based approach and a 2D partitioning approach, but does not give any details or provide communication cost analyses.
Table I summarizes the notation used in our paper. In particular, when analyzing communication costs we use the model where each message takes a constant time units latency regardless of its size plus an inverse bandwidth term that takes time units per word in the message, to reach its destination. Thus, sending a message of words takes time. In addition, we use when referring to the number of nonzeros in the sparse adjacency matrix , which is equal to the number of edges in the graph with self loops added. We also use for the average degree of a vertex in , i.e. .
|Symbols and Notations|
|Modified adjacency matrix of graph ()|
|Embedding matrix in layer ()|
|Weight matrix in layer ()|
|Matrix form of ()|
Input matrix to activation function ()
|Matrix form of ()|
Length of feature vector per vertex
|Feature vector for vertex|
|Total layers in GNN|
|Total number of processes|
|Matrix||Layers (i.e. values of )|
Consider a dataset that is represented as a graph , such as a protein network, social network, grid transmission network, or the union of tangled high-energy particle tracks. Here, is the set of vertices (nodes) and is the set of edges. We can consider the classification of the nodes or the edges. Without loss of generality, we will describe a GNN for node classification. The goal of the so-called node embedding problem is to map the nodes of a graph into a low-dimensional embedding space such that the similarity of two nodes is approximated by their similarity in the low-dimensional space . Here, , which is typically a -dimensional vector of floating-point values, is the embedding of vertex . In addition to node and edge classification, GNNs can also be used to classify graphs or perform regression on graphs. In this case, the input would be a set of graphs such as the atomistic structures of a set of molecules.
Let be the sparse adjacency matrix of the graph with added self-connections. The addition of self-connections ensures that each node does not forget its embedding when going from layer to layer . The rows and columns of are also often normalized , so for an undirected graph one actually uses due to its favorable spectral properties. Here,
is the identity matrix andis a diagonal matrix of modified vertex degrees. To avoid notational burden, we will still refer to this modified adjacency matrix with . We also distinguish between and its transpose explicitly in order to present a general training algorithm that works for both directed and undirected graphs. is a dense matrix of input node features. These features are application dependent attributes on graph nodes. A high-quality embedding can be achieved by using a neural network that uses the topology of the graph. In particular, the GNN forward propagation processes t he input features matrix at level using following simple equation:
Here, is the trainable matrix that holds the model parameters at the th level of the neural network, and is the activation function such as ReLU
. Consequently, the most time consuming operations are the multiplication of a sparse matrix with a dense matrix (SpMM) and dense matrix multiply. Backpropagation also relies on the same computational primitives. We provide backpropagation derivations in SectionIII-D.
|( iff and )|
|(see lemma below)|
Lemma for Equation 2:
|(See forward prop equations)|
The second step in Equation III-D is simply the gradient descent step. This step does not require communication, so it is not discussed in our analysis in the following section.
In this section, we present 1D, 2D, and 3D parallel algorithms for GNN training and analyze their communication costs. The presented communication costs are for one epoch, which is a single pass over the whole dataset.
In this regime, matrices and are distributed to processes in block rows, where each process receives consecutive rows. For example, given a matrix , we write to denote the block row owned by the th process. To simplify the algorithm description, we use to denote , the th block column of , although the whole block row is owned by a single process.
Let be the intermediate product . For each processor , the computation is:
The row-wise algorithm forms one row of output at a time, and each processor may potentially need to access all of to form a single row of . However, only a portion of is locally available at any time in parallel algorithms. The algorithm, thus, performs multiple iterations to fully form one row of . Algorithm 1 shows the pseudocode of the algorithm.
is partitioned by rows, and is partitioned by rows. This yields a D Block Row multiplication. is fully-replicated on each process, and is multiplied with after communication. The first multiplication is essentially a sparse matrix times (tall-skinny) dense matrix, also known as sparse matrix times multiple dense vectors (SpMM).
In the context of distributed SpMM, we use to refer to , where is the minimum number of dense matrix rows process needs to communicate to any other process in order for it to perform local matrix multiplication. This cost needs to be multiplied with because each such row is of length , i.e. each node carries a payload of size in the form of a feature vector. This is illustrated in Figure 1.
Graph and hypergraph partitioning tools can be applied as pre-processing to heuristically minimize themetric . Note that a non-adversarial is never higher than , which can be achieved by a random partitioning. Nevertheless, we chose to present the bounds in terms of because it can be lower than thanks to aforementioned tools. The per-process communication cost is thus
, the result of activation, is partitioned by rows as is . No further communication is necessary here to use in Eq. 1 for layer .
is partitioned by columns, and is partitioned by rows. This yields a D Outer Product multiplication. The communication results from the reduction of low-rank matrices, each of which are of size , formed on each process. The result of this reduction is , which is partitioned by rows just like so that the next iteration of the backpropagation does not incur a redistribution cost.
The intermediate low-rank products are dense unless there are empty rows on each part of . For the sake of initial analysis, let us assume we have an Erdős-Rényi graph
where each possible directed edge occurs with probability. This results in edges. Following the analysis in Section 4.1.2 of earlier work , the expected number of non-empty rows on each part is , for large (i.e. ). The storage cost of the intermediate low-rank product is expected to be , which is high but memory-scalable with increasing processor counts. At large scale (i.e. when ), this is better than storing the intermediate products in dense format, which would require a storage cost of per processor.
The theoretical sparsity analysis in the previous paragraph makes a case for taking advantage of sparsity for intermediate low-rank products for large . The practical issue is that sparse reduction algorithms carry high overheads that makes this approach hard to optimize, though there have been several encouraging developments in this area .
At this point, each processor has data that need to be reduce-scattered in order to get the intermediate product in a block row-distributed state. The communication cost is [11, 28], which we round up to to reduce clutter. The last step of multiplying the block row distributed with replicated to yield a block row distributed does not require any communication.
Algebraically, there are two matrix multiplications in this step of the backpropagation. However, we can reuse the intermediate product that we computed in the previous equation at the expense of increasing the memory footprint slightly. Then the only task is to multiply and , which is a small 1D outer product that requires an all-reduce on low-rank matrices of size . This multiplication has communication cost .
Given that the embedding (i.e., feature vector) lengths are different for each layer of the GNN, we use the superscript to denote the length of the feature vector in layer . This results in the following communication bound.
To reduce clutter, we can consider the ‘average’ feature vector length , resulting in the simplified formula.
The ability to treat as if it is and vice-versa gives the algorithm the freedom to trade 1D outer products with 1D block row/column multiplications. Given than , the resulting (simplified) communication cost for the symmetric case is
There are several other 1D algorithms we considered. For example, we could partition row-wise as opposed to column-wise. This makes the forward propagation an outer product instead of a 1D block row. Conversely, the first step of the backpropagation, , becomes a 1D block row as opposed an outer product. The last step of the backpropagation, , still costs the same in terms of communication as it performs a small 1D outer product. Consequently, we would still be performing 1 large outer product, 1 small outer product, and 1 block row multiplication as before, resulting in the same total communication cost.
We also considered transposing between forward and backward propagation, ensuring that both and are partitioned the same way (either by rows or columns). If the input is undirected, this is already the case without additional transpositions and degenerates to the symmetric case presented above. Similarly, if one can afford store two copies of adjacency matrices, then the symmetric case bound in Equation 2 applies. The communication cost of transposition is and has to be done only twice per epoch (once after forward propagation and once after backpropagation), not at every layer. The communication costs of this transposing variant would therefore be
We experimented with graph partitioning to evaluate its potential for us. We ran Metis on the Reddit data. For 64 processes, Metis’ partitions suggested a total communication reduction over random block row distribution ( of vs ). However, the total runtime of our bulk-synchronous algorithm would be dictated by the maximum communication per process, which was only a percent reduction over random 1D block row partitioning ( vs edges cut for the process with maximum communication). These numbers are actually optimistic and do not take into account the need to perform individualized “request and send” operations for exploiting the graph partitioning results.
For example, instead of relying on more efficient broadcast operations as done in Algorithm 1, each process that owns a part of would (depending on its sparsity structure) individually request a subset of rows of from its owner during forward propagation. This effectively increases the latency from to , slightly increases the bandwidth costs due to the communication of request indices, and finally makes it impossible to benefit from collective communication libraries such as NCCL and gloo. Finally, we would like to mention that given the scale free nature of most graph datasets, graph partitioning is unlikely to produce an asymptotic improvement despite its added computational costs.
1.5D algorithms  are attractive from the perspective of reducing communication when the matrices to be multiplied are very different in size. Here, we use the word ‘size’ to denote the number of nonzeros in a sparse matrix or the product of its dimensions in a dense matrix. One downside of 1.5D algorithms compared to 2D algorithms is their additional memory consumption due to replication. This was less of a concern for dense deep neural network training where either the model or the data was replicated for parallelism. However, for GNN training, memory is at a premium. Hence, 2D algorithms, which do not use any extra memory, are preferable.
In the case of GNN training, the relative sizes of two input matrices are and . When the average degree of the graph is close to the embedding vector length , then they are of similar size. In practice for most dataset , which makes it hard to justify the added memory burden of 1.5D algorithms.
Processors are logically organized on a square mesh, indexed by their row and column indices so that the th processor is denoted by . Matrices are assigned to processors according to a 2D block decomposition. For a given matrix, each process gets a submatrix of dimensions in its local memory. For example, is partitioned as shown below and is assigned to processor .
The pseudocode for the general rectangular case is listed in Algorithm 2. The first phase of the algorithm is based on a variant of the Scalable Universal Matrix Multiplication (SUMMA) algorithm . When BCast is called, the owner of becomes the root and broadcasts its submatrix to all the processes along the th process row. Symmetrically, BCast() means that whomever is the owner of broadcasts its submatrix along the th process column. The cost of a single broadcast of an word message to processes has a lower bound of , but high-level algorithms such as SUMMA  can avoid the factor in the latency term through pipelining. Consequently, we will also not spell out the additional factor for our row and column broadcasts.
Variables and , which are significant only at the broadcasting processes, are the local column and row ranges for matrices that are to be broadcast. Here we used the colon notation where the range is denoted by and empty colon specifies the full possible range. For example, denotes the th column, and denotes the first rows of any two-dimensional object . The SpMM and GEMM are local sparse-dense and dense-dense matrix multiplications, respectively.
We start by analyzing the special case to give the intuition. For each processor , the computation of the intermediate product is:
Both and are partitioned into a process grid. This yields a D multiplication which we can do with an optimized SUMMA algorithm. To compute a submatrix of , each process in the submatrix’s row must broadcast their and each submatrix’s column must broadcast their . is fully-replicated on each process, and is multiplied with after communication. However, this also requires communicating sized along the processor row, something we label as “partial SUMMA”.
is partitioned in a 2D process grid. When is element-wise, no communication is needed. However, when is not element-wise, in particular for log_softmax, each process needs to broadcast its with the entire process row.
and are partitioned in a 2D process grid, and results in the same communication pattern as Equation 1. We also need to account for . Recall that this needs communication when is not elementwise. This has the same communication pattern as Equation 2. The communication cost is
Strictly speaking, there are two matrix multiplications here. The first is multiplying and , which would have been a 2D SUMMA SpMM. However, we can reuse the same intermediate product from the previous equation, as we have done in the case of 1D Block Row algorithm. This increases storage by an additive term on each processor. The intermediate product is already partitioned on a process grid. The second multiplication, which is the only multiplication we have to pay for computing this equation, is between and the previously saved intermediate product . This is a 2D dense SUMMA on two matrices with elements, resulting in a small output. The final allgather is to keep replicated. Overall, the communication cost due to this equation is
The most important conclusion from this total communication formula is the in the denominator, which did not exist in the 1D algorithm’s communication cost. To get a sense of how much better the 2D algorithm could be at large process counts, assume that (1) random partitioning is used so , (2) the average degree of the graph is comparable to the feature vector length , which also means and (3) is negligible compared to . Then, the 2D algorithm would only move th of the data moved by the 1D algorithm. However, the latency cost of the 2D algorithm is higher by a factor of so it is not an appropriate method of large-scale parallel training on small graphs where latency is the dominant cost.
When the process grid is non-square, the 2D SUMMA algorithm is still well-defined and relatively easy to implement. Consider the forward propagation equation where and are 2D block partitioned on a grid. Each process needs to access th of and th of to form its own piece of the intermediate output . However, we now need to consider the communication due to the 2nd phase of Algorithm 2, which communicates this matrix along the process row. The forward propagation communication cost becomes:
where gcf denotes greatest common factor. This suggests that if the average vertex degree of the graph is significantly larger than the feature vector length, then there are potential savings is terms of sparse matrix communication by increasing the ratio. However, this comes at the expense of increasing the sum of other two terms, which correspond to dense matrix communication. This is because the sum of two numbers whose product is fixed is minimized when those numbers are equal, or put differently, square has the smallest perimeter of all rectangles of a given area. Consequently, given the unclear benefit to cost ratio of using non-square grids, our implementations focus on square grids in this work.
|Block Split 3D|
|Block Split 3D|
|Block Split 3D|
For ease of presentation, let us assume that processes are logically organized on a 3D mesh, indexed by three indices, though in general each of the three process dimensions can be different. Our matrix to process mesh assignment follows the Split-3D-SpGEMM approach of Azad et al. . We call our variation Split-3D-SpMM, because the primary GNN operation is SpMM as opposed to SpGEMM. Each 2D plane in this 3D processor mesh is called a “layer”.
Considering a single SpMM such as the in forward propagation, we note that two input matrices are split differently. After 3D distribution, each local submatrix of is of dimensions . That means the number of rows of each is times its number of columns. By contrast, is split along the rows across layers and each local piece is of dimensions . This data distribution choice makes each shorter and fatter than 2D distribution makes, potentially alleviating some of the issues with local SpMM scaling we observe with our 2D implementation.
We note that each process only holds th of the input matrices in Split-3D-SpMM, so there is no replication at the input stage. The replication happens in intermediate stages, as we explain in detail below.
The easiest way to think about a 3D multiplication algorithm is to consider it as independent 2D algorithms executing at each layer, followed by a reduction. To compute a submatrix of , each submatrix in broadcasts itself to the rest of the process row, and each submatrix broadcasts itself to the rest of the process column. In each 2D SUMMA iteration, each process on a given layer receives a submatrix from and a submatrix from , multiplies them, and adds them to a running sum.
Once these 2D SUMMA iterations that have been executing independently at each layer complete, processes have partial sums for that need to be reduced across the third process dimension (also called a “fiber”). The partial sums after the 2D SUMMA iterations complete are dense matrices, with elements each. These partial sums are then reduce-scattered along the fiber dimension to get the product in a Block Split 3D format. This results in the following communication cost just to form :
We now need to multiply this intermediate product with . Similar to the Block 2D case, we will perform a partial Split-3D-SpMM where the second input matrix does not need to be communicated because it is replicated. The total communication cost, with this partial step added, is:
is partitioned in a 3D process mesh. When is elementwise, no communication is needed. However, when is not elementwise, in particular for log_softmax, each process needs to broadcast its with the entire process row within a layer. This is equivalent to an all-gather per layer. No cross-layer or cross-row communication is needed as the output of log_softmax for a row of is only dependent on the values within that row.
and are partitioned in a 3D process grid, and results in the same communication pattern as Equation 1. We also need to account for . Recall that this needs communication when is not elementwise. This has the same communication pattern as Equation 2, hence totalling:
We store the intermediate product that has been computed in the previous step, and reuse it here as we have done in other algorithms. Multiplying with is a dense 3D SUMMA on two matrices with elements, resulting in a small output. Note that each of these inputs are Block Split 3D. As before, the final allgather is to keep replicated. The bandwidth cost of all-gather strictly dominates the bandwidth cost of fiber reduction, so we do not include the cost of fiber reduction in the bandwidth term. The resulting communication cost is:
Ignoring latency terms that are strictly dominated by the terms, we have
Although it provides an asymptotic reduction in communication costs, we do not implement this 3D algorithm due to (1) its high constants, (2) its implementation complexity, and most importantly (3) its need to do a factor of replication in its intermediate stages.
We ran our experiments on two of the largest datasets used in GNN research previously, the Reddit and Amazon datasets. In addition, we use a much larger protein similarity network which pushes the boundaries of large-scale GNN traning. This ‘protein’ network comes from the data repository of the HipMCL algorithm . It is an induced subgraph, which contains 1/8th of the original vertices, of the sequence similarity network that contained all the isolate genomes from the IMG database at the time. The characteristics of the datasets are documented in Table VI. The feature (input) and label (output) embedding lengths of the protein dataset largely follows the literature on this topic  where the protein sequences are assumed to be initially processed independently to obtain their 128-length embeddings (either via CNNs or LSTMs) before running GNNs on those embeddings to be able to benefit from the connectivity information.
We verified that our parallel implementation not only achieves the same training accuracy in the same number of epochs as the serial implementations in PyTorch, but it also outputs the same embeddings up to floating point accumulation errors. Consequently, we only provide performance numbers as the accuracy numbers are identical to the serial cases.
All of our experiments are run on the Summit supercomputer at ORNL, which has IBM AC922 nodes with 6 NVIDIA V100 GPUs. Each GPU has 16GB of HBM2 memory. Each Summit node has two sockets, each with 1 POWER9 CPU and 3 GPUs each. Within a socket, each GPU is connected to each other and the host CPU with NVLINK 2.0 with 100GB/s total bidirectional bandwidth. Across sockets, Summit uses the IBM X-bus interconnect with 64GB/s. Between nodes, Summit uses dual-rail EDR Infiniband with 23GB/s bandwidth.
We implement our 3-layer GNN architecture mostly in PyTorch Geometric (PyG) . Within PyG, we use torch.distributed with a NCCL backend for our communication primitives. For our SpMM calls, we separately call cuSPARSE’s csrmm2 function in a C++ extension. We compile our C++ backend with CUDA 10.1.
Recall each node on Summit has 6 GPUs. As such, our implementation is single-node multi-gpu when , but multi-node multi-gpu in all other cases.
For the Amazon and Protein datasets, we opt to randomly generate feature values for simplicity and use the whole graph as the training set. This does not affect performance, and in practice, users could use any values for the feature vectors. We use the input feature vectors and training split used by Hamilton et. al.  for Reddit as they are already provided within PyG. We run both Reddit and Amazon for epochs and Protein for epochs. We do not report numbers for Amazon on devices or numbers for Protein on or devices as the data does not fit in memory for those configurations. Jia et al. observed the same behavior with PyG .
For the data we were able to collect on Amazon, we see that the time spent on communicating dense matrices goes down by given more devices. In Figure 3, this is denoted by “dcomm”. This is consistent with the bounds discussed in Section IV-C as communication should scale by a factor of .
We observe that local SpMM does not scale with increasing process count. There are two fundamental reasons for this:
SpMM performance degrades as the matrix gets sparser. Yang et al.  demonstrates that when the average number of nonzeros per row (i.e., degree, goes down from 62 to 8, the sustained GFlops rates are cut by a factor of 3. They specifically evaluated cuSPARSE’s csrmm2 function, which is the same SpMM function we use, but the performance degradation due to increased sparsity is present for most of the SpMM implementations they evaluated. Not only is Amazon already sparse with an average degree , but it gets even sparser as its sparse adjacency matrix is 2D partitioned across larger device counts. This phenomenon is known as hypersparsity  and decreases the average degree of 2D partitioned submatrices by a factor of .
Since the dense matrices (e.g., the activation matrix) are also 2D partitioned, the number of columns in each local dense submatrix also goes down by a factor of , making the dense matrix “skinnier”. For example, the SpMM calls corresponding to middle layer go from multiplying with a dense matrix that has 16 columns (when ) to multiplying with a dense matrix with only 2 columns (when ). The performance degradation at this extremely skinny regime is also well documented .
These two factors have a multiplicative detrimental impact on the local SpMM performance. Future work will involve using more sophisticated SpMM implementations, such as the merge-based method , that are less impacted by these factors. Also, wider networks are known to improve testing accuracy of GNNs  so we expect a trend towards larger number of activations in hidden layers in the future, potentially making the skinny dense matrix issue less relevant.
We also note that sparse communication (“scomm” in Figure 3) does not scale. This is again due to Amazon’s sparsity. Sparse matrix communication here ends up being latency-bound. Furthermore, as shown by the total communication analysis done in Section IV-C5, each epoch requires sparse broadcasts. Each of these sparse broadcasts take less than ms at processes. On the Summit supercomputer, inter-node communication is latency-bound at that point and further reductions on the data volume can not compensate for the increased latency due to the term.
Fortunately, we still see an overall speedup when going from 16 to 64 processes in epoch throughput as shown in Figure 2. This is because the most costly operation in training on the Amazon dataset is the communication of dense matrices (such as activation matrices, its derivatives, and intermediate products). Our communication analysis in Section IV-C accurately predicts this bottleneck where the number of words moved due to communicating dense matrices is more than a factor of 2 larger than the number of words moved due to communicating sparse matrices. The difference is amplified in a multiplicative way by the difference in the average feature vector length () and the average degree ().
For the Reddit dataset, we see in Figure 3 that SpMM scales by from to processes. However, communication does not scale as device count increases. This is because, like the sparse matrix communication in Amazon, communication in Reddit ends up being latency-bound. Additionally, like the situation with Amazon, the average cost of a broadcast is roughly ms. Broadcasts this cheap on Summit end up being latency-bound.
For our protein network experiments, we see performance improvements on two fronts. First, from to processes, the total communication goes down by roughly . This is consistent with the bounds derived earlier as the communication should scale by a factor of . Second, the SpMM time goes down by roughly from to . This speedup is limited for the same reasons discussed above. Specifically, the performance of SpMM degrades as the matrix gets sparser. In the case of our protein network on processes, the average degree becomes roughly , a small fraction of the length of the row.
We are unable to compare our results with Neugraph or ROC due to multiple reasons. First is that neither code is publicly available and ROC authors (as the faster implementation) did not answer our requests for code access. Comparing reported numbers is not meaningful because we do not have access to a hardware system that is equipped with the same configuration as theirs. More importantly, our analysis predicts that our 2D implementation will only be competitive with 1D approaches when . Given that the largest GPU counts reported by Neugraph and ROC are 8 and 16, respectively, the benefits of our 2D implementation would not be apparent. We would like to highlight that our experiments are run on significantly larger GPU counts than these contemporary studies.
We presented distributed-memory parallel GNN training algorithms that asymptotically reduce communication costs compared to the commonly used vertex partitioning methods. Our algorithms were inspired by their existing counterparts in numerical linear algebra. Our results show that even simple variants of our algorithms can be scalable when implemented using off-the-shelf tools such as PyTorch.
The memory consumption of GNNs can be high due to the need to store activations generated during forward propagation so they can be used for backpropagation. If full-batch gradient descent is used or if graph sampling does not provide an asymptotic reduction, the memory costs become , which is prohibitive for deep networks. Consequently, we envision future work where our distributed training algorithms are carefully combined with sophisticated sampling based methods to achieve the best of worlds.
This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE 1752814 and by the National Science Foundation under Award No. 1823034. This work is also supported by the Advanced Scientific Computing Research (ASCR) Program of the Department of Energy Office of Science under contract No. DE-AC02-05CH11231.
This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.
Demystifying parallel and distributed deep learning: an in-depth concurrency analysis. ACM Computing Surveys (CSUR) 52 (4), pp. 1–43. Cited by: §II.
Proceedings of Machine Learning and Systems (MLSys), pp. 187–198. Cited by: §I, §II, §II, §V-A, §V-C, §VI.