Towards Efficient Large-Scale Graph Neural Network Computing

10/19/2018 ∙ by Lingxiao Ma, et al. ∙ 0

Recent deep learning models have moved beyond low-dimensional regular grids such as image, video, and speech, to high-dimensional graph-structured data, such as social networks, brain connections, and knowledge graphs. This evolution has led to large graph-based irregular and sparse models that go beyond what existing deep learning frameworks are designed for. Further, these models are not easily amenable to efficient, at scale, acceleration on parallel hardwares (e.g. GPUs). We introduce NGra, the first parallel processing framework for graph-based deep neural networks (GNNs). NGra presents a new SAGA-NN model for expressing deep neural networks as vertex programs with each layer in well-defined (Scatter, ApplyEdge, Gather, ApplyVertex) graph operation stages. This model not only allows GNNs to be expressed intuitively, but also facilitates the mapping to an efficient dataflow representation. NGra addresses the scalability challenge transparently through automatic graph partitioning and chunk-based stream processing out of GPU core or over multiple GPUs, which carefully considers data locality, data movement, and overlapping of parallel processing and data movement. NGra further achieves efficiency through highly optimized Scatter/Gather operators on GPUs despite its sparsity. Our evaluation shows that NGra scales to large real graphs that none of the existing frameworks can handle directly, while achieving up to about 4 times speedup even at small scales over the multiple-baseline design on TensorFlow.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning, in the form of deep neural networks (DNNs), has been gaining popularity due to its success in areas such as speech, vision, and natural language processing. In these areas, the coordinates of the underlying data representation often have a regular grid structure, which is friendly to hardware acceleration (e.g., GPU) with massive SIMD-style data-parallelism. There is an emerging trend in applying deep learning models on data with an irregular graph structure 

[9, 22, 25, 13, 15, 4, 5, 29]

, driven by the importance of the graph data such as social networks, knowledge graphs, and graphs in bioinformatics and neuroscience (e.g., protein-protein interactions or neuron connections in brains), and moving the state-of-the-art prediction results in their targeted applications (e.g., classification, embedding, and query-answering). These graph-based neural networks (GNNs) typically apply neural network models over the features associated with vertices and edges in a graph, and propagate and aggregate the results to produce the next-level features.

None of the existing solutions supports GNNs well. The existing graph process engines [28, 26, 11, 6, 41] often provide a Gather-Apply-Scatter (GAS)-like vertex-program model, but incapable of expressing and supporting neural network architectures within the graph constructs. Deep learning frameworks such as TensorFlow [3]

, PyTorch 

[2], MxNet [8], and CNTK [42] are designed to express neural networks as dataflow graphs, but do not support a graph-propagation model directly. In addition, none of them offer the needed scalability to handle large graphs, nor do they support efficient GPU-based implementation of graph propagation operators (which translate into sparse operations). The current lack of support has seriously limited the ability to explore the full potentials of GNNs at scale, as the combination of DNNs and large graph structures poses significant challenges at the system level.

In this paper, we present NGra, the first system to support large-scale GNNs, from an easy-to-express programming model to a scalable and efficient parallel processing engine on GPUs. NGra naturally combines dataflow with a vertex-program abstraction in a new model called SAGA-NN (Scatter-ApplyEdge-Gather-ApplyVertex with Neural Networks). Whereas SAGA can be considered a variant of the GAS model, the user defined functions in the SAGA-NN model allow users to express neural network computation over vertex or edge data (which are treated as tensors) by using a dataflow abstraction, rather than those designed for traditional graph processing (e.g., algorithms such as PageRank, connected component, and shortest path)

Just as with DNNs, efficient use of GPUs is critical to the performance of GNNs and is more so due to the additional challenge of handling large graph structures. To achieve scalability beyond the physical limitation of GPUs, NGra transparently partitions a graph (vertex and edge data) into chunks, and translates a GNN algorithm expressed in the SAGA-NN model into a dataflow graph with operators at the chunk granularity, through which it enables chunk-based parallel stream processing on a single or multiple GPUs.

The efficiency of the NGra engine then hinges heavily on how well NGra manages and schedules the parallel stream processing and the implementation of the key graph propagation operators, Scatter and Gather, on GPUs. NGra pays careful attention to data locality to minimize swapping data in and out of GPU memory and to maximize the reuse of data chunks while in GPU memory, while overlapping the data movement and the computation in a streaming way. For the multi-GPU case, it uses a ring-based streaming mechanism to avoid redundant data movement from host memory by exchanging data chunks among GPUs directly. The Scatter and Gather stages in the SAGA-NN model conduct the vertex data propagation along edges and behave as matrix multiplications on sparse structures. It is notoriously difficult to perform sparse matrix operation on data-parallel hardware like GPU. NGra therefore introduces special operators supported by a graph propagation engine into the dataflow graph and optimizes its execution on GPUs. Note that, unlike the traditional graph processing scenarios that other GPU-based graph engines focus on, in GNN scenarios, the mutable vertex data itself may not be accommodated in GPU device memory since the data of each vertex can be a feature vector rather than a simple scalar. Our scheme, therefore, prefers to exploit parallelism in per-vertex data access to benefit more on memory access efficiency.

We implement NGra through extending TensorFlow with vertex-program abstraction and custom operators for graph propagation procedure. We demonstrate that NGra can scale to support a variety of GNN algorithms on large graphs containing millions of vertices with hundreds of feature dimensions and hundred millions of edges through leveraging the host memory of a single server and the computation power of GPU(s), which cannot be achieved by using existing deep learning frameworks directly. Compared with TensorFlow on small graphs that it can support with GPU, NGra obtains up to about 4x speedup. We also extensively evaluate the improvements caused by the multiple optimizations in NGra to demonstrate their effectiveness.

The rest of the paper is organized as follows. Section 2 introduces the SAGA-NN programming abstraction. Section 3 describes the components, mechanisms, and optimizations in NGra system. Section 4 presents the ring-based streaming scheme in NGra to scale for multiple GPUs. Section 5 illustrates the usage of the SAGA-NN model with applications. Section 6 discusses the implementation and evaluation of NGra. We discuss related work in Section 7 and conclude in Section 8.

2 NGra Programming Abstraction

Graph-based neural network (GNN) is a general neural network architecture defined according to a graph structure. Each vertex or edge in the graph can be associated with a tensor data (normally a vector) as its feature or embedding. GNN can be stacked in multiple layers, with an iterative propagation procedure conducted layer-by-layer over the same graph. In each layer, the vertex or edge features are transformed and propagated along edges and aggregated at the target vertices to produce the new features for the next layer. The transformation can be an arbitrary DNN computation. The graph may also contain a label for each vertex, each edge, or the entire graph, for computing a loss function at the top layer. A feed-forward computation is then performed from the bottom layer to the top layer, with the back-propagation conducted reversely. Figure 

1 illustrates the feed-forward computation of a 2-layer GNN.

Figure 1: Feed-forward computation of a 2-layer GNN.

We now use the Gated Graph ConvNet (G-GCN) algorithm [29, 4] as a concrete example. Graph ConvNet generalizes the notion of the convolution operation, typically applied on image datasets, to apply to an arbitrary graph (e.g., a knowledge graph). Gated Graph ConvNet further incorporates the gates mechanism, so that the model can learn which edges are more important for the learning target. The feed-forward computation of G-GCN at each layer is formalized recursively in Example 2.1.

Example 2.1.

Let denote the feature vector of a vertex at layer , and , , and are the weight parameters to learn. The gated graph ConvNet algorithm recursively defines the feature of a vertex as follows:

where is the element-wise multiplication and (for each edge ) acts as edge gate computed by:

vertex$^{\ell + 1}^{\ell} [W_H^{\ell}\ W_C^{\ell}\ W^{\ell}] ^{\ell}^{\ell}^{\ell}\quad \eta=\text{sigmoid} (\text{p}.W_H^{\ell}\otimes \text{edge}^{\ell}.\text{src} + \text{p}.W_C^{\ell}\otimes \text{edge}^{\ell}.\text{dest})\quad$return .src   set Gather.accumulator = sum   accum = Gather(acc)   // compute new vertex data   vertex$^{\ell + 1}^{\ell}\quad$return   return vertex$^{\ell + 1}
Figure 2: Gated Graph ConvNet at layer in the SAGA-NN model, where refers to matrix multiplication.

2.1 SAGA-NN Model

To express the recursive computation at each layer of a GNN, which contains both graph propagation and DNN computation, we propose a SAGA-NN (Scatter-ApplyEdge-Gather-ApplyVertex with Neural Networks) vertex-program abstraction. SAGA-NN defines four stages of a feed-forward computation in each layer of a GNN: Scatter, ApplyEdge, Gather, and ApplyVertex.

Figure 3: SAGA-NN Stages for each layer of GNN.

SAGA-NN provides two user-defined functions (UDFs) for ApplyEdge and ApplyVertex, respectively, to declare the neural network computations on edges and vertices. The ApplyEdge function defines the computation on each edge, which takes edge and p as inputs, where edge is the abstraction of edge data and p contains the learnable parameters of the GNN model. Each edge is a tuple of tensors [src, dest, data] representing the data of the source and destination vertices connected by the edge, as well as the edge associated data; e.g., edge weight. This function can be used to apply a neural network model on edge and p, and output an intermediate tensor value associated with the edge. The ApplyVertex function defines the computation on a vertex, which takes as input a vertex tensor data vertex, the vertex aggregation accum and learnable parameters p, and returns the new vertex data through applying a neural network model. The SAGA-NN abstraction builds on a dataflow framework, so users can symbolically define the dataflow graphs in UDFs by connecting mathematical operations (e.g., add, tanh, sigmoid, matmul) provided by the underlying framework.

The other two stages, Scatter and Gather, perform data propagation and prepare data collections for inputs of ApplyEdge and ApplyVertex. They are triggered and conducted by the system implicitly, and do not require users to provide explicit UDFs. Figure 2 illustrates the description of G-GCN (at layer ) in the SAGA-NN model.

2.2 Execution Semantics

For each GNN layer, the four-stage execution flow of a feed-forward computation is illustrated in Figure 3. It starts from the Scatter stage, where the tensor data vertex of each vertex is passed onto the adjacent edges to construct edge data edge, containing both the source and destination vertex data. The subsequent ApplyEdge stage then invokes a parallel computation defined by the ApplyEdge function on the edge data to produce an intermediate tensor value for each edge as its outputs. The Gather stage then propagates those outputs along the edges and aggregates them at the destination vertices through commutative and associative accumulate operations. Finally, the ApplyVertex stage executes the computation defined in ApplyVertex function on all vertices to produce updated vertex data for the next layer.

A GNN training process also involves a backward computation phase in each stage of a layer for back-propagation of gradients. The backward computations are invoked in the reverse order of the feed-forward computation: it starts from the backward-VertexApply stage, which takes the gradient passed from layer as input to update the corresponding learnable parameters on vertices, and outputs the gradient accum and the partial vertex gradient . Then the backward-Gather stage takes accum as input and computes its output gradient for each edge based on the accumulate function used in Gather. The subsequent backward-ApplyEdge stage then takes as input to update the learnable parameters relevant to edge, and outputs the partial vertex gradients for both the source and destination vertices. Finally, the backward-Scatter stage accumulates all the partial gradients for a vertex to generate the final and pass to layer .

NGra can automatically generate the corresponding back-propagation execution for each layer of GNN defined in the SAGA-NN model, because the UDFs for ApplyEdge and ApplyVertex are expressed as dataflow computations over regular tensors and can therefore leverage auto-differentiation provided by the deep learning frameworks. We choose not to expose UDFs for Scatter and Gather, because these UDFs, if provided, are highly coupled with the propagation procedure, whose computations flow through the irregular graph structure and are hard to be expressed as dataflow that NGra optimizes—users would have to implement the corresponding derivative functions of the UDFs, a serious burden. Following the same principle, NGra also avoids to expose user-defined aggregation methods. It provides a set of default ones instead, including max, sum, and concatenation, which can be chosen by setting Gather.accumulator (as shown in Figure 2).

By carefully combining dataflow with the vertex-program abstraction, SAGA-NN inherits benefits from both. The dataflow abstraction makes it easy to express neural network architectures in GNNs and leverage auto-differentiation. The vertex-program in SAGA-NN allows users to express computations naturally by thinking locally as a vertex and captures common patterns in GNNs as well-defined stages, thereby enabling graph-related optimization (e.g., efficient graph propagation procedures) and helping produce streaming-based dataflow graph with an optimized scheduling strategy.

3 NGra System

NGra provides a combination of the dataflow and vertex-program abstractions as the user interface. Underneaththis abstraction, NGra mainly consists of 1) a front-end that translates the algorithm implemented in SAGA-NN model into a chunk-granularity dataflow graph to enable GNN computation on large graph in GPU; 2) an optimization layer that produces a scheduling strategy for minimizing data movement between host and GPU device memory, and recognizes opportunities to fuse operations and remove redundant computations; 3) a set of efficient propagation operation kernels that support streaming-based processing to overlap data movement and computation in GPU; 4) a dataflow execution runtime. Because NGra largely leverages existing dataflow-based deep learning frameworks for the dataflow execution runtime, we focus on the design of the first three in this section as they are the main contributions of NGra.

Figure 4: Chunk-based dataflow graph for a destination interval at a G-GCN layer. The swap-out of output tensors of operations from the SAG phase when connecting to D2H is hidden in the SAG sub-graph for a clear visualization.

3.1 Chunk-Based Streaming Dataflow

When exploiting computation power of GPU, existing deep learning frameworks assume that the input and output tensor data of a single operator in the dataflow graph can be fully held in GPU device memory. However, when treating the vertex or edge data of an entire graph as a single tensor or matrix, this assumption often fails to hold for large graphs, which limits the scale of the graph that the system can support efficiently. To address this problem, NGra splits the vertex and edge data of the graph into small chunks that can fit into GPU device memory and constructs a dataflow graph with operators processing computations at the chunk granularity.

NGra splits vertex and edge data into chunks through a 2D partitioning to tile the adjacency matrix representing the edges in the graph, with vertex data split into the corresponding equally-sized disjoint vertex id intervals. This way, edges in each edge chunk connect the vertices in two vertex chunks as sources and destinations, respectively. As illustrated in Figure 4, edge chunk contains edges connecting source vertices in vertex chunk and destination vertices in vertex chunk . In each edge chunk, for feed-forward computation, edges are laid out in a compressed sparse column (CSC) format with edges clustered by destination vertex id, whereas for back-propagation computation, edges are arranged in compressed sparse row (CSR) format. NGra also makes a best effort to re-encode vertex ids to equalize the numbers of edges in edge chunks for balanced chunk-granularity computation.

By splitting data into chunks, NGra is able to generate a dataflow graph with operators operating on data chunks that fit in GPU memory. In the dataflow graph, the outputs of the operators in the Scatter stage are a grid of edge data chunks with edge data the tuple [src,dest,data] (in Section 2.1). Each edge data chunk is then processed by the operators in the ApplyEdge stage to produce another grid of edge data chunks with edge data acc (as in Figure 2). The operators in the Gather stage then accumulate these edge data chunks based on the destination vertices to generate the corresponding vertex data chunks as the inputs for the ApplyVertex stage that follows.

A naive scheduling strategy for this dataflow graph is to execute the operators stage-by-stage. However, since the size of the output data chunk grids of a stage can be large and may not be accommodated in the GPU device memory, they will be swapped out to host memory after the completion of the current stage and before the beginning of the next, thereby losing the opportunity for the operators in the next stage to reuse the output data chunks that already reside in GPU device memory in the current stage. For example, when the Gather stage outputs a vertex chunk with the accumulated vertex data, the scheduler may choose to schedule the operators in the same Gather stage to produce a next vertex chunk, or, it may immediately schedule the operators in the following ApplyVertex stage to consume the current vertex chunk since it is already in the GPU device memory.

NGra therefore adopts a scheduling strategy for every GNN layer as illustrated in Figure 4. For each output vertex chunk of the Gather stage (e.g., in the figure) that maintains the accumulated values for the destination vertices in the chunk, it may be used as input and output by multiple operators in the same Gather stage for the different source vertex chunks (e.g., , , and ). The scheduler tries to hold it in GPU device memory to keep reusing it for those Gather operators until it is consumed by the operators in the following ApplyVertex stage. As in Figure 4, NGra continuously executes operators in the Scatter, ApplyEdge, and Gather (S-A-G) stages and repeats this pattern for , , and to produce the final , followed by the corresponding operators in the ApplyVertex stage that consumes . It then schedules the operators to produce the next output vertex chunk of the Gather stage.

NGra employs explicit device-to-host (D2H) and host-to-device (H2D) operators to conduct the data swap between host and GPU device memory. During a training process, the intermediate feature data (e.g., the result of matrix multiplication in the ApplyEdge stage as in Figure 4) relevant to vertex or edge chunks may be used in back-propagation. They may be swapped out to host memory during feed-forward computation and swapped back in during back-propagation.

Employing explicit data movement operators enables streaming-based scheduling, to overlap data swap with other computations in operators that are independent of the data transferred. For example, at the beginning of the execution of each GNN layer, the operators in the Scatter stage can overlap with their corresponding H2D operators; i.e., the scatter operation on the current vertex chunk can overlap with the operator that loads the next vertex chunk from host memory to GPU device memory.

Figure 5: Optimized layer-wise dataflow graph in G-GCN.
Figure 6: Propagation kernels.

3.2 Dataflow Graph Optimization

NGra further optimizes a generated dataflow graph to remove redundant computations or fuse operations by considering the semantics of the SAGA-NN model. Considering the matrix multiplication operations in the ApplyEdge stage as in Figure 4, they perform computations between the vertex data that are scattered to the edges and the learnable parameter or that is shared by all the edges. Because a vertex may have multiple adjacent edges that the vertex data can be scattered to, such same multiplication for a vertex can be conducted multiple times and redundant. NGra therefore moves the computations that are only related to source or destination vertices out of the ApplyEdge stage of the current layer to the ApplyVertex stage of the previous layer. Figure 5 shows the optimized dataflow graph with the matrix multiplications moved into the ApplyVertex stage.

NGra supports operator fusion as another optimization. In some GNN algorithms, the ApplyEdge function does not perform complex neural network computation, but only element-wise operations, such as +, -, , , tanh, sigmoid, ReLU, etc. In this case, the Scatter, ApplyEdge, Gather stages can be reduced to a single propagation procedure using a special customized operator. NGra automatically detects this case and replaces the subgraph of these three stages with the special operator. As shown in Figure 5, for G-GCN, after the matrix multiplications are moved to ApplyVertex stage, the SAG phase for each chunk contains only element-wise operations (e.g., + and sigmoid), so the whole SAG phase is replaced with a fused-gather operation.

3.3 Propagation Kernels on GPUs

Scatter and Gather are the two key stages to handle the propagation procedures over the often sparse edge structure of a graph. To support both stages efficiently on GPUs, NGra provides customized scatter/gather operation kernels optimized for GPU execution. The design carefully considers the data structure layout to allow the kernels to better leverage GPU memory hierarchy (i.e., shared memory and register) and massive parallelism. Even with a sparse graph structure, which often leads to random access on vertex data, unlike traditional graph processing scenarios, in most GNN algorithms, the data of each vertex is a dense vector rather than a scalar, we therefore prefer to exploit GPU parallelism in per-vertex data access with a best effort.

Scatter Kernel.   The scatter operator passes vertex feature data, from both source and destination, to edges. In feed-forward computation, edges are arranged in a CSC format. We therefore organize the incoming edges of a single destination vertex as in a group and assign a thread block to process them. For a vertex with a large in-degree, we split the edge set into consecutive subgroups for multiple thread blocks. In each thread block, the process of scattering source vertex data consists of two phases, as illustrated in Figure 6(a). First, the threads fetch the source vertex ids from the edges in the edge group stored in contiguous address space, and cache them in the shared memory. Second, the threads copy the vertex data to the edge data storage in parallel along the dimensions of vertex feature vector. This way, in most cases, an instruction in a thread-warp may have highly efficient coalesced memory access. The destination vertex data can be accessed in a similar way. Because destination data is shared by multiple edges, it can be cached in register after being accessed once. In back-propagation, a similar process is conducted on the CSR edge format.

Gather Kernel.   The gather operator collects the feature vectors of edges to the destination vertices and reduces them into a single vector for each destination vertex through a user-provided accumulate function. We employ a similar principle of exploiting parallelism as for the scatter operator. The process is illustrated in Figure 6(b). A block of threads first cooperatively enumerate the edge group, and accumulate the features of each edge into a temporary vector in register, and then write the result back to the corresponding destination vertex.

4 Parallel Processing with Multiple GPUs

NGra further exploits the parallelism of multiple GPUs in a single server. Figure 7 shows the interconnection architecture of a typical 8-GPU server, where GPUs are connected to CPU/DRAM (host memory) via a multi-level PCI-Express (PCIe) interface hierarchy.

4.1 Multi-GPU Architecture

In a multi-GPU system, the GPU interconnection has a hierarchical structure, leading to multiple levels of locality when transferring data across different GPUs or between host and GPU device memory. For example, as shown in Figure 7, the communication between GPU 0 and GPU 1 achieves the highest performance as they are attached to the same PCIe switch. GPU 0 needs to go through two PCIe switches and one PCIe host bridge when communicating with device like GPU 2 or 3, therefore introducing longer latency and lower throughput. GPU 0 cannot conduct P2P access when communicating with GPUs located in other NUMA nodes (e.g., GPU 4 or 7), and hence performs even worse.

In addition, the upper level link bandwidth is shared by the GPU devices under the same root. For example, GPU 0 and GPU 1 share the same communication bandwidth between the PCIe host bridge and the PCIe switch that they are rooted at. As a result, their communications with GPU devices under other PCIe switches may interfere with each other when conducted concurrently, leaving their local PCIe bandwidth under-utilized. This makes the root level links more likely become the bottleneck in parallel computation. We therefore carefully design the streaming scheme by considering the locality characteristics of the multi-GPU system to achieve higher parallelism and better performance.

Figure 7: Multi-GPU architecture

4.2 Ring-Based Parallel Streaming

Consider a layer of GNN computation, a vanilla solution to exploit multi-GPU parallelism is to run a dataflow subgraph (as shown in Figure 4), which produces an output vertex chunk of the ApplyVertex stage, in a single GPU, and let multiple GPUs execute different sets of such subgraphs. The process of this subgraph essentially outputs one vertex chunk (e.g., ) and takes a set of vertex chunks containing all vertices (e.g., , , and ) as input. When running these subgraphs in multiple GPUs at the same time, the same set of vertex chunks are loaded from host memory into the device memory of these GPUs. This makes data transfer bottlenecked at the root level PCIe links that are shared by all the GPUs. Therefore, we design a ring-based streaming scheme to allow GPUs to reuse the data chunks that are already loaded from host memory by exchanging them directly. We organize the data-sharing path among the multiple GPUs as a ring, which is illustrated as the circle with red dot line in Figure 7. Because both PCIe and QPI links support duplex [39, 16], the simultaneous data transfers on the ring do not interfer with each other on bandwidth. With this scheme, each vertex chunk is loaded from host memory to enter the ring and passed to all the GPUs on the ring in order.

Figure 8: Ring-based streaming in a 4-GPU setting.

Scheduling in Ring-Based Streaming.   In ring-based streaming, a GPU needs to take the following two actions: 1) loading a data chunk from the host memory or from the device memory of the previous GPU in the ring; 2) performing local computations. In order to enable the overlap between the two actions, NGra introduces an explicit device-to-device (D2D) operator and employs a coordinated scheduling as illustrated in Figure 8.

In step 1, only GPU 0 and GPU 2 load vertex data chunks 1 and 3 from the host memory, respectively. After loading, in step 2, GPU 0 and GPU 2 start computations based on chunks 1 and 3, while loading chunks 2 and 4, respectively. At the same time, GPU 1 and GPU 3 start fetching chunks 1 and 3 from GPU 0 and GPU 2. In step 3 and step 4, they enter the whole-ring forward mode where there is no data loaded from the host memory. In step 5, unable to get input vertex data from the ring any more, GPU 0 and GPU 2 read data chunk 5 and 7, respectively, from the host memory. Additionally, GPU 1 and GPU 3 drop data chunks 3 and 1 after processing them locally because the chunks have already been consumed by all the GPUs in the ring. All the GPUs in step 6 take the same actions as step 5, except that the data chunk numbers are shifted. And the process enters the whole-ring forward mode again after that step. The whole process continus in such a pipelining fashion until all vertex data chunks have been loaded and processed.

This scheme leaves GPU 1 and GPU 3 idle in step 1 because it already makes full use of the system bandwidth. More concurrent loadings can make things worse. For example, if all the GPUs load different vertex data chunks from the host memory simultaneously, throughput will be limited by root links such as the upper link of PCIe switch in Figure 7. In this case, GPU 0 and GPU 1 will get only half the bandwidth due to the shared upper link. Suppose a single GPU needs time to transfer a data chunk into GPU device memory, two concurrent transfers will take before starting computation. Actually, such longer transfer latency will happen each time the ring loads new data chunks from the host memory, which leads to extra stalls in the entire processing pipeline.

In addition, analyzing the multi-GPU architecture can help decide which GPU(s) should perform data loading in ring-based streaming. With host memory as root, GPUs as leaves, and data links (with bandwidth) as edges, we can build a bandwidth tree where we can get a maximal fat-tree by gradually removing GPUs that could share upper link bandwidth with other GPUs. On top of Figure 8, we get a maximal bandwidth fat-tree (solid red line) from the original bandwidth tree (dashed green line). Data loading from the host memory happens only on the GPUs that belong to this maximal fat-tree.

vertex$^{\ell + 1}^{\ell}[W^{\ell}_H,\ W^{\ell}_C]^{\ell}^{\ell}^{\ell}\quad$return edge$^{\ell}^{\ell + 1}^{\ell}\quad$return    return vertex$^{\ell + 1}
Figure 9: Communication Neural Net (CommNet)
vertex$^{\ell + 1}^{\ell}W^{\ell}^{\ell}^{\ell}^{\ell}\quad$return edge$^{\ell}\times$ edge.data   set Gather.accumulator = sum   accum = Gather(acc)   // compute new vertex data   vertex$^{\ell + 1}^{\ell}\quad$return    return vertex$^{\ell + 1}
Figure 10: Graph Convolutional Network (GCN)
vertex$^{\ell + 1}^{\ell}[W^{\ell},\ W^{\ell}_{pool},\ b^{\ell}]^{\ell}^{\ell}^{\ell}\quad$return    set Gather.accumulator = max   accum = Gather(acc)   // compute new vertex data   vertex$^{\ell + 1}^{\ell}\quad$return    return vertex$^{\ell + 1}
Figure 11: Max-Pooling GCN (MP-GCN)
vertex$^{\ell + 1}^{\ell}^{\ell}^{\ell}^{\ell}\quad$return    set Gather.accumulator = sum   accum = Gather(acc)   // compute new vertex data using GRU   vertex$^{\ell + 1}^{\ell}\quad$return GRU(vertex$^{\ell}^{\ell + 1}
Figure 12: Gated Graph Neural Network (GG-NN)

5 Applications

NGra can support many different types of graph-based neural networks [4, 5, 9, 15, 13, 22, 25, 29, 38]. In this section, we present the layer-wise programs of several representative GNN models from the literature.

Communication neural net (CommNet)  [38] is a model where cooperating agents learn to communicate among themselves before taking actions. Each agent is controlled by a deep feed-forward network, which additionally receives the summed transmissions of other connected agents. This network can be used to solve multiple learning communication tasks like traffic control. In CommNet, there is no computation on the edge, so the ApplyEdge stage is simpy a passthrough (see Figure 4.2). Each agent gets the summed transmissions via the Gather stage, and generates its new hidden state in the ApplyVertex stage.

Graph convolutional networks (GCN)  [15, 22] generalizes the notion of CNNs to an operation that operates on an arbitrary graph. This algorithm has been applied in many semi-supervised or unsupervised graph clustering problems, such as entity classification in a knowledge graph. In GCN, there is computation (without neural networks) on the edge for weighted neighbor activation. In the GCN program (see Figure 4.2), the Scatter operation feeds vertex features into the ApplyEdge function, which multiplies it by the static edge weight determined by the vertex degree. The Gather stage returns the weighted sum of activations of neighbors on which the ApplyVertex stage applies a fully-connected layer.

Max-pooling GCN (MP-GCN)  [13] applies the max-pooling operator to each of the computed features, which can effectively capture different aspects of the neighborhood set. In MP-GCN, there is an NN-based computation on the edge with the max aggregation instead of mean. In its program (see Figure 4.2), the Scatter stage passes the source vertex feature vector into fully-connected neural network defined in ApplyEdge function. Then the Gather stage returns the element-wise max of features of neighbors on which the ApplyVertex stage applies a fully-connected layer.

Gated GCN (G-GCN)  [4, 29] is our running example which incorporates the gate mechanism into GCN, so that the model can learn which edges are more important for the learning target. Its computation pattern differs from that of Max-pooling GCN in that the NN-based computation on edges requires the feature vectors of both source and destination vertices. The Scatter stage must propagate the feature data of both vertices on the edge (see Figure 2).

Gated graph neural networks (GG-NN)  [25]

applies recurrent neural networks (RNNs) for walks on a graph structure and unroll the recurrence for a fixed number of steps. This model was used for NLP tasks and also in quantum chemistry for fast organic molecule properties estimation. GG-NN has NN-based edge computation, but using different parameters for different edge labels (the model assumes discrete edge types). Also, GG-NN has dense computation on vertices where the

ApplyVertex

function is Gated Recurrent Unit (GRU). In the GG-NN program (see Figure 

4.2), different edges can share different parameters in the ApplyEdge function.

6 Evaluation

We implement NGra on top of TensorFlow (v1.7) with about 2,900 lines of C++ code and 3,000 lines of Python code. NGra extends TensorFlow with a front-end to transform SAGA-NN programs into a chunk-granularity dataflow graph, several (fused) scatter/gather operators for efficient graph propagation, and a ring-based streaming scheduling scheme.

In this section, we present the detailed evaluation results and demonstrate the efficiency and scalability of NGra, as well as comparing with a state-of-the-art system, TensorFlow.

Experimental setup.   We evaluate NGra on a multi-GPU server, which is equipped with dual 2.6 GHz Intel Xeon E5-2690v4 processors (28 cores in total), 512 GB of memory, and 8 NVIDIA Tesla P100 GPU. The installed operating system is Ubuntu 16.04, using libraries CUDA 8.0, and cuDNN 6.

Table 1 lists the datasets that we used for evaluation, which are Pubmed citation network [36], protein-protein interaction graphs [18], BlogCatalog social network [24], Reddit online discussion forum [13], and Wikidata [30], respectively. The column feature in Table 1 represents the size of vertex feature vector, and the label

column means the number of label classes. We test the performance of our system on the task of vertex classification, e.g., classifying academic papers into different subjects in the Pubmed citation dataset, which contains sparse bag-of-words feature vectors for each document and a list of citation links between documents.

Our evaluation uses 5 popular GNN applications introduced in Section 5. Note that only CommNet, GCN and GG-NN can be directly supported using the Tensorflow operators due to no or simple computation on the edge, in which cases the propagation can be treated as a sparse multiplication. We set the number of layers

in each GNN. All performance numbers in our experiments are calculated by averaging over 10 epochs.

Dataset vertex# edge# feature label
pubmed 19.7K 108.4K 500 3
protein 43.5K 205.6K 29 3
BlogCatalog 10.3K 668.0K 128 39
reddit_small 46.6K 1.4M 602 41
reddit_middle 233.0K 23.2M 602 41
reddit_full 2.2M 571.0M 300 50
enwiki 3.2M 222.1M 300 12
Table 1: Datasets (K: thousand, M: million).

6.1 Efficient Graph Propagation

NGra’s efficient Scatter and Gatter kernels play an import role in handling sparsity of graph propagation.

Micro-benchmark on Synthetic Data.   To evaluate performance of our propagation kernels, we compare NGra with TensorFlow and cuSPARSE [1] on a simple sparse-dense matrix multiplication workload, which can be implemented with SAGA model through just specifying ApplyEdge phase as a multiplication with edge feature. For TensorFlow, we directly use its sparse_tensor_dense_matmul operator. The inputs of the computation are a sparse matrix with variant graph densities (i.e., the percentage of non-zero values) from 0.01% to 10%, and a dense matrix. The performance results are shown in Figure 13. Compare to TensorFlow, our propagation kernels can speed up by to . Even compared to cuSPARSE, NGra can improve the performance by to . The huge performance gaps are mainly due to our careful kernel optimization and special GPU threading model design for GNN scenarios, where the vertex data is often a feature vector.

Figure 13: Propagation kernel time of TensorFlow(TF), cuSPARSE, and NGra(NG) on graphs with different density.

Real Applications on Small Data.   To compare with TensorFlow on real applications, we need to fit the whole graph data in device memory, as TensorFlow cannot support graph larger than single GPU memory. Thus, we just use the first 4 small datasets in Table 1 to run all the 3 applications that TensorFlow supports. Table 2 lists the comparison results. Overall, NGra outperforms TensorFlow by ranging from 7.7% to among all the applications and datasets. In the 3 applications, the average improvement of GCN (), is more than others, mainly because GCN’s graph propagation takes more computation cycles than others. The improvements are also relative to datasets. For example, the density of the blog graph is greater, which leads to higher graph propagation overhead, where NGra can speed up. The average improvement of all applications on the blog dataset is , while this numbers for the reset datasets are 35.7%, 23.4%, and respectively.

dataset pubmed protein blog reddit-small
NG TF NG TF NG TF NG TF
GCN 8.2 13.6 14.8 20.7 8.4 32.5 44.2 113.3
CommNet 14.2 18.6 27.4 33.5 10.6 35.8 62.4 132.4
GG-NN 37.6 41.9 77.7 83.7 23.6 49.4 127.3 195.3
Table 2: Iteration time (ms) comparison with TensorFlow.
Figure 14: Streaming scheduling strategies comparison on different applications. (Data: reddit_middle)

6.2 Scaling-up on a Single GPU

Figure 15: Scaling up performance of NGra on different applications.
Figure 16: Speed up of NGra with different applications on large graphs.

NGra uses the chunk-based streaming mechanism to support graphs that do not fit in GPU memory. We first evaluate different scheduling strategies in this mechanism, as introduced in Section 3, and then demonstrate NGra’s performance on real applications through comparisons with NGra’s baseline versions.

Streaming Scheduling Strategy.   The scheduling strategy of a chunk-based dataflow graph heavily affects the overall performance, as it determines the number of data swapping introduced in Section 3.1. We demonstrate the benefit of our strategy through comparing with two alternatives: stage-based and dest-order scheduling strategies. In the stage-based strategy, the Scatter, ApplyEdge, and Gatter (S-A-G) are composed as one stage, the ApplyVertex as another stage, where these two stages are executed one-by-one. This will introduce one data swapping between two stages. In the dest-order strategy, it prefers to schedule operators in the Scatter stage along the direction that destination vertex chunk changes. In this case, for each source vertex chunk, the accum data needs to be swapped in and out once. Figure 14 shows the comparison results for 5 applications on the reddit_middle dataset. Compared to the stage-based strategy, NGra’s scheduling outperforms by 24.9% to 35.1% for different applications. For the dest-order strategy, NGra improves the performance by 60.1% to 93.1%. These results demonstrate the benefits of avoiding data-swapping and the importance of NGra’s scheduling.

Benefit of Streaming on Real Applications.   To evaluate the performance gain of streaming in NGra, we implement a baseline version of NGra by disabling streaming and the optimized graph propagation, denoted as NG-base. In this case, NG-base can still handle large graph by partitioning it into chunks and processing them sequentially. We also implement another version through using only chunk-based streaming in NGra, denoted as NG-stream, which can achieve overlapping of data transmission and computation. We compare end-to-end performance of NG-base, NG-stream, and NGra (NG) on 3 applications in Figure 15. We construct the datasets with different scales by simply duplicating the reddit_small dataset by 1, 4, 9, and 16 copies. Compared to NG-base, NG-stream improves the performance by 33.2%, 29.3%, and 23.0%, respectively, for the 3 applications. By further using optimized graph propagation, NGra is able to speed up the performance of these applications by , , and than NG-base, respectively.

6.3 Scaling-out on Multiple GPUs

NGra enables scaling GNN computation to multiple GPUs with ring-based parallel streaming. We compare this mechanism with a baseline without the ring-based strategy, denoted as non-ring. Figure 16 shows the comparison results for 5 applications on two large datasets, enwiki and reddit_full, respectively. Please note that ring-base mechanism only works on multi-GPU, so 1 GPU data point is the same. The results show clearly the benefits of ring-based streaming mechanism. For example, when scaling the computation from 1 GPU to 2 GPUs, the average speed up of non-ring mechanism is only , while our ring-based one can reach . This is mainly because, without ring-based design, each of the two GPUs needs to load input data through shared PCIe links concurrently, which easily becomes the bottleneck of the system. The ring-based mechanism allows near-linear speed-up because the second GPU can directly load data from the first one, avoiding pressure on the shared upper PCIe links.

From Figure 16, we also observe near-linear scalability for our Ring-based mechanism before across NUMA nodes. As the current TensorFlow implementation can hardly support NUMA-aware tensor allocation well, reading data cross NUMA nodes become suboptimal. Our further experiments show that we can get speed-up on average if we manually enable NUMA-aware tensor allocation. Generally, Ring-based mechanism in NGra can improve performance by about 2 on average when using multiple GPUs.

7 Related Work

Many real world data can be organized as graph-structured, e.g., web graph, social network, and knowledge graph, etc., where tremendous valuable information can be extracted. A large number of graph processing systems have been proposed and developed to analyze those data through iterative propagation-based computation. Pregel [28] first proposes the vertex-program model. It is extended by subsequent work like GraphLab [26] and PowerGraph [11] which proposes GAS model to exploit more parallelism on edge-related operations. The GAS model is also well adopted and further extended by a bunch of following work which conduct optimizations on different aspects including graph layout, sequential data access, and secondary storage (e.g., GraphChi [23], Grace [33], XStream [35], Chaos [34], and FlashGraph [45]), distributed shared memory and RDMA (e.g., Grappa [31] and GraM [40]), NUMA-awareness, scheduling, and load balancing (e.g., Galois [32], Mizan [19], and Polymer [44]), and graph partitioning (e.g., PowerLyra [6] and BiGraph [7]). All these work concentrate on computations conducted on CPU.

There are another series of graph system work that focus on exploiting computation power of GPU for graph processing. Medusa [46] provides simple programming abstractions for GPU-based graph processing and automatically conducts parallel execution on multiple GPUs. CuSha [20] mainly focuses on exploring new graph representations to allow faster graph processing. They both cannot process graphs exceeding the GPU memory capacity. Totem [10] statically partitions the graph into GPU and host memory to balance their computation loads, which may not be achievable, especially for large graphs, since the ratios of memory and computation power between GPU and CPU are not aligned. GraphReduce [37] can process out-of-memory graphs on a single GPU. It optimizes memory coalescing through using two different formats, the benefit of which can be easily cancelled by the redundant data transfers. GTS [21] can also process out-of-memory graphs on multiple GPUs. It does not have mechanisms to avoid redundant vertex data load from host to device memory for multi-GPU case. Garaph [27] exploits edge-centric parallelism and dynamic scheduling to achieve the best performance on the CPU/GPU hybrid platform. Lux [17] investigates the placement of graph data over memory hierarchy of CPUs in multiple nodes. Graphie [14] proposes methods to address the challenges when the set of active vertices can change throughout the execution. All these above systems only focus on supporting traditional graph algorithms like PageRank, connected component, and shortest path, etc.

TuX [41]

pioneers the effort on studying the gap between graph and traditional machine learning computation, while NGra moves further to connect graph processing and deep learning which can be well supported by the dataflow frameworks like TensorFlow 

[3], PyTorch [2], MxNet [8], and CNTK [42], etc. A similar past effort is in work of GraphX [12] which is a graph system built over a general dataflow engine, SPARK [43], while its target is to connect graph processing with batch-like map-reduce workloads in a single workflow pipeline.

8 Conclusion

GNNs represent an emerging computation model that arises naturally from the need to apply neural network models on large graphs. Supporting efficient and scalable parallel computation for GNN training is demanding due to its inherent complexity. NGra is the first to target GNNs, with a new programming abstraction, which is then mapped and optimized as dataflows to execute efficiently on GPUs.

References

  • [1] cuSPARSE, Retrieved September, 2018. https://developer.nvidia.com/cusparse.
  • [2] PyTorch, Retrieved September, 2018. http://pytorch.org.
  • [3] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., and Zheng, X. TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) (GA, 2016), USENIX Association, pp. 265–283.
  • [4] Bresson, X., and Laurent, T. Residual gated graph convnets. In International Conference on Learning Representations (ICLR) (2018).
  • [5] Bui, T. D., Ravi, S., and Ramavajjala, V. Neural graph learning: Training neural networks using graphs. In Proceedings of 11th ACM International Conference on Web Search and Data Mining (WSDM) (2018).
  • [6] Chen, R., Shi, J., Chen, Y., and Chen, H.

    PowerLyra: Differentiated graph computation and partitioning on skewed graphs.

    In Proceedings of the Tenth European Conference on Computer Systems (2015), EuroSys’15, ACM.
  • [7] Chen, R., Shi, J., Zang, B., and Guan, H. Bipartite-oriented distributed graph partitioning for big learning. In Proceedings of 5th Asia-Pacific Workshop on Systems (2014), APSys’14, ACM.
  • [8] Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., and Zhang, Z. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. In NIPS Workshop on Machine Learning Systems (LearningSys) (2016).
  • [9] Defferrard, M., Bresson, X., and Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems (2016), pp. 3844–3852.
  • [10] Gharaibeh, A., Beltrão Costa, L., Santos-Neto, E., and Ripeanu, M. A yoke of oxen and a thousand chickens for heavy lifting graph processing. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (New York, NY, USA, 2012), PACT ’12, ACM, pp. 345–354.
  • [11] Gonzalez, J. E., Low, Y., Gu, H., Bickson, D., and Guestrin, C. Powergraph: Distributed graph-parallel computation on natural graphs. In OSDI (2012), vol. 12, p. 2.
  • [12] Gonzalez, J. E., Xin, R. S., Dave, A., Crankshaw, D., Franklin, M. J., and Stoica, I. GraphX: Graph processing in a distributed dataflow framework. In 11th USENIX Symposium on Operating Systems Design and Implementation (2014), OSDI’14, USENIX.
  • [13] Hamilton, W. L., Ying, R., and Leskovec, J. Inductive representation learning on large graphs. In NIPS (2017).
  • [14] Han, W., Mawhirter, D., Wu, B., and Buland, M. Graphie: Large-scale asynchronous graph traversals on just a gpu. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (2017), PACT’17.
  • [15] Henaff, M., Bruna, J., and LeCun, Y. Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163 (2015).
  • [16] Intel Corp. An introduction to the intel quickpath interconnect, 2009. https://www.intel.com/content/www/us/en/io/quickpath-technology/quick-path-interconnect-introduction-paper.html.
  • [17] Jia, Z., Kwon, Y., Shipman, G., McCormick, P., Erez, M., and Aiken, A. A distributed multi-gpu system for fast graph processing. Proc. VLDB Endow. 11, 3 (Nov. 2017), 297–310.
  • [18] Kersting, K., Kriege, N. M., Morris, C., Mutzel, P., and Neumann, M. Benchmark data sets for graph kernels, 2016.
  • [19] Khayyat, Z., Awara, K., Alonazi, A., Jamjoom, H., Williams, D., and Kalnis, P. Mizan: A system for dynamic load balancing in large-scale graph processing. In Proceedings of the 8th ACM European Conference on Computer Systems (2013), EuroSys’13, ACM.
  • [20] Khorasani, F., Vora, K., Gupta, R., and Bhuyan, L. N. Cusha: Vertex-centric graph processing on gpus. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing (New York, NY, USA, 2014), HPDC ’14, ACM, pp. 239–252.
  • [21] Kim, M.-S., An, K., Park, H., Seo, H., and Kim, J. Gts: A fast and scalable graph processing method based on streaming topology to gpus. In Proceedings of the 2016 International Conference on Management of Data (New York, NY, USA, 2016), SIGMOD ’16, ACM, pp. 447–461.
  • [22] Kipf, T. N., and Welling, M. Semi-supervised classification with graph convolutional networks. International Conference on Learning Representations (ICLR) (2016).
  • [23] Kyrola, A., Blelloch, G., and Guestrin, C. GraphChi: Large-scale graph computation on just a PC. In Presented as part of the 10th USENIX Symposium on Operating Systems Design and Implementation (2012), OSDI’12, USENIX.
  • [24] Lei, T., and Huan, L. Relational learning via latent social dimensions. In KDD ’09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (New York, NY, USA, 2009), ACM, pp. 817–826.
  • [25] Li, Y., Tarlow, D., Brockschmidt, M., and Zemel, R. Gated graph sequence neural networks. International Conference on Learning Representations (ICLR) (2016).
  • [26] Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., and Hellerstein, J. M. Distributed GraphLab: A framework for machine learning and data mining in the cloud. Proc. VLDB Endow. 5, 8 (Apr. 2012).
  • [27] Ma, L., Yang, Z., Chen, H., Xue, J., and Dai, Y. Garaph: Efficient gpu-accelerated graph processing on a single machine with balanced replication. In 2017 USENIX Annual Technical Conference (USENIX ATC 17) (Santa Clara, CA, 2017), USENIX Association, pp. 195–207.
  • [28] Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, I., Leiser, N., and Czajkowski, G. Pregel: A system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (2010), SIGMOD’10, ACM.
  • [29] Marcheggiani, D., and Titov, I. Encoding sentences with graph convolutional networks for semantic role labeling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (2017), Association for Computational Linguistics, pp. 1506–1515.
  • [30] Meta. Data dumps — meta, discussion about wikimedia projects, 2018. [Online; accessed 3-May-2018].
  • [31] Nelson, J., Holt, B., Myers, B., Briggs, P., Ceze, L., Kahan, S., and Oskin, M. Latency-tolerant software distributed shared memory. In 2015 USENIX Annual Technical Conference (2015), USENIX ATC’15, USENIX.
  • [32] Nguyen, D., Lenharth, A., and Pingali, K. A lightweight infrastructure for graph analytics. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (2013), SOSP’13, ACM.
  • [33] Prabhakaran, V., Wu, M., Weng, X., McSherry, F., Zhou, L., and Haradasan, M. Managing large graphs on multi-cores with graph awareness. In 2012 USENIX Annual Technical Conference (2012), USENIX ATC’12, USENIX.
  • [34] Roy, A., Bindschaedler, L., Malicevic, J., and Zwaenepoel, W. Chaos: Scale-out graph processing from secondary storage. In Proceedings of the 25th Symposium on Operating Systems Principles (2015), SOSP’15, ACM.
  • [35] Roy, A., Mihailovic, I., and Zwaenepoel, W. X-Stream: Edge-centric graph processing using streaming partitions. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (2013), SOSP’13, ACM.
  • [36] Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., and Eliassi-Rad, T. Collective classification in network data. AI magazine 20, 1 (2008), 61–80.
  • [37] Sengupta, D., Song, S. L., Agarwal, K., and Schwan, K. Graphreduce: Processing large-scale graphs on accelerator-based systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (New York, NY, USA, 2015), SC ’15, ACM, pp. 28:1–28:12.
  • [38] Sukhbaatar, S., Szlam, A., and Fergus, R.

    Learning multiagent communication with backpropagation.

    In Advances in Neural Information Processing Systems (NIPS) (2016), pp. 2244–2252,.
  • [39] Wikipedia. PCI Express, 2018. https://en.wikipedia.org/wiki/PCI_Express.
  • [40] Wu, M., Yang, F., Xue, J., Xiao, W., Miao, Y., Wei, L., Lin, H., Dai, Y., and Zhou, L. GraM: Scaling graph computation to the trillions. In Proceedings of the Sixth ACM Symposium on Cloud Computing (2015), SoCC’15, ACM.
  • [41] Xiao, W., Xue, J., Miao, Y., Li, Z., Chen, C., Wu, M., Li, W., and Zhou, L. TuX2: Distributed Graph Computation for Machine Learning. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17) (Boston, MA, 2017), USENIX Association, pp. 669–682.
  • [42] Yu, D., Eversole, A., Seltzer, M., Yao, K., Kuchaiev, O., Zhang, Y., Seide, F., Huang, Z., Guenter, B., Wang, H., Droppo, J., Zweig, G., Rossbach, C., Gao, J., Stolcke, A., Currey, J., Slaney, M., Chen, G., Agarwal, A., Basoglu, C., Padmilac, M., Kamenev, A., Ivanov, V., Cypher, S., Parthasarathi, H., Mitra, B., Peng, B., and Huang, X. An introduction to computational networks and the computational network toolkit. Tech. rep., October 2014.
  • [43] Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S., and Stoica, I. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (2012), NSDI’12, USENIX.
  • [44] Zhang, K., Chen, R., and Chen, H. NUMA-aware graph-structured analytics. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (2015), PPoPP’15, ACM.
  • [45] Zheng, D., Mhembere, D., Burns, R., Vogelstein, J., Priebe, C. E., and Szalay, A. S. FlashGraph: Processing billion-node graphs on an array of commodity SSDs. In 13th USENIX Conference on File and Storage Technologies (2015), FAST’15, USENIX.
  • [46] Zhong, J., and He, B. Medusa: A parallel graph processing system on graphics processors. SIGMOD Rec. 43, 2 (Dec. 2014), 35–40.

3 NGra System

NGra provides a combination of the dataflow and vertex-program abstractions as the user interface. Underneaththis abstraction, NGra mainly consists of 1) a front-end that translates the algorithm implemented in SAGA-NN model into a chunk-granularity dataflow graph to enable GNN computation on large graph in GPU; 2) an optimization layer that produces a scheduling strategy for minimizing data movement between host and GPU device memory, and recognizes opportunities to fuse operations and remove redundant computations; 3) a set of efficient propagation operation kernels that support streaming-based processing to overlap data movement and computation in GPU; 4) a dataflow execution runtime. Because NGra largely leverages existing dataflow-based deep learning frameworks for the dataflow execution runtime, we focus on the design of the first three in this section as they are the main contributions of NGra.

Figure 4: Chunk-based dataflow graph for a destination interval at a G-GCN layer. The swap-out of output tensors of operations from the SAG phase when connecting to D2H is hidden in the SAG sub-graph for a clear visualization.

3.1 Chunk-Based Streaming Dataflow

When exploiting computation power of GPU, existing deep learning frameworks assume that the input and output tensor data of a single operator in the dataflow graph can be fully held in GPU device memory. However, when treating the vertex or edge data of an entire graph as a single tensor or matrix, this assumption often fails to hold for large graphs, which limits the scale of the graph that the system can support efficiently. To address this problem, NGra splits the vertex and edge data of the graph into small chunks that can fit into GPU device memory and constructs a dataflow graph with operators processing computations at the chunk granularity.

NGra splits vertex and edge data into chunks through a 2D partitioning to tile the adjacency matrix representing the edges in the graph, with vertex data split into the corresponding equally-sized disjoint vertex id intervals. This way, edges in each edge chunk connect the vertices in two vertex chunks as sources and destinations, respectively. As illustrated in Figure 4, edge chunk contains edges connecting source vertices in vertex chunk and destination vertices in vertex chunk . In each edge chunk, for feed-forward computation, edges are laid out in a compressed sparse column (CSC) format with edges clustered by destination vertex id, whereas for back-propagation computation, edges are arranged in compressed sparse row (CSR) format. NGra also makes a best effort to re-encode vertex ids to equalize the numbers of edges in edge chunks for balanced chunk-granularity computation.

By splitting data into chunks, NGra is able to generate a dataflow graph with operators operating on data chunks that fit in GPU memory. In the dataflow graph, the outputs of the operators in the Scatter stage are a grid of edge data chunks with edge data the tuple [src,dest,data] (in Section 2.1). Each edge data chunk is then processed by the operators in the ApplyEdge stage to produce another grid of edge data chunks with edge data acc (as in Figure 2). The operators in the Gather stage then accumulate these edge data chunks based on the destination vertices to generate the corresponding vertex data chunks as the inputs for the ApplyVertex stage that follows.

A naive scheduling strategy for this dataflow graph is to execute the operators stage-by-stage. However, since the size of the output data chunk grids of a stage can be large and may not be accommodated in the GPU device memory, they will be swapped out to host memory after the completion of the current stage and before the beginning of the next, thereby losing the opportunity for the operators in the next stage to reuse the output data chunks that already reside in GPU device memory in the current stage. For example, when the Gather stage outputs a vertex chunk with the accumulated vertex data, the scheduler may choose to schedule the operators in the same Gather stage to produce a next vertex chunk, or, it may immediately schedule the operators in the following ApplyVertex stage to consume the current vertex chunk since it is already in the GPU device memory.

NGra therefore adopts a scheduling strategy for every GNN layer as illustrated in Figure 4. For each output vertex chunk of the Gather stage (e.g., in the figure) that maintains the accumulated values for the destination vertices in the chunk, it may be used as input and output by multiple operators in the same Gather stage for the different source vertex chunks (e.g., , , and ). The scheduler tries to hold it in GPU device memory to keep reusing it for those Gather operators until it is consumed by the operators in the following ApplyVertex stage. As in Figure 4, NGra continuously executes operators in the Scatter, ApplyEdge, and Gather (S-A-G) stages and repeats this pattern for , , and to produce the final , followed by the corresponding operators in the ApplyVertex stage that consumes . It then schedules the operators to produce the next output vertex chunk of the Gather stage.

NGra employs explicit device-to-host (D2H) and host-to-device (H2D) operators to conduct the data swap between host and GPU device memory. During a training process, the intermediate feature data (e.g., the result of matrix multiplication in the ApplyEdge stage as in Figure 4) relevant to vertex or edge chunks may be used in back-propagation. They may be swapped out to host memory during feed-forward computation and swapped back in during back-propagation.

Employing explicit data movement operators enables streaming-based scheduling, to overlap data swap with other computations in operators that are independent of the data transferred. For example, at the beginning of the execution of each GNN layer, the operators in the Scatter stage can overlap with their corresponding H2D operators; i.e., the scatter operation on the current vertex chunk can overlap with the operator that loads the next vertex chunk from host memory to GPU device memory.

Figure 5: Optimized layer-wise dataflow graph in G-GCN.
Figure 6: Propagation kernels.

3.2 Dataflow Graph Optimization

NGra further optimizes a generated dataflow graph to remove redundant computations or fuse operations by considering the semantics of the SAGA-NN model. Considering the matrix multiplication operations in the ApplyEdge stage as in Figure 4, they perform computations between the vertex data that are scattered to the edges and the learnable parameter or that is shared by all the edges. Because a vertex may have multiple adjacent edges that the vertex data can be scattered to, such same multiplication for a vertex can be conducted multiple times and redundant. NGra therefore moves the computations that are only related to source or destination vertices out of the ApplyEdge stage of the current layer to the ApplyVertex stage of the previous layer. Figure 5 shows the optimized dataflow graph with the matrix multiplications moved into the ApplyVertex stage.

NGra supports operator fusion as another optimization. In some GNN algorithms, the ApplyEdge function does not perform complex neural network computation, but only element-wise operations, such as +, -, , , tanh, sigmoid, ReLU, etc. In this case, the Scatter, ApplyEdge, Gather stages can be reduced to a single propagation procedure using a special customized operator. NGra automatically detects this case and replaces the subgraph of these three stages with the special operator. As shown in Figure 5, for G-GCN, after the matrix multiplications are moved to ApplyVertex stage, the SAG phase for each chunk contains only element-wise operations (e.g., + and sigmoid), so the whole SAG phase is replaced with a fused-gather operation.

3.3 Propagation Kernels on GPUs

Scatter and Gather are the two key stages to handle the propagation procedures over the often sparse edge structure of a graph. To support both stages efficiently on GPUs, NGra provides customized scatter/gather operation kernels optimized for GPU execution. The design carefully considers the data structure layout to allow the kernels to better leverage GPU memory hierarchy (i.e., shared memory and register) and massive parallelism. Even with a sparse graph structure, which often leads to random access on vertex data, unlike traditional graph processing scenarios, in most GNN algorithms, the data of each vertex is a dense vector rather than a scalar, we therefore prefer to exploit GPU parallelism in per-vertex data access with a best effort.

Scatter Kernel.   The scatter operator passes vertex feature data, from both source and destination, to edges. In feed-forward computation, edges are arranged in a CSC format. We therefore organize the incoming edges of a single destination vertex as in a group and assign a thread block to process them. For a vertex with a large in-degree, we split the edge set into consecutive subgroups for multiple thread blocks. In each thread block, the process of scattering source vertex data consists of two phases, as illustrated in Figure 6(a). First, the threads fetch the source vertex ids from the edges in the edge group stored in contiguous address space, and cache them in the shared memory. Second, the threads copy the vertex data to the edge data storage in parallel along the dimensions of vertex feature vector. This way, in most cases, an instruction in a thread-warp may have highly efficient coalesced memory access. The destination vertex data can be accessed in a similar way. Because destination data is shared by multiple edges, it can be cached in register after being accessed once. In back-propagation, a similar process is conducted on the CSR edge format.

Gather Kernel.   The gather operator collects the feature vectors of edges to the destination vertices and reduces them into a single vector for each destination vertex through a user-provided accumulate function. We employ a similar principle of exploiting parallelism as for the scatter operator. The process is illustrated in Figure 6(b). A block of threads first cooperatively enumerate the edge group, and accumulate the features of each edge into a temporary vector in register, and then write the result back to the corresponding destination vertex.

4 Parallel Processing with Multiple GPUs

NGra further exploits the parallelism of multiple GPUs in a single server. Figure 7 shows the interconnection architecture of a typical 8-GPU server, where GPUs are connected to CPU/DRAM (host memory) via a multi-level PCI-Express (PCIe) interface hierarchy.

4.1 Multi-GPU Architecture

In a multi-GPU system, the GPU interconnection has a hierarchical structure, leading to multiple levels of locality when transferring data across different GPUs or between host and GPU device memory. For example, as shown in Figure 7, the communication between GPU 0 and GPU 1 achieves the highest performance as they are attached to the same PCIe switch. GPU 0 needs to go through two PCIe switches and one PCIe host bridge when communicating with device like GPU 2 or 3, therefore introducing longer latency and lower throughput. GPU 0 cannot conduct P2P access when communicating with GPUs located in other NUMA nodes (e.g., GPU 4 or 7), and hence performs even worse.

In addition, the upper level link bandwidth is shared by the GPU devices under the same root. For example, GPU 0 and GPU 1 share the same communication bandwidth between the PCIe host bridge and the PCIe switch that they are rooted at. As a result, their communications with GPU devices under other PCIe switches may interfere with each other when conducted concurrently, leaving their local PCIe bandwidth under-utilized. This makes the root level links more likely become the bottleneck in parallel computation. We therefore carefully design the streaming scheme by considering the locality characteristics of the multi-GPU system to achieve higher parallelism and better performance.

Figure 7: Multi-GPU architecture

4.2 Ring-Based Parallel Streaming

Consider a layer of GNN computation, a vanilla solution to exploit multi-GPU parallelism is to run a dataflow subgraph (as shown in Figure 4), which produces an output vertex chunk of the ApplyVertex stage, in a single GPU, and let multiple GPUs execute different sets of such subgraphs. The process of this subgraph essentially outputs one vertex chunk (e.g., ) and takes a set of vertex chunks containing all vertices (e.g., , , and ) as input. When running these subgraphs in multiple GPUs at the same time, the same set of vertex chunks are loaded from host memory into the device memory of these GPUs. This makes data transfer bottlenecked at the root level PCIe links that are shared by all the GPUs. Therefore, we design a ring-based streaming scheme to allow GPUs to reuse the data chunks that are already loaded from host memory by exchanging them directly. We organize the data-sharing path among the multiple GPUs as a ring, which is illustrated as the circle with red dot line in Figure 7. Because both PCIe and QPI links support duplex [39, 16], the simultaneous data transfers on the ring do not interfer with each other on bandwidth. With this scheme, each vertex chunk is loaded from host memory to enter the ring and passed to all the GPUs on the ring in order.

Figure 8: Ring-based streaming in a 4-GPU setting.

Scheduling in Ring-Based Streaming.   In ring-based streaming, a GPU needs to take the following two actions: 1) loading a data chunk from the host memory or from the device memory of the previous GPU in the ring; 2) performing local computations. In order to enable the overlap between the two actions, NGra introduces an explicit device-to-device (D2D) operator and employs a coordinated scheduling as illustrated in Figure 8.

In step 1, only GPU 0 and GPU 2 load vertex data chunks 1 and 3 from the host memory, respectively. After loading, in step 2, GPU 0 and GPU 2 start computations based on chunks 1 and 3, while loading chunks 2 and 4, respectively. At the same time, GPU 1 and GPU 3 start fetching chunks 1 and 3 from GPU 0 and GPU 2. In step 3 and step 4, they enter the whole-ring forward mode where there is no data loaded from the host memory. In step 5, unable to get input vertex data from the ring any more, GPU 0 and GPU 2 read data chunk 5 and 7, respectively, from the host memory. Additionally, GPU 1 and GPU 3 drop data chunks 3 and 1 after processing them locally because the chunks have already been consumed by all the GPUs in the ring. All the GPUs in step 6 take the same actions as step 5, except that the data chunk numbers are shifted. And the process enters the whole-ring forward mode again after that step. The whole process continus in such a pipelining fashion until all vertex data chunks have been loaded and processed.

This scheme leaves GPU 1 and GPU 3 idle in step 1 because it already makes full use of the system bandwidth. More concurrent loadings can make things worse. For example, if all the GPUs load different vertex data chunks from the host memory simultaneously, throughput will be limited by root links such as the upper link of PCIe switch in Figure 7. In this case, GPU 0 and GPU 1 will get only half the bandwidth due to the shared upper link. Suppose a single GPU needs time to transfer a data chunk into GPU device memory, two concurrent transfers will take before starting computation. Actually, such longer transfer latency will happen each time the ring loads new data chunks from the host memory, which leads to extra stalls in the entire processing pipeline.

In addition, analyzing the multi-GPU architecture can help decide which GPU(s) should perform data loading in ring-based streaming. With host memory as root, GPUs as leaves, and data links (with bandwidth) as edges, we can build a bandwidth tree where we can get a maximal fat-tree by gradually removing GPUs that could share upper link bandwidth with other GPUs. On top of Figure 8, we get a maximal bandwidth fat-tree (solid red line) from the original bandwidth tree (dashed green line). Data loading from the host memory happens only on the GPUs that belong to this maximal fat-tree.

vertex$^{\ell + 1}^{\ell}[W^{\ell}_H,\ W^{\ell}_C]^{\ell}^{\ell}^{\ell}\quad$return edge$^{\ell}^{\ell + 1}^{\ell}\quad$return    return vertex$^{\ell + 1}
Figure 9: Communication Neural Net (CommNet)
vertex$^{\ell + 1}^{\ell}W^{\ell}^{\ell}^{\ell}^{\ell}\quad$return edge$^{\ell}\times$ edge.data   set Gather.accumulator = sum   accum = Gather(acc)   // compute new vertex data   vertex$^{\ell + 1}^{\ell}\quad$return    return vertex$^{\ell + 1}
Figure 10: Graph Convolutional Network (GCN)
vertex$^{\ell + 1}^{\ell}[W^{\ell},\ W^{\ell}_{pool},\ b^{\ell}]^{\ell}^{\ell}^{\ell}\quad$return    set Gather.accumulator = max   accum = Gather(acc)   // compute new vertex data   vertex$^{\ell + 1}^{\ell}\quad$return    return vertex$^{\ell + 1}
Figure 11: Max-Pooling GCN (MP-GCN)
vertex$^{\ell + 1}^{\ell}^{\ell}^{\ell}^{\ell}\quad$return    set Gather.accumulator = sum   accum = Gather(acc)   // compute new vertex data using GRU   vertex$^{\ell + 1}^{\ell}\quad$return GRU(vertex$^{\ell}^{\ell + 1}
Figure 12: Gated Graph Neural Network (GG-NN)

5 Applications

NGra can support many different types of graph-based neural networks [4, 5, 9, 15, 13, 22, 25, 29, 38]. In this section, we present the layer-wise programs of several representative GNN models from the literature.

Communication neural net (CommNet)  [38] is a model where cooperating agents learn to communicate among themselves before taking actions. Each agent is controlled by a deep feed-forward network, which additionally receives the summed transmissions of other connected agents. This network can be used to solve multiple learning communication tasks like traffic control. In CommNet, there is no computation on the edge, so the ApplyEdge stage is simpy a passthrough (see Figure 4.2). Each agent gets the summed transmissions via the Gather stage, and generates its new hidden state in the ApplyVertex stage.

Graph convolutional networks (GCN)  [15, 22] generalizes the notion of CNNs to an operation that operates on an arbitrary graph. This algorithm has been applied in many semi-supervised or unsupervised graph clustering problems, such as entity classification in a knowledge graph. In GCN, there is computation (without neural networks) on the edge for weighted neighbor activation. In the GCN program (see Figure 4.2), the Scatter operation feeds vertex features into the ApplyEdge function, which multiplies it by the static edge weight determined by the vertex degree. The Gather stage returns the weighted sum of activations of neighbors on which the ApplyVertex stage applies a fully-connected layer.

Max-pooling GCN (MP-GCN)  [13] applies the max-pooling operator to each of the computed features, which can effectively capture different aspects of the neighborhood set. In MP-GCN, there is an NN-based computation on the edge with the max aggregation instead of mean. In its program (see Figure 4.2), the Scatter stage passes the source vertex feature vector into fully-connected neural network defined in ApplyEdge function. Then the Gather stage returns the element-wise max of features of neighbors on which the ApplyVertex stage applies a fully-connected layer.

Gated GCN (G-GCN)  [4, 29] is our running example which incorporates the gate mechanism into GCN, so that the model can learn which edges are more important for the learning target. Its computation pattern differs from that of Max-pooling GCN in that the NN-based computation on edges requires the feature vectors of both source and destination vertices. The Scatter stage must propagate the feature data of both vertices on the edge (see Figure 2).

Gated graph neural networks (GG-NN)  [25]

applies recurrent neural networks (RNNs) for walks on a graph structure and unroll the recurrence for a fixed number of steps. This model was used for NLP tasks and also in quantum chemistry for fast organic molecule properties estimation. GG-NN has NN-based edge computation, but using different parameters for different edge labels (the model assumes discrete edge types). Also, GG-NN has dense computation on vertices where the

ApplyVertex

function is Gated Recurrent Unit (GRU). In the GG-NN program (see Figure 

4.2), different edges can share different parameters in the ApplyEdge function.

6 Evaluation

We implement NGra on top of TensorFlow (v1.7) with about 2,900 lines of C++ code and 3,000 lines of Python code. NGra extends TensorFlow with a front-end to transform SAGA-NN programs into a chunk-granularity dataflow graph, several (fused) scatter/gather operators for efficient graph propagation, and a ring-based streaming scheduling scheme.

In this section, we present the detailed evaluation results and demonstrate the efficiency and scalability of NGra, as well as comparing with a state-of-the-art system, TensorFlow.

Experimental setup.   We evaluate NGra on a multi-GPU server, which is equipped with dual 2.6 GHz Intel Xeon E5-2690v4 processors (28 cores in total), 512 GB of memory, and 8 NVIDIA Tesla P100 GPU. The installed operating system is Ubuntu 16.04, using libraries CUDA 8.0, and cuDNN 6.

Table 1 lists the datasets that we used for evaluation, which are Pubmed citation network [36], protein-protein interaction graphs [18], BlogCatalog social network [24], Reddit online discussion forum [13], and Wikidata [30], respectively. The column feature in Table 1 represents the size of vertex feature vector, and the label

column means the number of label classes. We test the performance of our system on the task of vertex classification, e.g., classifying academic papers into different subjects in the Pubmed citation dataset, which contains sparse bag-of-words feature vectors for each document and a list of citation links between documents.

Our evaluation uses 5 popular GNN applications introduced in Section 5. Note that only CommNet, GCN and GG-NN can be directly supported using the Tensorflow operators due to no or simple computation on the edge, in which cases the propagation can be treated as a sparse multiplication. We set the number of layers

in each GNN. All performance numbers in our experiments are calculated by averaging over 10 epochs.

Dataset vertex# edge# feature label
pubmed 19.7K 108.4K 500 3
protein 43.5K 205.6K 29 3
BlogCatalog 10.3K 668.0K 128 39
reddit_small 46.6K 1.4M 602 41
reddit_middle 233.0K 23.2M 602 41
reddit_full 2.2M 571.0M 300 50
enwiki 3.2M 222.1M 300 12
Table 1: Datasets (K: thousand, M: million).

6.1 Efficient Graph Propagation

NGra’s efficient Scatter and Gatter kernels play an import role in handling sparsity of graph propagation.

Micro-benchmark on Synthetic Data.   To evaluate performance of our propagation kernels, we compare NGra with TensorFlow and cuSPARSE [1] on a simple sparse-dense matrix multiplication workload, which can be implemented with SAGA model through just specifying ApplyEdge phase as a multiplication with edge feature. For TensorFlow, we directly use its sparse_tensor_dense_matmul operator. The inputs of the computation are a sparse matrix with variant graph densities (i.e., the percentage of non-zero values) from 0.01% to 10%, and a dense matrix. The performance results are shown in Figure 13. Compare to TensorFlow, our propagation kernels can speed up by to . Even compared to cuSPARSE, NGra can improve the performance by to . The huge performance gaps are mainly due to our careful kernel optimization and special GPU threading model design for GNN scenarios, where the vertex data is often a feature vector.

Figure 13: Propagation kernel time of TensorFlow(TF), cuSPARSE, and NGra(NG) on graphs with different density.

Real Applications on Small Data.   To compare with TensorFlow on real applications, we need to fit the whole graph data in device memory, as TensorFlow cannot support graph larger than single GPU memory. Thus, we just use the first 4 small datasets in Table 1 to run all the 3 applications that TensorFlow supports. Table 2 lists the comparison results. Overall, NGra outperforms TensorFlow by ranging from 7.7% to among all the applications and datasets. In the 3 applications, the average improvement of GCN (), is more than others, mainly because GCN’s graph propagation takes more computation cycles than others. The improvements are also relative to datasets. For example, the density of the blog graph is greater, which leads to higher graph propagation overhead, where NGra can speed up. The average improvement of all applications on the blog dataset is , while this numbers for the reset datasets are 35.7%, 23.4%, and respectively.

dataset pubmed protein blog reddit-small
NG TF NG TF NG TF NG TF
GCN 8.2 13.6 14.8 20.7 8.4 32.5 44.2 113.3
CommNet 14.2 18.6 27.4 33.5 10.6 35.8 62.4 132.4
GG-NN 37.6 41.9 77.7 83.7 23.6 49.4 127.3 195.3
Table 2: Iteration time (ms) comparison with TensorFlow.
Figure 14: Streaming scheduling strategies comparison on different applications. (Data: reddit_middle)

6.2 Scaling-up on a Single GPU

Figure 15: Scaling up performance of NGra on different applications.
Figure 16: Speed up of NGra with different applications on large graphs.

NGra uses the chunk-based streaming mechanism to support graphs that do not fit in GPU memory. We first evaluate different scheduling strategies in this mechanism, as introduced in Section 3, and then demonstrate NGra’s performance on real applications through comparisons with NGra’s baseline versions.

Streaming Scheduling Strategy.   The scheduling strategy of a chunk-based dataflow graph heavily affects the overall performance, as it determines the number of data swapping introduced in Section 3.1. We demonstrate the benefit of our strategy through comparing with two alternatives: stage-based and dest-order scheduling strategies. In the stage-based strategy, the Scatter, ApplyEdge, and Gatter (S-A-G) are composed as one stage, the ApplyVertex as another stage, where these two stages are executed one-by-one. This will introduce one data swapping between two stages. In the dest-order strategy, it prefers to schedule operators in the Scatter stage along the direction that destination vertex chunk changes. In this case, for each source vertex chunk, the accum data needs to be swapped in and out once. Figure 14 shows the comparison results for 5 applications on the reddit_middle dataset. Compared to the stage-based strategy, NGra’s scheduling outperforms by 24.9% to 35.1% for different applications. For the dest-order strategy, NGra improves the performance by 60.1% to 93.1%. These results demonstrate the benefits of avoiding data-swapping and the importance of NGra’s scheduling.

Benefit of Streaming on Real Applications.   To evaluate the performance gain of streaming in NGra, we implement a baseline version of NGra by disabling streaming and the optimized graph propagation, denoted as NG-base. In this case, NG-base can still handle large graph by partitioning it into chunks and processing them sequentially. We also implement another version through using only chunk-based streaming in NGra, denoted as NG-stream, which can achieve overlapping of data transmission and computation. We compare end-to-end performance of NG-base, NG-stream, and NGra (NG) on 3 applications in Figure 15. We construct the datasets with different scales by simply duplicating the reddit_small dataset by 1, 4, 9, and 16 copies. Compared to NG-base, NG-stream improves the performance by 33.2%, 29.3%, and 23.0%, respectively, for the 3 applications. By further using optimized graph propagation, NGra is able to speed up the performance of these applications by , , and than NG-base, respectively.

6.3 Scaling-out on Multiple GPUs

NGra enables scaling GNN computation to multiple GPUs with ring-based parallel streaming. We compare this mechanism with a baseline without the ring-based strategy, denoted as non-ring. Figure 16 shows the comparison results for 5 applications on two large datasets, enwiki and reddit_full, respectively. Please note that ring-base mechanism only works on multi-GPU, so 1 GPU data point is the same. The results show clearly the benefits of ring-based streaming mechanism. For example, when scaling the computation from 1 GPU to 2 GPUs, the average speed up of non-ring mechanism is only , while our ring-based one can reach . This is mainly because, without ring-based design, each of the two GPUs needs to load input data through shared PCIe links concurrently, which easily becomes the bottleneck of the system. The ring-based mechanism allows near-linear speed-up because the second GPU can directly load data from the first one, avoiding pressure on the shared upper PCIe links.

From Figure 16, we also observe near-linear scalability for our Ring-based mechanism before across NUMA nodes. As the current TensorFlow implementation can hardly support NUMA-aware tensor allocation well, reading data cross NUMA nodes become suboptimal. Our further experiments show that we can get speed-up on average if we manually enable NUMA-aware tensor allocation. Generally, Ring-based mechanism in NGra can improve performance by about 2 on average when using multiple GPUs.

7 Related Work

Many real world data can be organized as graph-structured, e.g., web graph, social network, and knowledge graph, etc., where tremendous valuable information can be extracted. A large number of graph processing systems have been proposed and developed to analyze those data through iterative propagation-based computation. Pregel [28] first proposes the vertex-program model. It is extended by subsequent work like GraphLab [26] and PowerGraph [11] which proposes GAS model to exploit more parallelism on edge-related operations. The GAS model is also well adopted and further extended by a bunch of following work which conduct optimizations on different aspects including graph layout, sequential data access, and secondary storage (e.g., GraphChi [23], Grace [33], XStream [35], Chaos [34], and FlashGraph [45]), distributed shared memory and RDMA (e.g., Grappa [31] and GraM [40]), NUMA-awareness, scheduling, and load balancing (e.g., Galois [32], Mizan [19], and Polymer [44]), and graph partitioning (e.g., PowerLyra [6] and BiGraph [7]). All these work concentrate on computations conducted on CPU.

There are another series of graph system work that focus on exploiting computation power of GPU for graph processing. Medusa [46] provides simple programming abstractions for GPU-based graph processing and automatically conducts parallel execution on multiple GPUs. CuSha [20] mainly focuses on exploring new graph representations to allow faster graph processing. They both cannot process graphs exceeding the GPU memory capacity. Totem [10] statically partitions the graph into GPU and host memory to balance their computation loads, which may not be achievable, especially for large graphs, since the ratios of memory and computation power between GPU and CPU are not aligned. GraphReduce [37] can process out-of-memory graphs on a single GPU. It optimizes memory coalescing through using two different formats, the benefit of which can be easily cancelled by the redundant data transfers. GTS [21] can also process out-of-memory graphs on multiple GPUs. It does not have mechanisms to avoid redundant vertex data load from host to device memory for multi-GPU case. Garaph [27] exploits edge-centric parallelism and dynamic scheduling to achieve the best performance on the CPU/GPU hybrid platform. Lux [17] investigates the placement of graph data over memory hierarchy of CPUs in multiple nodes. Graphie [14] proposes methods to address the challenges when the set of active vertices can change throughout the execution. All these above systems only focus on supporting traditional graph algorithms like PageRank, connected component, and shortest path, etc.

TuX [41]

pioneers the effort on studying the gap between graph and traditional machine learning computation, while NGra moves further to connect graph processing and deep learning which can be well supported by the dataflow frameworks like TensorFlow 

[3], PyTorch [2], MxNet [8], and CNTK [42], etc. A similar past effort is in work of GraphX [12] which is a graph system built over a general dataflow engine, SPARK [43], while its target is to connect graph processing with batch-like map-reduce workloads in a single workflow pipeline.

8 Conclusion

GNNs represent an emerging computation model that arises naturally from the need to apply neural network models on large graphs. Supporting efficient and scalable parallel computation for GNN training is demanding due to its inherent complexity. NGra is the first to target GNNs, with a new programming abstraction, which is then mapped and optimized as dataflows to execute efficiently on GPUs.

References

  • [1] cuSPARSE, Retrieved September, 2018. https://developer.nvidia.com/cusparse.
  • [2] PyTorch, Retrieved September, 2018. http://pytorch.org.
  • [3] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., and Zheng, X. TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) (GA, 2016), USENIX Association, pp. 265–283.
  • [4] Bresson, X., and Laurent, T. Residual gated graph convnets. In International Conference on Learning Representations (ICLR) (2018).
  • [5] Bui, T. D., Ravi, S., and Ramavajjala, V. Neural graph learning: Training neural networks using graphs. In Proceedings of 11th ACM International Conference on Web Search and Data Mining (WSDM) (2018).
  • [6] Chen, R., Shi, J., Chen, Y., and Chen, H.

    PowerLyra: Differentiated graph computation and partitioning on skewed graphs.

    In Proceedings of the Tenth European Conference on Computer Systems (2015), EuroSys’15, ACM.
  • [7] Chen, R., Shi, J., Zang, B., and Guan, H. Bipartite-oriented distributed graph partitioning for big learning. In Proceedings of 5th Asia-Pacific Workshop on Systems (2014), APSys’14, ACM.
  • [8] Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., and Zhang, Z. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. In NIPS Workshop on Machine Learning Systems (LearningSys) (2016).
  • [9] Defferrard, M., Bresson, X., and Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems (2016), pp. 3844–3852.
  • [10] Gharaibeh, A., Beltrão Costa, L., Santos-Neto, E., and Ripeanu, M. A yoke of oxen and a thousand chickens for heavy lifting graph processing. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (New York, NY, USA, 2012), PACT ’12, ACM, pp. 345–354.
  • [11] Gonzalez, J. E., Low, Y., Gu, H., Bickson, D., and Guestrin, C. Powergraph: Distributed graph-parallel computation on natural graphs. In OSDI (2012), vol. 12, p. 2.
  • [12] Gonzalez, J. E., Xin, R. S., Dave, A., Crankshaw, D., Franklin, M. J., and Stoica, I. GraphX: Graph processing in a distributed dataflow framework. In 11th USENIX Symposium on Operating Systems Design and Implementation (2014), OSDI’14, USENIX.
  • [13] Hamilton, W. L., Ying, R., and Leskovec, J. Inductive representation learning on large graphs. In NIPS (2017).
  • [14] Han, W., Mawhirter, D., Wu, B., and Buland, M. Graphie: Large-scale asynchronous graph traversals on just a gpu. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (2017), PACT’17.
  • [15] Henaff, M., Bruna, J., and LeCun, Y. Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163 (2015).
  • [16] Intel Corp. An introduction to the intel quickpath interconnect, 2009. https://www.intel.com/content/www/us/en/io/quickpath-technology/quick-path-interconnect-introduction-paper.html.
  • [17] Jia, Z., Kwon, Y., Shipman, G., McCormick, P., Erez, M., and Aiken, A. A distributed multi-gpu system for fast graph processing. Proc. VLDB Endow. 11, 3 (Nov. 2017), 297–310.
  • [18] Kersting, K., Kriege, N. M., Morris, C., Mutzel, P., and Neumann, M. Benchmark data sets for graph kernels, 2016.
  • [19] Khayyat, Z., Awara, K., Alonazi, A., Jamjoom, H., Williams, D., and Kalnis, P. Mizan: A system for dynamic load balancing in large-scale graph processing. In Proceedings of the 8th ACM European Conference on Computer Systems (2013), EuroSys’13, ACM.
  • [20] Khorasani, F., Vora, K., Gupta, R., and Bhuyan, L. N. Cusha: Vertex-centric graph processing on gpus. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing (New York, NY, USA, 2014), HPDC ’14, ACM, pp. 239–252.
  • [21] Kim, M.-S., An, K., Park, H., Seo, H., and Kim, J. Gts: A fast and scalable graph processing method based on streaming topology to gpus. In Proceedings of the 2016 International Conference on Management of Data (New York, NY, USA, 2016), SIGMOD ’16, ACM, pp. 447–461.
  • [22] Kipf, T. N., and Welling, M. Semi-supervised classification with graph convolutional networks. International Conference on Learning Representations (ICLR) (2016).
  • [23] Kyrola, A., Blelloch, G., and Guestrin, C. GraphChi: Large-scale graph computation on just a PC. In Presented as part of the 10th USENIX Symposium on Operating Systems Design and Implementation (2012), OSDI’12, USENIX.
  • [24] Lei, T., and Huan, L. Relational learning via latent social dimensions. In KDD ’09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (New York, NY, USA, 2009), ACM, pp. 817–826.
  • [25] Li, Y., Tarlow, D., Brockschmidt, M., and Zemel, R. Gated graph sequence neural networks. International Conference on Learning Representations (ICLR) (2016).
  • [26] Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., and Hellerstein, J. M. Distributed GraphLab: A framework for machine learning and data mining in the cloud. Proc. VLDB Endow. 5, 8 (Apr. 2012).
  • [27] Ma, L., Yang, Z., Chen, H., Xue, J., and Dai, Y. Garaph: Efficient gpu-accelerated graph processing on a single machine with balanced replication. In 2017 USENIX Annual Technical Conference (USENIX ATC 17) (Santa Clara, CA, 2017), USENIX Association, pp. 195–207.
  • [28] Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, I., Leiser, N., and Czajkowski, G. Pregel: A system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (2010), SIGMOD’10, ACM.
  • [29] Marcheggiani, D., and Titov, I. Encoding sentences with graph convolutional networks for semantic role labeling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (2017), Association for Computational Linguistics, pp. 1506–1515.
  • [30] Meta. Data dumps — meta, discussion about wikimedia projects, 2018. [Online; accessed 3-May-2018].
  • [31] Nelson, J., Holt, B., Myers, B., Briggs, P., Ceze, L., Kahan, S., and Oskin, M. Latency-tolerant software distributed shared memory. In 2015 USENIX Annual Technical Conference (2015), USENIX ATC’15, USENIX.
  • [32] Nguyen, D., Lenharth, A., and Pingali, K. A lightweight infrastructure for graph analytics. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (2013), SOSP’13, ACM.
  • [33] Prabhakaran, V., Wu, M., Weng, X., McSherry, F., Zhou, L., and Haradasan, M. Managing large graphs on multi-cores with graph awareness. In 2012 USENIX Annual Technical Conference (2012), USENIX ATC’12, USENIX.
  • [34] Roy, A., Bindschaedler, L., Malicevic, J., and Zwaenepoel, W. Chaos: Scale-out graph processing from secondary storage. In Proceedings of the 25th Symposium on Operating Systems Principles (2015), SOSP’15, ACM.
  • [35] Roy, A., Mihailovic, I., and Zwaenepoel, W. X-Stream: Edge-centric graph processing using streaming partitions. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (2013), SOSP’13, ACM.
  • [36] Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., and Eliassi-Rad, T. Collective classification in network data. AI magazine 20, 1 (2008), 61–80.
  • [37] Sengupta, D., Song, S. L., Agarwal, K., and Schwan, K. Graphreduce: Processing large-scale graphs on accelerator-based systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (New York, NY, USA, 2015), SC ’15, ACM, pp. 28:1–28:12.
  • [38] Sukhbaatar, S., Szlam, A., and Fergus, R.

    Learning multiagent communication with backpropagation.

    In Advances in Neural Information Processing Systems (NIPS) (2016), pp. 2244–2252,.
  • [39] Wikipedia. PCI Express, 2018. https://en.wikipedia.org/wiki/PCI_Express.
  • [40] Wu, M., Yang, F., Xue, J., Xiao, W., Miao, Y., Wei, L., Lin, H., Dai, Y., and Zhou, L. GraM: Scaling graph computation to the trillions. In Proceedings of the Sixth ACM Symposium on Cloud Computing (2015), SoCC’15, ACM.
  • [41] Xiao, W., Xue, J., Miao, Y., Li, Z., Chen, C., Wu, M., Li, W., and Zhou, L. TuX2: Distributed Graph Computation for Machine Learning. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17) (Boston, MA, 2017), USENIX Association, pp. 669–682.
  • [42] Yu, D., Eversole, A., Seltzer, M., Yao, K., Kuchaiev, O., Zhang, Y., Seide, F., Huang, Z., Guenter, B., Wang, H., Droppo, J., Zweig, G., Rossbach, C., Gao, J., Stolcke, A., Currey, J., Slaney, M., Chen, G., Agarwal, A., Basoglu, C., Padmilac, M., Kamenev, A., Ivanov, V., Cypher, S., Parthasarathi, H., Mitra, B., Peng, B., and Huang, X. An introduction to computational networks and the computational network toolkit. Tech. rep., October 2014.
  • [43] Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S., and Stoica, I. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (2012), NSDI’12, USENIX.
  • [44] Zhang, K., Chen, R., and Chen, H. NUMA-aware graph-structured analytics. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (2015), PPoPP’15, ACM.
  • [45] Zheng, D., Mhembere, D., Burns, R., Vogelstein, J., Priebe, C. E., and Szalay, A. S. FlashGraph: Processing billion-node graphs on an array of commodity SSDs. In 13th USENIX Conference on File and Storage Technologies (2015), FAST’15, USENIX.
  • [46] Zhong, J., and He, B. Medusa: A parallel graph processing system on graphics processors. SIGMOD Rec. 43, 2 (Dec. 2014), 35–40.

4 Parallel Processing with Multiple GPUs

NGra further exploits the parallelism of multiple GPUs in a single server. Figure 7 shows the interconnection architecture of a typical 8-GPU server, where GPUs are connected to CPU/DRAM (host memory) via a multi-level PCI-Express (PCIe) interface hierarchy.

4.1 Multi-GPU Architecture

In a multi-GPU system, the GPU interconnection has a hierarchical structure, leading to multiple levels of locality when transferring data across different GPUs or between host and GPU device memory. For example, as shown in Figure 7, the communication between GPU 0 and GPU 1 achieves the highest performance as they are attached to the same PCIe switch. GPU 0 needs to go through two PCIe switches and one PCIe host bridge when communicating with device like GPU 2 or 3, therefore introducing longer latency and lower throughput. GPU 0 cannot conduct P2P access when communicating with GPUs located in other NUMA nodes (e.g., GPU 4 or 7), and hence performs even worse.

In addition, the upper level link bandwidth is shared by the GPU devices under the same root. For example, GPU 0 and GPU 1 share the same communication bandwidth between the PCIe host bridge and the PCIe switch that they are rooted at. As a result, their communications with GPU devices under other PCIe switches may interfere with each other when conducted concurrently, leaving their local PCIe bandwidth under-utilized. This makes the root level links more likely become the bottleneck in parallel computation. We therefore carefully design the streaming scheme by considering the locality characteristics of the multi-GPU system to achieve higher parallelism and better performance.

Figure 7: Multi-GPU architecture

4.2 Ring-Based Parallel Streaming

Consider a layer of GNN computation, a vanilla solution to exploit multi-GPU parallelism is to run a dataflow subgraph (as shown in Figure 4), which produces an output vertex chunk of the ApplyVertex stage, in a single GPU, and let multiple GPUs execute different sets of such subgraphs. The process of this subgraph essentially outputs one vertex chunk (e.g., ) and takes a set of vertex chunks containing all vertices (e.g., , , and ) as input. When running these subgraphs in multiple GPUs at the same time, the same set of vertex chunks are loaded from host memory into the device memory of these GPUs. This makes data transfer bottlenecked at the root level PCIe links that are shared by all the GPUs. Therefore, we design a ring-based streaming scheme to allow GPUs to reuse the data chunks that are already loaded from host memory by exchanging them directly. We organize the data-sharing path among the multiple GPUs as a ring, which is illustrated as the circle with red dot line in Figure 7. Because both PCIe and QPI links support duplex [39, 16], the simultaneous data transfers on the ring do not interfer with each other on bandwidth. With this scheme, each vertex chunk is loaded from host memory to enter the ring and passed to all the GPUs on the ring in order.

Figure 8: Ring-based streaming in a 4-GPU setting.

Scheduling in Ring-Based Streaming.   In ring-based streaming, a GPU needs to take the following two actions: 1) loading a data chunk from the host memory or from the device memory of the previous GPU in the ring; 2) performing local computations. In order to enable the overlap between the two actions, NGra introduces an explicit device-to-device (D2D) operator and employs a coordinated scheduling as illustrated in Figure 8.

In step 1, only GPU 0 and GPU 2 load vertex data chunks 1 and 3 from the host memory, respectively. After loading, in step 2, GPU 0 and GPU 2 start computations based on chunks 1 and 3, while loading chunks 2 and 4, respectively. At the same time, GPU 1 and GPU 3 start fetching chunks 1 and 3 from GPU 0 and GPU 2. In step 3 and step 4, they enter the whole-ring forward mode where there is no data loaded from the host memory. In step 5, unable to get input vertex data from the ring any more, GPU 0 and GPU 2 read data chunk 5 and 7, respectively, from the host memory. Additionally, GPU 1 and GPU 3 drop data chunks 3 and 1 after processing them locally because the chunks have already been consumed by all the GPUs in the ring. All the GPUs in step 6 take the same actions as step 5, except that the data chunk numbers are shifted. And the process enters the whole-ring forward mode again after that step. The whole process continus in such a pipelining fashion until all vertex data chunks have been loaded and processed.

This scheme leaves GPU 1 and GPU 3 idle in step 1 because it already makes full use of the system bandwidth. More concurrent loadings can make things worse. For example, if all the GPUs load different vertex data chunks from the host memory simultaneously, throughput will be limited by root links such as the upper link of PCIe switch in Figure 7. In this case, GPU 0 and GPU 1 will get only half the bandwidth due to the shared upper link. Suppose a single GPU needs time to transfer a data chunk into GPU device memory, two concurrent transfers will take before starting computation. Actually, such longer transfer latency will happen each time the ring loads new data chunks from the host memory, which leads to extra stalls in the entire processing pipeline.

In addition, analyzing the multi-GPU architecture can help decide which GPU(s) should perform data loading in ring-based streaming. With host memory as root, GPUs as leaves, and data links (with bandwidth) as edges, we can build a bandwidth tree where we can get a maximal fat-tree by gradually removing GPUs that could share upper link bandwidth with other GPUs. On top of Figure 8, we get a maximal bandwidth fat-tree (solid red line) from the original bandwidth tree (dashed green line). Data loading from the host memory happens only on the GPUs that belong to this maximal fat-tree.

vertex$^{\ell + 1}^{\ell}[W^{\ell}_H,\ W^{\ell}_C]^{\ell}^{\ell}^{\ell}\quad$return edge$^{\ell}^{\ell + 1}^{\ell}\quad$return    return vertex$^{\ell + 1}
Figure 9: Communication Neural Net (CommNet)
vertex$^{\ell + 1}^{\ell}W^{\ell}^{\ell}^{\ell}^{\ell}\quad$return edge$^{\ell}\times$ edge.data   set Gather.accumulator = sum   accum = Gather(acc)   // compute new vertex data   vertex$^{\ell + 1}^{\ell}\quad$return    return vertex$^{\ell + 1}
Figure 10: Graph Convolutional Network (GCN)
vertex$^{\ell + 1}^{\ell}[W^{\ell},\ W^{\ell}_{pool},\ b^{\ell}]^{\ell}^{\ell}^{\ell}\quad$return    set Gather.accumulator = max   accum = Gather(acc)   // compute new vertex data   vertex$^{\ell + 1}^{\ell}\quad$return    return vertex$^{\ell + 1}
Figure 11: Max-Pooling GCN (MP-GCN)
vertex$^{\ell + 1}^{\ell}^{\ell}^{\ell}^{\ell}\quad$return    set Gather.accumulator = sum   accum = Gather(acc)   // compute new vertex data using GRU   vertex$^{\ell + 1}^{\ell}\quad$return GRU(vertex$^{\ell}^{\ell + 1}
Figure 12: Gated Graph Neural Network (GG-NN)

5 Applications

NGra can support many different types of graph-based neural networks [4, 5, 9, 15, 13, 22, 25, 29, 38]. In this section, we present the layer-wise programs of several representative GNN models from the literature.

Communication neural net (CommNet)  [38] is a model where cooperating agents learn to communicate among themselves before taking actions. Each agent is controlled by a deep feed-forward network, which additionally receives the summed transmissions of other connected agents. This network can be used to solve multiple learning communication tasks like traffic control. In CommNet, there is no computation on the edge, so the ApplyEdge stage is simpy a passthrough (see Figure 4.2). Each agent gets the summed transmissions via the Gather stage, and generates its new hidden state in the ApplyVertex stage.

Graph convolutional networks (GCN)  [15, 22] generalizes the notion of CNNs to an operation that operates on an arbitrary graph. This algorithm has been applied in many semi-supervised or unsupervised graph clustering problems, such as entity classification in a knowledge graph. In GCN, there is computation (without neural networks) on the edge for weighted neighbor activation. In the GCN program (see Figure 4.2), the Scatter operation feeds vertex features into the ApplyEdge function, which multiplies it by the static edge weight determined by the vertex degree. The Gather stage returns the weighted sum of activations of neighbors on which the ApplyVertex stage applies a fully-connected layer.

Max-pooling GCN (MP-GCN)  [13] applies the max-pooling operator to each of the computed features, which can effectively capture different aspects of the neighborhood set. In MP-GCN, there is an NN-based computation on the edge with the max aggregation instead of mean. In its program (see Figure 4.2), the Scatter stage passes the source vertex feature vector into fully-connected neural network defined in ApplyEdge function. Then the Gather stage returns the element-wise max of features of neighbors on which the ApplyVertex stage applies a fully-connected layer.

Gated GCN (G-GCN)  [4, 29] is our running example which incorporates the gate mechanism into GCN, so that the model can learn which edges are more important for the learning target. Its computation pattern differs from that of Max-pooling GCN in that the NN-based computation on edges requires the feature vectors of both source and destination vertices. The Scatter stage must propagate the feature data of both vertices on the edge (see Figure 2).

Gated graph neural networks (GG-NN)  [25]

applies recurrent neural networks (RNNs) for walks on a graph structure and unroll the recurrence for a fixed number of steps. This model was used for NLP tasks and also in quantum chemistry for fast organic molecule properties estimation. GG-NN has NN-based edge computation, but using different parameters for different edge labels (the model assumes discrete edge types). Also, GG-NN has dense computation on vertices where the

ApplyVertex

function is Gated Recurrent Unit (GRU). In the GG-NN program (see Figure 

4.2), different edges can share different parameters in the ApplyEdge function.

6 Evaluation

We implement NGra on top of TensorFlow (v1.7) with about 2,900 lines of C++ code and 3,000 lines of Python code. NGra extends TensorFlow with a front-end to transform SAGA-NN programs into a chunk-granularity dataflow graph, several (fused) scatter/gather operators for efficient graph propagation, and a ring-based streaming scheduling scheme.

In this section, we present the detailed evaluation results and demonstrate the efficiency and scalability of NGra, as well as comparing with a state-of-the-art system, TensorFlow.

Experimental setup.   We evaluate NGra on a multi-GPU server, which is equipped with dual 2.6 GHz Intel Xeon E5-2690v4 processors (28 cores in total), 512 GB of memory, and 8 NVIDIA Tesla P100 GPU. The installed operating system is Ubuntu 16.04, using libraries CUDA 8.0, and cuDNN 6.

Table 1 lists the datasets that we used for evaluation, which are Pubmed citation network [36], protein-protein interaction graphs [18], BlogCatalog social network [24], Reddit online discussion forum [13], and Wikidata [30], respectively. The column feature in Table 1 represents the size of vertex feature vector, and the label

column means the number of label classes. We test the performance of our system on the task of vertex classification, e.g., classifying academic papers into different subjects in the Pubmed citation dataset, which contains sparse bag-of-words feature vectors for each document and a list of citation links between documents.

Our evaluation uses 5 popular GNN applications introduced in Section 5. Note that only CommNet, GCN and GG-NN can be directly supported using the Tensorflow operators due to no or simple computation on the edge, in which cases the propagation can be treated as a sparse multiplication. We set the number of layers

in each GNN. All performance numbers in our experiments are calculated by averaging over 10 epochs.

Dataset vertex# edge# feature label
pubmed 19.7K 108.4K 500 3
protein 43.5K 205.6K 29 3
BlogCatalog 10.3K 668.0K 128 39
reddit_small 46.6K 1.4M 602 41
reddit_middle 233.0K 23.2M 602 41
reddit_full 2.2M 571.0M 300 50
enwiki 3.2M 222.1M 300 12
Table 1: Datasets (K: thousand, M: million).

6.1 Efficient Graph Propagation

NGra’s efficient Scatter and Gatter kernels play an import role in handling sparsity of graph propagation.

Micro-benchmark on Synthetic Data.   To evaluate performance of our propagation kernels, we compare NGra with TensorFlow and cuSPARSE [1] on a simple sparse-dense matrix multiplication workload, which can be implemented with SAGA model through just specifying ApplyEdge phase as a multiplication with edge feature. For TensorFlow, we directly use its sparse_tensor_dense_matmul operator. The inputs of the computation are a sparse matrix with variant graph densities (i.e., the percentage of non-zero values) from 0.01% to 10%, and a dense matrix. The performance results are shown in Figure 13. Compare to TensorFlow, our propagation kernels can speed up by to . Even compared to cuSPARSE, NGra can improve the performance by to . The huge performance gaps are mainly due to our careful kernel optimization and special GPU threading model design for GNN scenarios, where the vertex data is often a feature vector.

Figure 13: Propagation kernel time of TensorFlow(TF), cuSPARSE, and NGra(NG) on graphs with different density.

Real Applications on Small Data.   To compare with TensorFlow on real applications, we need to fit the whole graph data in device memory, as TensorFlow cannot support graph larger than single GPU memory. Thus, we just use the first 4 small datasets in Table 1 to run all the 3 applications that TensorFlow supports. Table 2 lists the comparison results. Overall, NGra outperforms TensorFlow by ranging from 7.7% to among all the applications and datasets. In the 3 applications, the average improvement of GCN (), is more than others, mainly because GCN’s graph propagation takes more computation cycles than others. The improvements are also relative to datasets. For example, the density of the blog graph is greater, which leads to higher graph propagation overhead, where NGra can speed up. The average improvement of all applications on the blog dataset is , while this numbers for the reset datasets are 35.7%, 23.4%, and respectively.

dataset pubmed protein blog reddit-small
NG TF NG TF NG TF NG TF
GCN 8.2 13.6 14.8 20.7 8.4 32.5 44.2 113.3
CommNet 14.2 18.6 27.4 33.5 10.6 35.8 62.4 132.4
GG-NN 37.6 41.9 77.7 83.7 23.6 49.4 127.3 195.3
Table 2: Iteration time (ms) comparison with TensorFlow.
Figure 14: Streaming scheduling strategies comparison on different applications. (Data: reddit_middle)

6.2 Scaling-up on a Single GPU

Figure 15: Scaling up performance of NGra on different applications.
Figure 16: Speed up of NGra with different applications on large graphs.

NGra uses the chunk-based streaming mechanism to support graphs that do not fit in GPU memory. We first evaluate different scheduling strategies in this mechanism, as introduced in Section 3, and then demonstrate NGra’s performance on real applications through comparisons with NGra’s baseline versions.

Streaming Scheduling Strategy.   The scheduling strategy of a chunk-based dataflow graph heavily affects the overall performance, as it determines the number of data swapping introduced in Section 3.1. We demonstrate the benefit of our strategy through comparing with two alternatives: stage-based and dest-order scheduling strategies. In the stage-based strategy, the Scatter, ApplyEdge, and Gatter (S-A-G) are composed as one stage, the ApplyVertex as another stage, where these two stages are executed one-by-one. This will introduce one data swapping between two stages. In the dest-order strategy, it prefers to schedule operators in the Scatter stage along the direction that destination vertex chunk changes. In this case, for each source vertex chunk, the accum data needs to be swapped in and out once. Figure 14 shows the comparison results for 5 applications on the reddit_middle dataset. Compared to the stage-based strategy, NGra’s scheduling outperforms by 24.9% to 35.1% for different applications. For the dest-order strategy, NGra improves the performance by 60.1% to 93.1%. These results demonstrate the benefits of avoiding data-swapping and the importance of NGra’s scheduling.

Benefit of Streaming on Real Applications.   To evaluate the performance gain of streaming in NGra, we implement a baseline version of NGra by disabling streaming and the optimized graph propagation, denoted as NG-base. In this case, NG-base can still handle large graph by partitioning it into chunks and processing them sequentially. We also implement another version through using only chunk-based streaming in NGra, denoted as NG-stream, which can achieve overlapping of data transmission and computation. We compare end-to-end performance of NG-base, NG-stream, and NGra (NG) on 3 applications in Figure 15. We construct the datasets with different scales by simply duplicating the reddit_small dataset by 1, 4, 9, and 16 copies. Compared to NG-base, NG-stream improves the performance by 33.2%, 29.3%, and 23.0%, respectively, for the 3 applications. By further using optimized graph propagation, NGra is able to speed up the performance of these applications by , , and than NG-base, respectively.

6.3 Scaling-out on Multiple GPUs

NGra enables scaling GNN computation to multiple GPUs with ring-based parallel streaming. We compare this mechanism with a baseline without the ring-based strategy, denoted as non-ring. Figure 16 shows the comparison results for 5 applications on two large datasets, enwiki and reddit_full, respectively. Please note that ring-base mechanism only works on multi-GPU, so 1 GPU data point is the same. The results show clearly the benefits of ring-based streaming mechanism. For example, when scaling the computation from 1 GPU to 2 GPUs, the average speed up of non-ring mechanism is only , while our ring-based one can reach . This is mainly because, without ring-based design, each of the two GPUs needs to load input data through shared PCIe links concurrently, which easily becomes the bottleneck of the system. The ring-based mechanism allows near-linear speed-up because the second GPU can directly load data from the first one, avoiding pressure on the shared upper PCIe links.

From Figure 16, we also observe near-linear scalability for our Ring-based mechanism before across NUMA nodes. As the current TensorFlow implementation can hardly support NUMA-aware tensor allocation well, reading data cross NUMA nodes become suboptimal. Our further experiments show that we can get speed-up on average if we manually enable NUMA-aware tensor allocation. Generally, Ring-based mechanism in NGra can improve performance by about 2 on average when using multiple GPUs.

7 Related Work

Many real world data can be organized as graph-structured, e.g., web graph, social network, and knowledge graph, etc., where tremendous valuable information can be extracted. A large number of graph processing systems have been proposed and developed to analyze those data through iterative propagation-based computation. Pregel [28] first proposes the vertex-program model. It is extended by subsequent work like GraphLab [26] and PowerGraph [11] which proposes GAS model to exploit more parallelism on edge-related operations. The GAS model is also well adopted and further extended by a bunch of following work which conduct optimizations on different aspects including graph layout, sequential data access, and secondary storage (e.g., GraphChi [23], Grace [33], XStream [35], Chaos [34], and FlashGraph [45]), distributed shared memory and RDMA (e.g., Grappa [31] and GraM [40]), NUMA-awareness, scheduling, and load balancing (e.g., Galois [32], Mizan [19], and Polymer [44]), and graph partitioning (e.g., PowerLyra [6] and BiGraph [7]). All these work concentrate on computations conducted on CPU.

There are another series of graph system work that focus on exploiting computation power of GPU for graph processing. Medusa [46] provides simple programming abstractions for GPU-based graph processing and automatically conducts parallel execution on multiple GPUs. CuSha [20] mainly focuses on exploring new graph representations to allow faster graph processing. They both cannot process graphs exceeding the GPU memory capacity. Totem [10] statically partitions the graph into GPU and host memory to balance their computation loads, which may not be achievable, especially for large graphs, since the ratios of memory and computation power between GPU and CPU are not aligned. GraphReduce [37] can process out-of-memory graphs on a single GPU. It optimizes memory coalescing through using two different formats, the benefit of which can be easily cancelled by the redundant data transfers. GTS [21] can also process out-of-memory graphs on multiple GPUs. It does not have mechanisms to avoid redundant vertex data load from host to device memory for multi-GPU case. Garaph [27] exploits edge-centric parallelism and dynamic scheduling to achieve the best performance on the CPU/GPU hybrid platform. Lux [17] investigates the placement of graph data over memory hierarchy of CPUs in multiple nodes. Graphie [14] proposes methods to address the challenges when the set of active vertices can change throughout the execution. All these above systems only focus on supporting traditional graph algorithms like PageRank, connected component, and shortest path, etc.

TuX [41]

pioneers the effort on studying the gap between graph and traditional machine learning computation, while NGra moves further to connect graph processing and deep learning which can be well supported by the dataflow frameworks like TensorFlow 

[3], PyTorch [2], MxNet [8], and CNTK [42], etc. A similar past effort is in work of GraphX [12] which is a graph system built over a general dataflow engine, SPARK [43], while its target is to connect graph processing with batch-like map-reduce workloads in a single workflow pipeline.

8 Conclusion

GNNs represent an emerging computation model that arises naturally from the need to apply neural network models on large graphs. Supporting efficient and scalable parallel computation for GNN training is demanding due to its inherent complexity. NGra is the first to target GNNs, with a new programming abstraction, which is then mapped and optimized as dataflows to execute efficiently on GPUs.

References

  • [1] cuSPARSE, Retrieved September, 2018. https://developer.nvidia.com/cusparse.
  • [2] PyTorch, Retrieved September, 2018. http://pytorch.org.
  • [3] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., and Zheng, X. TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) (GA, 2016), USENIX Association, pp. 265–283.
  • [4] Bresson, X., and Laurent, T. Residual gated graph convnets. In International Conference on Learning Representations (ICLR) (2018).
  • [5] Bui, T. D., Ravi, S., and Ramavajjala, V. Neural graph learning: Training neural networks using graphs. In Proceedings of 11th ACM International Conference on Web Search and Data Mining (WSDM) (2018).
  • [6] Chen, R., Shi, J., Chen, Y., and Chen, H.

    PowerLyra: Differentiated graph computation and partitioning on skewed graphs.

    In Proceedings of the Tenth European Conference on Computer Systems (2015), EuroSys’15, ACM.
  • [7] Chen, R., Shi, J., Zang, B., and Guan, H. Bipartite-oriented distributed graph partitioning for big learning. In Proceedings of 5th Asia-Pacific Workshop on Systems (2014), APSys’14, ACM.
  • [8] Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., and Zhang, Z. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. In NIPS Workshop on Machine Learning Systems (LearningSys) (2016).
  • [9] Defferrard, M., Bresson, X., and Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems (2016), pp. 3844–3852.
  • [10] Gharaibeh, A., Beltrão Costa, L., Santos-Neto, E., and Ripeanu, M. A yoke of oxen and a thousand chickens for heavy lifting graph processing. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (New York, NY, USA, 2012), PACT ’12, ACM, pp. 345–354.
  • [11] Gonzalez, J. E., Low, Y., Gu, H., Bickson, D., and Guestrin, C. Powergraph: Distributed graph-parallel computation on natural graphs. In OSDI (2012), vol. 12, p. 2.
  • [12] Gonzalez, J. E., Xin, R. S., Dave, A., Crankshaw, D., Franklin, M. J., and Stoica, I. GraphX: Graph processing in a distributed dataflow framework. In 11th USENIX Symposium on Operating Systems Design and Implementation (2014), OSDI’14, USENIX.
  • [13] Hamilton, W. L., Ying, R., and Leskovec, J. Inductive representation learning on large graphs. In NIPS (2017).
  • [14] Han, W., Mawhirter, D., Wu, B., and Buland, M. Graphie: Large-scale asynchronous graph traversals on just a gpu. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (2017), PACT’17.
  • [15] Henaff, M., Bruna, J., and LeCun, Y. Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163 (2015).
  • [16] Intel Corp. An introduction to the intel quickpath interconnect, 2009. https://www.intel.com/content/www/us/en/io/quickpath-technology/quick-path-interconnect-introduction-paper.html.
  • [17] Jia, Z., Kwon, Y., Shipman, G., McCormick, P., Erez, M., and Aiken, A. A distributed multi-gpu system for fast graph processing. Proc. VLDB Endow. 11, 3 (Nov. 2017), 297–310.
  • [18] Kersting, K., Kriege, N. M., Morris, C., Mutzel, P., and Neumann, M. Benchmark data sets for graph kernels, 2016.
  • [19] Khayyat, Z., Awara, K., Alonazi, A., Jamjoom, H., Williams, D., and Kalnis, P. Mizan: A system for dynamic load balancing in large-scale graph processing. In Proceedings of the 8th ACM European Conference on Computer Systems (2013), EuroSys’13, ACM.
  • [20] Khorasani, F., Vora, K., Gupta, R., and Bhuyan, L. N. Cusha: Vertex-centric graph processing on gpus. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing (New York, NY, USA, 2014), HPDC ’14, ACM, pp. 239–252.
  • [21] Kim, M.-S., An, K., Park, H., Seo, H., and Kim, J. Gts: A fast and scalable graph processing method based on streaming topology to gpus. In Proceedings of the 2016 International Conference on Management of Data (New York, NY, USA, 2016), SIGMOD ’16, ACM, pp. 447–461.
  • [22] Kipf, T. N., and Welling, M. Semi-supervised classification with graph convolutional networks. International Conference on Learning Representations (ICLR) (2016).
  • [23] Kyrola, A., Blelloch, G., and Guestrin, C. GraphChi: Large-scale graph computation on just a PC. In Presented as part of the 10th USENIX Symposium on Operating Systems Design and Implementation (2012), OSDI’12, USENIX.
  • [24] Lei, T., and Huan, L. Relational learning via latent social dimensions. In KDD ’09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (New York, NY, USA, 2009), ACM, pp. 817–826.
  • [25] Li, Y., Tarlow, D., Brockschmidt, M., and Zemel, R. Gated graph sequence neural networks. International Conference on Learning Representations (ICLR) (2016).
  • [26] Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., and Hellerstein, J. M. Distributed GraphLab: A framework for machine learning and data mining in the cloud. Proc. VLDB Endow. 5, 8 (Apr. 2012).
  • [27] Ma, L., Yang, Z., Chen, H., Xue, J., and Dai, Y. Garaph: Efficient gpu-accelerated graph processing on a single machine with balanced replication. In 2017 USENIX Annual Technical Conference (USENIX ATC 17) (Santa Clara, CA, 2017), USENIX Association, pp. 195–207.
  • [28] Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, I., Leiser, N., and Czajkowski, G. Pregel: A system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (2010), SIGMOD’10, ACM.
  • [29] Marcheggiani, D., and Titov, I. Encoding sentences with graph convolutional networks for semantic role labeling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (2017), Association for Computational Linguistics, pp. 1506–1515.
  • [30] Meta. Data dumps — meta, discussion about wikimedia projects, 2018. [Online; accessed 3-May-2018].
  • [31] Nelson, J., Holt, B., Myers, B., Briggs, P., Ceze, L., Kahan, S., and Oskin, M. Latency-tolerant software distributed shared memory. In 2015 USENIX Annual Technical Conference (2015), USENIX ATC’15, USENIX.
  • [32] Nguyen, D., Lenharth, A., and Pingali, K. A lightweight infrastructure for graph analytics. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (2013), SOSP’13, ACM.
  • [33] Prabhakaran, V., Wu, M., Weng, X., McSherry, F., Zhou, L., and Haradasan, M. Managing large graphs on multi-cores with graph awareness. In 2012 USENIX Annual Technical Conference (2012), USENIX ATC’12, USENIX.
  • [34] Roy, A., Bindschaedler, L., Malicevic, J., and Zwaenepoel, W. Chaos: Scale-out graph processing from secondary storage. In Proceedings of the 25th Symposium on Operating Systems Principles (2015), SOSP’15, ACM.
  • [35] Roy, A., Mihailovic, I., and Zwaenepoel, W. X-Stream: Edge-centric graph processing using streaming partitions. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (2013), SOSP’13, ACM.
  • [36] Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., and Eliassi-Rad, T. Collective classification in network data. AI magazine 20, 1 (2008), 61–80.
  • [37] Sengupta, D., Song, S. L., Agarwal, K., and Schwan, K. Graphreduce: Processing large-scale graphs on accelerator-based systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (New York, NY, USA, 2015), SC ’15, ACM, pp. 28:1–28:12.
  • [38] Sukhbaatar, S., Szlam, A., and Fergus, R.

    Learning multiagent communication with backpropagation.

    In Advances in Neural Information Processing Systems (NIPS) (2016), pp. 2244–2252,.
  • [39] Wikipedia. PCI Express, 2018. https://en.wikipedia.org/wiki/PCI_Express.
  • [40] Wu, M., Yang, F., Xue, J., Xiao, W., Miao, Y., Wei, L., Lin, H., Dai, Y., and Zhou, L. GraM: Scaling graph computation to the trillions. In Proceedings of the Sixth ACM Symposium on Cloud Computing (2015), SoCC’15, ACM.
  • [41] Xiao, W., Xue, J., Miao, Y., Li, Z., Chen, C., Wu, M., Li, W., and Zhou, L. TuX2: Distributed Graph Computation for Machine Learning. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17) (Boston, MA, 2017), USENIX Association, pp. 669–682.
  • [42] Yu, D., Eversole, A., Seltzer, M., Yao, K., Kuchaiev, O., Zhang, Y., Seide, F., Huang, Z., Guenter, B., Wang, H., Droppo, J., Zweig, G., Rossbach, C., Gao, J., Stolcke, A., Currey, J., Slaney, M., Chen, G., Agarwal, A., Basoglu, C., Padmilac, M., Kamenev, A., Ivanov, V., Cypher, S., Parthasarathi, H., Mitra, B., Peng, B., and Huang, X. An introduction to computational networks and the computational network toolkit. Tech. rep., October 2014.
  • [43] Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S., and Stoica, I. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (2012), NSDI’12, USENIX.
  • [44] Zhang, K., Chen, R., and Chen, H. NUMA-aware graph-structured analytics. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (2015), PPoPP’15, ACM.
  • [45] Zheng, D., Mhembere, D., Burns, R., Vogelstein, J., Priebe, C. E., and Szalay, A. S. FlashGraph: Processing billion-node graphs on an array of commodity SSDs. In 13th USENIX Conference on File and Storage Technologies (2015), FAST’15, USENIX.
  • [46] Zhong, J., and He, B. Medusa: A parallel graph processing system on graphics processors. SIGMOD Rec. 43, 2 (Dec. 2014), 35–40.

5 Applications

NGra can support many different types of graph-based neural networks [4, 5, 9, 15, 13, 22, 25, 29, 38]. In this section, we present the layer-wise programs of several representative GNN models from the literature.

Communication neural net (CommNet)  [38] is a model where cooperating agents learn to communicate among themselves before taking actions. Each agent is controlled by a deep feed-forward network, which additionally receives the summed transmissions of other connected agents. This network can be used to solve multiple learning communication tasks like traffic control. In CommNet, there is no computation on the edge, so the ApplyEdge stage is simpy a passthrough (see Figure 4.2). Each agent gets the summed transmissions via the Gather stage, and generates its new hidden state in the ApplyVertex stage.

Graph convolutional networks (GCN)  [15, 22] generalizes the notion of CNNs to an operation that operates on an arbitrary graph. This algorithm has been applied in many semi-supervised or unsupervised graph clustering problems, such as entity classification in a k