Distributed Hybrid CPU and GPU training for Graph Neural Networks on Billion-Scale Graphs

12/31/2021
by   Da Zheng, et al.
0

Graph neural networks (GNN) have shown great success in learning from graph-structured data. They are widely used in various applications, such as recommendation, fraud detection, and search. In these domains, the graphs are typically large, containing hundreds of millions or billions of nodes. To tackle this challenge, we develop DistDGLv2, a system that extends DistDGL for training GNNs in a mini-batch fashion, using distributed hybrid CPU/GPU training to scale to large graphs. DistDGLv2 places graph data in distributed CPU memory and performs mini-batch computation in GPUs. DistDGLv2 distributes the graph and its associated data (initial features) across the machines and uses this distribution to derive a computational decomposition by following an owner-compute rule. DistDGLv2 follows a synchronous training approach and allows ego-networks forming mini-batches to include non-local nodes. To minimize the overheads associated with distributed computations, DistDGLv2 uses a multi-level graph partitioning algorithm with min-edge cut along with multiple balancing constraints. This localizes computation in both machine level and GPU level and statically balance the computations. DistDGLv2 deploys an asynchronous mini-batch generation pipeline that makes all computation and data access asynchronous to fully utilize all hardware (CPU, GPU, network, PCIe). The combination allows DistDGLv2 to train high-quality models while achieving high parallel efficiency and memory scalability. We demonstrate DistDGLv2 on various GNN workloads. Our results show that DistDGLv2 achieves 2-3X speedup over DistDGL and 18X speedup over Euler. It takes only 5-10 seconds to complete an epoch on graphs with 100s millions of nodes on a cluster with 64 GPUs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/11/2020

DistDGL: Distributed Graph Neural Network Training for Billion-Scale Graphs

Graph neural networks (GNN) have shown great success in learning from gr...
research
06/11/2021

Global Neighbor Sampling for Mixed CPU-GPU Training on Giant Graphs

Graph neural networks (GNNs) are powerful tools for learning from graph ...
research
05/24/2021

Dorylus: Affordable, Scalable, and Accurate GNN Training with Distributed CPU Servers and Serverless Threads

A graph neural network (GNN) enables deep learning on structured graph d...
research
10/16/2021

Accelerating Training and Inference of Graph Neural Networks with Fast Sampling and Pipelining

Improving the training and inference performance of graph neural network...
research
02/04/2022

Marius++: Large-Scale Training of Graph Neural Networks on a Single Machine

Graph Neural Networks (GNNs) have emerged as a powerful model for ML ove...
research
11/11/2022

DistGNN-MB: Distributed Large-Scale Graph Neural Network Training on x86 via Minibatch Sampling

Training Graph Neural Networks, on graphs containing billions of vertice...
research
03/02/2023

HitGNN: High-throughput GNN Training Framework on CPU+Multi-FPGA Heterogeneous Platform

As the size of real-world graphs increases, training Graph Neural Networ...

Please sign up or login with your details

Forgot password? Click here to reset