DistGNN: Scalable Distributed Training for Large-Scale Graph Neural Networks

04/14/2021
by   Vasimuddin Md, et al.
29

Full-batch training on Graph Neural Networks (GNN) to learn the structure of large graphs is a critical problem that needs to scale to hundreds of compute nodes to be feasible. It is challenging due to large memory capacity and bandwidth requirements on a single compute node and high communication volumes across multiple nodes. In this paper, we present DistGNN that optimizes the well-known Deep Graph Library (DGL) for full-batch training on CPU clusters via an efficient shared memory implementation, communication reduction using a minimum vertex-cut graph partitioning algorithm and communication avoidance using a family of delayed-update algorithms. Our results on four common GNN benchmark datasets: Reddit, OGB-Products, OGB-Papers and Proteins, show up to 3.7x speed-up using a single CPU socket and up to 97x speed-up using 128 CPU sockets, respectively, over baseline DGL implementations running on a single CPU socket

READ FULL TEXT
research
07/13/2020

Deep Graph Library Optimizations for Intel(R) x86 Architecture

The Deep Graph Library (DGL) was designed as a tool to enable structure ...
research
01/29/2022

Distributed SLIDE: Enabling Training Large Neural Networks on Low Bandwidth and Simple CPU-Clusters via Model Parallelism and Sparsity

More than 70 of these idle compute are cheap CPUs with few cores that ar...
research
08/06/2023

Communication-Free Distributed GNN Training with Vertex Cut

Training Graph Neural Networks (GNNs) on real-world graphs consisting of...
research
11/11/2022

DistGNN-MB: Distributed Large-Scale Graph Neural Network Training on x86 via Minibatch Sampling

Training Graph Neural Networks, on graphs containing billions of vertice...
research
09/14/2022

Tuple Packing: Efficient Batching of Small Graphs in Graph Neural Networks

When processing a batch of graphs in machine learning models such as Gra...
research
09/10/2017

Applying ACO To Large Scale TSP Instances

Ant Colony Optimisation (ACO) is a well known metaheuristic that has pro...
research
10/11/2020

DistDGL: Distributed Graph Neural Network Training for Billion-Scale Graphs

Graph neural networks (GNN) have shown great success in learning from gr...

Please sign up or login with your details

Forgot password? Click here to reset