GNNPipe: Accelerating Distributed Full-Graph GNN Training with Pipelined Model Parallelism

08/19/2023
by   Jingji Chen, et al.
0

Current distributed full-graph GNN training methods adopt a variant of data parallelism, namely graph parallelism, in which the whole graph is divided into multiple partitions (subgraphs) and each GPU processes one of them. This incurs high communication overhead because of the inter-partition message passing at each layer. To this end, we proposed a new training method named GNNPipe that adopts model parallelism instead, which has a lower worst-case asymptotic communication complexity than graph parallelism. To ensure high GPU utilization, we proposed to combine model parallelism with a chunk-based pipelined training method, in which each GPU processes a different chunk of graph data at different layers concurrently. We further proposed hybrid parallelism that combines model and graph parallelism when the model-level parallelism is insufficient. We also introduced several tricks to ensure convergence speed and model accuracies to accommodate embedding staleness introduced by pipelining. Extensive experiments show that our method reduces the per-epoch training time by up to 2.45x (on average 2.03x) and reduces the communication volume and overhead by up to 22.51x and 27.21x (on average 10.27x and 14.96x), respectively, while achieving a comparable level of model accuracy and convergence speed compared to graph parallelism.

READ FULL TEXT
research
11/11/2022

Breadth-First Pipeline Parallelism

We introduce Breadth-First Pipeline Parallelism, a novel training schedu...
research
09/08/2018

Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform

The training process of Deep Neural Network (DNN) is compute-intensive, ...
research
03/02/2023

Boosting Distributed Full-graph GNN Training with Asynchronous One-bit Communication

Training Graph Neural Networks (GNNs) on large graphs is challenging due...
research
09/05/2022

Orthogonal layers of parallelism in large-scale eigenvalue computations

We address the communication overhead of distributed sparse matrix-(mult...
research
05/06/2016

A Graph-based Model for GPU Caching Problems

Modeling data sharing in GPU programs is a challenging task because of t...
research
10/20/2021

Synthesizing Optimal Parallelism Placement and Reduction Strategies on Hierarchical Systems for Deep Learning

We present a novel characterization of the mapping of multiple paralleli...
research
01/17/2023

AutoDDL: Automatic Distributed Deep Learning with Asymptotically Optimal Communication

Recent advances in deep learning base on growing model sizes and the nec...

Please sign up or login with your details

Forgot password? Click here to reset