Data-parallel distributed training of very large models beyond GPU capacity

11/29/2018
by   Samuel Matzek, et al.
0

GPUs have limited memory and it is difficult to train wide and/or deep models that cause the training process to go out of memory. It is shown in this paper how an open source tool called Large Model Support (LMS) can utilize a high bandwidth NVLink connection between CPUs and GPUs to accomplish training of deep convolutional networks. LMS performs tensor swapping between CPU memory and GPU memory such that only a minimal number of tensors required in a training step are kept in the GPU memory. It is also shown how LMS can be combined with an MPI based distributed deep learning module to train models in a data-parallel fashion across multiple GPUs, such that each GPU is utilizing the CPU memory for tensor swapping. The hardware architecture that enables the high bandwidth GPU link with the CPU is discussed as well as the associated set of software tools that are available as the PowerAI package.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/16/2021

ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning

In the last three years, the largest dense deep learning models have gro...
research
11/16/2021

Project CGX: Scalable Deep Learning on Commodity GPUs

The ability to scale out training workloads has been one of the key perf...
research
01/17/2022

A tool set for random number generation on GPUs in R

We introduce the R package clrng which leverages the gpuR package and is...
research
10/20/2022

Thwarting Piracy: Anti-debugging Using GPU-assisted Self-healing Codes

Software piracy is one of the concerns in the IT sector. Pirates leverag...
research
09/05/2023

TensorBank:Tensor Lakehouse for Foundation Model Training

Storing and streaming high dimensional data for foundation model trainin...
research
03/12/2020

Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems

Neural networks of ads systems usually take input from multiple resource...
research
02/02/2022

Harmony: Overcoming the hurdles of GPU memory capacity to train massive DNN models on commodity servers

Deep neural networks (DNNs) have grown exponentially in complexity and s...

Please sign up or login with your details

Forgot password? Click here to reset