Synkhronos: a Multi-GPU Theano Extension for Data Parallelism

10/11/2017
by   Adam Stooke, et al.
0

We present Synkhronos, an extension to Theano for multi-GPU computations leveraging data parallelism. Our framework provides automated execution and synchronization across devices, allowing users to continue to write serial programs without risk of race conditions. The NVIDIA Collective Communication Library is used for high-bandwidth inter-GPU communication. Further enhancements to the Theano function interface include input slicing (with aggregation) and input indexing, which perform common data-parallel computation patterns efficiently. One example use case is synchronous SGD, which has recently been shown to scale well for a growing set of deep learning problems. When training ResNet-50, we achieve a near-linear speedup of 7.5x on an NVIDIA DGX-1 using 8 GPUs, relative to Theano-only code running a single GPU in isolation. Yet Synkhronos remains general to any data-parallel computation programmable in Theano. By implementing parallelism at the level of individual Theano functions, our framework uniquely addresses a niche between manual multi-device programming and prescribed multi-GPU training routines.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/08/2018

Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform

The training process of Deep Neural Network (DNN) is compute-intensive, ...
research
05/22/2023

Communication-minimizing Asynchronous Tensor Parallelism

As state-of-the-art neural networks scale to billions of parameters, des...
research
01/27/2022

GC3: An Optimizing Compiler for GPU Collective Communication

Machine learning models made up of millions or billions of parameters ar...
research
05/11/2023

GPU-initiated Fine-grained Overlap of Collective Communication with Computation

In order to satisfy their ever increasing capacity and compute requireme...
research
06/15/2023

MuMFiM: Multiscale Modeling of Fibrous Materials

This article presents MuMFiM, an open source application for multiscale ...
research
04/17/2021

Ripple : Simplified Large-Scale Computation on Heterogeneous Architectures with Polymorphic Data Layout

GPUs are now used for a wide range of problems within HPC. However, maki...
research
04/26/2023

Acceleration for Timing-Aware Gate-Level Logic Simulation with One-Pass GPU Parallelism

Witnessing the advancing scale and complexity of chip design and benefit...

Please sign up or login with your details

Forgot password? Click here to reset