Synthesizing Optimal Parallelism Placement and Reduction Strategies on Hierarchical Systems for Deep Learning

10/20/2021
by   Ningning Xie, et al.
0

We present a novel characterization of the mapping of multiple parallelism forms (e.g. data and model parallelism) onto hierarchical accelerator systems that is hierarchy-aware and greatly reduces the space of software-to-hardware mapping. We experimentally verify the substantial effect of these mappings on all-reduce performance (up to 448x). We offer a novel syntax-guided program synthesis framework that is able to decompose reductions over one or more parallelism axes to sequences of collectives in a hierarchy- and mapping-aware way. For 69 framework synthesizes programs that outperform the default all-reduce implementation when evaluated on different GPU hierarchies (max 2.04x, average 1.27x). We complement our synthesis tool with a simulator exceeding 90 accuracy, which therefore reduces the need for massive evaluations of synthesis results to determine a small set of optimal programs and mappings.

READ FULL TEXT

page 9

page 13

research
07/23/2023

MARS: Exploiting Multi-Level Parallelism for DNN Workloads on Adaptive Multi-Accelerator Systems

Along with the fast evolution of deep neural networks, the hardware syst...
research
07/27/2018

FPGA-Based CNN Inference Accelerator Synthesized from Multi-Threaded C Software

A deep-learning inference accelerator is synthesized from a C-language s...
research
05/05/2023

Using Hierarchical Parallelism to Accelerate the Solution of Many Small Partial Differential Equations

This paper presents efforts to improve the hierarchical parallelism of a...
research
01/21/2022

Trireme: Exploring Hierarchical Multi-Level Parallelism for Domain Specific Hardware Acceleration

The design of heterogeneous systems that include domain specific acceler...
research
08/19/2023

GNNPipe: Accelerating Distributed Full-Graph GNN Training with Pipelined Model Parallelism

Current distributed full-graph GNN training methods adopt a variant of d...
research
12/13/2020

Learning to Schedule Halide Pipelines for the GPU

We present a new algorithm to automatically generate high-performance GP...
research
02/02/2022

Accelerated Quality-Diversity for Robotics through Massive Parallelism

Quality-Diversity (QD) algorithms are a well-known approach to generate ...

Please sign up or login with your details

Forgot password? Click here to reset