Enabling Data Movement and Computation Pipelining in Deep Learning Compiler

10/29/2022
by   Guyue Huang, et al.
0

Pipelining between data loading and computation is a critical tensor program optimization for GPUs. Multi-stage pipelining across the multi-level buffer hierarchy of GPU is particularly indispensable on the latest NVIDIA Ampere GPUs to reduce resource idleness and guarantee kernel performance. Currently, people rely on libraries written by experts such as cuBLAS to access the pipelining optimization instead of through a tensor program transformation, which is inextensible to new operators and un-composable with prior tensor compiler optimizations. We present ALCOP, an automatic pipelining framework based on TVM infrastructure that overcomes three critical obstacles in generating code for pipelining: detection of pipelining-applicable buffers, program transformation for multi-level multi-stage pipelining, and efficient schedule parameter search by incorporating static analysis. Experiments show that ALCOP can generate programs with 1.23x speedup on average (up to 1.73x) over vanilla TVM. On end-to-end models, ALCOP can improve upon TVM by up to 1.18x, and XLA by up to 1.64x. Besides, our performance model significantly improves the efficiency of the schedule tuning process and can find schedules with 99 given by exhaustive search while costing 40x fewer trials.

READ FULL TEXT

page 1

page 3

page 5

page 8

research
07/11/2023

PowerFusion: A Tensor Compiler with Explicit Data Movement Description and Instruction-level Graph IR

Deep neural networks (DNNs) are of critical use in different domains. To...
research
10/22/2022

ALT: Boosting Deep Learning Performance by Breaking the Wall between Graph and Operator Level Optimizations

Deep learning models rely on highly optimized tensor libraries for effic...
research
08/23/2021

High Performance GPU Code Generation for Matrix-Matrix Multiplication using MLIR: Some Early Results

This report presents some early results on code generation targeting ten...
research
11/07/2022

TLP: A Deep Learning-based Cost Model for Tensor Program Tuning

Tensor program tuning is a non-convex objective optimization problem, to...
research
12/13/2020

Learning to Schedule Halide Pipelines for the GPU

We present a new algorithm to automatically generate high-performance GP...
research
08/03/2020

A Learned Performance Model for Tensor Processing Units

Accurate hardware performance models are critical to efficient code gene...
research
06/22/2023

ACC Saturator: Automatic Kernel Optimization for Directive-Based GPU Code

Automatic code optimization is a complex process that typically involves...

Please sign up or login with your details

Forgot password? Click here to reset