Learning to Schedule Halide Pipelines for the GPU

12/13/2020
by   Luke Anderson, et al.
0

We present a new algorithm to automatically generate high-performance GPU implementations of complex imaging and machine learning pipelines, directly from high-level Halide algorithm code. It is fully automatic, requiring no schedule templates or hand-optimized kernels, and it targets a diverse range of computations which is significantly broader than existing autoschedulers. We address the scalability challenge of extending previous approaches to schedule large real world programs, while enabling a broad set of program rewrites that take into account the nested parallelism and memory hierarchy introduced by GPU architectures. We achieve this using a hierarchical sampling strategy that groups programs into buckets based on their structural similarity, then samples representatives to be evaluated, allowing us to explore a large space by only considering a subset of the space, and a pre-pass that 'freezes' decisions for the lowest cost sections of a program, allowing more time to be spent on the important stages. We then apply an efficient cost model combining machine learning, program analysis, and GPU architecture knowledge. Our method scales combinatorially better with respect to the deeper nested parallelism required by GPUs compared to previous work. We evaluate its performance on a diverse suite of real-world imaging and machine learning pipelines. We demonstrate results that are on average 1.66X faster than existing automatic solutions (up to 5X), and competitive with what the best human experts were able to achieve in an active effort to beat our automatic results.

READ FULL TEXT

page 5

page 9

research
03/14/2019

Stripe: Tensor Compilation via the Nested Polyhedral Model

Hardware architectures and machine learning (ML) libraries evolve rapidl...
research
09/16/2019

Model-Based Warp-Level Tiling for Image Processing Programs on GPUs

The efficient execution of image processing pipelines on GPUs is an area...
research
04/17/2020

GEVO: GPU Code Optimization using EvolutionaryComputation

GPUs are a key enabler of the revolution in machine learning and high pe...
research
10/29/2022

Enabling Data Movement and Computation Pipelining in Deep Learning Compiler

Pipelining between data loading and computation is a critical tensor pro...
research
04/17/2020

GEVO: GPU Code Optimization using Evolutionary Computation

GPUs are a key enabler of the revolution in machine learning and high pe...
research
04/23/2020

Cpp-Taskflow: A General-purpose Parallel and Heterogeneous Task Programming System at Scale

The Cpp-Taskflow project addresses the long-standing question: How can w...
research
10/20/2021

Synthesizing Optimal Parallelism Placement and Reduction Strategies on Hierarchical Systems for Deep Learning

We present a novel characterization of the mapping of multiple paralleli...

Please sign up or login with your details

Forgot password? Click here to reset