High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs

07/01/2022
by   William S. Moses, et al.
0

While parallelism remains the main source of performance, architectural implementations and programming models change with each new hardware generation, often leading to costly application re-engineering. Most tools for performance portability require manual and costly application porting to yet another programming model. We propose an alternative approach that automatically translates programs written in one programming model (CUDA), into another (CPU threads) based on Polygeist/MLIR. Our approach includes a representation of parallel constructs that allows conventional compiler transformations to apply transparently and without modification and enables parallelism-specific optimizations. We evaluate our framework by transpiling and optimizing the CUDA Rodinia benchmark suite for a multi-core CPU and achieve a 76 OpenMP code. Further, we show how CUDA kernels from PyTorch can efficiently run and scale on the CPU-only Supercomputer Fugaku without user intervention. Our PyTorch compatibility layer making use of transpiled CUDA PyTorch kernels outperforms the PyTorch CPU native backend by 2.7×.

READ FULL TEXT
research
12/22/2021

Lifting C Semantics for Dataflow Optimization

C is the lingua franca of programming and almost any device can be progr...
research
12/19/2021

COX: CUDA on X86 by Exposing Warp-Level Functions to CPUs

As CUDA programs become the de facto program among data parallel applica...
research
12/15/2020

AsyncTaichi: Whole-Program Optimizations for Megakernel Sparse Computation and Differentiable Programming

We present a whole-program optimization framework for the Taichi program...
research
10/18/2021

Can Fortran's 'do concurrent' replace directives for accelerated computing?

Recently, there has been growing interest in using standard language con...
research
01/27/2020

Automated Parallel Kernel Extraction from Dynamic Application Traces

Modern program runtime is dominated by segments of repeating code called...
research
12/02/2019

GPU Support for Automatic Generation of Finite-Differences Stencil Kernels

The growth of data to be processed in the Oil Gas industry matches t...
research
01/25/2019

Mind the Gap: Analyzing the Performance of WebAssembly vs. Native Code

All major web browsers now support WebAssembly, a low-level bytecode int...

Please sign up or login with your details

Forgot password? Click here to reset