Pushing the Limits of Online Auto-tuning: Machine Code Optimization in Short-Running Kernels

07/14/2017
by   Fernando Endo, et al.
0

We propose an online auto-tuning approach for computing kernels. Differently from existing online auto-tuners, which regenerate code with long compilation chains from the source to the binary code, our approach consists on deploying auto-tuning directly at the level of machine code generation. This allows auto-tuning to pay off in very short-running applications. As a proof of concept, our approach is demonstrated in two benchmarks, which execute during hundreds of milliseconds to a few seconds only. In a CPU-bound kernel, the average speedups achieved are 1.10 to 1.58 depending on the target micro-architecture, up to 2.53 in the most favourable conditions (all run-time overheads included). In a memory-bound kernel, less favourable to our runtime auto-tuning optimizations, the average speedups are 1.04 to 1.10, up to 1.30 in the best configuration. Despite the short execution times of our benchmarks, the overhead of our runtime auto-tuning is between 0.2 and 4.2 total application execution times. By simulating the CPU-bound application in 11 different CPUs, we showed that, despite the clear hardware disadvantage of In-Order (io) cores vs. Out-of-Order (ooo) equivalent cores, online auto-tuning in io CPUs obtained an average speedup of 1.03 and an energy efficiency improvement of 39 % over the SIMD reference in ooo CPUs.

READ FULL TEXT
research
03/22/2023

Kernel Launcher: C++ Library for Optimal-Performance Portable CUDA Applications

Graphic Processing Units (GPUs) have become ubiquitous in scientific com...
research
02/11/2023

Auto-SpMV: Automated Optimizing SpMV Kernels on GPU

Sparse matrix-vector multiplication (SpMV) is an essential linear algebr...
research
01/14/2022

Transfer-Tuning: Reusing Auto-Schedules for Efficient Tensor Program Code Generation

Auto-scheduling for tensor programs is a process where a search algorith...
research
11/12/2018

Transkernel: An Executor for Commodity Kernels on Peripheral Cores

Modern mobile and embedded platforms see a large number of ephemeral tas...
research
08/06/2018

Energy-Efficiency Prediction of Multithreaded Workloads on Heterogeneous Composite Cores Architectures using Machine Learning Techniques

Heterogeneous architectures have emerged as a promising alternative for ...
research
10/18/2019

A Benchmark Set of Highly-efficient CUDA and OpenCL Kernels and its Dynamic Autotuning with Kernel Tuning Toolkit

Autotuning of performance-relevant source-code parameters allows to auto...
research
10/12/2019

ClassyTune: A Performance Auto-Tuner for Systems in the Cloud

Performance tuning can improve the system performance and thus enable th...

Please sign up or login with your details

Forgot password? Click here to reset