An Empirical-cum-Statistical Approach to Power-Performance Characterization of Concurrent GPU Kernels

11/04/2020
by   Nilanjan Goswami, et al.
0

Growing deployment of power and energy efficient throughput accelerators (GPU) in data centers demands enhancement of power-performance co-optimization capabilities of GPUs. Realization of exascale computing using accelerators requires further improvements in power efficiency. With hardwired kernel concurrency enablement in accelerators, inter- and intra-workload simultaneous kernels computation predicts increased throughput at lower energy budget. To improve Performance-per-Watt metric of the architectures, a systematic empirical study of real-world throughput workloads (with concurrent kernel execution) is required. To this end, we propose a multi-kernel throughput workload generation framework that will facilitate aggressive energy and performance management of exascale data centers and will stimulate synergistic power-performance co-optimization of throughput architectures. Also, we demonstrate a multi-kernel throughput benchmark suite based on the framework that encapsulates symmetric, asymmetric and co-existing (often appears together) kernel based workloads. On average, our analysis reveals that spatial and temporal concurrency within kernel execution in throughput architectures saves energy consumption by 32 K20 across 12 benchmarks. Concurrency and enhanced utilization are often correlated but do not imply significant deviation in power dissipation. Diversity analysis of proposed multi-kernels confirms characteristic variation and power-profile diversity within the suite. Besides, we explain several findings regarding power-performance co-optimization of concurrent throughput workloads.

READ FULL TEXT

page 1

page 2

page 3

page 7

page 9

research
09/06/2023

Vector-Processing for Mobile Devices: Benchmark and Analysis

Vector processing has become commonplace in today's CPU microarchitectur...
research
01/21/2022

Enabling Flexibility for Sparse Tensor Acceleration via Heterogeneity

Recently, numerous sparse hardware accelerators for Deep Neural Networks...
research
07/18/2021

Effective GPU Sharing Under Compiler Guidance

Modern computing platforms tend to deploy multiple GPUs (2, 4, or more) ...
research
06/23/2014

Preemptive Thread Block Scheduling with Online Structural Runtime Prediction for Concurrent GPGPU Kernels

Recent NVIDIA Graphics Processing Units (GPUs) can execute multiple kern...
research
04/13/2023

Repositioning Tiered HotSpot Execution Performance Relative to the Interpreter

Although the advantages of just-in-time compilation over traditional int...
research
11/21/2017

Accelerating K-mer Frequency Counting with GPU and Non-Volatile Memory

The emergence of Next Generation Sequencing (NGS) platforms has increase...
research
06/21/2023

Opportunities of Renewable Energy Powered DNN Inference

With the proliferation of the adoption of renewable energy in powering d...

Please sign up or login with your details

Forgot password? Click here to reset