8 Steps to 3.7 TFLOP/s on NVIDIA V100 GPU: Roofline Analysis and Other Tricks

08/26/2020
by   Charlene Yang, et al.
0

Performance optimization can be a daunting task especially as the hardware architecture becomes more and more complex. This paper takes a kernel from the Materials Science code BerkeleyGW, and demonstrates a few performance analysis and optimization techniques. Despite challenges such as high register usage, low occupancy, complex data access patterns, and the existence of several long-latency instructions, we have achieved 3.7 TFLOP/s of double-precision performance on an NVIDIA V100 GPU, with 8 optimization steps. This is 55 the theoretical peak, 6.7 TFLOP/s, at nominal frequency 1312 MHz, and 70 the more customized peak based on our 58 techniques used to analyze this OpenACC kernel and optimize its performance are shown, including the use of hierarchical Roofline performance model and the performance tool Nsight Compute. This kernel exhibits computational characteristics that are commonly seen in many high-performance computing (HPC) applications, and are expected to be very helpful to a general audience of HPC developers and computational scientists, as they pursue more performance on NVIDIA GPUs.

READ FULL TEXT

page 3

page 4

page 6

research
05/03/2023

Prediction of Performance and Power Consumption of GPGPU Applications

Graphics Processing Units (GPUs) have become an integral part of High-Pe...
research
05/17/2023

Optimization and Portability of a Fusion OpenACC-based FORTRAN HPC Code from NVIDIA to AMD GPUs

NVIDIA has been the main provider of GPU hardware in HPC systems for ove...
research
10/22/2018

Double-precision FPUs in High-Performance Computing: an Embarrassment of Riches?

Among the (uncontended) common wisdom in High-Performance Computing (HPC...
research
01/13/2022

Development and performance of a HemeLB GPU code for human-scale blood flow simulation

In recent years, it has become increasingly common for high performance ...
research
07/07/2020

On the Efficient Evaluation of the Exchange Correlation Potential on Graphics Processing Unit Clusters

The predominance of Kohn-Sham density functional theory (KS-DFT) for the...
research
01/07/2020

High-Performance Statistical Computing in the Computing Environments of the 2020s

Technological advances in the past decade, hardware and software alike, ...
research
11/02/2017

Acceleration of tensor-product operations for high-order finite element methods

This paper is devoted to GPU kernel optimization and performance analysi...

Please sign up or login with your details

Forgot password? Click here to reset