Performance Optimization of SU3_Bench on Xeon and Programmable Integrated Unified Memory Architecture

02/28/2021
by   Jesmin Jahan Tithi, et al.
0

SU3_Bench is a microbenchmark developed to explore performance portability across multiple programming models/methodologies using a simple, but nontrivial, mathematical kernel. This kernel has been derived from the MILC lattice quantum chromodynamics (LQCD) code. SU3_Bench is bandwidth bound and generates regular compute and data access patterns. Therefore, on most traditional CPU and GPU-based systems, its performance is mainly determined by the achievable memory bandwidth. Although SU3_Bench is a simple kernel, experience says its subtleties require a certain amount of tweaking to achieve peak performance for a given programming model and hardware, making performance portability challenging. In this paper, we share some of the challenges in obtaining the peak performance for SU3_Bench on a state-of-the-art Intel Xeon machine, due to the nuances of variable definition, the nature of compiler-provided default constructors, how memory is accessed at object creation time, and the NUMA effects on the machine. We discuss how to tackle those challenges to improve SU3_Bench's performance by 2× compared to the original OpenMP implementation available at Github. This provides a valuable lesson for other similar kernels. Expanding on the performance portability aspects, we also show early results obtained porting SU3_Bench to the new Intel Programmable Integrated Unified Memory Architecture (PIUMA), characterized by a more balanced flops-to-byte ratio. This paper shows that it is not the usual bandwidth or flops, rather the pipeline throughput, that determines SU3_Bench's performance on PIUMA. Finally, we show how to improve performance on PIUMA and how that compares with the performance on Xeon, which has around one order of magnitude more flops-per-byte.

READ FULL TEXT
research
09/16/2023

Comparative evaluation of bandwidth-bound applications on the Intel Xeon CPU MAX Series

In this paper we explore the performance of Intel Xeon MAX CPU Series, r...
research
10/31/2020

An analytic performance model for overlapping execution of memory-bound loop kernels on multicore CPUs

Complex applications running on multicore processors show a rich perform...
research
03/04/2021

ECM modeling and performance tuning of SpMV and Lattice QCD on A64FX

The A64FX CPU is arguably the most powerful Arm-based processor design t...
research
08/29/2022

Improving the Efficiency of OpenCL Kernels through Pipes

In an effort to lower the barrier to the adoption of FPGAs by a broader ...
research
10/13/2020

PIUMA: Programmable Integrated Unified Memory Architecture

High performance large scale graph analytics is essential to timely anal...
research
01/21/2020

Lattice QCD on a novel vector architecture

The SX-Aurora TSUBASA PCIe accelerator card is the newest model of NEC's...
research
09/03/2022

Ridgeline: A 2D Roofline Model for Distributed Systems

In this short paper, we introduce the Ridgeline model, an extension of t...

Please sign up or login with your details

Forgot password? Click here to reset