Not All GPUs Are Created Equal: Characterizing Variability in Large-Scale, Accelerator-Rich Systems

08/23/2022
by   Prasoon Sinha, et al.
0

Scientists are increasingly exploring and utilizing the massive parallelism of general-purpose accelerators such as GPUs for scientific breakthroughs. As a result, datacenters, hyperscalers, national computing centers, and supercomputers have procured hardware to support this evolving application paradigm. These systems contain hundreds to tens of thousands of accelerators, enabling peta- and exa-scale levels of compute for scientific workloads. Recent work demonstrated that power management (PM) can impact application performance in CPU-based HPC systems, even when machines have the same architecture and SKU (stock keeping unit). This variation occurs due to manufacturing variability and the chip's PM. However, while modern HPC systems widely employ accelerators such as GPUs, it is unclear how much this variability affects applications. Accordingly, we seek to characterize the extent of variation due to GPU PM in modern HPC and supercomputing systems. We study a variety of applications that stress different GPU components on five large-scale computing centers with modern GPUs: Oak Ridge's Summit, Sandia's Vortex, TACC's Frontera and Longhorn, and Livermore's Corona. These clusters use a variety of cooling methods and GPU vendors. In total, we collect over 18,800 hours of data across more than 90 the GPUs in these clusters. Regardless of the application, cluster, GPU vendor, and cooling method, our results show significant variation: 8 average performance variation even though the GPU architecture and vendor SKU are identical within each cluster, with outliers up to 1.5X slower than the median GPU. These results highlight the difficulty in efficiently using existing GPU clusters for modern HPC and scientific workloads, and the need to embrace variability in future accelerator-based systems.

READ FULL TEXT

page 1

page 6

page 12

page 13

research
06/01/2020

Exceeding Conservative Limits: A Consolidated Analysis on Modern Hardware Margins

Modern large-scale computing systems (data centers, supercomputers, clou...
research
09/20/2022

Application Experiences on a GPU-Accelerated Arm-based HPC Testbed

This paper assesses and reports the experience of ten teams working to p...
research
01/13/2022

Development and performance of a HemeLB GPU code for human-scale blood flow simulation

In recent years, it has become increasingly common for high performance ...
research
05/05/2022

ChASE – A Distributed Hybrid CPU-GPU Eigensolver for Large-scale Hermitian Eigenvalue Problems

As modern massively parallel clusters are getting larger with beefier co...
research
07/07/2020

On the Efficient Evaluation of the Exchange Correlation Potential on Graphics Processing Unit Clusters

The predominance of Kohn-Sham density functional theory (KS-DFT) for the...
research
12/26/2017

The L-CSC cluster: greenest supercomputer in the world in Green500 list of November 2014

The L-CSC (Lattice Computer for Scientific Computing) is a general purpo...
research
11/28/2018

The L-CSC cluster: Optimizing power efficiency to become the greenest supercomputer in the world in the Green500 list of November 2014

The L-CSC (Lattice Computer for Scientific Computing) is a general purpo...

Please sign up or login with your details

Forgot password? Click here to reset