Demystifying the Nvidia Ampere Architecture through Microbenchmarking and Instruction-level Analysis

08/23/2022
by   Hamdy Abdelkhalik, et al.
0

Graphics processing units (GPUs) are now considered the leading hardware to accelerate general-purpose workloads such as AI, data analytics, and HPC. Over the last decade, researchers have focused on demystifying and evaluating the microarchitecture features of various GPU architectures beyond what vendors reveal. This line of work is necessary to understand the hardware better and build more efficient workloads and applications. Many works have studied the recent Nvidia architectures, such as Volta and Turing, comparing them to their successor, Ampere. However, some microarchitecture features, such as the clock cycles for the different instructions, have not been extensively studied for the Ampere architecture. In this paper, we study the clock cycles per instructions with various data types found in the instruction-set architecture (ISA) of Nvidia GPUs. Using microbenchmarks, we measure the clock cycles for PTX ISA instructions and their SASS ISA instructions counterpart. we further calculate the clock cycle needed to access each memory unit. We also demystify the new version of the tensor core unit found in the Ampere architecture by using the WMMA API and measuring its clock cycles per instruction and throughput for the different data types and input shapes. The results found in this work should guide software developers and hardware architects. Furthermore, the clock cycles per instructions are widely used by performance modeling simulators and tools to model and predict the performance of the hardware.

READ FULL TEXT

page 1

page 3

research
05/21/2019

Instructions' Latencies Characterization for NVIDIA GPGPUs

The last decade has seen a shift in the computer systems industry where ...
research
11/08/2022

Microprocessor Design with Dynamic Clock Source and Multi-Width Instructions

This paper introduces a novel 32-bit microprocessor, based on the RISC-V...
research
03/18/2019

Dissecting the NVidia Turing T4 GPU via Microbenchmarking

In 2019, the rapid rate at which GPU manufacturers refresh their designs...
research
01/21/2021

UNIT: Unifying Tensorized Instruction Compilation

Because of the increasing demand for computation in DNN, researchers dev...
research
10/15/2021

Metrics and Design of an Instruction Roofline Model for AMD GPUs

Due to the recent announcement of the Frontier supercomputer, many scien...
research
07/19/2018

A Queuing Model for CPU Functional Unit and Issue Queue Configuration

In a superscalar processor, instructions of various types flow through a...

Please sign up or login with your details

Forgot password? Click here to reset