Dissecting the NVidia Turing T4 GPU via Microbenchmarking

03/18/2019
by   Zhe Jia, et al.
0

In 2019, the rapid rate at which GPU manufacturers refresh their designs, coupled with their reluctance to disclose microarchitectural details, is still a hurdle for those software designers who want to extract the highest possible performance. Last year, these very reasons motivated us to dissect the Volta GPU architecture using microbenchmarks. The introduction in August 2018 of Turing, NVidia's latest architecture, pressed us to update our study. In this report, we examine Turing and compare it quantitatively against previous NVidia GPU generations. Specifically, we study the T4 GPU: a low-power board aiming at inference applications. We describe its improvements against its inference-oriented predecessor: the P4 GPU based on the Pascal architecture. Both T4 and P4 GPUs achieve significantly higher frequency-per-Watt figures than their full-size counterparts. We study the performance of the T4's TensorCores, finding a much higher throughput on low-precision operands than on the P4 GPU. We reveal that Turing introduces new instructions that express matrix math more succinctly. We map Turing's instruction space, finding the same encoding as Volta, and additional instructions. We reveal that the Turing TU104 chip has the same memory hierarchy depth as the Volta GV100; cache levels sizes on the TU104 are frequently twice as large as those found on the Pascal GP104. We benchmark each constituent of the T4 memory hierarchy and find substantial overall performance improvements over its P4 predecessor. We studied how clock throttling affects compute-intensive workloads that hit power or thermal limits. Many of our findings are novel, published here for the first time. All of them can guide high-performance software developers get closer to the GPU's peak performance.

READ FULL TEXT

page 5

page 7

page 9

page 11

page 19

page 21

page 35

page 39

research
04/18/2018

Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking

Every year, novel NVIDIA GPU designs are introduced. This rapid architec...
research
08/23/2022

Demystifying the Nvidia Ampere Architecture through Microbenchmarking and Instruction-level Analysis

Graphics processing units (GPUs) are now considered the leading hardware...
research
02/18/2020

Verified Instruction-Level Energy Consumption Measurement for NVIDIA GPUs

GPUs are prevalent in modern computing systems at all scales. They consu...
research
04/05/2021

GPU Domain Specialization via Composable On-Package Architecture

As GPUs scale their low precision matrix math throughput to boost deep l...
research
04/16/2017

In-Datacenter Performance Analysis of a Tensor Processing Unit

Many architects believe that major improvements in cost-energy-performan...
research
12/09/2021

High performance computing on Android devices – a case study

High performance computing for low power devices can be useful to speed ...
research
02/01/2020

Foreground object segmentation in RGB-D data implemented on GPU

This paper presents a GPU implementation of two foreground object segmen...

Please sign up or login with your details

Forgot password? Click here to reset