Log In Sign Up

Dissecting the NVidia Turing T4 GPU via Microbenchmarking

by   Zhe Jia, et al.

In 2019, the rapid rate at which GPU manufacturers refresh their designs, coupled with their reluctance to disclose microarchitectural details, is still a hurdle for those software designers who want to extract the highest possible performance. Last year, these very reasons motivated us to dissect the Volta GPU architecture using microbenchmarks. The introduction in August 2018 of Turing, NVidia's latest architecture, pressed us to update our study. In this report, we examine Turing and compare it quantitatively against previous NVidia GPU generations. Specifically, we study the T4 GPU: a low-power board aiming at inference applications. We describe its improvements against its inference-oriented predecessor: the P4 GPU based on the Pascal architecture. Both T4 and P4 GPUs achieve significantly higher frequency-per-Watt figures than their full-size counterparts. We study the performance of the T4's TensorCores, finding a much higher throughput on low-precision operands than on the P4 GPU. We reveal that Turing introduces new instructions that express matrix math more succinctly. We map Turing's instruction space, finding the same encoding as Volta, and additional instructions. We reveal that the Turing TU104 chip has the same memory hierarchy depth as the Volta GV100; cache levels sizes on the TU104 are frequently twice as large as those found on the Pascal GP104. We benchmark each constituent of the T4 memory hierarchy and find substantial overall performance improvements over its P4 predecessor. We studied how clock throttling affects compute-intensive workloads that hit power or thermal limits. Many of our findings are novel, published here for the first time. All of them can guide high-performance software developers get closer to the GPU's peak performance.


page 5

page 7

page 9

page 11

page 19

page 21

page 35

page 39


Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking

Every year, novel NVIDIA GPU designs are introduced. This rapid architec...

Demystifying the Nvidia Ampere Architecture through Microbenchmarking and Instruction-level Analysis

Graphics processing units (GPUs) are now considered the leading hardware...

Verified Instruction-Level Energy Consumption Measurement for NVIDIA GPUs

GPUs are prevalent in modern computing systems at all scales. They consu...

GPU Domain Specialization via Composable On-Package Architecture

As GPUs scale their low precision matrix math throughput to boost deep l...

In-Datacenter Performance Analysis of a Tensor Processing Unit

Many architects believe that major improvements in cost-energy-performan...

High performance computing on Android devices – a case study

High performance computing for low power devices can be useful to speed ...

Foreground object segmentation in RGB-D data implemented on GPU

This paper presents a GPU implementation of two foreground object segmen...

1 Control information

Control words appeared first with the Kepler architecture, which substantially replaced dynamic hardware scheduling with static software scheduling. Control words encode instruction scheduling decisions taken by the compiler [8] that the hardware must enforce. The design choice to use software scheduling in Kepler was a departure from the previous design (Fermi): designers replaced a complex hardware scheduler with a simpler, more efficient one that occupied less silicon area and consumed less power. Overall, software scheduling enabled simpler on-chip control logic, leading to higher compute density per area of silicon and better energy efficiency.

On Turing and Volta, 128 bits contain one instruction together with the control information associated with only that instruction.

Pre-Volta architectures pack one control word with multiple instruction words into a bundle. In each bundle, the first word contains control information, and the remaining words (3 on Pascal and Maxwell, 7 on Kepler) encode one instruction each. Each control word affects how the architecture schedules the instructions within the bundle.

The following excerpt shows a bundle of Pascal instructions decoded by nvdisasm. The bundle contains four 64-bit words. The first word, which has a hexadecimal dump but no corresponding disassembled instruction, is a control word. The remaining three words are instructions.

                                                        /* 0x000f8800fe2007f1 */
 /*0288*/         @P5 LDG.E.CI R66, [R86+0x100];        /* 0xeed4a00010055642 */
 /*0290*/        @!P5 MOV R66, RZ;                      /* 0x5c9807800ffd0042 */
 /*0298*/         @P6 LDG.E.CI R67, [R86+0x180];        /* 0xeed4a00018065643 */

Control information is encoded as follows on the different GPU generations:

  • on Kepler, each control word contains 6 zeroes as its most significant bits, 2 zeroes as its least significant bits, and 7 sections of 8 bits each;

  • on Pascal and Maxwell, each control word contains one zero as its most significant bit, and 3 sections of 21 bits each;

  • on Turing and Volta, each control section contains 2 zeroes as its most significant bits, and 1 section of 21 bits. For every 128 bits corresponding to one instruction, control information is preceded and followed by bits encoding the instruction itself.

Sections containing control information are organized in the same way on Turing, Volta, Pascal and Maxwell. Each section contains 6 fields, organized as follows:

Width (bits) 4 6 3 3 1 4
Meaning Reuse Wait Read Write Yield Stall
flags barrier barrier barrier flag cycles
mask index index

Fields have the following meaning:

Reuse flags. Each hardware thread on Turing, Volta, Pascal and Maxwell has a 2-way associative Content-Addressable Memory (CAM) for each of the four conceptual source registers operand positions. This memory is intended to allow data reuse between instructions without accessing any register ports: this relieves pressure on the register file, and helps reducing register bank conflicts (we discuss register bank conflicts at length in Section 8.1). Reuse flags control this mechanism as follows: an instruction may flag for saving into the reuse set any combination of up to its first four arguments. Each instruction will attempt to service register reads for its first 4 arguments from the respective reuse slots before resorting to loading values via register file ports. E.g., if the last two reuse-saved registers in the second instruction source operand position were R98 and R99, either of those registers may be used in the second position of instructions without contributing to register bank conflicts. The four bits in the reuse flags map the first to fourth source operands with the least to most significant bits, respectively.

Wait barrier mask; Read/Write barrier index. While most instructions have fixed latency and can be statically scheduled by the assembler, instructions involving memory and shared resources typically have variable latency. Turing, Volta, Pascal and Maxwell use dependency barriers to track the completion of variable-latency instructions and resolve data hazards. When a variable-latency instruction writes to a register, the assembler associates it to one of the 6 available barriers by setting the corresponding write barrier number field. When a later instruction consumes that register, the assembler marks the instruction as waiting on that barrier by setting the bit corresponding to that barrier in the wait barrier mask. The hardware will stall the later instruction until the results of the earlier one are available. An instruction may wait on multiple barriers, which explains why the wait barrier mask is a bitmask, not an index.

Read dependency barriers. Read dependency barriers serve to protect against write-after-read hazards. Unbuffered instructions that write the contents of registers to memory need the registers to remain unchanged during the operation. To guarantee that, the assembler associates them to a barrier by populating the corresponding read barrier number field. Later instructions writing to the same register will wait on that barrier.

Stall cycles. This 4-bit field indicates how long the scheduler should wait before issuing the next instruction, ranging from 0 to 15 cycles. On Pascal and Maxwell, if the combination of this field and the yield flag contain a special combination of bits, the two dispatchers in a processing block can dispatch two consecutive instructions of a warp at the same time (dual issue). On Turing and Volta there is only one dispatcher in a processing block, and we do not observe dual issue in the generated code.

Yield flag. As its predecessors, the Turing architecture uses a one-bit yield flag to balance the workload assigned to a processing block. When this bit is set, the scheduler prefers to issue the next instruction from the current warp. When the bit is cleared, the scheduler prefers to switch to another warp, making all register reuse flags for the next instruction ineffective. This costs one extra cycle to switch to another warp.

Warp A Index
0 1 2 3
Warp B index 4 48.9 72.4 72.5 73.1
5 73.4 46.7 72.5 73.1
6 73.2 72.8 47.0 73.2
7 72.9 72.7 72.7 46.2
V100 GPU
Warp A Index
0 1 2 3
Warp B index 4 42.27 66.05 66.04 65.29
5 66.05 41.98 66.04 66.04
6 66.02 66.04 42.06 66.04
7 66.04 66.04 66.02 42.08
Table 1: This experiment reveals the same mapping between warps and schedulers on Turing and Volta: warps with the same index modulo 4 are mapped to the same scheduler. We vary the indices of two active warps (A and B) and measure their aggregate throughput. When the indices collide modulo 4 (i.e., they are mapped to the same scheduler) performance drops. All values are in single-precision GFLOPS.

2 Processing Blocks and Schedulers

The Turing streaming multiprocessor (SM) is partitioned into four processing blocks, each containing a dedicated warp scheduler and dispatch unit [3]. Instructions from the same warp are allocated to a specific processing block, and can only access the processing units within that block.

We found that warps are mapped to schedulers (and processing blocks) on Turing and Volta according to the same, simple rule:

This is demonstrated with a benchmark composed of 8 warps running on a single SM simultaneously, of which only 2 are active with loops of FFMA instructions, while the remaining 6 are idle.

We repeat the experiments varying the warp index of each of the two active warps (Warp A and B), while measuring each time the aggregate arithmetic throughput achieved by the two warps. The results (see Table 1) show that whenever the two warps have the same index modulo 4 (e.g., 0 and 4, 1 and 5, …), their aggregate performance drops, which suggests that they are mapped to the same scheduler.

These findings are consistent between Turing and Volta.

Furthermore, these results indicate that every block of your workload must use at least 128 threads to fully utilize the processing units on one SM of Turing and Volta.

3 Instruction word format

3.1 Opcodes

Turing and Volta use more bits to encode their instructions than in previous architectures.

Unlike previous architectures (Pascal, Maxwell and Kepler), which organize the opcode in the most significant bits of the instruction, Turing and Volta architectures place the opcode in the least significant bits of the first 64-bit word of the instruction. Turing opcodes vary in length from 10 to 13 bits. For an extensive opcode reference that compares Pascal, Volta and Turing, see the Appendix.

3.2 Operands

As in previous architectures, instruction operands on Turing can be registers of different types, memory addresses (constant, shared or global), or an immediate value. Predication is regulated by 4 bits: the first bit is a negation flag, and the remaining 3 bits encode a predicate register index.

4 L1 data cache

Turing adopts the same combined L1 data cache / shared memory design as Volta. This design reduces the cache hit latency and improves the bandwidth with respect to the Pascal architecture.

As the geometry of the L1 data cache is concerned, our findings agree with what reported in the Turing and Volta architecture whitepapers [10, 3]. Specifically, the T4 offers twice as much L1 data capacity, and twice as high a bandwidth as the P4 GPU.

As performance is concerned, our experiments show that on a T4 GPU, the L1 data cache offers approximately 3.7 more bandwidth than its P4 predecessor.

4.1 Latency and bandwidth

The L1 data cache hit latency we measured on the T4 GPU is 32 cycles, compared to 82 cycles on the P4 (see Figure 6).

Before Turing and Volta, Kepler was the most recent architecture to combine its L1 cache and its shared memory. Kepler’s L1 cache read hit latency is 35 clock cycles. Turing exhibits a better L1 latency than Kepler in clock cycles, despite the T4 being clocked almost twice as high as the K80 (1,590 vs. 875 MHz).

We use the following benchmark to measure the L1 data cache load bandwidth. The benchmark scans an array with 32-bit elements; every warp accesses all the elements in the array:

__global__ void l1_bw(  uint32_t *startClk, uint32_t *stopClk,
                        float *dsink, uint32_t *posArray )
    // Thread index
    uint32_t tid = threadIdx.x;
    // Side-effect variable, intended to avoid compiler elimination of this code
    float sink = 0;
    // Warm up the L1 cache by populating it
    for (uint32_t i = tid; i<L1_SIZE; i+=THREADS_NUM) {
      float * ptr = posArray+i;
      asm volatile ("{\t\n"
        ".reg .f32 data;\n\t"
        " data, [%1];\n\t"
        "add.f32 %0, data, %0;\n\t"
        "}" : "+f"(sink) : "l"(ptr) : "memory"
    // Synchronize all threads
    asm volatile ("bar.sync 0;");
    // Start timing
    uint32_t start = 0;
    asm volatile ("mov.u32 %0, %%clock;" : "=r"(start) :: "memory");
    // Load data from L1 cache, accumulate
    for (uint32_t i = 0; i<L1_SIZE; i+=THREADS_NUM) {
      float * ptr = posArray+i;
      // every warp loads all data in l1 cache
      for (uint32_t j = 0; j<THREADS_NUM; j+=WARP_SIZE) {
        uint32_t offset = (tid+j)%THREADS_NUM;
        asm volatile ("{\t\n"
          ".reg .f32 data;\n\t"
          " data, [%1];\n\t"
          "add.f64 %0, data, %0;\n\t"
          "}" : "+f"(sink) : "l"(ptr+offset) : "memory"
    // Synchronize all threads
    asm volatile ("bar.sync 0;");
    // Stop timing
    uint32_t stop = 0;
    asm volatile ("mov.u32 %0, %%clock;" : "=r"(stop) :: "memory");
    // Write time and data back to memory
    startClk[tid]  = start;
    stopClk[tid]   = stop;
    dsink[tid]     = sink;
T4 V100 P100 P4 M60 K80
Theoretical upper bound 64.0 128.0 64.0 64.0 128.0 128.0 bytes/cycle
Measured throughput 58.8 108.3 31.3 15.7 15.7 68.6 bytes/cycle
Table 3: L1 cache load throughput per SM.

We report L1 data bandwidths we measured across GPU devices in Table 3, together with their theoretical upper bounds.

The actual bandwidth we measure on the T4 GPU is 58.83 bytes/cycle/SM, i.e., 3.7 higher than that of the P4 GPU, i.e., 15.7 bytes/cycle/SM. This bandwidth comparison expressed in cycle counts is meaningful, because the T4 and P4 cards run at very similar graphics clock frequencies .

We calculate the theoretical throughput by multiplying the LSU count per SM by the number of bytes that each LSU can load per cycle per instruction.

Historically, architectures that employ an L1 cache combined with shared memory (Turing, Volta and Kepler) exhibit a higher L1 bandwidth than architectures where the L1 cached and the shared memory are separate (Pascal and Maxwell).

4.2 Geometry

According to the Turing whitepaper [3], load/store operations can use a L1 data cache of 32 KiB or 64 KiB in size.

Our experiments based on Mei and Chu’s fine-grained pointer-chase technique [9] were unable to detect the whole configured size, and fell 7 KiB short of the nominal L1 data cache size, on both Volta and Turing architectures (see Table 4).

In our experimental setup, the shared memory is configured to a size of 64 KiB. We then employed a benchmark that scans a variable length array A twice. As long as the size of A exceeds 25 KiB, we detected cache misses.

At this time we are unable to explain this 7-KiB discrepancy. We conjecture it is the result of a newly applied replacement policy that we discuss below. We confirm that it is not associated to the ECC feature (error correction).

Configured size of shared memory (KiB) 32 64
Expected size of L1 data cache (KiB) 64 32
Detected size of L1 data cache (KiB) 57 25
Table 4: Detectable L1 data cache size with the pointer-chase benchmark on the T4 GPU.

Table 2 describes the remainder of L1 data cache geometry as we discover it. The line size, load and update granularity of Turing’s L1 data cache are the same as on the Volta, Pascal and Maxwell GPUs.

In our previous report for Volta [2], we discovered an improved L1 cache replacement policy on Volta with respect to its predecessors. Turing also features a L1 cache replacement policy that aims at preserving large arrays from eviction caused by sparse memory accesses.

We employed a benchmark that scans a variable length array twice, and recorded positions and latency data when L1 cache miss happens. We found that when the L1 data cache saturates, Turing randomly evicts 4 consecutive cache lines (128 B). We observed that once a block of cache lines are evicted, the second scan will cause more cache lines from the same set to be evicted.

5 Unified L2 cache

Turing employs an L2 cache that is unified for data, instructions and constant memory, as the previous GPU generations do. The L2 cache on a T4 GPU is a 16-way, set-associative cache having a size of 4,096 KiB, a cache line of 64 B, and an average latency of 188 clock cycles (Figure 6).

Turing Volta Pascal Pascal Maxwell Kepler
T4 V100 P100 P4 M60 K80
Throughput (GB/s) 1,270 2,155 1,624 979 446 339
Table 5: L2 data cache load throughput.

We use the following benchmark to measure L2 load bandwidth, on all the GPUs considered:

__global__ void l2_bw(float *dsink, uint32_t *posArray)
    // block and thread index
    UINT tid = threadIdx.x;
    UINT bid = blockIdx.x;
    // accumulator; side effect to prevent code elimination
    float sink = 0;
    // load data from l2 cache and accumulate
    for (UINT i = 0; i<L2_SIZE; i+=THREADS_NUM) {
      DTYPE* ptr = posArray+i;
      // every warp loads all data in l2 cache
      for ( UINT j=0; j < THREADS; j+=32 ){
        UINT offset = (tid+j)%THREADS;
        asm volatile ("{\t\n"
          ".reg .f32 data;\n\t"
          " data, [%1];\n\t"
          "add.f32 %0, data, %0;\n\t"
          "}" : "+f"(sink) : "l"(ptr+offset) : "memory"
    // side effect: store the result
    dsink[tid] = sink;

Note that we warm up the L2 cache before launching this kernel (code not shown for brevity). The benchmark contains a simple floating-point accumulation into variable sink, which is later written to global memory; this accumulation intentionally creates a side effect intended to prevent the compiler from eliminating the entire benchmark code. The marginal cost of this accumulation is negligible with respect to the data access latency.

The bandwidth we measure on the T4 device (see results in Table 5) is 30% higher than the P4’s, and 59% of the one measured on the larger V100 GPU.

Figure 7: We detect the size of instruction cache level with a benchmark based on sequences of identical instructions of increasing length. We then chart the average inverse throughput: each plateau reveals the size of a cache level. Top charts: boundaries of the first two levels in the hierarchy. Bottom charts: limits of the last level, misses to global memory.

6 Instruction cache hierarchy

In this section, we map experimentally the size and the organization of the instruction cache hierarchy. In practice, that consists in (1) detecting the size of each cache level and (2) determining how cache levels are distributed within the architectural blocks (scheduler, SM, entire chip) of the GPU.

6.1 Taxonomy

All GPU architectures we considered, including Turing, feature three levels of instruction caches. To avoid confusion, note that on Turing and Volta the three levels are named differently (L0, L1, L2) than on previous architectures (L1, L1.5, L2). We adopt this established taxonomy for consistency with the NVidia’s whitepapers [3, 10] and with prior literature. Pay attention to expressions like “the second level of instruction caches”: this expression refers to L1 on Turing and Volta, but to L1.5 on Pascal, Maxwell and Kepler.

6.2 Size

To detect the size of each cache level, we study how the average inverse throughput (i.e., average clocks per instruction, or CPI) achieved by a long sequence of instructions changes as a function of sequence length. As we increase the length of a sequence, we expect to see a constant CPI value until the sequence exceeds the cache size. Indeed, experimental results show plateaus and ramps (Figure 7) which correspond to cache level sizes and transitions from one level to the following. In the figure, the bottom charts focus on the three instruction cache levels, whereas the bottom charts focus on the transition between the last cache level and global device memory.

We report all findings in Table 2. Turing enjoys better inverse throughput than its predecessors when accessing the second and third instruction cache levels.

Experimental setup. Our benchmark measures the average CPI seen by a sequence of instructions of given length that exert no pressure on the data cache hierarchy. We iterate measurements for sequence sizes starting from the cache line size up to the larger plausible size of L3. The benchmark executes each sequence twice, but only times the second execution, so that we only measure capacity misses and conflict misses, but not cold misses.

  • On Pascal, Maxwell and Kepler, we employ the same technique as in our previous report [2] for the sake of consistency, i.e., long sequences of FFMA instructions, whose register operands are chosen so that each instruction experiences no register dependence with its neighbors.

  • On Volta and Turing, we switched to a simpler method that uses NOP sequences rather than FFMA. This choice circumvents NVCC’s undesired generation of 2-cycle stalls between consequent FFMA instructions on these two GPUs.

Figure 8: Aggressor-victim experiments designed to detect what architectural block (a scheduler, an SM, the entire chip) owns each level of the instruction cache, by observing how an aggressor warps that causes instruction cache pressure affects a victim warp’s performance. Top left: each L0 instruction cache is private to a scheduler. Top right: an L1 instruction cache is not private to a scheduler. Bottom left: each L1 instruction cache is private to one SM. Bottom right: the L2 instruction cache is common among all SMs.

6.3 Organization

Across the different GPU architectures, levels in the instruction memory hierarchy are organized as follows:

  • on Turing and Volta, each L0 instruction cache is private to one scheduler/processing block;

  • on all GPUs considered, each L1 instruction cache is private to an SM;

  • on Pascal, Maxwell and Kepler each L1.5 instruction cache is private to one SM; the L1.5 instruction cache does not exist on Turing and Volta;

  • on all GPUs considered, the L2 cache is unified (i.e., it caches instructions and data) and it is shared across all SMs.

On architectures older than Turing, we provided experimental support for these claims in our previous report [2]. For claims about Turing, we collected evidence using experiments designed as follows.

Our experiments measure the interaction between an aggressor warp and a victim warp. Both warps loop through sequences of NOP instructions of chosen length:

  • the victim warp only runs a fixed-length NOP sequence, typically designed to fit within a certain instruction cache level; we call it the victim probing sequence;

  • the aggressor warp runs, in addition to the same probing sequence as the victim, and before it, a variable-length NOP sequence, designed to thrash a given cache level, potentially evicting instruction cache entries.

We monitor whether the evictions caused by the aggressor warp only affect its own performance, or they affect the victim as well: if the victim is unaffected, then the smallest cache level that fits the fixed-length victim probing sequence is private to the architectural block where the two warps are running (i.e., GPU processing block or SM); else, the cache level is shared between the two warps and located outside the block considered. In our experiments, both warps monitor their performance by measuring their inverse throughput (CPI).

Results show that each L0 instruction cache is private to a processing block, that each L1 instruction cache is private to an SM, and that the L2 cache is shared among all SMs (Figure 8).

To examine the relation between levels L0, L1 and schedulers (or GPU processing blocks), we use experiments where the aggressor and victim warps run on the same SM, but different processing blocks. We use increasingly longer sequences in the aggressor warp. To exclude compulsory misses from the measurements, we let the aggressor and then the victim warm up the caches by running each their respective sequence once.

We observe that:

  • as the aggressor sequence grows while remaining below L0 capacity, only the aggressor experiences a slowdown (top left chart in Fig. 8), whereas the victim is unaffected. This indicates that the two warps access distinct L0 caches, private to each processing block;

  • as the instruction sequence grows above L0 capacity (top right chart) and into L1, both warps slow down similarly, which indicates that the two warps share L1.

Next, we examine the relation between levels L1 and L2, and SMs, with similarly constructed experiments. This time, the two warp run on separate SMs (SM0 and SM1).

We observe that:

  • as the aggressor sequence exceeds L0 but remains within L1 capacity, only the aggressor warp experiences a slow-down corresponding to L1 hit rates (bottom left); the victim, still running a sequence fitting L0 (16 KiB), is unaffected. This indicates that different SMs have distinct L1 caches;

  • as the aggressor sequence exceeds L2 capacity (bottom right chart), both victim and aggressor experience slowdowns; This indicates that different SMs access the same L2 cache.

7 Constant memory hierarchy

The constant memory is a cached window of global memory, reserved for data declared with the __constant__ keyword, plus kernel invocation parameters and immediate constants. We find that Turing has three levels of constant cache memory, which have the geometry and properties described in Table 2 and latency as in Figure 10.

The constant memory hierarchy used in Turing did not change significantly from previous generations. Across all the GPU generations we considered, the following properties hold true:

  • the L1 constant cache uses a non-LRU replacement policy;

  • each SM possesses two private levels of constant caches, which we denote as L1 and L1.5 constant cache (accesses to either of each level within an SM do not affect the same cache levels on other SMs);

  • the L2 cache is the third level of constant cache. It is shared among all SMs and is unified for instruction and data.

Figure 9: An aggressor-victim experiment shows that the L1.5 constant cache and the L1 instruction cache coincide. We measure the miss rates experienced by the scan of a constant array pre-cached in constant L1.5 cache (victim) that follows a long sequence of identical FFMA instructions (aggressor), intentionally designed to cause L1 instruction cache pressure. As the aggressor’s sequence length increases, the victim suffers increasing miss rates.
Figure 10: Latency of concurrent loads from constant memory within a warp depends on where the data is found in the cache hierarchy (L1, L1.5, or L2) and on the count of distinct locations referenced. The hardware broadcasts accesses to the same location.

On Turing as in Volta, the second levels of the constant and the instruction cache are backed by the same hardware cache. More precisely, the L1.5 constant cache and the L1 instruction cache coincide. To prove this claim, we run an aggressor-victim experiment, in which we show that instruction sequences of increasing length (aggressor) evict pre-populated entries in the L1.5 constant cache. We detect these evictions by recording the execution time of a constant array scan (victim) that we execute after the aggressor. We use instruction sequences composed of identical FFMA instructions.

Experimental results (Figure 9) show that longer instruction sequences in the aggressor cause correspondingly higher miss rates in the victim. We observed victim miss rates vary from 0% to 100%.

As in previous architectures, constant memory accesses on Turing support broadcasting (see Figure 10). When all threads within a warp access the same address, the constant memory sends data to all threads simultaneously. When threads visit diverging addresses, the accesses are serialized.

Figure 11: Register bank conflicts affect the execution time of instructions. Charted is the execution time taken by long sequences of identical FFMA instructions, as we vary one source register (RX). In both sequences R6, the destination sequence, is irrelevant. In sequence FFMA R6, R97, R99, RX, the choice of RX causes a conflict when RX

is odd: the other two source operands are already using both ports from bank 1, and a third access cannot occur in the same clock cycle. In sequence

FFMA R6, R98, R99, RX, no choice of RX can cause a conflict because R98 and R99 are on different banks.

8 Registers

8.1 Register File Banks

Turing and Volta use a physical register file of 16,384, 32-bit elements in each processing block. Thread-visible logical registers are allocated in increments of 8, or aggregate increments of 256 for all 32 threads in a warp. These register files are organized in 2 banks with dual 32-bit ports each, with logical registers belonging to the bank with the index matching their name, modulo-2. Each port can satisfy only one 32-bit read per clock cycle, and instructions in compiled code requiring three or more operands (such as FFMA, the single-precision floating-point fused multiply-and-add instruction) will suffer a stall in execution due to a register bank conflict if any three source registers’ names map to either dual-ported bank.

For example:

  • instruction FFMA R15, R11, R12, R13 has no conflict, since source operands R11 and R13 can be serviced by bank 1’s two ports, R12 can be serviced by one of bank 0’s ports, and destination register R15 does not use an additional port from bank 1;

  • instruction FFMA R18, R10, R12, R16 suffers a conflict because R10, R12 and R16 are all in bank 0. (The destination of R18 is irrelevant.)

Architectures prior to Volta used 4, single-ported banks, requiring substantially more constrained register scheduling by the compiler, but there are opportunities for improvements even on the newest devices. In our technical report on Volta [2], we demonstrated performance increases of up to 15% by minimizing bank conflicts through careful register re-assignment.

Figure 11 illustrates the effect of register bank conflicts on instruction latency on the T4 GPU. We use long sequences of identical FFMA instructions in which we vary one source register index (RX) to cause conflicts. Since the T4 GPU has dual-ported register banks, a conflict will only happen when all three 32-bit source registers in an FFMA instruction belong to the same bank. In every instruction of form FFMA R6, R97, R99, RX in the benchmark, R97 and R99 are in bank 1; if RX also sits in bank 1, a conflict will occur. (R6 is irrelevant as it is a destination register.) In instruction sequence FFMA R6, R98, R99, RX, because R98 and R99 sit in different banks, there is no choice of RX that can cause three reads from the same bank.

8.2 Uniform Registers

As per NVidia’s documentation, Turing introduces a new feature intended to improve the maximum achievable arithmetic throughput of the main, floating-point capable datapaths, by adding a separate, integer-only, scalar datapath (named the uniform datapath) that operates in parallel with the main datapath.

This design is intended to accelerate numerical, array-based, compute-bound workloads that occupy the main datapaths almost completely with floating-point instructions, typically FFMA or HMMA, but also contain a few integer operations, typically updating array indices, loop indices or pointers; or performing array or loop boundary checks.

These few integer instructions spoil the instruction mix, and prevent the main datapaths from ingesting a 100% pure stream of FFMA or HMMA. In these circumstances, even a small fraction of integer instructions can hurt the overall arithmetic throughput, lowering it significantly from its theoretical maximum.

On Turing, the compiler has the option to push these integer operations onto the separate uniform datapath, out of the way of the main datapath. To do so, the compiler must emit uniform datapath instructions.

Regular instructions can access both uniform and regular registers. Uniform datapath instructions, instead, focus on uniform instructions almost exclusively.

While at this time we have not been able to stimulate the generation of uniform datapath instructions by the compiler, we were able to enumerate the 64 uniform registers supported by Turing (including a Uniform Zero Register URZ and 63 general-purpose uniform registers UR0UR62) by systematically disassembling packed uniform instructions.

8.3 Regular Registers

Instructions on Turing still supports the 256 regular registers (including the general-purpose R0R254 and the Zero Register RZ).

We found that the cuobjdump --dump-resource-usage command (that prints a kernel’s register usage) reports a count that includes both regular and uniform registers. The upper limit of total registers used in any CUDA kernel is 256, unchanged from Volta.

We confirmed this result by patching the register count in the section header of a CUDA kernel to values above 256, and determining that cuobjdump only recognizes 256 registers at most.

Figure 12:

Shared memory latency increases under contention. Both axes use exponential scales. We used a stride factor multiplying the thread index as an offset to load data from shared memory. Each thread visits one 32-bit element and measures the average access latency. The benchmark warms shared memory before recording clock cycles.

Figure 13: Theoretical and measured shared memory bandwidth on the considered GPUs. The theoretical limits are given by the minimum of product and product The meaning of all factors in these products is explained in Table 2.

9 Shared memory

The T4 GPU has up to 64 KiB of shared memory (configurable by the user) that offers low latency and high memory bandwidth. In this section, we characterize shared memory performance, including performance under contention.

9.1 Latency

Turing’s shared memory enjoys relatively low latency among the GPUs we examined (Figure 12). Only the V100 GPU exhibits lower shared memory latency than the T4 GPU.

On all GPUs except for Kepler, the measured average access latency monotonically increases with the number of conflicts in a warp. Kepler is the only GPU adopting dual-ported shared memory banks, allowing any two threads to alias on any given bank without penalty and resolving two further aliases at a time for conflicted banks.

9.2 Bandwidth

Due to their large number of streaming multiprocessors, the V100 and P100 GPUs provide the highest theoretical and measured shared memory bandwidth (Figure 13).

As benchmarking is concerned, on Kepler, Maxwell, Pascal and Volta, we were able to rely on nvprof to collect shared memory metrics. On Turing, because nvprof does not support shared memory metrics collection on that GPU, we resorted to adopting the following custom-tailored benchmark:

// Pointer-chasing shared memory bandwidth benchmark
// dData  : Pointer-chase array
// dSink  : Side-effect destination variable (prevents code elimination)
// repeat : Count of pointer-chase steps requested
// To ensure all LSUs in an SM are used, use >= 128 threads
#define THREAD_NUM 1024
// shared memory per block
__global__ void bandwidthTest(uint32_t * dData,
                              uint32_t * dSink,
                              uint32_t   repeat){
  // Pointer-chase starting position in shared memory
  uint32_t sid = threadIdx.x;
  // The pointer-chase array in shared memory
  __shared__ DTYPE shrData[PCHASE_SIZE];
  // Initialize the pointer-chase array in shared memory
  for (uint32_t i = sid; i<PCHASE_SIZE; i+=THREAD_NUM)
    shrData[i] = dData[i];
  // Synchronize threads in a same block
  // Scan the shared-memory array with the p-chase method
  unsigned next=sid;
  for (uint32_t j = 0; j < repeat; j++) {
    next = shrData[next];
  // Side effect to prevent the compiler from eliminating this code
  dSink[sid] = next;

This benchmark performs pointer-chase accesses to the shared memory with a varying number of steps. We invoke as many threads and blocks as possible to provide enough pressure on load/store units. We measured the execution time as we increased pointer-chase step count.

We cross-verified the correctness and accuracy of this benchmark by running it on all architectures other than Turing (on which shared memory metrics are supported) and confirming that the bandwidths it measures match those computed from nvprof metrics.

Figure 14: Theoretical and actual global memory bandwidth on all GPUs considered. Theoretical bounds are derived from NVidia’s whitepapers. Actual bandwidths are the results of our benchmark, which loads data from a global memory array and stores it into another global memory array.

10 Global memory

We measured the actual global memory bandwidth and compared it against its theoretical limit for all the GPUs considered (Figure 14).

Thanks to their adoption of HBM2 memory, V100 and P100 boards feature a significantly higher bandwidth than GPUs based on GDDR memory. The P100 outperforms GDDR-based GPUs boards but suffers from a large gap between actual and theoretical performance. Compared to the P4 GPU, the T4 GPU enjoys a higher global bandwidth because of GDDR6 memory. However, the actual-to-theoretical bandwidth ratio on the T4 board is lower than on the P4 board (68.8% vs. 84.4%)

Figure 15: Global memory access latency seen by the pointer chase benchmark as it sweeps TLBs. The benchmarks perform a traditional pointer chase after a TLB warm-up scan, calculating the average global memory access latency with a stride of TLB page entry size.

11 TLBs

On Turing and on all other architectures that we examined, we found that

  • the L1 data cache is indexed by virtual addresses, and

  • the L2 data cache is indexed by physical addresses.

Because L2 is a physical cache, accesses to it involve the TLBs. We prove this claim by scanning a large array with L1 data cache enabled; we size the array to exceed the L1 TLB coverage, so that accesses in the benchmark would cause at least one level of TLB miss if L1 data cache were indexed by physical address. As expected, we saw no TLB misses in the second scan, as long as the stride is big enough to cache all accesses in L1 data cache. The same benchmark shows that addressing data in L2 data cache goes through the TLBs when the L1 data cache is disabled.

Figure 15 shows that, within the available global memory size, there are two levels of TLB on the Turing GPUs. The L1 TLB has 2 MiB page entries and 32 MiB coverage. The coverage of the L2 TLB is about 8192 MiB, which is the same as Volta.

12 Native instructions

Turing and Volta’s instructions typically exhibit lower latency than Pascal and older GPU generations, but Turing does not seem to offer instruction latency improvements over Volta. In this section, we report the latency of common instructions on Turing, Volta and Pascal in Table 6.

Architecture Instructions Latency (cycles)
POPC, FLO, MUFU, F2F, F2I, I2F, I2I 14
Table 6: Latency of frequently used instructions on Volta and Pascal.

As the Turing whitepaper [3] mentions, the dependent-issue latency for core FMA math operations is 4 clock cycles, the same as on Volta.

On Turing, we found that most integer, single- and half-precision instructions have similar latencies as those on Volta, whereas double-precision instructions increased their latency above 40 cycles.

On Volta, most integer and single-precision instructions have a latency of 4 cycles. In our previous work we determined that most Volta double-precision instructions have a latency of 8 cycles, and half-precision instructions have a latency of 6 cycles.

On Maxwell and Pascal, instructions IMAD and IMUL have a long latency because they are emulated.

On Pascal, most integer and single-precision instructions have a latency of 6 cycles; double-precision instructions have a latency of 8 cycles; more complex instructions, some of which run on the SFU, require 14 cycles.

Experimental setup. Measuring dependent issue instruction latency on a software-scheduled GPU requires the use of custom-tailored benchmarks designed as follows. To measure the latency of instruction A, we add a second instruction B that depends on A, then set the control word that regulates A’s execution:

  • if A has fixed latency, we choose a B that consumes A’s output. We decrease A’s stall cycles in its control word, till A’s result consumed by B is incorrect. The last stall value producing correct results is A’s latency;

  • if A has variable latency, we choose a B of known latency, then set control flags to create an artificial read/write dependency between A and B. We let the scheduler wait for the dependency, then measure the pair’s cumulative latency with a bracket of CS2R instructions, and obtain A’s latency by subtracting B’s known one.

Figure 16: Throughput of atomicAdd operations on global memory, measured in four contention scenarios.

13 Atomic operations

Our measurements show that atomic operations on shared memory have a slightly longer latency on Turing than on Volta, but shorter than Pascal and older generations. In Table 7, we report those latencies expressed in clock cycles. The comparison is meaningful even in real terms because the different GPUs adopt similar clock frequencies (reported in Table 2)

As atomics on global memory are concerned, latency seems to have increased on the T4 device compared with V100. The M60 GPU had the best latency among all GPU considered.

Notably, Kepler is the only architecture where shared memory atomics are slower than global memory one, and by a large margin (4 to 8). This is due to Kepler’s lack of hardware support for shared memory atomics. Moreover, its emulated atomics degrade quickly under contention. Later architectures support atomics in hardware, and offer low-latency atomics, even in presence of contention.

Shared memory Global memory
Contention T4 V100 P100 P4 M60 K80 T4 V100 P100 P4 M60 K80
none 8 6 15 16 17 93 76 36 26 30 24 29
2 threads 10 7 17 18 19 214 72 31 31 50 26 69
4 threads 14 11 19 25 25 460 73 32 48 50 41 96
8 threads 22 18 30 30 31 952 81 41 48 51 41 152
16 threads 37 24 46 46 47 1,936 97 58 50 51 46 264
32 threads 69 66 78 78 79 4,257 116 76 50 51 46 488
Table 7: Latency of atomic operations on shared and global memory, in clock cycles.

We measured these latencies with benchmarks designed in the following manner: we determine the latency of atomic instruction A by following it with a load instruction B, of known latency, that visits the same location. We deduce A’s latency from that of pair (A,B) as described in the previous section.

Figure 16 reports the throughput measured on GPUs from Kepler to Turing in presence of contention, in four scenarios:

  • Scenario 1, one block of 1,024 threads. Of these, threads access the same address, while the others access distinct, sequential addresses in global memory. 8 groups of threads access the same L2 cache line;

  • Scenario 2, one block of 1,024 threads. Of these, threads access the same address, while the others access sequential L2 cache lines in global memory, with every group of threads accessing a single L2 cache line;

  • Scenario 3, a variable number of blocks, of 1,024 threads each. All threads in all blocks access the same address; heavy contention exists among blocks;

  • Scenario 4, a variable number of blocks, of 1,024 threads each. All threads within a block access the same address. Different blocks access distinct addresses; no contention exists among blocks.

The T4 GPU doesn’t achieve the highest throughput in the scenarios with contention and the scenarios on single SM. The only scenario in which the T4 GPU provides the best performance is on multiple SMs and without contention among SMs. In all scenarios, from Maxwell to Pascal the aggregate throughput increase substantially.

Figure 17: Floating-point performance of cuBLAS and CUTLASS matrix multiplication on a T4 GPU running at 1,590 MHz.

14 New Tensor Core instructions

The Turing architecture refreshes its Tensor Cores by offering support for a wider range of operand types than Volta. Specifically, Tensor Cores as introduced in Volta were designed to offer high throughput when performing matrix math on half-precision floating point operands; on Turing, Tensor Cores add support for short integer operands: int8, int4 and int1.

Moreover, Turing offers new instructions that allow to express matrix math more succinctly. To demonstrate that, we will compare the Volta and the Turing code generated by the compiler for the same warp-level primitive wmma::mma_sync(). Readers will recognize this example from Chapter 4.3 of our technical report on Volta [2].

When targeting Volta, NVCC compiles one example invocation of the primitive into the following 16 HMMA.884.F32.F32.* instructions:

  HMMA.884.F32.F32.STEP0 R8,  R26.reuse.COL,    R16.reuse.COL,  R8  ;
  HMMA.884.F32.F32.STEP1 R10, R26.reuse.COL,    R16.reuse.COL,  R10 ;
  HMMA.884.F32.F32.STEP2 R4,  R26.reuse.COL,    R16.reuse.COL,  R4  ;
  HMMA.884.F32.F32.STEP3 R6,  R26.COL,          R16.COL,        R6  ;
  HMMA.884.F32.F32.STEP0 R8,  R20.reuse.COL,    R18.reuse.COL,  R8  ;
  HMMA.884.F32.F32.STEP1 R10, R20.reuse.COL,    R18.reuse.COL,  R10 ;
  HMMA.884.F32.F32.STEP2 R4,  R20.reuse.COL,    R18.reuse.COL,  R4  ;
  HMMA.884.F32.F32.STEP3 R6,  R20.COL,          R18.COL,        R6  ;
  HMMA.884.F32.F32.STEP0 R8,  R22.reuse.COL,    R12.reuse.COL,  R8  ;
  HMMA.884.F32.F32.STEP1 R10, R22.reuse.COL,    R12.reuse.COL,  R10 ;
  HMMA.884.F32.F32.STEP2 R4,  R22.reuse.COL,    R12.reuse.COL,  R4  ;
  HMMA.884.F32.F32.STEP3 R6,  R22.COL,          R12.COL,        R6  ;
  HMMA.884.F32.F32.STEP0 R8,  R2.reuse.COL,     R14.reuse.COL,  R8  ;
  HMMA.884.F32.F32.STEP1 R10, R2.reuse.COL,     R14.reuse.COL,  R10 ;
  HMMA.884.F32.F32.STEP2 R4,  R2.reuse.COL,     R14.reuse.COL,  R4  ;
  HMMA.884.F32.F32.STEP3 R6,  R2.COL,           R14.COL,        R6  ;

When targeting Turing, NVCC compiles the same primitive invocation into only 4 HMMA instructions of a new kind, that contain the new .1688 infix:

  # Turing rendition
  HMMA.1688.F32 R8,  R12, R22, R8 ;
  HMMA.1688.F32 R4,  R12, R23, R4 ;
  HMMA.1688.F32 R8,  R2,  R24, R8 ;
  HMMA.1688.F32 R4,  R2,  R25, R4 ;

15 Arithmetic performance

We evaluated arithmetic performance by benchmarking matrix-matrix multiplications using functions from the cuBLAS 10.1 library and template functions from cutlass 1.2, on integer operands and floating-point ones of different precisions. We report arithmetic throughput in TOPS and TFLOPS, when operating on integer and floating-point values respectively. In all experiments, the T4 GPU was running at a clock frequency of 1,590 MHz.

In half, single and double precision, cuBLAS provides higher arithmetic throughput than cutlass. This is because the cuBLAS library has been specifically optimized for the Turing architecture. For int8 precision, two APIs are available in cuBLAS 10.1:

  • BLAS-like extension function cublasGemmEx, which invokes native CUDA core implementations, and

  • the new light-weight cublasLtMatmul function, which supports int8 native TensorCore implementations.

For int8, the throughput of (cublasLtMatmul) is much higher than the throughput of (cublasGemmEx). At the time of this writing, only cutlass supports int4 and int1 matrix multiplication on NVidia GPUs.

Except in double precision, benchmarks don’t achieve near-peak performance. For int8 and int4, cutlass implementations don’t achieve 50% of theoretical throughput on the T4 GPU (Figure 17).

In Table 8 we compare the arithmetic throughputs achieved on T4 and P4 GPUs on matrix multiplication at different precisions, with both boards running at the respective top frequencies (1,590 and 1,531 MHz). The T4 GPU enjoys a higher throughput in half precision and int8 precision, thanks to Tensor Cores usage.

Because the T4 and the P4 GPU have the same number of CUDA cores, we measure similar arithmetic throughput in matrix multiplication on the two boards, in double and single precision. Note that double-precision performance is hampered by the small number of native FP64 cores available (only two per SM), as both architectures are optimized for inference, where lower precision is more frequently employed.

T4 P4
Double precision 253 231 GFLOPS
Single precision 7,174 6,944 GFLOPS
Half precision 41,616 6,571 GFLOPS
Int8 precision 74,934 24,172 GOPS
Int4 precision 114,384 - GOPS
Int1 precision 552,230 - GOPS
Table 8: Arithmetic throughput of matrix multiplication on inference-oriented GPUs on floating point and integer types.
Figure 18: Clock frequency observed on a T4 GPU while continuously computing cuBLAS matrix multiplication. The application clock frequency is set to 1,590 MHz.
Figure 19: Temperature and clock frequency of the T4 card when computing a cublasSgemm repeatedly. The application clock frequency is set to 1,590 MHz.

16 Performance throttling

Most GPUs include forms of clock throttling and/or power-state throttling to prevent exceeding either the power or thermal envelopes if the workload is particularly demanding or the heat dissipation is insufficient.

Our experiments show that the small form-factor T4 and P4 boards, designed for inference applications, achieve a significantly higher frequency-per-Watt rating than their full-size counterparts. At the same time, they are more prone to clock throttling than their full-size counterparts (K80, P100, V100, and M60) because of

  • their smaller size, which limits their heat sinks’ heat transfer rate, and

  • their maximum power limits set by the manufacturer, which is significantly lower (70W) on low-power, small form-factor boards than on full-size boards (250W).

Experimental setup: All GPU specimens we examined adopt passive cooling. Our K80, P100, V100 and M60 experiments ran on Dell PowerEdge C4130 servers, which are Tesla-qualified. Our T4 and P4 experiments ran on HPE Proliant DL360 Gen9 servers. This server model does not appear in NVidia’s Tesla-qualified server catalog. Power and thermal performance of a GPU also depend on the server that hosts it, and could be suboptimal on a non-qualified server. The server generation immediately following the one we employed (HPE Proliant DL360 Gen10) is Tesla-qualified, but we were unable to arrange for an upgrade before the publication of this manuscript.

In our experiments, we were able to trigger clock throttling on the T4 GPU consistently, using benchmarks based on cuBLAS matrix multiplication kernels cublas<t>gemm. On the T4 GPU, clock throttling triggers for two reasons:

  • power-limit throttling: instantaneous power exceeds the power limit set by the manufacturer (70W on the T4 GPU);

  • thermal throttling: the GPU reaches its maximum operating temperature (85°C on the T4 card).

Compared to power limit throttling, thermal throttling causes a more severe clock frequency reduction.

Figure 20: Clock frequency normalized to thermal design power (TDP) of all considered GPUs when computing an identical cublas<t>sgemm function on 10241024 matrices repeatedly. On every GPU, we set the application clock frequency to its maximum supported value.

16.1 Power-limit throttling

On the T4 and P4 GPUs, we saw power-limit throttling trigger very early in our cuBLAS-based matrix multiplication experiments. On the other hand, the V100, P100, M60 and K80 GPUs barely experienced any power-limit throttling, due to the larger margin between actual power consumption and its limit.

To confirm the cause of throttling, we designed an experiment that invokes cuBLAS<t>gemm kernels with input matrices of growing size. We observed the T4 GPU exceeded its power limit more and more frequently, and lower its clock rates more and more, with growing input sizes. The reduced clock frequency eventually hurts overall arithmetic throughput. See Figure 18.

In the experiment, we set the application clock for graphics on the T4 card to 1,590 MHz, and prevent GPU temperatures from exceeding the maximum operating temperature of the T4 GPU. We record the clock frequency of the T4 card while computing cublas<t>gemm in half precision.

16.2 Thermal throttling

We characterized thermal throttling with a benchmark that repeatedly launches a cublas<t>gemm kernel on a large matrix. We observed that below 85 degrees C (the maximum operating temperature), power limit throttling causes the T4 GPU to reduce its graphics clock with the growth of temperature. As soon as the temperature reaches 85 degrees C, thermal throttling triggers in addition to power-limit throttling, causing a more dramatic clock frequency step-down, depicted in Figure 19.

16.3 Power-limit throttling across GPU devices

We compared the power-limit throttling behavior of the different GPUs, by recording graphics clock over time while all cards computed endless repetitions of the same cublasSgemm kernel on 10241024 input matrices.

We noticed substantial differences between low-power GPUs (e.g., T4 and P4) and the full form-factor GPUs (K80, M60, P100, V100). We observe clock throttling only on the T4 and the P4 GPUs. Both cards are only able to run at their highest supported clock frequency for a few seconds at the very beginning of the experiment. As temperatures increased, clock throttling intervened and clock frequency decreased (Figure 20).

On full-height, full-length GPUs, we could not raise power consumption enough to approach the limits and trigger throttling.

Experimental setup: in all experiments, we set all graphics clocks to the highest supported value for each device. We turned off the AutoBoost features wherever available. We also ensured that only power-limit throttling was active.