1 Control information
Control words appeared first with the Kepler architecture, which substantially replaced dynamic hardware scheduling with static software scheduling. Control words encode instruction scheduling decisions taken by the compiler  that the hardware must enforce. The design choice to use software scheduling in Kepler was a departure from the previous design (Fermi): designers replaced a complex hardware scheduler with a simpler, more efficient one that occupied less silicon area and consumed less power. Overall, software scheduling enabled simpler on-chip control logic, leading to higher compute density per area of silicon and better energy efficiency.
On Turing and Volta, 128 bits contain one instruction together with the control information associated with only that instruction.
Pre-Volta architectures pack one control word with multiple instruction words into a bundle. In each bundle, the first word contains control information, and the remaining words (3 on Pascal and Maxwell, 7 on Kepler) encode one instruction each. Each control word affects how the architecture schedules the instructions within the bundle.
The following excerpt shows a bundle of Pascal instructions decoded by nvdisasm. The bundle contains four 64-bit words. The first word, which has a hexadecimal dump but no corresponding disassembled instruction, is a control word. The remaining three words are instructions.
Control information is encoded as follows on the different GPU generations:
on Kepler, each control word contains 6 zeroes as its most significant bits, 2 zeroes as its least significant bits, and 7 sections of 8 bits each;
on Pascal and Maxwell, each control word contains one zero as its most significant bit, and 3 sections of 21 bits each;
on Turing and Volta, each control section contains 2 zeroes as its most significant bits, and 1 section of 21 bits. For every 128 bits corresponding to one instruction, control information is preceded and followed by bits encoding the instruction itself.
Sections containing control information are organized in the same way on Turing, Volta, Pascal and Maxwell. Each section contains 6 fields, organized as follows:
Fields have the following meaning:
Reuse flags. Each hardware thread on Turing, Volta, Pascal and Maxwell has a 2-way associative Content-Addressable Memory (CAM) for each of the four conceptual source registers operand positions. This memory is intended to allow data reuse between instructions without accessing any register ports: this relieves pressure on the register file, and helps reducing register bank conflicts (we discuss register bank conflicts at length in Section 8.1). Reuse flags control this mechanism as follows: an instruction may flag for saving into the reuse set any combination of up to its first four arguments. Each instruction will attempt to service register reads for its first 4 arguments from the respective reuse slots before resorting to loading values via register file ports. E.g., if the last two reuse-saved registers in the second instruction source operand position were R98 and R99, either of those registers may be used in the second position of instructions without contributing to register bank conflicts. The four bits in the reuse flags map the first to fourth source operands with the least to most significant bits, respectively.
Wait barrier mask; Read/Write barrier index. While most instructions have fixed latency and can be statically scheduled by the assembler, instructions involving memory and shared resources typically have variable latency. Turing, Volta, Pascal and Maxwell use dependency barriers to track the completion of variable-latency instructions and resolve data hazards. When a variable-latency instruction writes to a register, the assembler associates it to one of the 6 available barriers by setting the corresponding write barrier number field. When a later instruction consumes that register, the assembler marks the instruction as waiting on that barrier by setting the bit corresponding to that barrier in the wait barrier mask. The hardware will stall the later instruction until the results of the earlier one are available. An instruction may wait on multiple barriers, which explains why the wait barrier mask is a bitmask, not an index.
Read dependency barriers. Read dependency barriers serve to protect against write-after-read hazards. Unbuffered instructions that write the contents of registers to memory need the registers to remain unchanged during the operation. To guarantee that, the assembler associates them to a barrier by populating the corresponding read barrier number field. Later instructions writing to the same register will wait on that barrier.
Stall cycles. This 4-bit field indicates how long the scheduler should wait before issuing the next instruction, ranging from 0 to 15 cycles. On Pascal and Maxwell, if the combination of this field and the yield flag contain a special combination of bits, the two dispatchers in a processing block can dispatch two consecutive instructions of a warp at the same time (dual issue). On Turing and Volta there is only one dispatcher in a processing block, and we do not observe dual issue in the generated code.
Yield flag. As its predecessors, the Turing architecture uses a one-bit yield flag to balance the workload assigned to a processing block. When this bit is set, the scheduler prefers to issue the next instruction from the current warp. When the bit is cleared, the scheduler prefers to switch to another warp, making all register reuse flags for the next instruction ineffective. This costs one extra cycle to switch to another warp.
|Warp A Index|
|Warp B index||4||48.9||72.4||72.5||73.1|
|Warp A Index|
|Warp B index||4||42.27||66.05||66.04||65.29|
2 Processing Blocks and Schedulers
The Turing streaming multiprocessor (SM) is partitioned into four processing blocks, each containing a dedicated warp scheduler and dispatch unit . Instructions from the same warp are allocated to a specific processing block, and can only access the processing units within that block.
We found that warps are mapped to schedulers (and processing blocks) on Turing and Volta according to the same, simple rule:
This is demonstrated with a benchmark composed of 8 warps running on a single SM simultaneously, of which only 2 are active with loops of FFMA instructions, while the remaining 6 are idle.
We repeat the experiments varying the warp index of each of the two active warps (Warp A and B), while measuring each time the aggregate arithmetic throughput achieved by the two warps. The results (see Table 1) show that whenever the two warps have the same index modulo 4 (e.g., 0 and 4, 1 and 5, …), their aggregate performance drops, which suggests that they are mapped to the same scheduler.
These findings are consistent between Turing and Volta.
Furthermore, these results indicate that every block of your workload must use at least 128 threads to fully utilize the processing units on one SM of Turing and Volta.
3 Instruction word format
Turing and Volta use more bits to encode their instructions than in previous architectures.
Unlike previous architectures (Pascal, Maxwell and Kepler), which organize the opcode in the most significant bits of the instruction, Turing and Volta architectures place the opcode in the least significant bits of the first 64-bit word of the instruction. Turing opcodes vary in length from 10 to 13 bits. For an extensive opcode reference that compares Pascal, Volta and Turing, see the Appendix.
As in previous architectures, instruction operands on Turing can be registers of different types, memory addresses (constant, shared or global), or an immediate value. Predication is regulated by 4 bits: the first bit is a negation flag, and the remaining 3 bits encode a predicate register index.
4 L1 data cache
Turing adopts the same combined L1 data cache / shared memory design as Volta. This design reduces the cache hit latency and improves the bandwidth with respect to the Pascal architecture.
As the geometry of the L1 data cache is concerned, our findings agree with what reported in the Turing and Volta architecture whitepapers [10, 3]. Specifically, the T4 offers twice as much L1 data capacity, and twice as high a bandwidth as the P4 GPU.
As performance is concerned, our experiments show that on a T4 GPU, the L1 data cache offers approximately 3.7 more bandwidth than its P4 predecessor.
4.1 Latency and bandwidth
The L1 data cache hit latency we measured on the T4 GPU is 32 cycles, compared to 82 cycles on the P4 (see Figure 6).
Before Turing and Volta, Kepler was the most recent architecture to combine its L1 cache and its shared memory. Kepler’s L1 cache read hit latency is 35 clock cycles. Turing exhibits a better L1 latency than Kepler in clock cycles, despite the T4 being clocked almost twice as high as the K80 (1,590 vs. 875 MHz).
We use the following benchmark to measure the L1 data cache load bandwidth. The benchmark scans an array with 32-bit elements; every warp accesses all the elements in the array:
|Theoretical upper bound||64.0||128.0||64.0||64.0||128.0||128.0||bytes/cycle|
We report L1 data bandwidths we measured across GPU devices in Table 3, together with their theoretical upper bounds.
The actual bandwidth we measure on the T4 GPU is 58.83 bytes/cycle/SM, i.e., 3.7 higher than that of the P4 GPU, i.e., 15.7 bytes/cycle/SM. This bandwidth comparison expressed in cycle counts is meaningful, because the T4 and P4 cards run at very similar graphics clock frequencies .
We calculate the theoretical throughput by multiplying the LSU count per SM by the number of bytes that each LSU can load per cycle per instruction.
Historically, architectures that employ an L1 cache combined with shared memory (Turing, Volta and Kepler) exhibit a higher L1 bandwidth than architectures where the L1 cached and the shared memory are separate (Pascal and Maxwell).
According to the Turing whitepaper , load/store operations can use a L1 data cache of 32 KiB or 64 KiB in size.
Our experiments based on Mei and Chu’s fine-grained pointer-chase technique  were unable to detect the whole configured size, and fell 7 KiB short of the nominal L1 data cache size, on both Volta and Turing architectures (see Table 4).
In our experimental setup, the shared memory is configured to a size of 64 KiB. We then employed a benchmark that scans a variable length array A twice. As long as the size of A exceeds 25 KiB, we detected cache misses.
At this time we are unable to explain this 7-KiB discrepancy. We conjecture it is the result of a newly applied replacement policy that we discuss below. We confirm that it is not associated to the ECC feature (error correction).
|Configured size of shared memory (KiB)||32||64|
|Expected size of L1 data cache (KiB)||64||32|
|Detected size of L1 data cache (KiB)||57||25|
Table 2 describes the remainder of L1 data cache geometry as we discover it. The line size, load and update granularity of Turing’s L1 data cache are the same as on the Volta, Pascal and Maxwell GPUs.
In our previous report for Volta , we discovered an improved L1 cache replacement policy on Volta with respect to its predecessors. Turing also features a L1 cache replacement policy that aims at preserving large arrays from eviction caused by sparse memory accesses.
We employed a benchmark that scans a variable length array twice, and recorded positions and latency data when L1 cache miss happens. We found that when the L1 data cache saturates, Turing randomly evicts 4 consecutive cache lines (128 B). We observed that once a block of cache lines are evicted, the second scan will cause more cache lines from the same set to be evicted.
5 Unified L2 cache
Turing employs an L2 cache that is unified for data, instructions and constant memory, as the previous GPU generations do. The L2 cache on a T4 GPU is a 16-way, set-associative cache having a size of 4,096 KiB, a cache line of 64 B, and an average latency of 188 clock cycles (Figure 6).
We use the following benchmark to measure L2 load bandwidth, on all the GPUs considered:
Note that we warm up the L2 cache before launching this kernel (code not shown for brevity). The benchmark contains a simple floating-point accumulation into variable sink, which is later written to global memory; this accumulation intentionally creates a side effect intended to prevent the compiler from eliminating the entire benchmark code. The marginal cost of this accumulation is negligible with respect to the data access latency.
The bandwidth we measure on the T4 device (see results in Table 5) is 30% higher than the P4’s, and 59% of the one measured on the larger V100 GPU.
6 Instruction cache hierarchy
In this section, we map experimentally the size and the organization of the instruction cache hierarchy. In practice, that consists in (1) detecting the size of each cache level and (2) determining how cache levels are distributed within the architectural blocks (scheduler, SM, entire chip) of the GPU.
All GPU architectures we considered, including Turing, feature three levels of instruction caches. To avoid confusion, note that on Turing and Volta the three levels are named differently (L0, L1, L2) than on previous architectures (L1, L1.5, L2). We adopt this established taxonomy for consistency with the NVidia’s whitepapers [3, 10] and with prior literature. Pay attention to expressions like “the second level of instruction caches”: this expression refers to L1 on Turing and Volta, but to L1.5 on Pascal, Maxwell and Kepler.
To detect the size of each cache level, we study how the average inverse throughput (i.e., average clocks per instruction, or CPI) achieved by a long sequence of instructions changes as a function of sequence length. As we increase the length of a sequence, we expect to see a constant CPI value until the sequence exceeds the cache size. Indeed, experimental results show plateaus and ramps (Figure 7) which correspond to cache level sizes and transitions from one level to the following. In the figure, the bottom charts focus on the three instruction cache levels, whereas the bottom charts focus on the transition between the last cache level and global device memory.
We report all findings in Table 2. Turing enjoys better inverse throughput than its predecessors when accessing the second and third instruction cache levels.
Experimental setup. Our benchmark measures the average CPI seen by a sequence of instructions of given length that exert no pressure on the data cache hierarchy. We iterate measurements for sequence sizes starting from the cache line size up to the larger plausible size of L3. The benchmark executes each sequence twice, but only times the second execution, so that we only measure capacity misses and conflict misses, but not cold misses.
On Pascal, Maxwell and Kepler, we employ the same technique as in our previous report  for the sake of consistency, i.e., long sequences of FFMA instructions, whose register operands are chosen so that each instruction experiences no register dependence with its neighbors.
On Volta and Turing, we switched to a simpler method that uses NOP sequences rather than FFMA. This choice circumvents NVCC’s undesired generation of 2-cycle stalls between consequent FFMA instructions on these two GPUs.
Across the different GPU architectures, levels in the instruction memory hierarchy are organized as follows:
on Turing and Volta, each L0 instruction cache is private to one scheduler/processing block;
on all GPUs considered, each L1 instruction cache is private to an SM;
on Pascal, Maxwell and Kepler each L1.5 instruction cache is private to one SM; the L1.5 instruction cache does not exist on Turing and Volta;
on all GPUs considered, the L2 cache is unified (i.e., it caches instructions and data) and it is shared across all SMs.
On architectures older than Turing, we provided experimental support for these claims in our previous report . For claims about Turing, we collected evidence using experiments designed as follows.
Our experiments measure the interaction between an aggressor warp and a victim warp. Both warps loop through sequences of NOP instructions of chosen length:
the victim warp only runs a fixed-length NOP sequence, typically designed to fit within a certain instruction cache level; we call it the victim probing sequence;
the aggressor warp runs, in addition to the same probing sequence as the victim, and before it, a variable-length NOP sequence, designed to thrash a given cache level, potentially evicting instruction cache entries.
We monitor whether the evictions caused by the aggressor warp only affect its own performance, or they affect the victim as well: if the victim is unaffected, then the smallest cache level that fits the fixed-length victim probing sequence is private to the architectural block where the two warps are running (i.e., GPU processing block or SM); else, the cache level is shared between the two warps and located outside the block considered. In our experiments, both warps monitor their performance by measuring their inverse throughput (CPI).
Results show that each L0 instruction cache is private to a processing block, that each L1 instruction cache is private to an SM, and that the L2 cache is shared among all SMs (Figure 8).
To examine the relation between levels L0, L1 and schedulers (or GPU processing blocks), we use experiments where the aggressor and victim warps run on the same SM, but different processing blocks. We use increasingly longer sequences in the aggressor warp. To exclude compulsory misses from the measurements, we let the aggressor and then the victim warm up the caches by running each their respective sequence once.
We observe that:
as the aggressor sequence grows while remaining below L0 capacity, only the aggressor experiences a slowdown (top left chart in Fig. 8), whereas the victim is unaffected. This indicates that the two warps access distinct L0 caches, private to each processing block;
as the instruction sequence grows above L0 capacity (top right chart) and into L1, both warps slow down similarly, which indicates that the two warps share L1.
Next, we examine the relation between levels L1 and L2, and SMs, with similarly constructed experiments. This time, the two warp run on separate SMs (SM0 and SM1).
We observe that:
as the aggressor sequence exceeds L0 but remains within L1 capacity, only the aggressor warp experiences a slow-down corresponding to L1 hit rates (bottom left); the victim, still running a sequence fitting L0 (16 KiB), is unaffected. This indicates that different SMs have distinct L1 caches;
as the aggressor sequence exceeds L2 capacity (bottom right chart), both victim and aggressor experience slowdowns; This indicates that different SMs access the same L2 cache.
7 Constant memory hierarchy
The constant memory is a cached window of global memory, reserved for data declared with the __constant__ keyword, plus kernel invocation parameters and immediate constants. We find that Turing has three levels of constant cache memory, which have the geometry and properties described in Table 2 and latency as in Figure 10.
The constant memory hierarchy used in Turing did not change significantly from previous generations. Across all the GPU generations we considered, the following properties hold true:
the L1 constant cache uses a non-LRU replacement policy;
each SM possesses two private levels of constant caches, which we denote as L1 and L1.5 constant cache (accesses to either of each level within an SM do not affect the same cache levels on other SMs);
the L2 cache is the third level of constant cache. It is shared among all SMs and is unified for instruction and data.
On Turing as in Volta, the second levels of the constant and the instruction cache are backed by the same hardware cache. More precisely, the L1.5 constant cache and the L1 instruction cache coincide. To prove this claim, we run an aggressor-victim experiment, in which we show that instruction sequences of increasing length (aggressor) evict pre-populated entries in the L1.5 constant cache. We detect these evictions by recording the execution time of a constant array scan (victim) that we execute after the aggressor. We use instruction sequences composed of identical FFMA instructions.
Experimental results (Figure 9) show that longer instruction sequences in the aggressor cause correspondingly higher miss rates in the victim. We observed victim miss rates vary from 0% to 100%.
As in previous architectures, constant memory accesses on Turing support broadcasting (see Figure 10). When all threads within a warp access the same address, the constant memory sends data to all threads simultaneously. When threads visit diverging addresses, the accesses are serialized.
8.1 Register File Banks
Turing and Volta use a physical register file of 16,384, 32-bit elements in each processing block. Thread-visible logical registers are allocated in increments of 8, or aggregate increments of 256 for all 32 threads in a warp. These register files are organized in 2 banks with dual 32-bit ports each, with logical registers belonging to the bank with the index matching their name, modulo-2. Each port can satisfy only one 32-bit read per clock cycle, and instructions in compiled code requiring three or more operands (such as FFMA, the single-precision floating-point fused multiply-and-add instruction) will suffer a stall in execution due to a register bank conflict if any three source registers’ names map to either dual-ported bank.
instruction FFMA R15, R11, R12, R13 has no conflict, since source operands R11 and R13 can be serviced by bank 1’s two ports, R12 can be serviced by one of bank 0’s ports, and destination register R15 does not use an additional port from bank 1;
instruction FFMA R18, R10, R12, R16 suffers a conflict because R10, R12 and R16 are all in bank 0. (The destination of R18 is irrelevant.)
Architectures prior to Volta used 4, single-ported banks, requiring substantially more constrained register scheduling by the compiler, but there are opportunities for improvements even on the newest devices. In our technical report on Volta , we demonstrated performance increases of up to 15% by minimizing bank conflicts through careful register re-assignment.
Figure 11 illustrates the effect of register bank conflicts on instruction latency on the T4 GPU. We use long sequences of identical FFMA instructions in which we vary one source register index (RX) to cause conflicts. Since the T4 GPU has dual-ported register banks, a conflict will only happen when all three 32-bit source registers in an FFMA instruction belong to the same bank. In every instruction of form FFMA R6, R97, R99, RX in the benchmark, R97 and R99 are in bank 1; if RX also sits in bank 1, a conflict will occur. (R6 is irrelevant as it is a destination register.) In instruction sequence FFMA R6, R98, R99, RX, because R98 and R99 sit in different banks, there is no choice of RX that can cause three reads from the same bank.
8.2 Uniform Registers
As per NVidia’s documentation, Turing introduces a new feature intended to improve the maximum achievable arithmetic throughput of the main, floating-point capable datapaths, by adding a separate, integer-only, scalar datapath (named the uniform datapath) that operates in parallel with the main datapath.
This design is intended to accelerate numerical, array-based, compute-bound workloads that occupy the main datapaths almost completely with floating-point instructions, typically FFMA or HMMA, but also contain a few integer operations, typically updating array indices, loop indices or pointers; or performing array or loop boundary checks.
These few integer instructions spoil the instruction mix, and prevent the main datapaths from ingesting a 100% pure stream of FFMA or HMMA. In these circumstances, even a small fraction of integer instructions can hurt the overall arithmetic throughput, lowering it significantly from its theoretical maximum.
On Turing, the compiler has the option to push these integer operations onto the separate uniform datapath, out of the way of the main datapath. To do so, the compiler must emit uniform datapath instructions.
Regular instructions can access both uniform and regular registers. Uniform datapath instructions, instead, focus on uniform instructions almost exclusively.
While at this time we have not been able to stimulate the generation of uniform datapath instructions by the compiler, we were able to enumerate the 64 uniform registers supported by Turing (including a Uniform Zero Register URZ and 63 general-purpose uniform registers UR0–UR62) by systematically disassembling packed uniform instructions.
8.3 Regular Registers
Instructions on Turing still supports the 256 regular registers (including the general-purpose R0–R254 and the Zero Register RZ).
We found that the cuobjdump --dump-resource-usage command (that prints a kernel’s register usage) reports a count that includes both regular and uniform registers. The upper limit of total registers used in any CUDA kernel is 256, unchanged from Volta.
We confirmed this result by patching the register count in the section header of a CUDA kernel to values above 256, and determining that cuobjdump only recognizes 256 registers at most.
9 Shared memory
The T4 GPU has up to 64 KiB of shared memory (configurable by the user) that offers low latency and high memory bandwidth. In this section, we characterize shared memory performance, including performance under contention.
Turing’s shared memory enjoys relatively low latency among the GPUs we examined (Figure 12). Only the V100 GPU exhibits lower shared memory latency than the T4 GPU.
On all GPUs except for Kepler, the measured average access latency monotonically increases with the number of conflicts in a warp. Kepler is the only GPU adopting dual-ported shared memory banks, allowing any two threads to alias on any given bank without penalty and resolving two further aliases at a time for conflicted banks.
Due to their large number of streaming multiprocessors, the V100 and P100 GPUs provide the highest theoretical and measured shared memory bandwidth (Figure 13).
As benchmarking is concerned, on Kepler, Maxwell, Pascal and Volta, we were able to rely on nvprof to collect shared memory metrics. On Turing, because nvprof does not support shared memory metrics collection on that GPU, we resorted to adopting the following custom-tailored benchmark:
This benchmark performs pointer-chase accesses to the shared memory with a varying number of steps. We invoke as many threads and blocks as possible to provide enough pressure on load/store units. We measured the execution time as we increased pointer-chase step count.
We cross-verified the correctness and accuracy of this benchmark by running it on all architectures other than Turing (on which shared memory metrics are supported) and confirming that the bandwidths it measures match those computed from nvprof metrics.
10 Global memory
We measured the actual global memory bandwidth and compared it against its theoretical limit for all the GPUs considered (Figure 14).
Thanks to their adoption of HBM2 memory, V100 and P100 boards feature a significantly higher bandwidth than GPUs based on GDDR memory. The P100 outperforms GDDR-based GPUs boards but suffers from a large gap between actual and theoretical performance. Compared to the P4 GPU, the T4 GPU enjoys a higher global bandwidth because of GDDR6 memory. However, the actual-to-theoretical bandwidth ratio on the T4 board is lower than on the P4 board (68.8% vs. 84.4%)
On Turing and on all other architectures that we examined, we found that
the L1 data cache is indexed by virtual addresses, and
the L2 data cache is indexed by physical addresses.
Because L2 is a physical cache, accesses to it involve the TLBs. We prove this claim by scanning a large array with L1 data cache enabled; we size the array to exceed the L1 TLB coverage, so that accesses in the benchmark would cause at least one level of TLB miss if L1 data cache were indexed by physical address. As expected, we saw no TLB misses in the second scan, as long as the stride is big enough to cache all accesses in L1 data cache. The same benchmark shows that addressing data in L2 data cache goes through the TLBs when the L1 data cache is disabled.
Figure 15 shows that, within the available global memory size, there are two levels of TLB on the Turing GPUs. The L1 TLB has 2 MiB page entries and 32 MiB coverage. The coverage of the L2 TLB is about 8192 MiB, which is the same as Volta.
12 Native instructions
Turing and Volta’s instructions typically exhibit lower latency than Pascal and older GPU generations, but Turing does not seem to offer instruction latency improvements over Volta. In this section, we report the latency of common instructions on Turing, Volta and Pascal in Table 6.
|Pascal||BFE, BFI, IADD, IADD32I, FADD, FMUL, FFMA, FMNMX,||6|
|HADD2, HMUL2, HFMA2, IMNMX, ISCADD, LOP, LOP32I,|
|LOP3, MOV, MOV32I, SEL, SHL, SHR, VADD, VABSDIFF,|
|DADD, DMUL, DFMA, DMNMX||8|
|FSET, DSET, DSETP, ISETP, FSETP||12|
|POPC, FLO, MUFU, F2F, F2I, I2F, I2I||14|
|Volta||IADD3, SHF, LOP3, SEL, MOV, FADD, FFMA, FMUL,||4|
|ISETP, FSET, FSETP|
|IMAD, FMNMX, DSET, DSETP||5|
|HADD2, HMUL2, HFMA2||6|
|DADD, DMUL, DFMA||8|
|FLO, BREV, MUFU||14|
|Turing||IADD3, SHF, LOP3, SEL, MOV, FADD, FFMA, FMUL,||4|
|ISETP, FSET, FSETP|
|IMAD, FMNMX, DSET, DSETP||5|
|HADD2, HMUL2, HFMA2||6|
|POPC, FLO, BREV, MUFU||15|
|DFMA, DSET, DSETP||54|
As the Turing whitepaper  mentions, the dependent-issue latency for core FMA math operations is 4 clock cycles, the same as on Volta.
On Turing, we found that most integer, single- and half-precision instructions have similar latencies as those on Volta, whereas double-precision instructions increased their latency above 40 cycles.
On Volta, most integer and single-precision instructions have a latency of 4 cycles. In our previous work we determined that most Volta double-precision instructions have a latency of 8 cycles, and half-precision instructions have a latency of 6 cycles.
On Maxwell and Pascal, instructions IMAD and IMUL have a long latency because they are emulated.
On Pascal, most integer and single-precision instructions have a latency of 6 cycles; double-precision instructions have a latency of 8 cycles; more complex instructions, some of which run on the SFU, require 14 cycles.
Experimental setup. Measuring dependent issue instruction latency on a software-scheduled GPU requires the use of custom-tailored benchmarks designed as follows. To measure the latency of instruction A, we add a second instruction B that depends on A, then set the control word that regulates A’s execution:
if A has fixed latency, we choose a B that consumes A’s output. We decrease A’s stall cycles in its control word, till A’s result consumed by B is incorrect. The last stall value producing correct results is A’s latency;
if A has variable latency, we choose a B of known latency, then set control flags to create an artificial read/write dependency between A and B. We let the scheduler wait for the dependency, then measure the pair’s cumulative latency with a bracket of CS2R instructions, and obtain A’s latency by subtracting B’s known one.
13 Atomic operations
Our measurements show that atomic operations on shared memory have a slightly longer latency on Turing than on Volta, but shorter than Pascal and older generations. In Table 7, we report those latencies expressed in clock cycles. The comparison is meaningful even in real terms because the different GPUs adopt similar clock frequencies (reported in Table 2)
As atomics on global memory are concerned, latency seems to have increased on the T4 device compared with V100. The M60 GPU had the best latency among all GPU considered.
Notably, Kepler is the only architecture where shared memory atomics are slower than global memory one, and by a large margin (4 to 8). This is due to Kepler’s lack of hardware support for shared memory atomics. Moreover, its emulated atomics degrade quickly under contention. Later architectures support atomics in hardware, and offer low-latency atomics, even in presence of contention.
|Shared memory||Global memory|
We measured these latencies with benchmarks designed in the following manner: we determine the latency of atomic instruction A by following it with a load instruction B, of known latency, that visits the same location. We deduce A’s latency from that of pair (A,B) as described in the previous section.
Figure 16 reports the throughput measured on GPUs from Kepler to Turing in presence of contention, in four scenarios:
Scenario 1, one block of 1,024 threads. Of these, threads access the same address, while the others access distinct, sequential addresses in global memory. 8 groups of threads access the same L2 cache line;
Scenario 2, one block of 1,024 threads. Of these, threads access the same address, while the others access sequential L2 cache lines in global memory, with every group of threads accessing a single L2 cache line;
Scenario 3, a variable number of blocks, of 1,024 threads each. All threads in all blocks access the same address; heavy contention exists among blocks;
Scenario 4, a variable number of blocks, of 1,024 threads each. All threads within a block access the same address. Different blocks access distinct addresses; no contention exists among blocks.
The T4 GPU doesn’t achieve the highest throughput in the scenarios with contention and the scenarios on single SM. The only scenario in which the T4 GPU provides the best performance is on multiple SMs and without contention among SMs. In all scenarios, from Maxwell to Pascal the aggregate throughput increase substantially.
14 New Tensor Core instructions
The Turing architecture refreshes its Tensor Cores by offering support for a wider range of operand types than Volta. Specifically, Tensor Cores as introduced in Volta were designed to offer high throughput when performing matrix math on half-precision floating point operands; on Turing, Tensor Cores add support for short integer operands: int8, int4 and int1.
Moreover, Turing offers new instructions that allow to express matrix math more succinctly. To demonstrate that, we will compare the Volta and the Turing code generated by the compiler for the same warp-level primitive wmma::mma_sync(). Readers will recognize this example from Chapter 4.3 of our technical report on Volta .
When targeting Volta, NVCC compiles one example invocation of the primitive into the following 16 HMMA.884.F32.F32.* instructions:
When targeting Turing, NVCC compiles the same primitive invocation into only 4 HMMA instructions of a new kind, that contain the new .1688 infix:
15 Arithmetic performance
We evaluated arithmetic performance by benchmarking matrix-matrix multiplications using functions from the cuBLAS 10.1 library and template functions from cutlass 1.2, on integer operands and floating-point ones of different precisions. We report arithmetic throughput in TOPS and TFLOPS, when operating on integer and floating-point values respectively. In all experiments, the T4 GPU was running at a clock frequency of 1,590 MHz.
In half, single and double precision, cuBLAS provides higher arithmetic throughput than cutlass. This is because the cuBLAS library has been specifically optimized for the Turing architecture. For int8 precision, two APIs are available in cuBLAS 10.1:
BLAS-like extension function cublasGemmEx, which invokes native CUDA core implementations, and
the new light-weight cublasLtMatmul function, which supports int8 native TensorCore implementations.
For int8, the throughput of (cublasLtMatmul) is much higher than the throughput of (cublasGemmEx). At the time of this writing, only cutlass supports int4 and int1 matrix multiplication on NVidia GPUs.
Except in double precision, benchmarks don’t achieve near-peak performance. For int8 and int4, cutlass implementations don’t achieve 50% of theoretical throughput on the T4 GPU (Figure 17).
In Table 8 we compare the arithmetic throughputs achieved on T4 and P4 GPUs on matrix multiplication at different precisions, with both boards running at the respective top frequencies (1,590 and 1,531 MHz). The T4 GPU enjoys a higher throughput in half precision and int8 precision, thanks to Tensor Cores usage.
Because the T4 and the P4 GPU have the same number of CUDA cores, we measure similar arithmetic throughput in matrix multiplication on the two boards, in double and single precision. Note that double-precision performance is hampered by the small number of native FP64 cores available (only two per SM), as both architectures are optimized for inference, where lower precision is more frequently employed.
16 Performance throttling
Most GPUs include forms of clock throttling and/or power-state throttling to prevent exceeding either the power or thermal envelopes if the workload is particularly demanding or the heat dissipation is insufficient.
Our experiments show that the small form-factor T4 and P4 boards, designed for inference applications, achieve a significantly higher frequency-per-Watt rating than their full-size counterparts. At the same time, they are more prone to clock throttling than their full-size counterparts (K80, P100, V100, and M60) because of
their smaller size, which limits their heat sinks’ heat transfer rate, and
their maximum power limits set by the manufacturer, which is significantly lower (70W) on low-power, small form-factor boards than on full-size boards (250W).
Experimental setup: All GPU specimens we examined adopt passive cooling. Our K80, P100, V100 and M60 experiments ran on Dell PowerEdge C4130 servers, which are Tesla-qualified. Our T4 and P4 experiments ran on HPE Proliant DL360 Gen9 servers. This server model does not appear in NVidia’s Tesla-qualified server catalog. Power and thermal performance of a GPU also depend on the server that hosts it, and could be suboptimal on a non-qualified server. The server generation immediately following the one we employed (HPE Proliant DL360 Gen10) is Tesla-qualified, but we were unable to arrange for an upgrade before the publication of this manuscript.
In our experiments, we were able to trigger clock throttling on the T4 GPU consistently, using benchmarks based on cuBLAS matrix multiplication kernels cublas<t>gemm. On the T4 GPU, clock throttling triggers for two reasons:
power-limit throttling: instantaneous power exceeds the power limit set by the manufacturer (70W on the T4 GPU);
thermal throttling: the GPU reaches its maximum operating temperature (85°C on the T4 card).
Compared to power limit throttling, thermal throttling causes a more severe clock frequency reduction.
16.1 Power-limit throttling
On the T4 and P4 GPUs, we saw power-limit throttling trigger very early in our cuBLAS-based matrix multiplication experiments. On the other hand, the V100, P100, M60 and K80 GPUs barely experienced any power-limit throttling, due to the larger margin between actual power consumption and its limit.
To confirm the cause of throttling, we designed an experiment that invokes cuBLAS<t>gemm kernels with input matrices of growing size. We observed the T4 GPU exceeded its power limit more and more frequently, and lower its clock rates more and more, with growing input sizes. The reduced clock frequency eventually hurts overall arithmetic throughput. See Figure 18.
In the experiment, we set the application clock for graphics on the T4 card to 1,590 MHz, and prevent GPU temperatures from exceeding the maximum operating temperature of the T4 GPU. We record the clock frequency of the T4 card while computing cublas<t>gemm in half precision.
16.2 Thermal throttling
We characterized thermal throttling with a benchmark that repeatedly launches a cublas<t>gemm kernel on a large matrix. We observed that below 85 degrees C (the maximum operating temperature), power limit throttling causes the T4 GPU to reduce its graphics clock with the growth of temperature. As soon as the temperature reaches 85 degrees C, thermal throttling triggers in addition to power-limit throttling, causing a more dramatic clock frequency step-down, depicted in Figure 19.
16.3 Power-limit throttling across GPU devices
We compared the power-limit throttling behavior of the different GPUs, by recording graphics clock over time while all cards computed endless repetitions of the same cublasSgemm kernel on 10241024 input matrices.
We noticed substantial differences between low-power GPUs (e.g., T4 and P4) and the full form-factor GPUs (K80, M60, P100, V100). We observe clock throttling only on the T4 and the P4 GPUs. Both cards are only able to run at their highest supported clock frequency for a few seconds at the very beginning of the experiment. As temperatures increased, clock throttling intervened and clock frequency decreased (Figure 20).
On full-height, full-length GPUs, we could not raise power consumption enough to approach the limits and trigger throttling.
Experimental setup: in all experiments, we set all graphics clocks to the highest supported value for each device. We turned off the AutoBoost features wherever available. We also ensured that only power-limit throttling was active.
-  Z. Jia, M. Maggioni, B. Staiger, and D. P. Scarpazza, “Dissecting the NVidia Volta GPU architecture via microbenchmarking,” in 2018 GPU Technology Conference, 2018. [Online]. Available: http://on-demand.gputechconf.com/gtc/2018/presentation/s8122-dissecting-the-volta-gpu-architecture-through-microbenchmarking.pdf
-  ——, “Dissecting the NVidia Volta GPU architecture via microbenchmarking,” 2018. [Online]. Available: https://arxiv.org/abs/1804.06826
-  NVidia Turing GPU Architecture. NVIDIA Corporation, 2018.
-  Y. Hou, “Asfermi,” 2011. [Online]. Available: https://github.com/hyqneuron/asfermi
-  C. Wang, Z. Jia, and K. Chen, “Tuning performance on Kepler GPUs: An introduction to Kepler assembler and its usage in CNN optimization,” in GPU Technology Conference Presentation, vol. 6173, 2015.
-  S. Gray, “maxas,” 2016. [Online]. Available: https://github.com/NervanaSystems/maxas/wiki/Control-Codes
-  H. Wong, M. M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos, “Demystifying GPU microarchitecture through microbenchmarking,” in 2010 IEEE International Symposium on Performance Analysis of Systems Software (ISPASS), March 2010, pp. 235–246.
-  X. Zhang, G. Tan, S. Xue, J. Li, K. Zhou, and M. Chen, “Understanding the GPU microarchitecture to achieve bare-metal performance tuning,” in Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’17. New York, NY, USA: ACM, 2017, pp. 31–43.
-  X. Mei and X. Chu, “Dissecting GPU memory hierarchy through microbenchmarking,” IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 1, pp. 72–86, Jan 2017.
-  NVidia Tesla V100 GPU Architecture, The World’s Most Advanced Data Center GPU. NVIDIA Corporation, 2017.
-  “NVidia cuobjdump and nvdisasm,” 2016. [Online]. Available: https://docs.nvidia.com/cuda/cuda-binary-utilities/