1 Introduction
Deep learning’s reliance on matrixmultiplication for compute has driven both research and industry to develop matrixmultiplication accelerator hardware — collectively called Tensor Core Units (TCUs) in this paper. TCUs come under the guise of different marketing terms, be it NVIDIA’s Tensor Cores [18], Google’s Tensor Processing Unit [10], Intel KNL’s AVX extensions [76], Apple A11’s Neural Engine [2], or ARM’s Machine Learning Processor [3]
. TCUs are designed to accelerate Multilayer Perceptrons (MLP), Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), or Deep Neural Network (DNN) in general TCUs vary in implementation
[71, 54, 18, 87, 76, 74, 75, 36, 40, 43, 79, 48], and are prevalent [4, 11, 24, 70, 10, 9, 8, 1] in edge devices, mobile, and the cloud.Although TCUs are prevalent and promise increase in performance and/or energy efficiency, they suffer from over specialization — with only matrixmultiplication operations being supported. This limits their applicability to general algorithms and makes them confined to narrowly specialized libraries and application domains. Take NVIDIA’s Tensor Cores, for example. For matrix multiplication, the NVIDIA Volta architecture achieves a throughput increase — with each Streaming Multiprocessor (SM) capable of performing half precision operations per cycle using the TCUs compared to half precision operations per cycle without the TCUs. This is enabled by the fact that NVIDIA’s Volta GPUs dedicate a large portion of the SM processing unit (or subcore) chip area to TCUs, shown in Figure 1. Currently algorithms other than general matrixmatrix multiplication (GEMM) do not utilize the TCUs— resulting in an idle TCU and low chip utilization for these algorithms.
The objective of the paper is to expand the class of algorithms that can execute on TCUs— enabling the TCU to be used for nonGEMM kernels. We choose reduction and scan, since a large body of work [30, 32, 31, 60, 82, 84, 68, 33] has shown that they are key primitives of data parallel implementations of radix sort, quicksort, string comparison, lexical analysis, stream compaction, polynomial evaluation, solving recurrence equations, and histograms. We formulate a simple mapping between reduction or scan and TCUs. Then we develop a library of warp, block, and gridlevel primitives for reduction and scan that utilize the NVIDIA TCUs, and present how they can be used in concert to achieve nearideal performance. Although this paper targets GPUs, the motivation, methods, and observations are applicable to a wide number of TCU implementations.
While the formulation is the main objective, we show that our reduction and scan algorithms is either order of magnitude faster or rivals the fastest GPU implementation, with much lower programming complexity. We also show that by leveraging the TCU we free up the general purpose ALUs on the SMs for tasks that cannot be expressed in terms of TCU operations.
The key contributions of the paper are as follows:

We show how to use the TCU to compute both reduction and scan — we believe we are the first to formulate these algorithms in terms of TCUs operations.

We measure our algorithms and show that we are orders of magnitude better than stateofart algorithms for small segment sizes — this is common in mathematics (e.g. evaluating polynomials), scientific applications (e.g. finite difference), and machine learning (e.g. batch norm) — and are comparable to the fastest algorithm for large segments.

We find that we are more energy efficient and decrease the utilization of general purpose ALUs — making them free for tasks that cannot be represented on TCUs.

We provide insights into how to relax the current CUDA TCU programming interface constraints.
This paper is divided as follows: we first describe the current NVIDIA TCUs and show the performance of GEMM and GEMV computation in Section 2. In Section 3, we give a background of reduction and scan and show the TCU algorithms for reduction (Section 4) and scan (5). We then compare our implementation against stateoftheart in Section 6. Section 7 describes the related work, before we conclude in Section 8.
2 NVIDIA Tensor Cores
A marquee feature of NVIDIA’s Volta (Tesla V100) architecture is its TCUs— a programmable matrixmultiplyandaccumulate hardware units, called Tensor Cores^{1}^{1}1We will use TCU and Tensor Core interchangeably in this paper. by NVIDIA. Figure 1 illustrates the processing block within each SM, with the V100 containing SMs and each having processing blocks. In turn, each processing block contains Tensor Cores — for a total of Tensor Cores on the V100 and achieving a throughput improvement over previous generation Tesla P100 [22]. This section first describes the current usage of the NVIDIA TCUs, then details the current TCU API and presents some evaluation results that motivate our work.
Each Tensor Core provides a tensor processing array capable of performing the operation within a cycle, where , , and are matrices. The inputs and must be in half precision format while the accumulators, and , can be either single or half precision. Each Tensor Core can perform FMA operations per cycle. Using Tensor Cores, each SM can perform floating point operations per cycle since each each FMA consists of two floating point operations and each SM contains Tensor Cores. This is an SM throughput increase compared to Pascal for floating point operations [22].
2.1 Current Usage
Tensor Cores have been only used to accelerate GEMM operations, most prominently through NVIDIA’s CUDA libraries. The NVIDIA libraries, such as cuBLAS [15] and cuDNN [16], require users to optin to use the Tensor Cores and both libraries accelerate GEMM computation (HGEMM for cuBLAS and convolution and recurrent neural networks for cuDNN). NVIDIA also provides the CUTLASS (CUDA Templates for Linear Algebra Subroutines) [17] library, which is a CUDA templated library that provides the building block primitives to write high performance GEMMlike kernels. Deep learning frameworks such as NVCaffe [52], Caffe2 [5], MXNet [14]
, PyTorch
[23][25], and TensorRT [19] leverage the NVIDIA libraries for DNN training [13] and inference.2.2 Programming Interface
NVIDIA also provides a CUDA C++ Warp Matrix Multiply and Accumulate (WMMA) [7] API to program the Tensor Cores directly. The current WMMA API provides warplevel matrix operations for matrix load (load_matrix_sync), matrix store (store_matrix_sync), and matrix multiply and accumulate (mma_sync). These APIs operate on a special data type fragment, which holds a matrix tile in threadlocal registers. A helper function to fill a matrix fragment with a scalar constant (fill_fragment) is provided as well. NVIDIA has not provided any API for explicitly calling the TCU at sub warp level — neither in the IR nor in the PTX [20, 21].
A load_matrix_sync distributes values of the matrix across the warp lanes. Threads within a warp utilize multiple Tensor Cores concurrently to perform the mma_sync operation — collaborating to compute the , with , , denoting the matrix dimensions. The API imposes limitations on the dimensions — requiring the shape to be either , , or .
Listing 1 shows a simple CUDA kernel that computes a matrix multiplication within a warp using the WMMA API. Lines 1–1 declare the matrix fragments. The API supports kinds of matrices — matrix_a (), matrix_b (), and accumulator ( or ) — with each having their own internal data layout ^{2}^{2}2The mapping between individual matrix elements to their residing thread(s) is purposely opaque [7] and undocumented. We will discuss how we alleviate some of the constraints in 6.1. as well as loading, storing, and computing semantics. Users specify both the data type and the shape of the fragments. For both the and
matrix kinds, users specify whether the matrix is in column or rowmajor order. Users specify the stride between rows using the leading dimension and load the data from either shared or global memory — Lines
1–1. Line 1 initializes the matrix_c elements to zero by broadcasting the scalar value into each index of the fragment. Once the data is loaded, users perform the matrix multiplication operation — Line 1 — and store the results — Line 1.The kernel in Listing 1 can be generalized to implement GEMM for arbitrary matrix dimensions in a manner similar to tiled matrix multiplication. For example, a naive implementation (referred to as WMMA HGEMM (naïve)) assigns a strip of rows from matrix and a strip of columns from matrix columns to each warp to compute a tile of the output . Each warp iterates through the A rows and B columns by loading tiles of A and B from global memory into the fragments using load_matrix_sync, then performing mma_sync, and repeating. After the entire rows of A and columns of B have been consumed, the warp uses store_matrix_sync to store the accumulated values into global memory. An optimized implementation (referred to as WMMA HGEMM) utilizes persistent threads where each thread block collaboratively loads multiple tiles of matrix and from global memory to shared memory so that some of the tiles can be reused across warps. The tiles are then loaded into fragments and the mma_sync operation is performed.
2.3 GEMM Evaluation
We evaluate the GEMM performance using Tensor Cores on a Tesla V100 PCIE GPU with CUDA through cuBLAS, CUTLASS (version ), and hand written kernels using the WMMA API. The results are shown in Figure 4.
For half precision GEMM (HGEMM), shown in Figure (a)a, cuBLAS HGEMM with Tensor Cores achieves a maximum performance of TFLOPS — approximately the peak performance — and over that of cuBLAS without the use of TCUs. For mixed precision GEMM (MGEMM), shown in Figure (b)b, we achieved maximum performance of TFLOPS on NVIDIA TCUs using cuBLAS, approximately the peak performance, and is about the performance of cuBLAS without Tensor Cores (the degradation of performance compared to HGEMM is due to output bytes count being twice as large). CUTLAS MGEMM is more performant that it’s HGEMM, we suspect this is due to optimizations for mixed precision that are absent for halfprecision.
2.4 GEMV Evaluation
The order of magnitude speedup of GEMM with TCU raises the question: can we formulate other algorithms in terms of matrix multiplication and also benefit from the TCU? The most obvious algorithm is matrixvector multiplication (GEMV).
We implement HGEMV (half precision GEMV) and MGEMV (mixedprecision GEMV) with HGEMM or MGEMM call of dimension — with required to be at least by the cuBLAS API. This method wastes at least memory loads and performs extra flops. We evaluate our implementations against cuBLAS SGEMV, since half precision GEMV is not present in cuBLAS.
Figure 5 shows that even when accounting for both resource and computation waste, HGEMV, implemented using cuBLAS HGEMM with Tensor Cores, outperforms the cuBLAS SGEMV by at least . Naïve (nonoptimized ^{3}^{3}3A user implicitly performs tiling when utilizing the WMMA API.) HGEMV and MGEMV are super imposed atop each other since the overhead of using mixedprecision is dwarfed by inefficient memory access. Both naïve versions still outperform cuBLAS SGEMV for large input sizes.
The WMMA HGEMV implementation saturates at about 900 GFLOPS in Figure 5. This is because each FMA requires one matrix element and one vector element, each is 2 bytes in size, to perform the FMA operations. That is we need to fetch one byte from the global memory into the shared memory for every floatingpoint operation performed. Note that there is no reuse of matrix elements and heavy reuse of vector elements. Assume that the vectors are mostly loaded from the L2 cache due to heavy reuse, the 900 GB/s global memory bandwidth of V100 is the hardware limitation that prevents the any GEMV implementation to go beyond 900 GFLOPS in half precision.
3 Conventional Reduction and Scan on GPU
The GEMV evaluation shows that the matrixmultiplication performance of NVIDIA TCUs is high enough to tolerate resource and computation waste in algorithms. Driven by this observation, we examine how to formulate two widely used collectives — reduction and scan — to utilize TCUs. Both reduction and scan are memorybandwidth bound so we expect that even with some resource and compute waste, their HGEMMbased implementations can beat or match the most highly tuned traditional implementations.
Given a vector , the reduction, also called fold or total, of is defined by its sum . Segmented reduction is defined as reductions on subsets of the input vector. In a regular segmented reduction, all segments are the same size^{4}^{4}4Irregular segmented reduction is implemented in terms of regular segmented reduction, and is elided from this paper.. The scan operation, also called prefix sum, is defined by the vector . Segmented scan is defined similarly to segmented reduction.
State of the art libraries such as CUB [62], MAGMA [26], OpenMP runtime [38], implement both reduction and scan in terms of warp, block, and devicelevel collectives, as illustrated in Figure 6. The building blocks — warplevel reduction and scan, are commonly implemented using shuffle instructions [12]. The shuffle implementations, shown in Listing 2, allow threads within a warp to share values via registers without using shared memory or blocklevel synchronization.
4 TCU Reduction Algorithm
Intuitively, reduction can be represented as a special case of a matrix multiplication, since
The challenge is to map generic input sizes onto the fixed tensor shapes supported by the TCU. For simplicity, this paper will use the matrix configuration as an example to implement the segmented reduction. Other configurations can be used in a similar manner to perform segmented reduction for multiples of or . It is natural to extend our algorithms to support mixed precision computation.
We use to represent a regular segmented reduction — partial reductions of the input uniformly partitioned into element subsets. We will also use as the matrix which has the first row set to 1 and the rest to 0 — i.e. , and the notation to denote a matrix where all the elements are the constant value .
To make our formulation nonWMMA API specific, we present our algorithms in an API neutral way. In the following sections, we use LoadTile in place of the load_matrix_sync which accepts a matrix layout (default is rowmajor) and stride (default is ). We abstract store_matrix_sync to make it addressable as if it was a linear array. We will also use the notation to denote the mma_sync operation.
4.1 Warplevel Reduction
We look at warplevel reduction first, since it is the building block for both block and gridlevel reductions. Even though using shuffle instructions for warplevel reduction is efficient, it can be a bottleneck due to its limited throughput. On the NVIDIA Volta architecture only warp shuffle operations can be performed within a clock cycle per SM.
This section shows the formulation of warplevel reduction onto the TCU when the segment size is , , and multiples of or
. Support for arbitrary segment sizes can be realized either by padding the input with zeros or by masking through the
matrix. We find that padding introduces minimal overhead and is required in some cases to maintain the memory alignment imposed by the TCU API.Segment Size 16:
The algorithm, shown in Algorithm 1 and Figure 7, performs warplevel reduction on elements which represent segments of size . The data is loaded from memory into a columnmajor order fragment (matrix ). Each row is then reduced using . The result — first row of — is stored in the output.
Depending on the TCU API’s flexibility, one can formulate the algorithm as either where is in columnmajor order or where is in rowmajor order. The formulation is more friendly to loading of A elements since the input vector elements naturally form the rowmajor layout for A. However, there is no performance difference when using the WMMA API since: (1) the API not expose subfragment operations and (2) the CUDA compiler performs the transposition through registers. We formulate the algorithm as to make it amicable for a more relaxed API, since the element output is in linear rowmajor order and can be written to consecutive locations in the global memory.
Segment Size 256:
For handling the segments of size , one follows a pattern similar to . First, all elements are loaded onto the TCU. The rows are reduced using the same procedure as . Finally, we reduce the first row to get a scalar. The algorithm is shown in Algorithm 2 and is a single iteration of the algorithm illustrated in Figure 8.
Segment Size Multiples of 256:
With the above and warplevel primitives, we can express segments that are multiples of either (denoted by ) or (denoted by ). We will first look at the algorithm, since it will be used for representing the algorithm.
A naïve way is to implement the segmented reduction as repeated applications of the , shown in Figure 8. While this is correct, it is work inefficient — wasting one matrix multiplication for each iteration. Instead of performing two reductions in each iteration, we can implement a work efficient segmented reduction by first reducing each row of the matrix () in each iteration and use the row of reduced values as an accumulator. In the final iteration, the final row is reduced into a scalar. Figure 9 illustrates the algorithm.
Segment Size Multiples of 16:
Similar to , segmented reduction where the segment size is multiples of () can be performed in two ways. The first is a strided segmented reduction, shown in Figure 10 (for the case where ). During each iteration , a warp loads segments (each of length ) into the matrix with a stride of , i.e., the beginning of each 16element segment is 16N elements away from the beginning of the next segment in the original input vector. The columns of are then reduced and accumulated into the first row of . This repeats for N iterations. This method is simple, works for arbitrary multiple of , and leverages GPU cache for small . For large this method suffers from bad locality, consistently misses the cache, and results in poor performance.
For , one can implement in terms of — shown in Algorithm 3. This makes better use of cache locality and reduces uncoalesced memory accesses. The left over, , can be implemented using the strided segmented reduction method.
4.2 Blocklevel Reduction
When the segment size is large, collaborative reduction within a block becomes profitable. We follow standard practice [62] to implement blocklevel reduction, but differ in that we still use the TCU to perform reduction on the partially reduced values. Algorithm 4 shows how we use warplevel reduction to implement the blocklevel kernel.
4.3 Gridlevel Reduction
When the segment size is very large a gridlevel reduction might be needed. A naïve gridlevel reduction for a list of length involves two kernel launches. The first kernel launch performs a segmented reduction with the output stored in a list of partials. A second kernel then reduces the partials into a scalar. Although it is simple to implement, it achieves performance that’s on par with the fastest algorithm.
5 TCU Scan Algorithm
Unlike reduction, it might be less intuitive to represent scan as a matrix multiplication. For a vector of elements, we can store it in rowmajor order within a matrix — with .
We notice that a rowwise scan can then be obtained by multiplying the matrix with an upper diagonal matrix — with the values of the upper diagonals being and the rest .
Similarly, to get the scan of each column one can use a lower diagonal matrix. We use a strict lower diagonal, i.e. the diagonal is , to get an exclusive scan of each column.
We use the matrix to create a matrix where each element is the reduction of the row of . That is, all elements in the row of are of the same value — the sum of the all elements preceding the row of , i.e. . The matrix can be generated by multiplying with a matrix with all element values set to 1. We then add to the matrix to generate the scan of — which is read in linear rowmajor order.
Throughout this section we will use to represent the upper diagonal matrix where the upper diagonal values are one, and use to represent the strict lower diagonal matrix where the values bellow the lower diagonal are one — i.e. and .
5.1 Warplevel Scan
With the above formulation, we follow a similar structure to Section 4: first introducing warplevel primitives before presenting the block and gridlevel primitives. We write to represent a regular segmented scan. Since the process of building warplevel, blocklevel, and gridlevel scans from is very similar to that of reduction, we will only highlight the key differences.
Segment Size 16:
Is the equation and is illustrated in steps , , and in Figure 11.
Segment Size 256:
Is implemented using matrix multiplications shown in Figure 11 and presented symbolically above.
Segment Size Multiples of 16:
Is similar to strided reduction, with the difference being that we broadcast the last column rather than the reduced value, shown in Algorithm 5.
Segment Size Multiples of 256:
5.2 Blocklevel Scan
Algorithm 7 shows how to perform the scan at the block level. It first computes the segmented scan using the warp primitives and stores the reduced values into a partials list. A scan is performed on the partial list and the values are added to the intermediate results to get the output.
Algorithm 7 also exercises the TCU to perform the scan on the partially reduced values across tiles. On Line 16 we use the offset of the last row () and as the leading dimension when loading the tile. This loads the last row of across tiles into . Line 17 then performs an exclusive scan on the last column of the and stores the results into the list of partials^{5}^{5}5 The implementation of is performed by just loading the last column values into the first row and performing an TCU version of the exclusive scan algorithm. Formulating the intermediate operation this way is needed to adhere to the CUDA WMMA API’s byte alignment constraint for loading fragments..
5.3 Gridlevel scan
Similar to reduction, the segmented scan is used as a building block for the gridlevel scan. The gridlevel scan uses a text book implementation, scanthenpropagate strategy, and involves kernel calls. The first kernel uses segmented scan and produces partially reduced values for each block. The second kernel performs a scan on the partially reduced values. The third kernel then uniformly adds the partially scanned values to their corresponding segments.
6 Evaluation
We evaluate our implementation on an Intel Xeon E52698 with CentOS , CUDA Driver , and CUDA Runtime installed. We use the Tesla V100PCIE GPU with 16GB of GPU Memory. The Tesla V100 sports HBM2 with a theoretical peak bandwidth of or billion half precision elements per second. All the results below show the throughput of the algorithms in terms of billions of half precision elements per second.
We implemented ^{6}^{6}6Accessible at https://www.removed/for_blind_review. our algorithms as a C++ header library. The API is designed to be similar to CUB’s high level API — providing functions such as TCU::SegmentedReduce, TCU::Reduce, TCU::SegmentedScan, and TCU::Scan. We employ autotuning to select the ideal algorithm, number of warps (or independent TCU operations) per block, coarsening factor (the number segments to perform within a warp), and block dimensions for the userprovided segment size.
6.1 Relaxing the WMMA API Constraints
Constraints arise when using the current WMMA API for nonGEMM computation. These limitations would not exist if one is to perform just GEMM computation. The constraints observed were:

Loads or stores must be performed at fragment granularity.

Loading and storing fragments can only be performed using global or shared memory; constant memory cannot be used.

The memory layout for the matrix kinds are not the same, making casting between matrix kinds a kluge.
We address these limitations in different ways within our implementation. For (1) and (2) we use knowledge about the internal layout of the fragment [53], we implemented API enhancements tailored to our usage. Listing 3 shows our API to support extended usage of the matrix_b fragment for both storing a constant matrix into a fragment (matrix_b_set_upper_triangular) and loading partial fragments (matrix_b_get_first_column).
Although we can use the layout information to shuffle registers to address (3), we express the cast in terms of load/store available through the WMMA API. For example, to cast a matrix in the matrix_a format to matrix_b format, we first store the matrix into shared memory and then perform a load from memory to matrix_b. These extensions can be leveraged by compiler runtimes to perform casting between fragment kinds.
We observe that API extensions using fragment layout information requires less block synchronization, uses less sharedmemory — resulting in an increased performance by up to for our algorithms. For brevity we omit these results.
6.2 Optimizing CUB for Half Precision
CUB is a C++ template library that contains multiple algorithms for the collectives. The CUB library contains fastest [63, 66] implementation for the reduction and scan collectives and is used by libraries such as Thrust [29] as well as most deep learning frameworks [25, 14, 23, 5, 68]. We compare against the latest release of CUB [62] (version ) and evaluate against different parameters of the collectives. As an optimization, warplevel shuffle reduction and scan are implemented in PTX for integer, float, and double data types, since NVCC is currently unable to use the shuffle instruction’s predicate to guard against invalid peers [12, 69]. CUB does not contain these shufflebased optimizations for half precision. To make the evaluations fair and more technically meaningful, we implement these optimization for the half precision data type in CUB. The modified CUB is used for the evaluation to provide a more aggressive base of comparison.
6.3 Warp and Blocklevel Reduction and Scan
Theoretically our warplevel TCU reduction algorithms require less than one fourth of the cycles of the warplevel shuffle reduction. For example, consider performing a warplevel : the warplevel reduction show in Listing 2 requires iterations of element reduction to reduce each segment. The total cycles is therefore , since each shuffle instruction and addition take cycles. On the other hand, our algorithm performs the reduction using two matrix multiplications or cycles — since each TCU WMMA matrix multiplication requires cycles. However, reduction is known to be memory bound, with the ideal performance bounded by memory copy speed.
We evaluate the TCU segmented reduction algorithms against cub::DeviceSegmentedReduce::Sum by fixing the number of input elements and varying the segment size, shown in Figure 12. When the segment size is less than , the algorithm is used. Its performance then degrades for large segment sizes due to its strided access pattern resulting in uncoalesced global memory access. When the segment size is larger than , the algorithm is used, but again suffers from performance degradation after segment size due to low occupancy. When the segment size is large — greater than , the blocklevel reduction is used. Our TCU implementation achieves more than of the peak throughput consistently for variable segment size and is always better than the CUB implementation.
When segment size is large and the number of segments is small, the performance of both CUB and our implementation drops. Since each segment gets mapped onto a block, a small number of segments causes some SMs to be idle. For example when segment size is , both CUB and our implementation achieve an occupancy of around and SM efficiency of around . A better strategy for these cases would be to assign multiple thread blocks to collaboratively reduce each segment when the size of the segment is very large. This can be achieved using CUDA 9’s cooperative groups [6]. Such optimization is outside the focus of this paper, which is to propose and evaluate TCUbased reduction and scan primitives.
Our TCU implementation largely outperforms CUB’s devicewide segmented reduction for different segment size. Through profiling, we identified the major predictors of performance to be, in the order of importance, the number of halfprecision floatingpoint instructions (inst_fp_16 in NVProf metric), warp instructions (inst_inter_thread_communication), and integer instructions (inst_integer). Consistently, our implementation’s halfprecision instructions is approximately equal to the number of total elements () while CUB requires a much larger number. Moreover CUB requires large number of integer and warp shuffle instructions while our implementation uses no warp shuffle instructions and a smaller number of integer instructions. This contributes to the performance improvement observed for segment size .
We examined the power consumption by measuring the average power within the execution phase of the kernel using NVVP. Our implementation uses less power compared to CUB across different segment sizes. Again, this is because of the efficient use of the and ALUs as well as better SM and DRAM utilization. We note that our algorithm leaves the ALUs (other than the TCU) idle, allowing less contention on these units.
CUB provides a cub::WarpReduce, applicable for segment sizes and , to compute a parallel reduction of elements within a warp. CUB also provides cub::BlockReduce to perform reduction within a block. These two primitives require users to partition the data and construct the kernel. Since CUB’s devicewide segmented reduction does not perform well for segment size smaller then , we evaluate our TCU implementations against cub::WarpReduce and cub::BlockReduce implementations, shown in Figure 13. The cub::WarpReduce implementation is tunable on block size. The cub::BlockReduce implementation is tunable on block size, thread coarsening factor, and reduction algorithms. We compare our implementation to the best CUB implementation. We find that our TCU implementations is still faster for segment size smaller than 1024, and is comparable to cub::BlockReduce for the other cases.
We evaluate TCU segmented scan algorithms against Thrust’s implementation (inclusive_scan_by_key), since CUB has no user visible API for segmented scan. The Thrust implementation utilizes CUB’s internal warp and blocklevel scan to implement the scanbykey operation. We use different segment sizes with a fixed number of input elements — the results are shown in Figure 14. Thrust, consistent with previous observation [42], has constant performance irrespective of the segment size. Our scan TCU implementations achieve more than of the peak throughput and is faster than thrust for small segment sizes. We observe lower power consumption compared to Thrust — observing it to be either equivalent in power usage or up to less. Our segmented scan is not ideal for large segment sizes since, as explained in Section 4, only a small number of blocks get launched and thus the GPU is under utilized. This can be remedied using the same strategy described for reduction.
CUB provides cub::WarpScan to compute a parallel scan of items partitioned within a warp, and cub::BlockScan within a block. Similar to reduction, these two primitives require more programming effort from the users to partition the data and construct the kernel. We evaluate our TCU segmented scan against the best cub::WarpScan and cub::BlockScan implementations, shown in Figure 13. The CUB scan implementations have the same tunable parameters as the CUB reduction implementations. We can see that our TCU implementations are still largely faster for small segment size, and are at least comparable to cub::BlockScan in other cases.
6.4 Gridlevel Reduction and Scan
Unlike the warp and blocklevel operations, this paper does not attempt to optimize gridlevel operations — opting to use a naïve implementation for the gridlevel collectives. The naïve implementation involves multiple kernel launches. We include the evaluation results to show that even our naïve gridlevel implementation achieves performance that is better or comparable to that of CUB and Thrust.
We compare against both CUB and Thrust for full reduction, Figure 15, and scan, Figure 16. For both cases, our implementation uses the blocklevel algorithm. Even with our naïve gridlevel implementation, we are able to mach the performance of CUB and are considerably faster than the Thrust implementation. For reduction and scan, the TCU implementation is slightly faster than CUB with large input sizes being bounded by memory bandwidth and is within (for reduction) of peek memory copy bandwidth.
For scan, our current unoptimized implementation internally uses CUB to reduce the partial values (kernel 2 described in Section 5.3). CUB, however, fails for inputs larger than which causes our implementation to fail. Future optimized implementations would not use CUB.
7 Related Work
The mapping of some algorithms onto matrix multiplication has already been well studied [49, 83, 56, 72]. Similarly, both reduction and scan are well studied from a performance and application [30, 32, 31, 78, 46, 64, 55] aspect on a wide range of architectures and have been heavily evaluated across GPU architectures [39, 41, 61, 85, 77]. To our knowledge however, there has been no attempt at mapping either reduction or scan in terms of matrix multiplication.
Considerable research has been done on the development of performance portable reduction and scan kernels [34, 28, 80, 57]. These compilers express the algorithms as system of alternative building blocks that are then composed and autotuned at compile time for both the target architecture and the input characteristics. These tools are synergistic with our technique, since we are able to add our algorithm as another building block to implement reduction or scan.
Previous work [85, 65, 62, 35, 73] has also shown that optimizations can be made to either avoid or hide the overhead of multikernel launches. These optimizations would enable our gridlevel operations to be competitive for large sizes when compared to stateoftheart methods. Other research looked at specific cases of scan, in [58] the authors look at performing scan on tuples while minimizing global reads and facilitating latency hiding. Recently there has been some work in applying scan and reduction to optimize database queries [81, 37, 47].
Work describing NVIDIA’s Tensor Cores is scarce. In [53], the authors use microbenchmarks to discern microarchitectural details of the V100 architecture. In [59, 45] use half precision and Tensor Cores to implement iterative solvers. They use half precision along with low quality solvers to compute the initial conditions and then switch to both higher precision solvers for subsequent iterations. The authors also examine the error incurred when using Tensor Cores and halfprecision for HPC workloads. In [44, 67, 27] the authors devise schemes to accelerate neural network training with limited loss of precision using half precision and Tensor Cores.
8 Conclusion
This paper leverages specialized hardware that was developed to optimize matrix multiplication for deep learning to implement both reduction and scan. This is in the same vein to early GPU research which implemented these algorithms in terms of shaders. We point to directions for future API and architectural changes to relax some of the TCU constraints such as loading fragments from constant, extracting single row or single column, etc. — resulting in a simplified implementation that achieves more energy efficiency and performance.
In this paper we showed how to map both reduction and scan onto TCU. We believe we are the first to formulate these algorithms to exercise the TCU. Our implementation achieves up to speedup on reduction, up to on scan, and showed that the performance rivals the state of the art implementation in the worst cases. We observed a less power consumption for reduction and for scan. We are able to make use the otherwise idle TCUs— enabling better GPU utilization for kernels that exercise the general purpose ALUs.
Future work would leverage the techniques described to map more algorithms and functions onto TCUs. We are specifically interested in transcendental and special functions (such as and
), since the NVIDIA special function units have been observed to be the bottleneck in HPC applications. We also want to express neural network layers in terms of TCUs— using a cuDNN like API. Some layer implementations and layer fusion opportunities are natural extensions of our work: such as the computation of variance in batchnorm
[50, 86, 51] or the evaluation of special functions in activation layers.References
 [1] Amazon EC2 Instances with Up to 8 NVIDIA Tesla V100 GPUs (P3). https://goo.gl/GyrFWN. Accessed: 2018724.
 [2] Apple A11 Bionic. https://www.apple.com/iphonex/#a11. Accessed: 2018724.
 [3] Arm Machine Learning Processor. https://developer.arm.com/products/processors/machinelearning/armmlprocessor. Accessed: 2018724.
 [4] Azure IoT Edge. https://azure.microsoft.com/enus/services/iotedge. Accessed: 2018724.
 [5] Caffe2. https://caffe2.ai. Accessed: 2018724.
 [6] Cooperative Groups: Flexible CUDA Thread Programming. https://devblogs.nvidia.com/cooperativegroups/. Accessed: 2018724.
 [7] CUDA C Programming Guide. https://docs.nvidia.com/cuda/cudacprogrammingguide/index.html#wmma. Accessed: 2018724.
 [8] FPGA Cloud Compute. https://cloud.baidu.com/product/fpga.html. Accessed: 2018724.
 [9] FPGA Cloud server. https://cn.aliyun.com/product/ecs/fpga. Accessed: 2018724.
 [10] Google Cloud TPU. https://cloud.google.com/tpu. Accessed: 2018724.
 [11] Google Edge TPU. https://cloud.google.com/edgetpu. Accessed: 2018724.
 [12] Kepler Shuffle: Tips and Tricks. http://ondemand.gputechconf.com/gtc/2013/presentations/S3174KeplerShuffleTipsTricks.pdf. Accessed: 2018724.
 [13] Mixed Precision Training. https://docs.nvidia.com/deeplearning/sdk/mixedprecisiontraining. Accessed: 2018724.
 [14] MXNet. https://mxnet.apache.org. Accessed: 2018724.
 [15] NVIDIA cuBLAS. https://developer.nvidia.com/cublas. Accessed: 2018724.
 [16] NVIDIA cuDNN. https://developer.nvidia.com/cudnn. Accessed: 2018724.
 [17] NVIDIA CUTLASS. https://devblogs.nvidia.com/cutlasslinearalgebracuda. Accessed: 2018724.
 [18] NVIDIA Tensor Cores. https://www.nvidia.com/enus/datacenter/tensorcore. Accessed: 2018724.
 [19] NVIDIA TensorRT. https://developer.nvidia.com/tensorrt. Accessed: 2018724.
 [20] NVVM IR Specification 1.5. https://docs.nvidia.com/cuda/nvvmirspec/index.html#nvvmintrinwarplevelmatrix. Accessed: 2018724.
 [21] Parallel Thread Execution ISA Version 6.2. https://docs.nvidia.com/cuda/parallelthreadexecution/index.html#warplevelmatrixinstructions. Accessed: 2018724.
 [22] Programming Tensor Cores in CUDA 9. https://devblogs.nvidia.com/programmingtensorcorescuda9. Accessed: 2018724.
 [23] PyTorch. https://pytorch.org. Accessed: 2018724.

[24]
Snapdragon Artificial Intelligence.
https://www.qualcomm.com/snapdragon/artificialintelligence. Accessed: 2018724.  [25] TensorFlow. https://www.tensorflow.org. Accessed: 2018724.
 [26] Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak, Julien Langou, Hatem Ltaief, Piotr Luszczek, and Stanimire Tomov. Numerical linear algebra on emerging architectures: The plasma and magma projects. In Journal of Physics: Conference Series, volume 180, page 012037. IOP Publishing, 2009.
 [27] Andrew Anderson, Servesh Muralidharan, and David Gregg. Efficient multibyte floating point data formats using vectorization. IEEE Transactions on Computers, 66(12):2081–2096, 2017.
 [28] Jason Ansel, Cy Chan, Yee Lok Wong, Marek Olszewski, Qin Zhao, Alan Edelman, and Saman Amarasinghe. PetaBricks: a language and compiler for algorithmic choice, volume 44. ACM, 2009.
 [29] Nathan Bell and Jared Hoberock. Thrust: A productivityoriented library for cuda. In GPU computing gems Jade edition, pages 359–371. Elsevier, 2011.
 [30] Guy E Blelloch. Scans as primitive parallel operations. IEEE Transactions on computers, 38(11):1526–1538, 1989.
 [31] Guy E Blelloch. Prefix sums and their applications. Technical report, Technical Report CMUCS90190, School of Computer Science, Carnegie Mellon University, 1990.
 [32] Guy E Blelloch, Michael A Heroux, and Marco Zagha. Segmented operations for sparse matrix computation on vector multiprocessors. Technical report, CarnegieMellon Univ Pittsburgh PA School of Computer Science, 1993.
 [33] Timothy M Chan. More algorithms for allpairs shortest paths in weighted graphs. SIAM Journal on Computing, 39(5):2075–2089, 2010.
 [34] LiWen Chang, Izzat El Hajj, Christopher Rodrigues, Juan GómezLuna, and Wenmei Hwu. Efficient kernel synthesis for performance portable programming. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture, page 12. IEEE Press, 2016.
 [35] Gaurav Chaurasia, Jonathan RaganKelley, Sylvain Paris, George Drettakis, and Fredo Durand. Compiling high performance recursive filters. In Proceedings of the 7th Conference on HighPerformance Graphics, pages 85–94. ACM, 2015.
 [36] YuHsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. Eyeriss: An energyefficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of SolidState Circuits, 52(1):127–138, 2017.
 [37] Mateus SH Cruz, Yusuke Kozawa, Toshiyuki Amagasa, and Hiroyuki Kitagawa. Accelerating set similarity joins using gpus. In Transactions on LargeScale Dataand KnowledgeCentered Systems XXVIII, pages 1–22. Springer, 2016.
 [38] Leonardo Dagum and Ramesh Menon. Openmp: an industry standard api for sharedmemory programming. IEEE computational science and engineering, 5(1):46–55, 1998.
 [39] Yuri Dotsenko, Naga K Govindaraju, PeterPike Sloan, Charles Boyd, and John Manferdelli. Fast scan algorithms on graphics processors. In Proceedings of the 22nd annual international conference on Supercomputing, pages 205–213. ACM, 2008.
 [40] Zidong Du, Shaoli Liu, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Qi Guo, Xiaobing Feng, Yunji Chen, et al. An accelerator for high efficient vision processing. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, 36(2):227–240, 2017.
 [41] Martin Dybdal, Martin Elsman, Bo Joel Svensson, and Mary Sheeran. Lowlevel functional gpu programming for parallel algorithms. In Proceedings of the 5th International Workshop on Functional HighPerformance Computing, pages 31–37. ACM, 2016.
 [42] Marco Eilers. Multireduce and multiscan on modern gpus. Department of Computer Science, University of Copenhagen. Master’s thesis, 2014.
 [43] Yao Fu, Ephrem Wu, Ashish Sirasao, Sedny Attia, Kamran Khan, and Ralph Wittig. Deep learning with int8 optimization on xilinx devices. White Paper, 2016.
 [44] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In International Conference on Machine Learning, pages 1737–1746, 2015.
 [45] Azzam Haidar, Panruo Wu, Stanimire Tomov, and Jack Dongarra. Investigating half precision arithmetic to accelerate dense linear system solvers. In Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for LargeScale Systems, page 10. ACM, 2017.
 [46] Mark Harris, Shubhabrata Sengupta, and John D Owens. Parallel prefix sum (scan) with cuda. GPU gems, 3(39):851–876, 2007.
 [47] Kaixi Hou, Weifeng Liu, Hao Wang, and Wuchun Feng. Fast segmented sort on gpus. In Proceedings of the International Conference on Supercomputing, page 12. ACM, 2017.
 [48] Miao Hu, John Paul Strachan, Zhiyong Li, Emmanuelle M Grafals, Noraica Davila, Catherine Graves, Sity Lam, Ning Ge, Jianhua Joshua Yang, and R Stanley Williams. Dotproduct engine for neuromorphic computing: programming 1t1m crossbar to accelerate matrixvector multiplication. In Proceedings of the 53rd annual design automation conference, page 19. ACM, 2016.
 [49] Xiaohan Huang and Victor Y Pan. Fast rectangular matrix multiplication and applications. Journal of complexity, 14(2):257–299, 1998.
 [50] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.

[51]
X. Jia, S. Song, W. He, Y. Wang, H. Rong, F. Zhou, L. Xie,
Z. Guo, Y. Yang, L. Yu, T. Chen, G. Hu, S. Shi, and X. Chu.
Highly Scalable Deep Learning Training System with MixedPrecision: Training ImageNet in Four Minutes.
ArXiv eprints, July 2018.  [52] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
 [53] Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P Scarpazza. Dissecting the nvidia volta gpu architecture via microbenchmarking. arXiv preprint arXiv:1804.06826, 2018.
 [54] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. Indatacenter performance analysis of a tensor processing unit. In Computer Architecture (ISCA), 2017 ACM/IEEE 44th Annual International Symposium on, pages 1–12. IEEE, 2017.
 [55] HeeSeok Kim, Shengzhao Wu, Liwen Chang, and W Hwu Wenmei. A scalable tridiagonal solver for gpus. In Parallel Processing (ICPP), 2011 International Conference on, pages 444–453. IEEE, 2011.
 [56] Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review, 51(3):455–500, 2009.
 [57] Rasmus Wriedt Larsen and Troels Henriksen. Strategies for regular segmented reductions on gpu. In Proceedings of the 6th ACM SIGPLAN International Workshop on Functional HighPerformance Computing, pages 42–52. ACM, 2017.
 [58] Sepideh Maleki, Annie Yang, and Martin Burtscher. Higherorder and tuplebased massivelyparallel prefix sums, volume 51. ACM, 2016.
 [59] Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S Vetter. Nvidia tensor core programmability, performance & precision. arXiv preprint arXiv:1803.04014, 2018.
 [60] Michael D McCool, Arch D Robison, and James Reinders. Structured parallel programming: patterns for efficient computation. Elsevier, 2012.
 [61] Trevor L McDonell, Manuel MT Chakravarty, Gabriele Keller, and Ben Lippmeier. Optimising purely functional gpu programs. ACM SIGPLAN Notices, 48(9):49–60, 2013.
 [62] D Merrill. CUB v1.8.0: CUDA Unbound, a library of warpwide, blockwide, and devicewide GPU parallel primitives. NVIDIA Research, 2018.
 [63] Duane Merrill and Michael Garland. Singlepass parallel prefix scan with decoupled lookback. Technical report, NVIDIA Technical Report NVR2016002, 2016.
 [64] Duane Merrill and Andrew Grimshaw. Parallel scan for stream architectures. University of Virginia, Department of Computer Science, Charlottesville, VA, USA, Technical Report CS200914, 2009.
 [65] Duane G Merrill and Andrew S Grimshaw. Revisiting sorting for gpgpu stream architectures. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques, pages 545–546. ACM, 2010.
 [66] Bruce Merry. A performance comparison of sort and scan libraries for gpus. Parallel Processing Letters, 25(04):1550007, 2015.
 [67] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training. In International Conference on Learning Representations, 2018.

[68]
Rory Mitchell and Eibe Frank.
Accelerating the xgboost algorithm using gpu computing.
PeerJ Computer Science, 3:e127, 2017.  [69] Wilt Nicholas. The cuda handbook: A comprehensive guide to gpu programming, 2013.
 [70] Paavo Pärssinen. Modern mobile graphics processors. Science: Internet, Data and Things (CSE4000), Spring 2018, page 211.
 [71] Wajahat Qadeer, Rehan Hameed, Ofer Shacham, Preethi Venkatesan, Christos Kozyrakis, and Mark A Horowitz. Convolution engine: balancing efficiency & flexibility in specialized computing. In ACM SIGARCH Computer Architecture News, volume 41, pages 24–35. ACM, 2013.
 [72] Stephan Rabanser, Oleksandr Shchur, and Stephan Günnemann. Introduction to tensor decompositions and their applications in machine learning. arXiv preprint arXiv:1711.10781, 2017.
 [73] Jonathan RaganKelley, Andrew Adams, Dillon Sharlet, Connelly Barnes, Sylvain Paris, Marc Levoy, Saman Amarasinghe, and Frédo Durand. Halide: decoupling algorithms from schedules for highperformance image processing. Communications of the ACM, 61(1):106–115, 2017.
 [74] Brandon Reagen, Robert Adolf, Paul Whatmough, GuYeon Wei, and David Brooks. Deep learning for computer architects. Synthesis Lectures on Computer Architecture, 12(4):1–123, 2017.
 [75] Brandon Reagen, Paul Whatmough, Robert Adolf, Saketh Rama, Hyunkwang Lee, Sae Kyu Lee, José Miguel HernándezLobato, GuYeon Wei, and David Brooks. Minerva: Enabling lowpower, highlyaccurate deep neural network accelerators. In ACM SIGARCH Computer Architecture News, volume 44, pages 267–278. IEEE Press, 2016.
 [76] Andres Rodriguez, Eden Segal, Etay Meiri, Evarist Fomenko, Young Jim Kim, Haihao Shen, and Barukh Ziv. Lower Numerical Precision Deep Learning Inference and Training. Technical report, Intel, January 2018.
 [77] Shubhabrata Sengupta, Mark Harris, and Michael Garland. Efficient parallel scan algorithms for gpus. NVIDIA, Santa Clara, CA, Tech. Rep. NVR2008003, 1(1):1–17, 2008.
 [78] Shubhabrata Sengupta, Aaron E Lefohn, and John D Owens. A workefficient stepefficient prefix sum algorithm. In Workshop on edge computing using new commodity architectures, pages 26–27, 2006.
 [79] Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R Stanley Williams, and Vivek Srikumar. Isaac: A convolutional neural network accelerator with insitu analog arithmetic in crossbars. ACM SIGARCH Computer Architecture News, 44(3):14–26, 2016.
 [80] Michel Steuwer, Toomas Remmelg, and Christophe Dubach. Lift: a functional dataparallel ir for highperformance gpu code generation. In Code Generation and Optimization (CGO), 2017 IEEE/ACM International Symposium on, pages 74–85. IEEE, 2017.
 [81] Hiroki Tokura, Toru Fujita, Koji Nakano, Yasuaki Ito, and Jacir L Bordim. Almost optimal columnwise prefixsum computation on the gpu. The Journal of Supercomputing, 74(4):1510–1521, 2018.

[82]
JeanBaptiste Tristan, Joseph Tassarotti, and Guy Steele.
Efficient training of lda on a gpu by meanformode estimation.
In International Conference on Machine Learning, pages 59–68, 2015.  [83] Charles Van Loan. A survey of matrix computations. Handbooks in operations research and management science, 3:247–321, 1992.
 [84] Xiaolong Xie, Yun Liang, Xiuhong Li, and Wei Tan. Culda_cgs: Solving largescale lda problems on gpus. arXiv preprint arXiv:1803.04631, 2018.
 [85] Shengen Yan, Guoping Long, and Yunquan Zhang. Streamscan: fast scan algorithms for gpus without global barrier synchronization. In ACM SIGPLAN Notices, volume 48, pages 229–238. ACM, 2013.
 [86] Yang You, Igor Gitman, and Boris Ginsburg. Scaling sgd batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888, 2017.
 [87] Yuhao Zhu, Matthew Mattina, and Paul Whatmough. Mobile machine learning hardware at arm: A systemsonchip (soc) perspective. arXiv preprint arXiv:1801.06274, 2018.
Comments
There are no comments yet.