Accelerating Reduction and Scan Using Tensor Core Units

by   Abdul Dakkak, et al.

Driven by deep learning, there has been a surge of specialized processors for matrix multiplication, referred to as TensorCore Units (TCUs). These TCUs are capable of performing matrix multiplications on small matrices (usually 4x4 or 16x16) to accelerate the convolutional and recurrent neural networks in deep learning workloads. In this paper we leverage NVIDIA's TCU to express both reduction and scan with matrix multiplication and show the benefits -- in terms of program simplicity, efficiency, and performance. Our algorithm exercises the NVIDIA TCUs which would otherwise be idle, achieves 89 bandwidth, and is orders of magnitude faster (up to 100x for reduction and 3x for scan) than state-of-the-art methods for small segment sizes -- common in machine learning and scientific applications. Our algorithm achieves this while decreasing the power consumption by up to 22



There are no comments yet.


page 1

page 5

page 6

page 7

page 10


Accelerating Sparse Matrix-Matrix Multiplication with GPU Tensor Cores

Sparse general matrix-matrix multiplication (spGEMM) is an essential com...

A Computational Model for Tensor Core Units

To respond to the need of efficient training and inference of deep neura...

Blocking Techniques for Sparse Matrix Multiplication on Tensor Accelerators

Tensor accelerators have gained popularity because they provide a cheap ...

Matrix Engines for High Performance Computing:A Paragon of Performance or Grasping at Straws?

Matrix engines or units, in different forms and affinities, are becoming...

TCUDB: Accelerating Database with Tensor Processors

The emergence of novel hardware accelerators has powered the tremendous ...

Unitary Learning for Deep Diffractive Neural Network

Realization of deep learning with coherent diffraction has achieved rema...

Similarity Search with Tensor Core Units

Tensor Core Units (TCUs) are hardware accelerators developed for deep ne...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning’s reliance on matrix-multiplication for compute has driven both research and industry to develop matrix-multiplication accelerator hardware — collectively called Tensor Core Units (TCUs) in this paper. TCUs come under the guise of different marketing terms, be it NVIDIA’s Tensor Cores [18], Google’s Tensor Processing Unit [10], Intel KNL’s AVX extensions [76], Apple A11’s Neural Engine [2], or ARM’s Machine Learning Processor [3]

. TCUs are designed to accelerate Multilayer Perceptrons (MLP), Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), or Deep Neural Network (DNN) in general TCUs vary in implementation 

[71, 54, 18, 87, 76, 74, 75, 36, 40, 43, 79, 48], and are prevalent [4, 11, 24, 70, 10, 9, 8, 1] in edge devices, mobile, and the cloud.

Although TCUs are prevalent and promise increase in performance and/or energy efficiency, they suffer from over specialization — with only matrix-multiplication operations being supported. This limits their applicability to general algorithms and makes them confined to narrowly specialized libraries and application domains. Take NVIDIA’s Tensor Cores, for example. For matrix multiplication, the NVIDIA Volta architecture achieves a throughput increase — with each Streaming Multiprocessor (SM) capable of performing half precision operations per cycle using the TCUs compared to half precision operations per cycle without the TCUs. This is enabled by the fact that NVIDIA’s Volta GPUs dedicate a large portion of the SM processing unit (or subcore) chip area to TCUs, shown in Figure 1. Currently algorithms other than general matrix-matrix multiplication (GEMM) do not utilize the TCUs— resulting in an idle TCU and low chip utilization for these algorithms.

Figure 1: Each processing block (subcore) in the NVIDIA Tesla V100 PCI-E architecture contains TCUs. In total, TCUs are available — achieving a theoretical peek of TFLOPS.

The objective of the paper is to expand the class of algorithms that can execute on TCUs— enabling the TCU to be used for non-GEMM kernels. We choose reduction and scan, since a large body of work [30, 32, 31, 60, 82, 84, 68, 33] has shown that they are key primitives of data parallel implementations of radix sort, quicksort, string comparison, lexical analysis, stream compaction, polynomial evaluation, solving recurrence equations, and histograms. We formulate a simple mapping between reduction or scan and TCUs. Then we develop a library of warp-, block-, and grid-level primitives for reduction and scan that utilize the NVIDIA TCUs, and present how they can be used in concert to achieve near-ideal performance. Although this paper targets GPUs, the motivation, methods, and observations are applicable to a wide number of TCU implementations.

While the formulation is the main objective, we show that our reduction and scan algorithms is either order of magnitude faster or rivals the fastest GPU implementation, with much lower programming complexity. We also show that by leveraging the TCU we free up the general purpose ALUs on the SMs for tasks that cannot be expressed in terms of TCU operations.

The key contributions of the paper are as follows:

  1. We show how to use the TCU to compute both reduction and scan — we believe we are the first to formulate these algorithms in terms of TCUs operations.

  2. We measure our algorithms and show that we are orders of magnitude better than state-of-art algorithms for small segment sizes — this is common in mathematics (e.g. evaluating polynomials), scientific applications (e.g. finite difference), and machine learning (e.g. batch norm) — and are comparable to the fastest algorithm for large segments.

  3. We find that we are more energy efficient and decrease the utilization of general purpose ALUs — making them free for tasks that cannot be represented on TCUs.

  4. We describe the current usage and the programmability of the NVIDIA TCUs and evaluate GEMM on the TCUs using cuBLAS [15], CUTLASS [17] and the CUDA TCU API. The results show that TCUs can be profitably used in a much wider range of libraries.

  5. We provide insights into how to relax the current CUDA TCU programming interface constraints.

This paper is divided as follows: we first describe the current NVIDIA TCUs and show the performance of GEMM and GEMV computation in Section 2. In Section 3, we give a background of reduction and scan and show the TCU algorithms for reduction (Section 4) and scan (5). We then compare our implementation against state-of-the-art in Section 6. Section 7 describes the related work, before we conclude in Section 8.

2 NVIDIA Tensor Cores

A marquee feature of NVIDIA’s Volta (Tesla V100) architecture is its TCUs— a programmable matrix-multiply-and-accumulate hardware units, called Tensor Cores111We will use TCU and Tensor Core interchangeably in this paper. by NVIDIA. Figure 1 illustrates the processing block within each SM, with the V100 containing SMs and each having processing blocks. In turn, each processing block contains Tensor Cores — for a total of Tensor Cores on the V100 and achieving a throughput improvement over previous generation Tesla P100 [22]. This section first describes the current usage of the NVIDIA TCUs, then details the current TCU API and presents some evaluation results that motivate our work.

(a) GEMM with half precision input and half precision output.
(b) Mixed precision GEMM with half precision input and single precision output.
Figure 4: General matrix-matrix multiplication (GEMM) performance using Tensor Cores for both half- ((a)a) and mixed- ((b)b) precision on a V100 PCI-E GPU with a clock frequency of MHz and a TFLOPS peek performance. The inputs are square matrices with variable dimensions. The optimized and naïve WMMA GEMM algorithms are described in the text.
1#include <mma.h>
2using namespace nvcuda::wmma;
3__global__ void dot_wmma_16x16(half *a, half *b, half *c) {
4   fragment<matrix_a, 16, 16, 16, half, col_major> a_frag;
5   fragment<matrix_b, 16, 16, 16, half, row_major> b_frag;
6   fragment<accumulator, 16, 16, 16, half> c_frag;  
7   load_matrix_sync(a_frag, a, /* leading dim */ 16);  
8   load_matrix_sync(b_frag, b, /* leading dim */ 16);  
9   fill_fragment(c_frag, 0.0f);  
10   mma_sync(c_frag, a_frag, b_frag, c_frag);  
11   store_matrix_sync(c, c_frag, 16, row_major);  
Listing 1: A simple CUDA kernel performing matrix multiplication () in half precision using the CUDA WMMA API.

Each Tensor Core provides a tensor processing array capable of performing the operation within a cycle, where , , and are matrices. The inputs and must be in half precision format while the accumulators, and , can be either single or half precision. Each Tensor Core can perform FMA operations per cycle. Using Tensor Cores, each SM can perform floating point operations per cycle since each each FMA consists of two floating point operations and each SM contains Tensor Cores. This is an SM throughput increase compared to Pascal for floating point operations [22].

2.1 Current Usage

Tensor Cores have been only used to accelerate GEMM operations, most prominently through NVIDIA’s CUDA libraries. The NVIDIA libraries, such as cuBLAS [15] and cuDNN [16], require users to opt-in to use the Tensor Cores and both libraries accelerate GEMM computation (HGEMM for cuBLAS and convolution and recurrent neural networks for cuDNN). NVIDIA also provides the CUTLASS (CUDA Templates for Linear Algebra Subroutines) [17] library, which is a CUDA templated library that provides the building block primitives to write high performance GEMM-like kernels. Deep learning frameworks such as NVCaffe [52], Caffe2 [5], MXNet [14]

, PyTorch 


, TensorFlow 

[25], and TensorRT [19] leverage the NVIDIA libraries for DNN training [13] and inference.

2.2 Programming Interface

NVIDIA also provides a CUDA C++ Warp Matrix Multiply and Accumulate (WMMA) [7] API to program the Tensor Cores directly. The current WMMA API provides warp-level matrix operations for matrix load (load_matrix_sync), matrix store (store_matrix_sync), and matrix multiply and accumulate (mma_sync). These APIs operate on a special data type fragment, which holds a matrix tile in thread-local registers. A helper function to fill a matrix fragment with a scalar constant (fill_fragment) is provided as well. NVIDIA has not provided any API for explicitly calling the TCU at sub warp level — neither in the IR nor in the PTX [20, 21].

A load_matrix_sync distributes values of the matrix across the warp lanes. Threads within a warp utilize multiple Tensor Cores concurrently to perform the mma_sync operation — collaborating to compute the , with , , denoting the matrix dimensions. The API imposes limitations on the dimensions — requiring the shape to be either , , or .

Listing 1 shows a simple CUDA kernel that computes a matrix multiplication within a warp using the WMMA API. Lines 11 declare the matrix fragments. The API supports kinds of matrices — matrix_a (), matrix_b (), and accumulator ( or ) — with each having their own internal data layout 222The mapping between individual matrix elements to their residing thread(s) is purposely opaque [7] and undocumented. We will discuss how we alleviate some of the constraints in 6.1. as well as loading, storing, and computing semantics. Users specify both the data type and the shape of the fragments. For both the and

matrix kinds, users specify whether the matrix is in column- or row-major order. Users specify the stride between rows using the leading dimension and load the data from either shared or global memory — Lines 

11. Line 1 initializes the matrix_c elements to zero by broadcasting the scalar value into each index of the fragment. Once the data is loaded, users perform the matrix multiplication operation — Line 1 — and store the results — Line 1.

The kernel in Listing 1 can be generalized to implement GEMM for arbitrary matrix dimensions in a manner similar to tiled matrix multiplication. For example, a naive implementation (referred to as WMMA HGEMM (naïve)) assigns a strip of rows from matrix and a strip of columns from matrix columns to each warp to compute a tile of the output . Each warp iterates through the A rows and B columns by loading tiles of A and B from global memory into the fragments using load_matrix_sync, then performing mma_sync, and repeating. After the entire rows of A and columns of B have been consumed, the warp uses store_matrix_sync to store the accumulated values into global memory. An optimized implementation (referred to as WMMA HGEMM) utilizes persistent threads where each thread block collaboratively loads multiple tiles of matrix and from global memory to shared memory so that some of the tiles can be re-used across warps. The tiles are then loaded into fragments and the mma_sync operation is performed.

Figure 5:

General matrix-vector multiplication (GEMV) performance using Tensor Cores on a V100 PCI-E GPU. GEMV can be implemented in terms of a GEMM (with dimensions

) or calling the GEMV method in CUBLAS (which currently does not support half precision).

2.3 GEMM Evaluation

We evaluate the GEMM performance using Tensor Cores on a Tesla V100 PCI-E GPU with CUDA through cuBLAS, CUTLASS (version ), and hand written kernels using the WMMA API. The results are shown in Figure 4.

For half precision GEMM (HGEMM), shown in Figure (a)a, cuBLAS HGEMM with Tensor Cores achieves a maximum performance of TFLOPS — approximately the peak performance — and over that of cuBLAS without the use of TCUs. For mixed precision GEMM (MGEMM), shown in Figure (b)b, we achieved maximum performance of TFLOPS on NVIDIA TCUs using cuBLAS, approximately the peak performance, and is about the performance of cuBLAS without Tensor Cores (the degradation of performance compared to HGEMM is due to output bytes count being twice as large). CUTLAS MGEMM is more performant that it’s HGEMM, we suspect this is due to optimizations for mixed precision that are absent for half-precision.

2.4 GEMV Evaluation

The order of magnitude speedup of GEMM with TCU raises the question: can we formulate other algorithms in terms of matrix multiplication and also benefit from the TCU? The most obvious algorithm is matrix-vector multiplication (GEMV).

We implement HGEMV (half precision GEMV) and MGEMV (mixed-precision GEMV) with HGEMM or MGEMM call of dimension — with required to be at least by the cuBLAS API. This method wastes at least memory loads and performs extra flops. We evaluate our implementations against cuBLAS SGEMV, since half precision GEMV is not present in cuBLAS.

Figure 5 shows that even when accounting for both resource and computation waste, HGEMV, implemented using cuBLAS HGEMM with Tensor Cores, outperforms the cuBLAS SGEMV by at least . Naïve (non-optimized 333A user implicitly performs tiling when utilizing the WMMA API.) HGEMV and MGEMV are super imposed atop each other since the overhead of using mixed-precision is dwarfed by inefficient memory access. Both naïve versions still outperform cuBLAS SGEMV for large input sizes.

The WMMA HGEMV implementation saturates at about 900 GFLOPS in Figure 5. This is because each FMA requires one matrix element and one vector element, each is 2 bytes in size, to perform the FMA operations. That is we need to fetch one byte from the global memory into the shared memory for every floating-point operation performed. Note that there is no reuse of matrix elements and heavy reuse of vector elements. Assume that the vectors are mostly loaded from the L2 cache due to heavy reuse, the 900 GB/s global memory bandwidth of V100 is the hardware limitation that prevents the any GEMV implementation to go beyond 900 GFLOPS in half precision.

3 Conventional Reduction and Scan on GPU

Figure 6: The reduction algorithm is

composed of warp-level reduction that reduces each segment and is used to

implement block-level reduction that further reduces each segment of partially reduced values. The partially reduced values are reduced across the grid

to perform full reduction.

The GEMV evaluation shows that the matrix-multiplication performance of NVIDIA TCUs is high enough to tolerate resource and computation waste in algorithms. Driven by this observation, we examine how to formulate two widely used collectives — reduction and scan — to utilize TCUs. Both reduction and scan are memory-bandwidth bound so we expect that even with some resource and compute waste, their HGEMM-based implementations can beat or match the most highly tuned traditional implementations.

1__device__ half warp_reduce(half val) {
2  for (int offset = WARP_SIZE/2; offset > 0; offset /= 2)
3    val += __shfl_down_sync(0xFFFFFFFFU, val, mask);
4  return val;
6__device__ half warp_scan(half val) {
7  for (int offset = 1; offset < WARP_SIZE; offset *= 2) {
8    auto n = __shfl_up_sync(0xFFFFFFFFU, val, mask);
9    if (laneid >= offset) val += n;
10  }
11  return val;
Listing 2: NVIDIA’s recommended warp-level reduction and scan implementations utilizing shuffle instructions.

Given a vector , the reduction, also called fold or total, of is defined by its sum . Segmented reduction is defined as reductions on subsets of the input vector. In a regular segmented reduction, all segments are the same size444Irregular segmented reduction is implemented in terms of regular segmented reduction, and is elided from this paper.. The scan operation, also called prefix sum, is defined by the vector . Segmented scan is defined similarly to segmented reduction.

State of the art libraries such as CUB [62], MAGMA [26], OpenMP runtime [38], implement both reduction and scan in terms of warp-, block-, and device-level collectives, as illustrated in Figure 6. The building blocks — warp-level reduction and scan, are commonly implemented using shuffle instructions [12]. The shuffle implementations, shown in Listing 2, allow threads within a warp to share values via registers without using shared memory or block-level synchronization.

4 TCU Reduction Algorithm

Intuitively, reduction can be represented as a special case of a matrix multiplication, since

The challenge is to map generic input sizes onto the fixed tensor shapes supported by the TCU. For simplicity, this paper will use the matrix configuration as an example to implement the segmented reduction. Other configurations can be used in a similar manner to perform segmented reduction for multiples of or . It is natural to extend our algorithms to support mixed precision computation.

We use to represent a regular segmented reduction — partial reductions of the input uniformly partitioned into element subsets. We will also use as the matrix which has the first row set to 1 and the rest to 0 — i.e. , and the notation to denote a matrix where all the elements are the constant value .

To make our formulation non-WMMA API specific, we present our algorithms in an API neutral way. In the following sections, we use LoadTile in place of the load_matrix_sync which accepts a matrix layout (default is row-major) and stride (default is ). We abstract store_matrix_sync to make it addressable as if it was a linear array. We will also use the notation to denote the mma_sync operation.

4.1 Warp-level Reduction

We look at warp-level reduction first, since it is the building block for both block- and grid-level reductions. Even though using shuffle instructions for warp-level reduction is efficient, it can be a bottleneck due to its limited throughput. On the NVIDIA Volta architecture only warp shuffle operations can be performed within a clock cycle per SM.

This section shows the formulation of warp-level reduction onto the TCU when the segment size is , , and multiples of or

. Support for arbitrary segment sizes can be realized either by padding the input with zeros or by masking through the

matrix. We find that padding introduces minimal overhead and is required in some cases to maintain the memory alignment imposed by the TCU API.

Segment Size 16:

The algorithm, shown in Algorithm 1 and Figure 7, performs warp-level reduction on elements which represent segments of size . The data is loaded from memory into a column-major order fragment (matrix ). Each row is then reduced using . The result — first row of — is stored in the output.

1:Initialize matrix.
5:if  then
Algorithm 1 The algorithm.

Depending on the TCU API’s flexibility, one can formulate the algorithm as either where is in column-major order or where is in row-major order. The formulation is more friendly to loading of A elements since the input vector elements naturally form the row-major layout for A. However, there is no performance difference when using the WMMA API since: (1) the API not expose sub-fragment operations and (2) the CUDA compiler performs the transposition through registers. We formulate the algorithm as to make it amicable for a more relaxed API, since the element output is in linear row-major order and can be written to consecutive locations in the global memory.

Figure 7: The algorithm

each warp loads elements into the matrix in column major order,

performs the TCU operation where the matrix has ones for the first row, and then

the result, which is in the first row of , is stored into the output vector.

Segment Size 256:

For handling the segments of size , one follows a pattern similar to . First, all elements are loaded onto the TCU. The rows are reduced using the same procedure as . Finally, we reduce the first row to get a scalar. The algorithm is shown in Algorithm 2 and is a single iteration of the algorithm illustrated in Figure 8.

1:Initialize matrix
6:if  then
Algorithm 2 The algorithm.
Figure 8: The work-inefficient algorithm

initializes the matrix with all zeros and

loads the input elements into a matrix in column major order.

A dot product where the matrix has the first row as ones and the rest of the values are zeros is performed to reduce each row into a scalar.

the dot product reduces the first row into a scalar.

If the segmented reduction size is equal to the matrix size (i.e. ) or for the last iteration, then the first element of the matrix is stored in the output array, otherwise

the first element of matrix is used as the first element of the matrix and the procedure is iterated starting from step


Segment Size Multiples of 256:

With the above and warp-level primitives, we can express segments that are multiples of either (denoted by ) or (denoted by ). We will first look at the algorithm, since it will be used for representing the algorithm.

A naïve way is to implement the segmented reduction as -repeated applications of the , shown in Figure 8. While this is correct, it is work inefficient — wasting one matrix multiplication for each iteration. Instead of performing two reductions in each iteration, we can implement a work efficient segmented reduction by first reducing each row of the matrix () in each iteration and use the row of reduced values as an accumulator. In the final iteration, the final row is reduced into a scalar. Figure 9 illustrates the algorithm.

Figure 9: The work-efficient algorithm

loads input elements into matrix in each iteration. It then

performs a matrix multiplication for between and with . The final vector is reduced

by performing the operation with the

result stored in the output.

Segment Size Multiples of 16:

Similar to , segmented reduction where the segment size is multiples of () can be performed in two ways. The first is a strided segmented reduction, shown in Figure 10 (for the case where ). During each iteration , a warp loads segments (each of length ) into the matrix with a stride of , i.e., the beginning of each 16-element segment is 16N elements away from the beginning of the next segment in the original input vector. The columns of are then reduced and accumulated into the first row of . This repeats for N iterations. This method is simple, works for arbitrary multiple of , and leverages GPU cache for small . For large this method suffers from bad locality, consistently misses the cache, and results in poor performance.

For , one can implement in terms of — shown in Algorithm 3. This makes better use of cache locality and reduces uncoalesced memory accesses. The left over, , can be implemented using the strided segmented reduction method.

Figure 10: A strided algorithm for

loads elements where the stride between each row is .

We then perform the matrix multiplication and

use the matrix as an accumulator for the next iteration where

we again load the next elements with the leading dimension set to . The

matrix multiplication is performed and

the first row is stored in the output vector.
1:Initialize matrix
4: Number of 256 segments
5:for  do
9:end for
10: Reduce rest of segments using
11:if  then
Algorithm 3 The Coalesced algorithm.
4:if  then
5:sync threads
6:if  then
8:end if
Algorithm 4 The Block-level algorithm.

4.2 Block-level Reduction

When the segment size is large, collaborative reduction within a block becomes profitable. We follow standard practice [62] to implement block-level reduction, but differ in that we still use the TCU to perform reduction on the partially reduced values. Algorithm 4 shows how we use warp-level reduction to implement the block-level kernel.

4.3 Grid-level Reduction

When the segment size is very large a grid-level reduction might be needed. A naïve grid-level reduction for a list of length involves two kernel launches. The first kernel launch performs a segmented reduction with the output stored in a list of partials. A second kernel then reduces the partials into a scalar. Although it is simple to implement, it achieves performance that’s on par with the fastest algorithm.

5 TCU Scan Algorithm

Unlike reduction, it might be less intuitive to represent scan as a matrix multiplication. For a vector of elements, we can store it in row-major order within a matrix — with .

We notice that a row-wise scan can then be obtained by multiplying the matrix with an upper diagonal matrix — with the values of the upper diagonals being and the rest .

Similarly, to get the scan of each column one can use a lower diagonal matrix. We use a strict lower diagonal, i.e. the diagonal is , to get an exclusive scan of each column.

We use the matrix to create a matrix where each element is the reduction of the row of . That is, all elements in the row of are of the same value — the sum of the all elements preceding the row of , i.e. . The matrix can be generated by multiplying with a matrix with all element values set to 1. We then add to the matrix to generate the scan of — which is read in linear row-major order.

Throughout this section we will use to represent the upper diagonal matrix where the upper diagonal values are one, and use to represent the strict lower diagonal matrix where the values bellow the lower diagonal are one — i.e. and .

5.1 Warp-level Scan

With the above formulation, we follow a similar structure to Section 4: first introducing warp-level primitives before presenting the block- and grid-level primitives. We write to represent a regular segmented scan. Since the process of building warp-level, block-level, and grid-level scans from is very similar to that of reduction, we will only highlight the key differences.

Segment Size 16:

Is the equation and is illustrated in steps


, and

in Figure 11.

Segment Size 256:

Is implemented using matrix multiplications shown in Figure 11 and presented symbolically above.

Figure 11: The algorithm

loads 256 elements from the input vector into a matrix and

initializes the matrix to . The


matrix multiplications are performed to compute the prefix sum of each row and column.

A row wise reduction is performed on the and added to the matrix.

The result is stored in the output vector.

If the segment size is a multiple of , then the last element of (position ) is broadcasted into the matrix and the procedure is repeated.

Segment Size Multiples of 16:

Is similar to strided reduction, with the difference being that we broadcast the last column rather than the reduced value, shown in Algorithm 5.

1:Initialize matrix.
5:for  do
10:     if  then
13:     end if
14:end for
Algorithm 5 The algorithm.

Segment Size Multiples of 256:

Only a small modification to is needed to implement and is illustrated in Figure 11. Algorithm 6 shows that we keep track of the sum (last element of the matrix) and broadcast it to the matrix after each iteration. The matrix is then used when performing subsequent iterations.

1:Initialize and matrices.
4:for  do
12:end for
Algorithm 6 The algorithm.

5.2 Block-level Scan

Algorithm 7 shows how to perform the scan at the block level. It first computes the segmented scan using the warp primitives and stores the reduced values into a partials list. A scan is performed on the partial list and the values are added to the intermediate results to get the output.

Algorithm 7 also exercises the TCU to perform the scan on the partially reduced values across tiles. On Line 16 we use the offset of the last row () and as the leading dimension when loading the tile. This loads the last row of across tiles into . Line 17 then performs an exclusive scan on the last column of the and stores the results into the list of partials555 The implementation of is performed by just loading the last column values into the first row and performing an TCU version of the exclusive scan algorithm. Formulating the intermediate operation this way is needed to adhere to the CUDA WMMA API’s byte alignment constraint for loading fragments..

1:Initialize and matrices.
3: Assumed to be less than 16
5: Partial sums
7:for  do
14:     sync threads
15:     if  then
17:           Exclusive scan
18:     end if
19:     sync threads
20:     for  do
24:     end for
26:end for
Algorithm 7 The Block-level algorithm.

5.3 Grid-level scan

Similar to reduction, the segmented scan is used as a building block for the grid-level scan. The grid-level scan uses a text book implementation, scan-then-propagate strategy, and involves kernel calls. The first kernel uses segmented scan and produces partially reduced values for each block. The second kernel performs a scan on the partially reduced values. The third kernel then uniformly adds the partially scanned values to their corresponding segments.

6 Evaluation

We evaluate our implementation on an Intel Xeon E5-2698 with CentOS , CUDA Driver , and CUDA Runtime installed. We use the Tesla V100-PCIE GPU with 16GB of GPU Memory. The Tesla V100 sports HBM2 with a theoretical peak bandwidth of or billion half precision elements per second. All the results below show the throughput of the algorithms in terms of billions of half precision elements per second.

We implemented 666Accessible at https://www.removed/for_blind_review. our algorithms as a C++ header library. The API is designed to be similar to CUB’s high level API — providing functions such as TCU::SegmentedReduce, TCU::Reduce, TCU::SegmentedScan, and TCU::Scan. We employ auto-tuning to select the ideal algorithm, number of warps (or independent TCU operations) per block, coarsening factor (the number segments to perform within a warp), and block dimensions for the user-provided segment size.

6.1 Relaxing the WMMA API Constraints

Constraints arise when using the current WMMA API for non-GEMM computation. These limitations would not exist if one is to perform just GEMM computation. The constraints observed were:

  1. Loads or stores must be performed at fragment granularity.

  2. Loading and storing fragments can only be performed using global or shared memory; constant memory cannot be used.

  3. The memory layout for the matrix kinds are not the same, making casting between matrix kinds a kluge.

We address these limitations in different ways within our implementation. For (1) and (2) we use knowledge about the internal layout of the fragment  [53], we implemented API enhancements tailored to our usage. Listing 3 shows our API to support extended usage of the matrix_b fragment for both storing a constant matrix into a fragment (matrix_b_set_upper_triangular) and loading partial fragments (matrix_b_get_first_column).

using frag_b = fragment<matrix_b, 16, 16, 16, half, row_major>;
__device__ int matrix_b_get_row_idx() {
    const int laneIdx = threadIdx.x % warpSize;
    return laneIdx&0x10 >> 2 + laneIdx&0x0B;
__device__ void matrix_b_set_upper_triangular(frag_b &f) {
#pragma unroll
    for (int ii = 0; ii < f.num_elements; ii++)
      f.x[ii] = matrix_b_get_row_idx() < ii ? 0.0f : 1.0f;
__device__ void matrix_b_get_first_column(half* out, frag_b f) {
    const int laneid = threadIdx.x % warpSize;
    if (laneid & 0x04) return ; // avoid redundant writes
    out[matrix_b_get_row_idx()] = f.x[0];
Listing 3: The WMMA API can only perform load/store from shared or global memory and lacks the ability to fill an TCU fragment from constant memory or operate on sub-fragments. This code shows how we enhance the NVIDIA WMMA API, using knowledge of the fragment layout, to create an upper triangular matrix and get the first column of a fragment for the matrix_b fragment kind.

Although we can use the layout information to shuffle registers to address (3), we express the cast in terms of load/store available through the WMMA API. For example, to cast a matrix in the matrix_a format to matrix_b format, we first store the matrix into shared memory and then perform a load from memory to matrix_b. These extensions can be leveraged by compiler runtimes to perform casting between fragment kinds.

We observe that API extensions using fragment layout information requires less block synchronization, uses less shared-memory — resulting in an increased performance by up to for our algorithms. For brevity we omit these results.

6.2 Optimizing CUB for Half Precision

CUB is a C++ template library that contains multiple algorithms for the collectives. The CUB library contains fastest [63, 66] implementation for the reduction and scan collectives and is used by libraries such as Thrust [29] as well as most deep learning frameworks [25, 14, 23, 5, 68]. We compare against the latest release of CUB [62] (version ) and evaluate against different parameters of the collectives. As an optimization, warp-level shuffle reduction and scan are implemented in PTX for integer, float, and double data types, since NVCC is currently unable to use the shuffle instruction’s predicate to guard against invalid peers [12, 69]. CUB does not contain these shuffle-based optimizations for half precision. To make the evaluations fair and more technically meaningful, we implement these optimization for the half precision data type in CUB. The modified CUB is used for the evaluation to provide a more aggressive base of comparison.

6.3 Warp- and Block-level Reduction and Scan

Theoretically our warp-level TCU reduction algorithms require less than one fourth of the cycles of the warp-level shuffle reduction. For example, consider performing a warp-level : the warp-level reduction show in Listing 2 requires iterations of element reduction to reduce each segment. The total cycles is therefore , since each shuffle instruction and addition take cycles. On the other hand, our algorithm performs the reduction using two matrix multiplications or cycles — since each TCU WMMA matrix multiplication requires cycles. However, reduction is known to be memory bound, with the ideal performance bounded by memory copy speed.

Figure 12: We evaluate the segmented reduction for the algorithms presented on different segment sizes (between and ) for a fixed element list. Through a combination of the algorithms presented, for the range between and we are able to achieve throughput within and of ideal throughput (the theoretical peak is billion half precision elements per second). The bar on top of the figure shows the best performing algorithm for each range of segment sizes.

We evaluate the TCU segmented reduction algorithms against cub::DeviceSegmentedReduce::Sum by fixing the number of input elements and varying the segment size, shown in Figure 12. When the segment size is less than , the algorithm is used. Its performance then degrades for large segment sizes due to its strided access pattern resulting in uncoalesced global memory access. When the segment size is larger than , the algorithm is used, but again suffers from performance degradation after segment size due to low occupancy. When the segment size is large — greater than , the block-level reduction is used. Our TCU implementation achieves more than of the peak throughput consistently for variable segment size and is always better than the CUB implementation.

Figure 13: Segmented

reduction and

scan are evaluated in terms of billions of half-precision elements per second (-axis) for segment sizes between and (-axis). The best configurations for our implementation as well as CUB are selected.

When segment size is large and the number of segments is small, the performance of both CUB and our implementation drops. Since each segment gets mapped onto a block, a small number of segments causes some SMs to be idle. For example when segment size is , both CUB and our implementation achieve an occupancy of around and SM efficiency of around . A better strategy for these cases would be to assign multiple thread blocks to collaboratively reduce each segment when the size of the segment is very large. This can be achieved using CUDA 9’s cooperative groups  [6]. Such optimization is outside the focus of this paper, which is to propose and evaluate TCU-based reduction and scan primitives.

Figure 14: We evaluate the segmented scan for the algorithms presented on different segment sizes for a fixed element list. Through a combination of the algorithms presented, for the range between and we are able to achieve throughput within and of ideal throughput (the theoretical peak is billion half precision elements per second).

Our TCU implementation largely outperforms CUB’s device-wide segmented reduction for different segment size. Through profiling, we identified the major predictors of performance to be, in the order of importance, the number of half-precision floating-point instructions (inst_fp_16 in NVProf metric), warp instructions (inst_inter_thread_communication), and integer instructions (inst_integer). Consistently, our implementation’s half-precision instructions is approximately equal to the number of total elements () while CUB requires a much larger number. Moreover CUB requires large number of integer and warp shuffle instructions while our implementation uses no warp shuffle instructions and a smaller number of integer instructions. This contributes to the performance improvement observed for segment size .

We examined the power consumption by measuring the average power within the execution phase of the kernel using NVVP. Our implementation uses less power compared to CUB across different segment sizes. Again, this is because of the efficient use of the and ALUs as well as better SM and DRAM utilization. We note that our algorithm leaves the ALUs (other than the TCU) idle, allowing less contention on these units.

CUB provides a cub::WarpReduce, applicable for segment sizes and , to compute a parallel reduction of elements within a warp. CUB also provides cub::BlockReduce to perform reduction within a block. These two primitives require users to partition the data and construct the kernel. Since CUB’s device-wide segmented reduction does not perform well for segment size smaller then , we evaluate our TCU implementations against cub::WarpReduce and cub::BlockReduce implementations, shown in Figure 13. The cub::WarpReduce implementation is tunable on block size. The cub::BlockReduce implementation is tunable on block size, thread coarsening factor, and reduction algorithms. We compare our implementation to the best CUB implementation. We find that our TCU implementations is still faster for segment size smaller than 1024, and is comparable to cub::BlockReduce for the other cases.

We evaluate TCU segmented scan algorithms against Thrust’s implementation (inclusive_scan_by_key), since CUB has no user visible API for segmented scan. The Thrust implementation utilizes CUB’s internal warp- and block-level scan to implement the scan-by-key operation. We use different segment sizes with a fixed number of input elements — the results are shown in Figure 14. Thrust, consistent with previous observation [42], has constant performance irrespective of the segment size. Our scan TCU implementations achieve more than of the peak throughput and is faster than thrust for small segment sizes. We observe lower power consumption compared to Thrust — observing it to be either equivalent in power usage or up to less. Our segmented scan is not ideal for large segment sizes since, as explained in Section 4, only a small number of blocks get launched and thus the GPU is under utilized. This can be remedied using the same strategy described for reduction.

Figure 15: A full reduction implementation based on the description in Section 4 achieves performance on par to CUB.

CUB provides cub::WarpScan to compute a parallel scan of items partitioned within a warp, and cub::BlockScan within a block. Similar to reduction, these two primitives require more programming effort from the users to partition the data and construct the kernel. We evaluate our TCU segmented scan against the best cub::WarpScan and cub::BlockScan implementations, shown in Figure 13. The CUB scan implementations have the same tunable parameters as the CUB reduction implementations. We can see that our TCU implementations are still largely faster for small segment size, and are at least comparable to cub::BlockScan in other cases.

6.4 Grid-level Reduction and Scan

Unlike the warp- and block-level operations, this paper does not attempt to optimize grid-level operations — opting to use a naïve implementation for the grid-level collectives. The naïve implementation involves multiple kernel launches. We include the evaluation results to show that even our naïve grid-level implementation achieves performance that is better or comparable to that of CUB and Thrust.

Figure 16: A full scan implementation based on the description in Section 5 achieves performance comparable to CUB.

We compare against both CUB and Thrust for full reduction, Figure 15, and scan, Figure 16. For both cases, our implementation uses the block-level algorithm. Even with our naïve grid-level implementation, we are able to mach the performance of CUB and are considerably faster than the Thrust implementation. For reduction and scan, the TCU implementation is slightly faster than CUB with large input sizes being bounded by memory bandwidth and is within (for reduction) of peek memory copy bandwidth.

For scan, our current unoptimized implementation internally uses CUB to reduce the partial values (kernel 2 described in Section 5.3). CUB, however, fails for inputs larger than which causes our implementation to fail. Future optimized implementations would not use CUB.

7 Related Work

The mapping of some algorithms onto matrix multiplication has already been well studied [49, 83, 56, 72]. Similarly, both reduction and scan are well studied from a performance and application [30, 32, 31, 78, 46, 64, 55] aspect on a wide range of architectures and have been heavily evaluated across GPU architectures [39, 41, 61, 85, 77]. To our knowledge however, there has been no attempt at mapping either reduction or scan in terms of matrix multiplication.

Considerable research has been done on the development of performance portable reduction and scan kernels [34, 28, 80, 57]. These compilers express the algorithms as system of alternative building blocks that are then composed and auto-tuned at compile time for both the target architecture and the input characteristics. These tools are synergistic with our technique, since we are able to add our algorithm as another building block to implement reduction or scan.

Previous work [85, 65, 62, 35, 73] has also shown that optimizations can be made to either avoid or hide the overhead of multi-kernel launches. These optimizations would enable our grid-level operations to be competitive for large sizes when compared to state-of-the-art methods. Other research looked at specific cases of scan, in [58] the authors look at performing scan on tuples while minimizing global reads and facilitating latency hiding. Recently there has been some work in applying scan and reduction to optimize database queries [81, 37, 47].

Work describing NVIDIA’s Tensor Cores is scarce. In [53], the authors use microbenchmarks to discern micro-architectural details of the V100 architecture. In [59, 45] use half precision and Tensor Cores to implement iterative solvers. They use half precision along with low quality solvers to compute the initial conditions and then switch to both higher precision solvers for subsequent iterations. The authors also examine the error incurred when using Tensor Cores and half-precision for HPC workloads. In [44, 67, 27] the authors devise schemes to accelerate neural network training with limited loss of precision using half precision and Tensor Cores.

8 Conclusion

This paper leverages specialized hardware that was developed to optimize matrix multiplication for deep learning to implement both reduction and scan. This is in the same vein to early GPU research which implemented these algorithms in terms of shaders. We point to directions for future API and architectural changes to relax some of the TCU constraints such as loading fragments from constant, extracting single row or single column, etc. — resulting in a simplified implementation that achieves more energy efficiency and performance.

In this paper we showed how to map both reduction and scan onto TCU. We believe we are the first to formulate these algorithms to exercise the TCU. Our implementation achieves up to speedup on reduction, up to on scan, and showed that the performance rivals the state of the art implementation in the worst cases. We observed a less power consumption for reduction and for scan. We are able to make use the otherwise idle TCUs— enabling better GPU utilization for kernels that exercise the general purpose ALUs.

Future work would leverage the techniques described to map more algorithms and functions onto TCUs. We are specifically interested in transcendental and special functions (such as and

), since the NVIDIA special function units have been observed to be the bottleneck in HPC applications. We also want to express neural network layers in terms of TCUs— using a cuDNN like API. Some layer implementations and layer fusion opportunities are natural extensions of our work: such as the computation of variance in batchnorm 

[50, 86, 51] or the evaluation of special functions in activation layers.


  • [1] Amazon EC2 Instances with Up to 8 NVIDIA Tesla V100 GPUs (P3). Accessed: 2018-7-24.
  • [2] Apple A11 Bionic. Accessed: 2018-7-24.
  • [3] Arm Machine Learning Processor. Accessed: 2018-7-24.
  • [4] Azure IoT Edge. Accessed: 2018-7-24.
  • [5] Caffe2. Accessed: 2018-7-24.
  • [6] Cooperative Groups: Flexible CUDA Thread Programming. Accessed: 2018-7-24.
  • [7] CUDA C Programming Guide. Accessed: 2018-7-24.
  • [8] FPGA Cloud Compute. Accessed: 2018-7-24.
  • [9] FPGA Cloud server. Accessed: 2018-7-24.
  • [10] Google Cloud TPU. Accessed: 2018-7-24.
  • [11] Google Edge TPU. Accessed: 2018-7-24.
  • [12] Kepler Shuffle: Tips and Tricks. Accessed: 2018-7-24.
  • [13] Mixed Precision Training. Accessed: 2018-7-24.
  • [14] MXNet. Accessed: 2018-7-24.
  • [15] NVIDIA cuBLAS. Accessed: 2018-7-24.
  • [16] NVIDIA cuDNN. Accessed: 2018-7-24.
  • [17] NVIDIA CUTLASS. Accessed: 2018-7-24.
  • [18] NVIDIA Tensor Cores. Accessed: 2018-7-24.
  • [19] NVIDIA TensorRT. Accessed: 2018-7-24.
  • [20] NVVM IR Specification 1.5. Accessed: 2018-7-24.
  • [21] Parallel Thread Execution ISA Version 6.2. Accessed: 2018-7-24.
  • [22] Programming Tensor Cores in CUDA 9. Accessed: 2018-7-24.
  • [23] PyTorch. Accessed: 2018-7-24.
  • [24]

    Snapdragon Artificial Intelligence. Accessed: 2018-7-24.
  • [25] TensorFlow. Accessed: 2018-7-24.
  • [26] Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak, Julien Langou, Hatem Ltaief, Piotr Luszczek, and Stanimire Tomov. Numerical linear algebra on emerging architectures: The plasma and magma projects. In Journal of Physics: Conference Series, volume 180, page 012037. IOP Publishing, 2009.
  • [27] Andrew Anderson, Servesh Muralidharan, and David Gregg. Efficient multibyte floating point data formats using vectorization. IEEE Transactions on Computers, 66(12):2081–2096, 2017.
  • [28] Jason Ansel, Cy Chan, Yee Lok Wong, Marek Olszewski, Qin Zhao, Alan Edelman, and Saman Amarasinghe. PetaBricks: a language and compiler for algorithmic choice, volume 44. ACM, 2009.
  • [29] Nathan Bell and Jared Hoberock. Thrust: A productivity-oriented library for cuda. In GPU computing gems Jade edition, pages 359–371. Elsevier, 2011.
  • [30] Guy E Blelloch. Scans as primitive parallel operations. IEEE Transactions on computers, 38(11):1526–1538, 1989.
  • [31] Guy E Blelloch. Prefix sums and their applications. Technical report, Technical Report CMU-CS-90-190, School of Computer Science, Carnegie Mellon University, 1990.
  • [32] Guy E Blelloch, Michael A Heroux, and Marco Zagha. Segmented operations for sparse matrix computation on vector multiprocessors. Technical report, Carnegie-Mellon Univ Pittsburgh PA School of Computer Science, 1993.
  • [33] Timothy M Chan. More algorithms for all-pairs shortest paths in weighted graphs. SIAM Journal on Computing, 39(5):2075–2089, 2010.
  • [34] Li-Wen Chang, Izzat El Hajj, Christopher Rodrigues, Juan Gómez-Luna, and Wen-mei Hwu. Efficient kernel synthesis for performance portable programming. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture, page 12. IEEE Press, 2016.
  • [35] Gaurav Chaurasia, Jonathan Ragan-Kelley, Sylvain Paris, George Drettakis, and Fredo Durand. Compiling high performance recursive filters. In Proceedings of the 7th Conference on High-Performance Graphics, pages 85–94. ACM, 2015.
  • [36] Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits, 52(1):127–138, 2017.
  • [37] Mateus SH Cruz, Yusuke Kozawa, Toshiyuki Amagasa, and Hiroyuki Kitagawa. Accelerating set similarity joins using gpus. In Transactions on Large-Scale Data-and Knowledge-Centered Systems XXVIII, pages 1–22. Springer, 2016.
  • [38] Leonardo Dagum and Ramesh Menon. Openmp: an industry standard api for shared-memory programming. IEEE computational science and engineering, 5(1):46–55, 1998.
  • [39] Yuri Dotsenko, Naga K Govindaraju, Peter-Pike Sloan, Charles Boyd, and John Manferdelli. Fast scan algorithms on graphics processors. In Proceedings of the 22nd annual international conference on Supercomputing, pages 205–213. ACM, 2008.
  • [40] Zidong Du, Shaoli Liu, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Qi Guo, Xiaobing Feng, Yunji Chen, et al. An accelerator for high efficient vision processing. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 36(2):227–240, 2017.
  • [41] Martin Dybdal, Martin Elsman, Bo Joel Svensson, and Mary Sheeran. Low-level functional gpu programming for parallel algorithms. In Proceedings of the 5th International Workshop on Functional High-Performance Computing, pages 31–37. ACM, 2016.
  • [42] Marco Eilers. Multireduce and multiscan on modern gpus. Department of Computer Science, University of Copenhagen. Master’s thesis, 2014.
  • [43] Yao Fu, Ephrem Wu, Ashish Sirasao, Sedny Attia, Kamran Khan, and Ralph Wittig. Deep learning with int8 optimization on xilinx devices. White Paper, 2016.
  • [44] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In International Conference on Machine Learning, pages 1737–1746, 2015.
  • [45] Azzam Haidar, Panruo Wu, Stanimire Tomov, and Jack Dongarra. Investigating half precision arithmetic to accelerate dense linear system solvers. In Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, page 10. ACM, 2017.
  • [46] Mark Harris, Shubhabrata Sengupta, and John D Owens. Parallel prefix sum (scan) with cuda. GPU gems, 3(39):851–876, 2007.
  • [47] Kaixi Hou, Weifeng Liu, Hao Wang, and Wu-chun Feng. Fast segmented sort on gpus. In Proceedings of the International Conference on Supercomputing, page 12. ACM, 2017.
  • [48] Miao Hu, John Paul Strachan, Zhiyong Li, Emmanuelle M Grafals, Noraica Davila, Catherine Graves, Sity Lam, Ning Ge, Jianhua Joshua Yang, and R Stanley Williams. Dot-product engine for neuromorphic computing: programming 1t1m crossbar to accelerate matrix-vector multiplication. In Proceedings of the 53rd annual design automation conference, page 19. ACM, 2016.
  • [49] Xiaohan Huang and Victor Y Pan. Fast rectangular matrix multiplication and applications. Journal of complexity, 14(2):257–299, 1998.
  • [50] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
  • [51] X. Jia, S. Song, W. He, Y. Wang, H. Rong, F. Zhou, L. Xie, Z. Guo, Y. Yang, L. Yu, T. Chen, G. Hu, S. Shi, and X. Chu.

    Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes.

    ArXiv e-prints, July 2018.
  • [52] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
  • [53] Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P Scarpazza. Dissecting the nvidia volta gpu architecture via microbenchmarking. arXiv preprint arXiv:1804.06826, 2018.
  • [54] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In Computer Architecture (ISCA), 2017 ACM/IEEE 44th Annual International Symposium on, pages 1–12. IEEE, 2017.
  • [55] Hee-Seok Kim, Shengzhao Wu, Li-wen Chang, and W Hwu Wen-mei. A scalable tridiagonal solver for gpus. In Parallel Processing (ICPP), 2011 International Conference on, pages 444–453. IEEE, 2011.
  • [56] Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review, 51(3):455–500, 2009.
  • [57] Rasmus Wriedt Larsen and Troels Henriksen. Strategies for regular segmented reductions on gpu. In Proceedings of the 6th ACM SIGPLAN International Workshop on Functional High-Performance Computing, pages 42–52. ACM, 2017.
  • [58] Sepideh Maleki, Annie Yang, and Martin Burtscher. Higher-order and tuple-based massively-parallel prefix sums, volume 51. ACM, 2016.
  • [59] Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S Vetter. Nvidia tensor core programmability, performance & precision. arXiv preprint arXiv:1803.04014, 2018.
  • [60] Michael D McCool, Arch D Robison, and James Reinders. Structured parallel programming: patterns for efficient computation. Elsevier, 2012.
  • [61] Trevor L McDonell, Manuel MT Chakravarty, Gabriele Keller, and Ben Lippmeier. Optimising purely functional gpu programs. ACM SIGPLAN Notices, 48(9):49–60, 2013.
  • [62] D Merrill. CUB v1.8.0: CUDA Unbound, a library of warp-wide, blockwide, and device-wide GPU parallel primitives. NVIDIA Research, 2018.
  • [63] Duane Merrill and Michael Garland. Single-pass parallel prefix scan with decoupled look-back. Technical report, NVIDIA Technical Report NVR-2016-002, 2016.
  • [64] Duane Merrill and Andrew Grimshaw. Parallel scan for stream architectures. University of Virginia, Department of Computer Science, Charlottesville, VA, USA, Technical Report CS2009-14, 2009.
  • [65] Duane G Merrill and Andrew S Grimshaw. Revisiting sorting for gpgpu stream architectures. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques, pages 545–546. ACM, 2010.
  • [66] Bruce Merry. A performance comparison of sort and scan libraries for gpus. Parallel Processing Letters, 25(04):1550007, 2015.
  • [67] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training. In International Conference on Learning Representations, 2018.
  • [68] Rory Mitchell and Eibe Frank.

    Accelerating the xgboost algorithm using gpu computing.

    PeerJ Computer Science, 3:e127, 2017.
  • [69] Wilt Nicholas. The cuda handbook: A comprehensive guide to gpu programming, 2013.
  • [70] Paavo Pärssinen. Modern mobile graphics processors. Science: Internet, Data and Things (CS-E4000), Spring 2018, page 211.
  • [71] Wajahat Qadeer, Rehan Hameed, Ofer Shacham, Preethi Venkatesan, Christos Kozyrakis, and Mark A Horowitz. Convolution engine: balancing efficiency & flexibility in specialized computing. In ACM SIGARCH Computer Architecture News, volume 41, pages 24–35. ACM, 2013.
  • [72] Stephan Rabanser, Oleksandr Shchur, and Stephan Günnemann. Introduction to tensor decompositions and their applications in machine learning. arXiv preprint arXiv:1711.10781, 2017.
  • [73] Jonathan Ragan-Kelley, Andrew Adams, Dillon Sharlet, Connelly Barnes, Sylvain Paris, Marc Levoy, Saman Amarasinghe, and Frédo Durand. Halide: decoupling algorithms from schedules for high-performance image processing. Communications of the ACM, 61(1):106–115, 2017.
  • [74] Brandon Reagen, Robert Adolf, Paul Whatmough, Gu-Yeon Wei, and David Brooks. Deep learning for computer architects. Synthesis Lectures on Computer Architecture, 12(4):1–123, 2017.
  • [75] Brandon Reagen, Paul Whatmough, Robert Adolf, Saketh Rama, Hyunkwang Lee, Sae Kyu Lee, José Miguel Hernández-Lobato, Gu-Yeon Wei, and David Brooks. Minerva: Enabling low-power, highly-accurate deep neural network accelerators. In ACM SIGARCH Computer Architecture News, volume 44, pages 267–278. IEEE Press, 2016.
  • [76] Andres Rodriguez, Eden Segal, Etay Meiri, Evarist Fomenko, Young Jim Kim, Haihao Shen, and Barukh Ziv. Lower Numerical Precision Deep Learning Inference and Training. Technical report, Intel, January 2018.
  • [77] Shubhabrata Sengupta, Mark Harris, and Michael Garland. Efficient parallel scan algorithms for gpus. NVIDIA, Santa Clara, CA, Tech. Rep. NVR-2008-003, 1(1):1–17, 2008.
  • [78] Shubhabrata Sengupta, Aaron E Lefohn, and John D Owens. A work-efficient step-efficient prefix sum algorithm. In Workshop on edge computing using new commodity architectures, pages 26–27, 2006.
  • [79] Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R Stanley Williams, and Vivek Srikumar. Isaac: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. ACM SIGARCH Computer Architecture News, 44(3):14–26, 2016.
  • [80] Michel Steuwer, Toomas Remmelg, and Christophe Dubach. Lift: a functional data-parallel ir for high-performance gpu code generation. In Code Generation and Optimization (CGO), 2017 IEEE/ACM International Symposium on, pages 74–85. IEEE, 2017.
  • [81] Hiroki Tokura, Toru Fujita, Koji Nakano, Yasuaki Ito, and Jacir L Bordim. Almost optimal column-wise prefix-sum computation on the gpu. The Journal of Supercomputing, 74(4):1510–1521, 2018.
  • [82] Jean-Baptiste Tristan, Joseph Tassarotti, and Guy Steele.

    Efficient training of lda on a gpu by mean-for-mode estimation.

    In International Conference on Machine Learning, pages 59–68, 2015.
  • [83] Charles Van Loan. A survey of matrix computations. Handbooks in operations research and management science, 3:247–321, 1992.
  • [84] Xiaolong Xie, Yun Liang, Xiuhong Li, and Wei Tan. Culda_cgs: Solving large-scale lda problems on gpus. arXiv preprint arXiv:1803.04631, 2018.
  • [85] Shengen Yan, Guoping Long, and Yunquan Zhang. Streamscan: fast scan algorithms for gpus without global barrier synchronization. In ACM SIGPLAN Notices, volume 48, pages 229–238. ACM, 2013.
  • [86] Yang You, Igor Gitman, and Boris Ginsburg. Scaling sgd batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888, 2017.
  • [87] Yuhao Zhu, Matthew Mattina, and Paul Whatmough. Mobile machine learning hardware at arm: A systems-on-chip (soc) perspective. arXiv preprint arXiv:1801.06274, 2018.