APNN-TC: Accelerating Arbitrary Precision Neural Networks on Ampere GPU Tensor Cores

Over the years, accelerating neural networks with quantization has been widely studied. Unfortunately, prior efforts with diverse precisions (e.g., 1-bit weights and 2-bit activations) are usually restricted by limited precision support on GPUs (e.g., int1 and int4). To break such restrictions, we introduce the first Arbitrary Precision Neural Network framework (APNN-TC) to fully exploit quantization benefits on Ampere GPU Tensor Cores. Specifically, APNN-TC first incorporates a novel emulation algorithm to support arbitrary short bit-width computation with int1 compute primitives and XOR/AND Boolean operations. Second, APNN-TC integrates arbitrary precision layer designs to efficiently map our emulation algorithm to Tensor Cores with novel batching strategies and specialized memory organization. Third, APNN-TC embodies a novel arbitrary precision NN design to minimize memory access across layers and further improve performance. Extensive evaluations show that APNN-TC can achieve significant speedup over CUTLASS kernels and various NN models, such as ResNet and VGG.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

page 10

page 11

04/23/2021

tcFFT: Accelerating Half-Precision FFT through Tensor Cores

Fast Fourier Transform (FFT) is an essential tool in scientific and engi...
06/30/2020

Accelerating Binarized Neural Networks via Bit-Tensor-Cores in Turing GPUs

Despite foreseeing tremendous speedups over conventional deep neural net...
11/18/2021

QGTC: Accelerating Quantized Graph Neural Networks via GPU Tensor Core

Over the most recent years, quantized graph neural network (QGNN) attrac...
10/23/2020

Not Half Bad: Exploring Half-Precision in Graph Convolutional Neural Networks

With the growing significance of graphs as an effective representation o...
10/08/2020

A Mixed-Precision RISC-V Processor for Extreme-Edge DNN Inference

Low bit-width Quantized Neural Networks (QNNs) enable deployment of comp...
03/11/2018

NVIDIA Tensor Core Programmability, Performance & Precision

The NVIDIA Volta GPU microarchitecture introduces a specialized unit, ca...
04/16/2021

Accelerating Sparse Deep Neural Networks

As neural network model sizes have dramatically increased, so has the in...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Over the recent years, demands to improve the performance of deep neural networks (DNNs) have never been satisfied. Prior work approaches faster and more efficient DNNs from different aspects, such as model pruning (liu2020autocompress; niu2020towards; ma2020pconv; dualInfer), kernel factorization (mobilenet_2017_howard; sandler2018mobilenetv2; xception; DSXplore), and data quantization (training-qnn; qcnn). Among those efforts, quantization-based DNN acceleration (searchlowbit; training-qnn; qcnn) finds its strengths in minimum modification of the original model architecture, lower memory consumption, and better runtime performance.

To accelerate quantized DNNs, many specialized cores have been introduced to support low-precision dense matrix-matrix multiplications, such as Tensor Processing Units (TPUs) (jouppi2017datacenter), Neural Network Processors (NNPs) (hickmann2020intel), and GPU Tensor Cores (choquette2021nvidia). For example, NVIDIA introduces Tensor Cores in Volta architecture (choquette2018volta) that support FP16 matrix-matrix multiplication. In Turing architecture, NVIDIA extends architecture support for more precisions (e.g., int1 and int4) and bit-level operations (e.g., XOR) (TPDS). Recently in the Ampere architecture, we find there is additional support for more precision and bit-level operations (e.g., AND). However, these specialized cores still support a limited range of precisions with only architecture-level efforts, while quantized DNNs usually require arbitrary precisions (e.g., 1-bit weight and 2-bit activations). In this paper, our key question is whether we can support arbitrary precision neural networks with the limited precisions on Tensor Cores.

We identify two major challenges in accelerating arbitrary precision DNNs on Ampere GPU Tensor Cores.

Lack of mathematical emulation design. To support arbitrary precisions (e.g., int1 weights and int2 activations), one naive approach is to represent these low-precision values with the supported high-precision values (e.g., int4). However, this approach introduces extra overhead and prevents efficient quantized DNNs on Tensor Cores. Another approach is to emulate with int1 compute primitives. However, with int1 precision, Tensor Cores only support two bit-level operations (i.e., XOR and AND) and mathematical emulation designs are required to support multiplication and addition in quantized DNNs. Moreover, quantized DNNs may have diverse input data (e.g., -1/+1 or 0/1), where different data may require different emulation designs.

Lack of efficient implementation for arbitrary precision NN layers. To accelerate APNN on Tensor Cores, we need to efficiently map arbitrary precision NN layers to Tensor Cores with specialized compute primitives and memory architectures. Existing works on accelerating binary neural networks simply split NN layers into small matrix tiles (e.g., ) to match Tensor Core compute primitives and improve the parallelism. However, naively borrowing these strategies fails to exploit the data locality during NN layer computation especially for our emulation workload. Moreover, arbitrary precision computation usually computes at the bit-level (e.g., int3 or int5) while existing hardware devices such as CPUs and GPUs usually operate at the word or byte level. Specialized bit operations and data organization are required to support efficient bit-level computation and avoid uncoalesced memory access.

Lack of efficient NN framework designs. One standard approach to build quantized neural networks is to stack a sequence of NN layers, such as a convolution layer followed by a pooling layer and a quantization layer. However, this approach ignores the data reuse opportunity across NN layers and leads to unnecessary memory overhead. For example, on NNs with 2-bit activations, there are two semantic equivalent implementations – quantization after reading 32-bit activations from the previous layer or quantization to 2-bit ones before writing to global memory for the next layer. While these two implementations provide the same semantic, the former requires memory access of bits while the latter only requires memory access of bits.

Figure 1. The overview of APNN framework.

To this end, we propose APNN-TC to accelerate Arbitrary Precision Neural Networks on Ampere GPU Tensor Cores, as illustrated in Figure 1. First, we propose an AP-BIT emulation design to support arbitrary-precision computation with 1-bit compute primitives. Our AP-BIT algorithm can adaptively select operators (e.g., XOR or AND) to support diverse input data (e.g., -1/1 or 0/1). Second, we build efficient AP-Layer design including an arbitrary-precision matrix-matrix multiplication (APMM) layer for fully connected layers and an arbitrary-precision convolution (APConv) layer for convolution layers. We propose a set of memory and computation designs (e.g., batch-based double caching and channel-major data organization) to fully exploit Tensor Core computation and minimize memory access. We also incorporate a performance analysis to automatically tune the hyper-parameters in APMM and APConv. Third, we propose an efficient APNN design to improve the performance at the framework level. It includes a minimal-traffic dataflow to support various precisions over APNN layers and a semantic-aware kernel fusion to minimize the data movement across layers.

In summary, we make the following contributions in this paper.

  • We develop APNN-TC to accelerate neural network on Ampere GPU Tensor Cores with arbitrary precision.

  • We propose three novel techniques: a) an AP-BIT emulation design to support arbitrary-precision computation; b) an efficient AP-Layer design to achieve high performance at the layer level; c) an efficient APNN design to minimize the data movement across layers.

  • Extensive experiments show that APNN-TC can achieve up to speedup over CUTLASS kernels and speedup over CUBLAS kernels. APNN-TC can also consistently outperform NNs implemented with built-in int8, half, or single precision. For example, with -bit weights and -bit activations, APNN-TC can achieve more than latency reduction and higher throughput than the single-precision NN with only accuracy drop.

2. Related Works

2.1. APNN algorithm designs

Arbitrary precision (lower than INT8) neural network (APNN) algorithms have been widely studied (HanMD15; TPDS; BinaryConnect; DoReFa; zhang2018lq; HAQ; OLCEL; li2019bstc; geng2020o3bnn) to fully explore the spectrum of NN performance and NN accuracy and cater to diverse application requirements. In addition to widely supported precisions on modern GPUs (e.g., int1, int4, and int8), these APNNs usually utilize more diverse precisions such as int2, int3, and int5. APNNs may also have different precisions for weights and activations (e.g., 1-bit weights and 2-bit activations). Comparing with INT8 quantized neural networks, APNNs provide better performance and memory efficiency at the cost of (slightly) degraded accuracy. Popular APNNs include DoReFa-Net (DoReFa) for 1-bit weights and 2-bit activations, LQ-Nets (zhang2018lq) for 1-4 bits, HAQ (HAQ) for 1-8 bits, OLAccel (OLCEL) for 4 bits, BSTC (li2019bstc) and TCBNN (TPDS) for 1 bits. In this paper, we follow LQ-Nets (zhang2018lq) that starts from a full-precision NN and adopts the quantization error minimization (QEM) strategy to generate quantized NNs.

2.2. APNN Hardware Supports

While many APNN algorithms have been designed, the hardware supports are still limited. One direction is to build FPGA and ASIC based implementations (HAQ; OLCEL)

to demonstrate the performance benefits of APNNs. However, these implementations usually require specialized hardware designs to support arbitrary-precision computation and cannot be applied to GPUs. Another direction is to utilize built-in precisions on GPUs for quantized neural networks. Taking the most famous Pytorch 

(pytorch) framework as an example, it supports FP32, FP16, and BF16 models on GPUs and int8 quantization on x86 CPUs with AVX2 support. Recently, BSTC (li2019bstc) and BTC (TPDS) accelerates binary neural networks on GPUs by exploiting the int1 compute primitive. However, existing works can only build on the limited precision supported on GPUs (e.g., int1, int4, and int8) and cannot fully exploit the performance benefits from APNNs. In this paper, we build the first generalized framework to accelerate arbitrary-precision neural networks on Ampere GPU Tensor Cores.

2.3. Tensor Cores

Tensor Cores are specialized cores for accelerating neural networks in terms of matrix-matrix multiplications. Tensor Cores are introduced in recent NVIDIA GPUs since Volta architecture (volta). Different from CUDA Cores that compute scalar values with individual threads, Tensor Cores compute at the matrix level with all threads in a warp (raihan2019modeling). For example, the 1-bit Tensor Core compute primitive takes two int1 input matrices A and B of shape and generates an int32 output matrix C of shape (TPDS). In Volta architecture, Tensor Cores support only half-precision computation (jia2018dissecting). To support more quantized neural networks, Tensor Cores add more precisions including int1, int4, and int8 in Turing architecture (jia2019dissecting). Regarding int1 precision, Tensor Cores support only XOR logical operation in Turing architecture and recently add AND logical operation in Ampere architecture (ampere). Despite these hardware efforts on supporting more precisions, arbitrary precisions are still not supported. This is the first work to support arbitrary precision computation on Ampere GPU Tensor Cores with int1 precision and support for both XOR and AND operations.

Figure 2. Illustration of AP-Bit Operation Template with 1-bit weight W and 2-bit feature X, which can be generalized to arbitrary weight bits and feature bits. Note that and in the dashed box are batched into a single large matrix during computation, which will be discussed in Section 4.1.

3. AP-Bit Emulation Design

In this section, we design an AP-BIT emulation on Tensor Cores to support arbitrary-precision computation. We first design an AP-Bit operation template that supports arbitrary-precision computation with 1-bit compute primitive on Tensor Cores. Then, we propose a data adaptive operator selection to automatically support various input data (e.g., -1/+1 and 0/1) with bitwise XOR and AND on Tensor Cores. Here, we focus on the algorithm design on small matrices (i.e., input matrices of and output matrix of ) that can fit directly on Tensor Core compute primitives. We will discuss the efficient computation of large matrices in the next section.

3.1. AP-Bit Operation Template Design

The AP-Bit operation template takes a matrix with -bit elements and a matrix with -bit elements, and computes with 1-bit operations on Tensor Cores to generate a -bit output matrix . Our key observation is that each arbitrary-bit scalar digit can be decomposed to a sequence of 1-bit scalar digits and the arbitrary computation can be conducted with only 1-bit operations and shift operations. Formally, to support scalar-level arbitrary precision computation of a 1-bit weight and a 2-bit feature with int1, we can first decompose 1-bit values and from the 2-bit feature as:

Suppose we have an 1-bit operation (e.g., the bmma API of Tensor Cores) that takes 1-bit inputs and generate 32-bit outputs, we can compute as

We illustrate our AP-Bit operation template in Figure 2. Here, we focus on a 1-bit weight matrix of shape and a 2-bit feature matrix of shape to illustrate our algorithm design. A naive approach is to use 4-bit integers to represent each 1-bit element and 2-bit element , and then use the compute primitive on Tensor Cores. However, this approach would lead to unnecessary memory and computation overhead. Instead, we propose to exploit the compute primitive on Tensor Cores to support arbitrary-precision computation by dynamically adjusting the memory and computation requirement. In particular, the first step is to conduct bit decomposition by splitting a 2-bit to two 1-bit elements and :

These 1-bit elements are then packed into 1-bit matrix and . The second step is to conduct batch-based Tensor Core computation on these 1-bit matrices with the bmma API and generate 32-bit output matrices

These matrices can be computed directly with the bmma API since all of them have the shape of . We also note that Tensor Core primitives for int1, int4, and int8 generate 32-bit output matrices to accumulate a large number of bit-operation outputs and avoid overflow. The third step is to conduct bit combination and generate the final output matrix

(1)

Here, , and refer to the scalar elements of matrix , and , respectively. For notation simplicity, we abbreviate Equation 1 as in the following sections to represent the scalar multiplication and elementwise addition. We note that mathematically.

It is not hard to see that this computation can be generalized to matrices with arbitrary bits and . Formally, given a -bit weight matrix and a -bit weight matrix , we can first decompose into 1-bit matrices and . For each element, we have

(2)

Then, we compute the bmma API for times for each combination of and :

Finally, we conduct bit combination to generate the -bit output matrix :

Cost Analysis. The cost of arbitrary-precision computation comes from three parts: bit decomposition, tensor core computation, and bit combination. Given a -bit weight matrix and a -bit data matrix of shape , bit decomposition shows complexity of since we need operations to split each -bit element from A into -bit elements and another operations to split each -bit element from B into -bit elements. The bit combination shows complexity of , since we have matrices of shape and need to add elementwisely. This overhead is negligible compared with the complexity in the Tensor Core computation. Note that only 1-bit compute primitives are used for this expensive matrix-matrix multiplication, which significantly reduces the overall latency.

3.2. Data Adaptive Operator Selection

While we compute with bit-0 and bit-1 in arbitrary-precision computation, these two values may actually encode diverse values. For example, the 1-bit weight matrix in neural networks may encode and , instead of and , in order to improve the accuracy of neural networks. In this case, bit-0 indicates the value and bit-1 indicates the value . To support this diversity in the encoded data, we introduce data adaptive operator selection by adopting different bit operations in Tensor Cores (i.e., XOR and AND). In particular, we support three cases, where we first conduct bit operations and then accumulate with popc (i.e., population count (popc) that counts the number of set bits). The Case-I is that both and encode and , where we choose logical AND

operation. For example, given a 1-bit vector

and a 1-bit vector , we use AND operation to compute as

The Case-II is that both and encodes and , where we select logical XOR operation. For example, given two 1-bit vectors and , we first map to and compute as

Here, (=2) is the length of the vector.

The Case-III is that encodes and , while encodes and . For example, we may need to compute the multiplication of two 1-bit vectors and . This case happens frequently in neural networks with a 1-bit weight matrix and a -bit feature matrix with . In this case, naively adopting XOR or AND does not work, since there are three values , , and

that cannot be easily encoded with 1 bit. To this end, we incorporate a linear transformation on

and compute with only AND operation. Our key observation is that can be transformed into a vector with only and by adding a constant vector :

Then, we compute with AND operation as Case-I. Finally, we recover the value by another linear transformation:

Note that is a constant vector that can be cached in Tensor Core fragment and does not introduce extra memory overhead.

4. Arbitrary Precision Layer Design

In this section, we propose the Arbitrary-Precision Matrix Multiplication (APMM) for fully connected layers and Arbitrary-Precision Convolution (APConv) for convolution layers.

4.1. Arbitrary-Precision Matrix Multiplication

Arbitrary-Precision Matrix Multiplication (APMM) takes the decomposed -bit weight matrix , , the decomposed -bit feature matrix , and computes output matrix . By default, APMM generates -bit output to avoid data overflow for large matrices and match the -bit output in Tensor Core compute primitives. APMM also supports arbitrary-precision output (e.g., int2) when APMM is used as a hidden layer in neural networks (NNs) and the output is consumed by the next APMM-based NN layer.

Considering that APMM essentially computes an arbitrary precision GEneral Matrix-Matrix multiplication (GEMM) kernel with multiple Binary Matrix-MAtrix multiplication (BMMA) kernels, one naive strategy is to build upon existing BMMA kernels (TPDS; li2019bstc). In particular, we can use existing BMMA kernels to multiply each pair of and and accumulate to the output matrix . However, this approach shows significant inefficiency due to two reasons. First, this approach ignores the data reuse opportunity since the same weight matrix tile from can be multiplied with different feature matrix tiles from and . Second, this approach requires extra communication across BMMA kernels, such that reducing into leads to significant global memory access. We show our efficient APMM design in Figure 3. It includes a batch-based double caching to facilitate the data reuse and a memory-efficient bit combination to accelerate the accumulation and optionally generate the arbitrary-precision output. Here, we illustrate the design with -bit and -bit for notation simplicity while arbitrary-precision and are supported.

Figure 3. Illustration of APMM. GL: GLobal memory. SHMEM: SHared Memory. FRAG: FRAGment.

(a) Batch-based Double Caching. Batch-based double caching exploits two GPU memory hierarchies (i.e., shared memory and fragment located in registers) to cache matrix tiles and facilitate data reuse in APMM computation, as illustrated in Figure 3(a). Considering the limited size of shared memory and fragment, we tile weight matrices and feature matrices such that these tiles can be cached in GPU memory hierarchies. Formally, given of shape and of shape , we first tile along the dimension into block matrix tiles of shape . Similarly, we tile along the dimension into block matrix tiles of shape . Here, each GPU block will multiply one pair of block matrix tiles and generate an output matrix tile of shape . Considering that Tensor Cores compute at the warp level, we further tile into warp matrix tiles of shape and into such that each warp computes an output tile of shape . To match with the bmma compute primitive of Tensor Cores, each warp will slide along , , and dimension during computation. Note that these tiling sizes have a significant impact on the performance, which will be analyzed in Section 4.3.

Batch-based double caching first adopts a batch strategy to improve inter-thread parallelism and achieve high performance. Existing works on binary neural networks (li2019bstc; TPDS) report that the GEMM size in NN workload is usually small (e.g., ) and use small matrix tiling sizes (e.g., ) to improve the inter-thread parallelism. However, this approach leads to low intra-thread parallelism and prevents data reuse. Instead, our batch strategy virtually transforms multiple small BMMAs into a large BMMA. In particular, given of shape and of shape , we batch these small matrices into of shape and of shape and compute using a single large BMMA. Here, we implement a “virtual" batch strategy during the data loading procedure by dynamically deciding the global memory address of the corresponding matrix tile such that no additional memory movement is involved.

Batch-based double caching then exploits two GPU memory hierarchies to facilitate data reuse at different levels. The first level is shared memory caching to reuse matrix tiles from and . Here, a naive strategy is that each warp independently loads a weight tile and a feature tile for computation. However, we observe that the same weight tile may be multiplied with feature tiles from different 1-bit feature matrices and , as illustrated in Figure 3(a). To this end, our design requires all warps to first collaboratively load weight data and feature data from global memory to shared memory. Then, each warp fetches its own matrix tiles from shared memory. In this way, we can significantly reduce global memory access by exploiting fast shared memory.

The second level is fragment caching to continuously store output tiles in the same Tensor Core fragment. Since Tensor Core compute primitives require to accumulate in 32-bit Tensor Core fragments, the output tiles usually consume a large memory space compared with the 1-bit input data. Moving output tiles between shared memory and Tensor Core fragment may lead to heavy shared memory access. Moreover, existing dissecting works (jia2019dissecting; jia2018dissecting) reveal that fragment is composed of registers and one GPU block of 8 warps can provide up to 256 KB Fragment, which is much larger than shared memory. To this end, as iterating through the K dimension during computation, we continuously use multiple fragments to cache output tiles of shape for reducing shared memory access and caching more feature and weight tiles in shared memory.

(b) Memory-efficient Bit Combination. Bit combination consumes 32-bit BMMA outputs and generates -bit APMM outputs as . ‘Bit combination can also generate arbitrary precision output when it is utilized as a NN hidden layer and its output is consumed by the next NN layer. Overall, bit combination takes only computation complexity, which is significantly lower than the computation complexity of GEMM operations. However, there are two potential memory bottlenecks in bit combination, which have a significant performance impact. The first one is global memory access when reducing -bit BMMA outputs to -bit APMM outputs. In a naive implementation that independently conducts BMMA and bit combination, bit combination usually introduces similar latency as the BMMA kernel. The main reason is that, while Tensor Cores provide significantly higher computation throughput than CUDA Cores, the global memory bandwidth remains the same. The second one is the shared memory access when converting -bit APMM outputs to arbitrary-precision outputs. In this procedure, we usually need to pack low-bit values (e.g., 2-bit) in registers from different threads to a single memory-aligned value (e.g., 32-bit) before storing to global memory. Relying on shared memory for data exchange across threads may lead to heavy shared memory access.

Memory-efficient bit combination includes two novel designs to mitigate memory overhead. The first design includes a semantic-aware workload allocation and an in-shared-memory reduction. In particular, at the data loading phase of BMMA, we load feature tiles and weight tiles of the same spatial location such that their multiplication outputs can be reduced directly. As illustrated in Figure 3, instead of loading a feature tile of or , we load two feature tiles of both and with the same matrix index. In this way, we can reduce and directly in shared memory and mitigate global memory access while not degrading the BMMA performance.

The second design incorporates an element-wise routine and an inter-thread communication to pack low-bit values and mitigate shared memory overhead. The element-wise routine is a user-defined interface to provide diverse support of quantization and batch normalization across NN layers. This routine applies to individual 32-bit reduced values in registers. Given a 32-bit value in a register, this routine may quantize it into a

-bit value that is still stored in the 32-bit register with the first bits as zeros. This routine also includes bit decomposition (Equation 2) that splits this -bit value in a register to -bit values in registers. After that, we use a __ballot_sync API to enable inter-thread communication and directly pack the -bit values across threads into -bit values that can be stored to the global memory.

4.2. Arbitrary-Precision Convolution (APConv)

APConv takes the decomposed 1-bit weight matrix of shape , the decomposed 1-bit feature matrix of shape , and generates output matrix . Here, is the number of output channels, is the number of input channels, is the kernel size, is the batch size. Existing works on bit-level convolution usually adopt a direct convolution design (li2019bstc; TPDS) to improve the GPU utilization. However, these methods ignore the data reuse opportunity and introduce heavy global memory access. In addition, APConv on a -bit weight and a -bit feature usually has times workload than the BConv on the same weight and feature size, which can easily contribute to high GPU utilization. To this end, APConv incorporates the batch-based double caching design as APMM to mitigate the global memory access. However, there are still two key challenges that distinguish APConv from APMM. The first is the data organization where naively reading the

feature map may easily lead to un-coalesced memory access. The second is the data padding where simply padding zeros may lead to erroneous results. To tackle these challenges, we propose

channel-major data organization and input-aware padding design.

Figure 4. Illustration of Channel Major Data Organization (NPHWC). P indicates the number of bits. indicates the image pixel at the -th channel, -th height, and -th width.

(a) Channel-Major Data Organization. Channel-major data organization transforms un-coalesced and unaligned memory access to a coalesced and aligned one for improving performance. Traditional data organization for -bit convolution usually employs a NCHW design, as illustrated in Figure 4(a). However, naively borrowing this design to APConv leads to un-aligned and un-coalesced memory access due to two reasons. First, multiple -bit (e.g., 3-bit) elements usually cannot be packed into an aligned 32-bit element, which is required for valid GPU reads and writes. Using a 32-bit element to store a -bit element will introduce extra memory overhead. Second, convolution operations usually read only continuous elements (or bits) due to the kernel size, which may lead to un-coalesced memory access.

We design a channel-major data organization as illustrated in Figure 4(b). There are two key design choices. First, we split a -bit feature matrix into 1-bit feature matrices and store each 1-bit feature matrix consecutively. In this way, we can provide aligned memory access for each 1-bit feature matrix and support arbitrary precision . Second, we consecutively store all channels of elements with the same spatial location. Since convolution layers usually have channels, this usually leads to coalesced memory access during computation.

(b) Input-aware Padding Design. Input-aware padding design adaptively adjusts padding values according to input values. As mentioned in Section 3.2, when the weight W encodes and with and , we cannot naively padding since represents .

We propose three padding strategies according to the input data. First, when both weight and feature encode and , we simply pad zeros for features. In this case, padding for features will only add extra ’s for arbitrary weight values, which does not change the computation result. Second, when both weight and feature encode and , we pad for features and use an extra counter flag to track the number of ’s when the convolution weight moves outside the input image frame. We will subtract counter to amend the corresponding convolution results. Third, when weight encodes and and feature encodes and , we pad to features and do not change the convolution results.

4.3. Performance Analysis

In our APNN-TC kernel design, there are six tuning knobs – the block tiling sizes , , , and the warp tiling sizes , , . These tiling sizes bring a trade-off between the Thread-Level Parallelism (TLP) and the Instruction Level Parallelism (ILP), especially the compute intensity (CI). Here, we focus on block tiling sizes, since we empirically observe that utilizing warps per block and splitting the block workload evenly across warps provide the best performance (i.e., , , ). In this subsection, we first analyze the performance impact of individual tuning knobs. Then, we propose an autotuning strategy to maximize the performance. Since APMM and APConv share the same batch-based double caching strategy, we use the same autotuning strategy for these two kernels.

4.3.1. Performance Model

TLP refers to the thread-level parallelism in terms of the number of threads in use. Intuitively, larger TLP can improve GPU utilization and kernel performance (li2015transit; li2016x). Formally, given a p-bit weight matrix of shape , a q-bit feature matrix of shape and the matrix tiling size , we define the TLP as

(3)

We ignore the number of threads for each block since it is a constant in our evaluation. Intuitively, smaller may improve TLP, which suggests a small especially for small matrices.

Compute intensity (CI) refers to the ratio of computation over memory access on each thread block. We aim to improve CI for two reasons. First, a higher CI indicates less memory access and better performance. While the amount of computation remains the same, the amount of memory access may be reduced significantly by data reusing and hyper-parameter tuning. Second, a higher CI on a thread block provides more opportunities for latency hiding. Formally, for a matrix tile, we compute the amount of global memory access as when reading a weight tile and a feature tile. We compute the amount of computation as from the matrix-matrix multiplication. Finally, we compute CI as

(4)

Note that CI can be increased when and are increased. We also observe that CI is independent of such that we can use smaller to leave space for larger and , especially when the shared memory size is a limiting factor. In our evaluation, we fix as by default.

4.3.2. Auto-tuning

During APNN-TC kernel design, there is a large search space on the complex interaction between matrix size (, , and ), weight bit , feature bit , and block tiling size and

. Note that the selected parameters may also be different on various GPUs according to computation and memory capabilities. To this end, we propose a heuristic algorithm to provide a faster search procedure in this large search space. Formally, given the matrix size

, , , the weight bit , the feature bit , the algorithm selects in two steps. First, we compute the TLP of each combination of and . We put these combinations in a priority queue, where a higher TLP leads to a high priority. Second, we pop individual combinations in the priority queue. We stick to the first combination with the highest TLP if its TLP is already smaller than a threshold . Otherwise, we continuously pop and select combinations in the priority queue to improve CI while ensuring TLP is larger than . We empirically set as in our evaluation. Note that different block tiling sizes share the same data layout such that there is no overhead when consecutively executing two layers with different block tiling sizes.

5. Arbitrary Precision Neural Network Design

In this section, we introduce our Arbitrary Precision Neural Network (APNN) design. We first introduce a minimal-traffic dataflow on supporting various precisions across layers in APNN. Then, we incorporate a semantic-aware kernel fusion to minimize the memory access across layers.

5.1. Minimal-Traffic Dataflow

Given an int8 RGB image, APNN computes a sequence of NN layers with -bit weights and -bit activations and finally generates an int32

output logits. Here, all intermediate layers compute at arbitrary precision by taking a

-bit weights and -bit activations and generate -bit outputs. Note that the int1 Tensor Core compute primitive can only generate outputs and an extra quantization layer is required to quantizing into -bit activations for the next layer. For performance consideration, during the initialization of an APNN, we quantize all weights before the model inference computation. To effectively maintain and transfer arbitrary-bit data, we pack the data bit-by-bit for both weight and feature map, following the data organization discussed in Section 4.2.

The input layer and the output layer have different precisions from the intermediate layers. As is the common practice with int8 image inputs, the input layer requires an extra quantization layer that quantizes -bit inputs into -bit activations. The output of the input layer will also be the quantized arbitrary-bit feature map serving as the input for the following intermediate layers. In the output layer, Tensor Core computation results will be directly used for the final softmax logits computation. Thus, we do not apply quantization after the output layer.

5.2. Semantic-aware Kernel Fusion

Besides APMM and APConv discussed previously, there are still multiple important layers in APNN, including quantization, Batch Normalization (BN), pooling, and ReLU. Given all scalars

in the layer, quantization element-wisely converts int32 values to q-bit values :

Here, is a -bit scalar zero-point, is the scaling scalar, and is the floor function. BN (ioffe2015batch) is another major component in NNs for tackling the covariate shift problem and facilitating NN training:

(5)

where and

are expectation and variance across the batch,

and are two learned parameters. Pooling splits the feature map spatially into grids and generates 1 scalar output for each grid by computing the average or the maximum value in each grid. ReLU takes individual input values and generates output values .

While these operations have linear time complexity to the size of feature maps and consume significantly less computation than APConv and APMM kernels, these operations may still introduce heavy latency due to the expensive memory access. Indeed, while Tensor Cores provides significantly improved computation capability, Tensor Cores share the same memory bandwidth with CUDA Cores on GPUs. Moreover, we observe that these values are usually computed element-wisely and do not require heavy communication across GPU threads. We propose a semantic-aware kernel fusion to minimize memory access. We first fuse APMM/APConv with its following quantization, BN, pooling, and ReLU kernels into a single kernel to minimize the global memory access. In particular, these following layers can be seamlessly applied once the convolution results become available at the shared memory. This can improve the computation intensity for individual convolution kernels meanwhile reducing the global memory access from invoking an additional batch normalization kernel. Second, considering that these following layers usually compute at scalar level, we can further reduce shared memory access by directly reusing values in registers (li2016critical). For example, when a APMM layer is followed by a BN layer, a quantization layer, and a ReLU layer, we directly compute the output scalar as

Note that we only need to load a scalar once to a register and avoids unnecessary shared memory access.

6. Evaluation

In this section, we evaluate APNN-TC under diverse precisions and show the benefits of arbitrary-precision computation in performance and accuracy.

(a) Over CUTLASS-GEMM-INT4
(b) Over CUBLAS-GEMM-INT8.
Figure 5. APMM Performance on RTX 3090.
(a) Over CUTLASS-GEMM-INT4.
(b) Over CUBLAS-GEMM-INT8.
Figure 6. APMM Performance on A100.

We evaluate on both Nvidia RTX 3090 and Nvidia Tesla A100. The RTX3090 GPU is in a ubuntu 16.04 system with Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz, 64 GB DDR3 DRAM, gcc-7.5.0, and using CUDA-11.1, CUTLASS-2.5, and CUBLAS-11.1. The A100 GPU is in a Linux 3.10.0 system with AMD EPYC 7742 64-core CPU, 1TB DDR4, gcc-9.1.0, and using CUDA-11.1, CUTLASS-2.5, and cuBLAS-11.3. All results reported are the average of 200 times execution.

6.1. APLayer Evaluation

6.1.1. APMM Performance

We compare our APMM designs with NVIDIA implementations of low-bit gemm (i.e., int1, int4, and int8) that are accelerated by Tensor Cores. For int8, we compare with cublas implementation, namely cublass-gemm-int8. Since int1 and int4 are not supported in cublas, we compare with cutlass implementation, namely cutlass-gemm-int1 and cutlass-gemm-int4. Following popular settings in NNs, we compute matrix multiplication of a matrix with shape and a matrix with shape , where is a popular batch size and covers typical fully connected layer dimensions. According to the precision of our APMM kernel, we name it APMM-wxay, where x indicates the weight bit and y indicates the activation bit. For example, APMM-w1a2 indicates 1-bit weights and 2-bit activations. While our APMM is general to support arbitrary precision, we show 8 popular bit combinations due to page limits. If both weight bits and activation bits are less than (e.g., w1a2, w1a3, w1a4, w2a2), we compare it against cutlass-gemm-int4. If either weight bits or activation bits are larger than , we compare it against cublas-gemm-int8. For each matrix size, we show a speedup of cutlass-gemm-int1 against cutlass-gemm-int4 and cublas-gemm-int8 as the performance benefit when sticking to binary neural networks (li2019bstc; TPDS). Since Tensor Core compute primitive supports only -bit outputs, all gemm kernels take low-bit input (e.g., int1, int4, and int8) and generate -bit outputs.

(a) Over CUTLASS-Conv-INT4.
(b) Over CUTLASS-Conv-INT8.
Figure 7. APConv Performance on RTX 3090.
(a) Over CUTLASS-Conv-INT4.
(b) Over CUTLASS-Conv-INT8.
Figure 8. APConv Performance on A100.

Figure 5 shows the results of APMM on RTX 3090. We compare APMM with cutlass-gemm-int4 in Figure 5(a) and cublas-gemm-int8 in Figure 5(b). Overall, we have three major observations. First, APMM can usually achieve significant speedup over baselines. For example, APMM-w1a2 can achieve up to speedup over cutlass-gemm-int4, while APMM-w5a1 can achieve up to speedup over cublas-gemm-int8. This result demonstrates the performance benefits of emulating arbitrary-precision with int1 compute primitives over sticking to int4 or int8 compute primitives. Second, APMMs with various weight and activation bits usually show similar performance on small matrices. For example, APMM-w1a2, APMM-w1a3, APMM-w1a4, and APMM-w2a2 achieves almost the same speedup when N=128 and N=256, even if these kernels have different computation overhead (e.g., from APMM-w1a2 and from APMM-w2a2). This benefit comes from our batch-based double caching (Section 4.1(a)), where individual small BMMAs are batched into a large BMMA and computed simultaneously. Surprisingly, our arbitrary precision computation can even outperform cutlass-gemm-int1 in such cases due to the improved GPU utilization. Third, we observe a smaller speedup over cublas-gemm-int8 on large matrix sizes, when peak int1 performance is achieved. Our investigation shows that, on RTX 3090, cutlass-gemm-int1 is only faster than cublas-gemm-int8, such that emulation is slower than built-in int8 compute primitives on large matrices when peak int1 performance is achieved (e.g., for APMM-w2a8). We argue that NN workload can still benefit significantly from our APMM since the fully connected layers in NNs usually have small matrix sizes (e.g., in ResNet-18). We also show the results of APMM on A100 in Figure 6 with similar observations.

6.1.2. APConv Performance

We compare APConv designs with NVIDIA implementations of low-bit convolution that are accelerated by Tensor Cores. Since cublas does not support int1, int4, AND int8

convolution, we use kernels from cutlass. We name these kernels as cutlass-conv-int1, cutlass-conv-int4, and cutlass-conv-int8. Similar to APMM, we evaluate 8 types of precision with the name APConv-wxay. Since convolution kernels have much more hyperparameters than matrix-multiplication kernels, we show the performance under various input and output channels while fixing the input size as

(medium feature size), filter size as

(most frequently used), stride as

(most frequently used), and batch as (for inference). Figure 7 and 8 show the speedup of APConv on RTX 3090 and A100, respectively. APConv can achieve speedup over cutlass-conv-int4 and speedup over cutlass-conv-int8. This result shows the significant performance benefit from emulating arbitrary precision with int1 over utilizing int4 or int8. Similar to APMM, we also observe a smaller speedup over cutlass-conv-int8 on larges channels due to the limitation of peak int1 performance. Since RTX3090 and A100 provide similar performance, we will focus on RTX3090 in the following evaluations.

6.2. APNN Evaluation

In this section, we evaluate the overall APNN performance on three mainstream neural network models with ImageNet dataset. The details of our evaluated NN models and their corresponding binarized neural network, low-bit (1-bit weight with 2-bit activation), single-precision accuracy precision are listed in Table 

1.

Dataset Network Input Size Output Size Binary w1a2 Single
ImageNet AlexNet (krizhevsky2012imagenet) 224x224x3 1000 46.1% 55.7% 57.0%
ImageNet VGG-Variant (cai2017deep) 224x224x3 1000 53.4% 68.8% 69.8%
ImageNet ResNet-18 (he2016deep) 224x224x3 1000 51.2% 62.6% 69.6%
Table 1. APNN Evaluation Setting. We list the dataset, network, input size, output size, and the model accuracy under precisions of BNN (i.e., int1), w1a2 (i.e., 1-bit weights with 2-bit activations), and single-precision floating point.
ImageNet-AlexNet ImageNet-VGG_Variant ImageNet-ResNet18
Schemes 8 Latency Throughput 8 Latency Throughput 8 Latency Throughput
CUTLASS-Single 4.43ms 2.89fps 25.24ms 3.89fps 60.96ms 1.51fps
CUTLASS-Half-TC 3.79ms 3.38fps 24.19ms 4.67fps 57.33ms 1.89fps
CUTLASS-INT8-TC 13.10ms 9.77fps 25.77ms 6.52fps 57.09ms 2.85fps
BNN 0.69ms 1.37fps 2.17ms 3.91fps 0.68ms 1.89fps
APNN-w1a2 0.36ms 2.85fps 1.66ms 5.32fps 0.64ms 1.70fps
Table 2. APNN Inference Performance on NVIDIA Ampere RTX3090 GPU. Note that latency is measured under a batch of 8 images, throughput is measured under a batch of 128.

We consider two types of configurations for evaluation. In the first setting, we focus on a specific low-bit configuration (1-bit weights and 2-bit activations, i.e., w1a2) across different neural network models. We choose several baselines including neural networks built with single-precision floating-point implementation from CUTLASS (cutlass) running on CUDA Cores, half-precision implementation from CUTLASS running on Tensor Cores, INT8 precision implementation from CUTLASS running on Tensor Cores, and the 1-bit binarized neural network running on Tensor Cores based on the state-of-the-art design from (TPDS). As shown in Table 2, our APNN design running on Tensor Cores can achieve a significant speedup compared with CUTLASS INT8, half and single precision implementations. This indicates the practical usage of our APNN design in latency-sensitive applications. Meanwhile, on large batch sizes for throughput performance evaluation, our APNN design also demonstrates its high throughput advantage over these “standardized” bit (e.g., 8-bit and half) precision baselines. Compared with the 1-bit binarized neural network running on Tensor Cores, our APNN design would demonstrate its significant accuracy improvement (an average 11.67%) as listed in Table 1. This can demonstrate the application of our APNN design in some application settings, where the BNN model accuracy performance fails to meet the demands. Overall, from the study, we can see that using our APNN design for arbitrary-bit precision computation is a potential way for balancing NN model accuracy and runtime performance.

Scheme 8 Latency (ms) Throughput (fps)
Float 25.24 3.89
Half 24.19 4.66
INT8 25.77 6.52
BNN 2.17 3.91
APNN-w1a2 1.66 5.32
APNN-w2a2 3.08 2.59
APNN-w2a8 14.14 5.65
Table 3. Case Study: APNN of VGG on ImageNet.

In the second setting, we shift our focus towards model runtime performance tradeoff on the VGG network. We select several low-bit settings for comparison, including the 1-bit weight with 2-bit activation, 2-bit weight with 2-bit activation, and 2-bit weight with 8-bit activation. As shown in Table 3, APNN-TC significantly reduces latency and improves throughput for w1a2 and w2a2 than INT8 which shows that APNN-TC can bring benefits for many arbitrary-precision computations. Comparing with INT8, APNN-TC with w2a8 shows lower throughput since we need to compute 16 (=2*8) 1-bit matrices to emulate arbitrary-precision computation, which require more computation than w1a2 with 2 1-bit matrices and w2a2 with 4 1-bit matrices. This also matches the performance on individual kernels (e.g., Figure 5, 6, 7, 8). This result indicates that APNN-TC can bring benefits for latency-sensitive applications.

6.3. Additional Studies

We perform several additional studies in this subsection, including the latency breakdown from individual NN layers and the benefit from kernel fusion. We show results from RTX 3090 and skip results from A100 since we observe similar trend on these two GPUs.

Latency Breakdown. Figure 9 illustrates the percentage breakdown of the latency for the inference of 8 images over three NNs on RTX-3090 GPU. Clearly, the first layer introduces the most delay since the input feature size for this layer is significantly larger than other layers. This percentage can be as high as for AlexNet and for VGG_Variant. On other layers, we observe a roughly balanced latency.

Figure 9. Per-layer latency breakdown of APNN models.
Figure 10. Speedup from APNN Kernel Fusion.

Benefits from Kernel Fusion. Figure 10 investigates the performance benefits from fusing APConv-w1a2, pooling, and quantization into one kernel. Specifically, in the "w/o Fusion" implementation, we implement three global functions for APConv-w1a2 with 32-bit output, pooling, and quantizing into -bit outputs, respectively. Here, each function read and write data to the global memory. In the "w/ Fusion" implementation, we conduct the same workload in a single kernel. Overall, we observe a latency reduction of on average. The main reason is that, in "w/ Fusion", data across APConv, pooling, and quantization can be cached in shared memory and global memory access is significantly reduced.

Figure 11. Overhead from bit combination and bit decomposition, relative to TC Computation.

Overhead from bit combination and bit decomposition. We show the overhead from bit combination and bit decomposition in Figure 11. We profile the overhead on APConv designs following the same setting as Section 6.1.2. We show results from APConv-w1a2 since we observe similar overhead across bit settings. On average, we empirically observe 1.16% overhead from bit combination and another 2.02% overhead from bit decomposition, compared to only TC computation. The main reason is that bit combination and bit decomposition introduce only quadratic time complexity, which is significantly smaller than the cubic time complexity from TC computation. Due to this difference in time complexity, the overhead from bit combination decreases from to as the channel size increases from to . We also observe similar trend for bit decomposition.

Figure 12. Comparing APMM and CUTLASS-GEMM.

Comparing APMM and cutlass GEM under the same bits. Figure 12 shows the performance comparison between APMM and cutlass-gemm when using the same bits. Overall, we observe that APMM-w4a4 can achieve speedup over cutlass-gemm-int4. The main reason is that APMM-w4a4 can achieve better parallelism by using int1 computations to emulate int4 computation and achieving better GPU utilization, especially for small matrix sizes. We note that this speedup of APMM-w4a4 over cutlass-gemm-int4 decreases as the matrix size increases where more int1 computation resources are required for emulation. We also observe that APMM-w1a1 can achieve speedup over cutlass-gemm-int1. This shows the benefit from our kernel-level optimizations.

w1a2 w1a3 w1a4 w2a2 cutlass-gemm-int4 cutlass-gemm-int1
6.67 6.81 7.06 7.15 15.61 7.92
Table 4. Raw latency of a typical fully-connected layer with batch size , input dimension , and output dimension . Unit: microsecond.

Raw latency of a typical fully-connected layer. Table 4 shows the raw latency of a typical fully-connected layer with batch size , input dimension , and output dimension . Overall, we observe that we require only around microsecond for such a layer. Comparing with cutlass-gemm-int4, we can achieve speedup on average by using arbitrary-precision computation. We also note that the arbitrary-precision computation is even slightly faster than the cutlass-gemm-int1, which matches the result in Section 6.1.1.

7. Discussion

Practical usage of APNN. Arbitrary-precision neural networks have been widely studied to provide diverse tradeoffs between precision and efficiency (HanMD15; TPDS; BinaryConnect; DoReFa; zhang2018lq; HAQ; OLCEL; li2019bstc). While arbitrary-precision may slightly reduce the precision, it shows merit in many practical usages such as smart sensors (sensor; KungZWCM18; McDanelTK17), mask detection (BinaryCoP), and intelligent agriculture (GarofaloT0RB20). In these usages, when a certain accuracy bar is surpassed, other essential metrics such as real-time processing and resource consumption are more important. For example, BinaryCoP (BinaryCoP) utilizes low-power binary neural networks to detect facial-mask wear at entrances to corporate buildings and airports. Another example is XpulpNN (GarofaloT0RB20) that uses quantized neural network on energy-efficient IoT devices.

Generality to other NNs.

This paper reports the results of APNN-TC on two most time-consuming kernels, GEMM and Convolution, from the computer vision domain and showcases the performance on popular vision models (e.g., AlexNet, VGG, and ResNet). Yet, we expect that APNN-TC applies to NNs from various domains such as natural language processing (NLP). Intuitively, APNN-TC accelerates GEMM and dot products which is the building block of many NLP NNs

(Transformer; ZhangFB20; DevlinCLT19), such as the attention layer and the feed-forward layer.

Generality to other processors. APNN-TC utilizes population count (i.e., popc()) and two logical operations (i.e., XOR and AND) to support arbitrary-precision computation on Nvidia GPUs. Considering the wide support for popc() and logical operations, APNN-TC can be easily adapted to diverse processors. For example, AMD GPUs (AMD) supports population count (i.e. popcnt() on AMD GPUs) and logical operations (e.g., bitwise XOR). Xeon phi (Intel) also supports population count and logical operations.

8. Conclusion

In this paper, we design and implement APNN-TC that accelerates arbitrary-precision neural networks on Ampere GPU Tensor Cores. Specifically, APNN-TC contains an int1-based emulation design on Tensor Cores to enable arbitrary-precision computation, an efficient AP-Layer design for efficiently mapping NN layers towards Tensor Cores, and an APNN design to minimize the memory access across NN layers. Extensive evaluations on two Ampere GPUs show that APNN-TC can achieve significant speedup over CUTLASS kernels and various mainstream NN models, such as ResNet and VGG.

9. Acknowledgements

We thank all anonymous reviewers for their valuable comments. This work was supported in part by NSF 1925717 and 2124039. This work was supported in part by the U.S. DOE Office of Science, Office of Advanced Scientific Computing Research, under award 66150: "CENATE - Center for Advanced Architecture Evaluation" and PNNL’s Data-Model-Convergence (DMC) LDRD Initiative Computation-Flow-Architecture (CFA) project. The Pacific Northwest National Laboratory is operated by Battelle for the U.S. Department of Energy under contract DE-AC05-76RL01830.

References