1. Introduction
Over the recent years, demands to improve the performance of deep neural networks (DNNs) have never been satisfied. Prior work approaches faster and more efficient DNNs from different aspects, such as model pruning (liu2020autocompress; niu2020towards; ma2020pconv; dualInfer), kernel factorization (mobilenet_2017_howard; sandler2018mobilenetv2; xception; DSXplore), and data quantization (trainingqnn; qcnn). Among those efforts, quantizationbased DNN acceleration (searchlowbit; trainingqnn; qcnn) finds its strengths in minimum modification of the original model architecture, lower memory consumption, and better runtime performance.
To accelerate quantized DNNs, many specialized cores have been introduced to support lowprecision dense matrixmatrix multiplications, such as Tensor Processing Units (TPUs) (jouppi2017datacenter), Neural Network Processors (NNPs) (hickmann2020intel), and GPU Tensor Cores (choquette2021nvidia). For example, NVIDIA introduces Tensor Cores in Volta architecture (choquette2018volta) that support FP16 matrixmatrix multiplication. In Turing architecture, NVIDIA extends architecture support for more precisions (e.g., int1 and int4) and bitlevel operations (e.g., XOR) (TPDS). Recently in the Ampere architecture, we find there is additional support for more precision and bitlevel operations (e.g., AND). However, these specialized cores still support a limited range of precisions with only architecturelevel efforts, while quantized DNNs usually require arbitrary precisions (e.g., 1bit weight and 2bit activations). In this paper, our key question is whether we can support arbitrary precision neural networks with the limited precisions on Tensor Cores.
We identify two major challenges in accelerating arbitrary precision DNNs on Ampere GPU Tensor Cores.
Lack of mathematical emulation design. To support arbitrary precisions (e.g., int1 weights and int2 activations), one naive approach is to represent these lowprecision values with the supported highprecision values (e.g., int4). However, this approach introduces extra overhead and prevents efficient quantized DNNs on Tensor Cores. Another approach is to emulate with int1 compute primitives. However, with int1 precision, Tensor Cores only support two bitlevel operations (i.e., XOR and AND) and mathematical emulation designs are required to support multiplication and addition in quantized DNNs. Moreover, quantized DNNs may have diverse input data (e.g., 1/+1 or 0/1), where different data may require different emulation designs.
Lack of efficient implementation for arbitrary precision NN layers. To accelerate APNN on Tensor Cores, we need to efficiently map arbitrary precision NN layers to Tensor Cores with specialized compute primitives and memory architectures. Existing works on accelerating binary neural networks simply split NN layers into small matrix tiles (e.g., ) to match Tensor Core compute primitives and improve the parallelism. However, naively borrowing these strategies fails to exploit the data locality during NN layer computation especially for our emulation workload. Moreover, arbitrary precision computation usually computes at the bitlevel (e.g., int3 or int5) while existing hardware devices such as CPUs and GPUs usually operate at the word or byte level. Specialized bit operations and data organization are required to support efficient bitlevel computation and avoid uncoalesced memory access.
Lack of efficient NN framework designs. One standard approach to build quantized neural networks is to stack a sequence of NN layers, such as a convolution layer followed by a pooling layer and a quantization layer. However, this approach ignores the data reuse opportunity across NN layers and leads to unnecessary memory overhead. For example, on NNs with 2bit activations, there are two semantic equivalent implementations – quantization after reading 32bit activations from the previous layer or quantization to 2bit ones before writing to global memory for the next layer. While these two implementations provide the same semantic, the former requires memory access of bits while the latter only requires memory access of bits.
To this end, we propose APNNTC to accelerate Arbitrary Precision Neural Networks on Ampere GPU Tensor Cores, as illustrated in Figure 1. First, we propose an APBIT emulation design to support arbitraryprecision computation with 1bit compute primitives. Our APBIT algorithm can adaptively select operators (e.g., XOR or AND) to support diverse input data (e.g., 1/1 or 0/1). Second, we build efficient APLayer design including an arbitraryprecision matrixmatrix multiplication (APMM) layer for fully connected layers and an arbitraryprecision convolution (APConv) layer for convolution layers. We propose a set of memory and computation designs (e.g., batchbased double caching and channelmajor data organization) to fully exploit Tensor Core computation and minimize memory access. We also incorporate a performance analysis to automatically tune the hyperparameters in APMM and APConv. Third, we propose an efficient APNN design to improve the performance at the framework level. It includes a minimaltraffic dataflow to support various precisions over APNN layers and a semanticaware kernel fusion to minimize the data movement across layers.
In summary, we make the following contributions in this paper.

We develop APNNTC to accelerate neural network on Ampere GPU Tensor Cores with arbitrary precision.

We propose three novel techniques: a) an APBIT emulation design to support arbitraryprecision computation; b) an efficient APLayer design to achieve high performance at the layer level; c) an efficient APNN design to minimize the data movement across layers.

Extensive experiments show that APNNTC can achieve up to speedup over CUTLASS kernels and speedup over CUBLAS kernels. APNNTC can also consistently outperform NNs implemented with builtin int8, half, or single precision. For example, with bit weights and bit activations, APNNTC can achieve more than latency reduction and higher throughput than the singleprecision NN with only accuracy drop.
2. Related Works
2.1. APNN algorithm designs
Arbitrary precision (lower than INT8) neural network (APNN) algorithms have been widely studied (HanMD15; TPDS; BinaryConnect; DoReFa; zhang2018lq; HAQ; OLCEL; li2019bstc; geng2020o3bnn) to fully explore the spectrum of NN performance and NN accuracy and cater to diverse application requirements. In addition to widely supported precisions on modern GPUs (e.g., int1, int4, and int8), these APNNs usually utilize more diverse precisions such as int2, int3, and int5. APNNs may also have different precisions for weights and activations (e.g., 1bit weights and 2bit activations). Comparing with INT8 quantized neural networks, APNNs provide better performance and memory efficiency at the cost of (slightly) degraded accuracy. Popular APNNs include DoReFaNet (DoReFa) for 1bit weights and 2bit activations, LQNets (zhang2018lq) for 14 bits, HAQ (HAQ) for 18 bits, OLAccel (OLCEL) for 4 bits, BSTC (li2019bstc) and TCBNN (TPDS) for 1 bits. In this paper, we follow LQNets (zhang2018lq) that starts from a fullprecision NN and adopts the quantization error minimization (QEM) strategy to generate quantized NNs.
2.2. APNN Hardware Supports
While many APNN algorithms have been designed, the hardware supports are still limited. One direction is to build FPGA and ASIC based implementations (HAQ; OLCEL)
to demonstrate the performance benefits of APNNs. However, these implementations usually require specialized hardware designs to support arbitraryprecision computation and cannot be applied to GPUs. Another direction is to utilize builtin precisions on GPUs for quantized neural networks. Taking the most famous Pytorch
(pytorch) framework as an example, it supports FP32, FP16, and BF16 models on GPUs and int8 quantization on x86 CPUs with AVX2 support. Recently, BSTC (li2019bstc) and BTC (TPDS) accelerates binary neural networks on GPUs by exploiting the int1 compute primitive. However, existing works can only build on the limited precision supported on GPUs (e.g., int1, int4, and int8) and cannot fully exploit the performance benefits from APNNs. In this paper, we build the first generalized framework to accelerate arbitraryprecision neural networks on Ampere GPU Tensor Cores.2.3. Tensor Cores
Tensor Cores are specialized cores for accelerating neural networks in terms of matrixmatrix multiplications. Tensor Cores are introduced in recent NVIDIA GPUs since Volta architecture (volta). Different from CUDA Cores that compute scalar values with individual threads, Tensor Cores compute at the matrix level with all threads in a warp (raihan2019modeling). For example, the 1bit Tensor Core compute primitive takes two int1 input matrices A and B of shape and generates an int32 output matrix C of shape (TPDS). In Volta architecture, Tensor Cores support only halfprecision computation (jia2018dissecting). To support more quantized neural networks, Tensor Cores add more precisions including int1, int4, and int8 in Turing architecture (jia2019dissecting). Regarding int1 precision, Tensor Cores support only XOR logical operation in Turing architecture and recently add AND logical operation in Ampere architecture (ampere). Despite these hardware efforts on supporting more precisions, arbitrary precisions are still not supported. This is the first work to support arbitrary precision computation on Ampere GPU Tensor Cores with int1 precision and support for both XOR and AND operations.
3. APBit Emulation Design
In this section, we design an APBIT emulation on Tensor Cores to support arbitraryprecision computation. We first design an APBit operation template that supports arbitraryprecision computation with 1bit compute primitive on Tensor Cores. Then, we propose a data adaptive operator selection to automatically support various input data (e.g., 1/+1 and 0/1) with bitwise XOR and AND on Tensor Cores. Here, we focus on the algorithm design on small matrices (i.e., input matrices of and output matrix of ) that can fit directly on Tensor Core compute primitives. We will discuss the efficient computation of large matrices in the next section.
3.1. APBit Operation Template Design
The APBit operation template takes a matrix with bit elements and a matrix with bit elements, and computes with 1bit operations on Tensor Cores to generate a bit output matrix . Our key observation is that each arbitrarybit scalar digit can be decomposed to a sequence of 1bit scalar digits and the arbitrary computation can be conducted with only 1bit operations and shift operations. Formally, to support scalarlevel arbitrary precision computation of a 1bit weight and a 2bit feature with int1, we can first decompose 1bit values and from the 2bit feature as:
Suppose we have an 1bit operation (e.g., the bmma API of Tensor Cores) that takes 1bit inputs and generate 32bit outputs, we can compute as
We illustrate our APBit operation template in Figure 2. Here, we focus on a 1bit weight matrix of shape and a 2bit feature matrix of shape to illustrate our algorithm design. A naive approach is to use 4bit integers to represent each 1bit element and 2bit element , and then use the compute primitive on Tensor Cores. However, this approach would lead to unnecessary memory and computation overhead. Instead, we propose to exploit the compute primitive on Tensor Cores to support arbitraryprecision computation by dynamically adjusting the memory and computation requirement. In particular, the first step is to conduct bit decomposition by splitting a 2bit to two 1bit elements and :
These 1bit elements are then packed into 1bit matrix and . The second step is to conduct batchbased Tensor Core computation on these 1bit matrices with the bmma API and generate 32bit output matrices
These matrices can be computed directly with the bmma API since all of them have the shape of . We also note that Tensor Core primitives for int1, int4, and int8 generate 32bit output matrices to accumulate a large number of bitoperation outputs and avoid overflow. The third step is to conduct bit combination and generate the final output matrix
(1) 
Here, , and refer to the scalar elements of matrix , and , respectively. For notation simplicity, we abbreviate Equation 1 as in the following sections to represent the scalar multiplication and elementwise addition. We note that mathematically.
It is not hard to see that this computation can be generalized to matrices with arbitrary bits and . Formally, given a bit weight matrix and a bit weight matrix , we can first decompose into 1bit matrices and . For each element, we have
(2) 
Then, we compute the bmma API for times for each combination of and :
Finally, we conduct bit combination to generate the bit output matrix :
Cost Analysis. The cost of arbitraryprecision computation comes from three parts: bit decomposition, tensor core computation, and bit combination. Given a bit weight matrix and a bit data matrix of shape , bit decomposition shows complexity of since we need operations to split each bit element from A into bit elements and another operations to split each bit element from B into bit elements. The bit combination shows complexity of , since we have matrices of shape and need to add elementwisely. This overhead is negligible compared with the complexity in the Tensor Core computation. Note that only 1bit compute primitives are used for this expensive matrixmatrix multiplication, which significantly reduces the overall latency.
3.2. Data Adaptive Operator Selection
While we compute with bit0 and bit1 in arbitraryprecision computation, these two values may actually encode diverse values. For example, the 1bit weight matrix in neural networks may encode and , instead of and , in order to improve the accuracy of neural networks. In this case, bit0 indicates the value and bit1 indicates the value . To support this diversity in the encoded data, we introduce data adaptive operator selection by adopting different bit operations in Tensor Cores (i.e., XOR and AND). In particular, we support three cases, where we first conduct bit operations and then accumulate with popc (i.e., population count (popc) that counts the number of set bits). The CaseI is that both and encode and , where we choose logical AND
operation. For example, given a 1bit vector
and a 1bit vector , we use AND operation to compute asThe CaseII is that both and encodes and , where we select logical XOR operation. For example, given two 1bit vectors and , we first map to and compute as
Here, (=2) is the length of the vector.
The CaseIII is that encodes and , while encodes and . For example, we may need to compute the multiplication of two 1bit vectors and . This case happens frequently in neural networks with a 1bit weight matrix and a bit feature matrix with . In this case, naively adopting XOR or AND does not work, since there are three values , , and
that cannot be easily encoded with 1 bit. To this end, we incorporate a linear transformation on
and compute with only AND operation. Our key observation is that can be transformed into a vector with only and by adding a constant vector :Then, we compute with AND operation as CaseI. Finally, we recover the value by another linear transformation:
Note that is a constant vector that can be cached in Tensor Core fragment and does not introduce extra memory overhead.
4. Arbitrary Precision Layer Design
In this section, we propose the ArbitraryPrecision Matrix Multiplication (APMM) for fully connected layers and ArbitraryPrecision Convolution (APConv) for convolution layers.
4.1. ArbitraryPrecision Matrix Multiplication
ArbitraryPrecision Matrix Multiplication (APMM) takes the decomposed bit weight matrix , , the decomposed bit feature matrix , and computes output matrix . By default, APMM generates bit output to avoid data overflow for large matrices and match the bit output in Tensor Core compute primitives. APMM also supports arbitraryprecision output (e.g., int2) when APMM is used as a hidden layer in neural networks (NNs) and the output is consumed by the next APMMbased NN layer.
Considering that APMM essentially computes an arbitrary precision GEneral MatrixMatrix multiplication (GEMM) kernel with multiple Binary MatrixMAtrix multiplication (BMMA) kernels, one naive strategy is to build upon existing BMMA kernels (TPDS; li2019bstc). In particular, we can use existing BMMA kernels to multiply each pair of and and accumulate to the output matrix . However, this approach shows significant inefficiency due to two reasons. First, this approach ignores the data reuse opportunity since the same weight matrix tile from can be multiplied with different feature matrix tiles from and . Second, this approach requires extra communication across BMMA kernels, such that reducing into leads to significant global memory access. We show our efficient APMM design in Figure 3. It includes a batchbased double caching to facilitate the data reuse and a memoryefficient bit combination to accelerate the accumulation and optionally generate the arbitraryprecision output. Here, we illustrate the design with bit and bit for notation simplicity while arbitraryprecision and are supported.
(a) Batchbased Double Caching. Batchbased double caching exploits two GPU memory hierarchies (i.e., shared memory and fragment located in registers) to cache matrix tiles and facilitate data reuse in APMM computation, as illustrated in Figure 3(a). Considering the limited size of shared memory and fragment, we tile weight matrices and feature matrices such that these tiles can be cached in GPU memory hierarchies. Formally, given of shape and of shape , we first tile along the dimension into block matrix tiles of shape . Similarly, we tile along the dimension into block matrix tiles of shape . Here, each GPU block will multiply one pair of block matrix tiles and generate an output matrix tile of shape . Considering that Tensor Cores compute at the warp level, we further tile into warp matrix tiles of shape and into such that each warp computes an output tile of shape . To match with the bmma compute primitive of Tensor Cores, each warp will slide along , , and dimension during computation. Note that these tiling sizes have a significant impact on the performance, which will be analyzed in Section 4.3.
Batchbased double caching first adopts a batch strategy to improve interthread parallelism and achieve high performance. Existing works on binary neural networks (li2019bstc; TPDS) report that the GEMM size in NN workload is usually small (e.g., ) and use small matrix tiling sizes (e.g., ) to improve the interthread parallelism. However, this approach leads to low intrathread parallelism and prevents data reuse. Instead, our batch strategy virtually transforms multiple small BMMAs into a large BMMA. In particular, given of shape and of shape , we batch these small matrices into of shape and of shape and compute using a single large BMMA. Here, we implement a “virtual" batch strategy during the data loading procedure by dynamically deciding the global memory address of the corresponding matrix tile such that no additional memory movement is involved.
Batchbased double caching then exploits two GPU memory hierarchies to facilitate data reuse at different levels. The first level is shared memory caching to reuse matrix tiles from and . Here, a naive strategy is that each warp independently loads a weight tile and a feature tile for computation. However, we observe that the same weight tile may be multiplied with feature tiles from different 1bit feature matrices and , as illustrated in Figure 3(a). To this end, our design requires all warps to first collaboratively load weight data and feature data from global memory to shared memory. Then, each warp fetches its own matrix tiles from shared memory. In this way, we can significantly reduce global memory access by exploiting fast shared memory.
The second level is fragment caching to continuously store output tiles in the same Tensor Core fragment. Since Tensor Core compute primitives require to accumulate in 32bit Tensor Core fragments, the output tiles usually consume a large memory space compared with the 1bit input data. Moving output tiles between shared memory and Tensor Core fragment may lead to heavy shared memory access. Moreover, existing dissecting works (jia2019dissecting; jia2018dissecting) reveal that fragment is composed of registers and one GPU block of 8 warps can provide up to 256 KB Fragment, which is much larger than shared memory. To this end, as iterating through the K dimension during computation, we continuously use multiple fragments to cache output tiles of shape for reducing shared memory access and caching more feature and weight tiles in shared memory.
(b) Memoryefficient Bit Combination. Bit combination consumes 32bit BMMA outputs and generates bit APMM outputs as . ‘Bit combination can also generate arbitrary precision output when it is utilized as a NN hidden layer and its output is consumed by the next NN layer. Overall, bit combination takes only computation complexity, which is significantly lower than the computation complexity of GEMM operations. However, there are two potential memory bottlenecks in bit combination, which have a significant performance impact. The first one is global memory access when reducing bit BMMA outputs to bit APMM outputs. In a naive implementation that independently conducts BMMA and bit combination, bit combination usually introduces similar latency as the BMMA kernel. The main reason is that, while Tensor Cores provide significantly higher computation throughput than CUDA Cores, the global memory bandwidth remains the same. The second one is the shared memory access when converting bit APMM outputs to arbitraryprecision outputs. In this procedure, we usually need to pack lowbit values (e.g., 2bit) in registers from different threads to a single memoryaligned value (e.g., 32bit) before storing to global memory. Relying on shared memory for data exchange across threads may lead to heavy shared memory access.
Memoryefficient bit combination includes two novel designs to mitigate memory overhead. The first design includes a semanticaware workload allocation and an insharedmemory reduction. In particular, at the data loading phase of BMMA, we load feature tiles and weight tiles of the same spatial location such that their multiplication outputs can be reduced directly. As illustrated in Figure 3, instead of loading a feature tile of or , we load two feature tiles of both and with the same matrix index. In this way, we can reduce and directly in shared memory and mitigate global memory access while not degrading the BMMA performance.
The second design incorporates an elementwise routine and an interthread communication to pack lowbit values and mitigate shared memory overhead. The elementwise routine is a userdefined interface to provide diverse support of quantization and batch normalization across NN layers. This routine applies to individual 32bit reduced values in registers. Given a 32bit value in a register, this routine may quantize it into a
bit value that is still stored in the 32bit register with the first bits as zeros. This routine also includes bit decomposition (Equation 2) that splits this bit value in a register to bit values in registers. After that, we use a __ballot_sync API to enable interthread communication and directly pack the bit values across threads into bit values that can be stored to the global memory.4.2. ArbitraryPrecision Convolution (APConv)
APConv takes the decomposed 1bit weight matrix of shape , the decomposed 1bit feature matrix of shape , and generates output matrix . Here, is the number of output channels, is the number of input channels, is the kernel size, is the batch size. Existing works on bitlevel convolution usually adopt a direct convolution design (li2019bstc; TPDS) to improve the GPU utilization. However, these methods ignore the data reuse opportunity and introduce heavy global memory access. In addition, APConv on a bit weight and a bit feature usually has times workload than the BConv on the same weight and feature size, which can easily contribute to high GPU utilization. To this end, APConv incorporates the batchbased double caching design as APMM to mitigate the global memory access. However, there are still two key challenges that distinguish APConv from APMM. The first is the data organization where naively reading the
feature map may easily lead to uncoalesced memory access. The second is the data padding where simply padding zeros may lead to erroneous results. To tackle these challenges, we propose
channelmajor data organization and inputaware padding design.(a) ChannelMajor Data Organization. Channelmajor data organization transforms uncoalesced and unaligned memory access to a coalesced and aligned one for improving performance. Traditional data organization for bit convolution usually employs a NCHW design, as illustrated in Figure 4(a). However, naively borrowing this design to APConv leads to unaligned and uncoalesced memory access due to two reasons. First, multiple bit (e.g., 3bit) elements usually cannot be packed into an aligned 32bit element, which is required for valid GPU reads and writes. Using a 32bit element to store a bit element will introduce extra memory overhead. Second, convolution operations usually read only continuous elements (or bits) due to the kernel size, which may lead to uncoalesced memory access.
We design a channelmajor data organization as illustrated in Figure 4(b). There are two key design choices. First, we split a bit feature matrix into 1bit feature matrices and store each 1bit feature matrix consecutively. In this way, we can provide aligned memory access for each 1bit feature matrix and support arbitrary precision . Second, we consecutively store all channels of elements with the same spatial location. Since convolution layers usually have channels, this usually leads to coalesced memory access during computation.
(b) Inputaware Padding Design. Inputaware padding design adaptively adjusts padding values according to input values. As mentioned in Section 3.2, when the weight W encodes and with and , we cannot naively padding since represents .
We propose three padding strategies according to the input data. First, when both weight and feature encode and , we simply pad zeros for features. In this case, padding for features will only add extra ’s for arbitrary weight values, which does not change the computation result. Second, when both weight and feature encode and , we pad for features and use an extra counter flag to track the number of ’s when the convolution weight moves outside the input image frame. We will subtract counter to amend the corresponding convolution results. Third, when weight encodes and and feature encodes and , we pad to features and do not change the convolution results.
4.3. Performance Analysis
In our APNNTC kernel design, there are six tuning knobs – the block tiling sizes , , , and the warp tiling sizes , , . These tiling sizes bring a tradeoff between the ThreadLevel Parallelism (TLP) and the Instruction Level Parallelism (ILP), especially the compute intensity (CI). Here, we focus on block tiling sizes, since we empirically observe that utilizing warps per block and splitting the block workload evenly across warps provide the best performance (i.e., , , ). In this subsection, we first analyze the performance impact of individual tuning knobs. Then, we propose an autotuning strategy to maximize the performance. Since APMM and APConv share the same batchbased double caching strategy, we use the same autotuning strategy for these two kernels.
4.3.1. Performance Model
TLP refers to the threadlevel parallelism in terms of the number of threads in use. Intuitively, larger TLP can improve GPU utilization and kernel performance (li2015transit; li2016x). Formally, given a pbit weight matrix of shape , a qbit feature matrix of shape and the matrix tiling size , we define the TLP as
(3) 
We ignore the number of threads for each block since it is a constant in our evaluation. Intuitively, smaller may improve TLP, which suggests a small especially for small matrices.
Compute intensity (CI) refers to the ratio of computation over memory access on each thread block. We aim to improve CI for two reasons. First, a higher CI indicates less memory access and better performance. While the amount of computation remains the same, the amount of memory access may be reduced significantly by data reusing and hyperparameter tuning. Second, a higher CI on a thread block provides more opportunities for latency hiding. Formally, for a matrix tile, we compute the amount of global memory access as when reading a weight tile and a feature tile. We compute the amount of computation as from the matrixmatrix multiplication. Finally, we compute CI as
(4) 
Note that CI can be increased when and are increased. We also observe that CI is independent of such that we can use smaller to leave space for larger and , especially when the shared memory size is a limiting factor. In our evaluation, we fix as by default.
4.3.2. Autotuning
During APNNTC kernel design, there is a large search space on the complex interaction between matrix size (, , and ), weight bit , feature bit , and block tiling size and
. Note that the selected parameters may also be different on various GPUs according to computation and memory capabilities. To this end, we propose a heuristic algorithm to provide a faster search procedure in this large search space. Formally, given the matrix size
, , , the weight bit , the feature bit , the algorithm selects in two steps. First, we compute the TLP of each combination of and . We put these combinations in a priority queue, where a higher TLP leads to a high priority. Second, we pop individual combinations in the priority queue. We stick to the first combination with the highest TLP if its TLP is already smaller than a threshold . Otherwise, we continuously pop and select combinations in the priority queue to improve CI while ensuring TLP is larger than . We empirically set as in our evaluation. Note that different block tiling sizes share the same data layout such that there is no overhead when consecutively executing two layers with different block tiling sizes.5. Arbitrary Precision Neural Network Design
In this section, we introduce our Arbitrary Precision Neural Network (APNN) design. We first introduce a minimaltraffic dataflow on supporting various precisions across layers in APNN. Then, we incorporate a semanticaware kernel fusion to minimize the memory access across layers.
5.1. MinimalTraffic Dataflow
Given an int8 RGB image, APNN computes a sequence of NN layers with bit weights and bit activations and finally generates an int32
output logits. Here, all intermediate layers compute at arbitrary precision by taking a
bit weights and bit activations and generate bit outputs. Note that the int1 Tensor Core compute primitive can only generate outputs and an extra quantization layer is required to quantizing into bit activations for the next layer. For performance consideration, during the initialization of an APNN, we quantize all weights before the model inference computation. To effectively maintain and transfer arbitrarybit data, we pack the data bitbybit for both weight and feature map, following the data organization discussed in Section 4.2.The input layer and the output layer have different precisions from the intermediate layers. As is the common practice with int8 image inputs, the input layer requires an extra quantization layer that quantizes bit inputs into bit activations. The output of the input layer will also be the quantized arbitrarybit feature map serving as the input for the following intermediate layers. In the output layer, Tensor Core computation results will be directly used for the final softmax logits computation. Thus, we do not apply quantization after the output layer.
5.2. Semanticaware Kernel Fusion
Besides APMM and APConv discussed previously, there are still multiple important layers in APNN, including quantization, Batch Normalization (BN), pooling, and ReLU. Given all scalars
in the layer, quantization elementwisely converts int32 values to qbit values :Here, is a bit scalar zeropoint, is the scaling scalar, and is the floor function. BN (ioffe2015batch) is another major component in NNs for tackling the covariate shift problem and facilitating NN training:
(5) 
where and
are expectation and variance across the batch,
and are two learned parameters. Pooling splits the feature map spatially into grids and generates 1 scalar output for each grid by computing the average or the maximum value in each grid. ReLU takes individual input values and generates output values .While these operations have linear time complexity to the size of feature maps and consume significantly less computation than APConv and APMM kernels, these operations may still introduce heavy latency due to the expensive memory access. Indeed, while Tensor Cores provides significantly improved computation capability, Tensor Cores share the same memory bandwidth with CUDA Cores on GPUs. Moreover, we observe that these values are usually computed elementwisely and do not require heavy communication across GPU threads. We propose a semanticaware kernel fusion to minimize memory access. We first fuse APMM/APConv with its following quantization, BN, pooling, and ReLU kernels into a single kernel to minimize the global memory access. In particular, these following layers can be seamlessly applied once the convolution results become available at the shared memory. This can improve the computation intensity for individual convolution kernels meanwhile reducing the global memory access from invoking an additional batch normalization kernel. Second, considering that these following layers usually compute at scalar level, we can further reduce shared memory access by directly reusing values in registers (li2016critical). For example, when a APMM layer is followed by a BN layer, a quantization layer, and a ReLU layer, we directly compute the output scalar as
Note that we only need to load a scalar once to a register and avoids unnecessary shared memory access.
6. Evaluation
In this section, we evaluate APNNTC under diverse precisions and show the benefits of arbitraryprecision computation in performance and accuracy.
We evaluate on both Nvidia RTX 3090 and Nvidia Tesla A100. The RTX3090 GPU is in a ubuntu 16.04 system with Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz, 64 GB DDR3 DRAM, gcc7.5.0, and using CUDA11.1, CUTLASS2.5, and CUBLAS11.1. The A100 GPU is in a Linux 3.10.0 system with AMD EPYC 7742 64core CPU, 1TB DDR4, gcc9.1.0, and using CUDA11.1, CUTLASS2.5, and cuBLAS11.3. All results reported are the average of 200 times execution.
6.1. APLayer Evaluation
6.1.1. APMM Performance
We compare our APMM designs with NVIDIA implementations of lowbit gemm (i.e., int1, int4, and int8) that are accelerated by Tensor Cores. For int8, we compare with cublas implementation, namely cublassgemmint8. Since int1 and int4 are not supported in cublas, we compare with cutlass implementation, namely cutlassgemmint1 and cutlassgemmint4. Following popular settings in NNs, we compute matrix multiplication of a matrix with shape and a matrix with shape , where is a popular batch size and covers typical fully connected layer dimensions. According to the precision of our APMM kernel, we name it APMMwxay, where x indicates the weight bit and y indicates the activation bit. For example, APMMw1a2 indicates 1bit weights and 2bit activations. While our APMM is general to support arbitrary precision, we show 8 popular bit combinations due to page limits. If both weight bits and activation bits are less than (e.g., w1a2, w1a3, w1a4, w2a2), we compare it against cutlassgemmint4. If either weight bits or activation bits are larger than , we compare it against cublasgemmint8. For each matrix size, we show a speedup of cutlassgemmint1 against cutlassgemmint4 and cublasgemmint8 as the performance benefit when sticking to binary neural networks (li2019bstc; TPDS). Since Tensor Core compute primitive supports only bit outputs, all gemm kernels take lowbit input (e.g., int1, int4, and int8) and generate bit outputs.
Figure 5 shows the results of APMM on RTX 3090. We compare APMM with cutlassgemmint4 in Figure 5(a) and cublasgemmint8 in Figure 5(b). Overall, we have three major observations. First, APMM can usually achieve significant speedup over baselines. For example, APMMw1a2 can achieve up to speedup over cutlassgemmint4, while APMMw5a1 can achieve up to speedup over cublasgemmint8. This result demonstrates the performance benefits of emulating arbitraryprecision with int1 compute primitives over sticking to int4 or int8 compute primitives. Second, APMMs with various weight and activation bits usually show similar performance on small matrices. For example, APMMw1a2, APMMw1a3, APMMw1a4, and APMMw2a2 achieves almost the same speedup when N=128 and N=256, even if these kernels have different computation overhead (e.g., from APMMw1a2 and from APMMw2a2). This benefit comes from our batchbased double caching (Section 4.1(a)), where individual small BMMAs are batched into a large BMMA and computed simultaneously. Surprisingly, our arbitrary precision computation can even outperform cutlassgemmint1 in such cases due to the improved GPU utilization. Third, we observe a smaller speedup over cublasgemmint8 on large matrix sizes, when peak int1 performance is achieved. Our investigation shows that, on RTX 3090, cutlassgemmint1 is only faster than cublasgemmint8, such that emulation is slower than builtin int8 compute primitives on large matrices when peak int1 performance is achieved (e.g., for APMMw2a8). We argue that NN workload can still benefit significantly from our APMM since the fully connected layers in NNs usually have small matrix sizes (e.g., in ResNet18). We also show the results of APMM on A100 in Figure 6 with similar observations.
6.1.2. APConv Performance
We compare APConv designs with NVIDIA implementations of lowbit convolution that are accelerated by Tensor Cores. Since cublas does not support int1, int4, AND int8
convolution, we use kernels from cutlass. We name these kernels as cutlassconvint1, cutlassconvint4, and cutlassconvint8. Similar to APMM, we evaluate 8 types of precision with the name APConvwxay. Since convolution kernels have much more hyperparameters than matrixmultiplication kernels, we show the performance under various input and output channels while fixing the input size as
(medium feature size), filter size as(most frequently used), stride as
(most frequently used), and batch as (for inference). Figure 7 and 8 show the speedup of APConv on RTX 3090 and A100, respectively. APConv can achieve speedup over cutlassconvint4 and speedup over cutlassconvint8. This result shows the significant performance benefit from emulating arbitrary precision with int1 over utilizing int4 or int8. Similar to APMM, we also observe a smaller speedup over cutlassconvint8 on larges channels due to the limitation of peak int1 performance. Since RTX3090 and A100 provide similar performance, we will focus on RTX3090 in the following evaluations.6.2. APNN Evaluation
In this section, we evaluate the overall APNN performance on three mainstream neural network models with ImageNet dataset. The details of our evaluated NN models and their corresponding binarized neural network, lowbit (1bit weight with 2bit activation), singleprecision accuracy precision are listed in Table
1.Dataset  Network  Input Size  Output Size  Binary  w1a2  Single 
ImageNet  AlexNet (krizhevsky2012imagenet)  224x224x3  1000  46.1%  55.7%  57.0% 
ImageNet  VGGVariant (cai2017deep)  224x224x3  1000  53.4%  68.8%  69.8% 
ImageNet  ResNet18 (he2016deep)  224x224x3  1000  51.2%  62.6%  69.6% 
ImageNetAlexNet  ImageNetVGG_Variant  ImageNetResNet18  
Schemes  8 Latency  Throughput  8 Latency  Throughput  8 Latency  Throughput 
CUTLASSSingle  4.43ms  2.89fps  25.24ms  3.89fps  60.96ms  1.51fps 
CUTLASSHalfTC  3.79ms  3.38fps  24.19ms  4.67fps  57.33ms  1.89fps 
CUTLASSINT8TC  13.10ms  9.77fps  25.77ms  6.52fps  57.09ms  2.85fps 
BNN  0.69ms  1.37fps  2.17ms  3.91fps  0.68ms  1.89fps 
APNNw1a2  0.36ms  2.85fps  1.66ms  5.32fps  0.64ms  1.70fps 
We consider two types of configurations for evaluation. In the first setting, we focus on a specific lowbit configuration (1bit weights and 2bit activations, i.e., w1a2) across different neural network models. We choose several baselines including neural networks built with singleprecision floatingpoint implementation from CUTLASS (cutlass) running on CUDA Cores, halfprecision implementation from CUTLASS running on Tensor Cores, INT8 precision implementation from CUTLASS running on Tensor Cores, and the 1bit binarized neural network running on Tensor Cores based on the stateoftheart design from (TPDS). As shown in Table 2, our APNN design running on Tensor Cores can achieve a significant speedup compared with CUTLASS INT8, half and single precision implementations. This indicates the practical usage of our APNN design in latencysensitive applications. Meanwhile, on large batch sizes for throughput performance evaluation, our APNN design also demonstrates its high throughput advantage over these “standardized” bit (e.g., 8bit and half) precision baselines. Compared with the 1bit binarized neural network running on Tensor Cores, our APNN design would demonstrate its significant accuracy improvement (an average 11.67%) as listed in Table 1. This can demonstrate the application of our APNN design in some application settings, where the BNN model accuracy performance fails to meet the demands. Overall, from the study, we can see that using our APNN design for arbitrarybit precision computation is a potential way for balancing NN model accuracy and runtime performance.
Scheme  8 Latency (ms)  Throughput (fps) 
Float  25.24  3.89 
Half  24.19  4.66 
INT8  25.77  6.52 
BNN  2.17  3.91 
APNNw1a2  1.66  5.32 
APNNw2a2  3.08  2.59 
APNNw2a8  14.14  5.65 
In the second setting, we shift our focus towards model runtime performance tradeoff on the VGG network. We select several lowbit settings for comparison, including the 1bit weight with 2bit activation, 2bit weight with 2bit activation, and 2bit weight with 8bit activation. As shown in Table 3, APNNTC significantly reduces latency and improves throughput for w1a2 and w2a2 than INT8 which shows that APNNTC can bring benefits for many arbitraryprecision computations. Comparing with INT8, APNNTC with w2a8 shows lower throughput since we need to compute 16 (=2*8) 1bit matrices to emulate arbitraryprecision computation, which require more computation than w1a2 with 2 1bit matrices and w2a2 with 4 1bit matrices. This also matches the performance on individual kernels (e.g., Figure 5, 6, 7, 8). This result indicates that APNNTC can bring benefits for latencysensitive applications.
6.3. Additional Studies
We perform several additional studies in this subsection, including the latency breakdown from individual NN layers and the benefit from kernel fusion. We show results from RTX 3090 and skip results from A100 since we observe similar trend on these two GPUs.
Latency Breakdown. Figure 9 illustrates the percentage breakdown of the latency for the inference of 8 images over three NNs on RTX3090 GPU. Clearly, the first layer introduces the most delay since the input feature size for this layer is significantly larger than other layers. This percentage can be as high as for AlexNet and for VGG_Variant. On other layers, we observe a roughly balanced latency.
Benefits from Kernel Fusion. Figure 10 investigates the performance benefits from fusing APConvw1a2, pooling, and quantization into one kernel. Specifically, in the "w/o Fusion" implementation, we implement three global functions for APConvw1a2 with 32bit output, pooling, and quantizing into bit outputs, respectively. Here, each function read and write data to the global memory. In the "w/ Fusion" implementation, we conduct the same workload in a single kernel. Overall, we observe a latency reduction of on average. The main reason is that, in "w/ Fusion", data across APConv, pooling, and quantization can be cached in shared memory and global memory access is significantly reduced.
Overhead from bit combination and bit decomposition. We show the overhead from bit combination and bit decomposition in Figure 11. We profile the overhead on APConv designs following the same setting as Section 6.1.2. We show results from APConvw1a2 since we observe similar overhead across bit settings. On average, we empirically observe 1.16% overhead from bit combination and another 2.02% overhead from bit decomposition, compared to only TC computation. The main reason is that bit combination and bit decomposition introduce only quadratic time complexity, which is significantly smaller than the cubic time complexity from TC computation. Due to this difference in time complexity, the overhead from bit combination decreases from to as the channel size increases from to . We also observe similar trend for bit decomposition.
Comparing APMM and cutlass GEM under the same bits. Figure 12 shows the performance comparison between APMM and cutlassgemm when using the same bits. Overall, we observe that APMMw4a4 can achieve speedup over cutlassgemmint4. The main reason is that APMMw4a4 can achieve better parallelism by using int1 computations to emulate int4 computation and achieving better GPU utilization, especially for small matrix sizes. We note that this speedup of APMMw4a4 over cutlassgemmint4 decreases as the matrix size increases where more int1 computation resources are required for emulation. We also observe that APMMw1a1 can achieve speedup over cutlassgemmint1. This shows the benefit from our kernellevel optimizations.
w1a2  w1a3  w1a4  w2a2  cutlassgemmint4  cutlassgemmint1 

6.67  6.81  7.06  7.15  15.61  7.92 
Raw latency of a typical fullyconnected layer. Table 4 shows the raw latency of a typical fullyconnected layer with batch size , input dimension , and output dimension . Overall, we observe that we require only around microsecond for such a layer. Comparing with cutlassgemmint4, we can achieve speedup on average by using arbitraryprecision computation. We also note that the arbitraryprecision computation is even slightly faster than the cutlassgemmint1, which matches the result in Section 6.1.1.
7. Discussion
Practical usage of APNN. Arbitraryprecision neural networks have been widely studied to provide diverse tradeoffs between precision and efficiency (HanMD15; TPDS; BinaryConnect; DoReFa; zhang2018lq; HAQ; OLCEL; li2019bstc). While arbitraryprecision may slightly reduce the precision, it shows merit in many practical usages such as smart sensors (sensor; KungZWCM18; McDanelTK17), mask detection (BinaryCoP), and intelligent agriculture (GarofaloT0RB20). In these usages, when a certain accuracy bar is surpassed, other essential metrics such as realtime processing and resource consumption are more important. For example, BinaryCoP (BinaryCoP) utilizes lowpower binary neural networks to detect facialmask wear at entrances to corporate buildings and airports. Another example is XpulpNN (GarofaloT0RB20) that uses quantized neural network on energyefficient IoT devices.
Generality to other NNs.
This paper reports the results of APNNTC on two most timeconsuming kernels, GEMM and Convolution, from the computer vision domain and showcases the performance on popular vision models (e.g., AlexNet, VGG, and ResNet). Yet, we expect that APNNTC applies to NNs from various domains such as natural language processing (NLP). Intuitively, APNNTC accelerates GEMM and dot products which is the building block of many NLP NNs
(Transformer; ZhangFB20; DevlinCLT19), such as the attention layer and the feedforward layer.Generality to other processors. APNNTC utilizes population count (i.e., popc()) and two logical operations (i.e., XOR and AND) to support arbitraryprecision computation on Nvidia GPUs. Considering the wide support for popc() and logical operations, APNNTC can be easily adapted to diverse processors. For example, AMD GPUs (AMD) supports population count (i.e. popcnt() on AMD GPUs) and logical operations (e.g., bitwise XOR). Xeon phi (Intel) also supports population count and logical operations.
8. Conclusion
In this paper, we design and implement APNNTC that accelerates arbitraryprecision neural networks on Ampere GPU Tensor Cores. Specifically, APNNTC contains an int1based emulation design on Tensor Cores to enable arbitraryprecision computation, an efficient APLayer design for efficiently mapping NN layers towards Tensor Cores, and an APNN design to minimize the memory access across NN layers. Extensive evaluations on two Ampere GPUs show that APNNTC can achieve significant speedup over CUTLASS kernels and various mainstream NN models, such as ResNet and VGG.
9. Acknowledgements
We thank all anonymous reviewers for their valuable comments. This work was supported in part by NSF 1925717 and 2124039. This work was supported in part by the U.S. DOE Office of Science, Office of Advanced Scientific Computing Research, under award 66150: "CENATE  Center for Advanced Architecture Evaluation" and PNNL’s DataModelConvergence (DMC) LDRD Initiative ComputationFlowArchitecture (CFA) project. The Pacific Northwest National Laboratory is operated by Battelle for the U.S. Department of Energy under contract DEAC0576RL01830.
Comments
There are no comments yet.