I Introduction
As the number of parameters in DNNs increases to improve model accuracy with various tasks, reducing inference latency is becoming more challenging. Reducing response time becomes highly critical when realtime services are demanded (e.g., autonomous driving, automatic speech recognition, and neural machine translation). Note that most of response time is usually consumed by general matrixtomatrix multiplication (GEMM) or general matrixtovector multiplication (GEMV) of highorder time complexity (see Fig.
1). Efficient computation of matrix multiplication, therefore, directly corresponds to response time reduction. Previously in order to accelerate GEMM operations, both hardware and softwarebased approaches have been introduced [li2019edge, reuther2019survey, choudhary2020comprehensive, cheng2017survey, zhou2019edge, he2018survey].As an effort to reduce latency, fewbatch multiplications^{1}^{1}1In this paper, we refer to either GEMV or GEMM as fewbatch multiplication for convenience. are strongly preferred for DNN inference at the cost of reduced weight reuse. Note that if GEMV is conducted to support single batch inference, weight matrix data is accessed only once. Such a streaminglike operation is highly memorybound in modern computing systems based on von Nuemann architecture, where main memory (DRAM) is separated from the processing unit [von1993first]. Moreover, if weight matrix size becomes larger, then the portion of memorybound operations is also larger. Execution workloads with little data reuse on computing systems, therefore, would prevent an efficient utilization of computing resources because of the problem of memorywall (also known as von Neumann bottleneck) [hennessy2011computer]. To alleviate a memoryaccess bottleneck from hardware perspectives, inmemory computing (in which computational operations are performed within the memory unit) has been widely studied [eleftheriou2018memory]. In other words, for DNNs, combating the memory bottleneck is desperate enough to request a new hardware architecture design paradigm.
As a practical solution at algorithm level, model compression is an effective technique to achieve lower endtoend latency. Model compression reduces not only offchip memory accesses (and hence, low power consumption) on mobile but also main memory bandwidth requirements by shrinking memory footprint with negligible accuracy drop [cheng2017survey, choudhary2020comprehensive]. Thus, model compression is being widely studied to accelerate inference computations. Popular model compression techniques include pruning [han2015deep, liu2018rethinking, lee2019network], lowrank approximation [sainath2013low, li2016recovery], and quantization [xu2018alternating, bhandare2019efficient, zhang2018lq].
In this work, we consider quantization because of its simple structure and high compression ratio [xu2018alternating, bhandare2019efficient, zhang2018lq]. The rationale behind quantization for DNNs is that we can reduce the number of bits to represent each parameter without noticeable model accuracy because DNNs include a lot of redundancy. Note that weights and activations need to be quantized at different times. Weights are fixed during inference, and hence, weight quantization is performed in advance before performing inference. On the other hand, activation quantization should be conducted onthefly with additional computations (for quantization) during inference. If quantization algorithm is complicated, then the cost of dynamic quantization might be larger than the gain from quantization effects. In addition, activation compression may result in a serious accuracy degradation if training is not aware of a quantized structure of activations. In this manuscript, thus, we study weight quantization only that is enough to accelerate matrix multiplication as we demonstrate later.
In this paper, we propose a novel matrix multiplication algorithm dedicated to quantized DNNs that can be performed on modern computer (von Neumann) architectures. Even though quantization obviously reduces storage requirements on offchip memories, achieving higher performance with quantized DNNs on CPUs or GPUs is challenging. First, because data transfer on commercial processors is performed with a fixed width (such as 32 bits) while weights can be quantized with an arbitary number of bits, accessing multiple quantized weights may cause some waste in bandwidth utilization. Second, decoding quantized weights may induce additional instructions. Our proposed method, called BiQGEMM^{2}^{2}2nonGEneral Matrix Multiplication for Binarycoding based Quantized neural networks, addresses such concerns using lookup tables that accept quantized weights as indices. BiQGEMM is built on the observation that quantization leads to a lot of redundant computations. The key idea is that for any realnumber vector (of activations) , the number of possible outcomes from a dot product of and a binary vector (of quantized weights) is limited to be , all of which can be precomputed and stored in lookup tables that can be reused. We show that by replacing a majority of arithmetic operations with table lookups, BiQGEMM calculates matrix multiplications with high performance and improved bandwidth utilization.
Ii Background
Iia Quantization
DNNs intentionally involve a lot of redundancy to expedite searching for an optimal local minimum. Thus, the model size of DNNs has a potential to be significantly reduced by various compression algorithms. Quantization is gaining increasing popularity as an effective model compression technique. There exist various quantization formats and dequantized weights can be represented either by fixedpoint numbers (based on uniform quantization) or by floatingpoint numbers (based on codebook lookups or binarycoding quantization).
Note that codebookbased quantization presents high compression ratios for various models with ignorable model accuracy degradation because expected values after quantization are still maintained to be floatingpoint numbers [facebook_lut_quant]. Even though codebookbased quantization is highly efficient to reduce offchip memory footprint, computational complexity is not reduced at all after dequantization. Fixedpoint quantization, on the other hand, reduces both storage requirement and computational complexity. Since INT8 quantization associated with additional techniques is introduced to be able to maintain the model accuracy of some wellknown CNN models, INT8 has been adopted by various commercial tools [bhandare2019efficient]. Note that operations other than GEMV or GEMM need to be redesigned to function in INT8 while INT8aware retraining may be necessary not to degrade model accuracy seriously. For example, layer normalization and softmax operations for attention blocks for Transformers demand floatingpoint computations [bhandare2019efficient]. Accordingly, frequent conversions between fixedpoint formats and floatingpoint formats would incur 15%30% computational overhead [bhandare2019efficient].
As an effort to reduce both computations and footprint significantly, binarycodingbased quantization has been proposed [xu2018alternating, rastegari2016xnor, zhang2018lq]. Since expected values after binarycoding quantization remain to be floatingpoint numbers, accuracy degradation can be ignored even when only about 3 bits are used to quantization [xu2018alternating, zhang2018lq]. Despite a possibility to highly simplify computations, binarycodingbased quantization has not been useful in practice because a computing system should allow bitlevel memory access. In this manuscript, hence, we consider binarycodingbased quantization as a baseline to enable practical usages in commercialized computing systems.
Note that in the case of INT8, activations should be also quantized in order to allow fixedpoint GEMV or GEMM, while such activation quantization with floatingpointbased quantization is optional. Activation quantization inherently demands dynamic quantization process during inference. Even though inference operations can be a lot more efficient by previously proposed methods such as method of 4 Russian [aho1974design] or popcount and XORlogic [rastegari2016xnor], activation quantization requires 1) modifications to training algorithm to severely restrict the range of activation values and 2) computational overhead for format conversions [rastegari2016xnor, xu2018alternating]. In this paper, we show that BiQGEMM presents high performance even when activations are maintained to be floatingpoint numbers.
IiB Binarycodingbased quantization
When a realnumber vector is quantized into bits by binarycodingbased quantization method, is mapped into scaling factors and binary vectors (). Then, is approximated as where scaling factors are shared by multiple elements in . Scaling factors and binary vectors are obtained as follows:
(1) 
such that quantization error is minimized. Since there is no analytical solution to minimize such quantization error, numerous heuristic approaches have been proposed
[guo2017network, xu2018alternating, zhang2018lq, courbariaux2016binarized, amc].The same principle of binarycodingbased vector quantization can be applied to matrices where quantization can be independently performed for each row or column. For a weight matrix quantized into binary matrices with scaling factor vectors , multiplication with a realnumber vector produces an output vector as follows:
(2) 
where operation denotes elementwise multiplication (i.e., Hadamard product) and is the number of quatization bits for weights. Fig. 2 illustrates how to perform multiplication of multibit quantized weight matrices by a realnumber vector. Note that for convenience, binary weight matrices s can be concatenated in vertical direction and multiplied by a vector . Then, elementwise multiplication by scaling factor produces an intermediate partial output . Finally, sum of vectors of yields the final output .
Consider that a realnumber activation vector is also quantized by using bits into with scaling factors (), the previous output now can be computed as follows:
(3) 
Eq. 3 suggests that activation quantization would increase the number of computations compared to Eq. 2, even though most computations are simple as bitwise logic. It should be noted that without sophisticated hardware design support for bitwise logic incurred by binarycodingquantization, activation quantization may degrade matrix multiplication performance.
IiC Natural Language Processing
In order to set a range of parameters (such as matrix size) to be used for our experiments and to take into account the impact of the proposed algorithm, we investigate natural language processing (NLP) as a representative application of BiQGEMM.
RNNs [rumelhart1986learning] and Transformers [vaswani2017attention]
are being widely accepted as timeseries data analysis tools to process natural language tasks. Long shortterm memory (LSTM)
[hochreiter1997long], compared with conventional RNNs, introduces additional gates in a unit to overcome longterm dependency and gradient vanishing problem in vanilla RNNs. As such, most recently proposed RNNbased networks employ LSTM as a basic unit to improve model accuracy on multiple benchmark language models. Transformers presented another noticeable advance in NLP. By breaking the recurrent structure and fully exploiting attention mechanism [bahdanau2014neural], Transformers better figure out the relevance between words in the sentences. Correspondingly, Transformers are primarily used in a wide range of NLP tasks including neural machine translation (NMT) and are extended to various applications, including BERT [devlin2018bert], with impressive results on GLUE [wang2018glue] and SQUAD [rajpurkar2016squad].The structure of Transformers can be divided into encoder layers and decoder layers. An encoder layer includes one attention block structured as four () weight matrices and a feedforward block with () and () matrices, where is the hidden size. Also, a decoder layer presents two attention blocks and a feedforward block while the structure of each block is the same as that of the encoder. The number of encoder layers is chosen to be 6 (6) and is selected to be 512 (1024) for the base (big) model. Weight matrices are fed into matrix multiplication operations and the weight matrix size is rapidly increasing to support various complicated tasks with increased model accuracy goals. For example, for NMT, most models that show excellent performance are based on the big model version of Transformer [ahmed2017weighted, shaw2018self, ott2018scaling, edunov2018understanding]. T5, another variant of Transformers, increases the number of weights to 11 billion and the number of layers to 24 [Raffel2019ExploringTL].
BERT [devlin2018bert] is a pretrainingbased model for applications that requires only the encoder part of Transformers. BERT models are known to continuously set new records on model accuracy with high number of encoder layers and hidden size (such as 24 and 1024, respectively). Associated with new training algorithms based on the large model of BERT, various advanced models, such as XLNet [yang2019xlnet], RoBERTa [liu2019roberta], and ERNIE [zhang2019ernie, sun2019ernie], are being developed. Ever increasing requests of higher accuracy demands only larger weight matrices. For instance, the biggest weight matrix size in xxlarge model of ALBERT [lan2019albert] is (), which requires 256 MB (with FP32) of memory footprint. Such large weight matrices cannot avoid frequent DRAM accesses even if the same parameters are repeatedly reused over the whole network.
As for automatic speech recognition (ASR), similarly, the number of parameters is also increasing to accomplish higher model accuracy [karita2019comparative, karita2019improving, irie2019choice, luscher2019rwth, park2019specaugment, han2019state]. To illustrate, LAS is an endtoend ASR DNN model based on bidirectional LSTM using six encoder layers with () weight matrix structure and two decoder layers with () weight matrix structure [park2019specaugment].
In sum, fast matrix multiplication with a matrix size of (at least) a few thousands is essential to realize DNNs of NLP tasks. Such highperformance matrix multiplication needs to assume that DNNs are compressed because of increasing number of parameters.
IiD Quantizing Transformers
Now we estimate the number of quantization bits using Transformers that are being widely applied to various NLP tasks. Table
I lists quantization results of (base model) Transformers (designed to perform English to German translation) using uniform quantization [bhandare2019efficient, prato2019fully] and binarycodingbased quantization with greedy approximation [guo2017network]. For uniform quantization results, we refer to the numbers from [bhandare2019efficient, prato2019fully]. For binarycodingbased quantization based on greedy approximation method (to reduce quantization error), we retrain the model using quantizationaware training algorithm introduced in [deeptwist] using WMT13 data set. When retraining the model, all hyperparameters are the same as in Transformer [vaswani2017attention] except large initial learning rate by 2 and additional hyperparameter of distortion step (introduced in [deeptwist]) that is set to be 2000. The baselines results for each quantization case are inherently different due to different initialization conditions and test set, and the number of quantization bits and model accuracy (given as BLEU score) are described in Table I. Note that uniform quantization requires activations to be quantized (onthefly) as well to enable fixedpoint matrix multiplications while such activation quantization is optional for binarycodingbased quantization. Besides, uniform quantization demands frequent conversions between floatingpoint formats and fixedpoint formats (with conversion overhead) to maintain highprecision activation functions, while such conversions are not necessary for binarycodingbased quantization.Table II shows the memory usage when weights and activations are quantized into different number of bits while a matrix size is fixed to be 512by512 (that is the size of an attention layer of the base Transformer). The number of subwords in the test data set is 18 on average, and thus, batch size is 18. Note that because of relatively small dimension of activations, activation quantization does not reduce memory footprint as much as weight quantization, while more bits for weight quantization need to be assigned given a target model accuracy as shown in Table I. Such observation is consistent with other matrix sizes. Combining Table I and Table II, for BiQGEMM design considerations, we quantize only weights while we are mainly interested in a few bits for quantization (e.g., 1 to 3 quantization bits).
Models  Data format (bits)  EnglishtoGerman  
(weight / activation)  BLEU  
Ref [bhandare2019efficient]  Baseline  32 / 32  27.68 
Uniform  8 / 8  27.30 (0.22)  
Ref [prato2019fully]  Baseline  32 / 32  26.46 
Uniform  8 / 8  26.38 (0.80)  
6 / 6  26.98 (+0.52)  
4 / 4  18.32 (8.14)  
Ref [deeptwist]  Baseline  32 / 32  25.8 
BinaryCoding
(Greedy) 
4 / 32  25.5 (0.3)  
3 / 32  25.3 (0.5)  
2 / 32  23.9 (1.9)  
1 / 32  0.4 (25.4) 
Data format (bits)  Memory (MB)  
W  A  O  W  I  O  total 
32  32  32  1.049  0.037  0.037  1.122 
8  8  32  0.262  0.009  0.037  0.308 
6  6  32  0.197  0.007  0.037  0.240 
4  4  32  0.131  0.005  0.037  0.173 
4  32  32  0.131  0.037  0.037  0.205 
3  32  32  0.098  0.037  0.037  0.172 
2  32  32  0.066  0.037  0.037  0.139 

W: weihgts, A: activations, O: outputs, I: inputs.
Iii Methodology
Iiia Motivation and Definitions
Definition 1.
LUTunit is the length of a subvector to be used as an input argument of a table lookup function.
Definition 2.
Given an by matrix denoted by , is a by matrix reshaped from while maintaining columnwise traversal.
Definition 3.
Given an by matrix denoted by , is a submatrix of formed by to columns when .
Definition 4.
Given a column vector of length , is a subvector of comprised of to rows when .
Definition 5.
denotes a matrix constructed by concatenating all possible (nonredundant) binary vectors of length.
We assume that a binary weight matrix and an input matrix are given, where , , and are output size, input size, and batch size, respectively. Fig. 3 shows an example of a quantized (binary) weight matrix and an input matrix. In Fig. 3, each matrix is equally divided into three parts along with LUTunit of 4. Considering a shaded part in Fig. 3, a row vector (having 4 binary digits) in is one of possible combinations. Correspondingly, each row after a product of and is also limited to be one of possible vectors. As an attempt to exploit a strictly limited space of available outputs, the product of and reshaped input matrix is precomputed and stored in lookup tables. Then, precomputed values are retrieved from lookup tables using bit binary digits in a weight matrix as a key (or an index). Note that when , computing efficiency is enhanced because most arithmetic operations can be replaced by retrieval operations if output size is large enough.
IiiB Algorithm Description
When LUTunit is given as a parameter, each column in the product of and becomes entries of a separate lookup table. Fig. 4 shows an exemplary process of building lookup tables when is , where we define a subvector of length and a lookup table corresponding to as follows:
(4) 
(5) 
when and , where , , and are the batch size, the input size, and the LUTunit, respectively. Then, the product of a submatrix and a subvector can be found in lookup table , instead of performing GEMV. In other words, partial products of GEMM are replaced with table lookups in BiQGEMM.
As shown in Fig. 4(b), building a lookup table can be optimized by using dynamic programming technique. Specifically, while constructing a lookup table , dynamic programming reduces redundant arithmetic operations (described as rightsided equations in Fig. 4(b)) compared to the case when GEMV using and is performed (described as leftsided equations in Fig. 4(b)). Algorithm 1 presents the pseudo code of building a lookup table with dynamic programming. In Fig. 4(b), each equation is annotated with the corresponding line numbers in Algorithm 1. Note that every subvector per input induces a distinct lookup table of entries, and hence, the time complexity of the proposed dynamic programming scheme is calculated as follows:
(6) 
obtained by our proposed technique is times less than that is time complexity of GEMMbased lookup table construction method (see Fig. 4(a)). Suppose our dynamic programming scheme is combined with multithreading, each thread is responsible for constructing one or more lookup tables (i.e., one lookup table cannot be implemented by coordinating more than two threads). Another level of parallelism is achieved by vectorizing independent equations in Fig. 4(b) to utilize SIMD instructions. Note that because of dependency among equations in the case of dynamic programming, however, conventional GEMV or GEMM schemes might be a better choice to fill up lookup table entries if a computing system (e.g., GPU) embeds a sizeable number of simple computation units. In other words, depending on the characteristics of a processor to run BiQGEMM, a choice of appropriate scheme to implement lookup tables would be different.
Fig. 5 shows an illustrated process of querying partial results from lookup tables. Consecutive LUTunit binary data in a quantized weight matrix are grouped into a subvector that can be converted into an integer number where (e.g., is converted into an integer 6 when is 4). In other words, the key matrix in Fig. 5 is a bitpacked version of , and each element in serves as an index of table lookups^{3}^{3}3Note that matrix instead of can be loaded in advance into the system, since the weight matrices are fixed during inference.. Partial results retrieved from lookup table entries are accumulated for each subvector of length per input vector, and the BiQGEMM is completed.
All keys in the key matrix are used (=the number of input vectors) times. For example, all of the leftmost lookup tables corresponding to every input (i.e., and in Fig. 5) are commonly accessed by the first column of key matrix. In other words, different lookup tables are accessed by a shared key. By contiguously placing lookup table entries associated with the same key as shown in Fig. 6, SIMD operations in CPU are encouraged, and bank conflicts in GPU can be mitigated. As for GPU, scratchpad memory or software controlled caches (i.e., shared memory in CUDA) can store lookup tables. Then, even if the memory accesses are irregular, multiple threads can fetch multiple data in parallel unless multiple addresses within the same memory bank are accessed. Thus, a penalty of irregular accesses to a lookup table on GPU is not as critical as that of CPU.
A tiling approach for BiQGEMM is illustrated in Fig. 7 and described in Algorithm 2. To build a lookup table (LUT) without redundancy, BiQGEMM adopts an LUTstationary tiling scheme. Two input arguments of BiQGEMM are given as a key matrix and an input tensor reshaped from an input matrix (while reshaped form is determined by a userdefined parameter ). For tiling, tile width and height need to be specified. Each tile in LUT is operated by one thread (that is assigned to one SM in the case of GPU), and hence, multiple accesses to lookup tables in a block can be SIMDified (by multiple CUDA cores in the case of GPU). Lookup tables in are not constructed in advance (i.e., not taken from DRAM), instead, implemented onthefly by Algorithm 1 with subvectors as inputs (Line 3 in Algorithm 2). After implementing lookup tables in , pairs of tiles in and corresponding to are loaded successively and individually, and computed by using (Lines 4–6 in Algorithm 2). With LUTstationary tiling, partial output results obtained by multiple threads are processed through either sum reduction or atomic additions to obtain the final output values. Since each of keys is utilized times, the worstcase time complexity required to retrieve is given as follows:
(7) 
To process multibit quantization of weight matrices, BiQGEMM assumes that multiple binary weight matrices are concatenated as described in Fig. 2. Note that such concatenation does not increase the number of lookup tables, and thus for BiQGEMM, only an amount of lookup table retrieving operations increases as the number of quantization bits increases. In short, for multibit quantized weight matrices, becomes , where is the number of quantization bits.
IiiC Complexity Analysis
Given an input matrix and a quantized binary weight matrix , a matrix multiplication performed by GEMM yields as time complexity, where , , and are output size, input size, and batch size, respectively (Fig. 1 shows an example when is 1). Time complexity analysis on a matrix multiplication performed by BiQGEMM, on the other hand, is divided into the following two parts: i) constructing lookup tables (Eq. 6) and ii) retrieving lookup table entries (Eq. 7). Correspondingly, time complexity of BiQGEMM is presented as follows:
(8)  
(9) 
If is satisfied, then can be approximated as
(10) 
Note that as long as , Eq. 10 holds regardless of a choice of algorithm to build lookup tables (i.e., irrespective of a selection between or ). Then, by using BiQGEMM instead of conventional GEMM, time complexity of a matrix multiplication is reduced by . For multibit quantized weights, time complexity of both BiQGEMM and GEMM increases linearly with the number of quantization bits.
Since the underlying principles of BiQGEMM are fundamentally different compared to GEMM, rethinking hardware designs is necessary. First, while performance of FMA units is directly related to GEMM performance, the usage of FMAs for BiQGEMM is limited to constructing lookup table. Second, while cache design is useful for GEMM to utilize spatial locality in SRAM when loading a matrix by accessing successive data, BiQGEMM cannot efficiently facilitate such a locality because accessing entries of lookup tables would be nonsequential in general (note that, nonetheless, such degraded locality is not fatal if BiQGEMM is associated with multibatch inference on CPU or with scratchpad on GPU (see Fig. 6). In addition, because BiQGEMM is desired to produce lookup tables (that are usually larger than an input matrix) to be placed in SRAM, an available range of tile size would be highly constrained compared to GEMM. Now, let us explain why BiQGEMM is designed to be efficiently operated in CPUs or GPUs with quantized weights despite such issues (i.e., low utilization of FMAs and and low data access locality). Note that with quantized weights, GEMM needs to decompress bitpacked quantized weight data by 1) performing twostep operations to extract quantized weights bitwise from a much wider data container (such as INT32) and 2) conducting twostep arithmetic operations to convert data of the form of into the form of (see Algorithm 3). On the other hand, BiQGEMM directly accesses and utilizes the bitpacked weight data as keys (or indices) of lookup tables without such additional decompressing steps. It should be noted that for quantized weights, overhead by decompression can outweigh the gain by the reduced memory footprint as we demonstrate in the next section.
Consideration on existing hardware architecture designs is one of the keys to understanding the impacts of BiQGEMM on the system performance. For example, even though large tile size for BiQGEMM would result in improved data reuse, current CPU or GPU designs allow limited tile size of BiQGEMM such that large batch size (i.e., computebound workload) might be less favorable to BiQGEMM. New BiQGEMMaware hardware design considering unique computational properties of BiQGEMM, thus, would be an ultimate solution. For example, if the scratchpad on GPU can be shared by all processing units, an allowed range of tile size for BiQGEMM would be much widen. New hardware design dedicated to BiQGEMM is, therefore, suggested as an important future work.
Iv Experimental Results
Iva Setup
We implemented our proposed algorithm BiQGEMM in C++/CUDA with various compilers targeting different target machines. Table III presents descriptions of systems where tests are performed. During tests with CPUs, performance of BiQGEMM is compared with Intel MKL (mkl) [wang2014intel], Eigen (eigen) [eigenweb], and an algorithm introduced in [weissteinmatrix] (kCpu). Additionally BiQGEMM is also run by GPU and compared with cuBLAS (cublas) [nvidia2008cublas], a kernel code in CUDA samples (kGpu) [volkov2008benchmarking]
, and XNORpopcout (
xnor) [courbariaux2016binarized]. We generated synthetic matrices filled by random numbers as data sets for tests.Mobile  PC  GPGPU  
Processor  CortexA76  i77700  Tesla v100 
# of Cores  4  4  80 (SMs) 
L1Dcache  64KB/core  32 KB/core  128 KB/SM 
SIMD lane  4/core  8/core  16*4/SM 
DRAM  8GB  16 GB  16 GB 
GB/s[a]  31.8  35.76  900 
FLOPS  19.36G 4  57.6G 4  181.87G 4 
Compiler  LLVM 8.0.7  gcc 5.4.0  nvcc 10.2 
OS  Android 9 (4.14.78)  ubuntu 16.04.6 (4.15.0) 

maximum memory bandwidth
Our proposed algorithm accepts LUTunit as a userdefined parameter that can affect system performance. Let us explain how LUTunit is optimized in practice. determines physical space allocated for lookup tables. If increases, then the number of lookup tables decreases while the number of entries in each lookup table increases exponentially (see Fig. 4(a) and Eq. 6) (i.e., there exists a tradeoff between the number of LUTs and the number of entries inside each LUT). Combined with output size , LUTunit specifies a relative performance gain of BiQGEMM over GEMM. Specifically from Eq. 9, for a given output size , we can find by . Note that in practical situations, hardware resource may limit the maximum (due to internal SRAM size), and thus, restrict tile size as well. Thus, theoretically optimized should be verified empirically throughout extensive experiments. We use (for our entire tests) that turns out to be close to the value optimized in theory .
IvB BiQGEMM Runtime Profiling
Fig. 8 represents the runtime portion of each operation when running BiQGEMM on CPU with various output size . Operations are mainly categorized into 1) lookup tables construction (build), 2) retrieving values (query), and 3) memory replacement for tiling (replace). As discussed in Section IIIC, increasing output size induces larger proportion in the process of retrieving values, and correspondingly, more arithmetic operations in GEMM to be replaced with retrieval operations in BiQGEMM. Note that even when more quantization bits are assigned to each weight, BiQGEMM increases the retrieving operations only which are relatively inexpensive among 3 operations (see Fig. 2). As such, when weights are quantized, BiQGEMM is better implemented (for performance compared with a GEMMbased scheme) when output size is larger and weights are quantized with more bits.
IvC GEMM with Quantized and Bitpacked Weights
To reduce memory footprint, bitpacking is essential for quantized models to be densely stored in a general data type, such as INT32. Through bitpacking with few batch multiplication, memorybound algorithms are accelerated by reduced memory bandwidth requirements. However, unpacking is required to be performed prior to running GEMM operations with packed data. Since unpacking fundamentally requires bitlevel manipulations, unpacking on CPUs and GPUs may cause a large computational overhead. Indeed, Fig. 9 demonstrates such concerns. Assuming that weights are 1bit quantized, Fig. 9 compares runtime of 3 different scenarios (w/o unpack, sGEMM, and w/ unpack) depending on how the binary vectors are processed: ‘sGEMM’ indicates a case where only one bit is stored in a 32bit container while ‘w/ unpack’ means multiplying bitpacked data after extracting bits through unpacking process. Since ‘sGEMM’ version does not assume bitpacked data formats, quantization would not affect performance (i.e., performance would be the same as that of fullprecision weights). Note that ‘w/o unpack’ measures runtime when bitpacked data is multiplied by a real vector without unpacking (i.e., products of a 32bit packed data (a scalar) and a vector of length 32), which will produce incorrect result, but is useful to identify performance gain by decreased memory access latency. Runtime gap between ‘w/o unpack’ and ‘sGEMM’ implies performance gain by reduced memory footprint, whereas the difference between ‘w/o unpack’ and ‘w/ pack’ runtime indicates latency overhead by unpacking operations. Fig. 9 confirms that GEMM with quantized weight is inefficient in terms of the response time even though quantization reduces DRAM memory access latency.
IvD Comparison with others
Even though small batch size is preferred for inference to reduce response time, recently developed DNNs demand batch size to be larger than 1. For example, an input (in the form of a sequence) of Transformers’ encoder contains several subwords (tokens) to detect hidden relationship between subwords. Because all of subwords in the input are multiplied by the same weights concurrently, those subwords are processed in a group manner. The number of subwords, thus, is equivalent to batch size in term of computation. Accordingly, we conduct experiments using various batch sizes ranging from 32 to 256 considering the number of subwords used for Transformers and their variants.
Since unpacking process adds significant overhead to run GEMMbased schemes with quantized bitpacked weights as shown in Section IVC, ‘sGEMM’ version (that stores only one bit in a 32bit containers without packing) introduced in the previous subsection is selected to be compared with BiQGEMM. ‘sGEMM’ version does not benefit from quantization, and therefore, 1bit quantized weights and fullprecision weights result in the same performance measured when using MKL (mkl) and Eigen (eigen) (thus, we do not specify whether weights are quantized in Fiq. 10 for eigen and mkl). Performance of BiQGEMM is measured when weights are quantized into 1, 2, or 3 bits. Note that runtime of both BiQGEMM and GEMM with quantized weights linearly increases as the number of quantization bits increases. Then, combined with an observation that BiQGEMM 1bit (BiQGEMM with 1bit quantization) shows highest performance in Fig. 10, BiQGEMM is always faster than GEMM given the same quantization bits. Moreover, if batch size is small enough, then BiQGEMM performance with 2 or 3bit quantization outperforms GEMM with fullprecision. Thus, even if the latency is of the top priority in the inference system design (at the expense of increased memory footprint without quantization), BiQGEMM can be the selection if the number of quantization bits is small enough with allowable model accuracy degradation.
weights  batch
size 
runtime (sec)  

(nbyn)  BiQGEMM  kGpu  cublas  xnor  
512  1  4  22  12  18 
32  11  24  20  18  
128  30  39  25  19  
256  58  63  26  19  
1024  1  4  36  14  18 
32  20  57  27  19  
128  70  120  45  21  
256  135  204  64  24  
2048  1  5  93  31  19 
32  47  153  52  23  
128  175  366  109  29  
256  330  661  179  40  
4096  1  7  213  90  23 
32  130  614  130  34  
128  528  1396  339  64  
256  1005  2516  594  109 
Fig. 10 shows that when input size is fixed, larger output size enhances speedup of BiQGEMM significantly because of higher reuse rate of lookup tables and correspondingly increased number of arithmetic operations to be replaced by simple table lookups. Large batch size, on the other hand, may have adverse effects on BiQGEMM performed by CPUs or GPUs. Fig. 10 describes that BiQGEMM can be slower than GEMM if batch size and the number of quantization bits are beyond a certain threshold value. In theory, since time complexity of BiQGEMM is given as and is empirically optimized to be 8, BiQGEMM with under 8bit quantization is supposed to be always faster than GEMM (of fullprecision) regardless of batch size. However, in reality, we need to consider limiting factors due to available hardware resources as discussed in Section III. If batch size increases and data reuse is improved correspondingly, then overall computational efficiency improvement of mkl and eigen can be higher than that of BiQGEMM. For example, when batch size exceeds 128 in Fig. 10(a), eigen and mkl are faster than BiQGEMM with 3bit quantization. Specific batch size determining whether BiQGEMM can be faster than GEMM depends on the system configuration. For example, in the case of mobile CPU with low computational power (see Table III), BiQGEMM outperforms fullprecision GEMM even when batch size becomes larger compared to the case of PC as described in Fig. 10(b).
Even though experimental results in Fig. 10 assumed only one thread, multithreading linearly improves performance of both BiQGEMM and GEMM that can be parallelized by tiling techniques.
IvE Experiments with GPU
On GPU, we compare BiQGEMM with kGpu, cuBLAS, and xnor. Both cublas and kGpu assume that only 1 bit is occupied in 32bit containers (with unnecessary storage of 31 bits) for each quantized weight (i.e., bitpacking is not considered because unpacking is as slow as sGEMM). In the case of xnor, activations are quantized as well such that matrix multiplications is mainly computed by XNOR and popcount operations without unpacking procedure. Assuming weights and activations are  and bit quantized, xnor shows time complexity of , where , , and are output size, input size, and batch size, respectively. Although activation quantization can simplify computations further, activation quantization demands training algorithm modifications and computational overhead during inference as discussed in Section II at the cost of model accuracy drop.
cublas is provided as a library form developed by a chip vendor such that we select kGpu as a baseline that we modify to implement BiQGEMM (for xnor, we use a publicly available code that is also based on kGpu). Table IV shows runtime with various matrix sizes and batch sizes when each weight is 1bit quantized (for xnor, activations are also 1bit quantized). Difference in performance between BiQGEMM and kGpu represents the gain by reduced bandwidth requirement and by improved computational principles. As shown in Table IV, BiQGEMM is faster than kGpu by 1.0830.42 times (as weight matrix size increases and batch size decreases, BiQGEMM becomes relatively faster). Note that if batch size is small enough (to be memorybound), even compared with xnor, BiQGEMM presents the best performance.
V Conclusion
We proposed an efficient matrixtomatrix multiplication technique dedicated to quantized neural networks. When weights are quantized, available output space of computational results is quite limited such that for a large matrix multiplication, a lot of computations become redundant. BiQGEMM removes such redundancy by replacing multiplications with table lookups. Moreover, because commercial processors enable only a fixed data transfer width, a lot of memory bandwidth can be wasted when weights are quantized into a few bits. BiQGEMM provides a way to access multiple quantized weights simultaneously regardless of the number of quantization bits. Hence, memory bandwidth utilization is enhanced by BiQGEMM significantly while required memory bandwidth is reduced by quantization. We demonstrated that BiQGEMM is a lot faster than previous matrix multiplication schemes especially when matrix size is large and batch size is small.
Comments
There are no comments yet.