As the number of parameters in DNNs increases to improve model accuracy with various tasks, reducing inference latency is becoming more challenging. Reducing response time becomes highly critical when real-time services are demanded (e.g., autonomous driving, automatic speech recognition, and neural machine translation). Note that most of response time is usually consumed by general matrix-to-matrix multiplication (GEMM) or general matrix-to-vector multiplication (GEMV) of high-order time complexity (see Fig.1). Efficient computation of matrix multiplication, therefore, directly corresponds to response time reduction. Previously in order to accelerate GEMM operations, both hardware- and software-based approaches have been introduced [li2019edge, reuther2019survey, choudhary2020comprehensive, cheng2017survey, zhou2019edge, he2018survey].
As an effort to reduce latency, few-batch multiplications111In this paper, we refer to either GEMV or GEMM as few-batch multiplication for convenience. are strongly preferred for DNN inference at the cost of reduced weight reuse. Note that if GEMV is conducted to support single batch inference, weight matrix data is accessed only once. Such a streaming-like operation is highly memory-bound in modern computing systems based on von Nuemann architecture, where main memory (DRAM) is separated from the processing unit [von1993first]. Moreover, if weight matrix size becomes larger, then the portion of memory-bound operations is also larger. Execution workloads with little data reuse on computing systems, therefore, would prevent an efficient utilization of computing resources because of the problem of memory-wall (also known as von Neumann bottleneck) [hennessy2011computer]. To alleviate a memory-access bottleneck from hardware perspectives, in-memory computing (in which computational operations are performed within the memory unit) has been widely studied [eleftheriou2018memory]. In other words, for DNNs, combating the memory bottleneck is desperate enough to request a new hardware architecture design paradigm.
As a practical solution at algorithm level, model compression is an effective technique to achieve lower end-to-end latency. Model compression reduces not only off-chip memory accesses (and hence, low power consumption) on mobile but also main memory bandwidth requirements by shrinking memory footprint with negligible accuracy drop [cheng2017survey, choudhary2020comprehensive]. Thus, model compression is being widely studied to accelerate inference computations. Popular model compression techniques include pruning [han2015deep, liu2018rethinking, lee2019network], low-rank approximation [sainath2013low, li2016recovery], and quantization [xu2018alternating, bhandare2019efficient, zhang2018lq].
In this work, we consider quantization because of its simple structure and high compression ratio [xu2018alternating, bhandare2019efficient, zhang2018lq]. The rationale behind quantization for DNNs is that we can reduce the number of bits to represent each parameter without noticeable model accuracy because DNNs include a lot of redundancy. Note that weights and activations need to be quantized at different times. Weights are fixed during inference, and hence, weight quantization is performed in advance before performing inference. On the other hand, activation quantization should be conducted on-the-fly with additional computations (for quantization) during inference. If quantization algorithm is complicated, then the cost of dynamic quantization might be larger than the gain from quantization effects. In addition, activation compression may result in a serious accuracy degradation if training is not aware of a quantized structure of activations. In this manuscript, thus, we study weight quantization only that is enough to accelerate matrix multiplication as we demonstrate later.
In this paper, we propose a novel matrix multiplication algorithm dedicated to quantized DNNs that can be performed on modern computer (von Neumann) architectures. Even though quantization obviously reduces storage requirements on off-chip memories, achieving higher performance with quantized DNNs on CPUs or GPUs is challenging. First, because data transfer on commercial processors is performed with a fixed width (such as 32 bits) while weights can be quantized with an arbitary number of bits, accessing multiple quantized weights may cause some waste in bandwidth utilization. Second, decoding quantized weights may induce additional instructions. Our proposed method, called BiQGEMM222non-GEneral Matrix Multiplication for Binary-coding based Quantized neural networks, addresses such concerns using lookup tables that accept quantized weights as indices. BiQGEMM is built on the observation that quantization leads to a lot of redundant computations. The key idea is that for any real-number vector (of activations) , the number of possible outcomes from a dot product of and a binary vector (of quantized weights) is limited to be , all of which can be pre-computed and stored in lookup tables that can be re-used. We show that by replacing a majority of arithmetic operations with table lookups, BiQGEMM calculates matrix multiplications with high performance and improved bandwidth utilization.
DNNs intentionally involve a lot of redundancy to expedite searching for an optimal local minimum. Thus, the model size of DNNs has a potential to be significantly reduced by various compression algorithms. Quantization is gaining increasing popularity as an effective model compression technique. There exist various quantization formats and dequantized weights can be represented either by fixed-point numbers (based on uniform quantization) or by floating-point numbers (based on codebook lookups or binary-coding quantization).
Note that codebook-based quantization presents high compression ratios for various models with ignorable model accuracy degradation because expected values after quantization are still maintained to be floating-point numbers [facebook_lut_quant]. Even though codebook-based quantization is highly efficient to reduce off-chip memory footprint, computational complexity is not reduced at all after dequantization. Fixed-point quantization, on the other hand, reduces both storage requirement and computational complexity. Since INT8 quantization associated with additional techniques is introduced to be able to maintain the model accuracy of some well-known CNN models, INT8 has been adopted by various commercial tools [bhandare2019efficient]. Note that operations other than GEMV or GEMM need to be re-designed to function in INT8 while INT8-aware retraining may be necessary not to degrade model accuracy seriously. For example, layer normalization and softmax operations for attention blocks for Transformers demand floating-point computations [bhandare2019efficient]. Accordingly, frequent conversions between fixed-point formats and floating-point formats would incur 15%30% computational overhead [bhandare2019efficient].
As an effort to reduce both computations and footprint significantly, binary-coding-based quantization has been proposed [xu2018alternating, rastegari2016xnor, zhang2018lq]. Since expected values after binary-coding quantization remain to be floating-point numbers, accuracy degradation can be ignored even when only about 3 bits are used to quantization [xu2018alternating, zhang2018lq]. Despite a possibility to highly simplify computations, binary-coding-based quantization has not been useful in practice because a computing system should allow bit-level memory access. In this manuscript, hence, we consider binary-coding-based quantization as a baseline to enable practical usages in commercialized computing systems.
Note that in the case of INT8, activations should be also quantized in order to allow fixed-point GEMV or GEMM, while such activation quantization with floating-point-based quantization is optional. Activation quantization inherently demands dynamic quantization process during inference. Even though inference operations can be a lot more efficient by previously proposed methods such as method of 4 Russian [aho1974design] or popcount- and XOR-logic [rastegari2016xnor], activation quantization requires 1) modifications to training algorithm to severely restrict the range of activation values and 2) computational overhead for format conversions [rastegari2016xnor, xu2018alternating]. In this paper, we show that BiQGEMM presents high performance even when activations are maintained to be floating-point numbers.
Ii-B Binary-coding-based quantization
When a real-number vector is quantized into bits by binary-coding-based quantization method, is mapped into scaling factors and binary vectors (). Then, is approximated as where scaling factors are shared by multiple elements in . Scaling factors and binary vectors are obtained as follows:
such that quantization error is minimized. Since there is no analytical solution to minimize such quantization error, numerous heuristic approaches have been proposed[guo2017network, xu2018alternating, zhang2018lq, courbariaux2016binarized, amc].
The same principle of binary-coding-based vector quantization can be applied to matrices where quantization can be independently performed for each row or column. For a weight matrix quantized into binary matrices with scaling factor vectors , multiplication with a real-number vector produces an output vector as follows:
where operation denotes element-wise multiplication (i.e., Hadamard product) and is the number of quatization bits for weights. Fig. 2 illustrates how to perform multiplication of multi-bit quantized weight matrices by a real-number vector. Note that for convenience, binary weight matrices s can be concatenated in vertical direction and multiplied by a vector . Then, element-wise multiplication by scaling factor produces an intermediate partial output . Finally, sum of vectors of yields the final output .
Consider that a real-number activation vector is also quantized by using bits into with scaling factors (), the previous output now can be computed as follows:
Eq. 3 suggests that activation quantization would increase the number of computations compared to Eq. 2, even though most computations are simple as bit-wise logic. It should be noted that without sophisticated hardware design support for bit-wise logic incurred by binary-coding-quantization, activation quantization may degrade matrix multiplication performance.
Ii-C Natural Language Processing
In order to set a range of parameters (such as matrix size) to be used for our experiments and to take into account the impact of the proposed algorithm, we investigate natural language processing (NLP) as a representative application of BiQGEMM.
RNNs [rumelhart1986learning] and Transformers [vaswani2017attention]
are being widely accepted as time-series data analysis tools to process natural language tasks. Long short-term memory (LSTM)[hochreiter1997long], compared with conventional RNNs, introduces additional gates in a unit to overcome long-term dependency and gradient vanishing problem in vanilla RNNs. As such, most recently proposed RNN-based networks employ LSTM as a basic unit to improve model accuracy on multiple benchmark language models. Transformers presented another noticeable advance in NLP. By breaking the recurrent structure and fully exploiting attention mechanism [bahdanau2014neural], Transformers better figure out the relevance between words in the sentences. Correspondingly, Transformers are primarily used in a wide range of NLP tasks including neural machine translation (NMT) and are extended to various applications, including BERT [devlin2018bert], with impressive results on GLUE [wang2018glue] and SQUAD [rajpurkar2016squad].
The structure of Transformers can be divided into encoder layers and decoder layers. An encoder layer includes one attention block structured as four () weight matrices and a feed-forward block with () and () matrices, where is the hidden size. Also, a decoder layer presents two attention blocks and a feed-forward block while the structure of each block is the same as that of the encoder. The number of encoder layers is chosen to be 6 (6) and is selected to be 512 (1024) for the base (big) model. Weight matrices are fed into matrix multiplication operations and the weight matrix size is rapidly increasing to support various complicated tasks with increased model accuracy goals. For example, for NMT, most models that show excellent performance are based on the big model version of Transformer [ahmed2017weighted, shaw2018self, ott2018scaling, edunov2018understanding]. T5, another variant of Transformers, increases the number of weights to 11 billion and the number of layers to 24 [Raffel2019ExploringTL].
BERT [devlin2018bert] is a pre-training-based model for applications that requires only the encoder part of Transformers. BERT models are known to continuously set new records on model accuracy with high number of encoder layers and hidden size (such as 24 and 1024, respectively). Associated with new training algorithms based on the large model of BERT, various advanced models, such as XLNet [yang2019xlnet], RoBERTa [liu2019roberta], and ERNIE [zhang2019ernie, sun2019ernie], are being developed. Ever increasing requests of higher accuracy demands only larger weight matrices. For instance, the biggest weight matrix size in xx-large model of ALBERT [lan2019albert] is (), which requires 256 MB (with FP32) of memory footprint. Such large weight matrices cannot avoid frequent DRAM accesses even if the same parameters are repeatedly reused over the whole network.
As for automatic speech recognition (ASR), similarly, the number of parameters is also increasing to accomplish higher model accuracy [karita2019comparative, karita2019improving, irie2019choice, luscher2019rwth, park2019specaugment, han2019state]. To illustrate, LAS is an end-to-end ASR DNN model based on bi-directional LSTM using six encoder layers with () weight matrix structure and two decoder layers with () weight matrix structure [park2019specaugment].
In sum, fast matrix multiplication with a matrix size of (at least) a few thousands is essential to realize DNNs of NLP tasks. Such high-performance matrix multiplication needs to assume that DNNs are compressed because of increasing number of parameters.
Ii-D Quantizing Transformers
Now we estimate the number of quantization bits using Transformers that are being widely applied to various NLP tasks. TableI lists quantization results of (base model) Transformers (designed to perform English to German translation) using uniform quantization [bhandare2019efficient, prato2019fully] and binary-coding-based quantization with greedy approximation [guo2017network]. For uniform quantization results, we refer to the numbers from [bhandare2019efficient, prato2019fully]. For binary-coding-based quantization based on greedy approximation method (to reduce quantization error), we retrain the model using quantization-aware training algorithm introduced in [deeptwist] using WMT13 data set. When retraining the model, all hyper-parameters are the same as in Transformer [vaswani2017attention] except large initial learning rate by 2 and additional hyper-parameter of distortion step (introduced in [deeptwist]) that is set to be 2000. The baselines results for each quantization case are inherently different due to different initialization conditions and test set, and the number of quantization bits and model accuracy (given as BLEU score) are described in Table I. Note that uniform quantization requires activations to be quantized (on-the-fly) as well to enable fixed-point matrix multiplications while such activation quantization is optional for binary-coding-based quantization. Besides, uniform quantization demands frequent conversions between floating-point formats and fixed-point formats (with conversion overhead) to maintain high-precision activation functions, while such conversions are not necessary for binary-coding-based quantization.
Table II shows the memory usage when weights and activations are quantized into different number of bits while a matrix size is fixed to be 512-by-512 (that is the size of an attention layer of the base Transformer). The number of sub-words in the test data set is 18 on average, and thus, batch size is 18. Note that because of relatively small dimension of activations, activation quantization does not reduce memory footprint as much as weight quantization, while more bits for weight quantization need to be assigned given a target model accuracy as shown in Table I. Such observation is consistent with other matrix sizes. Combining Table I and Table II, for BiQGEMM design considerations, we quantize only weights while we are mainly interested in a few bits for quantization (e.g., 1 to 3 quantization bits).
|Models||Data format (bits)||English-to-German|
|(weight / activation)||BLEU|
|Ref [bhandare2019efficient]||Baseline||32 / 32||27.68|
|Uniform||8 / 8||27.30 (-0.22)|
|Ref [prato2019fully]||Baseline||32 / 32||26.46|
|Uniform||8 / 8||26.38 (-0.80)|
|6 / 6||26.98 (+0.52)|
|4 / 4||18.32 (-8.14)|
|Ref [deeptwist]||Baseline||32 / 32||25.8|
|4 / 32||25.5 (-0.3)|
|3 / 32||25.3 (-0.5)|
|2 / 32||23.9 (-1.9)|
|1 / 32||0.4 (-25.4)|
|Data format (bits)||Memory (MB)|
W: weihgts, A: activations, O: outputs, I: inputs.
Iii-a Motivation and Definitions
LUT-unit is the length of a sub-vector to be used as an input argument of a table lookup function.
Given an -by- matrix denoted by , is a -by- matrix reshaped from while maintaining column-wise traversal.
Given an -by- matrix denoted by , is a sub-matrix of formed by -to- columns when .
Given a column vector of length , is a sub-vector of comprised of -to- rows when .
denotes a matrix constructed by concatenating all possible (non-redundant) binary vectors of length.
We assume that a binary weight matrix and an input matrix are given, where , , and are output size, input size, and batch size, respectively. Fig. 3 shows an example of a quantized (binary) weight matrix and an input matrix. In Fig. 3, each matrix is equally divided into three parts along with LUT-unit of 4. Considering a shaded part in Fig. 3, a row vector (having 4 binary digits) in is one of possible combinations. Correspondingly, each row after a product of and is also limited to be one of possible vectors. As an attempt to exploit a strictly limited space of available outputs, the product of and reshaped input matrix is pre-computed and stored in lookup tables. Then, pre-computed values are retrieved from lookup tables using -bit binary digits in a weight matrix as a key (or an index). Note that when , computing efficiency is enhanced because most arithmetic operations can be replaced by retrieval operations if output size is large enough.
Iii-B Algorithm Description
When LUT-unit is given as a parameter, each column in the product of and becomes entries of a separate lookup table. Fig. 4 shows an exemplary process of building lookup tables when is , where we define a sub-vector of length and a lookup table corresponding to as follows:
when and , where , , and are the batch size, the input size, and the LUT-unit, respectively. Then, the product of a sub-matrix and a sub-vector can be found in lookup table , instead of performing GEMV. In other words, partial products of GEMM are replaced with table lookups in BiQGEMM.
As shown in Fig. 4(b), building a lookup table can be optimized by using dynamic programming technique. Specifically, while constructing a lookup table , dynamic programming reduces redundant arithmetic operations (described as right-sided equations in Fig. 4(b)) compared to the case when GEMV using and is performed (described as left-sided equations in Fig. 4(b)). Algorithm 1 presents the pseudo code of building a lookup table with dynamic programming. In Fig. 4(b), each equation is annotated with the corresponding line numbers in Algorithm 1. Note that every sub-vector per input induces a distinct lookup table of entries, and hence, the time complexity of the proposed dynamic programming scheme is calculated as follows:
obtained by our proposed technique is times less than that is time complexity of GEMM-based lookup table construction method (see Fig. 4(a)). Suppose our dynamic programming scheme is combined with multi-threading, each thread is responsible for constructing one or more lookup tables (i.e., one lookup table cannot be implemented by coordinating more than two threads). Another level of parallelism is achieved by vectorizing independent equations in Fig. 4(b) to utilize SIMD instructions. Note that because of dependency among equations in the case of dynamic programming, however, conventional GEMV or GEMM schemes might be a better choice to fill up lookup table entries if a computing system (e.g., GPU) embeds a sizeable number of simple computation units. In other words, depending on the characteristics of a processor to run BiQGEMM, a choice of appropriate scheme to implement lookup tables would be different.
Fig. 5 shows an illustrated process of querying partial results from lookup tables. Consecutive LUT-unit binary data in a quantized weight matrix are grouped into a sub-vector that can be converted into an integer number where (e.g., is converted into an integer 6 when is 4). In other words, the key matrix in Fig. 5 is a -bit-packed version of , and each element in serves as an index of table lookups333Note that matrix instead of can be loaded in advance into the system, since the weight matrices are fixed during inference.. Partial results retrieved from lookup table entries are accumulated for each sub-vector of length per input vector, and the BiQGEMM is completed.
All keys in the key matrix are used (=the number of input vectors) times. For example, all of the leftmost lookup tables corresponding to every input (i.e., and in Fig. 5) are commonly accessed by the first column of key matrix. In other words, different lookup tables are accessed by a shared key. By contiguously placing lookup table entries associated with the same key as shown in Fig. 6, SIMD operations in CPU are encouraged, and bank conflicts in GPU can be mitigated. As for GPU, scratchpad memory or software controlled caches (i.e., shared memory in CUDA) can store lookup tables. Then, even if the memory accesses are irregular, multiple threads can fetch multiple data in parallel unless multiple addresses within the same memory bank are accessed. Thus, a penalty of irregular accesses to a lookup table on GPU is not as critical as that of CPU.
A tiling approach for BiQGEMM is illustrated in Fig. 7 and described in Algorithm 2. To build a lookup table (LUT) without redundancy, BiQGEMM adopts an LUT-stationary tiling scheme. Two input arguments of BiQGEMM are given as a key matrix and an input tensor reshaped from an input matrix (while reshaped form is determined by a user-defined parameter ). For tiling, tile width and height need to be specified. Each tile in LUT is operated by one thread (that is assigned to one SM in the case of GPU), and hence, multiple accesses to lookup tables in a block can be SIMDified (by multiple CUDA cores in the case of GPU). Lookup tables in are not constructed in advance (i.e., not taken from DRAM), instead, implemented on-the-fly by Algorithm 1 with sub-vectors as inputs (Line 3 in Algorithm 2). After implementing lookup tables in , pairs of tiles in and corresponding to are loaded successively and individually, and computed by using (Lines 4–6 in Algorithm 2). With LUT-stationary tiling, partial output results obtained by multiple threads are processed through either sum reduction or atomic additions to obtain the final output values. Since each of keys is utilized times, the worst-case time complexity required to retrieve is given as follows:
To process multi-bit quantization of weight matrices, BiQGEMM assumes that multiple binary weight matrices are concatenated as described in Fig. 2. Note that such concatenation does not increase the number of lookup tables, and thus for BiQGEMM, only an amount of lookup table retrieving operations increases as the number of quantization bits increases. In short, for multi-bit quantized weight matrices, becomes , where is the number of quantization bits.
Iii-C Complexity Analysis
Given an input matrix and a quantized binary weight matrix , a matrix multiplication performed by GEMM yields as time complexity, where , , and are output size, input size, and batch size, respectively (Fig. 1 shows an example when is 1). Time complexity analysis on a matrix multiplication performed by BiQGEMM, on the other hand, is divided into the following two parts: i) constructing lookup tables (Eq. 6) and ii) retrieving lookup table entries (Eq. 7). Correspondingly, time complexity of BiQGEMM is presented as follows:
If is satisfied, then can be approximated as
Note that as long as , Eq. 10 holds regardless of a choice of algorithm to build lookup tables (i.e., irrespective of a selection between or ). Then, by using BiQGEMM instead of conventional GEMM, time complexity of a matrix multiplication is reduced by . For multi-bit quantized weights, time complexity of both BiQGEMM and GEMM increases linearly with the number of quantization bits.
Since the underlying principles of BiQGEMM are fundamentally different compared to GEMM, rethinking hardware designs is necessary. First, while performance of FMA units is directly related to GEMM performance, the usage of FMAs for BiQGEMM is limited to constructing lookup table. Second, while cache design is useful for GEMM to utilize spatial locality in SRAM when loading a matrix by accessing successive data, BiQGEMM cannot efficiently facilitate such a locality because accessing entries of lookup tables would be non-sequential in general (note that, nonetheless, such degraded locality is not fatal if BiQGEMM is associated with multi-batch inference on CPU or with scratchpad on GPU (see Fig. 6). In addition, because BiQGEMM is desired to produce lookup tables (that are usually larger than an input matrix) to be placed in SRAM, an available range of tile size would be highly constrained compared to GEMM. Now, let us explain why BiQGEMM is designed to be efficiently operated in CPUs or GPUs with quantized weights despite such issues (i.e., low utilization of FMAs and and low data access locality). Note that with quantized weights, GEMM needs to decompress bit-packed quantized weight data by 1) performing two-step operations to extract quantized weights bit-wise from a much wider data container (such as INT32) and 2) conducting two-step arithmetic operations to convert data of the form of into the form of (see Algorithm 3). On the other hand, BiQGEMM directly accesses and utilizes the bit-packed weight data as keys (or indices) of lookup tables without such additional decompressing steps. It should be noted that for quantized weights, overhead by decompression can outweigh the gain by the reduced memory footprint as we demonstrate in the next section.
Consideration on existing hardware architecture designs is one of the keys to understanding the impacts of BiQGEMM on the system performance. For example, even though large tile size for BiQGEMM would result in improved data reuse, current CPU or GPU designs allow limited tile size of BiQGEMM such that large batch size (i.e., compute-bound workload) might be less favorable to BiQGEMM. New BiQGEMM-aware hardware design considering unique computational properties of BiQGEMM, thus, would be an ultimate solution. For example, if the scratchpad on GPU can be shared by all processing units, an allowed range of tile size for BiQGEMM would be much widen. New hardware design dedicated to BiQGEMM is, therefore, suggested as an important future work.
Iv Experimental Results
We implemented our proposed algorithm BiQGEMM in C++/CUDA with various compilers targeting different target machines. Table III presents descriptions of systems where tests are performed. During tests with CPUs, performance of BiQGEMM is compared with Intel MKL (mkl) [wang2014intel], Eigen (eigen) [eigenweb], and an algorithm introduced in [weissteinmatrix] (kCpu). Additionally BiQGEMM is also run by GPU and compared with cuBLAS (cublas) [nvidia2008cublas], a kernel code in CUDA samples (kGpu) [volkov2008benchmarking]
, and XNOR-popcout (xnor) [courbariaux2016binarized]. We generated synthetic matrices filled by random numbers as data sets for tests.
|# of Cores||4||4||80 (SMs)|
|L1D-cache||64KB/core||32 KB/core||128 KB/SM|
|DRAM||8GB||16 GB||16 GB|
|FLOPS||19.36G 4||57.6G 4||181.87G 4|
|Compiler||LLVM 8.0.7||gcc 5.4.0||nvcc 10.2|
|OS||Android 9 (4.14.78)||ubuntu 16.04.6 (4.15.0)|
maximum memory bandwidth
Our proposed algorithm accepts LUT-unit as a user-defined parameter that can affect system performance. Let us explain how LUT-unit is optimized in practice. determines physical space allocated for lookup tables. If increases, then the number of lookup tables decreases while the number of entries in each lookup table increases exponentially (see Fig. 4(a) and Eq. 6) (i.e., there exists a trade-off between the number of LUTs and the number of entries inside each LUT). Combined with output size , LUT-unit specifies a relative performance gain of BiQGEMM over GEMM. Specifically from Eq. 9, for a given output size , we can find by . Note that in practical situations, hardware resource may limit the maximum (due to internal SRAM size), and thus, restrict tile size as well. Thus, theoretically optimized should be verified empirically throughout extensive experiments. We use (for our entire tests) that turns out to be close to the value optimized in theory .
Iv-B BiQGEMM Runtime Profiling
Fig. 8 represents the runtime portion of each operation when running BiQGEMM on CPU with various output size . Operations are mainly categorized into 1) lookup tables construction (build), 2) retrieving values (query), and 3) memory replacement for tiling (replace). As discussed in Section III-C, increasing output size induces larger proportion in the process of retrieving values, and correspondingly, more arithmetic operations in GEMM to be replaced with retrieval operations in BiQGEMM. Note that even when more quantization bits are assigned to each weight, BiQGEMM increases the retrieving operations only which are relatively inexpensive among 3 operations (see Fig. 2). As such, when weights are quantized, BiQGEMM is better implemented (for performance compared with a GEMM-based scheme) when output size is larger and weights are quantized with more bits.
Iv-C GEMM with Quantized and Bit-packed Weights
To reduce memory footprint, bit-packing is essential for quantized models to be densely stored in a general data type, such as INT32. Through bit-packing with few batch multiplication, memory-bound algorithms are accelerated by reduced memory bandwidth requirements. However, unpacking is required to be performed prior to running GEMM operations with packed data. Since unpacking fundamentally requires bit-level manipulations, unpacking on CPUs and GPUs may cause a large computational overhead. Indeed, Fig. 9 demonstrates such concerns. Assuming that weights are 1-bit quantized, Fig. 9 compares runtime of 3 different scenarios (w/o unpack, sGEMM, and w/ unpack) depending on how the binary vectors are processed: ‘sGEMM’ indicates a case where only one bit is stored in a 32-bit container while ‘w/ unpack’ means multiplying bit-packed data after extracting bits through unpacking process. Since ‘sGEMM’ version does not assume bit-packed data formats, quantization would not affect performance (i.e., performance would be the same as that of full-precision weights). Note that ‘w/o unpack’ measures runtime when bit-packed data is multiplied by a real vector without unpacking (i.e., products of a 32-bit packed data (a scalar) and a vector of length 32), which will produce incorrect result, but is useful to identify performance gain by decreased memory access latency. Runtime gap between ‘w/o unpack’ and ‘sGEMM’ implies performance gain by reduced memory footprint, whereas the difference between ‘w/o unpack’ and ‘w/ pack’ runtime indicates latency overhead by unpacking operations. Fig. 9 confirms that GEMM with quantized weight is inefficient in terms of the response time even though quantization reduces DRAM memory access latency.
Iv-D Comparison with others
Even though small batch size is preferred for inference to reduce response time, recently developed DNNs demand batch size to be larger than 1. For example, an input (in the form of a sequence) of Transformers’ encoder contains several sub-words (tokens) to detect hidden relationship between sub-words. Because all of sub-words in the input are multiplied by the same weights concurrently, those sub-words are processed in a group manner. The number of sub-words, thus, is equivalent to batch size in term of computation. Accordingly, we conduct experiments using various batch sizes ranging from 32 to 256 considering the number of sub-words used for Transformers and their variants.
Since unpacking process adds significant overhead to run GEMM-based schemes with quantized bit-packed weights as shown in Section IV-C, ‘sGEMM’ version (that stores only one bit in a 32-bit containers without packing) introduced in the previous subsection is selected to be compared with BiQGEMM. ‘sGEMM’ version does not benefit from quantization, and therefore, 1-bit quantized weights and full-precision weights result in the same performance measured when using MKL (mkl) and Eigen (eigen) (thus, we do not specify whether weights are quantized in Fiq. 10 for eigen and mkl). Performance of BiQGEMM is measured when weights are quantized into 1, 2, or 3 bits. Note that runtime of both BiQGEMM and GEMM with quantized weights linearly increases as the number of quantization bits increases. Then, combined with an observation that BiQGEMM 1-bit (BiQGEMM with 1-bit quantization) shows highest performance in Fig. 10, BiQGEMM is always faster than GEMM given the same quantization bits. Moreover, if batch size is small enough, then BiQGEMM performance with 2- or 3-bit quantization outperforms GEMM with full-precision. Thus, even if the latency is of the top priority in the inference system design (at the expense of increased memory footprint without quantization), BiQGEMM can be the selection if the number of quantization bits is small enough with allowable model accuracy degradation.
Fig. 10 shows that when input size is fixed, larger output size enhances speedup of BiQGEMM significantly because of higher reuse rate of lookup tables and correspondingly increased number of arithmetic operations to be replaced by simple table lookups. Large batch size, on the other hand, may have adverse effects on BiQGEMM performed by CPUs or GPUs. Fig. 10 describes that BiQGEMM can be slower than GEMM if batch size and the number of quantization bits are beyond a certain threshold value. In theory, since time complexity of BiQGEMM is given as and is empirically optimized to be 8, BiQGEMM with under 8-bit quantization is supposed to be always faster than GEMM (of full-precision) regardless of batch size. However, in reality, we need to consider limiting factors due to available hardware resources as discussed in Section III. If batch size increases and data reuse is improved correspondingly, then overall computational efficiency improvement of mkl and eigen can be higher than that of BiQGEMM. For example, when batch size exceeds 128 in Fig. 10(a), eigen and mkl are faster than BiQGEMM with 3-bit quantization. Specific batch size determining whether BiQGEMM can be faster than GEMM depends on the system configuration. For example, in the case of mobile CPU with low computational power (see Table III), BiQGEMM outperforms full-precision GEMM even when batch size becomes larger compared to the case of PC as described in Fig. 10(b).
Even though experimental results in Fig. 10 assumed only one thread, multithreading linearly improves performance of both BiQGEMM and GEMM that can be parallelized by tiling techniques.
Iv-E Experiments with GPU
On GPU, we compare BiQGEMM with kGpu, cuBLAS, and xnor. Both cublas and kGpu assume that only 1 bit is occupied in 32-bit containers (with unnecessary storage of 31 bits) for each quantized weight (i.e., bit-packing is not considered because unpacking is as slow as sGEMM). In the case of xnor, activations are quantized as well such that matrix multiplications is mainly computed by XNOR and popcount operations without unpacking procedure. Assuming weights and activations are - and -bit quantized, xnor shows time complexity of , where , , and are output size, input size, and batch size, respectively. Although activation quantization can simplify computations further, activation quantization demands training algorithm modifications and computational overhead during inference as discussed in Section II at the cost of model accuracy drop.
cublas is provided as a library form developed by a chip vendor such that we select kGpu as a baseline that we modify to implement BiQGEMM (for xnor, we use a publicly available code that is also based on kGpu). Table IV shows runtime with various matrix sizes and batch sizes when each weight is 1-bit quantized (for xnor, activations are also 1-bit quantized). Difference in performance between BiQGEMM and kGpu represents the gain by reduced bandwidth requirement and by improved computational principles. As shown in Table IV, BiQGEMM is faster than kGpu by 1.0830.42 times (as weight matrix size increases and batch size decreases, BiQGEMM becomes relatively faster). Note that if batch size is small enough (to be memory-bound), even compared with xnor, BiQGEMM presents the best performance.
We proposed an efficient matrix-to-matrix multiplication technique dedicated to quantized neural networks. When weights are quantized, available output space of computational results is quite limited such that for a large matrix multiplication, a lot of computations become redundant. BiQGEMM removes such redundancy by replacing multiplications with table lookups. Moreover, because commercial processors enable only a fixed data transfer width, a lot of memory bandwidth can be wasted when weights are quantized into a few bits. BiQGEMM provides a way to access multiple quantized weights simultaneously regardless of the number of quantization bits. Hence, memory bandwidth utilization is enhanced by BiQGEMM significantly while required memory bandwidth is reduced by quantization. We demonstrated that BiQGEMM is a lot faster than previous matrix multiplication schemes especially when matrix size is large and batch size is small.