1 Introduction
With tremendous success of the deep learning or deep neural networks (DNNs), there exist urgent needs for deployments of inference models onto edgecomputing platforms/devices. As a major type of model compression technique, the DNN quantization becomes an essential method to reduce the computation, memory, and storage requirements in ondevice inference, especially for platforms with capability of customized architecture design, such as FPGA devices and ASIC chips. Generally speaking, DNN quantization learns DNN models in low bitwidth representation with accuracy performance close to that of the fullprecision models, while accelerating inference speed.
Various quantization schemes have been investigated including binary [5, 6, 26, 22], ternary[20, 15, 39], Fixedpoint (Fixed) [38, 4, 12, 16, 3, 11], PowerofTwo (PoT) [24, 37, 19, 36], Additive PowerofTwo (APoT) [21], etc. Those schemes have diverse accuracy and hardware performance. Binary and ternary significantly reduce computation by eliminating multiplication operations, but experience relatively large accuracy loss ( in general). Low bitwidth Fixed quantization has better accuracy performance. For example, 4bit Fixed can achieve negligible accuracy loss comparing with its 32bit floatingpoint counterpart, although it still needs multiplication operations during inference computation.
Different from binary, ternary, and Fixed, PoT is a nonlinear quantization scheme, where with quantization levels as poweroftwo numbers, multiplications can be replaced with bit shifting operations, thereby reducing computation to speedup inference. However, PoT still results in moderate accuracy loss (), which is due to the rigid resolution issue [21] that exhibits a high resolution around the mean and a low resolution at the tails of weight distribution and that cannot be resolved even with higher bitwidth. Therefore, APoT was proposed by representing quantization levels as the sum of multiple poweroftwo numbers, overcoming the rigid resolution issue of PoT, while enjoying nonmultiplication operations.
While exploring various quantization schemes, people found that the first and the last layers are of importance for preserving accuracy, and therefore in the abovementioned works, the first and the last layers are either unquantized (in 32bit floatingpoint) or quantized with no fewer than 8 bits. Motivated by this, the layerwise multiprecision quantization has been well investigated in [31, 33, 29, 10, 9, 34, 28], where various methods/algorithms have been used to deal with the large search space of the precision assignment for both weights and activations in each layer.
This work proposes a novel DNN quantization framework, namely RMSMP, with a Rowwise MixedScheme and MultiPrecision approach. Figure 1 illustrates the proposed quantization framework on the DNN (a) weight tensor and (b) weight matrix. Specifically, each filter in the weight tensor or each row in the weight matrix is assigned a combination of quantization scheme and precision. The candidates of schemes and precisions are derived practically for facilitating hardware implementation, and effectively for speeding up inference with the given resources on the hardware platforms/devices, while with the capability to preserve the accuracy as the unquantized (32bit floatingpoint) models. This highly hardwareinformative quantization strategy significantly reduces the search space of the DNN quantization problem, making our framework distinctive from existing multiprecision quantization works.
This is the first effort to apply mixed quantization schemes and multiple precisions within layers, targeting for simplified operations in hardware inference, while preserving the accuracy. Specifically, two quantization schemes i.e., PowerofTwo (PoT) and Fixedpoint (Fixed), and two precisions i.e., 4bit and 8bit are adopted and explored for quantization on weights and activations, to reduce inference computation and preserve accuracy. Furthermore, as demonstrated by the previous works that either implicitly use higher precisions for the first and the last layers, or explicitly assign multiple precisions to different layers, it is essential to use higher precisions at least for parts of the model, to boost the accuracy of quantized models close to that of the fullprecision models. However, different from existing works, this paper makes the observation that this does not necessarily relate to layerwise sensitivity, instead, as long as a certain portion of the weights in every layer use higher precisions, the quantization error can be mitigated. This observation enables layerwise uniformality towards practical hardware implementation of quantized models, while still enjoying rowwise flexibility of mixed schemes and multiple precision. The contributions of our quantization framework are summarized as follows:

A novel rowwise mixedscheme and multipleprecision quantization approach.

A highly hardwareinformative solution strategy significantly reducing the problem search space.

The best accuracy performance when under the same equivalent precision as the existing works.

The significant inference speedup on real devices, comparing to the networkwise uniform lowbit quantization i.e., the speed upper bound of the popular layerwise multiprecision approaches.
2 Related Work
2.1 Quantization Scheme
2.1.1 Linear Quantization Schemes
Linear quantization schemes with uniform quantization levels include binary, ternary, and fixedpoint. Both binary and ternary quantization use extremely low precision for DNN models, i.e., binarized (with values of
) or ternarized (with values of ) levels. Representative binary quantization methods include Binaryconnect [5], Binarized Neural Network (BNN) [6], XNORnet
[26], and ABCNet [22]. With weights constrained to , multiplications can be replaced by additions/subtractions, which can even be eliminated by using XNOR and AND operations if activations are quantized to binary as well. Ternary quantization schemes are implemented in TWN [20], TTQ [39], and [15]. Ternary networks also benefit from nonmultiplication operations. Although binary and ternary quantization can significantly reduce computation by simplifying operations, they introduce relatively large accuracy loss. From the above studies, accuracy typically degrades by under binary, and for ternary.Fixedpoint (Fixed) quantization schemes use more bits than binary and ternary to preserve accuracy, and have been implemented with different methods/algorithms. DoReFaNet [38] first explored it by introducing hyperbolic tangent transformation to weights and activations, with scaling factors to minimize quantization error. PACT [4] improved this method by adding a parameterized clipping threshold to activations. DSQ [12] developed differentiable soft quantization, which evolves training method to gradually approximate the uniform quantizer. QIL [16] parameterized the quantization interval and trained it with task loss, avoiding access to the original training data. L2Q [3] introduced data distribution loss during training to minimize quantization error. LSQ [11] proposed a differentiable method to learn the quantizer for each layer jointly with parameters. Generally, 4bit fixedpoint quantized models have negligible accuracy loss comparing with 32bit floatingpoint models, although cannot get rid of the expensive multiplications.
2.1.2 NonLinear Quantization Schemes
Nonlinear quantization schemes use nonuniform quantization levels, such as the PowerofTwo (PoT) scheme, where quantization levels are poweroftwo numbers. The multiplication of a weight (in PoT) and an input (in Fixed) can be replaced with the bit shifting operation. And therefore PoT can have higher speedup than the Fixed scheme for the DNN inference. PoT quantization was adopted first in LogQuant [24], and then with model accuracy enhancement techniques in INQ [37], extremely low bit neural network with ADMM [19], and LQNets [36]. However, PoT suffers from the rigid resolution phenomenon, and cannot attain higher model accuracy even with increased bitwidth. Particularly, 4bit PoT quantization will result in accuracy loss of .
To overcome the limitation, Additive PowerofTwo (APoT) [21] represents quantization levels as the sum of multiple poweroftwo numbers to mitigate the accuracy loss. Then the multiplication of an APoT quantized weight and a Fixed input can be replaced with bit shifting operations and additions. Furthermore, MSQ [2] leverages a mixture of a variant APoT and Fixed to maximize the resource utilization on FPGA devices, while in this work, we use mixed schemes of PoT and Fixed, achieving even higher accuracy and further reducing inference computation. All of the above works use the singleprecision quantization.
2.2 SinglePrecision vs MultiPrecision
A common practice in the abovementioned quantization works is to keep the first and last layers unquantized (in 32bit floatingpointing for weights/activations i.e., W32A32) or quantized with no fewer than 8 bits, with the purpose of preserving accuracy, which was first proposed in [13]. Spurred by that, another line of works [31, 33, 29, 10, 9, 34, 28] explore the layerwise multiprecision quantization approach to boost the accuracy of quantized models.
To deal with the large search space of layerwise precision assignment for both weights and activations, different methods/algorithms have been leveraged. HAQ [31]
uses the reinforcement learning to train an agent that decides the bitwidth of each layer. DNAS
[33] proposes a multiprecision quantizationaware training technique, where the bitwidth search is converted into a network architecture search. Mixed Precision DNNs [29]shows that bitwidths can be determined by learnable parameters, whose gradients are estimated using Straight Through Estimator (STE). HAWQ
[10], HAWQV2 [9], and PyHessian [34] solve Hessian matrix to determine bitwidth for each layer. The general idea is to assign more bits to layers that are sensitive to quantization error. QBERT [28] also uses Hessianbased multiprecision quantization to the BERT [7] model for the natural language processing application.There are two frameworks applying multiprecision within layers. RVQuant [25] quantizes a small percentage of weights with a higher precision irregularly within layers, and therefore could not achieve inference speedup comparing with a networkwise uniform lowbit quantization. AutoQ [23] trains a reinforcement learning agent to assign precisions down to the kernel level, which is computationally expensive. None of them considers mixed schemes.
3 Proposed RMSMP Quantization
This section discusses the proposed RMSMP quantization framework with a highly hardwareinformative strategy, by introducing the scheme selection, precision exploration, and the algorithm solution.
3.1 Mixed Schemes for Simplified Operations
The Fixedpoint (Fixed) quantization scheme has superior accuracy performance, and the PowerofTwo (PoT) is the most computationally efficient quantization scheme (with still acceptable accuracy performance) to speedup inference since multiplications can be replaced by bit shifting operations. Therefore, this work proposes a novel rowwise mixedscheme quantization approach with Fixed for preserving accuracy and PoT for reducing computation of inference. Specifically, for each row in the weight matrix or each filter in the weight tensor, the weights are either quantized into the Fixed scheme or the PoT scheme. The rowwise scheme assignment instead of a layerwise scheme assignment is used, because the hardware inference execution is conducted layer by layer on the same pieces of computing resource – i.e., the GEMM (general matrix multiply) core for processing Fixed weights, and i.e., the GEMM core for processing PoT weights. (Note that in PoT, activations are also quantized into Fixed to support the bit shifting operation in replacement of multiplication.)
The ratio of the two schemes is fixed based on the hardware characteristics, and is the same for different layers to keep the layerwise uniformality in hardware design such that inference speedup can be obtained in the layerbylayer inference execution. In ASIC chips, it is desirable to maximize the portion of the PoT quantized rows to the extent that it does not result in accuracy loss. In FPGA devices, there are heterogeneous computing resources i.e., DSPs to implement the core and LUTs to implement the core. Therefore, the ratio of the two schemes could be OFFLINE determined, such that the LUT utilization is high (while with 100% DSP utilization) to speed up inference and also it should not hurt accuracy.
In the training algorithm of our RMSMP quantization, the rowwise scheme assignment is based on the weight distribution of the particular row. If the weight distribution has a smaller variance, the PoT scheme is preferred; otherwise, the Fixed scheme is preferred. The threshold on the variance can be used based on the offline determined ratio of the two schemes.
3.1.1 FixedPoint (Fixed) Quantizer
The Fixed scheme uses uniform quantization levels, with the quantized weights mapped as a scaling factor times the quantization levels. With bit Fixed, the quantized weights should be from the following set:

(1) 
where denotes the scaling factor. Then the quantizer function from a 32bit floatingpoint weight to a Fixed quantized weight is

(2) 
where shifts a value within into the range of , e.g., , and clips according to
(3) 
3.1.2 PowerofTwo (PoT) Quantizer
The PoT scheme uses poweroftwo numbers as quantization levels. With bit PoT, the quantized weights should be from the following set:
(4) 
Then the quantizer function from a 32bit floating point weight to a PoT quantized weight is
(5)  
where .
3.2 Multiple Precisions for Boosting Accuracy
Motivated by the previous work that either implicitly use higher precisions for the first and the last layers, or explicitly assign multiple precisions to different layers, it is essential to use higher precisions for some parts of the model to boost the accuracy. However, the layerwise multiprecision approach does not compatible with the layerbylayer inference execution on the same pieces of hardware. In other words, the layerwise multiprecision approach incurs large implementation overhead, which may neutralize the expected inference speedup.
Figure 2 demonstrates the proposed multiprecision approach aligned with the rowwise mixed schemes. The majority of the rows use the 4bit precision for weights/activations i.e., PoTW4A4 and FixedW4A4, because 2bit has large accuracy loss and 3bit is not suitable for hardware implementation, which prefers operands in 2bit, 4bit, 8bit, etc. To boost accuracy, a higher precision with 8bit weights and 4bit activations is used on the Fixed scheme, i.e., FixedW8A4. The PoT scheme is not applied the higher precision because of its rigid resolution issue. The ratio of PoTW4A4 : FixedW4A4 : FixedW8A4 can be determined offline, and is the same across different layers, in order to keep the layerwise uniformality in the hardware implementation, such that little overhead is incurred during the layerbylayer inference execution and inference speedup can be guaranteed. In general, a larger portion of the PoTW4A4 can help with inference speedup, but will degrade accuracy. By using a small portion of FixedW8A4, the accuracy degradation due to the PoT scheme can be mitigated, as shown in Figure 3. This work fixes the percentage of FixedW8A4 as 5%, as it is enough to mitigate the accuracy degradation and an even smaller percentage only results in marginal further speedup of inference.
The effect of PoTW4A4 ratio on the accuracy of ResNet50, ResNet18, and MobileNetV2 models on CIFAR10, CIFAR100, and ImageNet datasets.
In the top row, only FixedW4A4 is mixed with PoTW4A4. In the bottom row, 5% of FixedW8A4 is used, besides PoTW4A4 and FixedW4A4.3.3 RMSMP Quantization Algorithm
Quantization  MobileNetv2 Accuracy (%)  ResNet18 Accuracy (%)  ResNet50 Accuracy (%)  
Method  Top1  Top5  Top1  Top5  Top1  Top5 
CIFAR10  
Baseline (W32A32)  92.51    93.62    93.91   
FixedW4A4  91.76 (92.34)    92.9 (93.43)    93.16 (93.73)  
PoTW4A4  90.92 (91.34)    92.14 (92.97)    92.53 (93.15)   
APoTW4A4 [21]  91.83 (92.72)    92.94 (93.47)    92.87 (93.68)   
PoTW4A4 + FixedW4A4  91.38 (91.77)    92.66 (93.21)    93.14 (93.76)   
APoTW4A4 + FixedW4A4 [2]  91.99 (92.55)    92.98 (93.65)    93.22 (93.77)   
FixedW4A4 + FixedW8A4  92.54    93.69    94.11   
RMSMP  92.58    93.72    94.03   
CIFAR100  
Baseline (W32A32)  71.48  91.98  74.49  92.70  77.77  94.00 
FixedW4A4  70.22 (71.16)  90.88 (91.63)  73.88 (74.37)  91.72 (92.31)  76.67 (77.41)  93.22 (93.85) 
PoTW4A4  67.11 (68.68)  89.21 (90.06)  72.97 (73.88)  91.65 (92.14)  74.58 (76.83)  92.22 (92.74) 
APoTW4A4 [21]  70.21 (71.13)  90.85 (91.69)  73.97 (74.33)  92.03 (92.49)  76.58 (77.56)  92.94 (93.55) 
PoTW4A4 + FixedW4A4  68.94 (70.37)  90.05 (90.89)  73.41 (74.12)  91.68 (92.25)  76.95 (77.28)  92.87 (93.66) 
APoTW4A4 + FixedW4A4 [2]  70.25 (71.50)  90.92 (91.82)  74.03 (74.60)  92.05 (92.63)  76.97 (77.31)  92.99 (93.86) 
FixedW4A4 + FixedW8A4  71.52  91.99  74.54  92.61  77.92  94.18 
RMSMP  71.51  91.97  74.61  92.69  77.85  94.13 
ImageNet  
Baseline (W32A32)  71.88  90.29  70.25  89.48  76.51  93.09 
FixedW4A4  67.23 (69.26)  86.03 (88.18)  68.66 (69.72)  87.54 (88.67)  75.58 (76.22)  92.01 (92.43) 
PoTW4A4  65.88 (67.93)  84.83 (86.63)  67.11 (68.20)  85.93 (87.14)  74.77 (75.03)  91.66 (92.22) 
APoTW4A4 [21]  67.76 (68.97)  86.42 (88.17)  68.48 (70.70)  87.92 (89.60)  75.49 (76.60)  92.93 (93.10) 
PoTW4A4 + FixedW4A4  66.20 (68.54)  85.66 (87.65)  67.98 (68.94)  86.75 (88.66)  75.85 (76.19)  91.97 (92.77) 
APoTW4A4 + FixedW4A4 [2]  67.31 (68.99)  86.11 (88.04)  69.22 (70.27)  88.33 (89.42)  76.11 (76.22)  92.58 (92.86) 
FixedW4A4 + FixedW8A4  68.98  89.04  70.48  89.52  76.68  93.44 
RMSMP  69.02  89.07  70.73  89.62  76.62  93.36 
The RMSMP quantization algorithm can train a DNN model from scratch or quantize a pretrained model into a quantized one, such that for each layer, the numbers of filters quantized into PoTW4A4, FixedW4A4, and FixedW8A4 follow the predefined ratio of , where .
For the assignment of quantization schemes and precisions to the filters of each layer, we use the Hessianbased method to determine which filters should use FixedW8A4 (higher precision). And for the rest filters, we determine PoTW4A4 vs FixedW4A4 based on the variances of the weights in each filter. Once it is determined the assignment of quantization scheme and precision (PoTW4A4, FixedW4A4, and FixedW8A4) down to the filter level for each layer, the Straight Through Estimator (STE) [1, 35]
(6)  
can be used to quantize the model into the specified schemes and precisions while solving the unavailable gradient issue during the backpropagation of the quantization process from continuous values into discrete values.
In order to determine the filters with the higher precision (FixedW8A4), we extend from the Hessian based method in HAWQ [10]
that by computing the max eigenvalue of the Hessian matrix of each filter, the filters with top 5% eigenvalues get assigned the higher precision. Hessian matrix is the second order derivatives of weight parameters, as following
(7) 
where the first order gradient can be calculated through normal backpropagation process.
To solve the eigenvalues of Hessian matrix (i.e., ) effectively, we employ the power iteration method as proposed in [10, 28]. By introducing an auxiliary variable , we loop as follows:
(8) 
so that and the last equality is proved in [10]. With Eq. 8, we can proceed to solve given by computing the right term through simple backpropagation, thus solving .
Further, the scheme assignment is determined based on the variances of the filters. Finally, after the filters are assigned the quantization schemes and precisions, we can train our quantized model through STE. Please note that we update the assignments of filters for every 10 epochs. The detailed steps are shown in Algorithm 1.
4 Evaluation
4.1 Experiment Setup
Method  Approach  BitWidth  First/Last  Top1  Top5  
Layer  (%)  (%)  
Baseline    W32A32  70.25  89.48  
Dorefa [38]  Linear  W4A4  68.10  88.10  
PACT [4]  Linear  W4A4  69.20  89.00  
DSQ [12]  Linear  W4A4  69.56  N/A  
QIL [16]  Linear  W4A4  70.10  N/A  
L2Q [3]  Linear  W4A4    65.92  86.72  
APoT [21]  NonLin.  W4A4  8bit  8bit  70.70  89.60 
LQNets [36]  NonLin.  W4A4  69.30  88.80  
DNAS [33]  MPLin.  Mixed  70.64  N/A  
MPDNN [29]  MPLin.  Mixed  ✓  ✓  70.08  N/A 
MSQ [2]  MS  W4A4  70.27  89.42  
RMSMP (Ours)  MPMS  W4A4*  ✓  ✓  70.73  89.62 
Method  Approach  BitWidth  First/Last  Top1  Top5  
Layer  (%)  (%)  
Baseline    W32A32  76.51  93.09  
Dorefa [38]  Linear  W4A4  71.40  88.10  
PACT [4]  Linear  W4A4  76.50  93.30  
APoT [21]  NonLin.  W4A4  8bit  8bit  76.60  93.10 
LQNets [36]  NonLin.  W4A4  75.40  92.40  
HAQ [31]  MPLin.  Mixed  8bit  ✓  76.15  92.89 
MSQ [2]  MS  W4A4  76.22  92.86  
RMSMP (Ours)  MPMS  W4A4*  ✓  ✓  76.62  93.36 
Method  Approach  BitWidth  First/Last  Top1  Top5  
Layer  (%)  (%)  
Baseline    W32A32  71.88  90.29  
PACT [4]  Linear  W4A4  61.40  N/A  
DSQ [12]  NonLin.  W4A4  64.80  N/A  
HAQ [31]  MPLin.  Mixed  8bit  ✓  67.01  87.46 
MSQ [2]  MS  W4A4  68.99  88.04  
RMSMP (Ours)  MPMS  W4A4*  ✓  ✓  69.02  89.07 
Method  Approach  BitWidth  Evaluation  Result 
Metric  ()  
BERT on SST2  
Baseline    W32A32  Accuracy  93.00 
Fixed  Linear  W4A4  Accuracy  92.78 
PoT  NonLin.  W4A4  Accuracy  92.43 
PoT + Fixed  Mixed  W4A4  Accuracy  92.78 
QBERT [28]  MPLin.  Mixed  Accuracy  92.66 
RMSMP (Ours)  MPMS  W4A4*  Accuracy  92.87 
BERT on MNLI  
Baseline    W32A32  mmAcc.  84.90 
Fixed  Linear  W4A4  mmAcc.  84.71 
PoT  NonLin.  W4A4  mmAcc.  84.79 
PoT + Fixed  Mixed  W4A4  mmAcc.  84.51 
QBERT [28]  MPLin.  Mixed  mmAcc.  84.17 
RMSMP (ours)  MPMS  W4A4*  mmAcc.  84.83 
The evaluation metric of SST2 is accuracy, while the evaluation metric of MNLI is mismatched accuracy (mmAcc.). For both metrics, a higher value is better.
Quantization Method  PoTW4A4 : FixedW4A4 : FixedW8A4  First/Last Layer Quantization  Accuracy  Results on FPGA XC7Z020  Results on FPGA XC7Z045  
Top1  Top5  Utilization  Throughput  Latency  Utilization  Throughput  Latency  
(%)  (%)  LUT  DSP  (GOP/s)  (ms)  LUT  DSP  (GOP/s)  (ms)  
(1) Fixed  0:100:0  8bit Fixed  69.72  88.67  26%  100%  29.6  122.6  21%  100%  115.6  31.4 
(2) Fixed  0:100:0  ✓  68.66  87.54  23%  100%  36.5  99.3  19%  100%  142.7  25.4 
(3) PoT  100:0:0  8bit Fixed  68.20  87.14  41%  100%  62.4  58.1  40%  100%  290.5  12.5 
(4) PoT  100:0:0  ✓  67.11  85.93  43%  12%  72.2  50.2  43%  3%  352.6  10.3 
(5) PoT + Fixed  50:50:0  8bit Fixed  68.94  88.66  50%  100%  50.3  72.0  48%  100%  196.8  18.4 
(6) PoT + Fixed  50:50:0  ✓  67.98  86.75  46%  100%  75.8  47.8  45%  100%  296.3  12.2 
(7) PoT + Fixed  60:40:0  8bit Fixed  68.53  88.47  52%  100%  57.0  63.6         
(8) PoT + Fixed  67:33:0  8bit Fixed  68.46  88.22          63%  100%  245.8  14.8 
MSQ1 [2]  60:40  ✓  69.22  88.33  53%  100%  77.0  47.1         
MSQ2 [2]  67:33  ✓  69.14  88.17          66%  100%  359.2  10.1 
RMSMP1  60:35:5  ✓  70.66  89.53  57%  100%  89.0  40.7         
RMSMP2  65:30:5  ✓  70.73  89.62          67%  100%  421.1  8.6 
Our RMSMP is evaluated on the image classification and natural language processing (NLP) applications. All the model training and quantization are conducted on NVIDIA TITAN RTX and GeForce RTX 2080 Ti GPUs, with CUDA 10.2 and PyTorch 1.5 frameworks on Ubuntu 18.04 operating system. The quantization algorithm utilizes the same data augmentation techniques as those used in training of baseline (32bit floatingpoint) models. Training tricks such as step or cosine learning rate decay and weight regularization are also the same for training the baselines and in the quantization algorithm.
The evaluated models on image classification tasks include ResNet18, ResNet50 [14] and MobileNetv2 [27] on CIFAR10, CIFAR100 [18] and ImageNet ILSVRC2012 [17] datasets. Baseline models in 32bit floatingpoint representation for CIFAR10 and CIFAR100 datasets are trained from scratch for epochs, then quantized for epochs. For ImageNet dataset, pretrained 32bit floatingpoint models are quantized for epochs. The initial learning rates are for CIFAR10, for CIFAR100, and for ImageNet. For NLP tasks with BERT [8] models, we evaluate on the Stanford Sentiment Treebank (SST2) and MultiGenre Natural Language Inference (MNLI) datasets from the General Language Understanding Evaluation (GLUE) [30] benchmark. The pretrained BERT models are from HuggingFace Transformer [32]. Quantization and finetuning are simultaneously performed for 3 epochs with the initial learning rate of . Besides, for all models, the maximum iteration number of the power iteration method is set to 20.
In addition to the model accuracy, we demonstrate the hardware efficiency (in terms of resource utilization, throughput, and latency) of the proposed RMSMP compared with other quantization methods through implementing the architectures with heterogeneous GEMM cores on FPGA devices, i.e., Zynq XC7Z020 and XC7Z045. Different ratios of quantization schemes/precisions are realized by adjusting the ratio among the processing element (PE) array sizes in the GEMM cores. Our optimal RMSMP implementation utilizes three heterogeneous GEMM cores, i.e., for processing with PoTW4A4 filters, for FixedW4A4, and for FixedW8A4. The working frequency is set to 100MHz for all implementations.
4.2 Accuracy Performance
Table 1 compares the proposed RMSMP method with other quantization methods for three models on three datasets. RMSMP generally achieves higher top1 accuracy than methods with Fixed, PoT, APoT, PoT + Fixed, and APoT + Fixed using 4bit precision, and comparable accuracy to the FixedW4A4 + FixedW8A4 method.
The results in Tables 2, 3, and 4 demonstrate the effectiveness of the proposed RMSMP method for three models i.e., ResNet18, ResNet50, and MobileNetV2, respectively, on ImageNet using mainly 4bit precision. For ResNet18, RMSMP increases the top1 accuracy by compared with the baseline, and outperforms existing methods by up to . For ResNet50, RMSMP increases the baseline accuracy by , and outperforms existing methods by up to . As for MobileNetV2, RMSMP achieves the least accuracy loss, with up to increase on top1 accuracy than other methods. In addition, comparisons on a large language model BERT are provided in Table 5. The BERT is a large model with higher redundancy. And therefore it is easy to achieve negligible accuracy loss after quantization. All the quantization methods in Table 5 perform very well and RMSMP yields slightly better accuracy than other methods on SST2 and MNLI datasets.
4.3 Hardware Efficiency
The proposed RMSMP method is compared with other quantization methods on two FPGA boards with the ResNet18 model on the ImageNet dataset, as displayed in Table 6. The total amount of resources on a specific FPGA board determines the optimal ratio of PoTW4A4 : FixedW4A4 : FixedW8A4 schemes. Specifically, the optimal ratio on XC7Z020 is 60:35:5 (RMSMP1), resulting in top1 accuracy of 70.66% and latency of 40.7ms, while the optimal ratio on XC7Z045 is 65:30:5 (RMSMP2), leading to top1 accuracy of 70.73% and latency of 8.6ms. RMSMP with the optimal ratio of quantization schemes achieves up to speedup on XC7Z020 and up to speedup on XC7Z045, comparing with the singlescheme and singleprecision method (1) Fixed. Methods (1)(3)(5)(7)(8) employ 8bit Fixed quantization for the first and last layers to maintain the accuracy, whereas this hurts the inference speed. Compared with methods (1)(2)(5)(6), RMSMP enhances the utilization ratio of LUTs to increase the throughput. Additionally, RMSMP achieves higher accuracy and lower latency with full DSP usage compared with the method (4). We also compare with MSQ [2] with the first/last layer quantized into 4bit. Our RMSMP outperforms MSQ in both accuracy and inference speed.
5 Conclusion
This work proposes a novel DNN quantization framework, namely RMSMP, with a Rowwise MixedScheme and MultiPrecision approach, featuring (i) layerwise uniformality to fulfill the requirement of practical hardware implementation, (ii) rowwise flexibility for mixed schemes and multiple precisions, (iii) hardwareinformative selection of candidate schemes and precisions (bitwidths) for significantly reducing the algorithm search space, and (iv) best accuracy performance among the stateofthearts. This paper adopts a highly hardwareinformative solution strategy to reduce the problem search space. Evaluating with the image classification and natural language processing applications, our RMSMP achieves superior accuracy performance. And we also evaluate our RMSMP with FPGA devices.
Acknowledgment
This work is partly supported by the National Science Foundation CCF1901378, and CCF1937500. Any opinions, findings, and conclusions or recommendations in this material are those of the authors and do not necessarily reflect the views of NSF.
References

[1]
(2013)
Estimating or propagating gradients through stochastic neurons for conditional computation
. arXiv preprint arXiv:1308.3432. Cited by: §3.3.  [2] (2021) Mix and match: a novel fpgacentric deep neural network quantization framework. In Proceedings of the 2021 IEEE International Symposium on High Performance Computer Architecture, Cited by: §2.1.2, Table 1, §4.3, Table 2, Table 3, Table 4, Table 6.
 [3] (2019) l2q: An ultralow loss quantization method for dnn. The 2019 International Joint Conference on Neural Networks (IJCNN). Cited by: §1, §2.1.1, Table 2.
 [4] (2018) Pact: parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085. Cited by: §1, §2.1.1, Table 2, Table 3, Table 4.
 [5] (2015) Binaryconnect: training deep neural networks with binary weights during propagations. In Advances in neural information processing systems (NeurIPS), pp. 3123–3131. Cited by: §1, §2.1.1.
 [6] (2016) Binarized neural networks: training deep neural networks with weights and activations constrained to+ 1 or1. arXiv preprint arXiv:1602.02830. Cited by: §1, §2.1.1.
 [7] (2018) Bert: pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.2.
 [8] (2019) BERT: pretraining of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §4.1.
 [9] (2019) HAWQv2: hessian aware traceweighted quantization of neural networks. arXiv preprint arXiv:1911.03852. Cited by: §1, §2.2, §2.2.

[10]
(2019)
Hawq: hessian aware quantization of neural networks with mixedprecision.
In
Proceedings of the IEEE International Conference on Computer Vision (ICCV)
, pp. 293–302. Cited by: §1, §2.2, §2.2, §3.3, §3.3.  [11] (2019) Learned step size quantization. International Conference on Learning Representations (ICLR). Cited by: §1, §2.1.1.
 [12] (2019) Differentiable soft quantization: bridging fullprecision and lowbit neural networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4852–4861. Cited by: §1, §2.1.1, Table 2, Table 4.
 [13] (2015) Learning both weights and connections for efficient neural network. In Advances in neural information processing systems (NeurIPS), pp. 1135–1143. Cited by: §2.2.

[14]
(2016)
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
, pp. 770–778. Cited by: §4.1.  [15] (2019) Simultaneously optimizing weight and quantizer of ternary neural network using truncated gaussian approximation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11438–11446. Cited by: §1, §2.1.1.
 [16] (2019) Learning to quantize deep networks by optimizing quantization intervals with task loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4350–4359. Cited by: §1, §2.1.1, Table 2.

[17]
(2012)
Imagenet classification with deep convolutional neural networks
. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §4.1.  [18] (2009) Learning multiple layers of features from tiny images. Cited by: §4.1.

[19]
(2018)
Extremely low bit neural network: squeeze the last bit out with admm.
In
ThirtySecond AAAI Conference on Artificial Intelligence (AAAI)
, Cited by: §1, §2.1.2.  [20] (2016) Ternary weight networks. arXiv preprint arXiv:1605.04711. Cited by: §1, §2.1.1.
 [21] (2020) Additive powersoftwo quantization: an efficient nonuniform discretization for neural networks. In International Conference on Learning Representations (ICLR), Cited by: §1, §1, §2.1.2, Table 1, Table 2, Table 3.
 [22] (2017) Towards accurate binary convolutional neural network. In Advances in Neural Information Processing Systems (NeurIPS), pp. 345–353. Cited by: §1, §2.1.1.
 [23] (2019) AutoQ: automated kernelwise neural network quantization. In International Conference on Learning Representations (ICLR), Cited by: §2.2.
 [24] (2016) Convolutional neural networks using logarithmic data representation. arXiv preprint arXiv:1603.01025. Cited by: §1, §2.1.2.
 [25] (2018) Valueaware quantization for training and inference of neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 580–595. Cited by: §2.2.
 [26] (2016) Xnornet: imagenet classification using binary convolutional neural networks. In European conference on computer vision (ECCV), pp. 525–542. Cited by: §1, §2.1.1.
 [27] (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 4510–4520. Cited by: §4.1.
 [28] (2020) Qbert: hessian based ultra low precision quantization of bert.. In ThirtySecond AAAI Conference on Artificial Intelligence (AAAI), pp. 8815–8821. Cited by: §1, §2.2, §2.2, §3.3, Table 5.
 [29] (2020) Mixed precision dnns: all you need is a good parametrization. International Conference on Learning Representations (ICLR). Cited by: §1, §2.2, §2.2, Table 2.
 [30] (2018) GLUE: a multitask benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 353–355. Cited by: §4.1.
 [31] (2019) HAQ: hardwareaware automated quantization with mixed precision. International Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §1, §2.2, §2.2, Table 3, Table 4.
 [32] (2019) HuggingFace’s transformers: stateoftheart natural language processing. ArXiv abs/1910.03771. Cited by: §4.1.
 [33] (2018) Mixed precision quantization of convnets via differentiable neural architecture search. arXiv preprint arXiv:1812.00090. Cited by: §1, §2.2, §2.2, Table 2.
 [34] (2019) PyHessian: neural networks through the lens of the hessian. arXiv preprint arXiv:1912.07145. Cited by: §1, §2.2, §2.2.
 [35] (2018) Understanding straightthrough estimator in training activation quantized neural nets. In International Conference on Learning Representations (ICLR), Cited by: §3.3.
 [36] (2018) Lqnets: learned quantization for highly accurate and compact deep neural networks. In Proceedings of the European conference on computer vision (ECCV), pp. 365–382. Cited by: §1, §2.1.2, Table 2, Table 3.
 [37] (2017) Incremental network quantization: towards lossless cnns with lowprecision weights. In International Conference on Learning Representations (ICLR), Cited by: §1, §2.1.2.
 [38] (2016) Dorefanet: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160. Cited by: §1, §2.1.1, Table 2, Table 3.
 [39] (2017) Trained ternary quantization. In International Conference on Learning Representations (ICLR), Cited by: §1, §2.1.1.
Comments
There are no comments yet.