HPTQ: Hardware-Friendly Post Training Quantization

09/19/2021 ∙ by Hai Victor Habi, et al. ∙ Sony 11

Neural network quantization enables the deployment of models on edge devices. An essential requirement for their hardware efficiency is that the quantizers are hardware-friendly: uniform, symmetric, and with power-of-two thresholds. To the best of our knowledge, current post-training quantization methods do not support all of these constraints simultaneously. In this work, we introduce a hardware-friendly post training quantization (HPTQ) framework, which addresses this problem by synergistically combining several known quantization methods. We perform a large-scale study on four tasks: classification, object detection, semantic segmentation and pose estimation over a wide variety of network architectures. Our extensive experiments show that competitive results can be obtained under hardware-friendly constraints.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks have shown state-of-art performance in many real-world computer vision tasks, such as image classification

[18, 20], object detection [42, 31, 29], semantic segmentation [7] and pose estimation [51, 6]. However, the deployment of deep neural networks on edge devices is still considered a challenging task due to limitations on available memory, computational power and power consumption.

Quantization [15] is a common approach to tackle this challenge with minimal performance loss, by reducing the bit-width of network weights and activations. Quantization methods can be roughly divided into two categories: quantization aware training (QAT) and post-training quantization (PTQ). QAT methods [24, 25, 8, 16] retrain the network in order to recover the accuracy degradation caused by quantization and usually achieve better results than PTQ methods. PTQ methods [2, 5, 35, 13] are simpler and add quantization to a given network model without any training process. These methods are usually based on a representative unlabeled dataset that is used for selecting the quantization parameters.

Recently, several works [25, 17, 48] have focused on hardware friendly quantization schemes. Namely, that their quantizers are uniform, symmetric and with power-of-two thresholds. Such quantizers optimize computational costs as they allow integer arithmetic without any cross-terms due to zero-points and floating-point scaling [25].

In this work, we introduce a hardware-friendly post-training quantization (HPTQ) method. To the best of our knowledge, current hardware friendly quantization methods are based on quantization aware training (QAT). This might be due to the difficulty of using power-of-two thresholds as stated in [37]. HPTQ offers a post-training quantization flow that adapts and synergistically combines several known techniques, namely, threshold selection, shift negative correction, channel equalization, per channel quantization and bias correction.

We extensively examine the performance of our method using 8-bit quantization. We evaluate HPTQ on different network architectures over a variety of tasks, including classification, object detection, semantic segmentation and pose estimation. Additionally, we provide an ablation study demonstrating the effect of each technique on the network performance. To summarize, our contributions are:

  • Introducing HPTQ, a method for hardware friendly post-training quantization.

  • A large-scale study of post-training quantization on a variety of tasks: classification, object detection, semantic segmentation and pose estimation.

  • We demonstrate that competitive results can be obtained under hardware friendly constraints of uniform, symmetric 8-bit quantization with power-of-two thresholds.

2 Background and Basic Notions

In this section we give a short overview of uniform quantization and the hardware friendly constraints that will be applied in this work, namely, symmetric quantization with power-of-two thresholds.

Uniform Affine Quantization.

A quantizer can be formalized as a right to left composition of an integer valued function and a recovering affine operation (known as de-quantization). The discrete range of is called a quantization grid and if it is uniformly spaced, then is said to be a uniform quantizer.

The constant gap between two adjacent points in the quantization grid of a uniform quantizer is called its step size and the affine shift is called the zero point . Using these parameters, a uniform quantizer can be formalized as:

(1)

where is the image of and is called the quantized integer value of .

Practically, is defined by a clipping range of real values and the number of bits for representing the quantized integer values:

(2)

where is the step size, and is the rounding function to the nearest integer. The zero-point is then defined as and a uniform quantizer can be formalized as:

(3)

Note that usually the clipping boundaries are selected so that the real value 0.0 is a point on the quantization grid.

Symmetric Quantization.

Symmetric quantization is a simplified case of a uniform quantizer that restricts the zero-point to . This eliminates the need for zero-point shift in Eq. 1 and thus enables efficient hardware implementation of integer arithmetic without any cross-terms [25].

The zero-point restriction to 0 requires the selection of either a signed or unsigned quantization grid. Let be a clipping threshold of the quantization range. A signed quantizer is then formalized as:

(4)

where is the step-size. Similarly, an unsigned quantizer is formalized as:

(5)

where is the step size.

Power-of-Two Thresholds.

A uniform, symmetric quantizer (either signed or unsigned) with a power-of-two integer threshold is said to be a hardware-friendly quantizer [17]. Restricting the threshold of a symmetric quantizer to power-of-two integers (i.e. , where ) enables an efficient hardware implementation that uses integer arithmetic without floating-point scaling [25].

Figure 1 illustrates uniform, symmetric and hardware-friendly 4-bit quantization grids for the same range of real numbers [-0.3,4.2] to be quantized. Specifically, the figure demonstrates how the symmetry and a power-of-two threshold constraints imply sub-optimal clipping ranges compared to the general uniform quantizer. These clipping ranges lead to a loss in representation bins and thus increase the potential rounding noise.

Figure 1: Uniform, Symmetric and Hardware-Friendly Quantizers. Illustration of the loss in quantization bins due to hardware friendly constraints.

3 Method

Given a trained floating point network and a representative dataset of independent and identically distributed samples, our aim is to quantize the network in post-training with hardware-friendly quantizers, namely that are uniform, symmetric and with power-of-two thresholds. Hardware Friendly Post Training Quantization (HPTQ) is a three-tier method for addressing this goal. HPTQ consists of a pre-processing stage followed by activation quantization and weight quantization (see Fig. 2

). In the resulting network, activations are quantized per tensor and weights are quantized per channel.

Activation Quantization

Weight Quantization

Pre-Processing

Input Network

Batch Normalization Folding

Threshold Selection

Shift Negative Correction

Activation Equalization

Statistics Collection

Outlier Removal

Statistics Data

Representative Dataset

Threshold Selection

Bias Correction

Quantized Network
Figure 2: The HPTQ framework. Dashed lines represent statistical information passing, which include also their updates, dotted lines represent data passing and solid lines represent an updated network.

3.1 Pre-Processing

The pre-processing stage consists of folding batch normalization layers into their preceding convolution layers

[24], collecting activation statistics using the representative dataset and finally removing outliers from the collected statistics.

Batch-Normalization Folding.

A common technique to reduce model size and computational complexity is batch-normalization folding [24] (also known as batch-normalization fusing) in which batch-normalization layers are folded into the weights of their preceding convolution layers.

Statistics Collection.

In this stage we infer all of the samples in the representative dataset and collect activation statistics of each layer. Specifically, for each layer denote the collection of its activations over by . Based on we collect histograms for each tensor as well as the minimum, maximum and mean values per channel. In the reset of this work we assume that activation tensors have three dimensions where , and are the height, weight and number of channels, respectively.

Outlier Removal.

In this step we filter out outliers in the activation histograms using the z-score approach described in

[1]

. Specifically, we remove histogram bins for which the absolute z-score value is larger than a predefined threshold. This implies that we restrict the range of each histogram bin to a predefined number of standard deviations from its activation mean value. See Figure

3 for an example. Note that since this step updates the histograms, it applies only to the Threshold Selection step (see Figure 2).

Figure 3: Outlier Removal. Left: an input data distribution. Middle: the respective distribution of absolute z-score values. Right: data distribution after outlier removal.

3.2 Activation Quantization

This stage consists of three steps: threshold selection, shift negative correction (SNC) and activation equalization. In the threshold selection step, we set power-of-two thresholds per tensor. The SNC step is a trick that improves the quantization of signed activation functions with a small negative range

[4]. In the activation equalization step we equalize the expected dynamic ranges of activation channels by applying a modified version of a technique that appears in [36].

Threshold Selection.

Given a fixed bit width , our aim is to find a power-of-two threshold that minimizes the noise caused by the quantization of each layer in the network. Formally, for each layer in the network, our objective is to find a threshold that minimizes

(6)

where is the size of the representative dataset, is the collection of activation tensors in the -th layer and is some error measurement.

In an ablation study we examine the effect of several possible quantization error measurements on the actual task accuracy, including Norms [38] and Kullback–Leibler (KL) divergence [34]. Our results show that Mean Square Error (MSE) [38] achieves the best performance (see Table 7). Thus, the objective of the threshold selection is to minimize

(7)

In practice, we approximate a solution to this minimization problem by estimating the noise based on the histogram corresponding to layer collected in the Statistics Collection step above. The restriction of the threshold to power-of-two values implies that the search space is discrete. Let be the maximal absolute value of an activation in over the representative dataset that was collected in the Statistics Collection step above and define the no-clipping threshold:

(8)

Note that the clipping noise induced by the threshold is zero and that for any power-of-two threshold larger than , the noise is increased. Thresholds smaller than may reduce the noise, albeit, at the cost of increasing the clipping noise. Therefore, we search for a threshold minimizing the quantization error starting with and iteratively decreasing it (see. Algorithm 1).

Data: quantization error estimator ERR ; no-clipping threshold ; bit-width ; iterations
Result: threshold value
;
;
for i in 0 to n do
       ;
       ;
       if  then
             ;
            
      
return
Algorithm 1 Constraint threshold selection

Shift Negative Correction (SNC).

Recent works have shown benefits in using signed, non-linear activation functions, such as Swish [40], PReLU and HSwish [21]. However, a signed symmetric quantization of these functions can be inefficient due to differences between their negative and positive dynamic ranges. The main idea in SNC is to reduce the quantization noise of an unsigned activation function with a small negative range (relatively to its positive range). This is done by adding a positive constant to the activation values (shifting its values) and using an unsigned quantizer with the same threshold. This effectively doubles the quantization grid resolution. Note that shifting the values can imply added clipping noise on the one hand but reduced rounding noise on the other.

This step can be viewed as an adaptation to PTQ of a technique that appears in [4], where activations are shifted and scaled in order to match a given dynamic range of a quantizer. Here, we do not add scaling due to its implied added complexity. Specifically, let be the activation function in some layer in the network, let be its threshold, calculated in the Threshold Selection step above and let be its minimal (negative) activation value over the representative dataset , collected in the Statistics Collection step above. If

for a hyperparameter

, then we replace with a shifted version and replace the signed quantizer with an unsigned quantizer followed by another shift operation as follows:

(9)

where is the signed quantizer, is the unsigned quantizer and is the bit-width. In practice, the last subtraction of is folded into the following operation in the network.

Activation Equalization.

In this step, we equalize activation ranges per channel similarly to the methods presented in [36, 33]. Here, we set the scale-per-channel factor according to the value of the threshold that is selected per-tensor. The motivation to use this scaling factor in order to equalize the activation ranges is to use the maximum range of the quantization bins for each channel (see Figure 4).

The authors in [36, 33] suggest to perform channel equalization by exploiting the positive scale equivariance property of activation functions. It holds for any piece-wise linear activation function in its relaxed form: where is a piece-wise linear function, is its modified version that fits this requirement and is a diagonal matrix with denoting the scale factor for channel .

The positive scaling equivariance can be applied on the following set of consecutive layers: a linear operation, a piece-wise linear function and an additional linear operation. This is demonstrated in the following equation:

(10)

where and are the first layer’s weights and bias, and are the second layer’s weights and bias. Although Eq. 10 demonstrates the case of fully-connected layers, it can be also extended for CNNs where the scaling is performed per channel.

We present a use case of channel equalization named Max Channel Equalization which can be applied in any quantization scheme. We assume that

is one of the following non-linear functions: ReLU, ReLU8 or PReLU. Given the quantization threshold

of a non-linear function as well as the maximal activation value of the channel , where is the activation tensor of the layer, we set:

(11)

so that the maximal value of each channel in tensor will be the threshold value (see Figure 4).

Figure 4: An example of Max Channel Equalization using MobileNetV2 [43] . Left: the max value of each channel. Middle: the inverse scale factor for each channel . Right: the max value of each channel after equalization using this scaling factor.

3.3 Weight Quantization

In the Weight Quantization stage we quantize the network’s weights. It was shown in [26, 41] that weight quantization with scaling per channel improves accuracy. Moreover, this work presents an efficient dot product and convolution implementation supporting per-channel quantization. Our Weight Quantization stage consists of per-channel threshold selection and bias correction [36].

Threshold Selection.

As noted above, weight quantization is performed per-channel. Its thresholds are selected similarly to activation thresholds (see Algorithm 1). However, a key difference is that here the search is performed directly on the weight values, opposed to the statistical values that are used for activation. More precisely, given the weights of some channel in the network, the initial no-clipping threshold is

(12)

where are the entries of . Additionally, the error induced by a threshold is

(13)

Note that as with activations, MSE is selected as an error measurement since it yields the best performance (see Table 10).

Bias Correction.

Quantization of weights induce bias shifts to activation means that may lead to detrimental behaviour in the following layers [36, 14]. Explicitly, let be the floating point output of a fully connected layer where are the floating-point input activation, weight and bias, respectively. Denote the quantized weights of the layer by and the corresponding output by . The induced bias shift can be expressed as follows:

(14)

Several works propose approaches to correct the quantization induced bias. These include using batch-normalization statistics [36], micro training [14] and applying scale and shift per channel [3].

We adopt the solution in [36]

, in which the bias shift is fixed by modifying the layer’s bias vector

(15)

where is the per channel empirical mean obtain in the Statistic Collection stage above. Note that although the above is written for a fully connected layer, it applies to convolutional layers as well, as shown in [36].

4 Experimental Results

In this section we evaluate the performance of HPTQ with 8-bit quantization over different tasks and a variety of network architectures. The experiments are divided into two parts. The first part presents an overall performance comparison to the floating point baseline as well as to state-of-the-art quantization approaches. The second part presents an ablation study that analyzes the influence of each technique in HPTQ separately.

4.1 Overall Performance Evaluation

We evaluate the performance of HPTQ on four different tasks: image classification, object detection, semantic segmentation and pose estimation. For each task, we present a comparison between the performance of models quantized by HPTQ and their floating point baselines. Furthermore, for classification and segmentation we provide a comprehensive performance comparison of HPTQ with both PTQ and QAT state-of-the art quantization methods.

We use the same set of hyper-parameters for all our experiments. Specifically, the number of image samples in the representative dataset is 500. The z-score threshold in the outlier removal step is . The SNC threshold is . Last, for both activations and weights, the number of iterations performed in Algorithm 1 in the threshold selection search is set to . One should note that fine-tuning the hyper-parameters per network may lead to further improvement. In all of the tables below is the difference between the performance of the floating point model and the quantized model, PC indicates the use of weights per channel quantization and PoT indicates power-of-two thresholds.

Classification.

We evaluate HPTQ on the ImageNet classification task

[10] using MobileNetV1 [20] , MobileNetV2 [43] and ResNet50 [18] architectures111https://www.tensorflow.org/api_docs/python/tf/keras/applications. Tables 1, 2 and 3 present comparisons of HPTQ with other quantization methods, both PTQ and QAT, for the three architectures. The results show that HPTQ achieves competitive performance despite the hardware friendly constraints. In the tables below F-Acc is the floating point accuracy and Q-Acc is the accuracy of the quantized model.

Type Method PC PoT F-Acc Q-Acc

QAT

QT [24] 70.9 70.0 0.9
TQT [25] 71.1 71.1 0.0

PTQ

SSBD [33] 70.9 69.95 0.95
Krishnamoorthi [26] 70.9 70.3 0.6
Wu et al [49] 71.88 70.39 1.49
Lee et al [27] 69.5 68.84 0.66
HPTQ (Our) 70.55 70.41 0.14
Table 1: ImageNet classification [10] with MobileNetV1 [20]
Type Method PC PoT F-Acc Q-Acc

QAT

QT [24] 71.9 70.9 1.0
RVQuant [39] 70.10 70.29 -0.19
TQT [25] 71.7 71.8 -0.10

PTQ

AdaQuant [23] 73.03 73.03 0.0
ZeroQ [5] 73.03 72.91 0.12
SSBD [33] 71.9 71.29 0.61
Wu et al [49] 71.88 71.14 0.74
Krishnamoorthi [26] 71.9 69.7 2.2
Nagel et al [37] 71.72 70.99 0.73
71.16 0.56
DFQ [36] 71.72 70.92 0.8
Lee et al [27] 71.23 69.5 1.73
HPTQ (Our) 71.812 71.46 0.352
Table 2: ImageNet classification [10] with MobileNetV2 [43]
Type Method PC PoT F-Acc Q-Acc

QAT

QT [24] 76.4 74.9 1.5
RVQuant [39] 75.92 75.67 0.25
HAWQ-V3 [50] 77.72 77.58 0.14
LSQ [11] 76.9 76.8 0.1
TQT [25] 76.9 76.5 0.4
FAQ [32] 75.4 75.4 0.0

PTQ

ZeroQ [5] 77.72 77.67 0.05
OCS [52] 76.1 75.9 0.2
SSBD [33] 75.2 74.95 0.25
He et al [19] 75.3 75.03 0.27
Wu et al [49] 76.16 76.05 0.11
Nagel et al [37] 76.07 75.87 0.2
75.88 0.19
Krishnamoorthi [26] 75.2 75.00 0.20
75.1 0.1
HPTQ (Our) 75.106 75.018 0.088
Table 3: ImageNet classification [10] with ResNet50 [18]

Semantic Segmentation.

We evaluate HPTQ on Pascal VOC

[12] using DeepLab V3222https://github.com/tensorflow/models/blob/master/research/deeplab/g3doc/model_zoo.md [7] with MobileNetV2 [43] as a backbone. Table 4 shows that HPTQ achieves competitive results compared to other PTQ methods.

Type Method PC PoT F-mIoU Q-mIoU

PTQ

DFQ [36] 72.45 72.33 0.12
Nagel et al [37] 72.94 72.44 0.50
72.27 0.67
HPTQ (Our) 75.57 75.38 0.19
Table 4: Semantic segmentation on Pascal VOC [12] using DeepLab V3 with MobileNetV2 [43] as a backbone. F-mIoU is the floating point mean Intersection-over-Union (mIoU) and Q-mIoU is the mIoU of the quantized model.

Object Detection.

We evaluate HPTQ on COCO

[30] using the SSD detector [31] with several backbones333https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2_detection_zoo.md. HPTQ achieves similar Mean Average Precision (mAP) to the floating point baseline as demonstrated in Table 5.

Model F-mAP Q-mAP
SSD MobileNetV2 [43] 320x320 20.2 20.21
SSD MobileNetV2 [43] FPN Lite 320x320 22.2 21.93
SSD ResNet50 [18] V1 FPN 640x640 34.3 34.3
Table 5: Object detection results with HPTQ on COCO [30] using MobileNetV2 [43] and ResNet50 [18] as backbones. F-mAP is the floating point mAP and Q-mAP is the mAP of the quantized model.

Pose-Estimation.

We evaluate HPTQ on the single-person pose estimation task using LPN network [51] on the LIP (Look into Person) dataset [28]. We use the PCKh metric [28]

for evaluation, which is the head-normalized probability of correct keypoints. HPTQ achieves similar performance to the floating point baseline with only a slight degradation from 81.65 to 81.53 PCKh.

4.2 Ablation Study

We provide an ablation study of HPTQ’s performance on the ImageNet classification task [10] using eleven networks444https://www.tensorflow.org/api_docs/python/tf/keras/applications. The study is divided into two parts analyzing activation quantization and weight quantization.

Table 6 compares the performance of HPTQ between four cases: full floating-point, activation quantization, weight quantization and joint quantization of both. The comparison shows that activation quantization causes a larger degradation in performance compared to weight quantization, especially for EfficientNet with Swish activations functions. This might be due to the fact that activation equalization is not applied for these activations.

Network F-Acc
Q-Acc
(Activation)
Q-Acc
(Weights)
Q-Acc
(Both)
MobileNetV1 [20] 70.558 70.48 70.394 70.418
MobileNetV2 [43] 71.812 71.616 71.668 71.46
NasnetMobile [53] 74.376 74.068 74.352 73.888
VGG16 [44] 70.956 70.834 70.946 70.81
InceptionV3 [46] 77.908 77.872 77.844 77.85
InceptionResNetV2 [45] 80.284 80.154 80.32 80.14
ResNet50 [18] 75.106 75.072 75.06 75.018
EfficientNet-B0 [47] 77.2 74.3 77.012 74.216
EfficientNet-B0 ReLU 77.65 77.1 77.568 77.092
DenseNet-121 [22] 74.848 73.252 74.784 73.356
Xception [9] 79.05 79.048 79.062 78.972
Table 6: ImageNet classification [10] accuracy with HPTQ in four cases: full floating-point, activation quantization, weight quantization and both activation and weight quantization.

Activation Quantization Analysis.

In this analysis we evaluate the influence of the different methods used for quantizing the activations (without quantizing the weights). The analysis is performed with eleven different network architectures555EfficientNet-B0 ReLU is a trained version of EfficientNet-B0 with ReLU activation function instead of swish666 https://keras.io/api/applications/ on the ImageNet classification [10] task. Table 7 shows an accuracy comparison using four different threshold selection methods without applying any other of the activation quantization steps. NC indicates using the no-clipping threshold. Mean Square Error (MSE), Mean Average Error (MAE) and Kullback–Leibler (KL) are three different error measurements in Equation 6.

Network NC MSE MAE KL
MobileNetV1 [20] 70.406 70.434 60.218 70.418
MobileNetV2 [43] 71.25 71.458 65.918 71.482
VGG16 [44] 70.8 70.764 58.37 65.096
ResNet50 [18] 74.612 74.996 67.896 59.556
Table 7: ImageNet classification [10] accuracy with activations quantized using different threshold selection methods (weights are in floating point).

Table 8 shows the incremental accuracy influence on ImageNet classification [10] of the methods used by HPTQ for activation quantization (without quantizing weights). Note that SNC is applied in all of the experiments in the table and its influence is studied separately below. The table shows that all of the methods result in an improvement. Note that fine-tuning the z-score threshold per network may lead to further improvement.

Network Name Baseline +Eq. +MSE Th. +z-score
MobileNetV1 [20] 70.406 70.418 70.48 70.48
MobileNetV2 [43] 71.25 71.34 71.528 71.616
NasnetMobile [53] 18.572 18.484 73.486 74.068
VGG16 [44] 70.8 70.696 70.888 70.834
InceptionV3 [46] 77.658 77.646 77.832 77.872
InceptionResNetV2 [45] 49.132 49.238 80.014 80.154
ResNet50 [18] 74.612 74.654 75.086 75.072
EfficientNet-B0 [47] 13.562 13.736 74.096 74.3
EfficientNet-B0 ReLU 74.298 76.298 76.956 77.1
DenseNet-121 [22] 56.08 55.916 73.28 73.252
Xception [9] 48.718 48.784 78.87 79.048
Table 8: The accuracy influence of the different activation quantization methods used by HPTQ for ImageNet classification [10] when keeping all weights in floating point. Baseline is quantization with no-clipping thresholds, +Eq. means adding max channel equalization, +MSE Th. means replacing the no-clipping thresholds with MSE and +z-score means applying z-score outlier removal.

Table 9 shows the accuracy improvement achieved by applying Shift Negative Correction (SNC). Specifically, the table compares the performance of several versions of MobileNetV1, each with different non-linear functions, with a full flow of activation quantization.

Swish
Leaky ReLU
()
PReLU SELU
Float 73.522 72.866 73.114 72.032
Without SNC 60.98 71.966 72.548 69.726
With SNC 71.146 72.588 72.548 70.902
Table 9: ImageNet classification accuracy [10] using HPTQ with and without SNC of MobileNetV1 [20] trained with different non-linear functions.

Weight Quantization Analysis.

In this analysis we evaluate the influence of the different methods used for quantizing weights (without quantizing activations). The analysis is performed with eleven different network architectures777EfficientNet-B0 ReLU is a trained version of EfficientNet-B0 with ReLU activation function instead of swish888 https://keras.io/api/applications/ on the ImageNet classification [10] task.

Table 10 shows an accuracy comparison of each quantized network using four different threshold selection methods (without applying bias correction). NC indicates using the no-clipping threshold. Mean Square Error (MSE), Mean Average Error (MAE) and Kullback–Leibler (KL) are three different error measurements in Equation 6. Similarly to the results for activation quantization in Table 7, the MSE error measurement achieves the best results.

Network NC MSE MAE KL
MobileNetV1 [20] 68.75 68.756 64.242 64.968
MobileNetV2 [43] 69.562 69.758 67.57 62.394
NasnetMobile [53] 74.188 74.232 72.79 73.358
VGG16 [44] 70.944 70.94 67.486 70.472
InceptionV3 [46] 77.768 77.82 70.91 74.28
InceptionResNetV2 [45] 80.244 80.276 78.676 77.112
ResNet50 [18] 75.068 75.11 72.352 73.418
EfficientNet-B0 [47] 76.822 76.822 75.86 75.554
EfficientNet-B0 ReLU 77.078 77.218 76.916 76.674
DenseNet-121 [22] 74.734 74.736 72.102 60.17
Xception [9] 79.006 79.006 77.47 75.374
Table 10: ImageNet classification [10] accuracy with weights quantized using different threshold selection methods (activations are in floating point).

Table 11 shows the incremental accuracy influence of the two methods (per channel quantization and bias correction) used in HPTQ for weight quantization (without quantizing activations) on the ImageNet classification task [10]. This table shows that both of our methods result in improvement.

Network Baseline Per ch. +Bias corr.
MobileNetV1 [20] 0.966 68.756 70.394
MobileNetV2 [43] 0.398 69.758 71.668
NasnetMobile [53] 73.494 74.232 74.352
VGG16 [44] 70.814 70.94 70.946
InceptionV3 [46] 76.42 77.82 77.844
InceptionResNetV2 [45] 80.066 80.276 80.32
ResNet50 [18] 74.718 75.11 75.06
EfficientNet-B0 [47] 2.524 76.822 77.012
EfficientNet-B0 ReLU 0.682 77.218 77.568
DenseNet-121 [22] 72.986 74.736 74.784
Xception [9] 78.786 79.006 79.062
Table 11: The incremental influence of applying per-channel threshold selection (Per ch.) and bias correction (Bias corr.) on ImageNet [10] classification accuracy. Baseline means quantization with MSE threshold applied per tensor.

5 Conclusions

In this work we propose HPTQ, a method for hardware-friendly post-training quantization. HPTQ offers a flow that adapts and synergistically combines several known quantization techniques both for weights and activations. We extensively evaluated the performance of HPTQ on four tasks: classification, object detection, semantic segmentation and pose estimation. Notably, for all of the tasks we demonstrated that competitive results can be obtained under our hardware-friendly constraints of uniform and symmetric quantization with power-of-two thresholds. In addition, we performed an ablation study in which we presented the contributions of each of the methods used by HPTQ.

References

  • [1] C. C. Aggarwal (2015) Outlier analysis. In Data mining, pp. 237–263. Cited by: §3.1.
  • [2] R. Banner, Y. Nahshan, E. Hoffer, and D. Soudry (2018) Post-training 4-bit quantization of convolution networks for rapid-deployment. arXiv preprint arXiv:1810.05723. Cited by: §1.
  • [3] R. Banner, Y. Nahshan, and D. Soudry (2019) Post training 4-bit quantization of convolutional networks for rapid-deployment. In Advances in Neural Information Processing Systems, Vol. 32, pp. 7950–7958. External Links: Link Cited by: §3.3.
  • [4] Y. Bhalgat, J. Lee, M. Nagel, T. Blankevoort, and N. Kwak (2020) Lsq+: improving low-bit quantization through learnable offsets and better initialization. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops

    ,
    pp. 696–697. Cited by: §3.2, §3.2.
  • [5] Y. Cai, Z. Yao, Z. Dong, A. Gholami, M. W. Mahoney, and K. Keutzer (2020) Zeroq: a novel zero shot quantization framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13169–13178. Cited by: §1, Table 2, Table 3.
  • [6] Z. Cao, G. Hidalgo, T. Simon, S. Wei, and Y. Sheikh (2019) OpenPose: realtime multi-person 2d pose estimation using part affinity fields. IEEE transactions on pattern analysis and machine intelligence 43 (1), pp. 172–186. Cited by: §1.
  • [7] L. Chen, G. Papandreou, F. Schroff, and H. Adam (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. Cited by: §1, §4.1.
  • [8] J. Choi, Z. Wang, S. Venkataramani, P. I. Chuang, V. Srinivasan, and K. Gopalakrishnan (2018) Pact: parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085. Cited by: §1.
  • [9] F. Chollet (2017)

    Xception: deep learning with depthwise separable convolutions

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1251–1258. Cited by: Table 10, Table 11, Table 6, Table 8.
  • [10] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §4.1, §4.2, §4.2, §4.2, §4.2, §4.2, Table 1, Table 10, Table 11, Table 2, Table 3, Table 6, Table 7, Table 8, Table 9.
  • [11] S. K. Esser, J. L. McKinstry, D. Bablani, R. Appuswamy, and D. S. Modha (2019) Learned step size quantization. arXiv preprint arXiv:1902.08153. Cited by: Table 3.
  • [12] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: §4.1, Table 4.
  • [13] J. Fang, A. Shafiee, H. Abdel-Aziz, D. Thorsley, G. Georgiadis, and J. H. Hassoun (2020) Post-training piecewise linear quantization for deep neural networks. In European Conference on Computer Vision, pp. 69–86. Cited by: §1.
  • [14] A. Finkelstein, U. Almog, and M. Grobman (2019) Fighting quantization bias with bias. arXiv preprint arXiv:1906.03193. Cited by: §3.3.
  • [15] A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer (2021) A survey of quantization methods for efficient neural network inference. arXiv preprint arXiv:2103.13630. Cited by: §1.
  • [16] R. Gong, X. Liu, S. Jiang, T. Li, P. Hu, J. Lin, F. Yu, and J. Yan (2019) Differentiable soft quantization: bridging full-precision and low-bit neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4852–4861. Cited by: §1.
  • [17] H. V. Habi, R. H. Jennings, and A. Netzer (2020) HMQ: hardware friendly mixed precision quantization block for cnns. In Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J. Frahm (Eds.), Cham, pp. 448–463. External Links: ISBN 978-3-030-58574-7 Cited by: §1, §2.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §4.1, Table 10, Table 11, Table 3, Table 5, Table 6, Table 7, Table 8.
  • [19] X. He and J. Cheng (2018) Learning compression from limited unlabeled data. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 752–769. Cited by: Table 3.
  • [20] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017)

    Mobilenets: efficient convolutional neural networks for mobile vision applications

    .
    arXiv preprint arXiv:1704.04861. Cited by: §1, §4.1, Table 1, Table 10, Table 11, Table 6, Table 7, Table 8, Table 9.
  • [21] A. Howard, M. Sandler, G. Chu, L. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, et al. (2019) Searching for mobilenetv3. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1314–1324. Cited by: §3.2.
  • [22] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: Table 10, Table 11, Table 6, Table 8.
  • [23] I. Hubara, Y. Nahshan, Y. Hanani, R. Banner, and D. Soudry (2020) Improving post training neural quantization: layer-wise calibration and integer programming. arXiv preprint arXiv:2006.10518. Cited by: Table 2.
  • [24] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2704–2713. Cited by: §1, §3.1, §3.1, Table 1, Table 2, Table 3.
  • [25] S. R. Jain, A. Gural, M. Wu, and C. H. Dick (2019) Trained quantization thresholds for accurate and efficient fixed-point inference of deep neural networks. arXiv preprint arXiv:1903.08066. Cited by: §1, §1, §2, §2, Table 1, Table 2, Table 3.
  • [26] R. Krishnamoorthi (2018) Quantizing deep convolutional networks for efficient inference: a whitepaper. arXiv preprint arXiv:1806.08342. Cited by: §3.3, Table 1, Table 2, Table 3.
  • [27] J. H. Lee, S. Ha, S. Choi, W. Lee, and S. Lee (2018) Quantization for rapid deployment of deep neural networks. arXiv preprint arXiv:1810.05488. Cited by: Table 1, Table 2.
  • [28] X. Liang, K. Gong, X. Shen, and L. Lin (2018) Look into person: joint body parsing & pose estimation network and a new benchmark. IEEE transactions on pattern analysis and machine intelligence 41 (4), pp. 871–885. Cited by: §4.1.
  • [29] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §1.
  • [30] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §4.1, Table 5.
  • [31] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §1, §4.1.
  • [32] J. L. McKinstry, S. K. Esser, R. Appuswamy, D. Bablani, J. V. Arthur, I. B. Yildiz, and D. S. Modha (2019) Discovering low-precision networks close to full-precision networks for efficient inference. In

    2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS)

    ,
    pp. 6–9. Cited by: Table 3.
  • [33] E. Meller, A. Finkelstein, U. Almog, and M. Grobman (2019) Same, same but different-recovering neural network quantization error through weight factorization. arXiv preprint arXiv:1902.01917. Cited by: §3.2, §3.2, Table 1, Table 2, Table 3.
  • [34] S. Migacz (2017) 8-bit inference with tensorrt. External Links: Link Cited by: §3.2.
  • [35] M. Nagel, R. A. Amjad, M. Van Baalen, C. Louizos, and T. Blankevoort (2020) Up or down? adaptive rounding for post-training quantization. In International Conference on Machine Learning, pp. 7197–7206. Cited by: §1.
  • [36] M. Nagel, M. v. Baalen, T. Blankevoort, and M. Welling (2019) Data-free quantization through weight equalization and bias correction. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1325–1334. Cited by: §3.2, §3.2, §3.2, §3.3, §3.3, §3.3, Table 2, Table 4.
  • [37] M. Nagel, M. Fournarakis, R. A. Amjad, Y. Bondarenko, M. van Baalen, and T. Blankevoort (2021) A white paper on neural network quantization. arXiv preprint arXiv:2106.08295. Cited by: §1, Table 2, Table 3, Table 4.
  • [38] Y. Nahshan, B. Chmiel, C. Baskin, E. Zheltonozhskii, R. Banner, A. M. Bronstein, and A. Mendelson (2019) Loss aware post-training quantization. arXiv preprint arXiv:1911.07190. Cited by: §3.2.
  • [39] E. Park, S. Yoo, and P. Vajda (2018) Value-aware quantization for training and inference of neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 580–595. Cited by: Table 2, Table 3.
  • [40] P. Ramachandran, B. Zoph, and Q. V. Le (2017) Searching for activation functions. arXiv preprint arXiv:1710.05941. Cited by: §3.2.
  • [41] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi (2016) Xnor-net: imagenet classification using binary convolutional neural networks. In European conference on computer vision, pp. 525–542. Cited by: §3.3.
  • [42] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497. Cited by: §1.
  • [43] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520. Cited by: Figure 4, §4.1, §4.1, Table 10, Table 11, Table 2, Table 4, Table 5, Table 6, Table 7, Table 8.
  • [44] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: Table 10, Table 11, Table 6, Table 7, Table 8.
  • [45] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi (2017)

    Inception-v4, inception-resnet and the impact of residual connections on learning

    .
    In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 31. Cited by: Table 10, Table 11, Table 6, Table 8.
  • [46] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: Table 10, Table 11, Table 6, Table 8.
  • [47] M. Tan and Q. Le (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pp. 6105–6114. Cited by: Table 10, Table 11, Table 6, Table 8.
  • [48] S. Uhlich, L. Mauch, F. Cardinaux, K. Yoshiyama, J. A. Garcia, S. Tiedemann, T. Kemp, and A. Nakamura (2019) Mixed precision dnns: all you need is a good parametrization. arXiv preprint arXiv:1905.11452. Cited by: §1.
  • [49] H. Wu, P. Judd, X. Zhang, M. Isaev, and P. Micikevicius (2020) Integer quantization for deep learning inference: principles and empirical evaluation. arXiv preprint arXiv:2004.09602. Cited by: Table 1, Table 2, Table 3.
  • [50] Z. Yao, Z. Dong, Z. Zheng, A. Gholami, J. Yu, E. Tan, L. Wang, Q. Huang, Y. Wang, M. Mahoney, et al. (2021) HAWQ-v3: dyadic neural network quantization. In International Conference on Machine Learning, pp. 11875–11886. Cited by: Table 3.
  • [51] Z. Zhang, J. Tang, and G. Wu (2019) Simple and lightweight human pose estimation. arXiv preprint arXiv:1911.10346. Cited by: §1, §4.1.
  • [52] R. Zhao, Y. Hu, J. Dotzel, C. De Sa, and Z. Zhang (2019) Improving neural network quantization without retraining using outlier channel splitting. In International conference on machine learning, pp. 7543–7552. Cited by: Table 3.
  • [53] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le (2018) Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697–8710. Cited by: Table 10, Table 11, Table 6, Table 8.