Accelerating Neural Network Inference by Overflow Aware Quantization

05/27/2020 ∙ by Hongwei Xie, et al. ∙ 0

The inherent heavy computation of deep neural networks prevents their widespread applications. A widely used method for accelerating model inference is quantization, by replacing the input operands of a network using fixed-point values. Then the majority of computation costs focus on the integer matrix multiplication accumulation. In fact, high-bit accumulator leads to partially wasted computation and low-bit one typically suffers from numerical overflow. To address this problem, we propose an overflow aware quantization method by designing trainable adaptive fixed-point representation, to optimize the number of bits for each input tensor while prohibiting numeric overflow during the computation. With the proposed method, we are able to fully utilize the computing power to minimize the quantization loss and obtain optimized inference performance. To verify the effectiveness of our method, we conduct image classification, object detection, and semantic segmentation tasks on ImageNet, Pascal VOC, and COCO datasets, respectively. Experimental results demonstrate that the proposed method can achieve comparable performance with state-of-the-art quantization methods while accelerating the inference process by about 2 times.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

To date, as a powerful machine learning system architecture, deep neural network (DNN) has been applied on numerous applications, e.g., image classification

[He et al.2016], object detection [Ren et al.2015, Liu et al.2016], and semantic segmentation [Chen et al.2018]. However, to achieve high-end performance on complicated problems, most DNN systems require heavy computational resources for model inference, which inevitably limits DNN’s deployment on low-cost processors. Such processors are extensively used in billions of commercial products, such as mobile phones, drones, and Internet-of-Things (IoT) devices, which makes DNN acceleration a critical problem in both academia and industry.

(a) For a 32-bit accumulator, one instruction can compute 4 multi-adds with 128-bit register.
(b) For a 16-bit accumulator, one instruction can compute 8 multi-adds at the same time.
Figure 1: A representative example to show that replacing a 32-bit accumulator with a 16-bit one leads to a double amount of multiply-adds operations at the same time.

To allow DNN acceleration, researchers have developed various approaches, which can be roughly divided into two groups. The first group of methods focus on designing or searching for more compact and efficient network structures, which allow for reduced number of parameters and computations while achieving comparable performance. Representative methods in this category include MobileNet [Howard et al.2017], EfficientNet [Tan and Le2019], ProxylessNAS [Cai et al.2019], and so on. The second group aims at improving the efficiency of arithmetic computation, i.e., the multiply-accumulate (MAC) operation, as it dominates most computations during the DNN model inference. One widely used method is to approximate the original floating-point calculation using fixed-point operation to achieve computation acceleration. This type of method is well-known as quantization [Jacob et al.2018]. Representative visualization of quantization is shown in Figure (a), where 8-bit fixed-point integers are used to approximate floating-point values and 32-bit fixed-point variables are used to hold MAC results. Moreover, in addition to speeding up the MAC operations, quantization also achieves better parallel computing based on the capability of modern CPUs.

By comparing Figure (b) against Figure (a), it can be shown that if 16-bit fixed-point variables are used to hold the MAC result, the degree of parallelism will be doubled and the I/O times will be halved. However, when 16-bit holder is used, numerical overflow on MAC results becomes a frequently-happening problem that must be explicitly considered. A straightforward solution is to use low-bit quantization (8-bit) for all operands, which however leads to loss of quantization precision and significantly reduced the performance. In addition, low-bit quantization still needs to take up more physical bits (e.g., 4-bit quantization still requires 8-bit physical operands) in most modern CPUs, making the computational resources partially wasted.

To summarize, existing quantization methods utilize fixed number of bits to represent float values, while both high-bit and low-bit representation have limitations. The former suffers from numerical overflow problems and the latter one leads to model precision degradation. To tackle this problem, we propose a novel method to adaptively determine the quantization precision in DNNs, by optimizing the number of bits for operands while prohibiting overflow on the low-bit MAC result holders. To achieve this, we introduce a trainable quantization range mapping factor into each layer of a DNN network, which automatically scales the quantized result to prevent the undesirable overflow. In addition, we propose a quantization-overflow aware training framework for learning the quantization parameters, to minimize the performance loss caused by post quantization [Krishnamoorthi2018]. To verify the effectiveness of our method, we conducted tests on a couple of state-of-the-art light-weighted DNNs for a variety of tasks on different benchmarking datasets. Specifically, our experiments include image classification, object detection, and semantic segmentation, which are tested on ImageNet, Pascal VOC, and COCO datasets, respectively. Experimental results demonstrate that, compared with state-of-the-art quantization methods, the proposed method can achieve comparable performance while speeding up the inference efficiency by about 2 times.

The main contributions of this paper are listed as follows:

  1. We propose an overflow aware quantization (OAQ) algorithm for accelerating DNNs, that is able to adaptively maximize the number of bits for operands while prohibiting the numeric overflow.

  2. To ensure optimized performance, we design a quantization overflow aware training framework (QOAT), to automatically learn the parameters used by the proposed OAQ algorithm.

  3. We conduct extensive experiments on three public datasets using state-of-the-art light-weight DNNs. The results verify the effectiveness of our method on a variety of tasks including image classification, object detection, and semantic segmentation.

2 Related Work

In this work, we focus on quantization methods that accelerate inference on off-the-shelf hardware platforms. While non-uniform quantization methods [Stock et al.2019, Gao et al.2019] are also shown to be effective, they do not allow efficient implementing on modern CPUs.

A representative early method is binary quantization [Rastegari et al.2016], which quantizes both weights and activations to one bit, by using bit-shift and bit-count instead of multiply-adds operators to speed up. This method achieves acceptable performance on common over-parameterized networks, like AlexNet [Krizhevsky et al.2012], but leads to substantial performance degradation on light-weight networks, e.g. ResNet-18 [He et al.2016] and MobileNet [Howard et al.2017]. As of today, one of the most widely used quantization methods is 8-bit quantization [Jacob et al.2018, Krishnamoorthi2018, Jain et al.2019], which is extensively applied in different applications on various hardware platforms. 8-bit quantization converts the inference process into integer-only operations, that could result in faster inference process on mobile CPUs. However, when tacking resource-demanding large networks or deployed on resource-constrained platforms, additional DNN acceleration is still required.

To allow further DNN acceleration, low-bit quantization techniques are under active exploration, in which a key problem is to balance the inference speed and model performance. [Choi et al.2018]

proposes PACT to train not only the weights but also the clipping parameters for clipped ReLU using gradient descent.

[Louizos et al.2018] presents RQ to optimize the quantization part with gradient descent. However, both methods suffer from performance degradation on lightweight networks. In fact, when quantizing both weights and activations to 4 bits, PACT [Choi et al.2018] leads to accuracy reduction from 70.9% to 62.44%. Also, quantizing MobileNet using RQ [Louizos et al.2018] to 6-bit achieves 68% accuracy only. Additionally, existing low-bit techniques only focus on the classification task. Evaluation results on other popular tasks, e.g., object detection or semantic segmentation, are limited on literatures. Inference accelerating using properties of processors is another research direction. [Zeng et al.2019] proposes to decrease computational load of multiplications, by exploiting the parallel computing capability of modern CPUs. [Gong et al.2019] implements 2-bit fast integer arithmetic with ARM NEON technology and achieves speed up over NCNN [Tecent.Inc2017]. However, the reported results still suffer from significant performance loss, e.g., 4-bit quantized MobileNet-v2 leads to performance drop from 71.8% to 64.8%.

Moreover, there are a number of mixed-precision quantization methods [Wang et al.2019, Wu et al.2018, Dong et al.2019], which focus on searching for an optimal bit-width setup that can achieve high-level acceleration on customized hardware platforms while avoiding performance drop. Compared to those methods, the proposed OAQ framework focuses on off-the-shelf devices, and addresses the numerical overflow problem by designing trainable parameters in each layer of a network.

3 Quantization Prerequisites

In this section, we first present the general algorithmic framework of quantization methods. Subsequently, we analyze the inherent overflow problem of the general framework and provide mathematical conditions that allow overflow aware quantization (OAQ). Detailed steps on performing OAQ approach are discussed in the next section.

3.1 Standard Quantization Method

To convert a floating-point real number to a fixed-point quantized number , the following affine mapping function is typically used [Jacob et al.2018]:

(1)

where is the scale factor and is the zero-point parameter. By denoting the element at th row and th column of matrix , the matrix multiplication of in quantized domain can be computed element-wise as:

(2)

where the multiplier is defined as

(3)

which can be implemented by fixed-point multiplication and efficient bit-shift [Jacob et al.2018]. By expanding terms in (2), we are able to obtain:

(4)

where

(5)

In (4), the majority of computation costs are the core integer matrix multiplication accumulation:

(6)

To compute for (6), the standard method is to accumulate products of 8-bit values (signed or unsigned) with a 32-bit integer accumulator (also see Figure (a)):

(7)

where represents the space of -bit representable number. As a result, a 32-bit register is required for caching the intermediate result. A NEON instruction can compute multiple multiply-adds at the same time, but it is limited by the register size and the number of multiplying and summing units on board. The limited register resource is usually the main bottleneck. Moreover, we note that under most ARM architectures, there is no instruction to implement (7). To this end, existing popular mobile inference engines (e.g., TFLite and NCNN) typically rely on

(8)

as a replacement, i.e., VMAL.S16 on ARM architecture.

Additionally, we note that loading data between register and memory is heavy operations. By applying (8), more register space is used to implement a bigger micro-kernel111https://engineering.fb.com/ml-applications/qnnpack/. This largely reduces the frequency of transferring intermediate results between register and memory. To show more details, we evaluated the performance of convolutions on MTK8167s CPU with different implementations. Using VMLAL.S8 and 4x8 micro-kernel, the computation efficiency is improved by 36%. While applying a 4x16 micro kernel, we achieved 73% speed up.

3.2 Optimize Quantization Operations

To optimize the efficiency of current quantization scheme, we seek to use 16-bit integer as accumulator

(9)

Compared to (7), one NEON instruction here (VMLAL.S8 on ARM architecture) is able to compute double amount of multiply-adds operations, as shown in Figure (a) and Figure (b).

However, by directly using (9), numerical overflow becomes an unavoidable problem. This is one of the core problem we seek to resolve in this work. To make this possible, we first re-expand (2) by assuming as the quantized inputs and as the quantized weights:

(10)

which can be rewritten as

(11)

where

(12)
(13)

In above equations, are all be constant values once training is complete, and thus and can be computed in advance to improve inference efficiency.

Based on above equations, we point out that, to allow efficient computation under (7), the following three conditions must be satisfied:

(14)

It is important to note that the last condition in (14) should hold for both final number and all intermediate numbers. We also point out that, the second and last conditions in (14) are not always naturally true. Without taking special consideration, numerical overflow will frequently happen. To guarantee (14) in a DNN system, additional algorithms need to be designed and implemented.

4 Overflow Aware Quantization Framework

This section describes the details of our overflow aware quantization algorithm, including both representation and training, to ensure the important three conditions in (14).

4.1 Adaptive Integer Representation

One straightforward method to reduce accumulation overflow is to narrow the range of each quantized value. For example, by using 4-bit quantization instead of 8-bit quantization, real values are mapped to instead of . As a result, numerical overflow becomes significantly less likely to happen, at a cost of wasting a large number of bits and reducing accuracy. To this end, we propose an adaptive float-bit-width method to fully utilize the representation capability without arithmetic overflow.

Specifically, we use a float quantization range mapping factor to adjust the affine relationship between the real range and quantized range, as shown in Figure 2. Scaled by , the original 8-bit (the biggest bit-width can be used) quantization range is mapped to . By enlarging , we are able to narrow down the quantized value range until the arithmetic overflows are eliminated.

To present this mathematically, the affine function (1) can be written as

(15)

By applying the scale factor on

(16)

the quantized value is narrowed to

(17)

where and are the minimum and maximum limits of the real value, and is the number of bits for the quantized value.

We name this as float-bit-width method since it differs from the traditional integer low-bit representation whose quantization ranges have to be chosen from the limited bit-width set, e.g, 3-bit for or 4-bit for . By using the proposed method, it is feasible to utilize different quantization range mapping factor in each layer, to maximize the representation capability while prohibiting overflow. Additionally, since is continuous, it can also be easily integrated into training process of DNNs. Details on adaptively learning for weights and activations of each layer will be discussed in the next subsection.

In addition to the range of quantized value, the value distribution is also critical. We expect the quantized values to be centered and gathered around zeros. As a result, The accumulated number will be more likely to be away from overflow. Initializing weights with a normal-distributed-like initializer and applying L1-L2-Normalization are representative methods that can be used.

Figure 2: We introduce a float factor to adjust the affine relationship between the real range and quantized range, e.g., enlarging to narrow down the quantized value range.
Figure 3: We insert Quant nodes into each layer of computation graph to calculates the amount of arithmetic overflow.

4.2 Learning Quantization Range Mapping Factor

To find proper for each layer to simultaneously prohibit arithmetic overflow and retain the model’s performance, we propose a quantization-overflow aware training framework. As shown in Figure 3, inspired by simulating quantization effects in forwarding pass [Jacob et al.2018], we add the overflow aware module that simulates neural operations, e.g., Convolution, FullyConnect using 16-bit accumulator for capturing arithmetic overflow in the forward pass.

In the forward pass, 8-bit quantization simulates quantized inference by implementing in floating-point arithmetic which is called FakeQuantization (

) [Jacob et al.2018]. To adaptively learn , capturing the amount of arithmetic overflow in the quantized inference process is required. Therefore, we insert a Quantization node () into each layer of the computation graph in addition to . requires and to produce 8-bit quantized values which are the same with inference engine. and are scaled by before being passed into and . and in activations are aggregated by exponential moving averages (EMA) with the smoothed parameter close to 1, such that the observed ranges are smoothed, allowing the network to enter a more stable state. The created convolution operation with 16-bit accumulator Conv-INT16 takes the 8-bit quantized values as input to simulate the integer-only-inference on the inference engine. This calculates the amount of arithmetic overflow that accumulates in the inference process, including all types of overflow defined in the overflow-free conditions (14

). An easy way to capture overflow signals from the popular training framework, e.g., TensorFlow or PyTorch is to compare the results between the regular 32-bit convolution and the 16-bit convolution. Certainly, you can make a more efficient implementation by some low-level language like C++. As the process of getting

is integer-only computation, gradient descent becomes not a proper method in back-propagation. To address this problem, we use a simple rule to compensate for this.

Specifically, when is bigger than zero, increasing by:

(18)

where is the fixed maximum learning rate. Additionally, is the dynamic learning rate for increasing , which decays with the steps of training. After a large number of iterations, is decayed to a quite small value to stabilize the training state.

Alternatively, if is zero, we decrease by:

(19)

where is the learning rate for decreasing , whose properties and physical representation are both similar to . For improving training efficiency, calculating and updating are executed every steps, e.g., 10 or 50 during iterations.

The strategy of inserting is slightly different from . As shown in Figure 3

, fake quantization for input or output is inserted after the activation function or a bypass connection, e.g. adds or concatenates as they change the real min-max ranges. For the operations such as max-pooling, up-sampling or padding,

is not required. But the 16-bit neural operation, e.g., Conv-INT16 only accepts quantized integer inputs. The outputs of has to be directly passed into it. That means must be inserted ahead of Conv-INT16 in the computation graph. Finally, and share the same , and , for strictly simulating the same inference process.

5 Overflow Study

(a) The activation of each layer in MobileNet-v1 trained on ImageNet.
(b) The activation of each layer in MobileNet-v1-SSD trained on Pascal VOC.
Figure 4: Provides two representative results of per-layer factor values, implying that the quantized values are mostly between 6bits and 7bits.

In this section, we first show in-depth analysis on the proposed framework of adaptive scale parameter and explain why we focus on comparing OAQ against state-of-the-art 6-bit quantization methods. Subsequently, we study the overflow sensitive of different neural network models and show how overflow ratio affecting the overall performance.

5.1 Per-Layer Adaptive Scale

Since our key algorithmic contribution is a method to adaptively learn the quantization range mapping factor for DNN’s each layer, it is also important to demonstrate the distribution of across networks in our experiments. Figure 4 provides two representative results of per-layer values, adaptively trained by using the proposed quantization-overflow aware training framework. From this figure, we can observe that the computed scale factor varies across layer, emphasizing our ‘scale-per-layer’ adaptive design. Additionally, we note that the majority of activation factor fall into the range of , indicating that the quantized values are mostly between 6bits and 7bits.

Additionally, we estimate the overflow ratio of low-bit quantization with a 16-bit accumulator by random multiply-adds simulation. Specifically, we uniformly sampled

values from the quantization value range and tried to capture an overflow signal. The signal is computed by applying consecutive multiply-adds operators of the numbers on a 16-bit holder. Since 3x3 depth-wise convolution and 1x1 point-wise convolution are widely used in light-weight DNN architectures, we chose to be {9, 64, 256, 1024} representatively. Subsequently, the rough Non-Overflow ratio was calculated by 100000 independent simulation runs. As shown in Figure 5, 6-bit quantization could retain overflow free in most cases, while 7-bit and 8-bit methods are risky.

Both Figure 4 and Figure 5 indicate that at most 6-bits could be used if we apply low bit quantization to address the overflow problem. Therefore, we will focus on comparing OAQ against state-of-the-art 6-bit quantization methods.

Figure 5: The Non-Overflow ratio of randomly simulated 9, 64, 256, 1024 operands’ multiply-adds 100,000 times with different quantization bits and a 16-bit accumulator.

5.2 Overflow Sensitive Study

Someone still worry about that the quantization range mapping factors learned from training data may be not applied to all test data safely. For example, a little arithmetic overflow may cause the entire model to fail. Actually, we indeed found some overflow (usually on the order of tens after quantization-overflow-aware training) while inference on test data, but the outputs seem still right. In order to analyze the impact of overflow quantitatively, we designed the overflow simulation experiments. Specifically, we manually inject overflow to the output of some layers to see their influence on the final result. And we tested the overflow sensitive of MoblieNet-v1 on ImangeNet and MobileNet-v2-SSD on COCO Detection Challenge respectively. The results are shown in Figure 6. In classification model, inject 0.05% arithmetic overflow into any one layer has no impact on the overall accuracy. When apply it to all layers, the accuracy slowly drops along with the growth of overflow ratio. In detection model, overflow on the last layers affect the final result seriously. 0.01% overflow will cause the model unavailable. But the detection model can keep high-performance when injecting overflow into other layers.

(a) Inject overflow into some layers(L0: the first layer, L1: the second layer, L26: the penultimate layer, L28: The last layer before Softmax, ALL: inject overflow into all layers) of MobileNet-v1 and evaluate on ImageNet.
(b) Inject overflow into some layers(L0: the first layer, L1: the second layer, LFs: the 6 feature layers that provide inputs to the last headers , LLs: The last 12 headers that regressing bounding boxes and classification scores) of MobileNet-v2-SSD and evaluate on COCO.
Figure 6: Inject arithmetic overflow into different layers of neural network, e.g., MobileNet-v1 and MobileNet-v2-SSD. In classification model, even 0.05% arithmetic overflow won’t damage the overall accuracy. But in detection model, 0.01% overflow will result in the model unavailable.

6 Experiments

To demonstrate the performance of our proposed OAQ framework, we evaluate both inference accuracy/recall and run-time characteristics on representative public benchmarking datasets.

6.1 ImageNet

The first experiment is to benchmark MobileNet-v2 [Sandler et al.2018], MoibleNet-v1 [Howard et al.2017], and MobileNet-v1 with depth-multiplier 0.25 on the ILSVRC-2012 ImageNet. This dataset consists of 1000 classes, 1.28 million training images, and 50K validation images. We fine-tuned MobileNet models from pre-trained model zoo (TF-slim222https://github.com/tensorflow/models/tree/master/research/slim

). All of those models were re-trained for 20 epochs. During training, the inputs were randomly cropped and resized to 224x224 before being fed into the network. Since the inputs of the first layer are not suitable for scaling, we skipped learning the quantization range mapping factor of the first layer’s weights. This rule was applied to all the following experiments, including PACT

[Choi et al.2018], which always uses 8-bit weights in the first layers. We report our evaluation results using Top-1 and Top-5 accuracy.

In our tests, we focus on comparing OAQ against state-of-the-art 6-bit quantization methods, including PACT [Choi et al.2018], RQ [Louizos et al.2018], and SR+DR [Gysel et al.2018]. Comparison against other selective state-of-the-art methods were also conducted. As shown in Table 1, OAQ even outperforms 8-bit Quantization-Aware Training (QAT) [Jacob et al.2018] in certain cases. From Table 1, we observe that when quantizing MobileNet-v1, OAQ outperforms RQ [Louizos et al.2018] and SR+DR[Gysel et al.2018] by large margins. Although PACT [Choi et al.2018] is better on Top-5, the proposed method consistently performs better in other metrics, especially on MobileNet-v1-0.25, i.e., 1.36% higher than PACT. We also take DSQ [Gong et al.2019] into comparison, as it also achieved speed up over NCNN on an ARM Cortex-A53 CPU. The result shows, on MobileNet-v2 our results are significantly better, e.g., 6.84% higher.

Model Method Bits MAC bits Top1 / Top5
MobileNet -v2 FP 32 32 71.80 / 91.00
QAT   8 32 70.90 / 90.00
PACT   6 32 71.25 / 90.00
DSQ   4 32 64.90 /   —
Our adaptive 16 71.64 / 90.10
MobileNet -v1 FP 32 32 70.90 / 89.90
QAT   8 32 70.10 / 88.90
PACT   6 32 70.46 / 89.59
RQ   6 32 68.02 / 88.00
SR+DR   6 32 66.66 / 87.17
Our adaptive 16 70.87 / 89.56
MobileNet -v1-0.25 FP 32 32 49.80 / 74.20
QAT   8 32 48.00 / 72.80
PACT   6 32 46.03 / 70.07
Our adaptive 16 47.38 / 72.14
Table 1: Evaluating on ImageNet and comparing against SOTA low-bit quantization methods.

6.2 Detection on VOC and COCO

Model Method Bits MAC bits mAP
MobileNet-v1 SSD FP 32 32 73.83
QAT 8 32 72.54
PACT 6 32 70.88
Our adaptive 16 72.53
MobileNet-v2 SSDLite FP 32 32 72.79
QAT 8 32 72.02
PACT 6 32 70.50
Our adaptive 16 71.84
Table 2: Evaluating on Pascal VOC Detection Challenge and comparing with 6bit-PACT.
Model Method Bits MAC bits mAP
MobileNet-v1 SSD FP 32 32 23.7
QAT 8 32 23.0
PACT 6 32 18.4
Our adaptive 16 22.0
MobileNet-v2 SSDLite FP 32 32 22.7
QAT 8 32 21.4
PACT 6 32 18.1
Our adaptive 16 21.8
Table 3: Evaluating on COCO Detection Challenge and comparing with 6bit-PACT.

To illustrate the applicability of our method to object detection, we applied OAQ on MobileNet-v1-SSD [Howard et al.2017, Liu et al.2016] and MobileNet-v2-SSDLite [Sandler et al.2018] and evaluated on the 2012 Pascal VOC object detection challenge and 2017 MSCOCO detection challenge. We implemented our experiments with TensorFlow and fine-tuned models from TensorFlow Object Detection API333https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md (only backbone since the SSD header differs with tasks). We first trained them with 32-bit floating-point precision to achieve state-of-the-art performance, and subsequently fine-tuned on VOC for steps with batch size 32 and steps on COCO with batch size 48 respectively.

The results on VOC and COCO are listed in Table 2 and Table 3 respectively. On both of VOC detection challenge and COCO detection challenge, the proposed method outperforms the 6-bit PACT [Choi et al.2018] significantly. In addition, our method achieves comparable performance with 8-bit QAT [Jacob et al.2018] and is of small mAP drop from the original model, i.e., about 1% mAP drop in VOC and less than 2% mAP drop in COCO.

6.3 Segmentation on VOC

To demonstrate the generalization of our method to semantic segmentation, we applied OAQ on DeepLab [Chen et al.2018] with MobileNet-v2 backbone (depth-multiplier 0.5 and 1.0). The performance was evaluated on the Pascal VOC segmentation challenge, which contains 1464 training images and 1449 validation images. The results are shown in Table 4. When quantizing the original model to 6 bits with PACT [Choi et al.2018], there is a significant drop in performance, e.g., MobileNet-v2-dm0.5 backbone dropped 10.9% in mIOU. By comparison, the proposed OAQ method achieved comparable performance with QAT, and only drop 1.8% and 0.4% on MobileNet-v2-dm0.5 and MobileNet-v2 backbone respectively compared to the original model.

Model Method Bits MAC bits mIOU
MobileNet-v2 dm0.5 deeplab FP 32 32 71.8
QAT 8 32 70.4
PACT 6 32 60.9
Our adaptive 16 70.0
MobileNet-v2 deeplab FP 32 32 75.3
QAT 8 32 74.8
PACT 6 32 70.4
Our adaptive 16 74.9
Table 4: Evaluating on Pascal VOC Segmentation Challenge and comparing with 6-bit PACT.
CPU Inference Engine MobileNet-v1 ResNet18
Allwinner V328 TFLite 550 1370
MNN 469 1605
MNN + OAQ 341 1021
Ours + OAQ 277 895
MTK8167s TFLite 387 950
NCNN 351 706
MNN 311 916
MNN + OAQ 220 604
Ours + OAQ 189 585
Table 5: Comparison on inference time (msec) using MobileNet-v1 and ResNet18 networks.

6.4 Inference Efficiency Benchmark

Finally, we demonstrate the capability of DNN acceleration of the proposed method on different low-cost hardware platforms. Specifically, we benchmarked computational efficiency on two selectively platforms, i.e., Allwinner V328 and MTK8167, whose processor architectures are ARM-Cortex-A7 and ARM-Cortex-A35 respectively. In this test, MobileNet-v1 and ResNet-18 were used as representative DNN models to conduct inference. To run the DNN inference, three popular neural network inference engines for low-cost platforms were selected, i.e., TFLite, MNN [Jiang et al.2020], and NCNN, under single-threaded implementation within one core.

The experimental results are listed in Table 5, which clearly demonstrate that the proposed method outperforms competing methods by wide margins. In fact, on both MTK8167s and Allwinner V328, our method achieves faster runtime than TFLite and faster than NCNN.

7 Conclusion

In this paper, we propose an overflow aware quantization method to allow significant DNN inference time acceleration, and minimize the loss of accuracy. To achieve this, we propose to adaptively adjust the number of bits used for representing quantized fixed-point integers. This scheme is also incorporated into a novel training framework, to adaptively learn the overflow-free quantization range while maintaining high-end performance. By using the proposed method, an extremely light-weight neural network can achieve comparable performance with the 8-bit quantization method on the ImageNet classification challenge. Comprehensive experiments were also conducted to verify that our method can also be applied to various dense prediction tasks, e.g., object detection, and semantic segmentation by outperforming competing state-of-the-art methods.

References