1 Introduction
To date, as a powerful machine learning system architecture, deep neural network (DNN) has been applied on numerous applications, e.g., image classification
[He et al.2016], object detection [Ren et al.2015, Liu et al.2016], and semantic segmentation [Chen et al.2018]. However, to achieve highend performance on complicated problems, most DNN systems require heavy computational resources for model inference, which inevitably limits DNN’s deployment on lowcost processors. Such processors are extensively used in billions of commercial products, such as mobile phones, drones, and InternetofThings (IoT) devices, which makes DNN acceleration a critical problem in both academia and industry.To allow DNN acceleration, researchers have developed various approaches, which can be roughly divided into two groups. The first group of methods focus on designing or searching for more compact and efficient network structures, which allow for reduced number of parameters and computations while achieving comparable performance. Representative methods in this category include MobileNet [Howard et al.2017], EfficientNet [Tan and Le2019], ProxylessNAS [Cai et al.2019], and so on. The second group aims at improving the efficiency of arithmetic computation, i.e., the multiplyaccumulate (MAC) operation, as it dominates most computations during the DNN model inference. One widely used method is to approximate the original floatingpoint calculation using fixedpoint operation to achieve computation acceleration. This type of method is wellknown as quantization [Jacob et al.2018]. Representative visualization of quantization is shown in Figure (a), where 8bit fixedpoint integers are used to approximate floatingpoint values and 32bit fixedpoint variables are used to hold MAC results. Moreover, in addition to speeding up the MAC operations, quantization also achieves better parallel computing based on the capability of modern CPUs.
By comparing Figure (b) against Figure (a), it can be shown that if 16bit fixedpoint variables are used to hold the MAC result, the degree of parallelism will be doubled and the I/O times will be halved. However, when 16bit holder is used, numerical overflow on MAC results becomes a frequentlyhappening problem that must be explicitly considered. A straightforward solution is to use lowbit quantization (8bit) for all operands, which however leads to loss of quantization precision and significantly reduced the performance. In addition, lowbit quantization still needs to take up more physical bits (e.g., 4bit quantization still requires 8bit physical operands) in most modern CPUs, making the computational resources partially wasted.
To summarize, existing quantization methods utilize fixed number of bits to represent float values, while both highbit and lowbit representation have limitations. The former suffers from numerical overflow problems and the latter one leads to model precision degradation. To tackle this problem, we propose a novel method to adaptively determine the quantization precision in DNNs, by optimizing the number of bits for operands while prohibiting overflow on the lowbit MAC result holders. To achieve this, we introduce a trainable quantization range mapping factor into each layer of a DNN network, which automatically scales the quantized result to prevent the undesirable overflow. In addition, we propose a quantizationoverflow aware training framework for learning the quantization parameters, to minimize the performance loss caused by post quantization [Krishnamoorthi2018]. To verify the effectiveness of our method, we conducted tests on a couple of stateoftheart lightweighted DNNs for a variety of tasks on different benchmarking datasets. Specifically, our experiments include image classification, object detection, and semantic segmentation, which are tested on ImageNet, Pascal VOC, and COCO datasets, respectively. Experimental results demonstrate that, compared with stateoftheart quantization methods, the proposed method can achieve comparable performance while speeding up the inference efficiency by about 2 times.
The main contributions of this paper are listed as follows:

We propose an overflow aware quantization (OAQ) algorithm for accelerating DNNs, that is able to adaptively maximize the number of bits for operands while prohibiting the numeric overflow.

To ensure optimized performance, we design a quantization overflow aware training framework (QOAT), to automatically learn the parameters used by the proposed OAQ algorithm.

We conduct extensive experiments on three public datasets using stateoftheart lightweight DNNs. The results verify the effectiveness of our method on a variety of tasks including image classification, object detection, and semantic segmentation.
2 Related Work
In this work, we focus on quantization methods that accelerate inference on offtheshelf hardware platforms. While nonuniform quantization methods [Stock et al.2019, Gao et al.2019] are also shown to be effective, they do not allow efficient implementing on modern CPUs.
A representative early method is binary quantization [Rastegari et al.2016], which quantizes both weights and activations to one bit, by using bitshift and bitcount instead of multiplyadds operators to speed up. This method achieves acceptable performance on common overparameterized networks, like AlexNet [Krizhevsky et al.2012], but leads to substantial performance degradation on lightweight networks, e.g. ResNet18 [He et al.2016] and MobileNet [Howard et al.2017]. As of today, one of the most widely used quantization methods is 8bit quantization [Jacob et al.2018, Krishnamoorthi2018, Jain et al.2019], which is extensively applied in different applications on various hardware platforms. 8bit quantization converts the inference process into integeronly operations, that could result in faster inference process on mobile CPUs. However, when tacking resourcedemanding large networks or deployed on resourceconstrained platforms, additional DNN acceleration is still required.
To allow further DNN acceleration, lowbit quantization techniques are under active exploration, in which a key problem is to balance the inference speed and model performance. [Choi et al.2018]
proposes PACT to train not only the weights but also the clipping parameters for clipped ReLU using gradient descent.
[Louizos et al.2018] presents RQ to optimize the quantization part with gradient descent. However, both methods suffer from performance degradation on lightweight networks. In fact, when quantizing both weights and activations to 4 bits, PACT [Choi et al.2018] leads to accuracy reduction from 70.9% to 62.44%. Also, quantizing MobileNet using RQ [Louizos et al.2018] to 6bit achieves 68% accuracy only. Additionally, existing lowbit techniques only focus on the classification task. Evaluation results on other popular tasks, e.g., object detection or semantic segmentation, are limited on literatures. Inference accelerating using properties of processors is another research direction. [Zeng et al.2019] proposes to decrease computational load of multiplications, by exploiting the parallel computing capability of modern CPUs. [Gong et al.2019] implements 2bit fast integer arithmetic with ARM NEON technology and achieves speed up over NCNN [Tecent.Inc2017]. However, the reported results still suffer from significant performance loss, e.g., 4bit quantized MobileNetv2 leads to performance drop from 71.8% to 64.8%.Moreover, there are a number of mixedprecision quantization methods [Wang et al.2019, Wu et al.2018, Dong et al.2019], which focus on searching for an optimal bitwidth setup that can achieve highlevel acceleration on customized hardware platforms while avoiding performance drop. Compared to those methods, the proposed OAQ framework focuses on offtheshelf devices, and addresses the numerical overflow problem by designing trainable parameters in each layer of a network.
3 Quantization Prerequisites
In this section, we first present the general algorithmic framework of quantization methods. Subsequently, we analyze the inherent overflow problem of the general framework and provide mathematical conditions that allow overflow aware quantization (OAQ). Detailed steps on performing OAQ approach are discussed in the next section.
3.1 Standard Quantization Method
To convert a floatingpoint real number to a fixedpoint quantized number , the following affine mapping function is typically used [Jacob et al.2018]:
(1) 
where is the scale factor and is the zeropoint parameter. By denoting the element at th row and th column of matrix , the matrix multiplication of in quantized domain can be computed elementwise as:
(2) 
where the multiplier is defined as
(3) 
which can be implemented by fixedpoint multiplication and efficient bitshift [Jacob et al.2018]. By expanding terms in (2), we are able to obtain:
(4)  
where
(5) 
In (4), the majority of computation costs are the core integer matrix multiplication accumulation:
(6) 
To compute for (6), the standard method is to accumulate products of 8bit values (signed or unsigned) with a 32bit integer accumulator (also see Figure (a)):
(7) 
where represents the space of bit representable number. As a result, a 32bit register is required for caching the intermediate result. A NEON instruction can compute multiple multiplyadds at the same time, but it is limited by the register size and the number of multiplying and summing units on board. The limited register resource is usually the main bottleneck. Moreover, we note that under most ARM architectures, there is no instruction to implement (7). To this end, existing popular mobile inference engines (e.g., TFLite and NCNN) typically rely on
(8) 
as a replacement, i.e., VMAL.S16 on ARM architecture.
Additionally, we note that loading data between register and memory is heavy operations. By applying (8), more register space is used to implement a bigger microkernel^{1}^{1}1https://engineering.fb.com/mlapplications/qnnpack/. This largely reduces the frequency of transferring intermediate results between register and memory. To show more details, we evaluated the performance of convolutions on MTK8167s CPU with different implementations. Using VMLAL.S8 and 4x8 microkernel, the computation efficiency is improved by 36%. While applying a 4x16 micro kernel, we achieved 73% speed up.
3.2 Optimize Quantization Operations
To optimize the efficiency of current quantization scheme, we seek to use 16bit integer as accumulator
(9) 
Compared to (7), one NEON instruction here (VMLAL.S8 on ARM architecture) is able to compute double amount of multiplyadds operations, as shown in Figure (a) and Figure (b).
However, by directly using (9), numerical overflow becomes an unavoidable problem. This is one of the core problem we seek to resolve in this work. To make this possible, we first reexpand (2) by assuming as the quantized inputs and as the quantized weights:
(10)  
which can be rewritten as
(11) 
where
(12) 
(13) 
In above equations, are all be constant values once training is complete, and thus and can be computed in advance to improve inference efficiency.
Based on above equations, we point out that, to allow efficient computation under (7), the following three conditions must be satisfied:
(14) 
It is important to note that the last condition in (14) should hold for both final number and all intermediate numbers. We also point out that, the second and last conditions in (14) are not always naturally true. Without taking special consideration, numerical overflow will frequently happen. To guarantee (14) in a DNN system, additional algorithms need to be designed and implemented.
4 Overflow Aware Quantization Framework
This section describes the details of our overflow aware quantization algorithm, including both representation and training, to ensure the important three conditions in (14).
4.1 Adaptive Integer Representation
One straightforward method to reduce accumulation overflow is to narrow the range of each quantized value. For example, by using 4bit quantization instead of 8bit quantization, real values are mapped to instead of . As a result, numerical overflow becomes significantly less likely to happen, at a cost of wasting a large number of bits and reducing accuracy. To this end, we propose an adaptive floatbitwidth method to fully utilize the representation capability without arithmetic overflow.
Specifically, we use a float quantization range mapping factor to adjust the affine relationship between the real range and quantized range, as shown in Figure 2. Scaled by , the original 8bit (the biggest bitwidth can be used) quantization range is mapped to . By enlarging , we are able to narrow down the quantized value range until the arithmetic overflows are eliminated.
To present this mathematically, the affine function (1) can be written as
(15) 
By applying the scale factor on
(16)  
the quantized value is narrowed to
(17) 
where and are the minimum and maximum limits of the real value, and is the number of bits for the quantized value.
We name this as floatbitwidth method since it differs from the traditional integer lowbit representation whose quantization ranges have to be chosen from the limited bitwidth set, e.g, 3bit for or 4bit for . By using the proposed method, it is feasible to utilize different quantization range mapping factor in each layer, to maximize the representation capability while prohibiting overflow. Additionally, since is continuous, it can also be easily integrated into training process of DNNs. Details on adaptively learning for weights and activations of each layer will be discussed in the next subsection.
In addition to the range of quantized value, the value distribution is also critical. We expect the quantized values to be centered and gathered around zeros. As a result, The accumulated number will be more likely to be away from overflow. Initializing weights with a normaldistributedlike initializer and applying L1L2Normalization are representative methods that can be used.
4.2 Learning Quantization Range Mapping Factor
To find proper for each layer to simultaneously prohibit arithmetic overflow and retain the model’s performance, we propose a quantizationoverflow aware training framework. As shown in Figure 3, inspired by simulating quantization effects in forwarding pass [Jacob et al.2018], we add the overflow aware module that simulates neural operations, e.g., Convolution, FullyConnect using 16bit accumulator for capturing arithmetic overflow in the forward pass.
In the forward pass, 8bit quantization simulates quantized inference by implementing in floatingpoint arithmetic which is called FakeQuantization (
) [Jacob et al.2018]. To adaptively learn , capturing the amount of arithmetic overflow in the quantized inference process is required. Therefore, we insert a Quantization node () into each layer of the computation graph in addition to . requires and to produce 8bit quantized values which are the same with inference engine. and are scaled by before being passed into and . and in activations are aggregated by exponential moving averages (EMA) with the smoothed parameter close to 1, such that the observed ranges are smoothed, allowing the network to enter a more stable state. The created convolution operation with 16bit accumulator ConvINT16 takes the 8bit quantized values as input to simulate the integeronlyinference on the inference engine. This calculates the amount of arithmetic overflow that accumulates in the inference process, including all types of overflow defined in the overflowfree conditions (14). An easy way to capture overflow signals from the popular training framework, e.g., TensorFlow or PyTorch is to compare the results between the regular 32bit convolution and the 16bit convolution. Certainly, you can make a more efficient implementation by some lowlevel language like C++. As the process of getting
is integeronly computation, gradient descent becomes not a proper method in backpropagation. To address this problem, we use a simple rule to compensate for this.Specifically, when is bigger than zero, increasing by:
(18) 
where is the fixed maximum learning rate. Additionally, is the dynamic learning rate for increasing , which decays with the steps of training. After a large number of iterations, is decayed to a quite small value to stabilize the training state.
Alternatively, if is zero, we decrease by:
(19) 
where is the learning rate for decreasing , whose properties and physical representation are both similar to . For improving training efficiency, calculating and updating are executed every steps, e.g., 10 or 50 during iterations.
The strategy of inserting is slightly different from . As shown in Figure 3
, fake quantization for input or output is inserted after the activation function or a bypass connection, e.g. adds or concatenates as they change the real minmax ranges. For the operations such as maxpooling, upsampling or padding,
is not required. But the 16bit neural operation, e.g., ConvINT16 only accepts quantized integer inputs. The outputs of has to be directly passed into it. That means must be inserted ahead of ConvINT16 in the computation graph. Finally, and share the same , and , for strictly simulating the same inference process.5 Overflow Study
In this section, we first show indepth analysis on the proposed framework of adaptive scale parameter and explain why we focus on comparing OAQ against stateoftheart 6bit quantization methods. Subsequently, we study the overflow sensitive of different neural network models and show how overflow ratio affecting the overall performance.
5.1 PerLayer Adaptive Scale
Since our key algorithmic contribution is a method to adaptively learn the quantization range mapping factor for DNN’s each layer, it is also important to demonstrate the distribution of across networks in our experiments. Figure 4 provides two representative results of perlayer values, adaptively trained by using the proposed quantizationoverflow aware training framework. From this figure, we can observe that the computed scale factor varies across layer, emphasizing our ‘scaleperlayer’ adaptive design. Additionally, we note that the majority of activation factor fall into the range of , indicating that the quantized values are mostly between 6bits and 7bits.
Additionally, we estimate the overflow ratio of lowbit quantization with a 16bit accumulator by random multiplyadds simulation. Specifically, we uniformly sampled
values from the quantization value range and tried to capture an overflow signal. The signal is computed by applying consecutive multiplyadds operators of the numbers on a 16bit holder. Since 3x3 depthwise convolution and 1x1 pointwise convolution are widely used in lightweight DNN architectures, we chose to be {9, 64, 256, 1024} representatively. Subsequently, the rough NonOverflow ratio was calculated by 100000 independent simulation runs. As shown in Figure 5, 6bit quantization could retain overflow free in most cases, while 7bit and 8bit methods are risky.5.2 Overflow Sensitive Study
Someone still worry about that the quantization range mapping factors learned from training data may be not applied to all test data safely. For example, a little arithmetic overflow may cause the entire model to fail. Actually, we indeed found some overflow (usually on the order of tens after quantizationoverflowaware training) while inference on test data, but the outputs seem still right. In order to analyze the impact of overflow quantitatively, we designed the overflow simulation experiments. Specifically, we manually inject overflow to the output of some layers to see their influence on the final result. And we tested the overflow sensitive of MoblieNetv1 on ImangeNet and MobileNetv2SSD on COCO Detection Challenge respectively. The results are shown in Figure 6. In classification model, inject 0.05% arithmetic overflow into any one layer has no impact on the overall accuracy. When apply it to all layers, the accuracy slowly drops along with the growth of overflow ratio. In detection model, overflow on the last layers affect the final result seriously. 0.01% overflow will cause the model unavailable. But the detection model can keep highperformance when injecting overflow into other layers.
6 Experiments
To demonstrate the performance of our proposed OAQ framework, we evaluate both inference accuracy/recall and runtime characteristics on representative public benchmarking datasets.
6.1 ImageNet
The first experiment is to benchmark MobileNetv2 [Sandler et al.2018], MoibleNetv1 [Howard et al.2017], and MobileNetv1 with depthmultiplier 0.25 on the ILSVRC2012 ImageNet. This dataset consists of 1000 classes, 1.28 million training images, and 50K validation images. We finetuned MobileNet models from pretrained model zoo (TFslim^{2}^{2}2https://github.com/tensorflow/models/tree/master/research/slim
). All of those models were retrained for 20 epochs. During training, the inputs were randomly cropped and resized to 224x224 before being fed into the network. Since the inputs of the first layer are not suitable for scaling, we skipped learning the quantization range mapping factor of the first layer’s weights. This rule was applied to all the following experiments, including PACT
[Choi et al.2018], which always uses 8bit weights in the first layers. We report our evaluation results using Top1 and Top5 accuracy.In our tests, we focus on comparing OAQ against stateoftheart 6bit quantization methods, including PACT [Choi et al.2018], RQ [Louizos et al.2018], and SR+DR [Gysel et al.2018]. Comparison against other selective stateoftheart methods were also conducted. As shown in Table 1, OAQ even outperforms 8bit QuantizationAware Training (QAT) [Jacob et al.2018] in certain cases. From Table 1, we observe that when quantizing MobileNetv1, OAQ outperforms RQ [Louizos et al.2018] and SR+DR[Gysel et al.2018] by large margins. Although PACT [Choi et al.2018] is better on Top5, the proposed method consistently performs better in other metrics, especially on MobileNetv10.25, i.e., 1.36% higher than PACT. We also take DSQ [Gong et al.2019] into comparison, as it also achieved speed up over NCNN on an ARM CortexA53 CPU. The result shows, on MobileNetv2 our results are significantly better, e.g., 6.84% higher.
Model  Method  Bits  MAC bits  Top1 / Top5 
MobileNet v2  FP  32  32  71.80 / 91.00 
QAT  8  32  70.90 / 90.00  
PACT  6  32  71.25 / 90.00  
DSQ  4  32  64.90 / —  
Our  adaptive  16  71.64 / 90.10  
MobileNet v1  FP  32  32  70.90 / 89.90 
QAT  8  32  70.10 / 88.90  
PACT  6  32  70.46 / 89.59  
RQ  6  32  68.02 / 88.00  
SR+DR  6  32  66.66 / 87.17  
Our  adaptive  16  70.87 / 89.56  
MobileNet v10.25  FP  32  32  49.80 / 74.20 
QAT  8  32  48.00 / 72.80  
PACT  6  32  46.03 / 70.07  
Our  adaptive  16  47.38 / 72.14 
6.2 Detection on VOC and COCO
Model  Method  Bits  MAC bits  mAP 

MobileNetv1 SSD  FP  32  32  73.83 
QAT  8  32  72.54  
PACT  6  32  70.88  
Our  adaptive  16  72.53  
MobileNetv2 SSDLite  FP  32  32  72.79 
QAT  8  32  72.02  
PACT  6  32  70.50  
Our  adaptive  16  71.84 
Model  Method  Bits  MAC bits  mAP 

MobileNetv1 SSD  FP  32  32  23.7 
QAT  8  32  23.0  
PACT  6  32  18.4  
Our  adaptive  16  22.0  
MobileNetv2 SSDLite  FP  32  32  22.7 
QAT  8  32  21.4  
PACT  6  32  18.1  
Our  adaptive  16  21.8 
To illustrate the applicability of our method to object detection, we applied OAQ on MobileNetv1SSD [Howard et al.2017, Liu et al.2016] and MobileNetv2SSDLite [Sandler et al.2018] and evaluated on the 2012 Pascal VOC object detection challenge and 2017 MSCOCO detection challenge. We implemented our experiments with TensorFlow and finetuned models from TensorFlow Object Detection API^{3}^{3}3https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md (only backbone since the SSD header differs with tasks). We first trained them with 32bit floatingpoint precision to achieve stateoftheart performance, and subsequently finetuned on VOC for steps with batch size 32 and steps on COCO with batch size 48 respectively.
The results on VOC and COCO are listed in Table 2 and Table 3 respectively. On both of VOC detection challenge and COCO detection challenge, the proposed method outperforms the 6bit PACT [Choi et al.2018] significantly. In addition, our method achieves comparable performance with 8bit QAT [Jacob et al.2018] and is of small mAP drop from the original model, i.e., about 1% mAP drop in VOC and less than 2% mAP drop in COCO.
6.3 Segmentation on VOC
To demonstrate the generalization of our method to semantic segmentation, we applied OAQ on DeepLab [Chen et al.2018] with MobileNetv2 backbone (depthmultiplier 0.5 and 1.0). The performance was evaluated on the Pascal VOC segmentation challenge, which contains 1464 training images and 1449 validation images. The results are shown in Table 4. When quantizing the original model to 6 bits with PACT [Choi et al.2018], there is a significant drop in performance, e.g., MobileNetv2dm0.5 backbone dropped 10.9% in mIOU. By comparison, the proposed OAQ method achieved comparable performance with QAT, and only drop 1.8% and 0.4% on MobileNetv2dm0.5 and MobileNetv2 backbone respectively compared to the original model.
Model  Method  Bits  MAC bits  mIOU 

MobileNetv2 dm0.5 deeplab  FP  32  32  71.8 
QAT  8  32  70.4  
PACT  6  32  60.9  
Our  adaptive  16  70.0  
MobileNetv2 deeplab  FP  32  32  75.3 
QAT  8  32  74.8  
PACT  6  32  70.4  
Our  adaptive  16  74.9 
CPU  Inference Engine  MobileNetv1  ResNet18 

Allwinner V328  TFLite  550  1370 
MNN  469  1605  
MNN + OAQ  341  1021  
Ours + OAQ  277  895  
MTK8167s  TFLite  387  950 
NCNN  351  706  
MNN  311  916  
MNN + OAQ  220  604  
Ours + OAQ  189  585 
6.4 Inference Efficiency Benchmark
Finally, we demonstrate the capability of DNN acceleration of the proposed method on different lowcost hardware platforms. Specifically, we benchmarked computational efficiency on two selectively platforms, i.e., Allwinner V328 and MTK8167, whose processor architectures are ARMCortexA7 and ARMCortexA35 respectively. In this test, MobileNetv1 and ResNet18 were used as representative DNN models to conduct inference. To run the DNN inference, three popular neural network inference engines for lowcost platforms were selected, i.e., TFLite, MNN [Jiang et al.2020], and NCNN, under singlethreaded implementation within one core.
The experimental results are listed in Table 5, which clearly demonstrate that the proposed method outperforms competing methods by wide margins. In fact, on both MTK8167s and Allwinner V328, our method achieves faster runtime than TFLite and faster than NCNN.
7 Conclusion
In this paper, we propose an overflow aware quantization method to allow significant DNN inference time acceleration, and minimize the loss of accuracy. To achieve this, we propose to adaptively adjust the number of bits used for representing quantized fixedpoint integers. This scheme is also incorporated into a novel training framework, to adaptively learn the overflowfree quantization range while maintaining highend performance. By using the proposed method, an extremely lightweight neural network can achieve comparable performance with the 8bit quantization method on the ImageNet classification challenge. Comprehensive experiments were also conducted to verify that our method can also be applied to various dense prediction tasks, e.g., object detection, and semantic segmentation by outperforming competing stateoftheart methods.
References
 [Cai et al.2019] Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct neural architecture search on target task and hardware. In ICLR, 2019.
 [Chen et al.2018] LiangChieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoderdecoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018.
 [Choi et al.2018] Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce IJen Chuang, Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085, 2018.
 [Dong et al.2019] Zhen Dong, Zhewei Yao, Amir Gholami, Michael Mahoney, and Kurt Keutzer. Hawq: Hessian aware quantization of neural networks with mixedprecision. arXiv preprint arXiv:1905.03696, 2019.

[Gao et al.2019]
Lianli Gao, Xiaosu Zhu, Jingkuan Song, Zhou Zhao, and Heng Tao Shen.
Beyond product quantization: Deep progressive quantization for image retrieval.
In IJCAI. AAAI Press, 2019.  [Gong et al.2019] Ruihao Gong, Xianglong Liu, Shenghu Jiang, Tianxiang Li, Peng Hu, Jiazhen Lin, Fengwei Yu, and Junjie Yan. Differentiable soft quantization: Bridging fullprecision and lowbit neural networks. In ICCV, 2019.

[Gysel et al.2018]
Philipp Gysel, Jon Pimentel, Mohammad Motamedi, and Soheil Ghiasi.
Ristretto: A framework for empirical study of resourceefficient inference in convolutional neural networks.
TNNLS, 29(11):5784–5789, 2018.  [He et al.2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [Howard et al.2017] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 [Jacob et al.2018] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integerarithmeticonly inference. In CVPR, pages 2704–2713, 2018.
 [Jain et al.2019] Sambhav R Jain, Albert Gural, Michael Wu, and Chris H Dick. Trained quantization thresholds for accurate and efficient fixedpoint inference of deep neural networks. arXiv preprint arXiv:1903.08066, 2019.
 [Jiang et al.2020] Xiaotang Jiang, Huan Wang, Yiliu Chen, Ziqi Wu, Lichuan Wang, Bin Zou, Yafeng Yang, Zongyang Cui, Yu Cai, Tianhang Yu, Chengfei Lv, and Zhihua Wu. Mnn: A universal and efficient inference engine. In MLSys, 2020.
 [Krishnamoorthi2018] Raghuraman Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342, 2018.
 [Krizhevsky et al.2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
 [Liu et al.2016] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, ChengYang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In ECCV, 2016.
 [Louizos et al.2018] Christos Louizos, Matthias Reisser, Tijmen Blankevoort, Efstratios Gavves, and Max Welling. Relaxed quantization for discretized neural networks. arXiv preprint arXiv:1810.01875, 2018.
 [Rastegari et al.2016] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. In ECCV. Springer, 2016.
 [Ren et al.2015] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster rcnn: Towards realtime object detection with region proposal networks. In NIPS, pages 91–99, 2015.
 [Sandler et al.2018] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and LiangChieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, 2018.
 [Stock et al.2019] Pierre Stock, Armand Joulin, Rémi Gribonval, Benjamin Graham, and Hervé Jégou. And the bit goes down: Revisiting the quantization of neural networks. arXiv preprint arXiv:1907.05686, 2019.
 [Tan and Le2019] Mingxing Tan and Quoc V Le. Efficientnet: Rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946, 2019.
 [Tecent.Inc2017] Tecent.Inc. Ncnn. https://github.com/Tencent/ncnn, 2017.
 [Wang et al.2019] Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. Haq: Hardwareaware automated quantization with mixed precision. In CVPR, 2019.
 [Wu et al.2018] Bichen Wu, Yanghan Wang, Peizhao Zhang, Yuandong Tian, Peter Vajda, and Kurt Keutzer. Mixed precision quantization of convnets via differentiable neural architecture search. arXiv preprint arXiv:1812.00090, 2018.
 [Zeng et al.2019] Linghua Zeng, Zhangcheng Wang, and Xinmei Tian. Kcnn: kernelwise quantization to remarkably decrease multiplications in convolutional neural network. In IJCAI. AAAI Press, 2019.
Comments
There are no comments yet.