1 Introduction
Convolutional neural networks (CNNs) have demonstrated stateoftheart performance in a lot of computer vision and image processing tasks, including classification
[16], detection [26], image enhancement [15] and video coding [3]. However, there is still a major difficulty to deploy CNNs in practice, especially in scenarios where computing resource is limited, since the CNNs have a huge number of parameters and require much computation. Many researches have been conducted to address this problem, for example by designing computationally efficient networks, exemplified by MobileNet [10], SqueezeNet [13], and ShuffleNet [30], or by pruning a welltrained network to reduce complexity [17] while maintaining accuracy [8].A distinctive approach to efficient CNNs is using low bitdepth for convolutions. In earlier years, networks with extremely low bitdepth, such as binary neural networks [12], ternary weight networks [19, 33, 22]
, and XNORnet
[25] have been proposed. These networks do not catch up with the current trend of using deeper and wider network [28, 9, 11]. Recently, new methods are studied such as multibit compression [7, 31, 6], vector quantization
[5], hashing trick [1], and ADMM [18]. Among them, integerarithmeticonly inference [14], which quantizes a floatingpointnumber (FPN) network to integers, seems a good solution. FPN arithmetic is not friendly to digital computing devices. It costs much more computing power than integer arithmetic. In addition, there is no standard of FPN arithmetic, so the implementation of FPN arithmetic is platform dependent [4]—this is a significant drawback as for applications concerning interoperability, such as video coding. Integer networks provide the benefit of smaller model, faster inference, as well as crossplatform consistency.In previous works, integer networks were always reported worse than the corresponding FPN networks in accuracy, if given the same number of parameters. According to [14], the 8bit integer ResNet152 achieves 76.7% accuracy, which is only slightly higher than the FPN ResNet50 (76.4%). Likewise, NVIDIA’s TensorRT reported 74.6% accuracy of 8bit ResNet152, which is even lower [23]. The accuracy decline of very deep neural network has been a severe discouraging fact of the previous integer networks, especially considering that the network complexity has been multiplied to achieve only marginal improvement.
We want to achieve integer networks that perform as well as FPN networks. For this purpose, we analyze previous quantization methods and observed that, as [23] have claimed, if we quantize the weights linearly into 8bit but keep the other modules unchanged, then the accuracy can be as high as before quantization. However, if both weights and activation is quantized into 8bit linearly, significant drop of accuracy occurred—it is due to the lowprecision activations. Nonetheless, previous works do not address this issue well.
Different from parameter quantization that quantizes static weights and biases into integer, activation quantization is dynamic as it quantizes the computed activations into integer during network inference. There are two possibilities for activation quantization: the first is to decide the quantization step onthefly, and the second is to determine the quantization step during training. We prefer the second way because it not only saves online computation but also provide higher precision. For the activation quantization, our key idea is to adapt the bounds of activation function to feature maps and networks. Bounded activation function has been studied like ReLU6
[27], but it is too simple for various networks and datasets. We introduce the Bounded ReLU (BReLU) as the activation function into CNNs. Different from the widely used ReLU, BReLU has both a lower bound and an upper bound and is linear between the two bounds. The bounds of BReLU are adjustable to suit for different feature maps. Note that there is a fundamental tradeoff for BReLU followed by quantization. If the dynamic range of BReLU is large, the quantization step should be large, too (given a predefined bitdepth for activations), which loses precision. But if the dynamic range is small, many features will be deactivated, which limits the learning capability of CNNs. We then propose methods to calculate the dynamic range for BReLU adaptively to the training data and networks to address this issue.We have verified the proposed method on three different CNNbased tasks: image classification, image superresolution, and compression artifact reduction, all achieving stateoftheart performance. The following two tasks belong to regression that natively requires high inference accuracy. Previously, lowbitdepth networks are seldom evaluated for these regression tasks. In addition, with the help of CUDAenabled devices to perform fast shortinteger convolutions, we manage to convert 32bit FPN networks into integerarithmeticonly networks with 8bit weights and 7bit activations. Our integer networks can still achieve virtually the same accuracy as FPN networks, but have only 1/4 memory cost and run 2 faster on modern GPUs.
2 Related Work
The closest work to us is [14], which is also the builtin method in Google’s TensorFlow. It applies exponential moving averages for the activation quantization, which calculates the bounds of activations onthefly. It modifies the bounds after each iteration, which is too frequent to be suitable for parameter learning. Moreover, the exponential moving averages method requires a traversal for each feature map, which is computationally expensive. Another limitation is that all kernels in a concatenation layer are required to have the same quantization. We propose a new method to decide the bounds of BReLU. And we develop a more delicate ratio synchronization mechanism to address the concatenation layer.
Another integer network method is NVIDIA’s TensorRT [23]
. It proposes a relative entropy method to determine the activation bounds. The idea is to minimize the loss of information during quantization. However, minimizing the defined loss cannot ensure the preserving of inference accuracy. In addition, TensorRT relies on floatingpoint arithmetic, i.e. the resulting network is not integer only.
3 Method
3.1 Architecture
Given a pretrained 32bit floatingpointnumber (float32) convolutional network, our goal is an integerarithmeticonly network with lowprecision integer weights and activations, which will decrease the model size and accelerate inference. Similar to [14], our integer networks use 8bit integer (int8) for weights and activations. The convolution results in bigger numbers, after which we quantize the activations back to int8 numbers, which are used for further convolutions.
As shown in Figure 1, for an integer convolutional layer we perform integer convolution on input and weights to get feature maps . is adjusted by the bias to get , which is then processed by the specific activation function–Bounded ReLU and quantized to get . is the input of the next layer.
To achieve such an integer network, we quantize the weights and biases of a float32 network. For the quantization, we observed the parameters in a convolutional kernel
roughly obey a normal distribution with zero mean, for which linearly quantization is efficient in precision and saves complexity compared to quantization with zeroshift. To fully utilize the limited precision of short integer, we map the maximal absolute value of
to the maximal possible absolute value of integer. Specifically, for a given float32 convolutional kernel and int8 target,(1) 
where indicates the quantization step for :
(2) 
To achieve the best accuracy, it is natural to make the integer networks close to the float32 networks during inference, because the float32 networks are welltrained. Accordingly, we assume a linear mapping between each of the kernels, feature maps and other parameters of an integer network and its floatingpoint counterpart:
(3) 
For example, an integer convolutional kernel is an approximation of its floatingpoint counterpart by . During network inference, these ratios will be accumulated. When the forward propagation is finished, the overall ratio can either be ignored (e.g. for classification) or be reset to 1 by rescaling the output (e.g. for regression). We modify and finetune the FPN network before mapping it to integer network, which do not need any further training.
A lowprecision network can be viewed as a highprecision network corrupted by some noise, for example, the noise is introduced by the rounding operation of Equation (1). During the entire process of integer network inference, there are two kinds of noise: due to weight quantization and due to activation quantization, respectively. Therefore, our work is focused on controlling the quantization noise to ensure the performance of integer networks.
3.2 Activation Quantization With BReLU
3.2.1 Why Bounded ReLU
By discretizing weights before quantizing them into integers, we manage to ensure that integer weights work precisely corresponding to float32 weights. However, integer convolution results in larger numbers. We cannot afford increasing bitdepth after each convolution, thus we need quantize activations from int32 back to int8. Following [14], we perform activation quantization by multiplication and right shift. Using the symbols in Figure 1,
(4) 
This comes with additional noise due to activation quantization. Different from parameters, activations depend on input data and cannot be trained, which means that activation quantization noise is not removable. For short integers like int8, activation quantization noise is considerably a challenge. For example, our experiment on a 4 layer network (4.2.2) shows that if we use conventional ReLU and always map the largest value into target range without clipping, the performance will drop largely. Besides, dynamically searching for the largest value in a feature map is computationally expensive. Consequently, we propose the activation function with Bounded ReLU, and its adaptability on different datasets and networks. BReLU sets a static range for the activation:
(5) 
For both FPN networks and integer networks, we use BReLU with (Figure 2) to replace ReLU. For the upper bound, we have for and for . To fully exploit the limited integer precision, given an integer data type for , is quantized to the maximal possible value of . For example, for int8. Then we discuss how to set . Unlike 32bit floatingpoint numbers, short integers have smaller and equally divided representation ranges. Quantizing a float32 feature map into short integer representation is similar to an uniform quantizer [20], and choosing is actually restricting the range of activations that are “allowed.” Clearly, a larger allows more activations to be learned, but also results in a larger quantization step for the uniform quantizer, which will introduce more noise to the output integer representation. Therefore, our methodology is to minimize while keep the learning ability of the floatingpoint network unaffected, meaning that the network can restore its accuracy after finetuning.
3.2.2 Data Domain Adaption
To achieve minimal
, it is unavoidable that a group of large values is deactivated among a feature map. Based on the fact that values in a feature map roughly obey a Gaussian distribution with mean and standard deviation being
and , we experiment  rule.  rule is an empirical rule which provide a range for the vast mojority of a normal distribution. Consequently, we deactivate values which lay beyond the vast mojority. For example, with 3 rule we set to the minimal value of the largest 0.15% values in a feature map.is set by computing over the training dataset, and remains unchanged afterwards. Computing over the entire training set, however, is not a good choice because certain outlier samples might take over the largest values. Thus, we calculate
in a batch and take the average value over batches as a recommended for float32 network. This value will be further refined in the following. Such method is adaptable for different feature maps generated from different data, and our experiment shows that  rule is better than ReLU6.For shallow networks specially, we introduce the Geometric Progression method, which fixes by the training data from scratch and does not need pretraining. Given a network of layers, the first term and last term are fixed with regard to the range of input and output, and is calculated by a geometric progression:
(6) 
is used as for the th layer. This method provides a BReLUenabled float32 network from scratch and simplifies the process of network quantization. It works for shallow networks because their learning ability is largely constrained by nature, for which large amounts of deactivated features is acceptable.
3.2.3 Model Domain Adaption
 rule always works fine for a variety of models based on the adaptability of . With different s we can choose different ratios of deactivated values in a feature map. For example, based on 7bit activations, we find works fine for VDSR and ResNet152, but works better for ResNet18. This is because ResNet152 is deeper and require higher precision, thus 3 rule provide a smaller and still provide enough learning ability. By experiment, we always try a set of s and for each one apply it to every activation within a network. In theory, the more bits available for activation, the larger we prefer because it provides stronger learning ability. There are also possibilities that there are different s for different layers among a convolutional neural network, which implicates these layers may play different roles for the final accuracy. We will prove it in the future work.
3.3 Quantized BReLU
Beside for floatingpoint network, we have to quantize it to get its integer counterpart (Figure 2). Take the convolutional layer in Figure 1 for example, we have
(7) 
thus
(8) 
we want to be corresponding to , so we can find a pair of two integers and :
(9) 
to make sure that
(10) 
To achieve and the following ratios, we shall use and to rescale . However, since and are integers, it is difficult to ensure that corresponds to precisely. Thus, we can modify . By dequantizing we have precisely match by , as well as all the values clipped by . Note that is not used for convolution, so we simply ensure that is quantized to :
(11)  
(12)  
(13) 
Once all the BReLUs in the FPN networks are fixed, we continue to train the float32 network. This will reduce the loss of accuracy for float32 network while minimizing the noise for integer network introduced by activation quantization.
3.4 Workarounds
3.4.1 Renormalization
In most cases, input images of a CNN are coded as 8bit unsigned integer (uint8), which are suitable for integer arithmetic by nature. In TensorRT, the input of 8bit networks is still kept as floatingpoint numbers, meaning that the integer inference starts after the first layer. In our integer networks, we simply subtract 128 from the uint8 raw data to get int8 input.^{1}^{1}1This is mostly due to a computational issue: convolutional kernel is int8, and it is a bit complex to mix uint8 and int8 in convolution. Correspondingly, in floatingpoint networks we also subtract 128 from the raw data and then multiply by . This is slightly different from most of the existing networks, which usually normalize the input by subtracting mean and dividing by standard deviation. Fortunately, the different input normalizations seem not influencing the accuracy much, as long as we finetune the networks by original training methods.
3.4.2 Ratio synchronization between feature maps
In the forward propagation, the ratios between integer and FPN networks are accumulated along the forward path. Chances are that two paths merge but with different ratios, for example if one path is deeper than the other (e.g. residual connection). In such cases, we need to adjust the ratios of the involved feature maps by rescaling them explicitly or implicitly. We develop Ratio Synchronization between feature maps which ensures the best quantization steps for kernels and same scale for all involved activations. Specific examples will be detailed in the experiments.
3.4.3 Training
To recover the accuracy due to quantization, we finetune the network parameters to improve the accuracy similar to [12]. Differently, instead of training in the integer domain directly, we would like to finetune the FPN weights, thus preserving the full precision after quantization. First we discretize the weights by the quantization step :
(14) 
Then we use the discretized weights for forward propagation. After backward propagation, we update the original weights by gradients. After finetuning, we quantize the discretized kernel , which corresponds to the integer kernel exactly. By avoiding scaling during training stage, our method do not involve pixelwise division of activations, thus save training time and power. Note that biases are not discretized because they will be quantized to int32 that has enough precision, and the difference between float32 biases and int32 biases can be safely ignored.
As indicated in [14]
, a batch normalization (BN) layer rescales the
of the previous convolutional kernel, resulting in different ’s for each output channel. Jacob et al. [14] proposed to merge BN layers into convolutional kernels before training, but failed to recover the precision at last. Similarly, our experiment shows that removing BN before training leads to a significant drop of accuracy. So we use different quantization steps for different channels during training and merge BN parameters into convolution after training.3.4.4 Algorithm
In summary, given a float32 network and training data, we use Algorithm 1 to convert it into an integer network. Our method convert an FPN network into integer network by stages, and each stage is based on the best accuracy of the previous one. Our method require 3 times of finetuning: after renormalization, after descretization of weights, and after applying BReLU. By applying such method, we hope to find the best quantization step , the best , and the best integer model.
4 Experiments
We evaluate our method on stateoftheart network architectures for both classification and regression tasks. Besides, our method promises bitwise consistency so we also test on CNNs used in video codec [3] which allows only integer computation and demands crossplatform consistency. We use an NVIDIA Titan X (pascal) GPU to measure the computational time.
4.1 Classification
Networks used in classification are the backbones of many computer vision tasks such as object detection and semantic segmentation. Among them, ResNet [9] is a widely used network architecture achieving stateoftheart classification accuracy. It includes multiple residual blocks as well as convolution and pooling layers. Our integer ResNets use only integer arithmetic during inference.
ResNet features many residual connections, for which we develop the ratio synchronization method.
Take basic block in ResNet18 (Figure 3) for example, the skip connection adds the input of a residual block to its output, directly (type 1) or after convolution (type 2). For type 1, first we rescale by multiplication and right shift using similar method in Equation (9) to make the ratio of close to , then adjust the quantization step of the second convolutional kernel to make the two ratios exactly the same. For type 2, we perform convolution on quantized input and do the similar for the output feature maps. Such procedure is carried out while calculating .
By applying Algorithm 1, we have integer ResNets (Table 1). They achieve equivalent precision with higher speed as well as a quarter model size. Here the model size is the memory footprint of the parameters. For storage purpose, we can surely use compression like Huffman coding to further reduce it. Compared with Jacob et al. [14] which handle activation quantization by exponential moving averages and TensorRT [23] which uses relative entropy, our method achieve stateoftheart performance very close to baseline accuracy. The meaning of this advancement is it serves for very high demanding accuracy, but also provide inference speed up and smaller model size, as shown in Table 2.
Network  Quantized  Top1 (%)  Top5 (%) 
ResNet18  float32  69.76  89.08 
ResNet18  TensorRT  69.56  88.99 
ResNet18  Ours  70.10  89.36 
ResNet152  float32  78.31  94.06 
ResNet152  TensorRT  74.70  91.78 
ResNet152  Jacob et al.  76.70  /^{1} 
ResNet152  Ours  78.01  93.81 

Not provided in [14].
Network  Quantized  Model size  Conv. time^{1} 

ResNet18  float32  44.6 MB  35.1 ms 
ResNet18  TensorRT  13.3 MB  /^{2} 
ResNet18  Ours  11.1 MB  16.2 ms 
ResNet152  float32  230 MB  234 ms 
ResNet152  TensorRT  67.6 MB  /^{2} 
ResNet152  Ours  57.5 MB  91.9 ms 

Only time for convolutions, average time per batch (50 images).

Not comparable because TensorRT adopts other optimizations.
To further demostrate the affection of our method including Ratio Syncronization (RS) and BReLU, we compare the result of 4 different configurations in Table 3. However, even we deactivate RS and use simple ReLU6 activation, our result is still significantly better than Jacob et al. [14] and TensorRT. This may due to our whole quantization algorithm and we will open our code and models for researchers.
Another finding is that within a single convolutional kernel with batch normalization, there are different ’s for different channels, and some of them is extremely large that the whole channel after quantization is 0. Based on the experiment that the accuracy is not affected, we believe that these channels can be safely pruned, which will further reduce model size and speed up inference.
configuration  Top1 (%)  Top5 (%) 

FPN  78.314  94.060 
ReLU6  77.602  93.650 
ReLU6+RS  77.810  93.718 
BReLU  77.754  93.656 
BReLU+RS  78.008  93.806 
4.2 Regression
A lot of image processing tasks require regression CNNs, such as image superresolution, denoising, and generative models. Nonetheless, lowprecision networks for regression tasks are seldom reported before. Since regression networks do not have average pooling or nonmaximum suppression, they are less tolerant to the noise during inference. By applying our method on multiple regression tasks, we show that our method is also qualified for regression networks.
4.2.1 VDSR for image superresolution
Very deep convolutional networks for image superresolution (VDSR) [15] is a typical CNN architecture used for regressive tasks, constructed by 20 convolutional layers with kernel size 33. The output is an image of same size with input, but looks better as if it is “superresolved.” VDSR is widely used in endtoend image tasks like imagecoloring, imagedenoising or imagederaining. Our result is shown in Table 4. Beside 7bit activation, we also evaluated our method on down to 4bit activations. As shown in Table 5, the accuracy drops significantly from 5bit to 4bit. Generally speaking, the deeper or less tolerant to noise a network is, the higher precision for activations is needed.
Single image superresolution is widely used in computer vision applications ranging from security and surveillance imaging to medical imaging where more image details are required on demand. Therefore, it not only requires objective evaluation, but also subjective quality. Our result in Figure 4 demonstrate that output of our integer network is both accurate by PSNR and visual pleasing as well as output of float32 network.
float32  int8  

PSNR for 2  37.55  37.51 
PSNR for 3  33.78  33.78 
PSNR for 4  31.44  31.42 
Model size (MB)  2.54  0.65 
Conv. time (ms)^{1}  120  55.7 

Only time for convolutions, overall time of Set5.
Bitdepth  2 PSNR  3 PSNR  4 PSNR 

8  37.52  33.78  31.42 
7  37.51  33.78  31.42 
6  37.49  33.76  31.40 
5  37.39  33.69  31.29 
4  36.08  33.24  30.74 
float32  37.55  33.78  31.44 
4.2.2 VRCNN for compression artifact reduction
VRCNN [3] is a network used in video coding for compression artifact reduction. It works similarly with VDSR, consisting of 4 convolution layers. VRCNN has 2 concatenation layers, each of which has two kinds of kernel sizes, like in Inception Net [29]. We design the ratio synchronization method for concatenation layers, as shown in Algorithm 2. To scale all the involved feature maps into same ratio, we select the minimal ratio as reference to ensure that none feature map would overflow. Then we adjust each quanization parameter and to reach that ratio, as well as corresponding to promise precise syncronization. We perform similar process on skip connections in ResNets above.
VRCNN is a shallow network, for which we calculated by the simple Geometric Progression method. Specifically, we set to 0.5, which is the absolute max value of input after renormalization, and to the max absolute value of the output residual. We train the BReLUFPN network from scratch and quantize the BReLUFPN network, leading to integer VRCNN with same performance as its float32 couterpart.
BDrate  Model size (KB)  Inf. time (ms)^{1}  

float32  5.3%  214  155 
int8  5.3%  54  79 

Entire inference time for a single picture of size 25601600.
Performance of VRCNN is evaluated on standard HEVC test sequences with QP ranging from 22 to 37, as shown in Table 6. Note that crossplatform consistency is crucial in video codec, which becomes one of the motivations of our work.
5 Conclusion
We have proposed a method for converting FPN networks into integerarithmeticonly networks without sacrificing accuracy. Our key idea is to replace ReLU with BReLU and adapt it to various datasets and networks, so as to efficiently quantize the activations. For the upper bound of BReLU we have studied two methods for shallow and deep networks respectively.
We have tested our methods on three tasks, including image superresolution and compression artifact reduction that are not reported before. Our method can achieve efficient integer networks that have a quarter model size and double speed on modern GPUs, and ensure crossplatform consistency. Our method outperforms Jacob et al. [14] and TensorRT in accuracy and achieves stateoftheart performance.
In the future, we plan to extend our method for other CNNs that have customized or specialized units, such as leaky ReLU and sigmoid (e.g. in LSTM). We will also conduct further study into BReLU by applying different stratagy for different layers.
References
 [1] Wenlin Chen, James T. Wilson, Stephen Tyree, Kilian Q. Weinberger, and Yixin Chen, ‘Compressing neural networks with the hashing trick’, CoRR, abs/1504.04788, (2015).
 [2] Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce IJen Chuang, Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan, ‘PACT: parameterized clipping activation for quantized neural networks’, CoRR, abs/1805.06085, (2018).
 [3] Yuanying Dai, Dong Liu, and Feng Wu, ‘A convolutional neural network approach for postprocessing in hevc intra coding’, in MultiMedia Modeling, pp. 28–39, Cham, (2017). Springer International Publishing.
 [4] David Goldberg, ‘What every computer scientist should know about floatingpoint arithmetic’, ACM Comput. Surv., 23(1), 5–48, (March 1991).
 [5] Yunchao Gong, Liu Liu, Ming Yang, and Lubomir D. Bourdev, ‘Compressing deep convolutional networks using vector quantization’, CoRR, abs/1412.6115, (2014).
 [6] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan, ‘Deep learning with limited numerical precision’, CoRR, abs/1502.02551, (2015).
 [7] Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding, 2015.
 [8] Song Han, Jeff Pool, John Tran, and William Dally, ‘Learning both weights and connections for efficient neural network’, in Advances in Neural Information Processing Systems 28, eds., C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, 1135–1143, Curran Associates, Inc., (2015).

[9]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, ‘Deep residual learning
for image recognition’, in
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, (June 2016).  [10] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam, ‘Mobilenets: Efficient convolutional neural networks for mobile vision applications’, CoRR, abs/1704.04861, (2017).
 [11] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger, ‘Densely connected convolutional networks’, in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (July 2017).

[12]
Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran ElYaniv, and Yoshua Bengio, ‘Binarized neural networks’, in
Advances in Neural Information Processing Systems 29, eds., D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 4107–4115, Curran Associates, Inc., (2016).  [13] Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally, and Kurt Keutzer, ‘Squeezenet: Alexnetlevel accuracy with 50x fewer parameters and <1mb model size’, CoRR, abs/1602.07360, (2016).
 [14] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko, ‘Quantization and training of neural networks for efficient integerarithmeticonly inference’, in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (June 2018).
 [15] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee, ‘Accurate image superresolution using very deep convolutional networks’, in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (June 2016).

[16]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, ‘Imagenet classification with deep convolutional neural networks’, in
Advances in Neural Information Processing Systems 25, eds., F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, 1097–1105, Curran Associates, Inc., (2012).  [17] Yann Le Cun, John S. Denker, and Sara A. Solla, ‘Optimal brain damage’, in Proceedings of the 2Nd International Conference on Neural Information Processing Systems, NIPS’89, pp. 598–605, Cambridge, MA, USA, (1989). MIT Press.
 [18] Cong Leng, Hao Li, Shenghuo Zhu, and Rong Jin, ‘Extremely low bit neural network: Squeeze the last bit out with ADMM’, CoRR, abs/1707.09870, (2017).
 [19] Fengfu Li and Bin Liu, ‘Ternary weight networks’, CoRR, abs/1605.04711, (2016).
 [20] Y. Linde, A. Buzo, and R. Gray, ‘An algorithm for vector quantizer design’, IEEE Transactions on Communications, 28(1), 84–95, (January 1980).
 [21] Jeffrey L. McKinstry, Steven K. Esser, Rathinakumar Appuswamy, Deepika Bablani, John V. Arthur, Izzet B. Yildiz, and Dharmendra S. Modha, ‘Discovering lowprecision networks close to fullprecision networks for efficient embedded inference’, CoRR, abs/1809.04191, (2018).
 [22] Naveen Mellempudi, Abhisek Kundu, Dheevatsa Mudigere, Dipankar Das, Bharat Kaul, and Pradeep Dubey, ‘Ternary neural networks with finegrained quantization’, CoRR, abs/1705.01462, (2017).
 [23] Szymon Migacz, ‘8bit inference with tensorrt’, in GPU technology conference, volume 2, p. 7, (2017).
 [24] Asit K. Mishra, Eriko Nurvitadhi, Jeffrey J. Cook, and Debbie Marr, ‘WRPN: wide reducedprecision networks’, CoRR, abs/1709.01134, (2017).
 [25] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi, ‘Xnornet: Imagenet classification using binary convolutional neural networks’, in Computer Vision – ECCV 2016, eds., Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, pp. 525–542, Cham, (2016). Springer International Publishing.
 [26] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi, ‘You only look once: Unified, realtime object detection’, in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (June 2016).
 [27] Mark Sandler, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, and LiangChieh Chen, ‘Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation’, CoRR, abs/1801.04381, (2018).
 [28] K. Simonyan and A. Zisserman, ‘Very Deep Convolutional Networks for LargeScale Image Recognition’, arXiv eprints, (September 2014).
 [29] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich, ‘Going deeper with convolutions’, in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (June 2015).
 [30] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun, ‘Shufflenet: An extremely efficient convolutional neural network for mobile devices’, in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (June 2018).
 [31] Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen, ‘Incremental network quantization: Towards lossless cnns with lowprecision weights’, CoRR, abs/1702.03044, (2017).
 [32] Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, and Yuheng Zou, ‘Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients’, CoRR, abs/1606.06160, (2016).
 [33] Chenzhuo Zhu, Song Han, Huizi Mao, and William J. Dally, ‘Trained ternary quantization’, CoRR, abs/1612.01064, (2016).