Convolutional neural networks (CNNs) have demonstrated state-of-the-art performance in a lot of computer vision and image processing tasks, including classification, detection , image enhancement  and video coding . However, there is still a major difficulty to deploy CNNs in practice, especially in scenarios where computing resource is limited, since the CNNs have a huge number of parameters and require much computation. Many researches have been conducted to address this problem, for example by designing computationally efficient networks, exemplified by MobileNet , SqueezeNet , and ShuffleNet , or by pruning a well-trained network to reduce complexity  while maintaining accuracy .
A distinctive approach to efficient CNNs is using low bit-depth for convolutions. In earlier years, networks with extremely low bit-depth, such as binary neural networks , ternary weight networks [19, 33, 22]
, and XNOR-net have been proposed. These networks do not catch up with the current trend of using deeper and wider network [28, 9, 11]. Recently, new methods are studied such as multi-bit compression [7, 31, 6]
, vector quantization, hashing trick , and ADMM . Among them, integer-arithmetic-only inference , which quantizes a floating-point-number (FPN) network to integers, seems a good solution. FPN arithmetic is not friendly to digital computing devices. It costs much more computing power than integer arithmetic. In addition, there is no standard of FPN arithmetic, so the implementation of FPN arithmetic is platform dependent —this is a significant drawback as for applications concerning interoperability, such as video coding. Integer networks provide the benefit of smaller model, faster inference, as well as cross-platform consistency.
In previous works, integer networks were always reported worse than the corresponding FPN networks in accuracy, if given the same number of parameters. According to , the 8-bit integer ResNet152 achieves 76.7% accuracy, which is only slightly higher than the FPN ResNet50 (76.4%). Likewise, NVIDIA’s TensorRT reported 74.6% accuracy of 8-bit ResNet152, which is even lower . The accuracy decline of very deep neural network has been a severe discouraging fact of the previous integer networks, especially considering that the network complexity has been multiplied to achieve only marginal improvement.
We want to achieve integer networks that perform as well as FPN networks. For this purpose, we analyze previous quantization methods and observed that, as  have claimed, if we quantize the weights linearly into 8-bit but keep the other modules unchanged, then the accuracy can be as high as before quantization. However, if both weights and activation is quantized into 8-bit linearly, significant drop of accuracy occurred—it is due to the low-precision activations. Nonetheless, previous works do not address this issue well.
Different from parameter quantization that quantizes static weights and biases into integer, activation quantization is dynamic as it quantizes the computed activations into integer during network inference. There are two possibilities for activation quantization: the first is to decide the quantization step on-the-fly, and the second is to determine the quantization step during training. We prefer the second way because it not only saves online computation but also provide higher precision. For the activation quantization, our key idea is to adapt the bounds of activation function to feature maps and networks. Bounded activation function has been studied like ReLU6, but it is too simple for various networks and datasets. We introduce the Bounded ReLU (BReLU) as the activation function into CNNs. Different from the widely used ReLU, BReLU has both a lower bound and an upper bound and is linear between the two bounds. The bounds of BReLU are adjustable to suit for different feature maps. Note that there is a fundamental tradeoff for BReLU followed by quantization. If the dynamic range of BReLU is large, the quantization step should be large, too (given a predefined bit-depth for activations), which loses precision. But if the dynamic range is small, many features will be deactivated, which limits the learning capability of CNNs. We then propose methods to calculate the dynamic range for BReLU adaptively to the training data and networks to address this issue.
We have verified the proposed method on three different CNN-based tasks: image classification, image super-resolution, and compression artifact reduction, all achieving state-of-the-art performance. The following two tasks belong to regression that natively requires high inference accuracy. Previously, low-bit-depth networks are seldom evaluated for these regression tasks. In addition, with the help of CUDA-enabled devices to perform fast short-integer convolutions, we manage to convert 32-bit FPN networks into integer-arithmetic-only networks with 8-bit weights and 7-bit activations. Our integer networks can still achieve virtually the same accuracy as FPN networks, but have only 1/4 memory cost and run 2 faster on modern GPUs.
2 Related Work
The closest work to us is , which is also the built-in method in Google’s TensorFlow. It applies exponential moving averages for the activation quantization, which calculates the bounds of activations on-the-fly. It modifies the bounds after each iteration, which is too frequent to be suitable for parameter learning. Moreover, the exponential moving averages method requires a traversal for each feature map, which is computationally expensive. Another limitation is that all kernels in a concatenation layer are required to have the same quantization. We propose a new method to decide the bounds of BReLU. And we develop a more delicate ratio synchronization mechanism to address the concatenation layer.
Another integer network method is NVIDIA’s TensorRT 
. It proposes a relative entropy method to determine the activation bounds. The idea is to minimize the loss of information during quantization. However, minimizing the defined loss cannot ensure the preserving of inference accuracy. In addition, TensorRT relies on floating-point arithmetic, i.e. the resulting network is not integer only.
Given a pretrained 32-bit floating-point-number (float32) convolutional network, our goal is an integer-arithmetic-only network with low-precision integer weights and activations, which will decrease the model size and accelerate inference. Similar to , our integer networks use 8-bit integer (int8) for weights and activations. The convolution results in bigger numbers, after which we quantize the activations back to int8 numbers, which are used for further convolutions.
As shown in Figure 1, for an integer convolutional layer we perform integer convolution on input and weights to get feature maps . is adjusted by the bias to get , which is then processed by the specific activation function–Bounded ReLU and quantized to get . is the input of the next layer.
To achieve such an integer network, we quantize the weights and biases of a float32 network. For the quantization, we observed the parameters in a convolutional kernel
roughly obey a normal distribution with zero mean, for which linearly quantization is efficient in precision and saves complexity compared to quantization with zero-shift. To fully utilize the limited precision of short integer, we map the maximal absolute value ofto the maximal possible absolute value of integer. Specifically, for a given float32 convolutional kernel and int8 target,
where indicates the quantization step for :
To achieve the best accuracy, it is natural to make the integer networks close to the float32 networks during inference, because the float32 networks are well-trained. Accordingly, we assume a linear mapping between each of the kernels, feature maps and other parameters of an integer network and its floating-point counterpart:
For example, an integer convolutional kernel is an approximation of its floating-point counterpart by . During network inference, these ratios will be accumulated. When the forward propagation is finished, the overall ratio can either be ignored (e.g. for classification) or be reset to 1 by rescaling the output (e.g. for regression). We modify and finetune the FPN network before mapping it to integer network, which do not need any further training.
A low-precision network can be viewed as a high-precision network corrupted by some noise, for example, the noise is introduced by the rounding operation of Equation (1). During the entire process of integer network inference, there are two kinds of noise: due to weight quantization and due to activation quantization, respectively. Therefore, our work is focused on controlling the quantization noise to ensure the performance of integer networks.
3.2 Activation Quantization With BReLU
3.2.1 Why Bounded ReLU
By discretizing weights before quantizing them into integers, we manage to ensure that integer weights work precisely corresponding to float32 weights. However, integer convolution results in larger numbers. We cannot afford increasing bit-depth after each convolution, thus we need quantize activations from int32 back to int8. Following , we perform activation quantization by multiplication and right shift. Using the symbols in Figure 1,
This comes with additional noise due to activation quantization. Different from parameters, activations depend on input data and cannot be trained, which means that activation quantization noise is not removable. For short integers like int8, activation quantization noise is considerably a challenge. For example, our experiment on a 4 layer network (4.2.2) shows that if we use conventional ReLU and always map the largest value into target range without clipping, the performance will drop largely. Besides, dynamically searching for the largest value in a feature map is computationally expensive. Consequently, we propose the activation function with Bounded ReLU, and its adaptability on different datasets and networks. BReLU sets a static range for the activation:
For both FPN networks and integer networks, we use BReLU with (Figure 2) to replace ReLU. For the upper bound, we have for and for . To fully exploit the limited integer precision, given an integer data type for , is quantized to the maximal possible value of . For example, for int8. Then we discuss how to set . Unlike 32-bit floating-point numbers, short integers have smaller and equally divided representation ranges. Quantizing a float32 feature map into short integer representation is similar to an uniform quantizer , and choosing is actually restricting the range of activations that are “allowed.” Clearly, a larger allows more activations to be learned, but also results in a larger quantization step for the uniform quantizer, which will introduce more noise to the output integer representation. Therefore, our methodology is to minimize while keep the learning ability of the floating-point network unaffected, meaning that the network can restore its accuracy after finetuning.
3.2.2 Data Domain Adaption
To achieve minimal
, it is unavoidable that a group of large values is deactivated among a feature map. Based on the fact that values in a feature map roughly obey a Gaussian distribution with mean and standard deviation beingand , we experiment - rule. - rule is an empirical rule which provide a range for the vast mojority of a normal distribution. Consequently, we deactivate values which lay beyond the vast mojority. For example, with 3- rule we set to the minimal value of the largest 0.15% values in a feature map.
is set by computing over the training dataset, and remains unchanged afterwards. Computing over the entire training set, however, is not a good choice because certain outlier samples might take over the largest values. Thus, we calculatein a batch and take the average value over batches as a recommended for float32 network. This value will be further refined in the following. Such method is adaptable for different feature maps generated from different data, and our experiment shows that - rule is better than ReLU6.
For shallow networks specially, we introduce the Geometric Progression method, which fixes by the training data from scratch and does not need pre-training. Given a network of layers, the first term and last term are fixed with regard to the range of input and output, and is calculated by a geometric progression:
is used as for the -th layer. This method provides a BReLU-enabled float32 network from scratch and simplifies the process of network quantization. It works for shallow networks because their learning ability is largely constrained by nature, for which large amounts of deactivated features is acceptable.
3.2.3 Model Domain Adaption
- rule always works fine for a variety of models based on the adaptability of . With different s we can choose different ratios of deactivated values in a feature map. For example, based on 7-bit activations, we find works fine for VDSR and ResNet152, but works better for ResNet18. This is because ResNet152 is deeper and require higher precision, thus 3- rule provide a smaller and still provide enough learning ability. By experiment, we always try a set of s and for each one apply it to every activation within a network. In theory, the more bits available for activation, the larger we prefer because it provides stronger learning ability. There are also possibilities that there are different s for different layers among a convolutional neural network, which implicates these layers may play different roles for the final accuracy. We will prove it in the future work.
3.3 Quantized BReLU
we want to be corresponding to , so we can find a pair of two integers and :
to make sure that
To achieve and the following ratios, we shall use and to rescale . However, since and are integers, it is difficult to ensure that corresponds to precisely. Thus, we can modify . By dequantizing we have precisely match by , as well as all the values clipped by . Note that is not used for convolution, so we simply ensure that is quantized to :
Once all the BReLUs in the FPN networks are fixed, we continue to train the float32 network. This will reduce the loss of accuracy for float32 network while minimizing the noise for integer network introduced by activation quantization.
In most cases, input images of a CNN are coded as 8-bit unsigned integer (uint8), which are suitable for integer arithmetic by nature. In TensorRT, the input of 8-bit networks is still kept as floating-point numbers, meaning that the integer inference starts after the first layer. In our integer networks, we simply subtract 128 from the uint8 raw data to get int8 input.111This is mostly due to a computational issue: convolutional kernel is int8, and it is a bit complex to mix uint8 and int8 in convolution. Correspondingly, in floating-point networks we also subtract 128 from the raw data and then multiply by . This is slightly different from most of the existing networks, which usually normalize the input by subtracting mean and dividing by standard deviation. Fortunately, the different input normalizations seem not influencing the accuracy much, as long as we finetune the networks by original training methods.
3.4.2 Ratio synchronization between feature maps
In the forward propagation, the ratios between integer and FPN networks are accumulated along the forward path. Chances are that two paths merge but with different ratios, for example if one path is deeper than the other (e.g. residual connection). In such cases, we need to adjust the ratios of the involved feature maps by rescaling them explicitly or implicitly. We develop Ratio Synchronization between feature maps which ensures the best quantization steps for kernels and same scale for all involved activations. Specific examples will be detailed in the experiments.
To recover the accuracy due to quantization, we finetune the network parameters to improve the accuracy similar to . Differently, instead of training in the integer domain directly, we would like to finetune the FPN weights, thus preserving the full precision after quantization. First we discretize the weights by the quantization step :
Then we use the discretized weights for forward propagation. After backward propagation, we update the original weights by gradients. After finetuning, we quantize the discretized kernel , which corresponds to the integer kernel exactly. By avoiding scaling during training stage, our method do not involve pixel-wise division of activations, thus save training time and power. Note that biases are not discretized because they will be quantized to int32 that has enough precision, and the difference between float32 biases and int32 biases can be safely ignored.
As indicated in 
, a batch normalization (BN) layer rescales theof the previous convolutional kernel, resulting in different ’s for each output channel. Jacob et al.  proposed to merge BN layers into convolutional kernels before training, but failed to recover the precision at last. Similarly, our experiment shows that removing BN before training leads to a significant drop of accuracy. So we use different quantization steps for different channels during training and merge BN parameters into convolution after training.
In summary, given a float32 network and training data, we use Algorithm 1 to convert it into an integer network. Our method convert an FPN network into integer network by stages, and each stage is based on the best accuracy of the previous one. Our method require 3 times of finetuning: after re-normalization, after descretization of weights, and after applying BReLU. By applying such method, we hope to find the best quantization step , the best , and the best integer model.
We evaluate our method on state-of-the-art network architectures for both classification and regression tasks. Besides, our method promises bit-wise consistency so we also test on CNNs used in video codec  which allows only integer computation and demands cross-platform consistency. We use an NVIDIA Titan X (pascal) GPU to measure the computational time.
Networks used in classification are the backbones of many computer vision tasks such as object detection and semantic segmentation. Among them, ResNet  is a widely used network architecture achieving state-of-the-art classification accuracy. It includes multiple residual blocks as well as convolution and pooling layers. Our integer ResNets use only integer arithmetic during inference.
ResNet features many residual connections, for which we develop the ratio synchronization method.
Take basic block in ResNet18 (Figure 3) for example, the skip connection adds the input of a residual block to its output, directly (type 1) or after convolution (type 2). For type 1, first we rescale by multiplication and right shift using similar method in Equation (9) to make the ratio of close to , then adjust the quantization step of the second convolutional kernel to make the two ratios exactly the same. For type 2, we perform convolution on quantized input and do the similar for the output feature maps. Such procedure is carried out while calculating .
By applying Algorithm 1, we have integer ResNets (Table 1). They achieve equivalent precision with higher speed as well as a quarter model size. Here the model size is the memory footprint of the parameters. For storage purpose, we can surely use compression like Huffman coding to further reduce it. Compared with Jacob et al.  which handle activation quantization by exponential moving averages and TensorRT  which uses relative entropy, our method achieve state-of-the-art performance very close to baseline accuracy. The meaning of this advancement is it serves for very high demanding accuracy, but also provide inference speed up and smaller model size, as shown in Table 2.
|Network||Quantized||Top-1 (%)||Top-5 (%)|
|ResNet152||Jacob et al.||76.70||/1|
Not provided in .
|Network||Quantized||Model size||Conv. time1|
|ResNet18||float32||44.6 MB||35.1 ms|
|ResNet18||Ours||11.1 MB||16.2 ms|
|ResNet152||float32||230 MB||234 ms|
|ResNet152||Ours||57.5 MB||91.9 ms|
Only time for convolutions, average time per batch (50 images).
Not comparable because TensorRT adopts other optimizations.
To further demostrate the affection of our method including Ratio Syncronization (RS) and BReLU, we compare the result of 4 different configurations in Table 3. However, even we deactivate RS and use simple ReLU6 activation, our result is still significantly better than Jacob et al.  and TensorRT. This may due to our whole quantization algorithm and we will open our code and models for researchers.
Another finding is that within a single convolutional kernel with batch normalization, there are different ’s for different channels, and some of them is extremely large that the whole channel after quantization is 0. Based on the experiment that the accuracy is not affected, we believe that these channels can be safely pruned, which will further reduce model size and speed up inference.
|configuration||Top-1 (%)||Top-5 (%)|
A lot of image processing tasks require regression CNNs, such as image super-resolution, denoising, and generative models. Nonetheless, low-precision networks for regression tasks are seldom reported before. Since regression networks do not have average pooling or non-maximum suppression, they are less tolerant to the noise during inference. By applying our method on multiple regression tasks, we show that our method is also qualified for regression networks.
4.2.1 VDSR for image super-resolution
Very deep convolutional networks for image super-resolution (VDSR)  is a typical CNN architecture used for regressive tasks, constructed by 20 convolutional layers with kernel size 33. The output is an image of same size with input, but looks better as if it is “super-resolved.” VDSR is widely used in end-to-end image tasks like image-coloring, image-denoising or image-deraining. Our result is shown in Table 4. Beside 7-bit activation, we also evaluated our method on down to 4-bit activations. As shown in Table 5, the accuracy drops significantly from 5-bit to 4-bit. Generally speaking, the deeper or less tolerant to noise a network is, the higher precision for activations is needed.
Single image super-resolution is widely used in computer vision applications ranging from security and surveillance imaging to medical imaging where more image details are required on demand. Therefore, it not only requires objective evaluation, but also subjective quality. Our result in Figure 4 demonstrate that output of our integer network is both accurate by PSNR and visual pleasing as well as output of float32 network.
|PSNR for 2||37.55||37.51|
|PSNR for 3||33.78||33.78|
|PSNR for 4||31.44||31.42|
|Model size (MB)||2.54||0.65|
|Conv. time (ms)1||120||55.7|
Only time for convolutions, overall time of Set-5.
|Bit-depth||2 PSNR||3 PSNR||4 PSNR|
4.2.2 VRCNN for compression artifact reduction
VRCNN  is a network used in video coding for compression artifact reduction. It works similarly with VDSR, consisting of 4 convolution layers. VRCNN has 2 concatenation layers, each of which has two kinds of kernel sizes, like in Inception Net . We design the ratio synchronization method for concatenation layers, as shown in Algorithm 2. To scale all the involved feature maps into same ratio, we select the minimal ratio as reference to ensure that none feature map would overflow. Then we adjust each quanization parameter and to reach that ratio, as well as corresponding to promise precise syncronization. We perform similar process on skip connections in ResNets above.
VRCNN is a shallow network, for which we calculated by the simple Geometric Progression method. Specifically, we set to 0.5, which is the absolute max value of input after re-normalization, and to the max absolute value of the output residual. We train the BReLU-FPN network from scratch and quantize the BReLU-FPN network, leading to integer VRCNN with same performance as its float32 couterpart.
|BD-rate||Model size (KB)||Inf. time (ms)1|
Entire inference time for a single picture of size 25601600.
Performance of VRCNN is evaluated on standard HEVC test sequences with QP ranging from 22 to 37, as shown in Table 6. Note that cross-platform consistency is crucial in video codec, which becomes one of the motivations of our work.
We have proposed a method for converting FPN networks into integer-arithmetic-only networks without sacrificing accuracy. Our key idea is to replace ReLU with BReLU and adapt it to various datasets and networks, so as to efficiently quantize the activations. For the upper bound of BReLU we have studied two methods for shallow and deep networks respectively.
We have tested our methods on three tasks, including image super-resolution and compression artifact reduction that are not reported before. Our method can achieve efficient integer networks that have a quarter model size and double speed on modern GPUs, and ensure cross-platform consistency. Our method outperforms Jacob et al.  and TensorRT in accuracy and achieves state-of-the-art performance.
In the future, we plan to extend our method for other CNNs that have customized or specialized units, such as leaky ReLU and sigmoid (e.g. in LSTM). We will also conduct further study into BReLU by applying different stratagy for different layers.
-  Wenlin Chen, James T. Wilson, Stephen Tyree, Kilian Q. Weinberger, and Yixin Chen, ‘Compressing neural networks with the hashing trick’, CoRR, abs/1504.04788, (2015).
-  Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan, ‘PACT: parameterized clipping activation for quantized neural networks’, CoRR, abs/1805.06085, (2018).
-  Yuanying Dai, Dong Liu, and Feng Wu, ‘A convolutional neural network approach for post-processing in hevc intra coding’, in MultiMedia Modeling, pp. 28–39, Cham, (2017). Springer International Publishing.
-  David Goldberg, ‘What every computer scientist should know about floating-point arithmetic’, ACM Comput. Surv., 23(1), 5–48, (March 1991).
-  Yunchao Gong, Liu Liu, Ming Yang, and Lubomir D. Bourdev, ‘Compressing deep convolutional networks using vector quantization’, CoRR, abs/1412.6115, (2014).
-  Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan, ‘Deep learning with limited numerical precision’, CoRR, abs/1502.02551, (2015).
-  Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding, 2015.
-  Song Han, Jeff Pool, John Tran, and William Dally, ‘Learning both weights and connections for efficient neural network’, in Advances in Neural Information Processing Systems 28, eds., C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, 1135–1143, Curran Associates, Inc., (2015).
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, ‘Deep residual learning
for image recognition’, in
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (June 2016).
-  Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam, ‘Mobilenets: Efficient convolutional neural networks for mobile vision applications’, CoRR, abs/1704.04861, (2017).
-  Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger, ‘Densely connected convolutional networks’, in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (July 2017).
Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio, ‘Binarized neural networks’, inAdvances in Neural Information Processing Systems 29, eds., D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 4107–4115, Curran Associates, Inc., (2016).
-  Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally, and Kurt Keutzer, ‘Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size’, CoRR, abs/1602.07360, (2016).
-  Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko, ‘Quantization and training of neural networks for efficient integer-arithmetic-only inference’, in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (June 2018).
-  Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee, ‘Accurate image super-resolution using very deep convolutional networks’, in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (June 2016).
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, ‘Imagenet classification with deep convolutional neural networks’, inAdvances in Neural Information Processing Systems 25, eds., F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, 1097–1105, Curran Associates, Inc., (2012).
-  Yann Le Cun, John S. Denker, and Sara A. Solla, ‘Optimal brain damage’, in Proceedings of the 2Nd International Conference on Neural Information Processing Systems, NIPS’89, pp. 598–605, Cambridge, MA, USA, (1989). MIT Press.
-  Cong Leng, Hao Li, Shenghuo Zhu, and Rong Jin, ‘Extremely low bit neural network: Squeeze the last bit out with ADMM’, CoRR, abs/1707.09870, (2017).
-  Fengfu Li and Bin Liu, ‘Ternary weight networks’, CoRR, abs/1605.04711, (2016).
-  Y. Linde, A. Buzo, and R. Gray, ‘An algorithm for vector quantizer design’, IEEE Transactions on Communications, 28(1), 84–95, (January 1980).
-  Jeffrey L. McKinstry, Steven K. Esser, Rathinakumar Appuswamy, Deepika Bablani, John V. Arthur, Izzet B. Yildiz, and Dharmendra S. Modha, ‘Discovering low-precision networks close to full-precision networks for efficient embedded inference’, CoRR, abs/1809.04191, (2018).
-  Naveen Mellempudi, Abhisek Kundu, Dheevatsa Mudigere, Dipankar Das, Bharat Kaul, and Pradeep Dubey, ‘Ternary neural networks with fine-grained quantization’, CoRR, abs/1705.01462, (2017).
-  Szymon Migacz, ‘8-bit inference with tensorrt’, in GPU technology conference, volume 2, p. 7, (2017).
-  Asit K. Mishra, Eriko Nurvitadhi, Jeffrey J. Cook, and Debbie Marr, ‘WRPN: wide reduced-precision networks’, CoRR, abs/1709.01134, (2017).
-  Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi, ‘Xnor-net: Imagenet classification using binary convolutional neural networks’, in Computer Vision – ECCV 2016, eds., Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, pp. 525–542, Cham, (2016). Springer International Publishing.
-  Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi, ‘You only look once: Unified, real-time object detection’, in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (June 2016).
-  Mark Sandler, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen, ‘Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation’, CoRR, abs/1801.04381, (2018).
-  K. Simonyan and A. Zisserman, ‘Very Deep Convolutional Networks for Large-Scale Image Recognition’, arXiv e-prints, (September 2014).
-  Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich, ‘Going deeper with convolutions’, in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (June 2015).
-  Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun, ‘Shufflenet: An extremely efficient convolutional neural network for mobile devices’, in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (June 2018).
-  Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen, ‘Incremental network quantization: Towards lossless cnns with low-precision weights’, CoRR, abs/1702.03044, (2017).
-  Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, and Yuheng Zou, ‘Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients’, CoRR, abs/1606.06160, (2016).
-  Chenzhuo Zhu, Song Han, Huizi Mao, and William J. Dally, ‘Trained ternary quantization’, CoRR, abs/1612.01064, (2016).