1 Introduction
A promising approach to lower the compute and memory requirements of convolutional deeplearning workloads is through the use of low numeric precision algorithms. Operating in lower precision mode reduces computation as well as data movement and storage requirements. Due to such efficiency benefits, there are many existing works which propose lowprecision deep neural networks (DNNs)
INQ ; NNFewMult ; LogNN ; StochasticRounding ; GoogleLowPrecision , even down to 2bit ternary mode TTQ ; TWN ; AALTWN and 1bit binary mode DoReFa ; BNN ; XNORNET ; BWN ; FINN . However, the majority of existing works in lowprecision DNNs sacrifice accuracy over the baseline fullprecision networks. Further, most prior works target reducing the precision of the model parameters (network weights). This primarily benefits the inference step only when batch sizes are small.To improve both execution efficiency and accuracy of lowprecision networks, we reduce both the precision of activation maps and model parameters and increase the number of filter maps in a layer. We call networks using this scheme wide reducedprecision networks (WRPN) and find that this scheme compensates or surpasses the accuracy of the baseline fullprecision network. Although the number of raw compute operations increases as we increase the number of filter maps in a layer, the compute bits required per operation is now a fraction of what is required when using fullprecision operations (e.g. going from FP32 AlexNet to 4bits precision and doubling the number of filters increases the number of compute operations by 4x, but each operation is 8x more efficient than FP32).
WRPN offers better accuracies, while being computationally less expensive compared to previously reported reducedprecision networks. We report results on AlexNet AlexNet
, batchnormalized Inception
InceptionBN , and ResNet34 ResNet on ILSVRC12 AlexNet dataset. We find 4bits to be sufficient for training deep and wide models while achieving similar or better accuracy than baseline network. With 4bit activation and 2bit weights, we find the accuracy to be atpar with baseline fullprecision. Making the networks wider and operating with 1bit precision, we close the accuracy gap between previously report binary networks and show stateofthe art results for ResNet34 (69.85% top1 with 2x wide) and AlexNet (48.04% top1 with 1.3x wide). To the best of our knowledge, our reported accuracies with binary networks (and even 4bit precision) are highest to date.Our reducedprecision quantization scheme is hardware friendly allowing for efficient hardware implementations. To this end, we evaluate efficiency benefits of lowprecision operations (4bits to 1bits) on Titan X GPU, Arria10 FPGA and ASIC. We see that FPGA and ASIC can deliver significant efficiency gain over FP32 operations (6.5x to 100x), while GPU cannot take advantage of very lowprecision operations.
2 Motivation for reducedprecision activation maps
While most prior works proposing reducedprecision networks work with low precision weights (e.g. BNN ; TTQ ; DoReFa ; AALTWN ; TWN ; BWN ; FINN ), we find that activation maps occupy a larger memory footprint when using minibatches of inputs. Using minibatches of inputs is typical in training of DNNs and cloudbased batched inference TPU . Figure 1 shows memory footprint of activation maps and filter maps as batch size changes for 4 different networks (AlexNet, InceptionResnetv2 IRv2 , ResNet50 and ResNet101) during the training and inference steps.
As batchsize increases, because of filter reuse across batches of inputs, activation maps occupy significantly larger fraction of memory compared to the filter weights. This aspect is illustrated in Figure 2
which shows the memory requirements of a canonical feedforward DNN for a hardware accelerator based system (e.g. GPU, FPGA, PCIe connected ASIC device, etc.). During training, the sum of all the activation maps (ACT) and weight tensors (W) are allocated in device memory for forward pass along with memory for gradient maps during backward propagation. The total memory requirements for training phase is the sum of memory required for the activation maps, weights and the maximum of input gradient maps (
Z) and maximum of backpropagated gradients (X). During inference, memory is allocated for input (IFM) and output feature maps (OFM) required by a single layer, and these memory allocations are reused for other layers. The total memory allocation during inference is then the maximum of IFM and maximum of OFM required across all the layers plus the sum of all Wtensors. At batch sizes 128 and more, activations start to occupy more than 98% of total memory footprint during training.Overall, reducing precision of activations and weights reduces memory footprint, bandwidth and storage while also simplifying the requirements for hardware to efficiently support these operations.
3 WRPN scheme and studies on AlexNet
Based on the observation that activations occupy more memory footprint compared to weights, we reduce the precision of activations to speed up training and inference steps as well as cut down on memory requirements. However, a straightforward reduction in precision of activation maps leads to significant reduction in model accuracy DoReFa ; XNORNET .
We conduct a sensitivity study where we reduce precision of activation maps and model weights for AlexNet running ILSVRC12 dataset and train the network from scratch. Table 2 reports our findings. Top1 singleprecision (32bits weights and activations) accuracy is 57.2%. The accuracy with binary weights and activations is 44.2%. This is similar to what is reported in XNORNET . and datapoint in this table is using TTQ technique TTQ . All other data points are collected using our quantization scheme (described later in Section 5
), all the runs have same hyperparameters and training is carried out for the same number of epochs as baseline network. To be consistent with results reported in prior works, we do not quantize weights and activations of the first and last layer.
We find that, in general, reducing the precision of activation maps and weights hurts model accuracy. Further, reducing precision of activations hurts model accuracy much more than reducing precision of the filter parameters. We find TTQ to be quite effective on AlexNet in that one can lower the precision of weights to 2b (while activations are still FP32) and not lose accuracy. However, we did not find this scheme to be effective for other networks like ResNet or Inception.
To regain the model accuracy while working with reducedprecision operands, we increase the number of filter maps in a layer. Although the number of raw compute operations increase with widening the filter maps in a layer, the bits required per compute operation is now a fraction of what is required when using fullprecision operations. As a result, with appropriate hardware support, one can significantly reduce the dynamic memory requirements, memory bandwidth, computational energy and speed up the training and inference process.
Our widening of filter maps is inspired from Wide ResNet WideResNets work where the depth of the network is reduced and width of each layer is increased (the operand precision is still FP32). Wide ResNet requires a redesign of the network architecture. In our work, we maintain the depth parameter same as baseline network but widen the filter maps. We call our approach WRPN  wide reducedprecision networks. In practice, we find this scheme to be very simple and effective  starting with a baseline network architecture, one can change the width of each filter map without changing any other network design parameter or hyperparameters. Carefully reducing precision and simultaneously widening filters keeps the total compute cost of the network under or atpar with baseline cost.^{1}^{1}1Compute cost is the product of the number of FMA operations and the sum of width of the activation and weight operands.
Table 2
reports the accuracy of AlexNet when we double the number of filter maps in a layer. With doubling of filter maps, AlexNet with 4bits weights and 2bits activations exhibits accuracy atpar with fullprecision networks. Operating with 4bits weights and 4bits activations surpasses the baseline accuracy by 1.44%. With binary weights and activations we better the accuracy of XNORNET
XNORNET by 4%.When doubling the number of filter maps, AlexNet’s raw compute operations grow by 3.9x compared to the baseline fullprecision network, however by using reducedprecision operands the overall compute complexity is a fraction of the baseline. For example, with 4b operands for weights and activations and 2x the number of filters, reducedprecision AlexNet is just 49% of the total compute cost of the fullprecision baseline (compute cost comparison is shown in Table 3).
We also experiment with other widening factors. With 1.3x widening of filters and with 4bits of activation precision one can go as low as 8bits of weight precision while still being atpar with baseline accuracy. With 1.1x wide filters, at least 8bits weight and 16bits activation precision is required for accuracy to match baseline fullprecision 1x wide accuracy. Further, as Table 3 shows, when widening filters by 2x, one needs to lower precision to at least 8bits so that the total compute cost is not more than baseline compute cost. Thus, there is a tradeoff between widening and reducing the precision of network parameters.
In our work, we tradeoff higher number of raw compute operations with aggressively reducing the precision of the operands involved in these operations (activation maps and filter weights) while not sacrificing the model accuracy. Apart from other benefits of reduced precision activations as mentioned earlier, widening filter maps also improves the efficiency of underlying GEMM calls for convolution operations since compute accelerators are typically more efficient on a single kernel consisting of parallel computation on large datastructures as opposed to many small sized kernels WideResNets .
4 Studies on deeper networks
We study how our scheme applies to deeper networks. For this, we study ResNet34 ResNet and batchnormalized Inception InceptionBN
and find similar trends, particularly that 2bits weight and 4bits activations continue to provide atpar accuracy as baseline. We use TensorFlow
tensorflow2015whitepaper and tensorpack tensorpack for all our evaluations and use ILSVRC12 train and val dataset for analysis.^{2}^{2}2We will opensource our implementation of reducedprecision AlexNet, ResNet and batchnormalized Inception networks.4.1 ResNet
ResNet34 has 3x3 filters in each of its modular layers with shortcut connections being 1x1. The filter bank width changes from 64 to 512 as depth increases. We use the preactivation variant of ResNet and the baseline top1 accuracy of our ResNet34 implementation using singleprecision 32bits data format is 73.59%. Binarizing weights and activations for all layers except the first and the last layer in this network gives top1 accuracy of 60.5%. For binarizing ResNet we did not reorder any layer (as is done in XNORNET). We used the same hyperparameters and learning rate schedule as the baseline network. As a reference, for ResNet18, the gap between XNORNET (1b weights and activations) and fullprecision network is 18%
XNORNET . It is also interesting to note that top1 accuracy of singleprecision AlexNet (57.20%) is lower than the top1 accuracy of binarized ResNet34 (60.5%).We experimented with doubling number of filters in each layer and reduce the precision of activations and weights. Table 4 shows the results of our analysis. Doubling the number of filters and 4bits precision for both weights and activations beats the baseline accuracy by 0.9%. 4bits activations and 2bits (ternary) weights has top1 accuracy atpar with baseline. Reducing precision to 2bits for both weights and activations degrades accuracy by only 0.2% compared to baseline.
Binarizing the weights and activations with 2x wide filters has a top1 accuracy of 69.85%. This is just 3.7% worse than baseline fullprecision network while being only 15% of the cost of the baseline network. Widening the filters by 3x and binarizing the weights and activations reduces this gap to 1.2% while the 3x wide network is 30% the cost of the fullprecision baseline network.
Although 4bits precision seems to be enough for wide networks, we advocate for 4bits activation precision and 2bits weight precision. This is because with ternary weights one can get rid of the multipliers and use adders instead. Additionally, with this configuration there is no loss of accuracy. Further, if some accuracy degradation is tolerable, one can even go to binary circuits for efficient hardware implementation while saving 32x in bandwidth for each of weights and activations compared to fullprecision networks. All these gains can be realized with simpler hardware implementation and lower compute cost compared to baseline networks.
To the best of our knowledge, our ResNet binary and ternary (with 2bits or 4bits activation) top1 accuracies are stateoftheart results in the literature including unpublished technical reports (with similar data augmentation PCLpaper1 ).
4.2 Batchnormalized Inception
We applied WRPN scheme to batchnormalized Inception network InceptionBN . This network includes batch normalization of all layers and is a variant of GoogleNet GoogleNet where the 5x5 convolutional filters are replaced by two 3x3 convolutions with up to 128 wide filters. Table 5 shows the results of our analysis. Using 4bits activations and 2bits weight and doubling the number of filter banks in the network produces a model that is almost atpar in accuracy with the baseline singleprecision network (0.02% loss in accuracy). Wide network with binary weights and activations is within 6.6% of the fullprecision baseline network.
5 Hardware friendly quantization scheme
We adopt the straightthrough estimator (STE) approach in our work
STE1 . When quantizing a real number to bits, the ordinality of the set of quantized numbers is . Mathematically, this small and finite set would have zero gradients with respect to its inputs. STE method circumvents this problem by defining an operator that has arbitrary forward and backward operations.Prior works using the STE approach define operators that quantize the weights based on the expectation of the weight tensors. For instance, TWN TWN uses a threshold and a scaling factor for each layer to quantize weights to ternary domain. In TTQ TTQ , the scaling factors are learned parameters. XNORNET binarizes the weight tensor by computing the sign of the tensor values and then scaling by the mean of the absolute value of each output channel of weights. DoReFa uses a single scaling factor across the entire layer. For quantizing weights to bits, where , DoReFa uses:
(1) 
Here is the kbit quantized version of inputs and quantize is a quantization function that quantizes a floatingpoint number in the range to a bit number in the same range. The transcendental operation constrains the weight value to lie in between and . The affine transformation post quantization brings the range to .
We build on these approaches and propose a much simpler scheme. For quantizing weight tensors we first hard constrain the values to lie within the range using minmax operation (e.g. tf.clipbyval when using Tensorflow tensorflow2015whitepaper ). For quantizing activation tensor values, we constrain the values to lie within the range . This step is followed by a quantization step where a real number is quantized into a bit number. This is given as, for :
(2) 
Here and are input realvalued weights and activation tensor and and are their quantized versions. One bit is reserved for signbit in case of weight values, hence the use of for these quantized values. Thus, weights can be stored and interpreted using signed datatypes and activations using unsigned datatypes. With appropriate affine transformations, the convolution operations (the bulk of the compute operations in the network during forward pass) can be done using quantized values (integer operations in hardware) followed by scaling with floatingpoint constants (this scaling operation can be done in parallel with the convolution operation in hardware). When , for binary weights we use the BWN approach BWN where the binarized weight value is computed based on the sign of input value followed by scaling with the mean of absolute values. For binarized activations we use the formulation in Eq. 2. We do not quantize the gradients and maintain the weights in reduced precision format.
For convolution operation when using WRPN, the forward pass during training (and the inference step) involves matrix multiplication of bits signed and bits unsigned operands. Since gradient values are in 32bits floatingpoint format, the backward pass involves a matrix multiplication operation using 32bits and bits operand for gradient and weight update.
When , the hard clipping of tensors to a range maps efficiently to minmax comparator units in hardware as opposed to using transcendental operations which are long latency operations. TTQ and DoRefa schemes involve division operation and computing a maximum value in the input tensor. Floatingpoint division operation is expensive in hardware and computing the maximum in a tensor is an operation. Additionally, our quantization parameters are static and do not require any learning or involve backpropagation like TTQ approach. We avoid each of these costly operations and propose a simpler quantization scheme (clipping followed by rounding).
5.1 Efficiency improvements of reducedprecision operations on GPU, FPGA and ASIC
In practice, the effective performance and energy efficiency one could achieve on a lowprecision compute operation highly depends on the hardware that runs these operations. We study the efficiency of lowprecision operations on various hardware targets – GPU, FPGA, and ASIC.
For GPU, we evaluate WRPN on Nvidia Titan X Pascal and for FPGA we use Intel Arria10. We collect performance numbers from both previously reported analysis FPGAvsGPU as well as our own experiments. For FPGA, we implement a DNN accelerator architecture shown in Figure 3(a). This is a prototypical accelerator design used in various works (e.g., on FPGA FPGAvsGPU and ASIC such as TPU TPU
). The core of the accelerator consists of a systolic array of processing elements (PEs) to perform matrix and vector operations, along with onchip buffers, as well as offchip memory management unit. The PEs can be configured to support different precision – (FP32, FP32), (INT4, INT4), (INT4, TER2), and (BIN1, BIN1). The (INT4, TER2) PE operates on ternary (+1,0,1) values and is optimized to include only an adder since there is no need for a multiplier in this case. The binary (BIN1, BIN1) PE is implemented using XNOR and bitcount. Our RTL design targets Arria10 1150 FPGA. For our ASIC study, we synthesize the PE design using Intel 14 nm process technology to obtain area and energy estimates.
Figure 3(b)  (g) summarize our analysis. Figure 3(b) shows the efficiency improvements using firstorder estimates where the efficiency is computed based on number of bits used in the operation. With this method we would expect (INT4, INT4) and (BIN1, BIN1) to be 8x and 32x more efficient, respectively, than (FP32, FP32). However, in practice the efficiency gains from reducing precision depend on whether the underlying hardware can take advantage of such lowprecisions.
Figure 3(c) shows performance improvement on Titan X GPU for various lowprecision operations relative to FP32. In this case, GPU can only achieve up to 4x improvements in performance over FP32 baseline. This is because GPU only provides firstclass support for INT8 operations, and is not able to take advantage of the lower INT4, TER2, and BIN1 precisions. On the contrary, FPGA can take advantage of such low precisions, since they are amenable for implementations on the FPGA’s reconfigurable fabric.
Figure 3(d) shows that the performance improvements from (INT4, INT4), (INT4, TER2), and (BIN1, BIN1) track well with the firstorder estimates from Figure 3(b). In fact, for (BIN1, BIN1), FPGA improvements exceed the firstorder estimate. Reducing the precision simplifies the design of compute units and lower buffering requirements on FPGA board. Computeprecision reduction leads to significant improvement in throughput due to smaller hardware designs (allowing more parallelism) and shorter circuit delay (allowing higher frequency). Figure 3(e) shows the performance and performance/Watt of the reducedprecision operations on GPU and FPGA. FPGA performs quite well on very low precision operations. In terms of performance/watt, FPGA does better than GPU on (INT4, INT4) and lower precisions.
ASIC allows for a truly customized hardware implementation. Our ASIC study provides insights to the upper bound of the efficiency benefits possible from lowprecision operations. Figure 3(f) and 3(g) show improvement in performance and energy efficiency of the various lowprecision ASIC PEs relative to baseline FP32 PE. As the figures show, going to lower precision offers 2 to 3 orders of magnitude efficiency improvements.
In summary, FPGA and ASIC are well suited for our WRPN approach. At 2x wide, our WRPN approach requires 4x more total operations than the original network. However, for INT4 or lower precision, each operation is 6.5x or better in efficiency than FP32 for FPGA and ASIC. Hence, WRPN delivers an overall efficiency win.
6 Related work
Reducedprecision DNNs is an active research area. Reducing precision of weights for efficient inference pipeline has been very well studied. Works like Binary connect (BC) BWN , Ternaryweight networks (TWN) TWN , finegrained ternary quantization PCLpaper1 and INQ INQ
target reducing the precision of network weights while still using fullprecision activations. Accuracy is almost always degraded when quantizing the weights. For AlexNet on Imagenet, TWN loses 5% top1 accuracy. Schemes like INQ,
ResilientDNNs and PCLpaper1 do finetuning to quantize the network weights and do not sacrifice accuracy as much but are not applicable for training networks from scratch. INQ shows promising results with 5bits of precision.XNORNET XNORNET , BNN BNN , DoReFa DoReFa and TTQ TTQ target training as well. While TTQ targets weight quantization only, most works targeting activation quantization hurt accuracy. XNORNET approach reduces top1 accuracy by 12% and DoReFa by 8% when quantizing both weights and activations to 1bit (for AlexNet on ImageNet). Further, XNORNET requires reordering of layers for its scheme to work. Recent work in GrahamLowPrec targets lowprecision activations and reports accuracy within 1% of baseline with 5bits precision and logarithmic (with base ) quantization. With finetuning this gap can be narrowed to be within 0.6% but not all layers are quantized.
Nonmultiples of two for operand values introduces hardware inefficiency in that memory accesses are no longer DRAM or cacheboundary aligned and endtoend runtime performance aspect is unclear when using complicated quantization schemes. We target endtoend training and inference, using very simple quantization method and aim for reducing precision without any loss in accuracy. To the best of our knowledge, our work is the first to study reducedprecision deep and wide networks, and show accuracy atpar with baseline for as low a precision as 4bits activations and 2bits weights. We report state of the art accuracy for wide binarized AlexNet and ResNet while still being lower in compute cost.
7 Conclusions
We present the Wide ReducedPrecision Networks (WRPN) scheme for DNNs. In this scheme, the numeric precision of both weights and activations are significantly reduced without loss of network accuracy. This result is in contrast to many previous works that find reducedprecision activations to detrimentally impact accuracy; specifically, we find that 2bit weights and 4bit activations are sufficient to match baseline accuracy across many networks including AlexNet, ResNet34 and batchnormalized Inception. We achieve this result with a new quantization scheme and by increasing the number of filter maps in each reducedprecision layer to compensate for the loss of information capacity induced by reducing the precision.
We motivate this work with our observation that fullprecision activations contribute significantly more to the memory footprint than fullprecision weight parameters when using minibatch sizes common during training and cloudbased inference; furthermore, by reducing the precision of both activations and weights the compute complexity is greatly reduced (40% of baseline for 2bit weights and 4bit activations).
The WRPN quantization scheme and computation on low precision activations and weights is hardware friendly making it viable for deeplyembedded system deployments as well as in cloudbased training and inference servers with compute fabrics for lowprecision. We compare Titan X GPU, Arria10 FPGA and ASIC implementations using WRPN and show our scheme increases performance and energyefficiency for isoaccuracy across each. Overall, reducing the precision allows customdesigned compute units and lower buffering requirements to provide significant improvement in throughput.
References
 (1) https://github.com/ppwwyyxx/tensorpack.

(2)
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado,
A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving,
M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg,
D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens,
B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan,
F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and
X. Zheng.
TensorFlow: Largescale machine learning on heterogeneous systems, 2015.
Software available from tensorflow.org.  (3) Y. Bengio, N. Léonard, and A. C. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. CoRR, abs/1308.3432, 2013.
 (4) M. Courbariaux and Y. Bengio. Binarynet: Training deep neural networks with weights and activations constrained to +1 or 1. CoRR, abs/1602.02830, 2016.
 (5) M. Courbariaux, Y. Bengio, and J. David. Binaryconnect: Training deep neural networks with binary weights during propagations. CoRR, abs/1511.00363, 2015.
 (6) B. Graham. Lowprecision batchnormalized activations. CoRR, abs/1702.08231, 2017.
 (7) S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep learning with limited numerical precision. CoRR, abs/1502.02551, 2015.
 (8) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
 (9) S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167, 2015.
 (10) N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.l. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. Vazir Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon. InDatacenter Performance Analysis of a Tensor Processing Unit. ArXiv eprints, Apr. 2017.

(11)
A. Krizhevsky, I. Sutskever, and G. E. Hinton.
Imagenet classification with deep convolutional neural networks.
In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.  (12) F. Li and B. Liu. Ternary weight networks. CoRR, abs/1605.04711, 2016.
 (13) Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio. Neural networks with few multiplications. CoRR, abs/1510.03009, 2015.
 (14) N. Mellempudi, A. Kundu, D. Mudigere, D. Das, B. Kaul, and P. Dubey. Ternary Neural Networks with FineGrained Quantization. ArXiv eprints, May 2017.
 (15) D. Miyashita, E. H. Lee, and B. Murmann. Convolutional neural networks using logarithmic data representation. CoRR, abs/1603.01025, 2016.
 (16) E. Nurvitadhi, G. Venkatesh, J. Sim, D. Marr, R. Huang, J. Ong Gee Hock, Y. T. Liew, K. Srivatsan, D. Moss, S. Subhaschandra, and G. Boudoukh. Can fpgas beat gpus in accelerating nextgeneration deep neural networks? In Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, FPGA ’17, pages 5–14, New York, NY, USA, 2017. ACM.
 (17) M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. CoRR, abs/1603.05279, 2016.
 (18) W. Sung, S. Shin, and K. Hwang. Resiliency of deep neural networks under quantization. CoRR, abs/1511.06488, 2015.
 (19) C. Szegedy, S. Ioffe, and V. Vanhoucke. Inceptionv4, inceptionresnet and the impact of residual connections on learning. CoRR, abs/1602.07261, 2016.
 (20) C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014.
 (21) Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. H. W. Leong, M. Jahre, and K. A. Vissers. FINN: A framework for fast, scalable binarized neural network inference. CoRR, abs/1612.07119, 2016.
 (22) V. Vanhoucke, A. Senior, and M. Z. Mao. Improving the speed of neural networks on cpus. In Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011, 2011.
 (23) G. Venkatesh, E. Nurvitadhi, and D. Marr. Accelerating deep convolutional networks using lowprecision and sparsity. CoRR, abs/1610.00324, 2016.
 (24) S. Zagoruyko and N. Komodakis. Wide residual networks. CoRR, abs/1605.07146, 2016.
 (25) A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen. Incremental network quantization: Towards lossless cnns with lowprecision weights. CoRR, abs/1702.03044, 2017.
 (26) S. Zhou, Z. Ni, X. Zhou, H. Wen, Y. Wu, and Y. Zou. Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients. CoRR, abs/1606.06160, 2016.
 (27) C. Zhu, S. Han, H. Mao, and W. J. Dally. Trained ternary quantization. CoRR, abs/1612.01064, 2016.