1 Introduction
Deep learning spearheaded by convolutional neural networks (CNN) and recurrent neural networks (RNN) has been pushing the frontiers in many computer vision applications (LeCun et al., 2015). Deep learning algorithms have achieved or surpassed humanlevels of perception in some applications (He et al., 2016; Xiong et al., 2016), enabling them to be deployed in the realworld applications. The key challenges of deep learning algorithms are the computational complexity and model size, which impede their deployment on enduser client devices, thus limiting them to cloudbased high performance servers. For example, AlexNet (Krizhevsky et al., 2012), a popular CNN model, requires 1.45 billion operations per input image with 240MB weights (Suda et al., 2016).
Many researchers have explored hardware accelerators for CNNs to enable deployment of CNNs in embedded devices and demonstrated good performance at low power consumption (Qiu et al., 2016; Shin et al., 2017). Reducing the data precision is a commonly used technique for improving the energy efficiency of CNNs. Typically, CNNs are trained on high performance CPU/GPU with 32bit floatingpoint data. Fixedpoint representation with shorter bitwidth for CNN weights and activations has been widely explored (Judd et al., 2015; Gupta et al., 2015; Gysel et al., 2016; Lin et al., 2015), which significantly reduces the storage requirements, memory bandwidth and power consumption without sacrificing accuracy.
This work focuses on the number representation schemes for implementing CNN inference. The representation scheme has the following requirements:

Accuracy: the representation should achieve the desired network accuracy with limited bitwidth.

Efficiency: the representation can be implemented in hardware efficiently.

Consistency: the representation should be consistent across different CNNs.
Based on these requirements, we propose using floatingpoint numbers for CNN weights and fixedpoint numbers for activations. We justify this choice from both algorithmic and hardware implementation perspectives. From the algorithmic perspective, using popular largescale CNNs such as AlexNet (Krizhevsky et al., 2012), SqueezeNet (Iandola et al., 2016), GoogLeNet (Szegedy et al., 2015) and VGG16 (Simonyan & Zisserman, 2014), we show that the representation range of the weights is the main factor that determines the inference accuracy, which can be better represented in floatingpoint format. From the hardware perspective, we show that multiplication can be implemented more efficiently with one floatingpoint operand and one fixedpoint operand generating a fixedpoint product.
The rest of the paper is organized as follows, Section 2 discusses related work. Section 3 gives some background about different number representation formats and typical hardware implementation of CNNs. Section 4 describes the proposed number representation scheme. Section 5 and Section 6 discuss the scheme from the algorithmic and implementation perspective. Section 7 presents the experimental results, and Section 8 concludes the paper.
2 Related Work
Precision of the neural network weights and activations plays a major role in determining the efficiency of the CNN hardware or software implementations. A lot of research focuses on replacing the standard 32bit floatingpoint data with reduced precision data for CNN inference. For example, Gysel et al. (Gysel, 2016)
propose representing both CNN weights and activations using minifloat, i.e., floatingpoint number with shorter bitwidth. Since fixedpoint arithmetic is more hardware efficient than floatingpoint arithmetic, most research focuses on fixedpoint quantization. Gupta et al.
(Gupta et al., 2015) present the impacts of different fixedpoint rounding schemes on the accuracy. Judd et al. (Judd et al., 2015) demonstrate that the minimum required data precision not only varies across different networks, but also across different layers of the same network. Lin et al. (Lin et al., 2015) present a fixedpoint quantization methodology to identify the optimal data precision for all layers of a network. Gysel et al. (Gysel et al., 2016) present a framework Ristretto for fixedpoint quantization and retraining of CNNs based on Caffe (Jia et al., 2014).Researchers have also explored training neural networks directly with fixedpoint weights. In (Hammerstrom, 1990), the author presents a hardware architecture for onchip learning with fixedpoint operations. More recently, in (Courbariaux et al., 2014), the authors train neural networks with floatingpoint, fixedpoint and dynamic fixedpoint formats and demonstrate that fixedpoint weights are sufficient for training. Gupta et al. (Gupta et al., 2015) demonstrate network training with 16bit fixedpoint weights using stochastic rounding scheme.
Many other approaches for memory reduction of neural networks have been explored. Han et al. (Han et al., 2015) propose a combination of network pruning, weight quantization during training and compression based on Huffman coding to reduce the VGG16 network size by 49X. In (Deng et al., 2015), the authors propose to store both 8bit quantized floatingpoint weights and 32bit full precision weights. At runtime, quantized weights or fullprecision weights are randomly fetched in order to reduce memory bandwidth. The continuous research effort to reduce the data precision has led to many interesting demonstrations with 2bit weights (Venkatesh et al., 2016) and even binary weights/activations (Courbariaux & Bengio, 2016; Rastegari et al., 2016). Zhou et al. (Zhou et al., 2016) demonstrate AlexNet training with 1bit weights, 2bit activations and 6bit gradients. These techniques require additional retraining and can result in suboptimal accuracies.
In contrast to prior works, this work proposes quantization of a pretrained neural network weights into floatingpoint numbers and implementation of activations in fixedpoint format both for memory reduction and hardware efficiency. It further shows that floatingpoint representation of weights achieves better range/accuracy tradeoff compared for the fixedpoint representation of same number of bits and we empirically demonstrate it on state of the art CNNs such as AlexNet (Krizhevsky et al., 2012), VGG16 (Simonyan & Zisserman, 2014), GoogLeNet (Szegedy et al., 2015) and SqueezeNet (Iandola et al., 2016). Although this work is based on quantization only without the need for retraining the network, retraining may also be applied to reclaim part of the accuracy loss due to quantization.
3 Background
3.1 FixedPoint Number Representation
Fixedpoint representation is very similar to integer representation. The difference is that integer has a scaling factor of 1 and fixedpoint can have a predefined scaling factor as power of 2. Some examples of fixedpoint numbers are shown in Fig. 1.
Usually, all fixedpoint representation is assumed to share the same scaling factor during the entire computation. In some scenarios, the computation can be classified into different sets, e.g., for different CNN layers, with fixedpoint numbers of different scaling factors. This is also referred as dynamic fixedpoint representation
(Courbariaux et al., 2014).3.2 Floatingpoint Number Representation
One example of floatingpoint number representation is shown in Fig. 2. For a floatingpoint representation, there are typically three parts: sign, mantissa and exponent. The sign bit determines whether the number is a positive or negative number. The mantissa determines the significand part and the exponent determine the scale of the value. Usually, there are some special encodings used for representing some special numbers (e.g., 0, NaN and +/ infinity),
For binary floatingpoint numbers, the mantissa can assume an implicit bit, which is also adopted by IEEE floatingpoint standard. This ensures that the value of mantissa is always between 1 and 2, so the leading bit 1 can be omitted to save storage space. However, such an implicit bit places a limit on the smallest representable number for the significand part.‘
The exponent is typically represented as an unsigned integer number with a bias. For example, for an 8bit exponent with a bias of 127, it can represent numbers from 127 to 128, i.e, 0127 to 255127.
3.3 Hardware Implementation of CNNs
CNNs typically consist of multiple convolution layers interspersed by pooling, ReLU and normalization layers followed by fullyconnected layers. Convolution and fullyconnected layers are the most compute and data intensive layers respectively
(Qiu et al., 2016). The computation in these layers consist of multiplyandaccumulate (MAC) operations. The data path is illustrated in Fig. 3, where the input features are multiplied with the weights to get the intermediate data (i.e., partial sums). These partial sums are accumulated to generate the output features. Since fixedpoint arithmetic is typically more efficient for hardware implementation, most hardware accelerators implement the MAC operations using fixedpoint representation.The power/area breakdown of the CNN hardware accelerator mostly depends on the data flow architecture. For example, in Eyeriss (Chen et al., 2016), in each processing element (PE), MAC and memory account for about 9% and 52% area respectively. For Origami (Cavigelli et al., 2015), MAC and memory account for about 32% and 34% area respectively.
4 Proposed Number Representation Scheme
The overview of our proposed number representation scheme is shown in Fig. 3. Different from most existing CNN implementations, we propose using a combination of floatingpoint and fixedpoint representations. The network weights are represented as floatingpoint numbers while the input/output features are represented as fixedpoint numbers. The multiplier is implemented to take one floatingpoint number and one fixedpoint number and produces output, i.e., intermediate data, in fixedpoint format. The intermediate data is in fixedpoint format and can have wider bitwidth than the input/output features. The accumulation is the same as fixedpoint adder, which can have higher bitwidth.
From hardware perspective, it is more efficient to implement multiplication using floatingpoint number and addition using fixedpoint number. The multiplication operations have one fixedpoint number input, one floatingpoint number input and fixedpoint number output. This can be implemented with a multiplier and a shifter. The multiplier will multiply the fixedpoint number with the mantissa part of the floatingpoint number, and the shifter will shift the results according to the exponent value. Therefore, we propose the hardware architecture illustrated in Fig. 3.
The accumulation/addition works with fixedpoint numbers, which can be wider (i.e., with larger bitwidth) than either of the inputs. This part is similar to most fixedpoint number based implementation of CNN accelerators.
5 Algorithmic Perspective
In this section, we investigate and explain why the proposed number representation scheme is better from the algorithmic perspective. Section 5.1 demonstrates that different CNNs can have inconsistent fixedpoint bitwidth requirements for representing the weights. Section 5.2 investigates this inconsistency by analyzing CNN weight distribution and properties. Section 5.3 shows that the representation range is the main factor that determines the inference accuracy. Section 5.4 shows that floatingpoint representation is more efficient and consistent representation for CNN weights.
5.1 CNN Accuracy with FixedPoint Weights
To evaluate different number representation schemes, we implement weight quantization based on Caffe (Jia et al., 2014) framework. To make a fair comparison, we assume that there is always a sign bit for representing negative values for both fixedpoint and floatingpoint numbers. We will represent the bitwidth of floating point representation as , where is the number of mantissa bits and is the number of exponent bits.
We apply the weight quantization on four popular CNN networks: AlexNet (Krizhevsky et al., 2012), SqueezeNet (Iandola et al., 2016), GoogLeNet (Szegedy et al., 2015) and VGG16 (Simonyan & Zisserman, 2014). We evaluate the network accuracy by doing quantization for all convolutional and fullyconnected layer weights. The activation is quantized to 16bit fixedpoint. For each layer, we normalize the weights so that the maximum absolute value equals 1. This is similar to the dynamic fixedpoint quantization (Moons & Verhelst, 2016; Gysel et al., 2016; Courbariaux et al., 2014). All accuracy results are top1 accuracy and normalized with the top1 accuracy using 32bit floatingpoint representation. The results of top5 accuracy correlate with that of top1 accuracy.
The network accuracy results using fixedpoint representation are shown in Fig. 4. Two observations can be made here:

For all networks, the accuracy starts increasing sharply after a certain threshold. For all networks, the normalized accuracy increases from close to 0 to close to 1 within 2 or 3 bits difference. This suggests that there is something dramatically different with these additional bits.

Among different networks, the required number of bits is very inconsistent. With 7bit fixedpoint number, AlexNet and SqueezeNet can achieve close to full accuracy, while GoogLeNet and VGG16 have very low accuracy. GoogLeNet and VGG16 need 10 to 11 bits to achieve full accuracy.
This inconsistency in bitwidth requirements across different CNNs poses challenges for hardware implementation. For the design to be generalpurpose and futureproof, the designer has to use margined bitwidth or use runtime adaptation (Moons & Verhelst, 2016), both of which incur significant overhead.
5.2 Weight Distribution
There can be several reasons that cause the inconsistency in Fig. 4. The network depth is one of the possible reasons. Similar to the idea in (Lin et al., 2015), the fixedpoint quantization can be modeled as quantization noise for each layer. The network accuracy may drop further, i.e., accumulate more noise, as the network depth increases. The other possible reason is the number of MAC operations in each layer. Small quantization error can accumulate over a large amount of MAC operations. For example, the total number of MAC operations to calculate one output for convolutional and fullyconnected layers in AlexNet are (363, 1200, 2304, 1728, 1728, 9216, 4096, 4096).
However, none of the above reasons can explain the first observation earlier about the sharp, instead of gradual, change in accuracy. To further investigate this, we plot the weight distribution of four different layers in AlexNet in Fig. 5. Most weights have small values even after the perlayer normalization. The distribution is concentrated at the center, which is also the motivation for Huffman encoding of the weights proposed in (Han et al., 2015).
To better visualize the difference, we plot the same weight distribution in logscale in Fig. 6. This plot is easier to spot the weight distribution difference and explains the accuracy behavior under fixedpoint quantization observed in Fig. 4. Under fixedpoint quantization, the layer with the most smallvalued weights, conv3, will be the most susceptible to quantization errors. With 4bit fixedpoint representation, more than 90% weights in conv3 are unrepresentable, i.e., quantized to 0. This also explains why stochastic rounding works better than roundtonearest reported in (Gupta et al., 2015).
Since the layers are cascaded, the accuracy of the entire network is limited by the weakest layer. This is why the network produces close to 0 accuracy with small bitwidth as shown in Fig. 4. With higher fixedpoint bitwidth, a larger number of weights in conv3 become representable, which results in the quick increase of the network accuracy.
The inconsistency in bitwidth requirements observed in Fig. 4 can also be explained with the weight distribution. We pick the two layers with largest and smallest weights from AlexNet and VGG16 and plot the weight distribution in Fig. 7. The layer conv31 in VGG16 has more weights with smaller exponent values. This explains why VGG16 requires more bitwidth when using fixedpoint representation.
5.3 Range vs. Precision
The results in Fig. 4 and the weight distributions in Fig. 6 show that the network can achieve almost full accuracy, e.g., with 7bit fixedpoint for AlexNet, even when most weights are barely representable, i.e., only with 1 or 2 significant bits. This means that representation range, i.e, the ability to represent larger/smaller values, is more important than representation precision, i.e., differentiation between nearby values.
Since the representation range and representation precision is hard to decompose in fixedpoint representation, we investigate this using floatingpoint representation. For floatingpoint representation, the mantissa bitwidth controls the precision and the exponent bitwidth controls the range.
Fig. 8 highlights some of the results of network accuracy using floatingpoint representation with varying exponent bitwidth (i.e., representation range). The floatingpoint has 3bit mantissa with the implicit bit. The implicit bit limits the value of the significant part, so that the representation range is controlled by the exponent part. With floatingpoint representation, the networks show consistent trend, with almost 0 accuracy with 2bit exponent and quickly increase to almost full accuracy with 4bit exponent. Unlike the behavior seen in Fig. 4, this is expected as 2 additional bits in exponent offers 4X increase in the representation range.
To get more insight into the impact of representation range, we also run experiments with floatingpoint like representation where we limit the exponent range rather than the exponent bits. For example, exponent range of 4 is equivalent to 2bit exponent and exponent range of 8 is equivalent to 3bit exponent. The results of exponent range experiments are shown in Fig. 9. The floatingpoint number has 2bit mantissa with the implicit bit. The behavior of the accuracy is similar to the results with fixedpoint representation, i.e., accuracy increases rapidly from almost 0 to almost 1. The relative ordering exponent range requirements also matches the bitwidth requirements for fixedpoint representation. This suggests that representation range is the main impacting factor for network accuracy.
With 2bit mantissa, some networks saturate with normalized accuracy less than 1, as shown in Fig. 9. To compare with the effect of precision, the experiments are repeated with 6bit mantissa as shown in Fig. 10. Comparing Fig. 10 with Fig. 9, the additional 4 bits in mantissa does not have significant impact on the network accuracy. This also validates the initial hypothesis that representation range is more important than representation accuracy.
5.4 CNN Accuracy with FloatingPoint Weights
Comparing to fixedpoint representation, floatingpoint is better for representation range, which increases exponentially with the exponent bitwidth. The results shown in Fig. 8 show that 4bit exponent is adequate and consistent across different networks.
The next question for floatingpoint representation is how much precision, i.e., how many mantissa bits are needed. Fig. 11 highlights some of the results of network accuracy using floatingpoint representation with varying mantissa bitwidth. The floatingpoint number has 4bit exponent, and is with the implicit bit. Most networks have very high accuracy even with 1bit mantissa and achieve full accuracy with 3bit mantissa. This is also consistent across different networks, which further proves that the inconsistency with fixedpoint representation seen in Fig. 4 is mainly from the inconsistent requirements for representation range rather than from the representation precision.
We also repeat the experiments with floatingpoint representation without the implicit bit in mantissa. The results are highlighted in Fig. 12 and Fig. 13. The implicit bit in mantissa limits the range of the significand part to . Removing it helps further extend the range of the representation, especially for representing small numbers. That is why the network accuracy saturates with 3bit exponent instead of 4bit. The implicit bit also improves the precision of the significand part. Hence, we need 4bit mantissa instead of 3bit to achieve full accuracy.
6 Implementation Perspective
This section motivates the proposed number representation scheme from the hardware implementation perspective. The implementation considerations are discussed in Section 6.1. Hardware tradeoff results are presented in Section 6.2.
6.1 Hardware Implementation Considerations
As discussed in Section 3.3, computations in CNNs are typically implemented as MAC operations. For the same 32bit wide operations, hardware implementation of fixedpoint arithmetic can be more efficient than floatingpoint arithmetic. This is one of the reasons why most of previous work focuses on the optimization for fixedpoint representation.
The comparison becomes less obvious when the bitwidth is smaller, especially when the number of exponent bits in floatingpoint representation is small. For example, as shown in Fig. 14, multiplier with a floatingpoint number can be implemented with a multiplier (of the bitwidth of mantissa) and a barrel shifter (of the bitwidth of exponent). This can be more efficient than multiplying two fixedpoint numbers, as the multiplier becomes smaller and the shifter is simpler.
6.2 Hardware Tradeoff Results
To validate the claim in Section 6.1, we implement the multiplier with different operand configurations using a commercial 16nm process technology and libraries. The results are highlighted in Fig. 15. The floatingpoint configuration is represented as +, i.e., mantissa bitwidth + exponent bitwidth. Here we assume the baseline is a multiplier with two 8bit fixedpoint operands, i.e., 88. The area and power numbers are all normalized with respect to the baseline. With the same bitwidth, the proposed scheme of combining fixedpoint and floatingoperands can reduce both area and power. The reduction increases with less mantissa and more exponent bits, as a shifter is more efficient than a multiplier. As an example, the floatingpoint operand with 4bit mantissa and 4bit exponent, i.e. 8x4+4, can reduce the power and area by more than 20%, compared to 8x8 fixedpoint multiplication.
7 Experimental Results
As discussed in Section 5.3, we perform the weight quantization based on Caffe (Jia et al., 2014). We denote the representation as (, ), where is the mantissa bitwidth and is the exponent bitwidth. =0 means fixedpoint representation.
=0  =1  =2  =3  =4  =5  
=1  0.002  0.002  0.003  0.920  0.918  0.918 
=2  0.003  0.003  0.003  0.983  0.982  0.982 
=3  0.002  0.003  0.003  0.995  0.995  0.995 
=4  0.016  0.003  0.003  0.997  1.001  1.001 
=5  0.775  0.002  0.003  0.994  0.999  0.998 
=6  0.979  0.002  0.003  0.995  0.999  0.999 
=7  0.996  0.002  0.003  0.995  0.999  0.999 
=8  0.998  0.002  0.003  0.995  1.000  1.000 
=9  1.001  0.002  0.003  0.995  1.000  1.000 
=10  0.999  0.002  0.003  0.995  1.001  1.001 
We evaluate the network accuracy with different bitwidth setting. Table 1 highlights some results for AlexNet. The network is nonfunctioning (i.e., with close to 0 accuracy) when is small, i.e., with limited representation range. To achieve full accuracy, AlexNet requires 8 bits for fixedpoint representation, i.e., (7, 0) with 1 sign bit, or 7 bits for floatingpoint representation with (3, 3) configuration. If the implementation only targets AlexNet, the proposed number representation can achieve 12.5% weight storage reduction and 8% power reduction in multiplication. The benefit will increase for CNNs that require more fixedpoint bitwidth.
As discussed earlier, one of the requirement for number representation scheme is the consistency across different networks. This is especially important for the hardware implementation to be futureproof and viable for different CNN models. Some results of the normalized accuracy of different network are highlighted in Table 2. The 7bit fixedpoint configuration used for AlexNet also works for SqueezeNet, but is not adequate for GoogLeNet and VGG16. 10bit fixedpoint representation is required to get consistent accuracy across all networks used in this study.
(, )  AlexNet  SqueezeNet  GoogLeNet  VGG16 

(7, 0)  1.00  1.00  0.85  0.02 
(10, 0)  1.00  0.99  0.99  1.00 
(3, 4)  0.99  0.99  0.99  1.00 
By using proposed number representation scheme, we only need 7bit floatingpoint, i.e, (3, 4) configuration. Therefore, we can replace 11bit weights (10bit fixedpoint number plus sign bit) with 8bit weights (3bit mantissa, 4bit exponent and 1 sign bit). This results in 36% storage reduction for weights and 50% power reduction in multiplication.
8 Conclusion
In this work, we propose CNN inference implementation with floatingpoint weights and fixedpoint activations. We give the motivation for the proposed number representation scheme from both algorithmic and hardware implementation perspectives. The proposed scheme can reduce the weight storage by up to 36% and the multiplier power by up to 50%. Future work will investigate the impacts of network topology and training on the number representation requirements.
References
 Cavigelli et al. (2015) Cavigelli, Lukas, Gschwend, David, Mayer, Christoph, Willi, Samuel, Muheim, Beat, and Benini, Luca. Origami: A convolutional network accelerator. In Proceedings of the 25th edition on Great Lakes Symposium on VLSI, pp. 199–204. ACM, 2015.
 Chen et al. (2016) Chen, YuHsin, Krishna, Tushar, Emer, Joel S, and Sze, Vivienne. Eyeriss: An energyefficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of SolidState Circuits, 2016.
 Courbariaux & Bengio (2016) Courbariaux, Matthieu and Bengio, Yoshua. Binarynet: Training deep neural networks with weights and activations constrained to +1 or 1. arXiv:1602.02830, 2016.
 Courbariaux et al. (2014) Courbariaux, Matthieu, David, JeanPierre, and Bengio, Yoshua. Training deep neural networks with low precision multiplications. arXiv preprint arXiv:1412.7024, 2014.
 Deng et al. (2015) Deng, Zhaoxia, Xu, Cong, Cai, Qiong, and Faraboschi, Paolo. Reducedprecision memory value approximation for deep learning. 2015.
 Gupta et al. (2015) Gupta, Suyog, Agrawal, Ankur, Gopalakrishnan, Kailash, and Narayanan, Pritish. Deep learning with limited numerical precision. CoRR, abs/1502.02551, 392, 2015.
 Gysel (2016) Gysel, Philipp. Ristretto: Hardwareoriented approximation of convolutional neural networks. arXiv preprint arXiv:1605.06402, 2016.
 Gysel et al. (2016) Gysel, Philipp, Motamedi, Mohammad, and Ghiasi, Soheil. Hardwareoriented approximation of convolutional neural networks. arXiv preprint arXiv:1604.03168, 2016.
 Hammerstrom (1990) Hammerstrom, Dan. A vlsi architecture for highperformance, lowcost, onchip learning. In Neural Networks, 1990., 1990 IJCNN International Joint Conference on, pp. 537–544. IEEE, 1990.
 Han et al. (2015) Han, Song, Mao, Huizi, and Dally, William J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.

He et al. (2016)
He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian.
Deep residual learning for image recognition.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 770–778, 2016.  Iandola et al. (2016) Iandola, Forrest N, Han, Song, Moskewicz, Matthew W, Ashraf, Khalid, Dally, William J, and Keutzer, Kurt. Squeezenet: Alexnetlevel accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
 Jia et al. (2014) Jia, Yangqing, Shelhamer, Evan, Donahue, Jeff, Karayev, Sergey, Long, Jonathan, Girshick, Ross, Guadarrama, Sergio, and Darrell, Trevor. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
 Judd et al. (2015) Judd, Patrick, Albericio, Jorge, Hetherington, Tayler, Aamodt, Tor, Jerger, Natalie Enright, Urtasun, Raquel, and Moshovos, Andreas. Reducedprecision strategies for bounded memory in deep neural nets. arXiv preprint arXiv:1511.05236, 2015.
 Krizhevsky et al. (2012) Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.
 LeCun et al. (2015) LeCun, Yann, Bengio, Yoshua, and Hinton, Geoffrey. Deep learning. Nature, 521(7553):436–444, 2015.
 Lin et al. (2015) Lin, Darryl D, Talathi, Sachin S, and Annapureddy, V Sreekanth. Fixed point quantization of deep convolutional networks. arXiv preprint arXiv:1511.06393, 2015.
 Moons & Verhelst (2016) Moons, Bert and Verhelst, Marian. An energyefficient precisionscalable convnet processor in a 40nm cmos. IEEE Journal of SolidState Circuits, 2016.
 Qiu et al. (2016) Qiu, Jiantao, Wang, Jie, Yao, Song, Guo, Kaiyuan, Li, Boxun, Zhou, Erjin, Yu, Jincheng, Tang, Tianqi, Xu, Ningyi, Song, Sen, et al. Going deeper with embedded fpga platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 26–35. ACM, 2016.
 Rastegari et al. (2016) Rastegari, Mohammad, Ordonez, Vicente, Redmon, Joseph, and Farhadi, Ali. Xnornet: Imagenet classification using binary convolutional neural networks. arXiv:1603.05279, 2016.
 Shin et al. (2017) Shin, Dongjoo, Lee, Jinmook, Lee, Jinsu, and Yoo, HoiJun. Dnpu: An 8.1tops/w reconfigurable cnnrnn processor for generalpurpose deep neural networks. In SolidState Circuits Conference (ISSCC), 2017 IEEE International, pp. 240–241. IEEE, 2017.
 Simonyan & Zisserman (2014) Simonyan, Karen and Zisserman, Andrew. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 Suda et al. (2016) Suda, Naveen, Chandra, Vikas, Dasika, Ganesh, Mohanty, Abinash, Ma, Yufei, Vrudhula, Sarma, Seo, Jaesun, and Cao, Yu. Throughputoptimized openclbased fpga accelerator for largescale convolutional neural networks. In Proceedings of the 2016 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 16–25. ACM, 2016.
 Szegedy et al. (2015) Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Dumitru, Vanhoucke, Vincent, and Rabinovich, Andrew. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9, 2015.
 Venkatesh et al. (2016) Venkatesh, Ganesh, Nurvitadhi, Eriko, and Marr, Debbie. Accelerating deep convolutional networks using lowprecision and sparsity. arXiv:1610.00324, 2016.
 Xiong et al. (2016) Xiong, W, Droppo, J, Huang, X, Seide, F, Seltzer, M, Stolcke, A, Yu, D, and Zweig, G. The microsoft 2016 conversational speech recognition system. arXiv preprint arXiv:1609.03528, 2016.
 Zhou et al. (2016) Zhou, Shuchang, Ni, Zekun, Zhou, Xinyu, Wen, He, Wu, Yuxin, and Zou, Yuheng. Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv:1606.06160, 2016.