1 Introduction
Deep convolutional neural networks (CNN) have demonstrated stateoftheart performance in image classification Krizhevsky et al. (2012); Simonyan & Zisserman (2014); He et al. (2015) but have steadily grown in computational complexity. For example, the Deep Residual Learning He et al. (2015) set a new record in image classification accuracy at the expense of billion floatingpoint multiplyandadd operations per forwardpass of an image and MB of memory to store the weights in its layer network.
In order for these large networks to run in realtime applications such as for mobile or embedded platforms, it is often necessary to use lowprecision arithmetic and apply compression techniques. Recently, many researchers have successfully deployed networks that compute using bit fixedpoint representation Vanhoucke et al. (2011); Abadi et al. (2015) and have successfully trained networks with bit fixed point Gupta et al. (2015). This work in particular is built upon the idea that algorithmlevel noise tolerance of the network can motivate simplifications in hardware complexity.
Interesting directions point towards matrix factorization Denton et al. (2014) and tensorification Novikov et al. (2015) by leveraging structure of the fullyconnected (FC) layers. Another promising area is to prune the FC layer before mapping this to sparse matrixmatrix routines in GPUs Han et al. (2015b). However, many of these inventions aim at systems that meet some required and specific criteria such as networks that have many, large FC layers or accelerators that handle efficient sparse matrixmatrix arithmetic. And with network architectures currently pushing towards increasing the depth of convolutional layers by settling for fewer dense FC layers He et al. (2015); Szegedy et al. (2015), there are potential problems in motivating a onesizefitsall solution to handle these computational and memory demands.
We propose a general method of representing and computing the dot products in a network that can allow networks with minimal constraint on the layer properties to run more efficiently in digital hardware. In this paper we explore the use of communicating activations, storing weights, and computing the atomic dotproducts in the binary logarithmic (base2 logarithmic) domain for both inference and training. The motivations for moving to this domain are the following:

Training networks with weight decay leads to final weights that are distributed nonuniformly around .

Similarly, activations are also highly concentrated near
. Our work uses rectified Linear Units (ReLU) as the nonlinearity.

Logarithmic representations can encode data with very large dynamic range in fewer bits than can fixedpoint representation Gautschi et al. (2016).

Data representation in domain is naturally encoded in digital hardware (as shown in Section 4.3).
Our contributions are listed:

we show that networks obtain higher classification accuracies with logarithmic quantization than linear quantization using traditional fixedpoint at equivalent resolutions.

we show that activations are more robust to quantization than weights. This is because the number of activations tend to be larger than the number of weights which are reused during convolutions.

we apply our logarithmic data representation on stateoftheart networks, allowing activations and weights to use only b with almost no loss in classification performance.

we generalize base arithmetic to handle different base. In particular, we show that a base enables the ability to capture large dynamic ranges of weights and activations but also finer precisions across the encoded range of values as well.

we develop logarithmic backpropagation for efficient training.
2 Related work
Reducedprecision computation. Shin et al. (2016); Sung et al. (2015); Vanhoucke et al. (2011); Han et al. (2015a) analyzed the effects of quantizing the trained weights for inference. For example, Han et al. (2015b) shows that convolutional layers in AlexNet Krizhevsky et al. (2012) can be encoded to as little as 5 bits without a significant accuracy penalty. There has also been recent work in training using low precision arithmetic. Gupta et al. (2015) propose a stochastic rounding scheme to help train networks using 16bit fixedpoint. Lin et al. (2015) propose quantized backpropagation and ternary connect. This method reduces the number of floatingpoint multiplications by casting these operations into powersoftwo multiplies, which are easily realized with bitshifts in digital hardware. They apply this technique on MNIST and CIFAR10 with little loss in performance. However, their method does not completely eliminate all multiplications endtoend. During testtime the network uses the learned full resolution weights for forward propagation. Training with reduced precision is motivated by the idea that highprecision gradient updates is unnecessary for the stochastic optimization of networks Bottou & Bousquet (2007); Bishop (1995); Audhkhasi et al. (2013). In fact, there are some studies that show that gradient noise helps convergence. For example, Neelakantan et al. (2015) empirically finds that gradient noise can also encourage faster exploration and annealing of optimization space, which can help network generalization performance.
Hardware implementations. There have been a few but significant advances in the development of specialized hardware of large networks. For example Farabet et al. (2010) developed FieldProgrammable Gate Arrays (FPGA) to perform realtime forward propagation. These groups have also performed a comprehensive study of classification performance and energy efficiency as function of resolution. Zhang et al. (2015) have also explored the design of convolutions in the context of memory versus compute management under the RoofLine model. Other works focus on specialized, optimized kernels for general purpose GPUs Chetlur et al. (2014).
3 Concept and Motivation
Each convolutional and fullyconnected layer of a network performs matrix operations that distills down to dot products , where is the input, the weights, and the activations before being transformed by the nonlinearity (e.g. ReLU). Using conventional digital hardware, this operation is performed using multiplyandadd operations using floating or fixed point representation as shown in Figure 1(a). However, this dot product can also be computed in the domain as shown in Figure 1(b,c).
3.1 Proposed Method 1.
The first proposed method as shown in Figure 1(b) is to transform one operand to its representation, convert the resulting transformation back to the linear domain, and multiply this by the other operand. This is simply
(1)  
where , quantizes to an integer, and is the function that bitshifts a value by an integer in fixedpoint arithmetic. In floatingpoint, this operation is simply an addition of with the exponent part of . Taking advantage of the operator to perform multiplication obviates the need for expensive digital multipliers.
Quantizing the activations and weights in the domain ( and ) instead of and is also motivated by leveraging structure of the nonuniform distributions of and . A detailed treatment is shown in the next section. In order to quantize, we propose two hardwarefriendly flavors. The first option is to simply floor the input. This method computes by returning the position of the first bit seen from the most significant bit (MSB). The second option is to round to the nearest integer, which is more precise than the first option. With the latter option, after computing the integer part, the fractional part is computed in order to assert the rounding direction. This method of rounding is summarized as follows. Pick bits followed by the leftmost and consider it as a fixed point number with 0 integer bit and fractional bits. Then, if , round up to the nearest integer and otherwise round it down to the nearest integer.
3.2 Proposed Method 2.
The second proposed method as shown in Figure 1(c) is to extend the first method to compute dot products in the domain for both operands. Additions in lineardomain map to sums of exponentials in the domain and multiplications in linear become addition. The resulting dotproduct is
(2)  
where the domain weights are and domain inputs are .
By transforming both the weights and inputs, we compute the original dot product by bitshifting by an integer result and summing over all .
3.3 Accumulation in domain
Although Fig. 1(b,c) indicates a logarithmtolinear converter between layers where the actual accumulation is performed in the linear domain, this accumulation is able to be performed in the domain using the approximation for . For example, let , , and . When ,
(3)  
and for in general,
(4) 
Note that preserves the fractional part of the word during accumulation. Both accumulation in linear domain and accumulation in domain have its pros and cons. Accumulation in linear domain is simpler but requires larger bit widths to accommodate large dynamic range numbers. Accumulation in in (3) and (4) appears to be more complicated, but is in fact simply computed using bitwise operations in digital hardware.
4 Experiments of Proposed Methods
Here we evaluate our methods as detailed in Sections 3.1 and 3.2 on the classification task of ILSVRC2012 Deng et al. (2009) using Chainer Tokui et al. (2015). We evaluate method 1 (Section 3.1) on inference (forward pass) in Section 4.1. Similarly, we evaluate method 2 (Section 3.2) on inference in Sections 4.2 and 4.3. For those experiments, we use published models (AlexNet Krizhevsky et al. (2012), VGG16 Simonyan & Zisserman (2014)
) from the caffe model zoo (
Jia et al. (2014)) without any fine tuning (or extra retraining). Finally, we evaluate method 2 on training in Section 4.4.4.1 Logarithmic Representation of Activations
layer  # Weight  # Input  

ReLU(Conv1)    
LogQuant1    
LRN1       
Pool1      
ReLU(Conv2)    
LogQuant2    
LRN2       
Pool2      
ReLU(Conv3)    
LogQuant3    
ReLU(Conv4)    
LogQuant4    
ReLU(Conv5)    
LogQuant5    
Pool5      
ReLU(FC6)    
LogQuant6    
ReLU(FC7)    
LogQuant7    
FC8   
layer  # Weight  # Input  

ReLU(Conv1_1)    
LogQuant1_1    
ReLU(Conv1_2)    
LogQuant1_2    
Pool1      
ReLU(Conv2_1)    
LogQuant2_1    
ReLU(Conv2_2)    
LogQuant2_2    
Pool2      
ReLU(Conv3_1)    
LogQuant3_1    
ReLU(Conv3_2)    
LogQuant3_2    
ReLU(Conv3_3)    
LogQuant3_3    
Pool3      
ReLU(Conv4_1)    
LogQuant4_1    
ReLU(Conv4_2)    
LogQuant4_2    
ReLU(Conv4_3)    
LogQuant4_3    
Pool4      
ReLU(Conv5_1)    
LogQuant5_1    
ReLU(Conv5_2)    
LogQuant5_2    
ReLU(Conv5_3)    
LogQuant5_3    
Pool5      
ReLU(FC6)    
LogQuant6    
ReLU(FC7)    
LogQuant7    
FC8   
This experiment evaluates the classification accuracy using logarithmic activations and floating point 32b for the weights. In similar spirit to that of Gupta et al. (2015), we describe the logarithmic quantization layer LogQuant that performs the elementwise operation as follows:
(5) 
where
(6)  
(7) 
These layers perform the logarithmic quantization and computation as detailed in Section 3.1. Tables 1 and 2 illustrate the addition of these layers to the models. The quantizer has a specified full scale range, and this range in linear scale is , where we express this as simply throughout this paper for notational convenience. The values for each layer are shown in Tables 1 and 2; they show added by an offset parameter. This offset parameter is chosen to properly handle the variation of activation ranges from layer to layer using images from the training set. The is a parameter which is global to the network and is tuned to perform the experiments to measure the effect of on classification accuracy. The is the number of bits required to represent a number after quantization. Note that since we assume applying quantization after ReLU function, is 0 or positive and then we use unsigned format without sign bit for activations.
In order to evaluate our logarithmic representation, we detail an equivalent linear quantization layer described as
(8) 
and where
(9) 
Figure 2 illustrates the effect of the quantizer on activations following the conv2_2 layer used in VGG16. The prequantized distribution tends to 0 exponentially, and the quantized distribution illustrates how the encoded activations are uniformly equalized across many output bins which is not prevalent in the linear case. Many smaller activation values are more finely represented by quantization compared to linear quantization. The total quantization error , where is or ,
is the vectorized activations of size
, is less for the quantized case than for linear. This result is illustrated in Figure 3. Using linear quantization with step size of 1024, we obtain a distribution of quantization errors that are highly concentrated in the region where . However, quantization with the as linear results in a significantly lower number of quantization errors in the region . This comes at the expense of a slight increase in errors in the region . Nonetheless, the quantization errors for and for linear.We run the models as described in Tables 1 and 2 and test on the validation set without data augmentation. We evaluate it with variable s and s for both quantizer layers.
Figure 4 illustrates the results of AlexNet. Using only 3 bits to represent the activations for both logarithmic and linear quantizations, the top5 accuracy is still very close to that of the original, unquantized model encoded at floatingpoint 32b. However, logarithmic representations tolerate a large dynamic range of s. For example, using 4b , we can obtain order of magnitude variations in the full scale without a significant loss of top5 accuracy. We see similar results for VGG16 as shown in Figure 5. Table 3 lists the classification accuracies with the optimal s for each case. There are some interesting observations. First, b performs worse than b linear for AlexNet but better for VGG16, which is a higher capacity network than AlexNet. Second, by encoding the activations in b , we achieve the same top5 accuracy compared to that achieved by b linear for VGG16. Third, with b , there is no loss in top5 accuracy from the original float32 representation.
Model  AlexNet  VGG16 

Float 32b  
Log. 3b  
Log. 4b  
Linear 3b  
Linear 4b 
4.2 Logarithmic Representation of Weights of Fully Connected Layers
The FC weights are quantized using the same strategies as those in Section 4.1, except that they have sign bit. We evaluate the classification performance using data representation for both FC weights and activations jointly using method 2 in Section 3.2. For comparison, we use linear for FC weights and for activations as reference. For both methods, we use optimal b for activations that were computed in Section 4.1.
Table 4 compares the mentioned approaches along with floating point. We observe a small win for over linear for AlexNet but a decrease for VGG16. Nonetheless, computation is performed without the use of multipliers.
Model  Float 32b  Log. 4b  Linear 4b 

AlexNet  
VGG16 
An added benefit to quantization is a reduction of the model size. By quantizing down to b including sign bit, we compress the FC weights for free significantly from Gb to Gb for AlexNet and Gb to Gb for VGG16. This is because the dense FC layers occupy and of the total model size for AlexNet and VGG16 respectively.
4.3 Logarithmic Representation of Weights of Convolutional Layers
We now represent the convolutional layers using the same procedure. We keep the representation of activations at b and the representation of weights of FC layers at b , and compare our method with the linear reference and ideal floating point. We also perform the dot products using two different bases: . Note that there is no additional overhead for base as it is computed with the same equation shown in Equation 4.
Table 5 shows the classification results. The results illustrate an approximate drop in performance from floating point down to 5b base2 but a relatively minor drop for 5b base. They includes sign bit. There are also some important observations here.
Model  Float  Linear  Base2  Base 

32b  5b  Log 5b  Log 5b  
AlexNet  
VGG16 
We first observe that the weights of the convolutional layers for AlexNet and VGG16 are more sensitive to quantization than are FC weights. Each FC weight is used only once per image (batch size of 1) whereas convolutional weights are reused many times across the layer’s input activation map. Because of this, the quantization error of each weight now influences the dot products across the entire activation volume. Second, we observe that by moving from b base to a finer granularity such as b base, we allow the network to 1) be robust to quantization errors and degradation in classification performance and 2) retain the practical features of domain arithmetic.
The distributions of quantization errors for both b base and b base are shown in Figure 6. The total quantization error on the weights, , where is the vectorized weights of size , is smaller for base than for base.
4.4 Training with Logarithmic Representation
We incorporate representation during the training phase. This entire algorithm can be computed using Method 2 in Section 3.2. Table 6 illustrates the networks that we compare. The proposed and linear networks are trained at the same resolution using 4bit unsigned activations and bit signed weights and gradients using Algorithm 1 on the CIFAR10 dataset with simple data augmentation described in He et al. (2015). Note that unlike BinaryNet Courbariaux & Bengio (2016), we quantize the backpropagated gradients to train net. This enables endtoend training using logarithmic representation at the bit level. For linear quantization however, we found it necessary to keep the gradients in its unquantized floatingpoint precision form in order to achieve good convergence. Furthermore, we include the training curve for BinaryNet, which uses unquantized gradients.
Fig. 7 illustrates the training results of , linear, and BinaryNet. Final test accuracies for b, linearb, and BinaryNet are , , respectively where linearb and BinaryNet use unquantized gradients. The test results indicate that even with quantized gradients, our proposed network with representation still outperforms the others that use unquantized gradients.
5 Conclusion
In this paper, we describe a method to represent the weights and activations with low resolution in the domain, which eliminates bulky digital multipliers. This method is also motivated by the nonuniform distributions of weights and activations, making representation more robust to quantization as compared to linear. We evaluate our methods on the classification task of ILSVRC2012 using pretrained models (AlexNet and VGG16). We also offer extensions that incorporate endtoend training using representation including gradients.
quantization  linear quantization  BinaryNet 
Conv  Conv  Conv 
BatchNorm  BatchNorm  BatchNorm 
ReLU  ReLU   
LogQuant  LinearQuant  Binarize 
Conv  Conv  Conv 
BatchNorm  BatchNorm  BatchNorm 
ReLU  ReLU   
LogQuant  LinearQuant  Binarize 
MaxPool  MaxPool  MaxPool 
Conv  Conv  Conv 
BatchNorm  BatchNorm  BatchNorm 
ReLU  ReLU   
LogQuant  LinearQuant  Binarize 
Conv  Conv  Conv 
BatchNorm  BatchNorm  BatchNorm 
ReLU  ReLU   
LogQuant  LinearQuant  Binarize 
MaxPool  MaxPool  MaxPool 
Conv  Conv  Conv 
BatchNorm  BatchNorm  BatchNorm 
ReLU  ReLU   
LogQuant  LinearQuant  Binarize 
Conv  Conv  Conv 
BatchNorm  BatchNorm  BatchNorm 
ReLU  ReLU   
LogQuant  LinearQuant  Binarize 
Conv  Conv  Conv 
BatchNorm  BatchNorm  BatchNorm 
ReLU  ReLU   
LogQuant  LinearQuant  Binarize 
Conv  Conv  Conv 
BatchNorm  BatchNorm  BatchNorm 
ReLU  ReLU   
LogQuant  LinearQuant  Binarize 
MaxPool  MaxPool  MaxPool 
FC  FC  FC 
BatchNorm  BatchNorm  BatchNorm 
ReLU  ReLU   
LogQuant  LinearQuant  Binarize 
FC  FC  FC 
BatchNorm  BatchNorm  BatchNorm 
ReLU  ReLU   
LogQuant  LinearQuant  Binarize 
FC  FC  FC 
    BatchNorm 
References
 Abadi et al. (2015) Abadi, Martín, Agarwal, Ashish, Barham, Paul, Brevdo, Eugene, Chen, Zhifeng, Citro, Craig, Corrado, Greg S., Davis, Andy, Dean, Jeffrey, Devin, Matthieu, Ghemawat, Sanjay, Goodfellow, Ian, Harp, Andrew, Irving, Geoffrey, Isard, Michael, Jia, Yangqing, Jozefowicz, Rafal, Kaiser, Lukasz, Kudlur, Manjunath, Levenberg, Josh, Mané, Dan, Monga, Rajat, Moore, Sherry, Murray, Derek, Olah, Chris, Schuster, Mike, Shlens, Jonathon, Steiner, Benoit, Sutskever, Ilya, Talwar, Kunal, Tucker, Paul, Vanhoucke, Vincent, Vasudevan, Vijay, Viégas, Fernanda, Vinyals, Oriol, Warden, Pete, Wattenberg, Martin, Wicke, Martin, Yu, Yuan, and Zheng, Xiaoqiang. TensorFlow: Largescale machine learning on heterogeneous systems, 2015.
 Audhkhasi et al. (2013) Audhkhasi, Kartik, Osoba, Osonde, and Kosko, Bart. Noise benefits in backpropagation and deep bidirectional pretraining. In Proceedings of The 2013 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE, 2013.
 Bishop (1995) Bishop, Christopher M. Training with noise is equivalent to tikhonov regularization. In Neural Computation, pp. 108–116, 1995.
 Bottou & Bousquet (2007) Bottou, Léon and Bousquet, Olivier. The tradeoffs of large scale learning. In Platt, J.C., Koller, D., Singer, Y., and Roweis, S.T. (eds.), Advances in Neural Information Processing Systems 20, pp. 161–168. Curran Associates, Inc., 2007.
 Chetlur et al. (2014) Chetlur, Sharan, Woolley, Cliff, Vandermersch, Philippe, Cohen, Jonathan, Tran, John, Catanzaro, Bryan, and Shelhamer, Evan. cudnn: Efficient primitives for deep learning. In Proceedings of Deep Learning and Representation Learning Workshop: NIPS 2014, 2014.
 Courbariaux & Bengio (2016) Courbariaux, Matthieu and Bengio, Yoshua. Binarynet: Training deep neural networks with weights and activations constrained to +1 or 1. arXiv preprint arXiv:1602.02830, 2016.
 Deng et al. (2009) Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., and FeiFei, L. ImageNet: A LargeScale Hierarchical Image Database. In CVPR09, 2009.
 Denton et al. (2014) Denton, Emily, Zaremba, Wojciech, Bruna, Joan, LeCun, Yann, and Fergus, Rob. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems 27 (NIPS2014), pp. 1269–1277, 2014.
 Farabet et al. (2010) Farabet, Clément, Martini, Berin, Akselrod, Polina, Talay, Selçuk, LeCun, Yann, and Culurciello, Eugenio. Hardware accelerated convolutional neural networks for synthetic vision systems. In Proceedings of 2010 IEEE International Symposium on Circuits and Systems (ISCAS),, pp. 257–260. IEEE, 2010.
 Gautschi et al. (2016) Gautschi, Michael, Schaffner, Michael, Gurkaynak, Frank K., and Benini, Luca. A 65nm CMOS 6.4to29.2pJ/FLOP at 0.8V shared logarithmic floating point unit for acceleration of nonlinear function kernels in a tightly coupled processor cluster. In Proceedings of Solid State Circuits Conference  (ISSCC), 2016 IEEE International. IEEE, 2016.
 Gupta et al. (2015) Gupta, Suyog, Agrawal, Ankur, Gopalakrishnan, Kailash, and Narayanan, Pritish. Deep learning with limited numerical precision. In Proceedings of The 32nd International Conference on Machine Learning (ICML2015), pp. 1737–1746, 2015.
 Han et al. (2015a) Han, Song, Mao, Huizi, and Dally, William J. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015a.
 Han et al. (2015b) Han, Song, Pool, Jeff, Tran, John, and Dally, William. Learning both weights and connections for efficient neural network. In Proceedings of Advances in Neural Information Processing Systems 28 (NIPS2015), pp. 1135–1143, 2015b.
 He et al. (2015) He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
 Jia et al. (2014) Jia, Yangqing, Shelhamer, Evan, Donahue, Jeff, Karayev, Sergey, Long, Jonathan, Girshick, Ross, Guadarrama, Sergio, and Darrell, Trevor. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia, pp. 675–678. ACM, 2014.
 Krizhevsky et al. (2012) Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In Pereira, F., Burges, C.J.C., Bottou, L., and Weinberger, K.Q. (eds.), Advances in Neural Information Processing Systems 25, pp. 1097–1105, 2012.
 Lin et al. (2015) Lin, Zhouhan, Courbariaux, Matthieu, Memisevic, Roland, and Bengio, Yoshua. Neural networks with few multiplications. arXiv preprint arXiv:1510.03009, 2015.
 Neelakantan et al. (2015) Neelakantan, Arvind, Vilnis, Luke, Le, Quoc V., Sutskever, Ilya, Kaiser, Lukasz, and Karol Kurach, James Martens. Adding gradient noise improves learning for very deep networks. arXiv preprint arXiv:1511.06807, 2015.
 Novikov et al. (2015) Novikov, Alexander, Podoprikhin, Dmitry, Osokin, Anton, and Vetrov, Dmitry. Tensorizing neural networks. In Advances in Neural Information Processing Systems 28 (NIPS2015), pp. 442–450, 2015.

Shin et al. (2016)
Shin, Sungho, Hwang, Kyuyeon, and Sung, Wonyong.
Fixed point performance analysis of recurrent neural networks.
In Proceedings of The 41st IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP2016). IEEE, 2016.  Simonyan & Zisserman (2014) Simonyan, Karen and Zisserman, Andrew. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:11409.1556, 2014.
 Sung et al. (2015) Sung, Wonyong, Shin, Sungho, and Hwang, Kyuyeon. Resiliency of deep neural networks under quantization. arXiv preprint arXiv:1511.06488, 2015.
 Szegedy et al. (2015) Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Dumitru, Vanhoucke, Vincent, and Rabinovich, Andrew. Going deeper with convolutions. In CVPR 2015, 2015.
 Tokui et al. (2015) Tokui, Seiya, Oono, Kenta, Hido, Shohei, and Clayton, Justin. Chainer: a nextgeneration open source framework for deep learning. In Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Twentyninth Annual Conference on Neural Information Processing Systems (NIPS), 2015.
 Vanhoucke et al. (2011) Vanhoucke, Vincent, Senior, Andrew, and Mao, Mark Z. Improving the speed of neural networks on cpus. In Proceedings of Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011, 2011.
 Zhang et al. (2015) Zhang, Chen, Li, Peng, Sun, Guangyu, Guan, Yijin, Xiao, Bingjun, and Cong, Jason. Optimizing FPGAbased accelerator design for deep convolutional neural networks. In Proceedings of 23rd International Symposium on FieldProgrammable Gate Arrays (FPGA2015), 2015.