Deep convolutional neural networks (CNN) have demonstrated state-of-the-art performance in image classification Krizhevsky et al. (2012); Simonyan & Zisserman (2014); He et al. (2015) but have steadily grown in computational complexity. For example, the Deep Residual Learning He et al. (2015) set a new record in image classification accuracy at the expense of billion floating-point multiply-and-add operations per forward-pass of an image and MB of memory to store the weights in its -layer network.
In order for these large networks to run in real-time applications such as for mobile or embedded platforms, it is often necessary to use low-precision arithmetic and apply compression techniques. Recently, many researchers have successfully deployed networks that compute using -bit fixed-point representation Vanhoucke et al. (2011); Abadi et al. (2015) and have successfully trained networks with -bit fixed point Gupta et al. (2015). This work in particular is built upon the idea that algorithm-level noise tolerance of the network can motivate simplifications in hardware complexity.
Interesting directions point towards matrix factorization Denton et al. (2014) and tensorification Novikov et al. (2015) by leveraging structure of the fully-connected (FC) layers. Another promising area is to prune the FC layer before mapping this to sparse matrix-matrix routines in GPUs Han et al. (2015b). However, many of these inventions aim at systems that meet some required and specific criteria such as networks that have many, large FC layers or accelerators that handle efficient sparse matrix-matrix arithmetic. And with network architectures currently pushing towards increasing the depth of convolutional layers by settling for fewer dense FC layers He et al. (2015); Szegedy et al. (2015), there are potential problems in motivating a one-size-fits-all solution to handle these computational and memory demands.
We propose a general method of representing and computing the dot products in a network that can allow networks with minimal constraint on the layer properties to run more efficiently in digital hardware. In this paper we explore the use of communicating activations, storing weights, and computing the atomic dot-products in the binary logarithmic (base-2 logarithmic) domain for both inference and training. The motivations for moving to this domain are the following:
Training networks with weight decay leads to final weights that are distributed non-uniformly around .
Logarithmic representations can encode data with very large dynamic range in fewer bits than can fixed-point representation Gautschi et al. (2016).
Data representation in -domain is naturally encoded in digital hardware (as shown in Section 4.3).
Our contributions are listed:
we show that networks obtain higher classification accuracies with logarithmic quantization than linear quantization using traditional fixed-point at equivalent resolutions.
we show that activations are more robust to quantization than weights. This is because the number of activations tend to be larger than the number of weights which are reused during convolutions.
we apply our logarithmic data representation on state-of-the-art networks, allowing activations and weights to use only b with almost no loss in classification performance.
we generalize base- arithmetic to handle different base. In particular, we show that a base- enables the ability to capture large dynamic ranges of weights and activations but also finer precisions across the encoded range of values as well.
we develop logarithmic backpropagation for efficient training.
2 Related work
Reduced-precision computation. Shin et al. (2016); Sung et al. (2015); Vanhoucke et al. (2011); Han et al. (2015a) analyzed the effects of quantizing the trained weights for inference. For example, Han et al. (2015b) shows that convolutional layers in AlexNet Krizhevsky et al. (2012) can be encoded to as little as 5 bits without a significant accuracy penalty. There has also been recent work in training using low precision arithmetic. Gupta et al. (2015) propose a stochastic rounding scheme to help train networks using 16-bit fixed-point. Lin et al. (2015) propose quantized back-propagation and ternary connect. This method reduces the number of floating-point multiplications by casting these operations into powers-of-two multiplies, which are easily realized with bitshifts in digital hardware. They apply this technique on MNIST and CIFAR10 with little loss in performance. However, their method does not completely eliminate all multiplications end-to-end. During test-time the network uses the learned full resolution weights for forward propagation. Training with reduced precision is motivated by the idea that high-precision gradient updates is unnecessary for the stochastic optimization of networks Bottou & Bousquet (2007); Bishop (1995); Audhkhasi et al. (2013). In fact, there are some studies that show that gradient noise helps convergence. For example, Neelakantan et al. (2015) empirically finds that gradient noise can also encourage faster exploration and annealing of optimization space, which can help network generalization performance.
Hardware implementations. There have been a few but significant advances in the development of specialized hardware of large networks. For example Farabet et al. (2010) developed Field-Programmable Gate Arrays (FPGA) to perform real-time forward propagation. These groups have also performed a comprehensive study of classification performance and energy efficiency as function of resolution. Zhang et al. (2015) have also explored the design of convolutions in the context of memory versus compute management under the RoofLine model. Other works focus on specialized, optimized kernels for general purpose GPUs Chetlur et al. (2014).
3 Concept and Motivation
Each convolutional and fully-connected layer of a network performs matrix operations that distills down to dot products , where is the input, the weights, and the activations before being transformed by the non-linearity (e.g. ReLU). Using conventional digital hardware, this operation is performed using multiply-and-add operations using floating or fixed point representation as shown in Figure 1(a). However, this dot product can also be computed in the -domain as shown in Figure 1(b,c).
3.1 Proposed Method 1.
The first proposed method as shown in Figure 1(b) is to transform one operand to its representation, convert the resulting transformation back to the linear domain, and multiply this by the other operand. This is simply
where , quantizes to an integer, and is the function that bitshifts a value by an integer in fixed-point arithmetic. In floating-point, this operation is simply an addition of with the exponent part of . Taking advantage of the operator to perform multiplication obviates the need for expensive digital multipliers.
Quantizing the activations and weights in the -domain ( and ) instead of and is also motivated by leveraging structure of the non-uniform distributions of and . A detailed treatment is shown in the next section. In order to quantize, we propose two hardware-friendly flavors. The first option is to simply floor the input. This method computes by returning the position of the first bit seen from the most significant bit (MSB). The second option is to round to the nearest integer, which is more precise than the first option. With the latter option, after computing the integer part, the fractional part is computed in order to assert the rounding direction. This method of rounding is summarized as follows. Pick bits followed by the leftmost and consider it as a fixed point number with 0 integer bit and fractional bits. Then, if , round up to the nearest integer and otherwise round it down to the nearest integer.
3.2 Proposed Method 2.
The second proposed method as shown in Figure 1(c) is to extend the first method to compute dot products in the -domain for both operands. Additions in linear-domain map to sums of exponentials in the -domain and multiplications in linear become -addition. The resulting dot-product is
where the -domain weights are and -domain inputs are .
By transforming both the weights and inputs, we compute the original dot product by bitshifting by an integer result and summing over all .
3.3 Accumulation in domain
Although Fig. 1(b,c) indicates a logarithm-to-linear converter between layers where the actual accumulation is performed in the linear domain, this accumulation is able to be performed in the -domain using the approximation for . For example, let , , and . When ,
and for in general,
Note that preserves the fractional part of the word during accumulation. Both accumulation in linear domain and accumulation in domain have its pros and cons. Accumulation in linear domain is simpler but requires larger bit widths to accommodate large dynamic range numbers. Accumulation in in (3) and (4) appears to be more complicated, but is in fact simply computed using bit-wise operations in digital hardware.
4 Experiments of Proposed Methods
Here we evaluate our methods as detailed in Sections 3.1 and 3.2 on the classification task of ILSVRC-2012 Deng et al. (2009) using Chainer Tokui et al. (2015). We evaluate method 1 (Section 3.1) on inference (forward pass) in Section 4.1. Similarly, we evaluate method 2 (Section 3.2) on inference in Sections 4.2 and 4.3. For those experiments, we use published models (AlexNet Krizhevsky et al. (2012), VGG16 Simonyan & Zisserman (2014)
) from the caffe model zoo (Jia et al. (2014)) without any fine tuning (or extra retraining). Finally, we evaluate method 2 on training in Section 4.4.
4.1 Logarithmic Representation of Activations
|layer||# Weight||# Input|
|layer||# Weight||# Input|
This experiment evaluates the classification accuracy using logarithmic activations and floating point 32b for the weights. In similar spirit to that of Gupta et al. (2015), we describe the logarithmic quantization layer LogQuant that performs the element-wise operation as follows:
These layers perform the logarithmic quantization and computation as detailed in Section 3.1. Tables 1 and 2 illustrate the addition of these layers to the models. The quantizer has a specified full scale range, and this range in linear scale is , where we express this as simply throughout this paper for notational convenience. The values for each layer are shown in Tables 1 and 2; they show added by an offset parameter. This offset parameter is chosen to properly handle the variation of activation ranges from layer to layer using images from the training set. The is a parameter which is global to the network and is tuned to perform the experiments to measure the effect of on classification accuracy. The is the number of bits required to represent a number after quantization. Note that since we assume applying quantization after ReLU function, is 0 or positive and then we use unsigned format without sign bit for activations.
In order to evaluate our logarithmic representation, we detail an equivalent linear quantization layer described as
Figure 2 illustrates the effect of the quantizer on activations following the conv2_2 layer used in VGG16. The pre-quantized distribution tends to 0 exponentially, and the -quantized distribution illustrates how the -encoded activations are uniformly equalized across many output bins which is not prevalent in the linear case. Many smaller activation values are more finely represented by quantization compared to linear quantization. The total quantization error , where is or ,
is the vectorized activations of size, is less for the -quantized case than for linear. This result is illustrated in Figure 3. Using linear quantization with step size of 1024, we obtain a distribution of quantization errors that are highly concentrated in the region where . However, quantization with the as linear results in a significantly lower number of quantization errors in the region . This comes at the expense of a slight increase in errors in the region . Nonetheless, the quantization errors for and for linear.
Figure 4 illustrates the results of AlexNet. Using only 3 bits to represent the activations for both logarithmic and linear quantizations, the top-5 accuracy is still very close to that of the original, unquantized model encoded at floating-point 32b. However, logarithmic representations tolerate a large dynamic range of s. For example, using 4b , we can obtain order of magnitude variations in the full scale without a significant loss of top-5 accuracy. We see similar results for VGG16 as shown in Figure 5. Table 3 lists the classification accuracies with the optimal s for each case. There are some interesting observations. First, b performs worse than b linear for AlexNet but better for VGG16, which is a higher capacity network than AlexNet. Second, by encoding the activations in b , we achieve the same top-5 accuracy compared to that achieved by b linear for VGG16. Third, with b , there is no loss in top-5 accuracy from the original float32 representation.
4.2 Logarithmic Representation of Weights of Fully Connected Layers
The FC weights are quantized using the same strategies as those in Section 4.1, except that they have sign bit. We evaluate the classification performance using data representation for both FC weights and activations jointly using method 2 in Section 3.2. For comparison, we use linear for FC weights and for activations as reference. For both methods, we use optimal b for activations that were computed in Section 4.1.
Table 4 compares the mentioned approaches along with floating point. We observe a small win for over linear for AlexNet but a decrease for VGG16. Nonetheless, computation is performed without the use of multipliers.
|Model||Float 32b||Log. 4b||Linear 4b|
An added benefit to quantization is a reduction of the model size. By quantizing down to b including sign bit, we compress the FC weights for free significantly from Gb to Gb for AlexNet and Gb to Gb for VGG16. This is because the dense FC layers occupy and of the total model size for AlexNet and VGG16 respectively.
4.3 Logarithmic Representation of Weights of Convolutional Layers
We now represent the convolutional layers using the same procedure. We keep the representation of activations at b and the representation of weights of FC layers at b , and compare our method with the linear reference and ideal floating point. We also perform the dot products using two different bases: . Note that there is no additional overhead for base- as it is computed with the same equation shown in Equation 4.
Table 5 shows the classification results. The results illustrate an approximate drop in performance from floating point down to 5b base-2 but a relatively minor drop for 5b base-. They includes sign bit. There are also some important observations here.
|32b||5b||Log 5b||Log 5b|
We first observe that the weights of the convolutional layers for AlexNet and VGG16 are more sensitive to quantization than are FC weights. Each FC weight is used only once per image (batch size of 1) whereas convolutional weights are reused many times across the layer’s input activation map. Because of this, the quantization error of each weight now influences the dot products across the entire activation volume. Second, we observe that by moving from b base- to a finer granularity such as b base-, we allow the network to 1) be robust to quantization errors and degradation in classification performance and 2) retain the practical features of -domain arithmetic.
The distributions of quantization errors for both b base- and b base- are shown in Figure 6. The total quantization error on the weights, , where is the vectorized weights of size , is smaller for base- than for base-.
4.4 Training with Logarithmic Representation
We incorporate representation during the training phase. This entire algorithm can be computed using Method 2 in Section 3.2. Table 6 illustrates the networks that we compare. The proposed and linear networks are trained at the same resolution using 4-bit unsigned activations and -bit signed weights and gradients using Algorithm 1 on the CIFAR10 dataset with simple data augmentation described in He et al. (2015). Note that unlike BinaryNet Courbariaux & Bengio (2016), we quantize the backpropagated gradients to train -net. This enables end-to-end training using logarithmic representation at the -bit level. For linear quantization however, we found it necessary to keep the gradients in its unquantized floating-point precision form in order to achieve good convergence. Furthermore, we include the training curve for BinaryNet, which uses unquantized gradients.
Fig. 7 illustrates the training results of , linear, and BinaryNet. Final test accuracies for -b, linear-b, and BinaryNet are , , respectively where linear-b and BinaryNet use unquantized gradients. The test results indicate that even with quantized gradients, our proposed network with representation still outperforms the others that use unquantized gradients.
In this paper, we describe a method to represent the weights and activations with low resolution in the -domain, which eliminates bulky digital multipliers. This method is also motivated by the non-uniform distributions of weights and activations, making representation more robust to quantization as compared to linear. We evaluate our methods on the classification task of ILSVRC-2012 using pretrained models (AlexNet and VGG16). We also offer extensions that incorporate end-to-end training using representation including gradients.
- Abadi et al. (2015) Abadi, Martín, Agarwal, Ashish, Barham, Paul, Brevdo, Eugene, Chen, Zhifeng, Citro, Craig, Corrado, Greg S., Davis, Andy, Dean, Jeffrey, Devin, Matthieu, Ghemawat, Sanjay, Goodfellow, Ian, Harp, Andrew, Irving, Geoffrey, Isard, Michael, Jia, Yangqing, Jozefowicz, Rafal, Kaiser, Lukasz, Kudlur, Manjunath, Levenberg, Josh, Mané, Dan, Monga, Rajat, Moore, Sherry, Murray, Derek, Olah, Chris, Schuster, Mike, Shlens, Jonathon, Steiner, Benoit, Sutskever, Ilya, Talwar, Kunal, Tucker, Paul, Vanhoucke, Vincent, Vasudevan, Vijay, Viégas, Fernanda, Vinyals, Oriol, Warden, Pete, Wattenberg, Martin, Wicke, Martin, Yu, Yuan, and Zheng, Xiaoqiang. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.
- Audhkhasi et al. (2013) Audhkhasi, Kartik, Osoba, Osonde, and Kosko, Bart. Noise benefits in backpropagation and deep bidirectional pre-training. In Proceedings of The 2013 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE, 2013.
- Bishop (1995) Bishop, Christopher M. Training with noise is equivalent to tikhonov regularization. In Neural Computation, pp. 108–116, 1995.
- Bottou & Bousquet (2007) Bottou, Léon and Bousquet, Olivier. The tradeoffs of large scale learning. In Platt, J.C., Koller, D., Singer, Y., and Roweis, S.T. (eds.), Advances in Neural Information Processing Systems 20, pp. 161–168. Curran Associates, Inc., 2007.
- Chetlur et al. (2014) Chetlur, Sharan, Woolley, Cliff, Vandermersch, Philippe, Cohen, Jonathan, Tran, John, Catanzaro, Bryan, and Shelhamer, Evan. cudnn: Efficient primitives for deep learning. In Proceedings of Deep Learning and Representation Learning Workshop: NIPS 2014, 2014.
- Courbariaux & Bengio (2016) Courbariaux, Matthieu and Bengio, Yoshua. Binarynet: Training deep neural networks with weights and activations constrained to +1 or -1. arXiv preprint arXiv:1602.02830, 2016.
- Deng et al. (2009) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
- Denton et al. (2014) Denton, Emily, Zaremba, Wojciech, Bruna, Joan, LeCun, Yann, and Fergus, Rob. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems 27 (NIPS2014), pp. 1269–1277, 2014.
- Farabet et al. (2010) Farabet, Clément, Martini, Berin, Akselrod, Polina, Talay, Selçuk, LeCun, Yann, and Culurciello, Eugenio. Hardware accelerated convolutional neural networks for synthetic vision systems. In Proceedings of 2010 IEEE International Symposium on Circuits and Systems (ISCAS),, pp. 257–260. IEEE, 2010.
- Gautschi et al. (2016) Gautschi, Michael, Schaffner, Michael, Gurkaynak, Frank K., and Benini, Luca. A 65nm CMOS 6.4-to-29.2pJ/FLOP at 0.8V shared logarithmic floating point unit for acceleration of nonlinear function kernels in a tightly coupled processor cluster. In Proceedings of Solid- State Circuits Conference - (ISSCC), 2016 IEEE International. IEEE, 2016.
- Gupta et al. (2015) Gupta, Suyog, Agrawal, Ankur, Gopalakrishnan, Kailash, and Narayanan, Pritish. Deep learning with limited numerical precision. In Proceedings of The 32nd International Conference on Machine Learning (ICML2015), pp. 1737–1746, 2015.
- Han et al. (2015a) Han, Song, Mao, Huizi, and Dally, William J. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015a.
- Han et al. (2015b) Han, Song, Pool, Jeff, Tran, John, and Dally, William. Learning both weights and connections for efficient neural network. In Proceedings of Advances in Neural Information Processing Systems 28 (NIPS2015), pp. 1135–1143, 2015b.
- He et al. (2015) He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
- Jia et al. (2014) Jia, Yangqing, Shelhamer, Evan, Donahue, Jeff, Karayev, Sergey, Long, Jonathan, Girshick, Ross, Guadarrama, Sergio, and Darrell, Trevor. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia, pp. 675–678. ACM, 2014.
- Krizhevsky et al. (2012) Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In Pereira, F., Burges, C.J.C., Bottou, L., and Weinberger, K.Q. (eds.), Advances in Neural Information Processing Systems 25, pp. 1097–1105, 2012.
- Lin et al. (2015) Lin, Zhouhan, Courbariaux, Matthieu, Memisevic, Roland, and Bengio, Yoshua. Neural networks with few multiplications. arXiv preprint arXiv:1510.03009, 2015.
- Neelakantan et al. (2015) Neelakantan, Arvind, Vilnis, Luke, Le, Quoc V., Sutskever, Ilya, Kaiser, Lukasz, and Karol Kurach, James Martens. Adding gradient noise improves learning for very deep networks. arXiv preprint arXiv:1511.06807, 2015.
- Novikov et al. (2015) Novikov, Alexander, Podoprikhin, Dmitry, Osokin, Anton, and Vetrov, Dmitry. Tensorizing neural networks. In Advances in Neural Information Processing Systems 28 (NIPS2015), pp. 442–450, 2015.
Shin et al. (2016)
Shin, Sungho, Hwang, Kyuyeon, and Sung, Wonyong.
Fixed point performance analysis of recurrent neural networks.In Proceedings of The 41st IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP2016). IEEE, 2016.
- Simonyan & Zisserman (2014) Simonyan, Karen and Zisserman, Andrew. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:11409.1556, 2014.
- Sung et al. (2015) Sung, Wonyong, Shin, Sungho, and Hwang, Kyuyeon. Resiliency of deep neural networks under quantization. arXiv preprint arXiv:1511.06488, 2015.
- Szegedy et al. (2015) Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Dumitru, Vanhoucke, Vincent, and Rabinovich, Andrew. Going deeper with convolutions. In CVPR 2015, 2015.
- Tokui et al. (2015) Tokui, Seiya, Oono, Kenta, Hido, Shohei, and Clayton, Justin. Chainer: a next-generation open source framework for deep learning. In Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Twenty-ninth Annual Conference on Neural Information Processing Systems (NIPS), 2015.
- Vanhoucke et al. (2011) Vanhoucke, Vincent, Senior, Andrew, and Mao, Mark Z. Improving the speed of neural networks on cpus. In Proceedings of Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011, 2011.
- Zhang et al. (2015) Zhang, Chen, Li, Peng, Sun, Guangyu, Guan, Yijin, Xiao, Bingjun, and Cong, Jason. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of 23rd International Symposium on Field-Programmable Gate Arrays (FPGA2015), 2015.