1 Introduction
The annually held ILSVRC competition has seen stateoftheart classification accuracies by deep networks such as AlexNet by Krizhevsky et al. (2012), VGG by Simonyan & Zisserman (2015), GoogleNet (Szegedy et al., 2015) and ResNet (He et al., 2015). These networks contain millions of parameters and require billions of arithmetic operations.
Various solutions have been offered to reduce the resourcerequirement of CNNs. Fixed point arithmetic is less resource hungry compared to floating point. Moreover, it has been shown that fixed point arithmetic is adequate for neural network computation (Hammerstrom, 1990). This observation has been leveraged recently to condense deep CNNs. Gupta et al. (2015) show that networks on datasets like CIFAR10 (10 images classes) can be trained in 16bit. Further trimming of the same network uses as low as 7bit multipliers (Courbariaux et al., 2014). Another approach by Courbariaux et al. (2016) uses binary weights and activations, again on the same network.
The complexity of deep CNNs can be split into two parts. First, the convolutional layers contain more than 90% of the required arithmetic operations. By turning these floating point operations into operations with small fixed point numbers, both the chip area and energy consumption can be significantly reduced. The second resourceintense layer type are fully connected layers, which contain over 90% of the network parameters. As a nice byproduct of using bitwidth reduced fixed point numbers, the data transfer to offchip memory is reduced for fully connected layers. In this paper, we concentrate on approximating convolutional and fully connected layers only. Using fixed point arithmetic is a hardwarefriendly way of approximating CNNs. It allows the use of smaller processing elements and reduces the memory requirements without adding any computational overhead such as decompression.
Even though it has been shown that CNNs perform well with small fixed point numbers, there exists no thorough investigation of the delicate tradeoff between bitwidth reduction and accuracy loss. In this paper we present Ristretto, which automatically finds a perfect balance between the bitwidth reduction and the given maximum error tolerance. Ristretto performs a fast and fully automated trimming analysis of any given network. This posttraining tool can be used for applicationspecific trimming of neural networks.
2 Mixed Fixed Point Precision
In the next two sections we discuss quantization of a floating point CNN to fixed point. Moreover, we explain dynamic fixed point, and show how it can be used to further decrease network size while maintaining the classification accuracy.
The data path of fully connected and convolutional layers consists of a series of MAC operations (multiplication and accumulation), as shown in Figure 1. The layer activations are multiplied with the network weights, and the results are accumulated to form the output. As shown by Qiu et al. (2016), it is a good approach to use mixed precision, i.e., different parts of a CNN use different bitwidths.
In Figure 1, and refer to the number of bits for layer outputs and layer weights, respectively. Multiplication results are accumulated using an adder tree which gets thicker towards the end. The adder outputs in the first level are bits wide, and the bitwidth grows by 1 bit in each level. In the last level, the bitwidth is , where is the number of multiplication operations per output value. In the last stage, the bias is added to form the layer output. For each network layer, we need to find the right balance between reducing the bitwidths ( and ) and maintaining a good classification accuracy.
3 Dynamic Fixed Point
The different parts of a CNN have a significant dynamic range. In large layers, the outputs are the result of thousands of accumulations, thus the network parameters are much smaller than the layer outputs. Fixed point has only limited capability to cover a wide dynamic range. Dynamic fixed point (Williamson, 1991; Courbariaux et al., 2014) is a solution to this problem.
In dynamic fixed point, each number is represented as follows: . Here denotes the bitwidth, the sign bit, is the fractional length, and the mantissa bits. The intermediate values in a network have different ranges. Therefor it is desirable to assign fixed point numbers into groups with constant fl, such that the number of bits allocated to the fractional part is constant within that group. Each network layer is split into two groups: one for the layer outputs, one for the layer weights. This allows to better cover the dynamic range of both layer outputs and weights, as weights are normally significantly smaller. On the hardware side, it is possible to realize dynamic fixed point arithmetic using bit shifters.
Different hardware accelerators for deployment of neural networks have been proposed (Motamedi et al., 2016; Qiu et al., 2016; Han et al., 2016a). The first important step in accelerator design is the compression of the network in question. In the next section we present Ristretto, a tool which can condense any neural network in a fast and automated fashion.
4 Ristretto: Approximation Framework in Caffe
From Caffe to Ristretto
According to Wikipedia, Ristretto is ’a short shot of espresso coffee made with the normal amount of ground coffee but extracted with about half the amount of water’. Similarly, our compressor removes the unnecessary parts of a CNN, while making sure the essence – the ability to predict image classes – is preserved. With its strong community and fast training for deep CNNs, Caffe
(Jia et al., 2014) is an excellent framework to build on.Ristretto takes a trained model as input, and automatically brews a condensed network version. Input and output of Ristretto are a network description file (prototxt) and the network parameters. Optionally, the quantized network can be finetuned with Ristretto. The resulting fixed point model in Caffeformat can then be used for a hardware accelerator.
Quantization flow
Ristretto’s quantization flow has five stages (Figure 2) to compress a floating point network into fixed point.
In the first step, the dynamic range of the weights is analyzed to find a good fixed point representation. For the quantization from floating point to fixed point, we use roundnearest.
The second step runs several thousand images in forward path. The generated layer activations are
analyzed to generate statistical parameters. Ristretto uses enough bits in the integer part of fixed point numbers to avoid saturation of layer activations.
Next Ristretto performs a binary search to find
the optimal number of bits for convolutional weights, fully connected weights, and layer outputs. In this step, a certain network part is quantized, while the rest remains in floating point. Since
there are three network parts that should use independent bitwidths (weights of convolutional and fully connected layers as well as layer outputs), iteratively quantizing one network part
allows us to find the optimal bitwidth for each part. Once a good tradeoff between small number representation and classification accuracy is found, the resulting fixed point network is retrained.
Finetuning
In order to make up for the accuracy drop incurred by quantization, the fixed point network is finetuned in Ristretto. During this retraining procedure, the network learns how to classify images with fixed point parameters. Since the network weights can only have discrete values, the main challenge consists in the weight update. We adopt the idea of previous work
(Courbariaux et al., 2015) which uses full precision shadow weights. Small weight updates are applied to the full precision weights , whereas the discrete weights are sampled from the full precision weights. The sampling during finetuning is done with stochastic rounding. This rounding scheme was successfully used by Gupta et al. (2015) for weight updates of 16bit fixed point networks.Ristretto uses the finetuning procedure illustrated in Figure 3. For each batch, the full precision weights are quantized to fixed point. During forward propagation, these discrete weights are used to compute the layer outputs . Each layer turns its input batch into output , according to its function . Assuming the last layer computes the loss, we denote as the overall CNN function.
The goal of back propagation is to compute the error gradient with respect to each fixed point parameter. For parameter updates we use the Adam rule by Kingma & Ba (2015). As an important observation, we do not quantize layer outputs to fixed point during finetuning. We use floating point layer outputs instead, which enables Ristretto to analytically compute the error gradient with respect to each parameter. In contrast, the validation of the network is done with fixed point layer outputs.
To achieve the best finetuning results, we used a learning rate that is an order of magnitude lower than the last full precision training iteration. Since the choice of hyper parameters for retraining is crucial (Bergstra & Bengio, 2012), Ristretto relies on minimal human intervention in this step.
Fast finetuning with fixed point parameters
Ristretto brews a condensed network with fixed point weights and fixed point layer activations. For simulation of the forward propagation in hardware, Ristretto uses full floating point for accumulation.
This follows the thought of Gupta et al. (2015) and is conform with our description of the forward data path in hardware (Figure 2). During finetuning, the full precision weights need
to be converted
to fixed point for each batch, but after that all computation can be done in floating point (Figure 3). Therefore Ristretto can fully leverage optimized matrixmatrix
multiplication routines for both forward and backward propagation.
Thanks to its fast implementation on the GPU, a fixed point
CaffeNet can be tested on the ILSVRC 2014 validation dataset (50k images) in less than 2 minutes (using one Tesla K40 GPU).
5 Results
In this section we present the results of approximating 32bit floating point networks by condensed fixed point models. All classification accuracies were obtained running the respective network on the whole validation dataset. We present approximation results of Ristretto for five different networks. First, we consider LeNet (LeCun et al., 1998)
which can classify handwritten digits (MNIST dataset). Second, CIFAR10 Full model provided by Caffe is used to classify images into 10 different classes. Third, we condense CaffeNet, which is the Caffe version of AlexNet and classifies images into the 1000 ImageNet categories. Fourth, we use the BVLC version of GoogLeNet
(Szegedy et al., 2015) to classify images of the same data set. Finally, we approximate SqueezeNet (Iandola et al., 2016), a recently proposed architecture with the classification accuracy of AlexNet, but 50X fewer parameters.Impact of dynamic fixed point
We used Ristretto to quantize CaffeNet (AlexNet) into fixed point, and
compare traditional fixed point with dynamic fixed point. To allow a simpler comparison, all layer outputs and network parameters share the
same bitwidth. Results show a good performance of static
fixed point for as low as 18bit (Figure 4). However, when reducing the bitwidth further, the accuracy starts to drop significantly, while dynamic fixed
point has a stable accuracy.
We can conclude that dynamic fixed point performs significantly better for such a large network. With dynamic fixed point, we can adapt the number of bits allocated to integer and fractional part, according to the dynamic range of different parts of the network. We will therefore concentrate on dynamic fixed point for the subsequent experiments.
Quantization of individual network parts
In this section, we analyze the impact of quantization on different parts of a floating point CNN.
Table 1 shows the classification accuracy when the layer outputs,
the convolution kernels or the parameters of fully connected layers are quantized to dynamic fixed point.
In all three nets, the convolution kernels and layer activations can be trimmed to 8bit with an absolute accuracy change of only 0.3%. Fully connected layers are more affected from trimming to 8bit weights, the absolute change is maximally 0.9%. Interestingly, LeNet weights can be trimmed to as low as 2bit, with absolute accuracy change below 0.4%.
Fixed point bitwidth  16bit  8bit  4bit  2bit 

LeNet, 32bit floating point accuracy: 99.1%  
Layer output  99.1%  99.1%  98.9%  85.9% 
CONV parameters  99.1%  99.1%  99.1%  98.9% 
FC parameters  99.1%  99.1%  98.9%  98.7% 
Full CIFAR10, 32bit floating point accuracy: 81.7%  
Layer output  81.6%  81.6%  79.6%  48.0% 
CONV parameters  81.7%  81.4%  75.9%  19.1% 
FC parameters  81.7%  80.8%  79.9%  77.5% 
CaffeNet top1, 32bit floating point accuracy: 56.9%  
Layer output  56.8%  56.7%  06.0%  00.1% 
CONV parameters  56.9%  56.7%  00.1%  00.1% 
FC parameters  56.9%  56.3%  00.1%  00.1% 
Finetuning of all considered network parts
Here we report the accuracy of five networks that were condensed and finetuned with Ristretto. All networks use dynamic fixed point parameters as well as dynamic fixed point layer outputs for
convolutional and fully connected layers.
LeNet performs well in 2/4bit, while CIFAR10 and the three ImageNet CNNs can be trimmed to 8bit (see Table 2).
Surprisingly, these compressed networks still perform nearly as well as their floating point baseline. The relative accuracy drops of LeNet, CIFAR10 and SqueezeNet are very small (0.6%),
whereas the approximation of the larger CaffeNet and GoogLeNet incurs a slightly higher cost (0.9% and 2.3% respectively). We hope we will further improve the finetuning results of these larger
networks in the future.
The SqueezeNet architecture was developed by Iandola et al. (2016) with the goal of a small CNN that performs well on the ImageNet data set. Ristretto can make the already small network even smaller, so that its parameter size is less than 2 MB. This condensed network is wellsuited for deployment in smart mobile systems.
All five 32bit floating point networks can be approximated well in 8bit and 4bit fixed point. For a hardware implementation, this reduces the size of multiplication units by about one order of magnitude. Moreover, the required memory bandwidth is reduced by 4–8X. Finally, it helps to hold 4–8X more parameters in onchip buffers. The code for reproducing the quantization and finetuning results is available^{1}^{1}1https://github.com/pmgysel/caffe.
20cmLayer  

outputs  20cmCONV  
parameters  20cmFC  
parameters  20cm32bit floating  
point baseline  20cmFixed point  
accuracy  
LeNet (Exp 1)  4bit  4bit  4bit  99.1%  99.0% (98.7%) 
LeNet (Exp 2)  4bit  2bit  2bit  99.1%  98.8% (98.0%) 
Full CIFAR10  8bit  8bit  8bit  81.7%  81.4% (80.6%) 
SqueezeNet top1  8bit  8bit  8bit  57.7%  57.1% (55.2%) 
CaffeNet top1  8bit  8bit  8bit  56.9%  56.0% (55.8%) 
GoogLeNet top1  8bit  8bit  8bit  68.9%  66.6% (66.1%) 
A previous work by Courbariaux et al. (2014) concentrates on training with limited numerical precision. They can train a dynamic fixed point network on the MNIST data set using just 7bits to represent activations and weights. Ristretto doesn’t reduce the resource requirements for training, but concentrates on inference instead. Ristretto can produce a LeNet network with 2bit parameters and 4bit activations. Our approach is different in that we train with high numerical precision, then quantize to fixed point, and finally finetune the fixed point network.
Other works (Courbariaux et al., 2016; Rastegari et al., 2016) can reduce the bitwidth even further to as low as 1bit, using more advanced number encodings than dynamic fixed point. Ristretto’s strength lies in its capability to approximate a large number of existing floating point models on challenging data sets. For the five considered networks, Ristretto can quantize activations and weights to 8bit or lower, at an accuracy drop below 2.3%, compared to the floating point baseline.
While more sophisticated data compression schemes could be used to achieve higher network size reduction, our approach is very hardware friendly and imposes no additional overhead such as decompression.
6 Conclusion and Future Work
In this work we presented Ristretto, a Caffebased approximation framework for deep convolutional neural networks. The framework reduces the memory requirements, area for processing elements and overall power consumption for hardware accelerators. A large net like CaffeNet can be quantized to 8bit for both weights and layer outputs while keeping the network’s accuracy change below 1% compared to its 32bit floating point counterpart. Ristretto is both fast and automated, and we release the code as an open source project.
Ristretto is in its first development stage. We consider adding new features in the future: 1. Shared weights: Fetching cookbook indices from offchip memory, instead of real values (Han et al., 2016b)
. 2. Network pruning as shown by the same authors. 3. Network binarization as shown by
Courbariaux et al. (2016) and Rastegari et al. (2016). These additional features will help to reduce the bitwidth even further, and to reduce the computational complexity of trimmed networks.References

Bergstra & Bengio (2012)
Bergstra, J. and Bengio, Y.
Random Search for HyperParameter Optimization.
The Journal of Machine Learning Research
, 13(1):281–305, 2012.  Courbariaux et al. (2014) Courbariaux, M., David, J.P., and Bengio, Y. Training Deep Neural Networks with Low Precision Multiplications. arXiv preprint arXiv:1412.7024, 2014.
 Courbariaux et al. (2015) Courbariaux, M., Bengio, Y., and David, J.P. BinaryConnect: Training Deep Neural Networks with binary weights during propagations. In Advances in Neural Information Processing Systems, pp. 3105–3113, 2015.
 Courbariaux et al. (2016) Courbariaux, M., Hubara, I., Soudry, D., ElYaniv, R., and Bengio, Y. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or 1. arXiv preprint arXiv:1602.02830, 2016.
 Gupta et al. (2015) Gupta, S., Agrawal, A., Gopalakrishnan, K., and Narayanan, P. Deep Learning with Limited Numerical Precision. In Proceedings of the 32nd International Conference on Machine Learning (ICML15), pp. 1737–1746, 2015.
 Hammerstrom (1990) Hammerstrom, D. A VLSI Architecture for HighPerformance, LowCost, Onchip Learning. In IJCNN International Joint Conference on Neural Networks, 1990, pp. 537–544. IEEE, 1990.
 Han et al. (2016a) Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M. A., and Dally, W. J. EIE: Efficient Inference Engine on Compressed Deep Neural Network. arXiv preprint arXiv:1602.01528, 2016a.
 Han et al. (2016b) Han, S., Mao, H., and Dally, W. J. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. In International Conference on Learning Representations, 2016b.
 He et al. (2015) He, K., Zhang, X., Ren, S., and Sun, J. Deep Residual Learning for Image Recognition. arXiv preprint arXiv:1512.03385, 2015.
 Iandola et al. (2016) Iandola, F. N., Moskewicz, M. W., Ashraf, K., Han, S., Dally, W. J., and Keutzer, K. SqueezeNet: AlexNetlevel accuracy with 50x fewer parameters and <0.5MB model size. arXiv:1602.07360, 2016.
 Jia et al. (2014) Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and Darrell, T. Caffe: Convolutional Architecture for Fast Feature Embedding. In Proceedings of the ACM International Conference on Multimedia, pp. 675–678. ACM, 2014.
 Kingma & Ba (2015) Kingma, D. and Ba, J. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations, 2015.
 Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems, pp. 1097–1105, 2012.
 LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. GradientBased Learning Applied to Document Recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Motamedi et al. (2016) Motamedi, M., Gysel, P., Akella, V., and Ghiasi, S. Design Space Exploration of FPGABased Deep Convolutional Neural Networks. In 2016 21st Asia and South Pacific Design Automation Conference (ASPDAC), pp. 575–580. IEEE, 2016.
 Qiu et al. (2016) Qiu, J., Wang, J., Yao, S., Guo, K., Li, B., Zhou, E., Yu, J., Tang, T., Xu, N., Song, S., Wang, Y., and Yang, H. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network. In Proceedings of the 2016 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 26–35, 2016.
 Rastegari et al. (2016) Rastegari, M., Ordonez, V., Redmon, J., and Farhadi, A. XNORNet: ImageNet Classification Using Binary Convolutional Neural Networks. arXiv preprint arXiv:1603.05279, 2016.
 Simonyan & Zisserman (2015) Simonyan, K. and Zisserman, A. Very Deep Convolutional Networks for LargeScale Image Recognition. In International Conference on Learning Representations, 2015.

Szegedy et al. (2015)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D.,
Vanhoucke, V., and Rabinovich, A.
Going Deeper with Convolutions.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 1–9, 2015.  Williamson (1991) Williamson, D. Dynamically scaled fixed point arithmetic. In IEEE Pacific Rim Conference on Communications, Computers and Signal Processing, 1991, pp. 315–318. IEEE, 1991.
Comments
There are no comments yet.