Today, deep neural networks (DNN) are applied to many machine learning tasks with great success. They are suited to solve many problems from image recognition, segmentation and natural language processing[1, 2], delivering state-of-the-art results. Their success is mainly based on improved training methods, their ability to learn complex relationships directly from raw input data and on the increasing computational power that makes processing of large datasets and training of large models with up to millions of parameters possible. Despite their success, DNNs are not suited for mobile applications or the use on embedded devices because they are computationally complex and need large memory to store the huge number of parameters. Network reduction methods address this problem and reduce the computational and memory complexity of DNNs without significant performance degradation.
Consider a generic dense feed-forward network with layers that computes
where is the layer index and , and are the input, activation and output of a layer, respectively.
The memory complexity of layer is to store the elements of the weight matrix
and the bias vectorusing a word length of . The parameters are typically stored in floating point (float32) format with . The computational complexity consists of multiplication and accumulation (MAC) operations.
One approach to reduce the computational and memory complexity of trained DNNs is to reduce the number of parameters in the network. Examples are factorization or pruning methods, which either exploit the low-rank or the sparse structure of the weight matrices [3, 4, 5, 6, 7].
Another approach is to quantize all parameters of a DNN and to map each parameter of a DNN to the nearest element of a small set of values . Prior works demonstrated that clustering techniques can be used to find such a small set of values that approximate the parameters of a trained DNN with high accuracy. DNNs proved to be very robust to quantization, meaning that only few cluster centers are needed to represent the parameters of a DNN. Because only the cluster centers must be stored in float32 and because the cluster index of each parameter can be encoded with only few bits, quantization considerably reduces the memory complexity of trained DNNs. Others used fixed point quantization to reduce the memory complexity 
. However, during inference the network is still evaluated with floating point arithmetics.
Vanhoucke et. al proposed a method to speed up DNN inference using quantization and a combination of fixed point and float32 arithmetics . We propose a similar, but new method to evaluate DNNs which uses only low precision integer arithmetics, binary shift and clipping operations and does not need any float32 calculations.
We apply uniform quantization to convert the float32 inputs and parameters of a network to low precision integer values. The activation of each network layer is evaluated using only low precision integer calculations. For networks with arbitrary activation functions, the layer activation must be converted back to float32 to evaluate the activation function, resulting in mixed integer/float32 computations. For networks with relu activation functions, we demonstrate that the network can be evaluated using only integer arithmetics which are much simpler to implement in hardware. Therefore, our method is the key for an efficient implementation of DNNs on dedicated hardware. Special hardware can, for example, perform multiple low precision integer operations in the same time as one floating point operation. For example, NVIDIAs recent Pascal architecture can perform 4 int8 MACs in the same time as a single float32 multiplication. This means, a DNN can be evaluated up to 4 times faster.
The contributions of this paper are: 1) We propose a new method to evaluate DNNs that uses uniform quantization and only needs low precision integer arithmetics, binary shift and clipping operations. 2) We discuss how training with regularization influences the distribution of the parameters of a DNN and thus the performance of the quantized DNN.
2 Uniform quantization
Consider a uniform quantization function that maps a value to an integer value of word length . As shown in Fig. 1, we distinguish between quantization of signed and unsigned values.
For signed values , we use a quantization function to map real values from the original value range to a set of signed integer values that are symmetrically distributed around , i.e. . We define the signed quantization function as
is the step width
is the clipping (saturation) of the quantization function, which maps all values in the overload region to the maximum or minimum value in . For unsigned values , we use a quantization function to map real values from the value range to a set of unsigned integer values . We define the unsigned quantization function as
The step width is and the clipping function is
Quantization introduces the quantization noise , meaning . For given , the step size must be chosen to balance high resolution (small ) against small overload regions (large ). One approach is to minimize the mean square quantization error
, which depends on the probability density functionof . A compact with short tails is desirable to minimize .
Using uniform quantization, a function with and is approximately
where all terms within the bracket can be computed with integer arithmetics of word length . If is also chosen as the step size for quantization of , only one single floating point multiplication is needed to map the result back to the original value range.
We assume that integer multiplications produce no overflows. This means, a multiplication of integer values of word legnth results in an integer of twice the word length ( and ).
3 Quantization of layers
Of course, the word length does not increase from layer to layer. The input of each layer is quantized from back to . After training, the layers of a DNN can be converted to quantized layers which only use low precision integer arithmetics to calculate the layer activation. Although only dense layers are considered below, the idea can be easily adapted to convolutional layers as well.
As shown in Fig. 2, the input and the parameters and are quantized to vectors and matrices containing integer values of word length . , and are quantized with uniform quantization functions which are defined elementwise. We use signed quantization for the parameters and and unsigned quantization for , since we assume unsigned in each layer. This is feasible for networks with relu or sigmoid activation functions.
The quantized layer computes
where is the quantization noise. In general, a different word length can be chosen for , and , in each layer. However, in each layer, we use the same fixed for the inputs and parameters and only choose , and such that the quantization error is minimized in each layer, seperately. For this purpose, we use training samples
to estimate the mean square quantization errorand minimize it using a grid search.
We use as the step size to quantize . As discribed in section 2, the activation of the layer can be converted back to the original value range using just floating point multiplications. This overhead is small compared to the multiplications, which can now be evaluated with low precision integer arithmetics.
If each layer of the DNN uses a relu activation function and if we choose and as powers of 2 (i.e. ), the computation of the network output can be further simplified, as shown in Fig. 3. Since is piecewise linear and , we can compute the quantized input of layer by
This means, we only need to apply a binary shift followed by a clipping to each element of . Because no floating point multiplications are needed, the DNN can be evaluated using only low precision integer arithmetics.
4 Regularization and quantization
Assume that we want to quantize with . As described in secion 2, this introduces a quantization error , which depends on the shape of . If has a high probability to have values within the overload region , the mean squared quantization error will be large. Therefore, a compact distribution with no tails () is desirable for a small quantization error.
During training, regularization can control the shape of the distribution of the parameters and . A common way to regularize a network is to augment the cost function with a regularization term , where is the regularization parameter. For , the regularization term acts like a Laplacian prior for , promoting sparse weight matrices with long tails in distribution. This is the desired way to regularize a DNN if pruning is used for network reduction. After training, small elements in can be set to zero to remove (prune) connections from the network. However, it is not desirable for quantization. A regularization term with penalizes the largest absolute values in and thus promotes weight matrices with a compact distribution . This is desired for quantization. Our experiments show that a careful choice of during training is the key to a good model accuracy after quantization.
All experiments are done using Theano
and Keras. We use network quantization with two networks trained on MNIST dataset  which contains gray-scale images of pre-segmented handwritten digits that are divided into a training set with 60000 and test set with 10000 images of size 28x28.
feature maps. Maxpooling with strideis applied after the layers and . The two dense output layers of MNISTnet2 contain and neurons.
For both networks, we use relu activation functions in all hidden layers and a softmax activation function with cross-entropy loss in the output layer. The networks are optimized with Adam for epochs, using dropout with probability . After training, MNISTnet1 and MNISTnet2 achieve the baseline accuracies and , respectively.
In our experiments, we apply network quantization with different word lengths and compare three different quantization methods. These quantization methods use different approaches to choose the step sizes , and .
The first method uses the maximum absolute values from and to determine the step sizes , and for quantization. We call this method "max. absolute value". The second method chooses , and to minimize the mean squared quantization error in each layer, separately. We call this method "min. MSE I". The third method also minimizes the quantization error, but restricts the step sizes to be powers of 2. We call this method "min. MSE II". This method is the most interesting one, because it allows to evaluate the network using only integer arithmetics, binary shift and clipping.
For MNISTnet1, the results are shown in Fig. 5 (a). MNISTnet1 can be evaluated with integer arithmetics without considerable performance loss. For , we observe that the quantized network performs best if the step size for quantization is chosen to minimize the mean squared quantization error.
In our second experiment, we apply quantization to the MNISTnet2 using three different regularization methods during training, i.e. no regularization with , regularization with and and with and . The results are shown in Fig. 5 (b) to 5 (d). We observe large accuracy degradations for if the MNISTnet2 is trained with and . For all three quantization methods, the accuracy of the quantized networks is even worse than with MNISTnet2 trained without regularization (). However, this is expected since regularization with leads to a distribution with long tails and therefore to large quantization errors. This is shown in Fig. 4.
Quantization works best if the MNISTnet2 is trained with and . Even with integer arithmetics the accuracy of the MNISTnet2 is only reduced slightly if we use the quantization method "min. MSE I". This is astonishing since implies that only three weight values remain after quantization. For , the quantization method "min. MSE II" also leads to an accuracy very close to the baseline. This means, MNISTnet2 can be evaluated using only integer arithmetics, binary shift and clipping operations.
In general, quantization can be used in combination with other model reduction methods like factorization or pruning [3, 4, 5]. However, we belief that one cause for the remarkable robustness of DNNs to quantization noise is the large parameter redundancy in trained DNNs. If this redundancy is removed by other reduction methods, we belief that quantization will lead to larger performance degradation. The optimum trade-off between quantization and other model reduction methods is still an open issue.
We proposed a method to quantize DNNs that was trained with floating point accuracy and to evaluate them using only low precision integer arithmetics. In our experiments, we showed that our method leads to almost no accuracy degradation and can therefore be used to implement trained DNNs efficiently on dedicated hardware. We also demonstrated how regularization during training influences the distribution of weights and thus network quantization. Our experiments show that regularization what is often used in combination with network pruning leads to DNNs that can not be quantized easily. Instead, regularizers that lead to a compact distribution of weights are beneficial for quantization.
“A tutorial survey of architectures, algorithms, and applications for deep learning,”APSIPA Transactions on Signal and Information Processing, 2014.
-  Yoshua Bengio, Aaron C. Courville, and Pascal Vincent, “Unsupervised feature learning and deep learning: A review and new perspectives,” CoRR, vol. abs/1206.5538, 2012.
-  Xiangyu Zhang, Jianhua Zou, Xiang Ming, Kaiming He, and Jian Sun, “Efficient and accurate approximations of nonlinear convolutional networks,” CoRR, vol. abs/1411.4229, 2014.
-  Xiangyu Zhang, Jianhua Zou, Kaiming He, and Jian Sun, “Accelerating very deep convolutional networks for classification and detection,” CoRR, vol. abs/1505.06798, 2015.
-  B. Yang L. Mauch, “A novel layerwise pruning method for model reduction of fully connected deep neural networks,” in In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017.
-  B. Yang L. Mauch, “Selecting optimal layer reduction factors for model reduction of deep neural networks,” in In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017.
-  Yann Le Cun, John S. Denker, and Sara A. Solla, “Optimal brain damage,” in Advances in Neural Information Processing Systems. 1990, pp. 598–605, Morgan Kaufmann.
-  Song Han, Huizi Mao, and William J. Dally, “Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding,” CoRR, vol. abs/1510.00149, 2015.
-  Darryl Dexu Lin, Sachin S. Talathi, and V. Sreekanth Annapureddy, “Fixed point quantization of deep convolutional networks,” CoRR, vol. abs/1511.06393, 2015.
-  Vincent Vanhoucke, Andrew Senior, and Mark Z. Mao, “Improving the speed of neural networks on cpus,” in Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011, 2011.
-  James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio, “Theano: a CPU and GPU math expression compiler,” in Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010, Oral Presentation.
-  François Chollet, “Keras,” 2015.
“The mnist database of handwritten digit images for machine learning research,”IEEE Signal Processing Magazine, , no. 141-142, November 2012.