A major challenge in the deployment of Deep Neural Networks (DNNs) is their high computational cost. Finding effective methods to improve run-time efficiency is still an area of research. We can group various approaches taken by researchers into the following three categories.
Hardware optimization: Specifically designed hardwares are deployed to efficiently perform computations in ML tasks. Compiler optimization
: Compression and fusion techniques coupled with efficient hardware-aware implementations, such as dense and sparse matrix-vector multiplication, are used.Model optimization: Run-time performance can also be gained by modifying the model structure and the underlying arithmetic operations. While hardware and compiler optimization are typically lossless (i.e. incur no loss in model accuracy), model optimization trades-off computational cost (memory, runtime, or power) for model accuracy. For example, by scaling the width of the network (zagoruyko2016wide). The goal of model optimization is to improve the trade-off between computational cost and model accuracy. This work falls into this category.
1.1 Architecture optimization
One strategy to construct efficient DNNs is to define a template from which efficient computational blocks can be generated. Multiple instantiations of these blocks are then chained together to form a DNN. SqueezeNet (iandola2016squeezenet), MobileNets (howard2017mobilenets; sandler2018mobilenetv2), ShuffleNets (zhang2018shufflenet; ma2018shufflenet), and ESPNets (mehta2018espnet; mehta2019espnetv2) fall into this category. Complementary to these methods, NASNet (zoph2018learning) and EfficientNet (tan2019efficientnet) search for an optimal composition of blocks restricted to a computational budget (e.g., FLOPS) by changing the resolution, depth, width, or other parameters of each layer.
1.2 Pruning and Compression
Several methods have been proposed to improve runtime performance by detecting and removing computational redundancy. Methods in this category include low-rank acceleration (jaderberg2014speeding), the use of depth-wise convolution in Inception (szegedy2015going), sparsification of kernels in deep compression (han2015deep)
, re-training redundant neurons in DSD(han2016dsd), depth-wise separable convolution in Xception (chollet2017xception), pruning redundant filters in PFA (suau2018network), finding an optimal sub-network in lottery ticket hypothesis (frankle2018lottery), and separating channels based on the features resolution in octave convolution (chen2019drop). While some of these compression methods can be applied to a trained network, most add training-time constraints to create a computationally efficient model.
1.3 Low-precision arithmetic and quantization
Another avenue to improve runtime performance (and the focus of this work) is to use low-precision arithmetic. The idea is to use fewer bits to represent weights and activations. Some instances of these strategies already exist in AI compilers, where it is common to cast weights of a trained model from 32 bits to 16 or 8 bits. However, in general, post-training quantization reduces the model accuracy. This can be addressed by incorporating lower-precision arithmetic into the training process (during-training quantization), allowing the resulting model to better adapt to the lower precision. For example, in gupta2015deep; jacob2018quantization the authors use 16 and 8 bits fixed-point representation to train DNNs.
Using fewer bits results in dramatic memory savings. This has motivated research into methods that use a single bit to represent a scalar weight: In courbariaux2015binaryconnect the authors train models with weights quantized to the values in . While this results in a high level of compression, model accuracy can drop significantly. li2016ternary and zhu2016trained reduce the accuracy gap between full precision and quantized models by considering ternary quantization (using the values in ), at the cost of slightly less compression.
To further improve the computational efficiency, the intermediate activation tensors (feature maps) can also be quantized. When this is the case, an implementation can use high-performance operators that act on quantized inputs, for example a convolutional block depicted in Figure1(left). This idea has been explored in (courbariaux2016binarized; rastegari2016xnor; zhou2016dorefa; hubara2017quantized; mishra2017wrpn; lin2017towards; cai2017deep; ghasemzadeh2018rebnet; zhang2018lq; choi2018bridging), and many other works.
We call a mapping from a tensor with full precision entries to a tensor with the same shape but with values in a binary quantization. When both weights and activations of a DNN are quantized using binary quantization, called Binary Neural Network (BNN), fast and power-efficient kernels which use bitwise operations can be implemented. Observe that the inner-product between two vectors with entries in
can be written as bitwise XNor operations followed by bit-counting(courbariaux2016binarized). However, the quantization of both weights and activations further reduces the model accuracy. In this work, we focus on improving the accuracy of the quantized model through improved quantization. The computational cost remains similar to the previous BNNs (rastegari2016xnor; hubara2017quantized).
1.4 Main contributions
In this work, we analyze the accuracy of binary quantization when applied to both weights and activations of a DNN, and propose methods to improve the quantization accuracy:
We present an analysis of the quantization error and show that scaled binary quantization is a good approximation (Section 2).
We propose a greedy -bits quantization algorithm (Section 3.2.4).
Experiments on the ImageNet dataset show that the optimal algorithms have reduced quantization error, and lead to improved classification accuracy (Section 5).
2 Low-rank binary quantization
Binary quantization (that maps entries of a tensor to
) of weights and activation tensors of a neural network can significantly reduce the model accuracy. A remedy to retrieve this accuracy loss is to scale the binarized tensors with few full precision values. For example,hubara2017quantized
learn a scaling for each channel from the parameters of batch-normalization, andrastegari2016xnor scale the quantized activation tensors using the channel-wise average of pixel values.
In this section, using low-rank matrix analysis, we analyze different scaling strategies. We conclude that multiplying the quantized tensor by a single scalar, which is computationally the most efficient option, has approximately the same accuracy as the more expensive alternatives.
We introduce the rank-1 binary quantization– an approximation to a matrix :
where is a rank- matrix, , and is element-wise multiplication (Hadamard product). Note that this approximation is also defined for tensors, after appropriate reshaping. For example, for an image classification task, we can reshape the output of a layer of a DNN with shape , where , , and are height, width, and number of channels, respectively, into an matrix with rows and one column per channel.
We define the error of a rank- binary quantization as , where is the Frobenius norm. Entries of are in , therefore, the quantization error is equal to . Note that
(the total energy), which is equal to sum of the squared singular values, is the same for any. Different choices of
change the distribution of the total energy among components of the Singular Value Decomposition (SVD) of. The optimal rank- binary quantization is achieved when most of the energy of is in its first component.
In rastegari2016xnor, the authors proposed to quantize the activations by applying the sign function and scale them by their channel-wise average. We can formulate this scaling strategy as a special rank-1 binary quantization , where
and is an -dimensional vector with all entries 1.
In Appendix A we show that the optimal rank-1 binary quantization is given by and , where is the element-wise sign of , and is the first component of the SVD of
. Moreover, we empirically analyze the accuracy of the optimal rank-1 binary quantization for a random matrix, where its entries are i.i.d. . This is a relevant example since after the application of Batch Normalization (BN) (ioffe2015batch) activation tensors are expected to have a similar distribution. The first singular value of captures most of the energy , and the first left and right singular vectors are almost constant vectors. Therefore, a scalar multiple of approximates well: , where . We use this computationally efficient approximation called scaled binary quantization.
3 Scaled binary quantization
In Section 2 we showed that scaled binary quantization is a good approximation to activation and weight tensors of a DNN. Next we show how we can further improve the accuracy of scaled binary quantization using more bits. To simplify the presentation (1) we flatten matrix in to a vector with , and (2) we assume the entries of
are different realizations of a random variablex
with an underlying probability distribution
. In practice, we compute all statistics using their unbiased estimators from vector(e.g., is an unbiased estimator of ). Furthermore, for , we denote entrywise application of to by . The quantized approximation of is denoted by , and the error (loss) of quantization is . All optimal solutions are with respect to this error and hold for an arbitrary distribution .
3.1 1-Bit quantization
A 1-bit scaled binary quantization of is:
which is determined by a scalar and a function . Finding the optimal 1-bit scaled binary quantization can be formulated as the following optimization problem:
3.1.1 Optimal 1-Bit algorithm
3.2 -Bits quantization
We can further improve the accuracy of scaled binary quantization by adding more terms to the approximation (3). A -bits scaled binary quantization of is
which is determined by a set of pairs of scalars ’s and functions . Observe that any permutation of ’s results in the same quantization. To remove ambiguity, we assume .
When both weights, , and activations, , are quantized using scaled binary quantization (6), their inner-product can be written as:
where and are quantized activations and weights with and bits, respectively, , and . This inner-product can be computed efficiently using bitwise XNors followed by bit-counting (see Figure 1(right) with and ).
Finding the optimal -bits scaled binary quantization can be formulated as:
This is an optimization problem with a non-convex domain for all . We solve the optimization for in Section 3.1 and in Section 3.2.2 for arbitrary distribution . We also provide an approximate solution to (8) in Section 3.2.4 using a greedy algorithm.
Discussion: A general -bits quantizer maps full precision values to an arbitrary set of numbers, not necessarily in the form of (6). The optimal quantization in this case can be computed using the Lloyd’s algorithm (lloyd1982least). While a general -bits quantization has more representation power compared to -bits scaled binary quantization, it does not allow an efficient implementation based on bitwise operations. Fixed-point representation (as opposed to floating point) is also in the form of (6) with an additional constant term. However, fixed-point quantization uniformly quantizes the space, therefore, it can be significantly inaccurate for small values of .
3.2.1 Foldable quantization
In this section, we introduce a special family of -bits scaled binary quantizations that allow fast computation of the quantized values. We name this family of quantizations foldable. A -bits scaled binary quantization given by ’s is foldable if the following conditions are satisfied:
When the foldable condition is satisfied, given ’s, we can compute the ’s in (6) efficiently by applying the sign function.
3.2.2 Optimal 2-bits algorithm
In this section, we present the optimal 2-bits binary quantization algorithm, the solution of (8) for . In Appendix C we show that the optimal 2-bits binary quantization is foldable and the scalars and should satisfy the following optimality conditions:
with standard normal distribution. The optimallies on the intersection of the identity line and average of the conditional expectations in (10).
For a given vector we can solve for in (10) efficiently. We substitute the conditional expectations in (10) by conditional average operators as their unbiased estimators. (10) implies that for the optimal , the average of the entries in smaller than (an estimator of ) and the average of the entries greater than (an estimator of ) should be equidistant form . Note that (10) may have more than one solution, which are local minima of the objective function in (8). We find all the values that satisfy this condition in time. We first sort entries of based on their absolute value and compute their cumulative sum. Then with one pass we can check whether (10) is satisfied for each element of . We evaluate the objective function in (8) for each local minima, and retain the best. After is calculated is simply computed from (11). As explained in Section 4, this process is only done during the training. In our experiments, finding the optimal 2-bits quantization increased the training time by 35 compared to the 2-bits greedy algorithm (see Section 3.2.4). Sine the optimal 2-bits binary quantization is foldable, after recovering and , we have and .
3.2.3 Optimal ternary algorithm
The optimization domain of (8) for over the scalars is illustrated in Figure 2(right). The boundaries of the domain, and , correspond to 1-bit binary and ternary (li2016ternary) quantizations, respectively. The scaled ternary quantization maps each full precision value to . Ternary quantization needs 2-bits for representation. However, when a hardware with sparse calculation support is available, for example as in EIE (han2016eie), using ternary quantization can be more efficient compared to general 2-bits quantization. In Appendix D we show that the optimal scaled ternary quantization is foldable and the scalar should satisfy:
3.2.4 -bits greedy algorithm
In this section, we propose a greedy algorithm to compute -bits scaled binary quantization, which we call Greedy Foldable (GF). It is given in Algorithm 1.
In GF algorithm we compute a sequence of residuals. At each step, we greedily find the best and for the current residual using the optimal 1-bit binary quantization (5). Note that for the GF is the same as the optimal 1-bit binary quantization.
Few of the other papers that have tackled the -bits binary quantization to train quantized DNNs are as follows. In ReBNet (ghasemzadeh2018rebnet), the authors proposed an algorithm similar to Algorithm 1, but considered ’s as trainable parameters to be learned by back-propagation. lin2017towards and zhang2018lq find -bits binary quantization via alternating optimization for ’s and ’s. Note that, all these methods produce sub-optimal solutions.
4 Training binary networks
The loss functions in our quantized neural networks are non-differentiable due to the sign function in the quantizers. To address this challenge we use the training algorithm proposed incourbariaux2015binaryconnect. To compute the gradient of the sign function we use the Straight Through Estimator (STE) (bengio2013estimating):
. During the training we keep the full precision weights and use Stochastic Gradient Descent (SGD) to gradually update them in back-propagation. In the forward-pass, only the quantized weights are used.
During the training we compute quantizers (for both weights and activations) using the online statistics, i.e., the scalars in a -bits scaled binary quantization (6) are computed based on the observed values. During the training we also store the running average of these scalars. During inference we use the stored quantized scalars to improve the efficiency. This procedure is similar to the update of the batch normalization parameters in a standard DNN training (ioffe2015batch).
We conduct experiments on the ImageNet dataset (deng2009imagenet) using the ResNet-18 architecture (he2016deep). The details of the architecture and training are provided in Appendix E.
We conduct three sets of experiments: (1) evaluate quantization error of activations of a pre-trained DNN, (2) evaluate the quantization error based on the classification accuracy of a post-training quantized network, and (3) evaluate the classification accuracy of during-training quantized networks. We report the quantization errors of the proposed binary quantization algorithms (optimal 1-bit, 2-bits, ternary, and the greedy foldable quantizations) and compare with the state-of-the-art algorithms BWN-Net (rastegari2016xnor), XNor-Net (rastegari2016xnor), TWN-Net (li2016ternary), DoReFa-Net (zhou2016dorefa), ABC-Net (lin2017towards), and LQ-Net (zhang2018lq).
5.1 Quantization error of activations
To quantify the errors of the introduced binary quantization algorithms we adopt the analysis performed by anderson2017high. They show that the angle between and can be used as a measure of the accuracy of a quantization scheme. They prove that when and elements of are i.i.d. , converges to degrees for large .
Here we use the real data distribution. We trained a full precision network. We compute the activation tensors at each layer for a set of 128 images. In Figure 4
we show the angle between the full precision and quantized activations for different layers. When the optimal quantization is used, a significant reduction in the angle is observed compared to the greedy algorithm. The optimal 2-bits quantization is even better than the greedy 4-bits quantization for later layers of the network, for which activation tensors have more skewed distribution, make it harder for quantization in form of (6
). Furthermore, the accuracy of the optimal quantization has less variance with respect to different input images and different layers of the network.
The angle between the full precision and the quantized activations for different layers of a trained full precision ResNet-18 architecture on ImageNet. The 95% confidence interval over different input images is shown.
5.2 Post-training quantization
In this section we apply post-training quantization to the weights of a pre-trained full precision network. We then use the quantized network for inference and report the classification accuracy. This procedure can result in an acceptable accuracy for a moderate number of bits (e.g., 16 or 8). However, the error significantly grows with a lower number of bits, which is the case in this experiment. Therefore, we only care about the relative differences between different quantization strategies. This experiment demonstrates the effect of quantization errors on the accuracy of the quantized DNNs.
The results are shown in the top half of Table 4. When the optimal 2-bits quantization is used, significant accuracy improvement (more than one order of magnitude) is observed compared to the greedy 2-bits quantization, which illustrate the effectiveness of the optimal quantization.
5.3 During-training quantization
To achieve higher accuracy we apply quantization during the training, so that the model can adapt to the quantized weights and activations. In the bottom half of Table 4, we report the accuracies of the during-training quantized DNNs, all trained with the same setup. We use 1-bit binary quantization for weights, and use different quantization algorithms for activations. When quantization is applied during-training, significantly higher accuracies are achieved. Similar to the previous experiments the optimal quantization algorithm achieves a better accuracy compared to the greedy.
In Table 5
we report results from the related works in which ResNet-18 architecture with quantized weights and/or activations is trained on the ImageNet dataset for the classification task. We report the mean and standard deviation of the model accuracy over 5 runs when our algorithms are used. Note that for 1-bit quantization the Greedy Foldable (GF) algorithm is the same with the optimal 1-bit binary quantization. In Opt* we used 2larger batch-size compared to Opt but with the same number of optimization steps. As shown in the Table 5 the proposed quantization algorithms match or improve the accuracies of the state-of-the-art BNNs.
|Method (ResNet-18 on ImageNet)||Val. top-1||Val. top-5|
|DoReFa-Net (zhou2016dorefa)111This result is taken from (zhang2018lq).||2||1||53.4||-|
In this work, we analyze the accuracy of binary quantization to train DNNs with quantized weights and activations. We discuss methods to improve the accuracy of quantization, namely scaling and using more bits.
We introduce the rank- binary quantization, as a general scaling scheme. Based on a singular value analysis we motivate using the scaled binary quantization, a computationally efficient scaling strategy. We define a general -bits scaled binary quantization. We provide provably optimal 1-bit, 2-bits, and ternary quantizations. In addition, we propose a greedy -bits quantization algorithm. We show results for post and during-training quantization, and demonstrate significant improvement in accuracy when optimal quantization is used. We compare the proposed quantization algorithms with state-of-the-art BNNs on the ImageNet dataset and show improved classification accuracies.
Appendix A Optimal rank-1 binary quantization
In this section, we find the optimal rank-1 binary quantization of an by matrix that is discussed in Section 2:
First, observe that the element-wise multiplication by and does not change the Frobenius norm. Therefore:
Furthermore, note that
Here is the ’th singular value of and is its rank. In addition for any :
Hence, to minimize the sum in (15) we need to find an for which is maximized:
is the 2-norm of . Therefore:
For any and we have since for we have . Here is the element-wise absolute value of . Note that for and with positive values the inequality becomes an equality. Therefore:
Observe that the element-wise absolute value does not change the vector norm, i.e. , and hence is a unit vector when is. Also for any we have since for we have . So we have
Therefore, we showed that and equal to the best rank-1 approximation of (i.e. the first term in its SVD) is a solution of (13).
For an with i.i.d. entries we show the singular values and first left and right singular vectors of in Figure 6. Observe that the first singular value of captures most of the energy . The fraction of energy captured by the first component of SVD converges to the squared mean of the standard folded normal distribution () for large square matrices. Also note that the first left and right singular vectors are almost constant, i.e., they can be written as for some .
Appendix B Optimal scaled binary quantization
In this section, we solve the following optimization problem corresponding to the optimal 1-bit scaled binary quantization as discussed in Section 3.1:
Here is a probability distribution function. First, observe that:
Therefore, the optimal choice for function is . So we can rewrite (21) as follows:
Setting the gradient of the objective function in (23) with respect to to zero, we get:
Hence, we showed that the optimal 1-bit scaled binary quantization maps to .
Appendix C Optimal 2-bits binary quantization
In this section, we solve the following optimization problem corresponding to the optimal 2-bits binary quantization as discussed in Section 3.2.2:
First, we show that the optimal 2-bits binary quantization is foldable, i.e., and . Observe that
The inequality in (26) holds because , and therefore, . The objective function in (25) is a weighted average of with non-negative weights. For the inequality is strict if . In that case, flipping the value of both and reduces to a strictly smaller value . Hence, the optimal solution of (25) should satisfy for all .
For any and if we consider , the problem reduces to the 1-bit binary quantization for . Based on the result showed in Appendix B for the optimal solution we have . This completes the proof to show that the optimal 2-bits binary quantization is foldable.
Next, we find the optimal values for and . If we substitute and in (25) we can decompose into four segments and write:
Here is the error as a function of and , and is the folded distribution function. Assuming the optimal point occurs in the interior of the domain, it should satisfy the zero gradient condition: . Taking derivative from (27) with respect to and and set it to zero we get:
We can simplify (28) and rewrite and in terms of the following conditional expectations:
Hence, the optimal values of and can be obtained by solving for and in (29). The optimization domain has two boundaries. One is when . This reduces the problem to 1-bit binary quantization. The optimal solution in that case is discussed in Appendix B. The other boundary is when . This results in the ternary quantization. The optimal solution in this case is discussed in Appendix D.
Appendix D Optimal ternary quantization
In this section, we find the optimal symmetric ternary quantization, that is to map a full precision value to a discrete set as discussed in Section 3.2.3. Finding the optimal mapping can be formulated as the following optimization problem:
First, the above form is a special case of (25), hence, using the same argument as in Appendix C we can show that there is a foldable optimal solution: and . Then the total error as a function of can be written as:
where is the folded probability distribution function. Taking derivative from (31) and setting it to zero, we get:
Appendix E Details of training ResNet on ImageNet
In this section, we explain the details of how the DNN results reported in this paper are produced. All results correspond to the ResNet-18 architecture trained on the ImageNet dataset for the classification task. We use the standard training and validation splits of the ImageNet dataset. We followed a similar architecture as XNor-Net (rastegari2016xnor). The convolutional block that we use is depicted in Figure 1
(left). We use ReLU non-linearity before the batch normalization as suggested by(rastegari2016xnor). Also, we find it important to use bounded dynamic range, and therefore clip the values to . We use 2, 3, 5, and 8 for 1, 2, 3, and 4 bits quantizations, respectively. Similar to the other BNNs for the first and last layers we use full precision. Also, as suggested by choi2018bridging we use full precision short-cuts in ResNet architecture, which adds a small computational/memory overhead. We quantize weights per filter and activations per layer. As cai2017deep we use first-order polynomial learning-rate annealing schedule (from to
) and train for 120 epochs. We do not use weight decay. For the data augmentation we use the standard methods used to train full precision ResNet architecture. For training we apply random resize and crop to 224224, followed by random horizontal flipping, color jittering, and lightening. For test we resize the images to 256256 followed by a center cropping to 224224.