Any-Precision Deep Neural Networks (AAAI 2021)
We present Any-Precision Deep Neural Networks (Any-Precision DNNs), which are trained with a new method that empowers learned DNNs to be flexible in any numerical precision during inference. The same model in runtime can be flexibly and directly set to different bit-width, by truncating the least significant bits, to support dynamic speed and accuracy trade-off. When all layers are set to low-bits, we show that the model achieved accuracy comparable to dedicated models trained at the same precision. This nice property facilitates flexible deployment of deep learning models in real-world applications, where in practice trade-offs between model accuracy and runtime efficiency are often sought. Previous literature presents solutions to train models at each individual fixed efficiency/accuracy trade-off point. But how to produce a model flexible in runtime precision is largely unexplored. When the demand of efficiency/accuracy trade-off varies from time to time or even dynamically changes in runtime, it is infeasible to re-train models accordingly, and the storage budget may forbid keeping multiple models. Our proposed framework achieves this flexibility without performance degradation. More importantly, we demonstrate that this achievement is agnostic to model architectures. We experimentally validated our method with different deep network backbones (AlexNet-small, Resnet-20, Resnet-50) on different datasets (SVHN, Cifar-10, ImageNet) and observed consistent results. Code and models will be available at https://github.com/haichaoyu.READ FULL TEXT VIEW PDF
Any-Precision Deep Neural Networks (AAAI 2021)
While state-of-the-art deep learning models can achieve very high accuracy on various benchmarks, runtime cost is another crucial factor to consider in practice. In general, the capacity of a deep learning model is positively correlated with its complexity. As a result, accurate models mostly run slower, consume more power, and have larger memory footprint as well as model size. In practice, it is inevitable to balance efficiency and accuracy to get a good trade-off when deploying any deep learning models.
To alleviate this issue, a number of approaches have been proposed to address it from different perspectives. We observe active researches [20, 2, 4] in looking for more efficient deep neural network architectures to support practical usage [15, 32, 26, 35]. People also consider to adaptively modify general deep learning model inference to dynamically determine the execution during the feed-forward pass to save some computation at the cost of potential accuracy drop [10, 28, 31, 29].
Besides these explorations, another important line of research proposes a low-level solution to use less bits to represent deep learning model and its runtime data to achieve largely reduced runtime cost. It has been shown in various literatures that full-precision is over-abundant in many applications that we can use 8-bit or even 4-bit models without obvious performance degradation.
Some previous works went further in this direction. For example, BNN, XNOR-Net, and others[6, 25, 36] are proposed to use as low as 1-bit for both the weights and activations of the deep neural networks to reduce power-usage, memory-footprint, running time, and model size. However, ultra low-precision models always observe obvious accuracy drop . While many methods have been proposed to improve accuracy of the low-precision models, so far we see no silver bullet. Stepping back from uniformly ultra-low precision models, mixed-precision models have been proposed to serve as a better trade-off [30, 9]. Effective ways have been found to train accurate models with some layers processing in ultra-low precision and some layers in high precision.We illustrate these different paradigms in Figure 1.
When we look at this spectrum of deep learning models in terms of its numerical precision, from full-precision at one end to low-precision at the other, and mixed-precision in between, we have to admit that efficiency/accuracy trade-off always exists in reality and to deploy a model in a specific application scenario we have to find the right trade-off point. Previous methods can provide a specific operating point but what if we demand flexibility as well? It would be a highly favorable property if we can dynamically change the efficiency/accuracy trade-off point given a single model. Preferably, we want to be able to adjust the model, without the need of re-training or re-calibration, to run in high accuracy mode when resources are sufficient and switch to low accuracy mode when resources are limited.
In this paper, we propose a method to train deep learning models to be flexible in numerical precision, namely Any-Precision deep neural networks. After training, we can freely quantize the model layers into various precision levels, without fine-tuning or calibration and without any data. When running in low-precision, full-precision or other precision levels in between, it achieves comparable accuracy to models specifically trained under the matched settings. Furthermore, given fixed computational budget, it can potentially find better operating point than one trained rigorously.
To summarize, our contributions are:
We introduce the concept of Any-Precision DNN. In runtime we can quantize its layers into different bit-width. Its accuracy changes smoothly with respect to its precision level without drastic performance degradation;
We propose a novel model-agnostic method to train Any-Precision DNN and validate its effectiveness with multiple widely used benchmarks and with multiple neural network architectures;
Recent progresses in deep learning inference hardware motivate the research of using low-bit integer instead of float-point values to represent network weights and activations. Binarized Neural Networks and XNOR-Net  are early works in this direction to use only 1-bit to represent the weights and activations in DNNs. When training these 1-bit networks, a float-point value copy of the parameters are maintained under the hood to calculate approximated gradients. Usually a sign function is used to quantize the float-point value copy to binary value in the feed-forward pass. Using only 1-bit numerical precision leads to obvious drop in accuracy in most scenarios, Zhou et al.  proposed DoReFa-Net to specifically train arbitrary bitwidth in weights, activations, and gradients. Since gradients are also in low-bits, proper implementation could accelerate both the forward and backward passes.
One of the essential problem in learning low-precision DNNs is the quantization operator. Quantization of the real-value parameters in the feed-foward pass and approximation of the gradients through the quantization operator in the backward pass heavily influence the final model accuracy. For example, the sign function adopted in Binarized NN  discards the value distribution variations across layers and hurt the performance. In XNOR-Net , a scaling factor is added to each layer to minimize the information loss. Choi et al.  proposed a parameterized clipping activation for quantization to support arbitrary bits quantization of activation. Zhang et al.  pointed out that having an uniform quantization pattern across layers is suboptimal and propose a learnable quantizer for each layer to improve the model accuracy.
In the backward pass, most prior works use the Straight-Through Estimator (STE) to approximate the gradients over the quantizers. Cai et al.  proposed to use a half-wave gaussian quantization operator to replace the sign
function for better learning efficiency and a piece-wise continuous function in the backpropagation step to alleviate the gradient mismatch issue in the prior design. Liuet al.  also attacked the gradient mismatch problem by introducing a piecewise polynomial function to approximate the sign function. Another interesting recent work from Ding et al. 
addressed this problem by introducing a new loss function over the value distribution of layer activations.
Besides the performance gap to full-precision model, training binary networks have been reportedly to be unstable. Tang et al.  carefully analyzed the training process and concluded that using PReLu activation function, a low learning rate, and the bipolar regularization on weights could lead to a more stable training process with better optimum. Zhuang et al.  looked at the overall training strategy and propose a progressive training process. They suggested to first train the net with quantized weights and then quantized activations, first train with high-precision and then low-precision, and jointly train the low-bit model with the full-precision one.
A similar joint training strategy has been observed to be effective in this work as well. Since our work is along an orthogonal direction of low-precision DNNs training and design, our method can be complementary to train better and flexible DNNs.
The accuracy drop of ultra low-bit models and the emergent new hardware designs motivate the research of training DNNs with mixed-precision. Although most of the knowledge from training low-precision DNNs can be transfered to mixed-precision training as well, an open question is how to specify the bit-width of each layer for both weights and activations. Given fixed computational budget, number of potential configurations are exponentially large.
Zhou et al.  proposed to find the configuration by solving an optimization problem where the prospective accuracy drop is added as the constraint. They revealed how noise on the feature map related to accuracy degradation, then estimated the effect of parameter quantization errors in individual layers on the overall model prediction accuracy. Wang et al. 
proposed to use reinforcement learning to determine the quantization policy. The policy takes in layer configuration and stats as input to predict the bit-width of weights and activations. When learning the policy, the feedback from the hardware is taken into consideration through a hardware simulator generating the latency and energy signals. Donget al.  presented an novel second-order quantization method to select the bit-width of each layer as well as the fine-tuning order of layers, based on the layer’s Hessian spectrum.
Quantization of a pre-trained model with fine-tuning or calibration on a dataset is another related research topic in the area. Although methods in this area are working on a different problem from ours, we partially share the motivation to have the flexibility of quantization control in the runtime. Without special treatment, many models collapse even in 8-bit precision in post-training quantization. One recent work from Nagel et al.  identified two issues leading to the large accuracy drop, the large variation in the weight ranges across channels and biased output errors due to quantization errors affecting following layers. With their method, they are able to alleviate the bias and equalize the weight ranges by rescaling and reparameterization. In this paper, the model we produced can be readily quantized into lower precision without further process.
In the research area of deep neural networks architecture search, the slimmable neural networks by Yu et al.  is related to ours in terms of methodology. They presented method to train a single neural network with adjustable number of channels in each layer at runtime. Their exploration is limited to the search space of network architecture instead of weights.
Neural networks are generally constructed layer by layer. We denote input to the -th layer in a neural network model as , the weights of the layer as and the biases as . The output from this layer can be calculated as
Without loss of generality, we take one channel in a fully-connected layer as a concrete example in the following description and drop the subscript for simplicity, i.e.,
where and is a scalar.
For better computation efficiency, we would like to avoid the float-value dot product of
-dimensional vectors. Instead we use-bit fixed-point integers to represent the weights as and input activations as . Hereafter, we assume and are stored as signed integers in its bitwise format. Note that in some related works , elements of and could be represented as vectors of and the conversion between these two formats are trivial. With -bit integers weights and activations, as discussed in prior arts [36, 25], the computation can be accelerated by leveraging bit-wise operations (and, xnor, bit-count), or even dedicated DNN hardwares.
Early works  show that by adding a layer-wise real-value scaling factor could largely help reduce the output range variation and hence achieve better model accuracy. Since the scaling factor is shared across channels within the same layer, the computational cost is fractional. Following this setting, with the quantized weights and inputs, we have
The activations are then quantized into -bit fixed-point integers as the input to the next layer.
We will discuss our quantization functions in details in the next section. Here we describe the runtime of a trained Any-Precision DNN.
Once training is finished, we can keep the weights at a higher precision level for storage, for example, at 8-bit. As shown in Figure 2, we can simply quantize the weights into lower bit-width by bit-shifting. We experimentally observe that with the proposed training framework, the model accuracy changes smoothly and consistently on-par or even outperform dedicated models trained at the same bit-width.
A number of quantization functions have been proposed in the literature for weights and activations respectively. Given a pre-trained DNN model, one can quantize its weights into low-bit and apply certain quantization function to activations accordingly. However, when the number of bits gets smaller, the accuracy quickly drops due to the rough approximation in weights and large variations in activations. The most widely adopted framework to obtain low-bit model is quantization-aware training. The proposed method follows the quantization-aware training framework.
We take the same fully-connected layer as an example. In training, we maintain the float-point value weights for the actual layer weights . In the feed-forward pass, given input , we follow Equation 3 to compute the raw output
. Prior arts show the importance of the batch normalization (BN) layer in low-precision DNN training and we follow accordingly. is then passed into a BN layer and then quantized into as the input to the next layer.
We use a uniform quantization strategy similar to Zhou et al.  with a scaling factor to approximate the weights. Given the floating point weight , we first apply the tanh function to normalize it into and then transform it into , i.e.,
We then quantize normalized value into -bit integers and scaling factor , where
Hereafter denotes the upper-bound of -bit integer and converts a floating point value into an integer.
Finally the values are re-mapped back to approximate the range of floating point values to obtain
where is the mean of absolute value of all floating-valued weights in the same layer. Eventually, we approximate with and execute the feed-forward pass with the quantized weights as shown in Equation 3, the scaling factor can be applied after the dot-product of all integers vectors.
In the backward pass, gradients are computed with respect to the underlying float-value variable and updates are applied to as well. In this way, the relatively unreliable and nuance signals would be accumulated gradually and hence this will stabilize the overall training process. Since not all operations involved are nice smooth functions to support back-propagation, we use the straight through estimator (STE)  to approximate the gradients. For example, the round operation in Equation 5 has zero derivative almost everywhere. With STE, we assign
For activation quantization in the feed-forward pass, we obtain the -bit fixed-point representation by first clipping the value to be within and then
In practice, we only calculate the integer part as and absorb the constant scaling factor into the persistent network parameters in the next layer.
Let denote the final loss function and the gradient with respect to the activation is then approximated to be
The gradient of the function is approximated with STE to be .
In prior low-precision models, the bit-width is fixed during the training process. In runtime, if we alter the model accuracy drops drastically. To encourage flexibility in the produced model, here we propose to dynamically change within the training stage to align the training and inference process. However, the distribution of activations varies under different bit-width , especially when is small (e.g., 1-bit), as shown in Figure 3. As a result, without special treatment, the dynamically changed creates conflicts in learning the model that it fails to converge in our experiments.
One of the widely adopted technique to adjust internal feature/activation distribution is Batch Normalization (BatchNorm) . It works by normalizing layer output across batch dimension as following
where is the batch size, denotes the index within current batch, is a small value added to avoid numerical issue. and
are mean and variance respectively defined as
During training, BatchNorm layer keeps calculating running averages for and , i.e.,
where and are the values before the current update, the decay rate is a hyper-parameter set a-prior. But even with the BatchNorm layer, dynamically changed will lead to failure of convergence in training due to the value distribution variations shown in the toy example in Figure 3.
In our proposed framework, we adopted dynamically changed BatchNorm layer to work with different in training. More specifically, assume we have a list of bit-width candidates , we keep copies of BatchNorm layer parameters and internal states . When the current training iteration works with , we reset the BatchNorm layers with data from to use and update the corresponded copy.
Similar technique has been adopted by Yu et al.  when dealing with varied network architectures. Parameters of all BatchNorm layers are kept after training and used in inference. Note that compared with the total number of network parameters, the additional amount from BatchNorm layers is negligible. We summarize the proposed method in Algorithm 1.
With the proposed algorithm, we can train DNN being flexible for runtime bit-width adjustment.
Another optional component in our method is adding knowledge distillation  in training
. Knowledge distillation works by matching the outputs of two networks. In training a network, we can use a more complicated model or an ensemble of models to produce soft targets by adjusting the temperature of the final softmax layer and then use the soft targets to guide the network learning.
In our framework, we apply this idea by generating soft targets from a high-precision model. More specifically, in each training iteration, we first set the quantization bit-width to the highest candidate and run feed-forward pass to obtain soft targets . Later, instead of accumulating cross-entropy loss for each precision candidate, we use KL divergence of the model prediction and as the loss. In our experiments, we observe that in general knowledge distillation leads to better performance at -bit precision level.
|Dataset||Class Number||Image Number (Train/Test)|
We validate our method with several network architectures and datasets. These networks include a -layer CNN (named Model C in ), AlexNet , Resnet-8 , Resnet-20, and Resnet-50 . The datasets include Cifar-10 , Street View House Numbers (SVHN) , and ImageNet . In Table 1 and Figure 4, we show details of these datasets.
We implement the whole framework in PyTorch
. On Cifar-10, we train AlexNet and Resnet-20 models for 200 epochs with initial learning rateand decayed by at epochs . On SVHN, the 8-layer CNN and Resnet-8 models are trained for 100 epochs with initial learning rate and decayed by at epochs . We combine the training and extra training data on SVHN as our training dataset. All models on Cifar-10 and SVHN are optimized with the Adam optimizer  without weight decay. On ImageNet, we train Resnet-50 model for 120 epochs with initial learning rate decayed by at epochs with SGD optimizer.
For all models, following Zhou et al.  we keep first and last layer be real-valued. In training, we train the networks with bit-width candidates . Note that when the bit-width is set to , it is a full-precision model that we use floating-valued weights and activations. In testing, we evaluate the model runs at each bit-width in the list respectively. By default, we add knowledge distillation (KD) in training, we use full-precision model to get soft targets as supervision in the low-precision iterations.
|Models||1 bit||2 bit||4 bit||8 bit||FP32|
|Models||1 bit||2 bit||4 bit||8 bit||FP32|
|Models||1 bit||2 bit||4 bit||8 bit||FP32|
We compare our method to very competitive baseline models at each precision level. For each bit-width we tested, we dedicatedly train a low-precision model following the same training pipeline with fixed bit-width for weights and activations. We compare the accuracy we obtained from our dedicated low-bit models to other recent works in this field to make sure the baseline models are competitive. For example, on Cifar-10, our 1-bit model achieves an accuracy of while the recent work from Ding et al.  reports .
As shown in Table 2, on all three datasets, the proposed Any-Precision DNN achieves comparable performance to the competitive dedicated models.
|Methods \ Runtime bit-width||1 bit||2 bit||3 bit||4 bit||5 bit||6 bit||7 bit||8 bit||FP32|
|Quantize Dedicated Models with bit-shifting|
|Quantize Dedicated Models with bit-shifting and BatchNorm calibration|
|Runtime bit-width||1 bit||2 bit||4 bit||8 bit|
|Quantize Dedicated Models with Bit-shifting|
|From 8-bit model||0.104||0.116||0.436||74.3|
|After BatchNorm Calibration|
|From 8-bit model||0.164||0.126||6.91||74.3|
|Test \ Train||1,2,4,8||1,8||2,8||4,8|
We compare our method to alternative post-training quantization methods. We experiment with Resnet-20 on Cifar-10. We evaluated two post-training strategies.
The first one directly quantizes dedicated models with bit-shifting. In other words, to obtain an -bit model from a trained -bit model, as what we do with Any-Precision DNN shown in Figure 2, we simply drop the least-significant bit of all weights. With no surprise, this strategy fails dramatically on challenging large-scale benchmark as shown in Table 4. On smaller dataset Cifar-10, when quantizing models into very low bit-width accuracy drops a lot but when the target runtime bit-width is higher than , the simple strategy shows to be effective as well (Table 3). We argue that this is because Resnet-20 has a relatively large capacity on Cifar-10 that rough numerical precision works as well.
The second strategy follows the same bit-shifting to drop bit with an added BatchNorm calibration process. In the calibration process, BatchNorm statistics will be re-calculated by feed-forwarding a number of training samples. As shown in Table 3 and Table 4, the BatchNorm calibration helps a lot in low-bit settings. However, the accuracy is still much lower than the ones from our method. We can leverage this post-training calibration technique with the proposed framework to fill-in the gaps of training candidate bit-width list, i.e., after training for 1,2,4,8-bits precision levels, we can further calibrate the model under the remaining 3,5,6,7-bit settings to get the missed copies of BatchNorm layer parameters. So that, in runtime, we can freely choose any precision level from to bits.
To understand how the dynamically changed BatchNorm layers help in our framework, we visualize the activation value distributions of several layers of an Any-Precision AlexNet. More specifically, we look at how activation value distribution changes from the 2nd, 4th, to the 6th convolutional layers and the BatchNorm layers after them when the runtime precision level is set to ,,,-bit respectively. As shown in Figure 5, when running at -bit precision, the activation distribution is obviously off from others after the convolutional layers; the followed BatchNorm layer rectifies the distributions; then the next convolutional layer would create this distribution variation again. It is very clear that by keeping multiple copies of the BatchNorm layer parameters for different bit-width, we can minimize input variations to the convolutional layers and hence have the same set of convolutional layer parameters to support Any-Precision in runtime.
We study how the candidate bit-width list used in training the Any-Precision DNN influence the testing performance on other bit-widths.
Table 5 shows test accuracy of models trained under different bit-width combinations. We observe that training with more candidate bit-width generally lead to better generalization to the others and the candidate bit-width list is better to cover the extreme cases in runtime. For example, the 1,8-bits combination performs more stable across different runtime bit-width compared with 2,8-bits and 4,8-bits combination. Since better coverage in training takes longer for the model to converge, this observation can guide the bit-width selection under limited training resources.
|Models||1 bit||2 bit||4 bit||8 bit||FP32|
We study the influence of different knowledge distillation strategies:
w/o KD: no KD is used as shown in Algorithm 1;
KD: the highest bit-width outputs are supervised by groundtruth. The others are supervised by the soft targets from the highest bit-width.
KD recursive: the highest bit-width outputs are supervised by groundtruth. Then every other bit-width outputs are supervised by the soft targets from the nearest superior bit-width.
In Table 6, we observed in general KD helps improve -bit performance and it is better than the one w/o KD. An interesting observation is that KD variations even slightly outperform the dedicated models in 4,8-bit and full-precision. The hypothesis is that by jointly training, the KD losses from low-bit regularizes the training to avoid overfitting.
In this paper, we introduce Any-Precision DNN to address the practical efficiency/accuracy trade-off dilemma from a new perspective. Instead of seeking for a better operating point, we enable runtime adjustment of model precision-level to support flexible efficiency/accuracy trade-off without additional storage or computation cost. The model can be stored at 8-bit or higher and run in lower bit-width such as 1-bit or 2-bit. The model accuracy drops gracefully when bit-width gets smaller. To train an Any-Precision DNN, we propose to have dynamic model-wise quantization in training and employ dynamically changed BatchNorm layers to align activation distributions across different bit-width. We evaluate our method on three major image classification datasets with multiple network architectures. When running in low-bit by simply bit-shifting the pre-trained weights and quantizing the activations, our model achieves comparable accuracy to dedicatedly trained low-precision models.
Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432. Cited by: §2.
Neural networks for machine learning. Coursera, video lectures. Cited by: §3.3.
Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §4.
Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §2, §3.1.