Learned Step Size Quantization

02/21/2019
by   Steven K. Esser, et al.
6

We present here Learned Step Size Quantization, a method for training deep networks such that they can run at inference time using low precision integer matrix multipliers, which offer power and space advantages over high precision alternatives. The essence of our approach is to learn the step size parameter of a uniform quantizer by backpropagation of the training loss, applying a scaling factor to its learning rate, and computing its associated loss gradient by ignoring the discontinuity present in the quantizer. This quantization approach can be applied to activations or weights, using different levels of precision as needed for a given system, and requiring only a simple modification of existing training code. As demonstrated on the ImageNet dataset, our approach achieves better accuracy than all previous published methods for creating quantized networks on several ResNet network architectures at 2-, 3- and 4-bits of precision.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

04/29/2018

UNIQ: Uniform Noise Injection for the Quantization of Neural Networks

We present a novel method for training deep neural network amenable to i...
05/27/2019

Differentiable Quantization of Deep Neural Networks

We propose differentiable quantization (DQ) for efficient deep neural ne...
01/31/2017

Mixed Low-precision Deep Learning Inference using Dynamic Fixed Point

We propose a cluster-based quantization method to convert pre-trained fu...
02/23/2018

Autoencoder based image compression: can the learning be quantization independent?

This paper explores the problem of learning transforms for image compres...
09/11/2018

Discovering Low-Precision Networks Close to Full-Precision Networks for Efficient Embedded Inference

To realize the promise of ubiquitous embedded deep network inference, it...
01/19/2019

Accumulation Bit-Width Scaling For Ultra-Low Precision Training Of Deep Networks

Efforts to reduce the numerical precision of computations in deep learni...
06/23/2021

Numerical influence of ReLU'(0) on backpropagation

In theory, the choice of ReLU'(0) in [0, 1] for a neural network has a n...

Code Repositories

lsq-net

Unofficial implementation of LSQ-Net, a neural network quantization framework


view repo

Tiny-YOLO-LSQ

This is an implementation of YOLO using LSQ network quantization method.


view repo

LSQ-implementation

This is an unofficial implementation of the paper - LEARNED STEP SIZE QUANTIZATION at ICLR 2020


view repo

LSQFakeQuantize-PyTorch

FakeQuantize with Learned Step Size(LSQ+) as Observer in PyTorch


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep networks are emerging as components of a number of revolutionary technologies, including image recognition (Krizhevsky et al., 2012), speech recognition (Hinton et al., 2012), and driving assistance (Xu et al., 2017). Unlocking the full promise of such applications requires a system perspective where task performance, throughput, energy-efficiency, and compactness are all critical considerations to be optimized through co-design of algorithms and deployment hardware. Current research seeks to create deep networks that maintain high accuracy while reducing the precision needed to represent their activations and weights, thereby reducing the computation and memory required for their implementation. The advantages of using such low precision algorithms to create networks for low precision hardware has been demonstrated in several deployed systems (Esser et al., 2016; Jouppi et al., 2017; Qiu et al., 2016). Here we demonstrate a new method for training quantized networks that achieves significantly better performance than prior quantization approaches on the ImageNet dataset across several network architectures.

Our primary contribution is Learned Step Size Quantization (LSQ), a method for training low precision networks that uses the training loss gradient to learn the step size parameter of a uniform quantizer associated with each layer of weights and activations. While there are many methods of quantizing data, most approaches apply a sequence of operations that first scale the data, then round and clip the data to a set of integer values whose range is determined by the number of quantization bits, and then finally rescale the data back to its original range. Strictly speaking, this process is not differentiable due to the use of the round operator, but approximations can be employed to allow for backpropagation. Prior approaches that use backpropagation to learn parameters controlling quantization (Choi et al., 2018b, a; Jung et al., 2018) create a gradient approximation by beginning with the forward function for the quantizer, removing the round function from this equation, then differentiating the remaining operations. In contrast, our approach simply differentiates each operation of the quantizer forward function, passing the gradient through the round function, but allowing the round function to impact down stream operations in the quantizer for the purpose of computing their gradient. As we discuss below, these two methods lead to different gradients to parameters controlling the quantization itself, with our approach producing a gradient with finer responsiveness across the domain of the quantizer.

Many approaches have been explored for training networks for low precision inference. These have included quantizers where the mapping from inputs to discrete values is i) fixed based on user settings, ii) tuned using statistics from the data, iii) tuned by solving a quantizer error minimization problem during training, or iv) learned using backpropagation to train parameters controlling the quantization process. Uniform quantization is commonly employed, but a few approaches employ asymmetric (different quantization for positive and negative values) or nonuniform mappings, which introduce additional complexity to the quantization process by allowing variable bin sizes. Table 1 summarizes this prior work.


Method Quantizer Tuning Quantizer Distribution Reference
BNN Fixed Uniform (Hubara et al., 2016)
Eedn Fixed Uniform (Esser et al., 2016)
FAQ Fixed Uniform (McKinstry et al., 2018)
TWN Data Statistics Uniform (Li & Liu, 2016)
DoReFa Data Statistics Uniform (Zhou et al., 2016)
XNOR Data Statistics Uniform (Rastegari et al., 2016)
HWGQ Data Statistics Uniform (Cai et al., 2017)
Regularization QE Minimization Uniform (Choi et al., 2018c)
LQ-Nets QE Minimization Nonniform (Zhang et al., 2018)
QIL Backprop Uniform (Jung et al., 2018)
NICE Backprop Uniform (Baskin et al., 2018)
Distillation Backprop Nonuniform (Polino et al., 2018)
LSQ (ours) Backprop Uniform
TTQ Backprop + Data Statistics Uniform, Asymmetric (Zhu et al., 2016)
Apprentice Backprop + Data Statistics Uniform, Asymmetric (Mishra & Marr, 2017)
PACT Backprop + Data Statistics Uniform (Choi et al., 2018b)
PACT-SAWB Backprop + Data Statistics Uniform (Choi et al., 2018a)
Table 1: Quantization approaches, specifying how a mapping is chosen from inputs to quantized values, and how the quantized values themselves are distributed. QE: Quantization Error, Backprop: Backpropagation.

Our approach falls in the school of directly learning quantization parameters through backpropagation, which has the appealing feature that it seeks a quantization that directly improves the metric of interest, the training loss. In comparison, fixed mapping schemes based on user settings, while attractive for their simplicity, place no guarantees on optimizing network performance, and quantization error minimization schemes might perfectly minimize quantization error and yet still be non optimal if a different quantization mapping actually minimizes task error. The primary differences of our approach from previous work using backpropagation to learn the quantization mapping are the use of a different approximation to the quantizer gradient, described in detail in Section 2.1, and the application of a scaling factor to the learning rate of the parameters controlling quantization.

2 Methods

Our objective is to train a deep network such that, at inference time, the matrix multiplication operations used in its convolution and fully connected layers can be implemented using low precision integer operations. Our specific approach is to employ a uniform quantizer of the form:

(1)
(2)

where is the data to quantize, is a quantized, scaled integer representation of the data, is the quantized output at the same scale as , is a step size parameter, rounds to the nearest integer, and is a signed clip function that returns with values below set to and values above set to . For unsigned data, is the number of positive non-zero quantization levels, and for signed data is the number of positive and the number of negative non-zero quantization levels. Thus for an encoding using bits, for unsigned data, and for signed data (which uses one state less than available given bits so as to preserve symmetry about zero).

For inference, we envision computing equation 1 for weights () offline, for activations () online, and using the resulting and values as input to low precision integer matrix multiplication units underlying convolution or fully connected layers. The output of such layers can then be rescaled by the step size for weights () and activations (

) using a relatively low cost high precision scalar-tensor multiplication, a step that can potentially be merged with other operations such as the affine transform associated with batch normalization. A simple visualization of this computational flow is provided in Figure

1.


Figure 1: Data flow diagram for computations involved for a single low precision convolution or fully connected layer, as envisioned here. Boxes without borders denote low precision integer data or operations.

To train our network, we follow the now widely adopted practice for training quantized networks of maintaining and updating full precision weights, but using quantized weights and activations for the forward pass. For simplicity during training, we use

as input to matrix multiplication layers, which is algebraically equivalent to our proposed inference operations described in the preceding paragraph. For the gradient through the quantization functions to activations, we use a straight through estimator for the round function gradient

(Bengio et al., 2013), and set the gradient of clipped values to zero:

(3)

For the gradient through the quantizer to weights, we also use a straight through estimator for the round function but pass the gradient completely through the clip function as this avoids weights becoming permanently stuck in the clipped range:

(4)

2.1 Learned Step Size Quantization

The step size parameter determines the specific mapping of high precision to quantized values, which can have a large impact on network performance (in a worst case, an arbitrarily large step size would map all values to zero). Since our objective during learning is to minimize training loss, we choose to learn step size in a way that also seeks to minimize this loss, specifically by treating as a parameter to be learned using standard backpropagation. To backpropagate to the step size parameter requires an approximation of the gradient of the quantizer output with respect to step size, , which is otherwise undefined due to the discontinuous nature of the quantizer.

Here, we propose to approximate by using a straight through estimator for the round function in the quantizer, as used in computing above, and applying standard differentiation for the remaining operations in the quantizer, including setting the gradient of clipped values to zero. It should be noted that we do not remove the round function itself where it would naturally appear in the gradient equation (specifically, the component of derived from the appearance of in equation 2 is dependent on this round function):

(5)

This gradient is visualized in Figure 2.


Figure 2: Quantizer output and our approximated gradient with respect to the step size parameter where and A) , corresponding to use of 2-bit weights, B) , corresponding to 3-bit weights and the positive portion corresponding to 2-bit activations, and C) , corresponding to 4-bit weights and the positive portion corresponding to 3-bit activations.

To provide a deeper understanding of this gradient, it can be noted that appears twice in the quantizer. In equation 1, appears as a divisor inside the round function, where it determines which integer valued quantization bin () each real valued input is assigned. The gradient with respect to at this location provides the first term in equation 5 when , and does not have an impact when as we do not pass the gradient to this appearance of for values clipped in the forward pass. The negative sign on this term reflects the fact that as in equation 1 increases, there is a chance that will drop to a lower magnitude bin.

In equation 2, appears as a multiplier outside of the round function, where it determines the real value () to which each integer value maps. The gradient with respect to this appearance of provides the second term in equation 5 when and the entire gradient when . The associated positive sign on these terms reflects the fact that as in equation 2 increases, will increase proportional to .

Prior approaches using backpropagation to learn quantization controlling parameters (Choi et al., 2018b, a; Jung et al., 2018) completely remove the round operation when differentiating the forward pass, equivalent in our derivation to removing the round function in equation 5, so that where the two terms cancel such that . This provides a coarser approximation of this gradient, one drawback of which is that if . However, consider that if is just less than , a very small decrease in will cause the corresponding to change from to , suggesting that the gradient in this region should be non zero.

For this work, each layer of weights has a distinct and each layer of activations has a distinct

, thus the number of step size parameters in a given network is equal to the number of quantized weight layers plus the number of quantized activation layers. Step size is initialized for activations to 1, and for weight layers to the average absolute value of the weights. We implemented and tested LSQ in PyTorch.

2.2 Step Size Learning Rate Scale

We note that for a given layer, if the updates to as a result of learning are large relative to changes to , then changes to could become highly correlated, driven by the single source . This would likely impact the quality of learning, as it is of course desirable that each in a given population changes independently to promote diverse responses to different inputs.

The magnitude of a parameter update for a given mini-batch in stochastic gradient descent is proportional to its gradient with respect to training loss,

. For , this gradient is:

(6)

where is over all elements in the corresponding layer. This summation over the population could result in much larger than a typical , particularly if values of are highly correlated, leading to changes in with a magnitude greatly differing from changes to . To prevent this imbalance from leading to instability in learning, we introduce a step size learning rate scalehyperparameter that is simply a multiplier on the learning rate used for . For simplicity, we use a single such hyperparameter for all weight layers, and a single such hyperparameter for all activation layers. We found that properly setting the learning rate scale was critical to good performance and chose values via experiment on a 2-bit ResNet-18 network (see Section 3.1).

2.3 Fine Tuning

All of the quantized networks in this paper use fine tuning, that is they are initialized using weights from a trained full precision network with equivalent architecture before training in the quantized space. This approach has been explored in a number of works as means to improve training of quantized networks (Sung et al., 2015; Zhou et al., 2016; Mishra & Marr, 2017; McKinstry et al., 2018).

2.4 Dataset

All experiments in this paper were conducted on the ImageNet dataset (Russakovsky et al., 2015). For the purpose of performing hyperparameter exploration without knowledge of the final validation set, we split the ImageNet training dataset into two subsets by moving 50 training images from each class to a new data set we call train-v, used for validation during hyperparameter sweeps, and using the remaining training images for another dataset we call train-t, used for the corresponding training. All results in this paper use the standard ImageNet training and validation sets, except where it is explicitly noted that they use train-v and train-t.

2.5 Training Procedure

All networks were trained using stochastic gradient descent optimization with a momentum of 0.9, using a softmax cross entropy loss function and cosine learning rate decay

(Loshchilov & Hutter, 2016) (without restarts). Full precision controls were trained with weight decay of 0.0001, with ResNet18 and ResNet34 using a learning rate of 0.1 and batch size of 256, and ResNet50 using a learning rate of 0.05 and batch size of 128. Quantized networks were trained with weight decay of 0.00005 for ResNet18, and weight decay of 0.0001 for the larger ResNet34 and ResNet50 as suggested in (McKinstry et al., 2018)

, all with an initial learning rates 10 times lower the full precision networks and the same batch size as the full precision controls. Except where noted, all networks were trained for 90 epochs. Images were resized to 256

256, then a 224

224 crop was selected for training, with horizontal mirroring applied randomly with 0.5 probability. At test time, a 224

224 centered crop was chosen.

Experiments were run using the pre-activation version of ResNet (He et al., 2016)

. We use high precision input activations and weights for the first and last layers, as this standard practice for quantized neural networks has been demonstrated to make a large impact on performance.

3 Results

In the sections below, we first perform hyperparameter sweeps to determine the value of step size learning rate scale to use. Following this we look at the distribution of quantized data, examine quantization error, then compare LSQ to existing quantization methods across several network architectures.

3.1 Step Size Learning Rate Scale Hyperparameter Sweep

To select the weight step size learning rate scale, we trained 6 ResNet-18 networks with 2-bit activations and full precision weights for 9 epochs, setting the learning rate scale to a different member of the set for each run, and using the ImageNet train-v and train-t subsets. We performed a similar sweep for the activation step size learning rate scale with 2-bit weights and full precision activations.

For activations, we found best performance with a step size learning rate scale of , with performance falling off steadily as this value was reduced (Figure 3A). For weights, we found a broad plateau in the hyperparameter space that provided good performance with a step size learn rate scale between and , with best performance at (Figure 3B). Based on this, for all further training we used an activation step size learning rate scale of and a weight step size learning rate scale of . In all remaining sections we used the real ImageNet train and validation sets.


Figure 3: Hyperparameter sweep of step size learning rate scale using 9 epoch training runs on ResNet-18. A) Sweep for activation step size, using 2-bit activations and full precision weights. B) Sweep for weight step size, using 2-bit weights and full precision activations. The corresponding numeric top-1 accuracy is given above each bar, with a number in parentheses indicating the value is too low for the bar to appear within the range shown.

3.2 Distribution of Quantized Data

We examined the distribution of quantized data in a trained ResNet-18 network with 2-bit activations and weights by computing a histogram of for each layer for all data in the test set (Figure 4). We noted that activations were biased towards the zero bin with

of activations in the zero bin, a result not particularly surprising given the use of ReLU neurons in the network. Weights showed a slight preference for the zero bin, with

of weights quantized to zero.


Figure 4: For a trained 2-bit ResNet-18, distribution of quantized integer activation values () by layer (A) and averaged for all layers (B), and distribution of quantized integer weight values () by layer (C) and averaged for all layers (D).

3.3 Quantization Error

LSQ seeks a quantization that explicitly minimizes training loss and not quantization error (the distance between and on some metric). We next sought to understand whether LSQ learns a final step size that also implicitly minimizes quantization error. For this purpose, for a given layer we define the final step size learned by LSQ as and let be the set of discrete values . For each layer, on a single batch of test data we computed which value of minimizes mean absolute error, , mean square error,

, and Kullback-Leibler divergence,

where and

are probability distributions. For purposes of relative comparison, we ignore the first term of Kullback-Leibler divergence, as it does not depend on

, and approximate the second term as , where the expectation is over the sample distribution.

Per layer results of the above comparison are shown in Figure 5. To provide a condensed measurement of quantization error, we computed the absolute difference between and the value of that minimizes each of the above metrics, and averaged this across weight layers and across activation layers (a difference of 0 would indicate the learned value of step size also minimizes the corresponding quantization error metric). For activations, this difference was for mean absolute error, for mean square error, and for Kullback-Leibler divergence, while for weights, this difference was for mean absolute error, for mean square error, and for Kullback-Leibler divergence.


Figure 5: Mean square error (MSE), mean absolute error (MAE) and Kullback-Leibler divergence (KL), between and for each layer across a range of for activations (A) and weights (B). To facilitate comparison between layers, is normalized by the actual step size learned by LSQ (). Horizontal bar indicates where .

3.4 Comparison with Other Approaches

We used LSQ to train several ResNet variants where activations and weights both use 2, 3 or 4 bits of precision, and compared our results to published results of other approaches for training quantized networks. To facilitate comparison, we did not consider networks that used full precision for any layer other than the first and last. We trained full precision (baseline) versions of the networks we considered and found scores slightly higher than previously reported, which we attribute to our use of cosine learning rate decay (Loshchilov & Hutter, 2016).

On ResNet-18 (Table 2), ResNet-34 (Table 3), and ResNet-50 (Table 4), we found that LSQ achieved higher accuracy than all prior quantization approaches at the precision we examined. Accuracy improved with higher precision, and at 4-bit exceeded baseline full precision accuracy on ResNet-18 ( top-1) and on ResNet-34 ( top-1) and nearly matched full precision accuracy on ResNet-50 ( top-1).


Precision Accuracy
Method (a,w) Top-1 Top-5
Baseline (Ours) 32, 32 70.5 89.6
LSQ (Ours) 4, 4 70.9 89.9
QIL 4, 4 70.1
FAQ 4, 4 69.8 89.1
LQ-Nets 4, 4 (nu) 69.3 88.8
PACT 4, 4 69.2 89.0
NICE 4, 4 69.8 89.2
DoReFa* 4, 4 67.5 87.6
Regularization 2, 2 67.3 87.9
LSQ (Ours) 3, 3 70.0 89.4
QIL 3, 3 69.2
LQ-Nets 3, 3 (nu) 68.2 87.9
PACT 3, 3 68.1 88.2
NICE 3, 3 67.7 88.2
DoReFa* 3, 3 67.5 87.6
LSQ (Ours) 2, 2 67.0 87.3
QIL 2, 2 65.7
LQ-Nets 2, 2 (nu) 64.9 85.9
PACT 2, 2 64.4 85.6
DoReFa* 2, 2 62.6 84.4
Regularization 2, 2 61.7 84.4
Table 2: Comparison of low precision ResNet-18 networks on ImageNet. An asterisk indicates results are reported in (Choi et al., 2018b), but not in the original paper. Under precision, ”nu” indicates non uniform quantization.

Precision Accuracy
Method (a,w) Top-1 Top-5
Baseline (Ours) 32, 32 74.1 91.8
LSQ (Ours) 4, 4 74.4 91.8
QIL 4, 4 73.7
NICE 4, 4 73.5 91.4
FAQ 4, 4 73.3 91.3
DoReFa* 4, 4 69.9 89.2
LSQ (Ours) 3, 3 73.8 91.4
QIL 3, 3 73.1
LQ-Nets 3, 3 (nu) 71.9 90.2
NICE 3, 3 71.7 90.8
DoReFa* 3, 3 69.9 89.2
LSQ (Ours) 2, 2 71.1 90.0
QIL 2, 2 70.6
LQ-Nets 2, 2 (nu) 69.8 89.1
DoReFa* 2, 2 67.1 87.3
Table 3: Comparison of low precision ResNet-34 networks on ImageNet. An asterisk indicates results are reported in (Choi et al., 2018b), but not in the original paper. Under precision, ”nu” indicates non uniform quantization.

Precision Accuracy
Method (a,w) Top-1 Top-5
Baseline (Ours) 32, 32 76.9 93.4
LSQ (Ours) 4, 4 76.7 93.3
NICE 4, 4 76.5 93.3
PACT 4, 4 76.5 92.6
FAQ 4,4 76.3 92.9
LQ-Nets 4, 4 (nu) 75.1 92.4
DoReFa* 4, 4 71.4 89.8
LSQ (Ours) 3, 3 76.1 92.9
PACT 3, 3 75.3 92.6
NICE 3, 3 75.1 92.6
LQ-Nets 3, 3 (nu) 75.1 92.4
DoReFa* 3, 3 69.9 89.2
LSQ (Ours) 2, 2 73.2 91.3
PACT 2, 2 72.2 90.5
LQ-Nets 2, 2 (nu) 71.5 90.3
DoReFa* 2, 2 67.1 87.3
Table 4: Comparison of low precision ResNet-50 networks on ImageNet. An asterisk indicates results are reported in (Choi et al., 2018b), but not in the original paper. Under precision, ”nu” indicates non uniform quantization.

4 Conclusions

The results here demonstrate that LSQ exceeds performance of all prior approaches for creating quantized networks across several architectures for classifying the ImageNet dataset. Interestingly, LSQ does not appear to minimize quantization error, whether measured using mean square error, mean absoluste error, or Kullback-Leibler divergence. The approach itself is simple, requiring a single additional parameter per weight or activation layer.

Looking to future work, it is likely possible to constrain the step size parameter to powers of 2 without a large degradation in performance. Such an approach would further simplify the hardware necessary for quantization by replacing the multiplications with bit shift operations. Though it remains to be demonstrated, it is possible that the relative insensitivity of LSQ to the specific step size learning rate scale might also mean that it is relatively tolerant to a range of step sizes, and thus a power of 2 constraint would have relatively little impact on performance.

This work is a continuation of a trend towards steadily reducing the number of bits of precision necessary to achieve good performance across a range of network architectures on ImageNet. It is noteworthy that this trend towards higher performance at lower precision strengthens the analogy between artificial neural networks and biological neural networks, which themselves employ synapses represented by perhaps a few bits of information

(Bartol Jr et al., 2015) and single bit spikes that may be employed in small spatial and/or temporal ensembles to provide low bit width data representation. Of course, it is yet unclear exactly how far this analogy can be taken. These recent results in the low precision space suggest that compute resources used for performing classification with neural networks require dramatically fewer bits of precision than commonly believed even a few years ago, meaning that custom hardware able to operate at these reduced precision levels could gain considerably from the resulting energy-efficiency and compactness, without sacrificing task performance.

References