graffitist
Graph Transforms to Quantize and Retrain Deep Neural Nets in TensorFlow.
view repo
We propose a method of training quantization clipping thresholds for uniform symmetric quantizers using standard backpropagation and gradient descent. Our quantizers are constrained to use power-of-2 scale-factors and per-tensor scaling for weights and activations. These constraints make our methods better suited for hardware implementations. Training with these difficult constraints is enabled by a combination of three techniques: using accurate threshold gradients to achieve range-precision trade-off, training thresholds in log-domain, and training with an adaptive gradient optimizer. We refer to this collection of techniques as Adaptive-Gradient Log-domain Threshold Training (ALT). We present analytical support for the general robustness of our methods and empirically validate them on various CNNs for ImageNet classification. We are able to achieve floating-point or near-floating-point accuracy on traditionally difficult networks such as MobileNets in less than 5 epochs of quantized (8-bit) retraining. Finally, we present Graffitist, a framework that enables immediate quantization of TensorFlow graphs using our methods. Code available at https://github.com/Xilinx/graffitist .
READ FULL TEXT VIEW PDFGraph Transforms to Quantize and Retrain Deep Neural Nets in TensorFlow.
Machine learning continues to be increasingly pervasive in applications that span the cloud to the edge. Low-precision quantization, such as uniform quantization between two clipping thresholds, is an important technique enabling low-power, high-throughput implementations and a mechanism for managing memory bandwidth requirements in neural network inference. However, this reduced precision leads to commensurate reductions in accuracy.
Retraining weights with quantization-in-the-loop is a useful technique to regain some lost accuracy. However the quantization thresholds are typically fixed after initial calibration, causing (a) lack of ability to adapt to changing weight and activation distributions during training, and (b) threshold calibration based on local quantization errors that is agnostic to global neural network loss. We address both of these issues by treating thresholds as learnable parameters whose gradients can be computed through backpropagation and adjusted via gradient descent. Therefore, (a) our thresholds can be trained along with weights during quantized training, and (b) the gradients are computed on the overall loss meaning the learned thresholds are more optimal for the network as a whole.
The idea of learning clipping thresholds via gradient descent is not completely novel. For example, TensorFlow [1, 2] defines gradients with respect to min/max variables in their FakeQuant implementation [35, 34]. However, when computing the gradients they appear to bypass the round function even during forward pass evaluation, which has the mathematical effect of causing the thresholds to train to the minimum and maximum of the input distribution, rather than finding a good range-precision trade-off point. We discuss this in detail in Section 3.5. Only very recently (and independently of our work), others have published work [9] showing the need to keep the round function in the forward pass. We discuss this work at the end of Section 2.
Besides the general method to train quantization thresholds using accurate gradients discussed in Section 3
, our work presents several other novel contributions. To our knowledge, it is the first work to demonstrate quantization threshold training with per-tensor and power-of-2 scaling constraints. These are practically useful limitations for maximizing hardware efficiency. To achieve this, we provide an easy-to-implement and fast convergence training scheme (ALT) that only requires training thresholds in the log-domain using an adaptive gradient optimizer. We demonstrate that our implementation and hyperparameter recommendations are robust, analytically in Section
4 and empirically in Section 6. Additionally, in Section 5 we present a framework for automatic quantization of TensorFlow graphs using our methods. Finally we discuss insights from ALT training in Section 7.The push to reduce power consumption for inference on the edge and improve throughput and/or latency for inference on the cloud has motivated strong interest in neural network quantization research in just the past few years.
Historically, some of the earlier work in quantization looked at low bit-width weights, and in some cases, activations. BinaryNet [7]
proposed quantizing weights and activations to binary values of +1 and -1 and showed that weights could be trained using a straight-through estimator (STE)
[4], where quantization is applied in the forward pass but approximated to a clipped identity function in the backward pass. XNOR-Nets
[26]uses a different network and binarization method with scale-factors based on the maximum per-channel activation values, to achieve better ImageNet performance. Ternary networks
[22, 41] add another quantization level at 0 suggesting this helps significantly with accuracy. TTQ [41] also suggest using a codebook to map the two non-zero quantization levels to arbitrary values that can be trained with gradient descent by averaging the gradients of the weights within each quantization bucket.Continuing the trend for higher accuracy, researchers revisited multi-bit quantization. Initially, the quantization range tended to be fixed, such as in DoReFa-Net [40] where elements are limited to [-1, 1] with the tanh function, or WRPN [25] which limits weights to [-1, 1] and activations to [0, 1] even during floating point training. To improve accuracy further, researchers considered non-fixed quantization ranges. HWGQ [5]
uses ReLU activation and learns a clipping range by minimizing L2 loss of pre- and post-quantized values. PACT
[6] learns this ReLU clipping parameter through gradient descent, using the gradient(1) |
derived using the STE. LQ-Nets [39] achieve state-of-the-art accuracy through a custom quantization error minimization (QEM) algorithm and non-uniform quantization scheme with decision levels derived from a small binary encoding basis. QIL [19] likewise introduces a custom quantizatiom scheme based on trainable quantization intervals.
While more extravagent quantization schemes can be used to push accuracy to the limit, simpler more realistic hardware-aware quantization schemes are becoming increasingly promising in the industry. NVIDIA’s TensorRT [24] proposed an 8-bit symmetric uniform per-tensor quantization with statically calibrated thresholds. They demonstrate the range-precision trade-off through local Kullback-Leibler (KL) divergence minimization and show good performance for traditional CNNs but do not explore retraining. Google’s TensorFlow Lite supports quantization-aware training [37] which is based on previous work [18] using asymmetric uniform quantization. They demonstrate learning the clipping thresholds through an exponential moving average of the min/max values of the input distributions seen during initial warm-ups on random batches of training data. This is consistent with their gradient definition [35] which does not allow for a range-precision trade-off, as seen from the quantizer transfer curves in Section 3.5. Google’s whitepaper [21]
reviews the commercially relevant quantization schemes and design choices such as affine or symmetric uniform quantization, per-tensor or per-channel scaling, and batch normalization
[15] considerations for quantization-aware training. IBM’s FAQ [23] uses percentile initialization to the 99.99th percentile for 8-bit and 99.9th percentile for 4-bit for determining clipping thresholds, but does not train these thresholds.FAT [10] does propose training the quantization thresholds through gradient descent while keeping the weights unchanged. They use an unlabeled dataset and train on a root-mean-square-error loss between the original and quantized networks. NICE [3] starts with a clamping parameter located standard deviations from the mean of the input distribution, and trains it using gradient descent on a derivative found using the STE in a formulation similar to (1).
Independently of our work, IBM’s LSQ [9] found very similar gradient definition for the quantizer and uses backpropagation to train them. However, our works differ in several interesting ways. For example, they learn the scale-factors directly and do not restrict them to power-of-2. Besides the evident implications for accuracy and hardware implementation, we show in Section 4 that this also has major implications for training stability due to scale dependence of learning rate. As a workaround to these stability issues, they require careful fine-tuning of hyperparameters and consequently retrain for 90 epochs compared to 5 epochs in our case. They also use high precision in the first and last layers to retain performance, as is common in the field. We suspect the high precision and lack of power-of-2 limitations allow for very high accuracy in their low bit-width experiments. Further, they do not explore quantization on more difficult networks such as MobileNets [14, 29]. We address these issues with a different gradient formulation in Section 3 and justify it analytically in Section 4.
A simple design choice for a uniform quantizer is one that uses an affine mapping between the real domain and the quantized domain , such as
(2) |
where constants (scale-factor) and (zero-point) are the quantization parameters. Generally, is a positive real number, and is a quantized value that maps to the real zero^{1}^{1}1
This formulation satisfies the domain specific constraint (with neural networks and zero padding) that the real zero be exactly representable in the quantized domain
[16, 18, 21]..While the affine quantizer allows for a direct mapping from floating point values to integers (without the need for lookup tables), there is added cost due to special handling of zero-points and real-valued scale-factors (see Appendix A).
For efficient implementation on fixed-point hardware, we constrain our quantization scheme to use:
No zero-points: By setting , the affine quantizer in (2) reduces to a symmetric quantizer:
(3) |
This allows us to drop cross-terms that otherwise arise from a matrix multiplication or convolution operation involving zero-points (see Appendix A.1
). A natural consequence is that symmetric quantization can be less precise with highly asymmetric or skewed distributions.
Power-of-2 scale-factors: Instead of allowing real-valued scale-factors, we constrain them to the form (where is an integer denoting the fractional length; can be positive or negative). This enables scaling using simple bit-shifts without the overhead of a fixed-point multiplier operation (see Appendix A.2). Right bit-shifts get round to the nearest integer, with round-half-to-even to prevent bias.
Per-tensor quantization: Elements in a given weight or activation tensor are quantized using a single scale-factor , commonly referred to as per-tensor quantization. While it is common practice to use per-channel quantization for networks with depthwise convolutions such as MobileNets, we find that per-tensor combined with ALT training works well for INT8, and that per-channel may only be necessary for such networks at lower bit-widths (e.g., INT4).
Mid-tread quantizer: In BinaryNet [7], weights and activations are quantized to +1 and -1 using a mid-rise quantization scheme based on a classification threshold at 0. As a result, 0 is not representable in the quantized domain. In contrast, vanilla hardware multipliers accept 0 as a valid input, so a quantization scheme which includes 0 in its quantized domain is more natural and makes better use of existing hardware. Therefore, we restrict our quantizers to mid-tread with classification thresholds at integers .
Having constrained our uniform quantizer to use linear mapping with no zero-points, power-of-2 scale-factors, per-tensor and mid-tread quantization, we can proceed to define its forward pass characteristics.
The quantization function for a tensor is parameterized only by its scale-factor , which depends on threshold and bit-width chosen for this tensor^{2}^{2}2While we fix for each tensor (at compile time) based on the footprint of the fixed-point hardware it maps to (albeit configurable), we allow (hence ) to be trained using backpropagation.. performs quantization by applying four point-wise operations (in order): scale, round, saturate and de-quant, each of which is explained below.
Scale: Tensor elements are scaled such that the lowest power-of-2 larger than raw threshold (i.e., , where denotes ceil^{3}^{3}3The ceil function ensures a power-of-2 scale-factor that is initially biased in the direction of having more elements within the clipping range.) is mapped to the largest value supported in the quantized domain (i.e., if signed, or if unsigned). Naturally, elements that fall out of the saturation threshold in either direction would be clipped.
Round: The scaled tensor elements are round to nearest integers using bankers rounding (round-half-to-even) denoted by in the equations below. This prevents an overall upward or downward bias which is known to impact end-to-end inference accuracy in neural networks [18].
Saturate: Once scaled and rounded, it is possible for some elements in the tensor to have exceeded the largest value supported in the quantized domain; such elements are clipped: . Since we apply clipping to the scaled tensor, the clipping limits () are constants independent of the real bounds. If the tensor is signed, we clip to and if unsigned, we clip to .
De-quant: The last step undoes the scaling step. Therefore, we emulate the effect of quantization while retaining the original scale of the input tensor.
Putting together the point-wise operations from above, the quantization function can be formally written as:
(4) |
where , and for signed data; , and for unsigned data.
In the backward pass we can use gradient descent to simultaneously update quantization thresholds and weights of the quantized network. To do this, we derive the local gradients of our quantizer with respect to scale-factor and input . By formulating the quantization function in (4) to perform clipping after scaling, the clipping limits are constants independent of . Further, we (carefully) use the STE to approximate gradients of round/ceil to 1, without approximating round/ceil to be identity. Specifically, we define , but and .
Considering the three cases of how compares to and , we re-write (4) as:
(5) |
The local gradient with respect to scale-factor is:
(6) |
Noting that ,
(7) |
The choice to train thresholds in the log-domain is simple yet very effective for various stability reasons discussed in Section 4.
Similarly, the local gradient with respect to input is:
(8) |
To qualitatively understand the role of threshold gradient and input gradient during backpropagation, let us consider the following toy problem: A single quantizer optimized using least-square-error loss . The overall gradients of are:
(9) | ||||
(10) |
Figure 1 shows the forward and backward pass transfer curves for our quantizer. As noted, the exact clipping thresholds of in the real domain are and .
Role of threshold gradients: As seen from the plots of vs. in Figure 2, threshold gradients are positive for within clipping thresholds and negative otherwise. When most of the input distribution^{4}^{4}4Gaussian distributed in this example, but the analysis holds in general. falls within , the cumulative threshold gradient is positive causing to decrease^{5}^{5}5From the update rule where is the learning rate.. In other words, the limits get pulled inward in favor of larger precision. Similarly, when most of the input distribution falls outside , the cumulative threshold gradient is negative, increases, and the limits
get pushed outward in favor of larger dynamic range. This technique is naturally robust to ill-behaved distributions (e.g., long tails or outliers) by achieving range-precision trade-off through gradient-based optimization.
Role of Input Gradients: Using a similar analysis as for threshold gradients, we see that the input gradients are non-zero for values of that fall outside , biased to keep them from getting clipped. This encourages the weight and activation distributions to be tighter.
To summarize, threshold gradients help train optimal thresholds for clipping weights and activations, whereas input gradients nudge the weights and activations to tighter bounds. By simultaneously training clipping thresholds and weights of the quantized network through backpropagation, we adopt joint (mutual) optimization over a global loss. While the actual loss landscape is non-trivial, the qualitative analysis from our toy problem still holds.
In contrast, we find quantizer implementations that define threshold gradients by simply clipping the upstream gradients at the saturation thresholds. For example TensorFlow’s FakeQuant (used for quantization-aware training [37, 18]) defines gradients with respect to min/max thresholds as a clip function (see kernel implementation [35]).
In the forward pass, TF-FakeQuant operation [34] is mathematically equivalent to our formulation (except with zero-point), defined as:
(11) |
However, in the backward pass they likely treat the round function to be identity in (11), reducing it to a clip function with clipped gradients. This is consistent with the TF-FakeQuant transfer curves in Figure 3 showing that the gradients with respect to thresholds are trivially clipped to zero for within . In other words, the cumulative gradients always push the limits outward, thus training to the actual min/max of the input distributions and favoring range at the cost of precision. This is particularly bad for distributions with long tails or outliers because the scheme discourages clipping.
Initially, it may seem that with the definition of a gradient with respect to the raw threshold, backpropagation and gradient descent could be immediately used to train it. However, just as training weights in a vanilla neural network requires care in the choice of optimizer and learning rate, here too care must be taken to ensure training stability and convergence. There are three main properties we would like our training procedure to satisfy: numerical stability, scale invariance, and convergence. Here we discuss each of these issues and the engineering tweaks used to solve them.
One obvious problem with training raw thresholds is that gradient updates could potentially bump a threshold to a negative value, causing and therefore scale-factor to diverge. If this happens even once, the network as a whole will break. An easy solution is to train as opposed to itself, since its domain is . Using log thresholds is convenient because it already appears in the expression for . However, the most important benefit is described in Section 4.2, where the log representation makes ensuring scale invariance very easy.
For a given input distribution we prefer that the threshold gradients have similar magnitudes regardless of the position of the threshold itself. This threshold scale invariance is useful for making sure training is not too slow when the thresholds are far from their optimal values. Similarly, the properties of our threshold gradients should not depend on the scale of the input distribution. This input scale invariance
is important because it ensures that quantized training behaves the same way for the different weights and activations in the network, even if the variance of their distributions vary over many orders of magnitude.
Unfortunately, neither of these scale invariances hold. Far from improving, Figure 4 shows that in moving from raw threshold training (left) to log threshold training (middle), both scale invariance properties of the threshold gradients actually degrade.
Threshold scale invariance: Updates to the log threshold would be threshold scale invariant if the gradients on both sides of the negative-to-positive jump were flat, as seen in the right plot of Figure 4. However, this is not the case for log threshold gradients (center plot of Figure 4). On the left-of-jump side, as decreases, gradients of (hence updates to) get exponentially smaller, meaning it will converge very slowly to lower optimal values (see the log grad SGD case in the left plots of Figure 5). Similarly, on the right-of-jump side, as increases, updates to increase exponentially, meaning it will converge very quickly and possibly unstably to higher optimal values (see the log grad SGD case in the right plots of Figure 5). In the raw threshold domain, we would like gradients of (hence updates to) to scale proportional to . This is also not the case for the left-of-jump side of raw threshold gradients (left plot of Figure 4). In other words, the raw and log threshold gradients are swapped from what we would prefer on the left-of-jump sides.
Input scale invariance: Updates to the log threshold are input scale invariant if the gradients are threshold scale invariant and x-axis shifted copies for varying input scales, as seen in the right plot of Figure 4. However, this is not the case for log threshold gradients (center plot of Figure 4) as the gradient magnitudes depend on the scale of the input. In fact when accounting for the threshold scale dependence, the gradient magnitudes depend quadratically on the scale of the input.
Normed gradients: While neither raw or log threshold gradients have the desired properties of scale invariance, only minimal modifications to our log threshold gradient is needed to get these properties to hold (see desired log threshold gradient on the right of Figure 4). In particular, if we normalize the gradient by its bias-corrected moving average variance, we achieve a close approximation of the desired gradients (12). To improve stability, we can encapsulate (12) in a clipping function to guarantee no large gradients (13).
(12) | ||||
(13) |
Yet another desired property highlighted in Figure 4 is that near the jump, the ratio of the gradient magnitudes to either side of the jump is to be preserved between the original and normed gradient cases. This is important for the convergence dynamics of the system discussed in Section 4.3. In dynamic situations, the gradient normalization solution (12) approximates this feature as well.
Figure 5 shows training curves on the toy quantization error problem across various bit-widths, input scales, and optimization algorithms. Raw gradient with SGD fails for large and converges too slowly for small , as we would expect from Sections 4.1 and 4.2. Additionally, they have -dependent stability once converged. Switching from raw to log threshold gradients, we see that log gradient with Adam performs well, yet log gradient with SGD performs poorly, with weak convergence rates for small and divergence for large . However, after performing gradient normalization (13), normed log gradient with SGD performs well, demonstrating that lack of proper gradient norming is the main issue preventing convergence using standard gradient descent. Besides the differing convergence rates, another characteristic becomes immediately obvious - stability after convergence. For example, raw gradient method tends to oscillate wildly between multiple integer-level log thresholds, whereas normed log gradient method is better behaved and tends to stay within a single integer log threshold band.
Adam optimizer: While gradient norming (13) led to good results with SGD, we note that Adam without this gradient norming also works quite well. It is easy to see why this is - Adam has built-in gradient norming [20]
. Thus we can avoid redefining the gradients by simply using an optimizer that includes adaptive gradients, such as Adam or RMSprop
[13]. While RMSprop appears to superficially resemble (13) more closely than Adam, we suspect Adam has better behavior in the absence of gradient clipping due to its use of moments to smooth the gradients. To use Adam safely, we derive rough bounds on the learning rate and momentum parameters to ensure the oscillations seen in Figure
5 for log gradient with Adam do not exceed a single integer bin. This is important because if they move across bins often, the network may have more trouble adapting to the changing distributions from a given quantized layer, in an effect that may be similar to the motivation for batch normalization [15].One primary cause of the sharp gradient jumps seen in Figure 4 is our insistence on power-of-2 scaling. In the forward pass, features downstream from the quantized layer are completely unaware of intermediate non-power-of-2 scale-factors so there are sharp jumps at integral , similar to what might be observed when using the STE for traditional quantization. The net effect is a bang-bang like operation.
In more detail, for a given input distribution there is some critical integer threshold before which the gradients are negative (causing positive threshold updates) and after which the gradients are positive. This negative feedback will force the threshold to oscillate around . The gradients and on either side of tend to be fairly constant within a distance 1 of due to power-of-2 scaling. For simplicity, assume so that the ratio . As grows, we would expect the following behavior: the threshold stays in the higher bin for a while, slowly decaying until reaching the lower bin, at which point a large causes it to jump back to the higher bin, where it begins a slow decay again. This behavior can be observed in the left plots of Figure 5 and are shown in more detail in Figure 9.
If normed log gradients and SGD are used together, the dynamics are fairly simple. Let be the SGD update on normed log gradient (13). Then because by design, a given jump in the sawtooth-like pattern will have magnitude bounded by learning rate . Thus by selecting , we can ensure convergence within a threshold bin.
However in our experiments, we used the implementationally simpler approach of unnormed log gradients with the Adam optimizer. While simpler to implement, the analysis is more complicated due to the second-order nature of the optimizer. Adam has three key hyperparameters: and operates by keeping track of a moving mean of gradients and a moving variance before applying update rule . In practice, bias correction is used to get , but when considering settling dynamics for , this bias correction is insignificant. Typical values are .
In Appendix B, a detailed analysis of convergence for Adam is carried out. From this analysis a simple set of guidelines emerge. First, the learning rate is set to guarantee . Next, we ensure to satisfy the limits of our analysis. Finally, we make sure . These results are summarized in Table 1. For simplicity, we use for all of our training.
Bit-width | 4 | 8 | |
---|---|---|---|
Steps |
We released Graffitist^{6}^{6}6Code available at https://github.com/Xilinx/graffitist., an end-to-end software stack built on top of TensorFlow to quantize and retrain deep neural networks using our ALT method. It is in experimental stages as we continue to add support for more operation types, layer topologies, network styles, graph optimizations, and compression techniques. Graffitist stands on the shoulders of giants and the interface is inspired in part by earlier tools from TensorFlow [36, 37].
Graffitist applies several optimizations to the input graph prior to quantization. For example, folding batch normalization layers into preceding convolutional or fully connected or depthwise convolutional layers’ weights. We adopt the following best practices from [18, 21, 37]: (a) ensure folded batch norms in training and inference graphs are mathematically equivalent (i.e., distributions seen during training match those during inference); (b) apply batch norm corrections for switching between batch and moving average statistics to reduce jitter in training folded weights due to noisy batch updates; (c) freeze batch norm moving mean and variance updates post convergence for improved accuracy. Other optimizations include collapsing concat-of-concat layers into single concat, splicing identity nodes not involved in control edges, transforming average pool layers into depthwise conv layers with reciprocal^{7}^{7}7Reciprocal being where is the kernel size. multiplier as weights, and explicitly merging input scales for scale preserving ops such as concat, bias-add, eltwise-add, and maximum (for leaky relu).
Graffitist allows for quantization in either static mode or retrain mode.
Static Mode. Quantization thresholds (hence scale factors) are determined based on statistics of weights and activations derived from a calibration dataset. Specifically, weight thresholds (per-tensor) are set to the maximum absolute value, and activation thresholds (per-tensor) are chosen such as to minimize the symmetric Kullback-Leibler-J distance [8] for each quantization layer locally. This is done in a strictly topological order to ensure inputs to a layer are quantized (and fixed) prior to quantizing the current layer. The entire optimization and calibration process is automated and only requires a single API call to Graffitist.
Retrain Mode. Quantization thresholds and weights are simultaneously trained on a global loss using our ALT method. Recovery is achieved within 5 epochs of retraining. This requires two separate API calls to Graffitist - first to generate a quantized training graph which can be trained using native TensorFlow, and second to generate the equivalent quantized inference graph which uses weights and thresholds from the previously trained checkpoint.
While Graffitist supports configurable bit-widths for weights and activations, for the scope of this paper we use two modes; INT8: 8/8 (W/A) and INT4: 4/8 (W/A). In the absence of a 4x8 multiplier, the INT4 mode still allows for 50% weight compression (double packing weights per byte) and reduced memory bandwidth cost for fetching weights. The internal precisions for different layer topologies are defined below. Quantization layers marked as indicate that their scale-factors are explicitly merged / shared. To avoid double quantization, input tensors are assumed to be already quantized by the previous layer, with the exception of the primary input (placeholder) which is explicitly quantized.
Compute layers (e.g., conv, matmul, depthwise conv) are quantized as:
where is the input tensor, is the weight tensor, and
is the bias tensor. If followed by a ReLU or ReLU6 activation function, the last
stage is delayed to until after ReLU/ReLU6, and uses unsigned datatype to utilize the extra sign bit.Eltwise-add layer is quantized as:
where and are the input tensors. Similar to the compute layer case, the last stage is delayed and uses unsigned datatype if followed by ReLU/ReLU6.
Leaky ReLU is quantized as:
where is the input tensor, and is the slope of activation function for negative inputs. The last stage on the previous compute layer is skipped when it is followed by Leaky ReLU. Instead a stage is used to retain high internal precision for the -multiply op.
Average pool is quantized as:
where is the input tensor, and is the reciprocal.
Concat is not quantized because the input scales are merged explicitly, and hence it is lossless:
where , , and are input tensors.
The quantization layer defined in (4) and (6) may be trivially implemented using native TensorFlow ops and tf.stop_gradient as depicted in Figure 6. However this low-level implementation has a large memory footprint during training due to the need for storing intermediate tensors for gradient computation in the backward pass. This impacts the maximum batch size that can fit on a single GPU. To resolve this, Graffitist comes packaged with fused quantization kernels that are pre-compiled for CPU/GPU. The fused implementation is efficient, helps avoid memory overhead and allows training using larger batch sizes compared to the native implementation.
We evaluate ALT on variants of five classes of CNNs trained and validated on ImageNet (ILSVRC14) classification dataset [28]. The networks include VGG {16, 19} [30], Inception v{1, 2, 3, 4} [32, 15, 33, 31], ResNet v1 {50, 101, 152} [12], MobileNet v{1, 2} 1.0 224 [14, 29], and DarkNet 19 [27]. We obtained the models, pre-trained weights (FP32) and pre-processing for each of these networks from the TF-Slim model zoo [38] except for DarkNet 19 which was converted to TensorFlow using DW2TF [11]. Calibration sets are prepared for each network using a batch of 50 unlabeled images, randomly sampled from the validation set, with applied pre-processing. This is used for initializing the thresholds in both static and retrain modes.
We are interested in a scalable, hardware-friendly and production-ready approach to INT8/INT4 quantization that maps well on generic fixed-point hardware. While our simplifying constraints (from Section 3.1) may not be ideal for lower bit-widths, the fundamentals of ALT are more generally applicable by simply removing these constraints. To limit the scope of this paper to the least-common-denominator fixed-point quantization, we do not make comparisons with other state-of-the-art low-bitwidth quantization schemes. Instead we draw comparisons of ALT (wt+th) retraining to static quantization and wt-only retraining. We can derive many interesting insights from this analysis.
When thresholds are not trained, they are initialized to MAX for weights, and KL-J distance calibrated for activations. However when training thresholds, we find it useful to initialize the weight thresholds based on standard deviations or percentile of the weight distribution rather than MAX. Table 2 summarizes the threshold initialization scheme we used in all our experiments.
Mode | Threshold Initialization | ||
---|---|---|---|
weights | activations | ||
Static | MAX | KL-J | |
Retrain | wt | MAX | KL-J |
wt,th | 3SD | KL-J |
In Section 4.3, we discussed the post-convergence oscillations of thresholds around the critical integer threshold due to our power-of-2 scaling constraint. When thresholds cross this integer level, it can change the distributions of downstream activations, requiring weights and thresholds of the following layers to adapt to it. To minimize this effect, we incrementally freeze thresholds starting at steps, once every 50 steps in the order of increasing absolute gradient magnitude, if they are on the correct side of (determined using an EMA).
Before exporting the models to TensorFlow protocol buffers (.pb) for Graffitist to absorb, we make the following synthetic modifications: (i) replace tf.reduce_mean with tf.nn.avg_pool (if any), (ii) remove auxiliary logit layers (if any), and (iii) remove dropouts (if any). Additionally, we disable data-augmentation (e.g., random flip / crop) during retraining. These modifications are done keeping in mind that ALT training focuses primarily on learning thresholds through backpropagation, while allowing previously trained weights to be fine-tuned using a relatively small learning rate. As expected, most of the recovery is achieved within a fraction of an epoch due to thresholds converging, and the rest of it (up to 5 epochs) is just weights adjusting to the new thresholds. Because the overall training steps required with ALT are so few compared to from-scratch training, and that pre-trained weight distributions are not allowed to wildly change (overfit), we find it best to disable data-augmentation and dropout regularization.
Based on the analysis in Sections 4.2 and 4.3, we use the Adam optimizer with parameters and for training thresholds and weights. The initial learning rate is set to for thresholds and for weights. Learning rates are decayed exponentially (with staircase enabled) by a factor of every steps for weights and by a factor of every steps for thresholds, where is the batch size. We use a batch size of 24 for all networks except for ResNet v1 152 and Inception v4 for which a batch of 16 is used. Softmax cross-entropy loss is used to compute quantization threshold gradients and this loss, together with weight regularization (if any), are used to compute weight gradients. Batch norm moving means and variances are frozen after epoch.
Mode | Precision | Bit-width | Accuracy (%) | Epochs | ||
---|---|---|---|---|---|---|
(W/A) | top-1 | top-5 | ||||
VGG 16 | ||||||
FP32 | 32/32 | 70.9 | 89.8 | |||
Static | INT8 | 8/8 | 70.4 | 89.7 | ||
Retrain | wt | FP32 | 32/32 | 71.9 | 90.5 | 1.0 |
wt | INT8 | 8/8 | 71.8 | 90.5 | 1.0 | |
wt,th | INT8 | 8/8 | 71.7 | 90.4 | 0.9 | |
wt,th | INT4 | 4/8 | 71.5 | 90.3 | 4.0 | |
VGG 19 | ||||||
FP32 | 32/32 | 71.0 | 89.8 | |||
Static | INT8 | 8/8 | 70.4 | 89.7 | ||
Retrain | wt | FP32 | 32/32 | 71.8 | 90.4 | 1.0 |
wt | INT8 | 8/8 | 71.7 | 90.4 | 1.0 | |
wt,th | INT8 | 8/8 | 71.7 | 90.4 | 1.0 | |
wt,th | INT4 | 4/8 | 71.2 | 90.1 | 2.0 | |
Inception v1 | ||||||
FP32 | 32/32 | 69.8 | 89.6 | |||
Static | INT8 | 8/8 | 68.6 | 88.9 | ||
Retrain | wt | FP32 | 32/32 | 70.3 | 90.0 | 2.8 |
wt | INT8 | 8/8 | 70.6 | 90.3 | 3.5 | |
wt,th | INT8 | 8/8 | 70.7 | 90.2 | 2.4 | |
wt,th | INT4 | 4/8 | 67.2 | 88.2 | 4.0 | |
Inception v2 | ||||||
FP32 | 32/32 | 74.0 | 91.8 | |||
Static | INT8 | 8/8 | 73.1 | 91.3 | ||
Retrain | wt | FP32 | 32/32 | 74.3 | 92.2 | 3.3 |
wt | INT8 | 8/8 | 74.4 | 92.3 | 4.7 | |
wt,th | INT8 | 8/8 | 74.4 | 92.4 | 2.5 | |
wt,th | INT4 | 4/8 | 71.9 | 90.8 | 4.8 | |
Inception v3 | ||||||
FP32 | 32/32 | 78.0 | 93.9 | |||
Static | INT8 | 8/8 | 76.8 | 93.3 | ||
Retrain | wt | FP32 | 32/32 | 78.3 | 94.2 | 2.1 |
wt | INT8 | 8/8 | 78.2 | 94.1 | 2.0 | |
wt,th | INT8 | 8/8 | 78.3 | 94.3 | 1.2 | |
wt,th | INT4 | 4/8 | 76.4 | 93.1 | 4.4 | |
Inception v4 | ||||||
FP32 | 32/32 | 80.2 | 95.2 | |||
Static | INT8 | 8/8 | 79.4 | 94.6 | ||
Retrain | wt | FP32 | 32/32 | 80.2 | 95.2 | ** |
wt | INT8 | 8/8 | 80.1 | 95.3 | 1.7 | |
wt,th | INT8 | 8/8 | 80.1 | 95.2 | 1.5 | |
wt,th | INT4 | 4/8 | 78.9 | 94.7 | 4.2 |
Mode | Precision | Bit-width | Accuracy (%) | Epochs | ||
---|---|---|---|---|---|---|
(W/A) | top-1 | top-5 | ||||
MobileNet v1 1.0 224 | ||||||
FP32 | 32/32 | 71.0 | 90.0 | |||
Static | INT8 | 8/8 | 0.6 | 3.6 | ||
Retrain | wt | FP32 | 32/32 | 71.1 | 90.0 | 3.4 |
wt | INT8 | 8/8 | 67.0 | 87.9 | 4.6 | |
wt,th | INT8 | 8/8 | 71.1 | 90.0 | 2.1 | |
wt,th | INT4 | 4/8 | – | – | ||
MobileNet v2 1.0 224 | ||||||
FP32 | 32/32 | 70.1 | 89.5 | |||
Static | INT8 | 8/8 | 0.3 | 1.2 | ||
Retrain | wt | FP32 | 32/32 | 71.7 | 90.7 | 3.2 |
wt | INT8 | 8/8 | 68.2 | 89.0 | 2.7 | |
wt,th | INT8 | 8/8 | 71.8 | 90.6 | 2.2 | |
wt,th | INT4 | 4/8 | – | – | ||
DarkNet 19 | ||||||
FP32 | 32/32 | 73.0 | 91.4 | |||
Static | INT8 | 8/8 | 68.7 | 89.7 | ||
Retrain | wt | FP32 | 32/32 | 74.4 | 92.3 | 3.1 |
wt | INT8 | 8/8 | 72.9 | 91.6 | 3.8 | |
wt,th | INT8 | 8/8 | 74.5 | 92.3 | 1.8 | |
wt,th | INT4 | 4/8 | 73.2 | 91.6 | 2.8 | |
ResNet v1 50 | ||||||
FP32 | 32/32 | 75.2 | 92.2 | |||
Static | INT8 | 8/8 | 74.3 | 91.7 | ||
Retrain | wt | FP32 | 32/32 | 75.4 | 92.5 | 3.7 |
wt | INT8 | 8/8 | 75.3 | 92.3 | 1.0 | |
wt,th | INT8 | 8/8 | 75.4 | 92.3 | 1.9 | |
wt,th | INT4 | 4/8 | 74.4 | 91.7 | 2.0 | |
ResNet v1 101 | ||||||
FP32 | 32/32 | 76.4 | 92.9 | |||
Static | INT8 | 8/8 | 74.8 | 92.0 | ||
Retrain | wt | FP32 | 32/32 | 76.6 | 93.2 | 1.2 |
wt | INT8 | 8/8 | 76.3 | 93.0 | 1.0 | |
wt,th | INT8 | 8/8 | 76.4 | 93.1 | 0.9 | |
wt,th | INT4 | 4/8 | 75.7 | 92.5 | 2.0 | |
ResNet v1 152 | ||||||
FP32 | 32/32 | 76.8 | 93.2 | |||
Static | INT8 | 8/8 | 76.2 | 93.0 | ||
Retrain | wt | FP32 | 32/32 | 76.8 | 93.3 | 1.0 |
wt | INT8 | 8/8 | 76.7 | 93.3 | 1.5 | |
wt,th | INT8 | 8/8 | 76.7 | 93.3 | 1.4 | |
wt,th | INT4 | 4/8 | 76.0 | 93.0 | 1.9 |
Table 3 reports the single-crop ImageNet validation accuracy for 12 networks. Default image sizes are used - for Inception v{3, 4}, for Darknet 19 and for all other networks. Standard pre-processing for each network is applied to center crop, resize, and normalize the input data.
The different trials include pre-trained FP32 baseline, static INT8 run, and 4 retrain runs - FP32 wt-only, INT8 wt-only, INT8 wt+th and INT4 wt+th. FP32 baseline numbers are reported as validated on our end. For an unbiased comparison, we train the FP32 weights using the same procedure (optimizers, learning rates, decay, freeze etc.) as with our quantized retraining, except without the threshold training logic. This FP32 wt-only retraining only serves as a fair baseline to our INT8 and INT4 retrain results. That said, we do not use the retrained FP32 weights to initialize any of our INT8/INT4 retraining runs, and they always start from pre-trained FP32 weights. This is done to keep the overhead of retraining to a minimum.
The validation accuracy and epoch count corresponding to the best checkpoint are noted in Table 3. As we see, all the networks converge within 5 epochs. Variance on the reported accuracy stems from a few sources (in decreasing order of impact): (a) best rather than mean validation (our findings in Section 7.3 suggest this variance is within 0.2%), (b) non-determinism due to inexact floating point math (empirically within 0.1%), (c) round to one decimal (bound to 0.05%). Keeping these variance bounds on accuracy in mind, we can draw interesting insights into the benefits of ALT training.
Our experiments demonstrate floating point accuracy for 8-bit quantization and near-floating-point accuracy for 4-bit quantization for most networks. We see that static quantization incurs a higher loss than retrained methods. This is expected because (a) weights are not trained to adapt to the quantized network, and (b) quantization thresholds are picked using local statistics instead of being optimized on a global loss. For networks that are easier to quantize to INT8 (e.g., VGGs, Inceptions, ResNets), we find that retraining weights alone while fixing thresholds to their pre-calibrated values (based on Table 2) is sufficient. In such cases, ALT (wt+th) retraining shows no added benefit. However, for networks known to be difficult to quantize (e.g., MobileNets, DarkNets), ALT (wt+th) retraining yields up to higher top-1 accuracy compared to wt-only training for INT8, and can match FP32 accuracy even with per-tensor, uniform symmetric, power-of-2 scaling constraints. This demonstrates the range-precision tradeoff through trained thresholds in action. For lower precisions such as INT4, we find that wt-only training does not recover, and so ALT (wt+th) retraining is necessary. The INT4 accuracy falls short of FP32, and we believe this maybe due to (a) our quantization constraints in Section 3.1, and (b) the first/last layers not retaining full precision^{8}^{8}8We quantize first/last layers to a minimum of INT8, so that they can be mapped on the same fixed-point hardware used for other layers..
Method | Precision | Quantization Scheme | top-1 |
---|---|---|---|
MobileNet v1 1.0 224 | |||
FP32 | 70.9 | ||
INT8 | per-channel, symmetric, real scaling | 70.7 | |
INT8 | per-tensor, asymmetric, real scaling | 70.0 | |
Ours | FP32 | 71.1 | |
INT8 | per-tensor, symmetric, p-of-2 scaling | 71.1 | |
MobileNet v2 1.0 224 | |||
FP32 | 71.9 | ||
INT8 | per-channel, symmetric, real scaling | 71.1 | |
INT8 | per-tensor, asymmetric, real scaling | 70.9 | |
Ours | FP32 | 71.7 | |
INT8 | per-tensor, symmetric, p-of-2 scaling | 71.8 |
For more difficult networks such as MobileNets, it is well known that symmetric, per-tensor quantization post-training or through calibrate-only methods is detrimental [21, 10]. We believe this is true, in particular due to the use of depthwise convolutions with irregular weight distributions and widely varying ranges between channels. With wt-only retraining we are only able to recover to within of floating-point accuracy. However, with ALT (wt+th) retraining, our results for 8-bit are the highest we have seen using symmetric, power-of-2 scaled, per-tensor quantization, even matching floating-point accuracy with no loss. In Table 4 we draw a few comparisons with Google’s quantization-aware training results for MobileNets. As seen, we incur no loss with INT8 quantization even with stricter constraints. We believe this is due to the fact that our threshold gradient formulation is in fact able to balance range-precision effectively.
In Figure 7 we analyze the retrained distributions for a few quantized layers in MobileNet v1, highlighting the importance of range-precision trade-off. As seen with the depthwise separable layers’ weights, the trained thresholds move-in from their initialized values by up to 3 integer bins in the log domain, favoring precision over dynamic range. For some other layers, the thresholds move-out from their initialized values, favoring range over precision.
Figure 8 shows a histogram of trained threshold deviations for different networks under 8-bit and 4-bit quantized retraining. We find that larger positive deviations are seen in the 8-bit case compared to the 4-bit case. This intuitively makes sense as the method decides to favor range with more bits of precision, but cuts back on range when only few bits of precision are available.
We run validation every 1000 training steps and save the best top-1 score checkpoint. This approach was initially driven by a desire to better understand convergence and stability properties with our method, but we continued using it since intermediate validation was not too expensive for 5 epochs of retraining. However a valid concern is that this intermediate validation introduces a positive bias to our results through cherry-picking. We compare this positive-biased validation method to simply taking the average of validation scores at fixed intervals: 20%, 40%, 60%, 80% and 100% of the fifth epoch. As noted in Table 5, the differences between these methods on the top-1 accuracy are and for MobileNet v1 and VGG 16 respectively, suggesting that cherry-picking only results in a minor positive bias on our reported accuracy.
Accuracy (%) | Epochs | ||
top-1 | top-5 | ||
MobileNet v1 1.0 224 | |||
70.982 | 89.886 | 4.2 | |
70.986 | 89.860 | 4.4 | |
71.076 | 89.930 | 4.6 | |
71.000 | 89.870 | 4.8 | |
71.022 | 89.944 | 5.0 | |
Mean | 71.0 | 89.9 | |
Best | 71.1 | 90.0 | 2.1 |
VGG 16 | |||
71.448 | 90.438 | 4.2 | |
71.462 | 90.456 | 4.4 | |
71.434 | 90.436 | 4.6 | |
71.500 | 90.426 | 4.8 | |
71.458 | 90.456 | 5.0 | |
Mean | 71.5 | 90.4 | |
Best | 71.7 | 90.4 | 0.9 |
In Section 3, we presented a quantization threshold training scheme amenable to most generic fixed-point hardware by constraining our method to uniform, symmetric, power-of-2 scaled, per-tensor quantization. We showed that our quantizer’s gradient formulation allowed a unique range-precision trade-off, essential for high accuracy quantized networks. We demonstrated that training was possible despite all of our constraints by utilizing log-domain threshold training and adaptive gradient optimization. In Section 4, we provided analytical arguments for the general robustness of our adaptive-gradient log-domain threshold (ALT) training technique. In Section 5, we presented a framework called Graffitist for automatically quantizing TensorFlow graphs with our methods. In Section 6, we empirically validated our methods on a suite of common ImageNet CNNs. Finally, in Sections 7, we provided insightful discussions on ALT training and state-of-the-art results for 8-bit MobileNet quantization.
Our work and results demonstrate the effectiveness of our techniques for high accuracy quantization of neural networks on fixed-point hardware. While our work covers a major use case for quantization, there are many other quantization flavors we could explore in future work. For example, it would be useful to see how well the techniques we designed for strict power-of-2 scaling generalize to non power-of-2 scale-factors. Some additional relaxations of our constraints we could explore include per-channel rather than per-tensor quantization, which could potentially allow for more aggressive bitwidths on difficult networks like MobileNets, and non-symmetric or even non-uniform quantization schemes, where threshold training via backpropagation and gradient descent has been tried with mild success. We would not be surprised to see our methods and analysis techniques have broader applicability for more general classes of quantizers.
We thank Engin Tunali, Ramon Uribe, Sean Settle, Paolo D’Alberto, Armin Banaei, Ephrem Wu, Nicholas Fraser, Ralph Wittig from Xilinx, and Elaina Chai, Boris Murmann from Stanford for valuable discussions and comments.
International Journal of Computer Vision (IJCV)
, 115(3):211–252, 2015.Consider^{9}^{9}9This example is adapted from Section 2.2 in [18]. two real numbers and and their product . Using the affine mapping from (2) to represent this, we get:
(14) |
which can be expressed as
(15) |
The cross-terms in (15) add complexity and often require special handling to remain efficient. While the added cost can be amortized over several accumulations of a matrix multiplication or convolution operation, it would still require optimizations^{10}^{10}10Some of which are covered in [17, 18, 21]., both algorithmic and kernel-level.
By eliminating zero-points, the cross-terms vanish and the operation simplifies to:
(16) |
With positive real scale-factors, the constant multiplier in (16), which is empirically [18] found to be in the interval (0, 1) can be expressed in the normalized form where is a non-negative integer and is in the interval [0.5, 1). In other words, the accumulator (storing ) needs to be scaled by a fixed-point multiplier that approximates and right-shifted by bits (with round-to-nearest):
(17) |
However, by constraining scale-factors to strict power-of-2, the scaling operation reduces to a rather simple bit-shift (with round-to-nearest):
(18) |
Let be the period of oscillations at convergence. If we assume , then we can treat the moving variance estimate as if it is a constant . However, we cannot make the same assumption for the relationship between and . Instead, based on our earlier discussion in Section 4.3 of the bang-bang behavior, we assume that a gradient is seen for a single step, then is seen for steps. Then for a given cycle of this behavior, , where is the steady-state minimum mean during the cycle. Because this is steady-state, we can solve for and :
(19) | ||||
(20) |
Adam updates look like or . We can solve for by finding when or . As an intermediate step, we find:
(21) |
Now, we set :
(22) |
The worst case happens when is large, so if we substitute and assume , we get:
(23) | ||||
(24) |
where we replace the large expression in (23) with in (24). We now solve for the critical point of to determine .
(25) | ||||
(26) |
(27) |
To simplify this expression, note that and so . Then and:
(28) |
Further, if , then the right term is negative and the expression has a simple upper bound:
(29) |
In practice, we notice that sometimes noise can can cause to stay on the high-gradient side of the threshold boundary for multiple steps, causing the momentum to build up. Thus, to be safe, we recommend designing for .
A rough estimate for the number of steps needed for convergence is . Because of adaptive gradients, should be close to 1, provided we allow enough time for historical variance to decay - steps^{11}^{11}11This is a problem when historical gradient magnitudes were higher, as is usually the case when , as seen in the small plots of Figure 5.. Thus, the overall number of steps would be . Assuming calibration is used, should be close to 1, giving the simplified expression steps.
Finally, we address how to approximate . The operation of crossing a threshold boundary moves some fraction of inputs from the case to the or cases (assume only for simplicity from here on). Using the toy -loss model (9),
(30) |
we see that for any given , the ratio between the gradients in the outer and inner cases is . But since recently switched cases, . As a rough estimate, we might expect . Averaged over the entire input, . The over-design helps address some uncertainty in this measure as well.
Figure 9 shows a re-run of Figure 5 for the case of Adam optimization on log threshold gradients. These plots allow us to validate our Adam convergence analysis above. First we note that , which is an approximate upper bound on and well within the over-design principle. Next, notice that . For example, in the case, while .
Most importantly, we expect the max log-threshold deviation to be upper-bounded by from left to right if our original assumptions hold - that we visit the lower threshold bin for one step and stay in the upper bin for steps. While the bound holds for all , it is close to not holding for
. A brief inspection reveals why this is the case - the log threshold spends far more than one step in the lower threshold bin per period, violating our one-step assumption. This violation can be explained by looking at the gradients, which show that the lower threshold bin sometimes has positive gradients, depending on the randomness of the input Gaussian vector. These phenomena motivate our suggestion to over-design by
. The cost in additional steps needed to reach convergence seems like a worthwhile trade-off.
Comments
There are no comments yet.