Many of the best performing neural network architectures in real-world applications have a large number of parameters. For example, the current standard machine translation architecture, TransformerVaswani et al. (2017), has layers that contain millions of parameters. Even models that are designed to jointly optimize the performance and the parameter efficiency, such as EfficientNets Tan and Le (2019), still require dozens to hundreds of megabytes, which limits their applications to domains like robotics or virtual assistants.
Model compression schemes reduce the memory footprint of overparametrized models. Pruning LeCun et al. (1990) and distillation Hinton et al. (2015) remove parameters by reducing the number of network weights. In contrast, quantization focuses on reducing the bits per weight. This makes quantization particularly interesting when compressing models that have already been carefully optimized in terms of network architecture. Whereas deleting weights or whole hidden units will inevitably lead to a drop in performance, we demonstrate that quantizing the weights can be performed with little to no loss in accuracy.
Popular postprocessing quantization methods, like scalar quantization, replace the floating-point weights of a trained network by a lower-precision representation, like fixed-width integers Vanhoucke et al. (2011). These approaches achieve a good compression rate with the additional benefit of accelerating inference on supporting hardware. However, the errors made by these approximations accumulate in the computations operated during the forward pass, inducing a significant drop in performance Stock et al. (2019).
A solution to address this drifting effect is to directly quantize the network during training. This raises two challenges. First, the discretization operators have a null gradient — the derivative with respect to the input is zero almost everywhere. This requires special workarounds to train a network with these operators. The second challenge that often comes with these workarounds is the discrepancy that appears between the train and test functions implemented by the network. Quantization Aware Training (QAT) Jacob et al. (2018) resolves these issues by quantizing all the weights during the forward and using a straight through estimator (STE) Bengio et al. (2013) to compute the gradient. This works when the error introduced by STE is small, like with int8 fixed-point quantization, but does not suffice in compression regimes where the approximation made by the compression is more severe.
In this work, we show that quantizing only a subset of weights instead of the entire network during training is more stable for high compression schemes. Indeed, by quantizing only a random fraction of the network at each forward, most the weights are updated with unbiased gradients. Interestingly, we show that our method can employ a simpler quantization scheme during the training. This is particularly useful for quantizers with trainable parameters, such as Product Quantizer (PQ), for which our quantization proxy is not parametrized. Our approach simply applies a quantization noise, called Quant-Noise, to a random subset of the weights, see Figure 1. We observe that this makes a network resilient to various types of discretization methods: it significantly improves the accuracy associated with (a) low precision representation of weights like int8; and (b) state-of-the-art PQ. Further, we demonstrate that Quant-Noise can be applied to existing trained networks as a post-processing step, to improve the performance network after quantization.
In summary, this paper makes the following contributions:
We introduce the Quant-Noise technique to learn networks that are more resilient to a variety of quantization methods such as int4, int8, and PQ;
Adding Quant-Noise to PQ leads to new state-of-the-art trade-offs between accuracy and model size. For instance, for natural language processing (NLP), we reach 82.5% accuracy on MNLI by compressing RoBERTa to 14 MB. Similarly for computer vision, we report 80.0% top-1 accuracy on ImageNet by compressing an EfficientNet-B3 to 3.3 MB;
By combining PQ and int8 to quantize weights and activations for networks trained with Quant-Noise, we obtain extreme compression with fixed-precision computation and achieve 79.8% top-1 accuracy on ImageNet and perplexity on WikiText-103.
2 Related Work
Many compression methods focus on efficient parameterization, via weight pruning LeCun et al. (1990); Li et al. (2016); Huang et al. (2018); Mittal et al. (2018), weight sharing Dehghani et al. (2018); Turc et al. (2019); Lan et al. (2019) or with dedicated architectures Tan and Le (2019); Zhang et al. (2017); Howard et al. (2019). Weight pruning is implemented during training Louizos et al. (2017) or as a fine-tuning post-processing step Han et al. (2015, 2016). Many pruning methods are unstructured, i.e., remove individual weights LeCun et al. (1990); Molchanov et al. (2017). On the other hand, structured pruning methods follow the structure of the weights to reduce both the memory footprint and the inference time of a model Li et al. (2016); Luo et al. (2017); Fan et al. (2019). We refer the reader to Liu et al. Liu et al. (2018) for a review of different pruning strategies. Others have worked on lightweight architectures, by modifying existing models Zhang et al. (2018); Wu et al. (2019); Sukhbaatar et al. (2019a) or developing new networks, such as MobileNet Howard et al. (2019), ShuffleNet Zhang et al. (2017), and EfficientNet Tan and Le (2019) in vision. Finally, knowledge distillation Hinton et al. (2015) has been applied to sentence representation Turc et al. (2019); Sanh et al. (2019a); Sun et al. (2019); Zhao et al. (2019); Jiao et al. (2019), to reduce the size of a BERT model Devlin et al. (2018).
There are extensive studies of scalar quantization to train networks with low-precision weightsand activations Courbariaux et al. (2015); Courbariaux and Bengio (2016); Rastegari et al. (2016); McDonnell (2018). These methods benefit from specialized hardware to also improve the runtime during inference Vanhoucke et al. (2011)
. Other quantization methods such as Vector Quantization (VQ) and PQJegou et al. (2011) quantize blocks of weights simulatneously to achieve higher compression rate Stock et al. (2019); Gong et al. (2014); Joulin et al. (2016); Carreira-Perpiñán and Idelbayev (2017). Closer to our work, several works have focused at simulatenously training and quantizing a network Jacob et al. (2018); Krishnamoorthi (2018); Gupta et al. (2015); Dong et al. (2019). Gupta et al. Gupta et al. (2015) assigns weights to a quantized bin stochastically which is specific to scalar quantization, but allows training with fixed point arithmetic. Finally, our method can be interpreted as a form of Bayesian compression Louizos et al. (2017), using the Bayesian intepretation of Dropout Gal and Ghahramani (2016). As opposed to their work, we select our noise to match the weight transformation of a target quantization methods without restricting it to a scale mixture prior.
3 Quantizing Neural Networks
In this section, we present the principles of quantization, several standard quantization methods, and describe how to combine scalar and product quantization. For clarity, we focus on the case of a fixed real matrix . We suppose that this matrix is split into blocks :
where the nature of these blocks is determined by the quantization method. A codebook is a set of vectors, i.e., . Quantization methods compress the matrix by assigning to each block an index that points to a codeword in a codebook , and storing the codebook and the resulting indices (as the entries of an index matrix ) instead of the real weights. During the inference, they reconstruct an approximation of the original matrix such that .
We distinguish scalar quantization, such as int8, where each block consists of a single weight, from vector quantization, where several weights are quantized jointly.
3.1 Fixed-point Scalar Quantization
Fixed-point scalar quantization methods replace floating-point number representations by low-precision fixed-point representations. They simultaneously reduce a model’s memory footprint and accelerate inference by using fixed-point arithmetic on supporting hardware.
Fixed-point scalar quantization operates on blocks that represent a single weight, i.e., . Floating-point weights are replaced by bit fixed-point numbers Gupta et al. (2015)
, with the extreme case of binarization whereCourbariaux et al. (2015). More precisely, the weights are rounded to one of possible codewords. These codewords correspond to bins evenly spaced by a scale factor and shifted by a bias . Each weight is mapped to its nearest codeword , i.e.,
where we compute the scale and bias as:
3.2 Product Quantization
Several quantization methods work on groups of weights, such as vectors, to benefit from the correlation induced by the structure of the network. In this work, we focus on Product Quantization for its good performance at extreme compression ratio Stock et al. (2019).
In vector quantization methods, the blocks are predefined groups of weights instead of single weights. The codewords are groups of values, and the index matrix maps groups of weights from the matrix to these codewords. In this section, we present the Product Quantization framework as it generalizes both scalar and vector quantization. We consider the case where we apply PQ to the columns of and thus assume that .
Traditional vector quantization techniques split the matrix into its columns and learns a codebook on the resulting vectors. Instead, Product Quantization splits each column into subvectors and learns the same codebook for each of the resulting subvectors. Each quantized vector is subsequently obtained by assigning its subvectors to the nearest codeword in the codebook. Learning the codebook is traditionally done using -means with a fixed number of centroids, typically to store the index matrix using int8. Thus, the objective function is written as:
PQ shares representations between subvectors, which allows for higher compression rates than intN.
When quantizing a full network rather than a single matrix, extreme compression with PQ induces a quantization drift as reconstruction error accumulates Stock et al. (2019). Indeed, subsequent layers take as input the output of preceding layers, which are modified by the quantization of the preceding layers. This creates a drift in the network activations, resulting in large losses of performance. A solution proposed by Stock et al. Stock et al. (2019), which we call iterative PQ (iPQ), is to quantize layers sequentially from the lowest to the highest, and finetune the upper layers as the lower layers are quantized, under the supervision of the uncompressed (teacher) model. Codewords of each layer are finetuned by averaging the gradients of their assigned elements with gradient steps of the form:
is the loss function andis a learning rate. This adapts the upper layers to the drift appearing in their inputs, reducing the impact of the quantization approximation on the overall performance.
3.3 Combining Fixed-Point with Product Quantization
Fixed-point quantization and Product Quantization are often regarded as competing choices, but can be advantageously combined. Indeed, PQ/iPQ compresses the network by replacing vectors of weights by their assigned centroids, but these centroids are in floating-point precision. Fixed-point quantization compresses both activations and weights to fixed-point representations. Combining both approaches means that the vectors of weights are mapped to centroids that are compressed to fixed-point representations, along with the activations. This benefits from the extreme compression ratio of iPQ and the finite-precision arithmetics of intN quantization.
More precisely, for a given matrix, we store the int8 representation of the centroids of dimension along with the representations of the centroid assignments of the subvectors. The int8 representation of the centroids is obtained with Eq. (2). The overall storage of the matrix and activations during a forward pass with batch size is
In particular, when , the centroid assignments are also stored in int8, which means that every value required for a forward pass is stored in an int8 format. We divide by the float32 overhead of storing the centroids, although the storage requirement associated with the centroids is small compared to the cost of indexing the subvectors for standard networks. In contrast to iPQ alone where we only quantize the weights, we also quantize the activations using int8. We evaluate this approach on both natural language processing and computer vision tasks in Section 5.
Deep networks are not exposed to the noise caused by the quantization drift during training, leading to suboptimal performance. A solution to make the network robust to quantization is to introduce it during training. Quantization Aware Training (QAT) Jacob et al. (2018) exposes the network during training by quantizing weights during the forward pass. This transformation is not differentiable and gradients are approximated with a straight through estimator (STE) Bengio et al. (2013); Courbariaux and Bengio (2016). STE introduces a bias in the gradients that depends on level of quantization of the weights, and thus, the compression ratio. In this section, we propose a simple modification to control this induced bias with a stochastic amelioration of QAT, called Quant-Noise. The idea is to quantize a randomly selected fraction of the weights instead of the full network as in QAT, leaving some unbiased gradients flow through unquantized weights. Our general formulation can simulate the effect of both quantization and of pruning during training.
4.1 Training Networks with Quantization Noise
We consider the case of a real matrix as in Section 3. During the training of a network, our proposed Quant-Noise method works as follows: first, we compute blocks related to a target quantization method. Then, during each forward pass, we randomly select a subset of these blocks and apply some distortion to them. During the backward pass, we compute gradients for all the weights, using STE for the distorted weights.
More formally, given a set of tuples of indices for , and a distortion or noise function acting on a block, we define an operator such that, for each block , we apply the following transformation:
The noise function simulates the change in the weights produced by the target quantization method (see Section 4.2 for details). We replace the matrix by the resulting noisy matrix during the forward pass to compute a noisy output , i.e.,
where is an input vector. During the backward pass, we compute the gradient on the non-distorted weights and apply STE on the distorted weights, i.e.,
Note that our approach is equivalent to QAT when containts all the tuples of indices. However, an advantage of Quant-Noise over QAT is that unbiased gradients continue to flow via blocks unaffected by the noise. As these blocks are randomly selected for each forward, we guarantee that each weight regularly sees gradients that are not affected by the nature of the function . As a side effect, our quantization noise regularizes the network in a similar way as DropConnect Wan et al. (2013) or LayerDrop Fan et al. (2019).
Composing quantization noises.
As noise operators are compositionally commutative, we can make a network robust to a combination of quantization methods by composing their noise operators:
This property is particularly useful to combine quantization with pruning operators during training, as well as combining scalar quantization with product quantization.
4.2 Adding Noise to Specific Quantization Methods
In this section, we propose several implementations of the noise function for the quantization methods described in Section 3. We also show how to handle pruning with it.
Fixed-point scalar quantization.
In intN quantization, the blocks are atomic and weights are rounded to their nearest neighbor in the codebook. The function replaces weight with the output of the rounding function defined in Eq. (2), i.e.,
where and are updated during training. In particular, the application of Quant-Noise to int8 scalar quantization is a stochastic amelioration of QAT.
|Quantization Scheme||Language Modeling||Image Classification|
|- trained with QAT||34.1|
|- trained with Quant-Noise|
|- trained with QAT|
|- trained with Quant-Noise|
|- trained with QAT|
|- trained with Quant-Noise|
|iPQ & int8 + Quant-Noise|
As opposed to intN, codebooks in PQ require a clustering step based on weight values. During training, we learn codewords online and use the resulting centroids to implement the quantization noise. More precisely, the noise function assigns a selected block to its nearest codeword in the associated codebook :
Updating the codebooks online works well. However, empirically, running
-means once per epoch is faster and does not noticeably modify the resulting accuracy.
Note that computing the exact noise function for PQ is computationally demanding. We propose a simpler and faster alternative approximation to the operational transformation of PQ and iPQ. The noise function simply zeroes out the subvectors of the selected blocks, i.e., As a sidenote, we considered other alternatives, for instance one where the subvectors are mapped to the mean subvector. In practice, we found that these approximations lead to similar performance, see Section 6. This proxy noise function is a form of Structured Dropout and encourages correlations between the subvectors. This correlation is beneficial to the subsequent clustering involved in PQ/iPQ.
Adding pruning to the quantization noise.
The specific form of quantization noise can be adjusted to incorporate additional noise specific to pruning. We simply combine the noise operators of quantization and pruning by composing them following Eq. (9). We consider the pruning noise function of Fan et al. Fan et al. (2019) where they randomly drop predefined structures during training. In particular, we focus on LayerDrop, where the structures are the residual blocks of highway-like layers Srivastava et al. (2015), as most modern architectures, such as ResNet or Transformer, are composed of this structure. More precisely, the corresponding noise operator over residual blocks is
For pruning, we do not use STE to backpropagate the gradient of pruned weights, as dropping them entirely during training has the benefit of speeding convergenceHuang et al. (2016). Once a model is trained with LayerDrop, the number of layers kept at inference can be adapted to match computation budget or time constraint.
We demonstrate the impact of Quant-Noise on the performance of several quantization schemes in a variety of settings (see Appendix - Sec. 8.1). We compare iPQ + Quant-Noise with existing work to demonstrate that Quant-Noise leads to extreme compression rates at a reasonable cost in accuracy.
|Language modeling||Sentence Representation||Image Classification|
5.1 Improving Compression with Quant-Noise
Quant-Noise is a regularization method that makes networks more robust to the target quantization scheme or combination of quantization schemes during training. We show the impact of Quant-Noise in Table 1 for a variety of quantization methods: int8/int4 and iPQ.
We experiment in
different settings: a Transformer network trained for language modeling on WikiText-103 and a EfficientNet-B3 convolutional network trained for image classification on ImageNet-1k. Our quantization noise framework is general and flexible — Quant-Noise improves the performance of quantized models for every quantization scheme in both experimental settings. Importantly, Quant-Noise only changes model training by adding a regularization noise similar to dropout, with no impact on the convergence rate or training speed.
This comparison of different quantization schemes shows that Quant-Noise works particularly well with high performance quantization methods, like iPQ, where QAT tends to degrade the performances, even compared to quantizing as a post-processing step. In subsequent experiments in this section, we focus on applications with iPQ because it offers the best trade-off between model performance and compression, and has little negative impact on FLOPS.
Fixed-Point Product Quantization.
Combining iPQ and int8 as described in Section 3.3 allows us to take advantage of the high compression rate of iPQ with a fixed-point representation of both centroids and activations. As shown in Table 1, this combination incurs little loss in accuracy with respect to iPQ + Quant-Noise. Most of the memory footprint of iPQ comes from indexing and not storing centroids, so the compression ratios are comparable.
Complementarity with Weight Pruning and Sharing.
We analyze how Quant-Noise is compatible and complementary with pruning (“+Prune”) and weight sharing (“+Share”), see Appendix for details on weight sharing. We report results for Language modeling on WikiText-103, pre-trained sentence representations on MNLI and object classification on ImageNet-1k in Table 2. The conclusions are remarkably consistent across tasks and benchmarks: Quant-Noise gives a large improvement over strong iPQ baselines. Combining it with sharing and pruning offers additional interesting operating points of performance vs size.
5.2 Comparison with the state of the art
We now compare our approach on the same tasks against the state of the art. We apply our best quantization setup on competitive models and reduce their memory footprint by when combining with weight sharing and pruning, offering extreme compression for good performance.
Natural Language Processing.
In Figure 2, we examine the trade-off between performance and model size. Our quantized RoBERTa offers a competitive trade-off between size and performance with memory reduction methods dedicated to BERT, like TinyBERT, MobileBERT, or AdaBERT.
We compress EfficientNet-B3 from Mb to Mb ( compression) while maintaining high top-1 accuracy ( versus for the original model). As shown in Figure 2, our quantized EfficientNet-B3 is smaller and more accurate than architectures dedicated to optimize on-device performance with limited size like MobileNet or ShuffleNet.
Incorporating pruning noise into quantization is also beneficial. For example, with pruning iPQ+Quant-Noise reduces size by with only a drop of PPL in language modeling. Further, pruning reduces FLOPS by the same ratio as its compression factor, in our case, . By adding sharing with pruning, in language modeling, we achieve an extreme compression ratio of with a drop of PPL with FLOPS reduction from pruning entire shared chunks of layers. For comparison, our MB model has the same performance as the MB Transformer-XL base.
In this section, we study the use of our approach as a post-processing step where a pre-trained model is finetuned with Quant-Noise. We also examine the impact of the level of noise during training as well as the impact of approximating iPQ during training.
|Train without Quant-Noise||25.2||Train without Quant-Noise||82.5|
|+ Finetune with Quant-Noise||20.9||+ Finetune with Quant-Noise||83.4|
|Train with Quant-Noise||20.7||Train with Quant-Noise||83.6|
6.1 Finetuning with Quant-Noise for Post-Processing Quantization
We explore taking existing pre-trained models and post-processing with Quant-Noise instead of training from scratch. For language modeling, we start with the Adaptive Inputs architecture and train for 10 additional epochs. For RoBERTa, we train for 25k more updates. We show that finetuning with Quant-Noise incorporates the benefits and almost matches training from scratch, see Table 3. For example, in language modeling, there is only a PPL difference after applying iPQ.
We further examine how to incorporate Quant-Noise more flexibly into pretraining RoBERTa for sentence classification tasks. We take an already pre-trained RoBERTa model and only incorporate Quant-Noise during the sentence classification task transfer learning step. We show in Table3 that this is also effective at compressing while retaining accuracy after quantization with iPQ.
6.2 Impact of Noise Rate
We analyze the performance for various values of Quant-Noise in Figure 3 on a Transformer for language modeling. For iPQ, performance is impacted by high rates of quantization noise. For example, a Transformer with the noise function degrades with rate higher than 0.5, i.e., when half of the weights are passed through the noise function . We hypothesize that for large quantities of noise, a larger effect of using proxy rather than the exact PQ noise is observed. For int8 quantization and its noise function, higher rates of noise are slightly worse but not as severe. A rate of for int8 quantization is equivalent to the Quantization Aware Training of Krishnamoorthi (2018), as the full matrix is quantized with STE, showing the potential benefit of partial quantization during training.
6.3 Impact of Approximating the Noise Function
We study the impact of approximating quantization noise during training. We focus on the case of iPQ with the approximation described in Section 4.2. In Table 4, we compare the correct noise function for iPQ with its approximation . This approximate noise function does not consider cluster assignments or centroid values and simply zeroes out the selected blocks. For completeness, we include an intermediate approximation where we consider cluster assignments to apply noise within each cluster, but still zero-out the vectors. These approximations do not affect the performance of the quantized models. This suggests that increasing the correlation between subvectors that are jointly clustered is enough to maintain the performance of a model quantized with iPQ. Since PQ tends to work well on highly correlated vectors, such as activations in convolutional networks, this is not surprising. Using the approximation presents the advantage of speed and practicality. Indeed, one does not need to compute cluster assignments and centroids for every layer in the network after each epoch. Moreover, the approach is less involved in terms of code.
We show that quantizing a random subset of weights during training maintains performance in the high quantization regime. We validate that Quant-Noise works with a variety of different quantization schemes on several applications in text and vision. Our method can be applied to a combination of iPQ and int8 to benefit from extreme compression ratio and fixed-point arithmetic. Finally, we show that Quant-Noise can be used as a post-processing step to prepare already trained networks for subsequent quantization, to improve the performance of the compressed model.
- Classy vision. Note: https://github.com/facebookresearch/ClassyVision Cited by: §8.1, §8.2.
- Adaptive input representations for neural language modeling. arXiv preprint arXiv:1809.10853. Cited by: §8.1, §8.2.
Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432. Cited by: Training with Quantization Noise for Extreme Fixed-Point Compression, §1, §4.
- . arXiv preprint arXiv:1611.01576. Cited by: Table 5.
-  Faster and just as accurate: a simple decomposition for transformer models. Cited by: Table 6.
- Model compression as constrained optimization, with application to neural nets. part ii: quantization. Cited by: §2.
- AdaBERT: task-adaptive bert compression with differentiable neural architecture search. arXiv preprint arXiv:2001.04246. Cited by: Table 6.
- Pact: parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085. Cited by: §3.1.
- BinaryConnect: training deep neural networks with binary weights during propagations. CoRR. Cited by: §2, §3.1.
- BinaryNet: training deep neural networks with weights and activations constrained to +1 or -1. CoRR. Cited by: §2, §4.
- Transformer-xl: attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860. Cited by: Table 5.
- Language modeling with gated convolutional networks. In Proc. of ICML, Cited by: §8.2, Table 5.
- Universal transformers. Cited by: §2.
- ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, Cited by: §8.1.
- Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2, §8.1, Table 6.
- Stochastic quantization for learning accurate low-bit deep neural networks. International Journal of Computer Vision 127 (11-12), pp. 1629–1642. Cited by: §2.
- Reducing transformer depth on demand with structured dropout. arXiv preprint arXiv:1909.11556. Cited by: §2, §4.1, §4.2, §8.1, §8.5, Table 6.
Dropout as a bayesian approximation: representing model uncertainty in deep learning. In
international conference on machine learning, pp. 1050–1059. Cited by: §2.
- Compressing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115. Cited by: §2.
- Efficient softmax approximation for gpus. arXiv abs/1609.04309. Cited by: §8.2.
- Deep learning with limited numerical precision. In ICML, Cited by: §2, §3.1.
- Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding. ICLR. Cited by: §2.
- Learning both weights and connections for efficient neural network. In NIPS, pp. 1135–1143. Cited by: §2.
- Deep residual learning for image recognition. CoRR. Cited by: Table 7.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §1, §2.
- Searching for mobilenetv3. arXiv e-prints. Cited by: §2.
- Condensenet: an efficient densenet using learned group convolutions. In CVPR, Cited by: §2, Table 7.
- Deep networks with stochastic depth. In ECCV, Cited by: §4.2.
Quantization and training of neural networks for efficient integer-arithmetic-only inference.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2704–2713. Cited by: Training with Quantization Noise for Extreme Fixed-Point Compression, §1, §2, §4.
- Product quantization for nearest neighbor search. PAMI. Cited by: §2.
- Tinybert: distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351. Cited by: §2, Table 6.
- Fasttext.zip: compressing text classification models. arXiv preprint arXiv:1612.03651. Cited by: §2.
- Quantizing deep convolutional networks for efficient inference: a whitepaper. arXiv preprint arXiv:1806.08342. Cited by: §2, §6.2.
- Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291. Cited by: §8.2.
ALBERT: a lite bert for self-supervised learning of language representations. Cited by: §2, Table 6.
- Optimal brain damage. In NIPS, Cited by: §1, §2.
- Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710. Cited by: §2.
- Additive powers-of-two quantization: a non-uniform discretization for neural networks. arXiv preprint arXiv:1909.13144. Cited by: §3.1.
- Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §8.1, §8.2, §8.2.
- Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270. Cited by: §2.
Sgdr: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983. Cited by: §8.2.
- Learning sparse neural networks through regularization. arXiv preprint arXiv:1712.01312. Cited by: §2, §2.
- Thinet: a filter level pruning method for deep neural network compression. In ICCV, Cited by: §2.
- ShuffleNet V2: practical guidelines for efficient CNN architecture design. CoRR. Cited by: Table 7.
A tensorized transformer for language modeling. arXiv preprint arXiv:1906.09777. Cited by: Table 5.
- Training wide residual networks for deployment using a single bit for each weight. Cited by: §2.
- Pointer Sentinel Mixture Models. arXiv abs/1609.07843. Cited by: §8.1.
Recovering from random pruning: on the plasticity of deep convolutional neural networks. In WACV, Cited by: §2.
- Variational dropout sparsifies deep neural networks. In ICML, Cited by: §2.
- Fairseq: a fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, Cited by: §8.1.
- How to construct deep recurrent neural networks. In Proceedings of the Second International Conference on Learning Representations (ICLR 2014), Cited by: §8.2.
- Automatic differentiation in pytorch. Cited by: §8.1.
- Language models are unsupervised multitask learners. Cited by: §8.2.
- Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507. Cited by: Table 5.
- Xnor-net: imagenet classification using binary convolutional neural networks. In ECCV, Cited by: §2.
- Mobilenetv2: inverted residuals and linear bottlenecks. In Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: Table 7.
- DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter. Cited by: §2.
- DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108. Cited by: Table 6.
- Highway networks. arXiv preprint arXiv:1505.00387. Cited by: §4.2.
- And the bit goes down: revisiting the quantization of neural networks. CoRR abs/1907.05686. Cited by: §1, §2, §3.2, §3.2, Table 7.
- Adaptive attention span in transformers. arXiv preprint arXiv:1905.07799. Cited by: §2.
- Augmenting self-attention with persistent memory. arXiv preprint arXiv:1907.01470. Cited by: Table 5.
- Patient knowledge distillation for bert model compression. EMNLP. Cited by: §2, Table 6.
-  MobileBERT: task-agnostic compression of bert for resource limited devices. Cited by: Table 6.
- On the importance of initialization and momentum in deep learning. In International conference on machine learning, pp. 1139–1147. Cited by: §8.2.
- EfficientNet: rethinking model scaling for convolutional neural networks. Cited by: §1, §2, Table 1, §8.1, Table 7.
- Well-read students learn better: the impact of student initialization on knowledge distillation. arXiv preprint arXiv:1908.08962. Cited by: §2, Table 6.
- Improving the speed of neural networks on cpus. Cited by: §1, §2.
- Attention is all you need. In NIPS, Cited by: §1.
- Regularization of neural networks using DropConnect. In ICML, Cited by: §4.1.
- GLUE: a multi-task benchmark and analysis platform for natural language understanding. Note: ICLR Cited by: §8.1.
- HAQ: hardware-aware automated quantization. CoRR. Cited by: Table 7.
- A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of NAACL-HLT, Cited by: §8.1.
- Pay less attention with lightweight and dynamic convolutions. In ICLR, Cited by: §2.
- Accelerating neural transformer via an average attention network. arXiv preprint arXiv:1805.00631. Cited by: §2.
- ShuffleNet: an extremely efficient convolutional neural network for mobile devices. CoRR. Cited by: §2.
- Extreme language model compression with optimal subwords and shared projections. arXiv preprint arXiv:1909.11687. Cited by: §2, Table 6.
8.1 Experimental Setting
We assess the effectiveness of Quant-Noise on competitive language and vision benchmarks. We consider Transformers for language modeling, RoBERTa for pre-training sentence representations, and EfficientNet for image classification. Our models are implemented in PyTorch . We use fairseq  for language modeling and pre-training for sentence representation tasks and Classy Vision  for EfficientNet.
Pre-Training of Sentence Representations.
8.2 Training Details
To handle the large vocabulary of Wikitext-103, we follow  and  in using adaptive softmax  and adaptive input for computational efficiency. For both input and output embeddings, we use dimension size and three adaptive bands: K, K, and K. We use a cosine learning rate schedule [2, 41] and train with Nesterov’s accelerated gradient . We set the momentum to 0.99 and renormalize gradients if the norm exceeds 0.1 . During training, we partition the data into blocks of contiguous tokens that ignore document boundaries. At test time, we respect sentence boundaries. We set LayerDrop to 0.2. We set Quant-Noise value to 0.05. During training time, we searched over the parameters (0.05, 0.1, 0.2) to determine the optimal value of Quant-Noise. During training time, the block size of Quant-Noise is 8.
The base architecture is a layer model with embedding size and FFN size . We follow  in using the subword tokenization scheme from , which uses bytes as subword units. This eliminates unknown tokens. We train with large batches of size and maintain this batch size using gradient accumulation. We do not use next sentence prediction 
. We optimize with Adam with a polynomial decay learning rate schedule. We set LayerDrop to 0.2. We set Quant-Noise value to 0.1. We did not hyperparameter search to determine the optimal value of Quant-Noise as training RoBERTa is computationally intensive. During training time, the block size of Quant-Noise is 8.
During finetuning, we hyperparameter search over three learning rate options (1e-5, 2e-5, 3e-5) and batchsize (16 or 32 sentences). The other parameters are set following . We do single task finetuning, meaning we only tune on the data provided for the given natural language understanding task. We do not perform ensembling. When finetuning models trained with LayerDrop, we apply LayerDrop and Quant-Noise during finetuning time as well.
We use the architecture of EfficientNet-B3 defined in Classy Vision  and follow the default hyperparameters for training. We set Quant-Noise value to 0.1. During training time, we searched over the parameters (0.05, 0.1, 0.2) to determine the optimal value of Quant-Noise. During training time, the block size of Quant-Noise is set to for all convolutions, for depth-wise convolutions, for depth-wise convolutions and
for the classifier. For sharing, we shared weights between blocks 9-10, 11-12, 14-15, 16-17, 19-20-21, 22-23 and refer to blocks that share the same weights as achunk
. For LayerDrop, we drop the chunks of blocks defined previously with probability 0.2 and evaluate only with chunks 9-10, 14-15 and 19-20-21.
8.3 Scalar Quantization Details
We closely follow the methodology of PyTorch 1.4. We emulate scalar quantization by quantizing the weights and the activations. The scales and zero points of activations are determined by doing a few forward passes ahead of the evaluation and then fixed. We use the Histogram method to compute and , which aims at approximately minimizing the quantization error by adjusting and . This scheme is a refinement of the MinMax scheme. Per channel quantization is also discussed in Table 9.
|Trans XL Large ||970||18.3|
|Compressive Trans ||970||17.1|
|4 Layer QRNN ||575||33.0|
|Trans XL Base ||570||24.0|
|Persis Mem ||506||20.6|
|Tensorized core-2 ||325||18.9|
|Quant-Noise + Share + Prune||10||24.2|
|RoBERTa Base + LD ||480||84.8|
|BERT Base ||420||84.4|
|PreTrained Distil ||257||82.5|
|ALBERT Base ||45||81.6|
|Quant-Noise + Share + Prune||14||82.5|
|HAQ 4 bits ||12.4||76.2|
|iPQ ResNet-50 ||5.09||76.1|
|Quant-Noise + Share + Prune||2.3||77.8|
8.4 iPQ Quantization Details
We quantize FFN with block size 8, embeddings with block size 8, and attention with block size 4. We tuned the block size for attention between the values (4, 8) to find the best performance. Note that during training with apply Quant-Noise to all the layers.
We quantize FFN with block size 4, embeddings with block size 4, and attention with block size 4. We tuned the block size between the values (4, 8) to find the best performance. Note that during training with apply Quant-Noise to all the layers.
We quantize blocks sequentially and end up with the classifier. The block sizes are for all convolutions, for depth-wise convolutions, for depth-wise convolutions and for the classifier. Note that during training with apply Quant-Noise to all the weights in InvertedResidual Blocks (except the Squeeze-Excitation subblocks), the head convolution and the classifier.
8.5 Details of Pruning and Layer Sharing
We apply the Every Other Layer strategy from Fanet al. . When combining layer sharing with pruning, we train models with shared layers and then prune chunks of shared layers. When sharing layers, the weights of adjacent layers are shared in chunks of two. For a concrete example, imagine we have a model with layers A, B, C, D, E, F, G, H. We share layers A and B, C and D, E and F, G and H. To prune, every other chunk would be pruned away, for example we could prune A, B, E, F.
8.6 Numerical Results for Graphical Diagrams
8.7 Further Ablations
8.7.1 Impact of Quant-Noise for the Vision setup
We provide another study showing the impact of the proportion of elements on which to apply Quant-Noise in Table 8.
|Quantization Scheme||Language Modeling||Image Classification|
|Size||Compress||Test PPL||Size||Compress||Top-1 Acc.|
|Int4 Quant Histogram|
|Int4 Quant Channel|
|Int8 Quant Histogram|
|Int8 Quant Channel|
|Quant-Noise + Share + Prune||10||24.2|
|Quant-Noise + Share + Prune with STE||10||24.5|
8.7.2 Impact of the number of centroids
We quantize with 256 centroids which represents a balance between size and representation capacity. The effect of the number of centroids on performance and size is shown in Figure 4 (a). Quantizing with more centroids improves perplexity — this parameter could be adjusted based on the practical storage constraints.
8.7.3 Effect of Initial Model Size
Large, overparameterized models are more easily compressed. In Figure 5, we explore quantizing both shallower and skinnier models. For shallow models, the gap between quantized and non-quantized perplexity does not increase as layers are removed (Figure 5, left). In contrast, there is a larger gap in performance for models with smaller FFN (Figure 5, right). As the FFN size decreases, the weights are less redundant and more difficult to quantize with iPQ.
8.7.4 Difficulty of Quantizing Different Model Structures
Quantization is applied to various portions of the Transformer architecture — the embedding, attention, feedforward, and classifier output. We compare the quantizability of various portions of the network in this section.
Is the order of structures important?
We quantize specific network structures first — this is important as quantizing weight matrices can accumulate reconstruction error. Some structures of the network should be quantized last so the finetuning process can better adjust the centroids. We find that there are small variations in performance based on quantization order (see Figure 6). We choose to quantize FFN, then embeddings, and finally the attention matrices in Transformer networks.
Which structures can be compressed the most?
Finally, we analyze which network structures can be most compressed. During quantization, various matrix block sizes can be chosen as a parameter — the larger the block size, the more compression, but also the larger the potential reduction of performance. Thus, it is important to understand how much each network structure can be compressed to reduce the memory footprint of the final model as much as possible. In Figure 6, we quantize two model structures with a fixed block size and vary the block size of the third between 4 and 32. As shown, the FFN and embedding structures are more robust to aggressive compression, while the attention drastically loses performance as larger block sizes are used.
8.7.5 Approach to intN Scalar Quantization
We compare quantizing per-channel to using a histogram quantizer in Table 9. The histogram quantizer maintains a running min/max and minimizes L2 distance between quantized and non-quantized values to find the optimal min/max. Quantizing per channel learns scales and offsets as vectors along the channel dimension, which provides more flexibility since scales and offsets can be different.
8.7.6 LayerDrop with STE
For quantization noise, we apply the straight through estimator (STE) to remaining weights in the backward pass. We experiment with applying STE to the backward pass of LayerDrop’s pruning noise. Results are shown in Table 10 and find slightly worse results.