1 Introduction
Many of the best performing neural network architectures in realworld applications have a large number of parameters. For example, the current standard machine translation architecture, Transformer
Vaswani et al. (2017), has layers that contain millions of parameters. Even models that are designed to jointly optimize the performance and the parameter efficiency, such as EfficientNets Tan and Le (2019), still require dozens to hundreds of megabytes, which limits their applications to domains like robotics or virtual assistants.Model compression schemes reduce the memory footprint of overparametrized models. Pruning LeCun et al. (1990) and distillation Hinton et al. (2015) remove parameters by reducing the number of network weights. In contrast, quantization focuses on reducing the bits per weight. This makes quantization particularly interesting when compressing models that have already been carefully optimized in terms of network architecture. Whereas deleting weights or whole hidden units will inevitably lead to a drop in performance, we demonstrate that quantizing the weights can be performed with little to no loss in accuracy.
Popular postprocessing quantization methods, like scalar quantization, replace the floatingpoint weights of a trained network by a lowerprecision representation, like fixedwidth integers Vanhoucke et al. (2011). These approaches achieve a good compression rate with the additional benefit of accelerating inference on supporting hardware. However, the errors made by these approximations accumulate in the computations operated during the forward pass, inducing a significant drop in performance Stock et al. (2019).
A solution to address this drifting effect is to directly quantize the network during training. This raises two challenges. First, the discretization operators have a null gradient — the derivative with respect to the input is zero almost everywhere. This requires special workarounds to train a network with these operators. The second challenge that often comes with these workarounds is the discrepancy that appears between the train and test functions implemented by the network. Quantization Aware Training (QAT) Jacob et al. (2018) resolves these issues by quantizing all the weights during the forward and using a straight through estimator (STE) Bengio et al. (2013) to compute the gradient. This works when the error introduced by STE is small, like with int8 fixedpoint quantization, but does not suffice in compression regimes where the approximation made by the compression is more severe.
In this work, we show that quantizing only a subset of weights instead of the entire network during training is more stable for high compression schemes. Indeed, by quantizing only a random fraction of the network at each forward, most the weights are updated with unbiased gradients. Interestingly, we show that our method can employ a simpler quantization scheme during the training. This is particularly useful for quantizers with trainable parameters, such as Product Quantizer (PQ), for which our quantization proxy is not parametrized. Our approach simply applies a quantization noise, called QuantNoise, to a random subset of the weights, see Figure 1. We observe that this makes a network resilient to various types of discretization methods: it significantly improves the accuracy associated with (a) low precision representation of weights like int8; and (b) stateoftheart PQ. Further, we demonstrate that QuantNoise can be applied to existing trained networks as a postprocessing step, to improve the performance network after quantization.
In summary, this paper makes the following contributions:

We introduce the QuantNoise technique to learn networks that are more resilient to a variety of quantization methods such as int4, int8, and PQ;

Adding QuantNoise to PQ leads to new stateoftheart tradeoffs between accuracy and model size. For instance, for natural language processing (NLP), we reach 82.5% accuracy on MNLI by compressing RoBERTa to 14 MB. Similarly for computer vision, we report 80.0% top1 accuracy on ImageNet by compressing an EfficientNetB3 to 3.3 MB;

By combining PQ and int8 to quantize weights and activations for networks trained with QuantNoise, we obtain extreme compression with fixedprecision computation and achieve 79.8% top1 accuracy on ImageNet and perplexity on WikiText103.
2 Related Work
Model compression.
Many compression methods focus on efficient parameterization, via weight pruning LeCun et al. (1990); Li et al. (2016); Huang et al. (2018); Mittal et al. (2018), weight sharing Dehghani et al. (2018); Turc et al. (2019); Lan et al. (2019) or with dedicated architectures Tan and Le (2019); Zhang et al. (2017); Howard et al. (2019). Weight pruning is implemented during training Louizos et al. (2017) or as a finetuning postprocessing step Han et al. (2015, 2016). Many pruning methods are unstructured, i.e., remove individual weights LeCun et al. (1990); Molchanov et al. (2017). On the other hand, structured pruning methods follow the structure of the weights to reduce both the memory footprint and the inference time of a model Li et al. (2016); Luo et al. (2017); Fan et al. (2019). We refer the reader to Liu et al. Liu et al. (2018) for a review of different pruning strategies. Others have worked on lightweight architectures, by modifying existing models Zhang et al. (2018); Wu et al. (2019); Sukhbaatar et al. (2019a) or developing new networks, such as MobileNet Howard et al. (2019), ShuffleNet Zhang et al. (2017), and EfficientNet Tan and Le (2019) in vision. Finally, knowledge distillation Hinton et al. (2015) has been applied to sentence representation Turc et al. (2019); Sanh et al. (2019a); Sun et al. (2019); Zhao et al. (2019); Jiao et al. (2019), to reduce the size of a BERT model Devlin et al. (2018).
Quantization.
There are extensive studies of scalar quantization to train networks with lowprecision weightsand activations Courbariaux et al. (2015); Courbariaux and Bengio (2016); Rastegari et al. (2016); McDonnell (2018). These methods benefit from specialized hardware to also improve the runtime during inference Vanhoucke et al. (2011)
. Other quantization methods such as Vector Quantization (VQ) and PQ
Jegou et al. (2011) quantize blocks of weights simulatneously to achieve higher compression rate Stock et al. (2019); Gong et al. (2014); Joulin et al. (2016); CarreiraPerpiñán and Idelbayev (2017). Closer to our work, several works have focused at simulatenously training and quantizing a network Jacob et al. (2018); Krishnamoorthi (2018); Gupta et al. (2015); Dong et al. (2019). Gupta et al. Gupta et al. (2015) assigns weights to a quantized bin stochastically which is specific to scalar quantization, but allows training with fixed point arithmetic. Finally, our method can be interpreted as a form of Bayesian compression Louizos et al. (2017), using the Bayesian intepretation of Dropout Gal and Ghahramani (2016). As opposed to their work, we select our noise to match the weight transformation of a target quantization methods without restricting it to a scale mixture prior.3 Quantizing Neural Networks
In this section, we present the principles of quantization, several standard quantization methods, and describe how to combine scalar and product quantization. For clarity, we focus on the case of a fixed real matrix . We suppose that this matrix is split into blocks :
(1) 
where the nature of these blocks is determined by the quantization method. A codebook is a set of vectors, i.e., . Quantization methods compress the matrix by assigning to each block an index that points to a codeword in a codebook , and storing the codebook and the resulting indices (as the entries of an index matrix ) instead of the real weights. During the inference, they reconstruct an approximation of the original matrix such that .
We distinguish scalar quantization, such as int8, where each block consists of a single weight, from vector quantization, where several weights are quantized jointly.
3.1 Fixedpoint Scalar Quantization
Fixedpoint scalar quantization methods replace floatingpoint number representations by lowprecision fixedpoint representations. They simultaneously reduce a model’s memory footprint and accelerate inference by using fixedpoint arithmetic on supporting hardware.
Fixedpoint scalar quantization operates on blocks that represent a single weight, i.e., . Floatingpoint weights are replaced by bit fixedpoint numbers Gupta et al. (2015)
, with the extreme case of binarization where
Courbariaux et al. (2015). More precisely, the weights are rounded to one of possible codewords. These codewords correspond to bins evenly spaced by a scale factor and shifted by a bias . Each weight is mapped to its nearest codeword , i.e.,(2) 
where we compute the scale and bias as:
We focus on this uniform rounding scheme instead of other nonuniform schemes Choi et al. (2018); Li et al. (2019)
, because it allows for fixedpoint arithmetic with implementations in PyTorch and Tensorflow (see Appendix). The compression rate is
. The activations are also rounded to bit fixedpoint numbers. With int8 for instance, this leads to to faster inference on dedicated hardware. In this work, we consider both int4 and int8 quantization.3.2 Product Quantization
Several quantization methods work on groups of weights, such as vectors, to benefit from the correlation induced by the structure of the network. In this work, we focus on Product Quantization for its good performance at extreme compression ratio Stock et al. (2019).
Traditional PQ.
In vector quantization methods, the blocks are predefined groups of weights instead of single weights. The codewords are groups of values, and the index matrix maps groups of weights from the matrix to these codewords. In this section, we present the Product Quantization framework as it generalizes both scalar and vector quantization. We consider the case where we apply PQ to the columns of and thus assume that .
Traditional vector quantization techniques split the matrix into its columns and learns a codebook on the resulting vectors. Instead, Product Quantization splits each column into subvectors and learns the same codebook for each of the resulting subvectors. Each quantized vector is subsequently obtained by assigning its subvectors to the nearest codeword in the codebook. Learning the codebook is traditionally done using means with a fixed number of centroids, typically to store the index matrix using int8. Thus, the objective function is written as:
(3) 
PQ shares representations between subvectors, which allows for higher compression rates than intN.
Iterative PQ.
When quantizing a full network rather than a single matrix, extreme compression with PQ induces a quantization drift as reconstruction error accumulates Stock et al. (2019). Indeed, subsequent layers take as input the output of preceding layers, which are modified by the quantization of the preceding layers. This creates a drift in the network activations, resulting in large losses of performance. A solution proposed by Stock et al. Stock et al. (2019), which we call iterative PQ (iPQ), is to quantize layers sequentially from the lowest to the highest, and finetune the upper layers as the lower layers are quantized, under the supervision of the uncompressed (teacher) model. Codewords of each layer are finetuned by averaging the gradients of their assigned elements with gradient steps of the form:
(4) 
where ,
is the loss function and
is a learning rate. This adapts the upper layers to the drift appearing in their inputs, reducing the impact of the quantization approximation on the overall performance.3.3 Combining FixedPoint with Product Quantization
Fixedpoint quantization and Product Quantization are often regarded as competing choices, but can be advantageously combined. Indeed, PQ/iPQ compresses the network by replacing vectors of weights by their assigned centroids, but these centroids are in floatingpoint precision. Fixedpoint quantization compresses both activations and weights to fixedpoint representations. Combining both approaches means that the vectors of weights are mapped to centroids that are compressed to fixedpoint representations, along with the activations. This benefits from the extreme compression ratio of iPQ and the finiteprecision arithmetics of intN quantization.
More precisely, for a given matrix, we store the int8 representation of the centroids of dimension along with the representations of the centroid assignments of the subvectors. The int8 representation of the centroids is obtained with Eq. (2). The overall storage of the matrix and activations during a forward pass with batch size is
(5) 
In particular, when , the centroid assignments are also stored in int8, which means that every value required for a forward pass is stored in an int8 format. We divide by the float32 overhead of storing the centroids, although the storage requirement associated with the centroids is small compared to the cost of indexing the subvectors for standard networks. In contrast to iPQ alone where we only quantize the weights, we also quantize the activations using int8. We evaluate this approach on both natural language processing and computer vision tasks in Section 5.
4 Method
Deep networks are not exposed to the noise caused by the quantization drift during training, leading to suboptimal performance. A solution to make the network robust to quantization is to introduce it during training. Quantization Aware Training (QAT) Jacob et al. (2018) exposes the network during training by quantizing weights during the forward pass. This transformation is not differentiable and gradients are approximated with a straight through estimator (STE) Bengio et al. (2013); Courbariaux and Bengio (2016). STE introduces a bias in the gradients that depends on level of quantization of the weights, and thus, the compression ratio. In this section, we propose a simple modification to control this induced bias with a stochastic amelioration of QAT, called QuantNoise. The idea is to quantize a randomly selected fraction of the weights instead of the full network as in QAT, leaving some unbiased gradients flow through unquantized weights. Our general formulation can simulate the effect of both quantization and of pruning during training.
4.1 Training Networks with Quantization Noise
We consider the case of a real matrix as in Section 3. During the training of a network, our proposed QuantNoise method works as follows: first, we compute blocks related to a target quantization method. Then, during each forward pass, we randomly select a subset of these blocks and apply some distortion to them. During the backward pass, we compute gradients for all the weights, using STE for the distorted weights.
More formally, given a set of tuples of indices for , and a distortion or noise function acting on a block, we define an operator such that, for each block , we apply the following transformation:
(6) 
The noise function simulates the change in the weights produced by the target quantization method (see Section 4.2 for details). We replace the matrix by the resulting noisy matrix during the forward pass to compute a noisy output , i.e.,
(7) 
where is an input vector. During the backward pass, we compute the gradient on the nondistorted weights and apply STE on the distorted weights, i.e.,
(8) 
Note that our approach is equivalent to QAT when containts all the tuples of indices. However, an advantage of QuantNoise over QAT is that unbiased gradients continue to flow via blocks unaffected by the noise. As these blocks are randomly selected for each forward, we guarantee that each weight regularly sees gradients that are not affected by the nature of the function . As a side effect, our quantization noise regularizes the network in a similar way as DropConnect Wan et al. (2013) or LayerDrop Fan et al. (2019).
Composing quantization noises.
As noise operators are compositionally commutative, we can make a network robust to a combination of quantization methods by composing their noise operators:
(9) 
This property is particularly useful to combine quantization with pruning operators during training, as well as combining scalar quantization with product quantization.
4.2 Adding Noise to Specific Quantization Methods
In this section, we propose several implementations of the noise function for the quantization methods described in Section 3. We also show how to handle pruning with it.
Fixedpoint scalar quantization.
In intN quantization, the blocks are atomic and weights are rounded to their nearest neighbor in the codebook. The function replaces weight with the output of the rounding function defined in Eq. (2), i.e.,
(10) 
where and are updated during training. In particular, the application of QuantNoise to int8 scalar quantization is a stochastic amelioration of QAT.
Quantization Scheme  Language Modeling  Image Classification  
16layer Transformer  EfficientNetB3  
Wikitext103  ImageNet1k  
Size  Compression  PPL  Size  Compression  Top1  
Uncompressed model  
int4 quantization  
 trained with QAT  34.1  
 trained with QuantNoise  
int8 quantization  
 trained with QAT  
 trained with QuantNoise  
iPQ  
 trained with QAT  
 trained with QuantNoise  
iPQ & int8 + QuantNoise 
Product quantization.
As opposed to intN, codebooks in PQ require a clustering step based on weight values. During training, we learn codewords online and use the resulting centroids to implement the quantization noise. More precisely, the noise function assigns a selected block to its nearest codeword in the associated codebook :
(11) 
Updating the codebooks online works well. However, empirically, running
means once per epoch is faster and does not noticeably modify the resulting accuracy.
Note that computing the exact noise function for PQ is computationally demanding. We propose a simpler and faster alternative approximation to the operational transformation of PQ and iPQ. The noise function simply zeroes out the subvectors of the selected blocks, i.e., As a sidenote, we considered other alternatives, for instance one where the subvectors are mapped to the mean subvector. In practice, we found that these approximations lead to similar performance, see Section 6. This proxy noise function is a form of Structured Dropout and encourages correlations between the subvectors. This correlation is beneficial to the subsequent clustering involved in PQ/iPQ.
Adding pruning to the quantization noise.
The specific form of quantization noise can be adjusted to incorporate additional noise specific to pruning. We simply combine the noise operators of quantization and pruning by composing them following Eq. (9). We consider the pruning noise function of Fan et al. Fan et al. (2019) where they randomly drop predefined structures during training. In particular, we focus on LayerDrop, where the structures are the residual blocks of highwaylike layers Srivastava et al. (2015), as most modern architectures, such as ResNet or Transformer, are composed of this structure. More precisely, the corresponding noise operator over residual blocks is
For pruning, we do not use STE to backpropagate the gradient of pruned weights, as dropping them entirely during training has the benefit of speeding convergence
Huang et al. (2016). Once a model is trained with LayerDrop, the number of layers kept at inference can be adapted to match computation budget or time constraint.5 Results
We demonstrate the impact of QuantNoise on the performance of several quantization schemes in a variety of settings (see Appendix  Sec. 8.1). We compare iPQ + QuantNoise with existing work to demonstrate that QuantNoise leads to extreme compression rates at a reasonable cost in accuracy.
Language modeling  Sentence Representation  Image Classification  
Comp.  Size  PPL  Comp.  Size  Acc.  Comp.  Size  Acc.  
Unquantized models  
Original model  
+ Sharing  
+ Pruning  
Quantized models  
iPQ  
+ QuantNoise  
+ Sharing  
+ Pruning 
5.1 Improving Compression with QuantNoise
QuantNoise is a regularization method that makes networks more robust to the target quantization scheme or combination of quantization schemes during training. We show the impact of QuantNoise in Table 1 for a variety of quantization methods: int8/int4 and iPQ.
We experiment in
different settings: a Transformer network trained for language modeling on WikiText103 and a EfficientNetB3 convolutional network trained for image classification on ImageNet1k. Our quantization noise framework is general and flexible — QuantNoise improves the performance of quantized models for every quantization scheme in both experimental settings. Importantly, QuantNoise only changes model training by adding a regularization noise similar to dropout, with no impact on the convergence rate or training speed.
This comparison of different quantization schemes shows that QuantNoise works particularly well with high performance quantization methods, like iPQ, where QAT tends to degrade the performances, even compared to quantizing as a postprocessing step. In subsequent experiments in this section, we focus on applications with iPQ because it offers the best tradeoff between model performance and compression, and has little negative impact on FLOPS.
FixedPoint Product Quantization.
Combining iPQ and int8 as described in Section 3.3 allows us to take advantage of the high compression rate of iPQ with a fixedpoint representation of both centroids and activations. As shown in Table 1, this combination incurs little loss in accuracy with respect to iPQ + QuantNoise. Most of the memory footprint of iPQ comes from indexing and not storing centroids, so the compression ratios are comparable.
Complementarity with Weight Pruning and Sharing.
We analyze how QuantNoise is compatible and complementary with pruning (“+Prune”) and weight sharing (“+Share”), see Appendix for details on weight sharing. We report results for Language modeling on WikiText103, pretrained sentence representations on MNLI and object classification on ImageNet1k in Table 2. The conclusions are remarkably consistent across tasks and benchmarks: QuantNoise gives a large improvement over strong iPQ baselines. Combining it with sharing and pruning offers additional interesting operating points of performance vs size.
5.2 Comparison with the state of the art
We now compare our approach on the same tasks against the state of the art. We apply our best quantization setup on competitive models and reduce their memory footprint by when combining with weight sharing and pruning, offering extreme compression for good performance.
Natural Language Processing.
In Figure 2, we examine the tradeoff between performance and model size. Our quantized RoBERTa offers a competitive tradeoff between size and performance with memory reduction methods dedicated to BERT, like TinyBERT, MobileBERT, or AdaBERT.
Image Classification.
We compress EfficientNetB3 from Mb to Mb ( compression) while maintaining high top1 accuracy ( versus for the original model). As shown in Figure 2, our quantized EfficientNetB3 is smaller and more accurate than architectures dedicated to optimize ondevice performance with limited size like MobileNet or ShuffleNet.
Incorporating pruning noise into quantization is also beneficial. For example, with pruning iPQ+QuantNoise reduces size by with only a drop of PPL in language modeling. Further, pruning reduces FLOPS by the same ratio as its compression factor, in our case, . By adding sharing with pruning, in language modeling, we achieve an extreme compression ratio of with a drop of PPL with FLOPS reduction from pruning entire shared chunks of layers. For comparison, our MB model has the same performance as the MB TransformerXL base.
6 Ablations
In this section, we study the use of our approach as a postprocessing step where a pretrained model is finetuned with QuantNoise. We also examine the impact of the level of noise during training as well as the impact of approximating iPQ during training.
Adaptive Input  PPL  RoBERTa  Acc. 

Train without QuantNoise  25.2  Train without QuantNoise  82.5 
+ Finetune with QuantNoise  20.9  + Finetune with QuantNoise  83.4 
Train with QuantNoise  20.7  Train with QuantNoise  83.6 
6.1 Finetuning with QuantNoise for PostProcessing Quantization
We explore taking existing pretrained models and postprocessing with QuantNoise instead of training from scratch. For language modeling, we start with the Adaptive Inputs architecture and train for 10 additional epochs. For RoBERTa, we train for 25k more updates. We show that finetuning with QuantNoise incorporates the benefits and almost matches training from scratch, see Table 3. For example, in language modeling, there is only a PPL difference after applying iPQ.
We further examine how to incorporate QuantNoise more flexibly into pretraining RoBERTa for sentence classification tasks. We take an already pretrained RoBERTa model and only incorporate QuantNoise during the sentence classification task transfer learning step. We show in Table
3 that this is also effective at compressing while retaining accuracy after quantization with iPQ.6.2 Impact of Noise Rate
We analyze the performance for various values of QuantNoise in Figure 3 on a Transformer for language modeling. For iPQ, performance is impacted by high rates of quantization noise. For example, a Transformer with the noise function degrades with rate higher than 0.5, i.e., when half of the weights are passed through the noise function . We hypothesize that for large quantities of noise, a larger effect of using proxy rather than the exact PQ noise is observed. For int8 quantization and its noise function, higher rates of noise are slightly worse but not as severe. A rate of for int8 quantization is equivalent to the Quantization Aware Training of Krishnamoorthi (2018), as the full matrix is quantized with STE, showing the potential benefit of partial quantization during training.
6.3 Impact of Approximating the Noise Function
We study the impact of approximating quantization noise during training. We focus on the case of iPQ with the approximation described in Section 4.2. In Table 4, we compare the correct noise function for iPQ with its approximation . This approximate noise function does not consider cluster assignments or centroid values and simply zeroes out the selected blocks. For completeness, we include an intermediate approximation where we consider cluster assignments to apply noise within each cluster, but still zeroout the vectors. These approximations do not affect the performance of the quantized models. This suggests that increasing the correlation between subvectors that are jointly clustered is enough to maintain the performance of a model quantized with iPQ. Since PQ tends to work well on highly correlated vectors, such as activations in convolutional networks, this is not surprising. Using the approximation presents the advantage of speed and practicality. Indeed, one does not need to compute cluster assignments and centroids for every layer in the network after each epoch. Moreover, the approach is less involved in terms of code.
7 Conclusion
We show that quantizing a random subset of weights during training maintains performance in the high quantization regime. We validate that QuantNoise works with a variety of different quantization schemes on several applications in text and vision. Our method can be applied to a combination of iPQ and int8 to benefit from extreme compression ratio and fixedpoint arithmetic. Finally, we show that QuantNoise can be used as a postprocessing step to prepare already trained networks for subsequent quantization, to improve the performance of the compressed model.
References
 Classy vision. Note: https://github.com/facebookresearch/ClassyVision Cited by: §8.1, §8.2.
 Adaptive input representations for neural language modeling. arXiv preprint arXiv:1809.10853. Cited by: §8.1, §8.2.

Estimating or propagating gradients through stochastic neurons for conditional computation
. arXiv preprint arXiv:1308.3432. Cited by: Training with Quantization Noise for Extreme FixedPoint Compression, §1, §4.  . arXiv preprint arXiv:1611.01576. Cited by: Table 5.
 [5] Faster and just as accurate: a simple decomposition for transformer models. Cited by: Table 6.
 Model compression as constrained optimization, with application to neural nets. part ii: quantization. Cited by: §2.
 AdaBERT: taskadaptive bert compression with differentiable neural architecture search. arXiv preprint arXiv:2001.04246. Cited by: Table 6.
 Pact: parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085. Cited by: §3.1.
 BinaryConnect: training deep neural networks with binary weights during propagations. CoRR. Cited by: §2, §3.1.
 BinaryNet: training deep neural networks with weights and activations constrained to +1 or 1. CoRR. Cited by: §2, §4.
 Transformerxl: attentive language models beyond a fixedlength context. arXiv preprint arXiv:1901.02860. Cited by: Table 5.
 Language modeling with gated convolutional networks. In Proc. of ICML, Cited by: §8.2, Table 5.
 Universal transformers. Cited by: §2.
 ImageNet: A LargeScale Hierarchical Image Database. In CVPR, Cited by: §8.1.
 Bert: pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2, §8.1, Table 6.
 Stochastic quantization for learning accurate lowbit deep neural networks. International Journal of Computer Vision 127 (1112), pp. 1629–1642. Cited by: §2.
 Reducing transformer depth on demand with structured dropout. arXiv preprint arXiv:1909.11556. Cited by: §2, §4.1, §4.2, §8.1, §8.5, Table 6.

Dropout as a bayesian approximation: representing model uncertainty in deep learning
. Ininternational conference on machine learning
, pp. 1050–1059. Cited by: §2.  Compressing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115. Cited by: §2.
 Efficient softmax approximation for gpus. arXiv abs/1609.04309. Cited by: §8.2.
 Deep learning with limited numerical precision. In ICML, Cited by: §2, §3.1.
 Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding. ICLR. Cited by: §2.
 Learning both weights and connections for efficient neural network. In NIPS, pp. 1135–1143. Cited by: §2.
 Deep residual learning for image recognition. CoRR. Cited by: Table 7.
 Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §1, §2.
 Searching for mobilenetv3. arXiv eprints. Cited by: §2.
 Condensenet: an efficient densenet using learned group convolutions. In CVPR, Cited by: §2, Table 7.
 Deep networks with stochastic depth. In ECCV, Cited by: §4.2.

Quantization and training of neural networks for efficient integerarithmeticonly inference.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 2704–2713. Cited by: Training with Quantization Noise for Extreme FixedPoint Compression, §1, §2, §4.  Product quantization for nearest neighbor search. PAMI. Cited by: §2.
 Tinybert: distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351. Cited by: §2, Table 6.
 Fasttext.zip: compressing text classification models. arXiv preprint arXiv:1612.03651. Cited by: §2.
 Quantizing deep convolutional networks for efficient inference: a whitepaper. arXiv preprint arXiv:1806.08342. Cited by: §2, §6.2.
 Crosslingual language model pretraining. arXiv preprint arXiv:1901.07291. Cited by: §8.2.

ALBERT: a lite bert for selfsupervised learning of language representations
. Cited by: §2, Table 6.  Optimal brain damage. In NIPS, Cited by: §1, §2.
 Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710. Cited by: §2.
 Additive powersoftwo quantization: a nonuniform discretization for neural networks. arXiv preprint arXiv:1909.13144. Cited by: §3.1.
 Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §8.1, §8.2, §8.2.
 Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270. Cited by: §2.

Sgdr: stochastic gradient descent with warm restarts
. arXiv preprint arXiv:1608.03983. Cited by: §8.2.  Learning sparse neural networks through regularization. arXiv preprint arXiv:1712.01312. Cited by: §2, §2.
 Thinet: a filter level pruning method for deep neural network compression. In ICCV, Cited by: §2.
 ShuffleNet V2: practical guidelines for efficient CNN architecture design. CoRR. Cited by: Table 7.

A tensorized transformer for language modeling
. arXiv preprint arXiv:1906.09777. Cited by: Table 5.  Training wide residual networks for deployment using a single bit for each weight. Cited by: §2.
 Pointer Sentinel Mixture Models. arXiv abs/1609.07843. Cited by: §8.1.

Recovering from random pruning: on the plasticity of deep convolutional neural networks
. In WACV, Cited by: §2.  Variational dropout sparsifies deep neural networks. In ICML, Cited by: §2.
 Fairseq: a fast, extensible toolkit for sequence modeling. In Proceedings of NAACLHLT 2019: Demonstrations, Cited by: §8.1.
 How to construct deep recurrent neural networks. In Proceedings of the Second International Conference on Learning Representations (ICLR 2014), Cited by: §8.2.
 Automatic differentiation in pytorch. Cited by: §8.1.
 Language models are unsupervised multitask learners. Cited by: §8.2.
 Compressive transformers for longrange sequence modelling. arXiv preprint arXiv:1911.05507. Cited by: Table 5.
 Xnornet: imagenet classification using binary convolutional neural networks. In ECCV, Cited by: §2.
 Mobilenetv2: inverted residuals and linear bottlenecks. In Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: Table 7.
 DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter. Cited by: §2.
 DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108. Cited by: Table 6.
 Highway networks. arXiv preprint arXiv:1505.00387. Cited by: §4.2.
 And the bit goes down: revisiting the quantization of neural networks. CoRR abs/1907.05686. Cited by: §1, §2, §3.2, §3.2, Table 7.
 Adaptive attention span in transformers. arXiv preprint arXiv:1905.07799. Cited by: §2.
 Augmenting selfattention with persistent memory. arXiv preprint arXiv:1907.01470. Cited by: Table 5.
 Patient knowledge distillation for bert model compression. EMNLP. Cited by: §2, Table 6.
 [64] MobileBERT: taskagnostic compression of bert for resource limited devices. Cited by: Table 6.
 On the importance of initialization and momentum in deep learning. In International conference on machine learning, pp. 1139–1147. Cited by: §8.2.
 EfficientNet: rethinking model scaling for convolutional neural networks. Cited by: §1, §2, Table 1, §8.1, Table 7.
 Wellread students learn better: the impact of student initialization on knowledge distillation. arXiv preprint arXiv:1908.08962. Cited by: §2, Table 6.
 Improving the speed of neural networks on cpus. Cited by: §1, §2.
 Attention is all you need. In NIPS, Cited by: §1.
 Regularization of neural networks using DropConnect. In ICML, Cited by: §4.1.
 GLUE: a multitask benchmark and analysis platform for natural language understanding. Note: ICLR Cited by: §8.1.
 HAQ: hardwareaware automated quantization. CoRR. Cited by: Table 7.
 A broadcoverage challenge corpus for sentence understanding through inference. In Proceedings of NAACLHLT, Cited by: §8.1.
 Pay less attention with lightweight and dynamic convolutions. In ICLR, Cited by: §2.
 Accelerating neural transformer via an average attention network. arXiv preprint arXiv:1805.00631. Cited by: §2.
 ShuffleNet: an extremely efficient convolutional neural network for mobile devices. CoRR. Cited by: §2.
 Extreme language model compression with optimal subwords and shared projections. arXiv preprint arXiv:1909.11687. Cited by: §2, Table 6.
8 Appendix
8.1 Experimental Setting
We assess the effectiveness of QuantNoise on competitive language and vision benchmarks. We consider Transformers for language modeling, RoBERTa for pretraining sentence representations, and EfficientNet for image classification. Our models are implemented in PyTorch [52]. We use fairseq [50] for language modeling and pretraining for sentence representation tasks and Classy Vision [1] for EfficientNet.
Language Modeling.
PreTraining of Sentence Representations.
Image Classification.
8.2 Training Details
Language Modeling
To handle the large vocabulary of Wikitext103, we follow [12] and [2] in using adaptive softmax [20] and adaptive input for computational efficiency. For both input and output embeddings, we use dimension size and three adaptive bands: K, K, and K. We use a cosine learning rate schedule [2, 41] and train with Nesterov’s accelerated gradient [65]. We set the momentum to 0.99 and renormalize gradients if the norm exceeds 0.1 [51]. During training, we partition the data into blocks of contiguous tokens that ignore document boundaries. At test time, we respect sentence boundaries. We set LayerDrop to 0.2. We set QuantNoise value to 0.05. During training time, we searched over the parameters (0.05, 0.1, 0.2) to determine the optimal value of QuantNoise. During training time, the block size of QuantNoise is 8.
RoBERTa
The base architecture is a layer model with embedding size and FFN size . We follow [39] in using the subword tokenization scheme from [53], which uses bytes as subword units. This eliminates unknown tokens. We train with large batches of size and maintain this batch size using gradient accumulation. We do not use next sentence prediction [34]
. We optimize with Adam with a polynomial decay learning rate schedule. We set LayerDrop to 0.2. We set QuantNoise value to 0.1. We did not hyperparameter search to determine the optimal value of QuantNoise as training RoBERTa is computationally intensive. During training time, the block size of QuantNoise is 8.
During finetuning, we hyperparameter search over three learning rate options (1e5, 2e5, 3e5) and batchsize (16 or 32 sentences). The other parameters are set following [39]. We do single task finetuning, meaning we only tune on the data provided for the given natural language understanding task. We do not perform ensembling. When finetuning models trained with LayerDrop, we apply LayerDrop and QuantNoise during finetuning time as well.
EfficientNet
We use the architecture of EfficientNetB3 defined in Classy Vision [1] and follow the default hyperparameters for training. We set QuantNoise value to 0.1. During training time, we searched over the parameters (0.05, 0.1, 0.2) to determine the optimal value of QuantNoise. During training time, the block size of QuantNoise is set to for all convolutions, for depthwise convolutions, for depthwise convolutions and
for the classifier. For sharing, we shared weights between blocks 910, 1112, 1415, 1617, 192021, 2223 and refer to blocks that share the same weights as a
chunk. For LayerDrop, we drop the chunks of blocks defined previously with probability 0.2 and evaluate only with chunks 910, 1415 and 192021.
8.3 Scalar Quantization Details
We closely follow the methodology of PyTorch 1.4. We emulate scalar quantization by quantizing the weights and the activations. The scales and zero points of activations are determined by doing a few forward passes ahead of the evaluation and then fixed. We use the Histogram method to compute and , which aims at approximately minimizing the quantization error by adjusting and . This scheme is a refinement of the MinMax scheme. Per channel quantization is also discussed in Table 9.
Model  MB  PPL 

Trans XL Large [11]  970  18.3 
Compressive Trans [54]  970  17.1 
GCNN [12]  870  37.2 
4 Layer QRNN [4]  575  33.0 
Trans XL Base [11]  570  24.0 
Persis Mem [62]  506  20.6 
Tensorized core2 [45]  325  18.9 
QuantNoise  38  20.7 
QuantNoise + Share + Prune  10  24.2 
Model  MB  MNLI 
RoBERTa Base + LD [17]  480  84.8 
BERT Base [15]  420  84.4 
PreTrained Distil [67]  257  82.5 
DistilBERT [58]  250  81.8 
MobileBERT* [64]  96  84.4 
TinyBERT [31]  55  82.8 
ALBERT Base [35]  45  81.6 
AdaBERT [7]  36  81.6 
QuantNoise  38  83.6 
QuantNoise + Share + Prune  14  82.5 
Model  MB  Acc. 

EfficientNetB7 [66]  260  84.4 
ResNet50 [24]  97.5  76.1 
DenseNet169 [27]  53.4  76.2 
EfficientNetB0 [66]  20.2  77.3 
MobileNetv2 [56]  13.4  71.9 
Shufflenetv2 [44]  8.7  69.4 
HAQ 4 bits [72]  12.4  76.2 
iPQ ResNet50 [60]  5.09  76.1 
QuantNoise  3.3  80.0 
QuantNoise + Share + Prune  2.3  77.8 
8.4 iPQ Quantization Details
Language Modeling
We quantize FFN with block size 8, embeddings with block size 8, and attention with block size 4. We tuned the block size for attention between the values (4, 8) to find the best performance. Note that during training with apply QuantNoise to all the layers.
RoBERTa
We quantize FFN with block size 4, embeddings with block size 4, and attention with block size 4. We tuned the block size between the values (4, 8) to find the best performance. Note that during training with apply QuantNoise to all the layers.
EfficientNet
We quantize blocks sequentially and end up with the classifier. The block sizes are for all convolutions, for depthwise convolutions, for depthwise convolutions and for the classifier. Note that during training with apply QuantNoise to all the weights in InvertedResidual Blocks (except the SqueezeExcitation subblocks), the head convolution and the classifier.
8.5 Details of Pruning and Layer Sharing
We apply the Every Other Layer strategy from Fanet al. [17]. When combining layer sharing with pruning, we train models with shared layers and then prune chunks of shared layers. When sharing layers, the weights of adjacent layers are shared in chunks of two. For a concrete example, imagine we have a model with layers A, B, C, D, E, F, G, H. We share layers A and B, C and D, E and F, G and H. To prune, every other chunk would be pruned away, for example we could prune A, B, E, F.
8.6 Numerical Results for Graphical Diagrams
8.7 Further Ablations
8.7.1 Impact of QuantNoise for the Vision setup
We provide another study showing the impact of the proportion of elements on which to apply QuantNoise in Table 8.
0  0.2  0.4  0.6  0.8  1  

Top1  80.66  80.83  80.82  80.88  80.92  80.64 
Quantization Scheme  Language Modeling  Image Classification  

16layer Transformer  EfficientNetB3  
Wikitext103  ImageNet1K  
Size  Compress  Test PPL  Size  Compress  Top1 Acc.  
Uncompressed model  
Int4 Quant Histogram  
+ QuantNoise  
Int4 Quant Channel  
+ QuantNoise  
Int8 Quant Histogram  
+ QuantNoise  
Int8 Quant Channel  
+ QuantNoise 
Model  MB  PPL 

QuantNoise + Share + Prune  10  24.2 
QuantNoise + Share + Prune with STE  10  24.5 
8.7.2 Impact of the number of centroids
We quantize with 256 centroids which represents a balance between size and representation capacity. The effect of the number of centroids on performance and size is shown in Figure 4 (a). Quantizing with more centroids improves perplexity — this parameter could be adjusted based on the practical storage constraints.
8.7.3 Effect of Initial Model Size
Large, overparameterized models are more easily compressed. In Figure 5, we explore quantizing both shallower and skinnier models. For shallow models, the gap between quantized and nonquantized perplexity does not increase as layers are removed (Figure 5, left). In contrast, there is a larger gap in performance for models with smaller FFN (Figure 5, right). As the FFN size decreases, the weights are less redundant and more difficult to quantize with iPQ.
8.7.4 Difficulty of Quantizing Different Model Structures
Quantization is applied to various portions of the Transformer architecture — the embedding, attention, feedforward, and classifier output. We compare the quantizability of various portions of the network in this section.
Is the order of structures important?
We quantize specific network structures first — this is important as quantizing weight matrices can accumulate reconstruction error. Some structures of the network should be quantized last so the finetuning process can better adjust the centroids. We find that there are small variations in performance based on quantization order (see Figure 6). We choose to quantize FFN, then embeddings, and finally the attention matrices in Transformer networks.
Which structures can be compressed the most?
Finally, we analyze which network structures can be most compressed. During quantization, various matrix block sizes can be chosen as a parameter — the larger the block size, the more compression, but also the larger the potential reduction of performance. Thus, it is important to understand how much each network structure can be compressed to reduce the memory footprint of the final model as much as possible. In Figure 6, we quantize two model structures with a fixed block size and vary the block size of the third between 4 and 32. As shown, the FFN and embedding structures are more robust to aggressive compression, while the attention drastically loses performance as larger block sizes are used.
8.7.5 Approach to intN Scalar Quantization
We compare quantizing perchannel to using a histogram quantizer in Table 9. The histogram quantizer maintains a running min/max and minimizes L2 distance between quantized and nonquantized values to find the optimal min/max. Quantizing per channel learns scales and offsets as vectors along the channel dimension, which provides more flexibility since scales and offsets can be different.
8.7.6 LayerDrop with STE
For quantization noise, we apply the straight through estimator (STE) to remaining weights in the backward pass. We experiment with applying STE to the backward pass of LayerDrop’s pruning noise. Results are shown in Table 10 and find slightly worse results.