1 Introduction
Language model pretraining from large unlabeled data has become the new drivingpower for models such as BERT, XLNet, and RoBerta [devlin2018bert, yang2019xlnet, liu2019roberta]. Built upon Transformer [vaswani2017attention], BERT based [devlin2018bert] models significantly improve the state of the art performance when finetuned on various Natural Language Processing (NLP) tasks [rajpurkar2016SQuAD, wang2018glue]. Recently, many followup works push this line of research even further by increasing the model capacity to more than billions of parameters [radford2019language]. Though these models achieve cuttingedge results on various NLP tasks, the resulting models have high latency, and prohibitive memory footprint and power consumption for edge inference. This, in turn, has limited the deployment of these models on embedded devices like cellphones or smart assistance, which now require cloud connectivity to function.
A promising method to address this challenge is quantization, which uses low bit precision for parameter storage and enables low bit hardware operations to speed up inference. The reduced memory footprint and accelerated inference can then enable edge deployment on hardware that supports reduced precision inference such as FPGAs or domain specific accelerators. However, for ultra lowbit setting, e.g., 4 bits, the generalization performance of the quantized model can significantly degrade, and this may not be acceptable for a target application. Historically, in the computer vision area, a large prominent line of work tackles this problem, e.g., different quantization schemes
[krishnamoorthi2018quantizing, zhang2018lq], mixed precision quantization [dong2019hawq, wu2018mixed, zhou2018adaptive], etc. However, there is very limited work done on NLP [xu2018alternating, wang2018hitnet], particularly on BERTbased models, which are actually more in need of model compression and acceleration.In this paper, we focus on ultra low precision quantization of BERT based models, with the goal of minimizing performance degradation while maintaining hardware efficiency. To achieve this, we incorporate a number of novel techniques and propose QBERT. The contributions of our work include:
The loss landscape for different layers in MNLI and CoNNL03 is illustrated by perturbing the parameters along the first two dominant eigenvectors of the Hessian. The silver sphere shows the point in the parameter space to which the BERT model has converged. Layers that exhibit flatter curvature can be quantized to lower bit precision.

[noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt,leftmargin=*]

We apply mixedprecision quantization on BERT, guided by extensive layerwise analysis of second order information (i.e., Hessian information). We find that BERT exhibits a drastically different Hessian behaviour, as compared with NN models for computer vision [yao2018Hessian, dong2019hawq]
. Therefore, we propose a sensitivity measurement based on both mean and variance of the top eigenvalues in order to achieve better mixedprecision quantization, as opposed to
[dong2019hawq], which only uses mean value. 
We propose a new quantization scheme, named groupwise quantization, which can alleviate accuracy degradation, without significant increase in hardware complexity. Specifically, in groupwise quantization scheme, we partition each matrix to different groups, each with its unique quantization range and look up table.

We investigate the bottlenecks in BERT quantization, namely how different factors such as quantization scheme and modules such as embedding, selfattention, and fullyconnected layers affect the tradeoff between NLP performance and the model compression ratio.
We evaluate QBERT in four downstream tasks, including Sentiment Classification, Natural Language Inference, Named Entity Recognition, and Machine Reading Comprehension. QBERT achieves
compression ratio in weights, smaller activation size, and smaller embedding size, within at most 2.3% accuracy loss. To the best of our knowledge, this is the first work for BERT quantization to ultra low bits with acceptable performance loss.2 Related Work
Model compression
Model compression is a very active area of research. Efforts in this area could be broadly categorized as follows: (i) new architectures that are compact by design [iandola2016squeezenet, howard2017mobilenets]; (ii) automated neural architecture search (NAS) with reward function set as latency or model size [wang2019haq, wu2019fbnet]; (iii) pruning based methods to reduce model size of existing architectures [lecun1990optimal, li2016pruning]; (iv) knowledge distillation from a large model to help train a more compact model [ba2014deep, hinton2015distilling]; (v) hardware and architecture codesign [gholami2018squeezenext]; and (vi) inference quantization [zhang2018lq, dong2019hawq].
Here we solely focus on quantization [courbariaux2015binaryconnect, rastegari2016xnor, li2016ternary, zhou2016dorefa, choi2018pact, Jacob_2018_CVPR, zhang2018lq, dong2019hawq]. One of the challenges here is that ultra low precision quantization can lead to significant accuracy degradation. Mixed precision quantization [wu2018mixed, zhou2018adaptive, wang2019haq] and multistage quantization [zhou2017incremental] have been proposed to solve/alleviate this problem. However, the challenge with mixedprecision quantization is that the search space is exponentially large. For instance, if we have three precision options for a specific layer (2, 4 or 8bits), then the total search space of each finetuned BERT model [devlin2018bert] becomes different precision settings. Recently, [dong2019hawq] proposed a secondorder sensitivity based method to address this issue and achieved stateoftheart results on computer vision tasks. Part of our paper builds upon this prior work and extends the results to include other variations of second order information instead of just the mean value of the Hessian spectrum.
Compressed NLP model
Notable examples for NLP compression work are LSTM and GRUbased models for machine translation and language model [xu2018alternating, wang2018hitnet]. From the recent introduction of Tranformer models, we have observed a significant increase in NLP model size. This is due to the incorporation of very large fully connected layers and attention matrices in Transformers [vaswani2017attention, devlin2018bert, yang2019xlnet, liu2019roberta, radford2019language]. Model compression is crucial for deploying these models in resource constrained environments. Pilot works addressing this are [michel2019sixteen, bhandare2019efficient]. From a different angle, [tay2019lightweight, ma2019tensorized] have probed the architectural change of selfattention layer to make the Transformer lightweight. There have also been attempts to use distillation to reduce large pretrained Transformer models such as BERT [devlin2018bert] in [tang2019distilling, sun2019patient]. However, significant accuracy loss is observed even for relatively small compression ratio of . Here we show that this compression ratio could be increased up to , including reduction of embedding layer, with much smaller performance degradation.
3 Methodology
In this section, we introduce our proposed BERT quantization methods, including the mixed precision quantization based on Hessian information, as well as techniques used for the groupwise quantizing scheme.
As in [devlin2018bert], a finetuned BERTBASE model consists of three parts: embedding; Transformer based encoder layers; and output layer. Specifically, assuming is the input word (sentence) and
is the corresponding label, we have the loss function
defined as:where CE is the cross entropy function (or other appropriate loss functions), is a combination of , and . Here, is the embedding table, are the encoder layers, and
is the output/classifier layer
^{3}^{3}3Here, we use for both function and its corresponding parameters without confusion..The size of parameters in BERTBASE model is 91MB for embedding, 325MB for encoder and 0.01MB for output. We do not quantize the output layer due to its negligible size, and focus on quantizing both the embedding and encoder layers. As will be discussed in Sec. 5.1, we find that the embedding layer is more sensitive to quantization than the encoder layers. As a result, we quantize embedding and encoder parameters in different ways. The quantization schemes we used are explained in detail in the following sections.
3.1 Quantization process
General NN inference is performed in floating point precision for both weights and activations. Quantization restricts the network weights to a finite set of values defined as follows:
where is quantization operator,
is a real valued input tensor (activation or a weight), and
denotes an interval in the real numbers . Here is the quantization precision for a specific layer.There are multiple choices for quantization function . Here we use uniform quantization function, where the range of floating point values in a tensor is equally split [zhou2016dorefa, hubara2017quantized] and then represented by unsigned integers in . It should be noted that a nonuniform quantizer can potentially further increase the accuracy. However, we solely focus on uniform quantization since it allows more efficient and easier hardware implementation. To backpropogate gradients through
, which is nondifferentiable, we use the Straightthrough Estimator (STE)
[bengio2013estimating]. See Appendix A for more details about the forward and backward propagation during the entire quantization process.3.2 Mixed precision quantization
Different encoder layers are attending to different structures [clark2019does], and it is expected that they exhibit different sensitivity. Thus, assigning the same number of bits to all the layers is suboptimal. This scenario is more critical if the targeted model size is very small, which requires ultra low precision such as 4bits or 2bits. As a result we explore mixedprecision quantization, where we assign more bits to more sensitive layers in order to retain performance.
In [dong2019hawq], a Hessian AWare Quantization (HAWQ) is developed for mixedbits assignments. The main idea is that the parameters in NN layers with higher Hessian spectrum (i.e., larger top eigenvalues) are more sensitive to quantization and require higher precision, as compared to layers with small Hessian spectrum (i.e., smaller top eigenvalues). However, there exist 7M parameters for each encoder layer in BERTBASE. Given that the Hessian of each layer is a matrix of size , there is a common misconception that computing second order statistics is infeasible. However, the Hessian spectrum can be computed by a matrixfree power iteration method [yao2018Hessian], and this does not require explicit formation of the operator. To illustrate this, we take the first encoder layer as an example. Denoting the gradient of the first encoder layer as
, for a random vector
with the same dimension as , we have(1) 
where is Hessian matrix of the first encoder. Here the second equation comes from the fact that is independent to . The top eigenvalue then can be computed by power iteration, as shown in Alg. 1 in Appendix. We denote as the top eigenvalue of ith encoder layer. Using this approach, we show in Fig. 10 the distribution of top Hessian eigenvalue for different layers of BERTBASE. Different layers exhibit different magnitude of eigenvalues even though all layers have exactly same structure and size.
The above Hessian based approach was used in [dong2019hawq], where top eigenvalues are computed and averaged for different training data. More aggressive quantization is performed for layers that have smaller top eigenvalue, which corresponds to flatter loss landscape as in Fig. LABEL:fig:Hessianlosslandscape3. However, we find that assigning bits based only on the average top eigenvalues is infeasible for many NLP tasks. As shown in Fig. 10, top eigenvalues of Hessian for some layers exhibits very high variance with respect to different portion of the input dataset. As an example, the variance of the layer for SQuAD stays larger than 61.6 while the mean of that layer is around 1.0, even though each data point corresponds to 10% of the entire dataset (which is 9K samples). To address this, we use the following metric instead of just using mean value,
(2) 
where is the distribution of the top eigenvalues of , calculated with 10% of training dataset.^{4}^{4}4Without confusion, we use for both single top eigenvalue and its distribution with respect to 10% of the data. After is computed, we sort them in descending order, and we use it as a metric to relatively determine the quantization precision. We then perform quantizationaware finetuning based on the selected precision setting.
An important technical point that we need to emphasize is that our method expects that before performing quantization the trained model has converged to a local minima. That is, the practitioners who trained BERT and performed its finetuning for downstream tasks should have chosen the hyperparameters and number of iterations such that a local minima has been reached. The necessary optimality conditions are zero gradient, and positive curvature (i.e., positive Hessian eigenvalue). In our analysis, we observed that for the three tasks of MNLI, CoNLL03, and SST2 the top Hessian eigenvalue is indeed positive for (see Fig. 5, and Fig. 25 in Appendix). However, we find that the BERT model finetuned for SQuAD has actually not converged to a local minima, as evident in the Hessian eigenvalues shown in Fig. 10(d), where we observe very large negative eigenvalues. Directly visualizing the loss landscape also shows this very clearly as in Fig. 13. Because of this, our expectation is that performing quantization on SQuAD would lead to higher performance degradation as compared to other tasks, and this is indeed the case as will be discussed next.
are concatenated together, which results in a 3d tensor. The same color denotes the same group with a shared quantization range. As shown in (a), for layerwise quantization, the entire 3d tensor will be quantized from a universal quantization range into discrete unsigned integers. A special case of groupwise quantization in (b) is that we treat each dense matrix as a group, and every matrix can have its own quantization range. We show a more general case in (c), where we partition each dense matrix w.r.t. output neuron and bucket every continuous
output neurons as a group.3.3 Groupwise Quantization
Assume that the input sequence has words and each word has a dim embedding vector ( for BERTBASE), i.e., . In Transformer encoder, each selfattention head has 4 dense matrix, i.e., , where is the number of attention heads. Here , , and stand for key, query, value and output weight matrix. Each selfattention head computes the weighted sum as
Through this reparametrization, the multihead selfattention (MHSA) will add these features into the final output, that is we will have . Directly quantizing each 4 matrices in MHSA as an entirety with the same quantization range can significantly degrade the accuracy, since there are more than 2M parameters in total, which corresponds to
output neurons, and the weights corresponding to each neuron may lie in different range of real numbers. Channelwise quantization can be used to alleviate this problem in convolutional neural networks, where each convolutional kernel can be treated as a single output channel and have its own quantization range. However, this cannot be directly applied for dense matrices, since each dense matrix itself is a single kernel. Therefore, we propose groupwise quantization for attentionbased models. We treat the individual matrix
with respect to each head in one dense matrix of MHSA as a group so there will be groups. Furthermore, in each group, we bucket sequential output neurons together as subgroups, e.g., each 6 output neurons as one subgroup so there are subgroup in total (the hidden dimension in each head of BERTBASE is ). Each subgroup can have its own quantization range. An illustration is shown in Fig. 17 for , where we concatenate value matrix to be a 3d tensor. For layerwise quantization, the entire 3d tensor will be quantized into the same range of discrete numbers, as shown in Fig. (a)a. A special case of groupwise quantization is that we treat each dense matrix as a group, and every matrix can have its own quantization range as shown in Fig. (b)b. A more general case in Fig. (c)c is that we partition each dense matrix with respect to output neuron, and we bucket every continuous output neurons as a group. The effect of finer groupwise quantization is further investigated in Sec. 4.2.4 Experiment
In this section, we describe our experiments on evaluating the proposed QBERT on four different NLP tasks. Details of the datasets are shown in Appendix B. To the best of our knowledge, there is no published work done on BERT quantization at this point, so we report Direct quantization (DirectQ), i.e., quantization without mixedprecision and groupwise quantization as a baseline.




4.1 Main Results
We present results of QBERT on the development set of the four tasks of SST2, MNLI, CoNLL03, and SQuAD, as summarized in Tab. 1. As one can see, QBERT performs significantly better compared to the DirectQ method across all four tasks in each bit setting. The gap becomes more obvious for ultra low bit setting. As an example, in 4bits setting, Direct quantization (DirectQ) of SQuAD results in 11.5% performance degradation as compared to BERTBASE. However, for the same 4bits setting, QBERT only exhibits 0.5% performance degradation. Moreover, under 3bits setting, the gap between QBERT and DirectQ increases even further to 9.6827.83% for various tasks.
In order to push further the precision setting to lower bits, we investigate the mixedprecision QBERT (QBERTMP). As can be seen, QBERT with uniform 2bits setting has very poor performance across all four tasks, though the memory is reduced by 20% against 3bits setting. The reason behind this is the discrepancy that not all the layers have the same sensitivity to quantization as evident from loss landscape visualizations; see Fig. LABEL:fig:Hessianlosslandscape3 (and Fig. LABEL:fig:Hessianlosslandscape2 in Appendix). Intuitively, for more sensitive layers, higher bit precision needs to be set, while for layers that are less sensitive, 2bits setting is already sufficient. To set mixed precision to each encoder layer of BERTBASE, we measure the sensitivity based on Eq. 2, which captures both mean and variance of the top eigenvalue of the Hessian shown in Fig. 10. Note that all experiments in Fig. 10 are based on 10 runs and each run uses 10% of the entire training dataset. We can obverse that for most of the lower encoder layers (layer 18), the variance is pretty large compared to the last three layers. We generally observe that the middle part (layer 48) has the largest . Beyond the relatively smaller mean, the last three layers also have much smaller variance, which indicates the insensitivity of these layers. Therefore, higher bits will only be assigned for middle layers according to Eq. 2 for QBERT 2/3 MP.^{5}^{5}5Exact detailed bits setting is included in the Appendix C.1 In this way, with only additional 5MB memory storage, 2/3bits QBERTMP is able to retain the performance drop within 2.3% for MNLI, SQuAD and 1.1% for SST2, CoNLL03, with up to compression ratio in weights. Note that this is up to 6.8% better than QBERT with uniform 2 bits.
One consideration for quantization is that 3bit quantized execution is typically not supported in hardware. It is however possible to load 3bit quantized values and cast them to higher bit precision such as 4 or 8 bits in the execution units. This would still have the benefit of reduced memory volume to/from DRAM. It is also possible to avoid using 3 bits and instead use a mixture of 2 and 4 bits as shown in Tab. 1. For example, SST2 QBERTMP with mixed 2/4bit precision weights has the same model size as the 3 bit quantization in 53.2MB and achieves similar accuracy. We observe a similar trend for other tasks as well.
One important observation is that we found SQuAD to be harder to quantize as compared to other tasks; see Tab. (d)d. For example, 2bits DirectQ results in more than 10% F score degradation. Even QBERT has larger performance drop as compared to other tasks in Tab. 1. We studied this phenomenon further through Hessian analysis. In Fig. 10, among all the tasks, it can be clearly seen that SQuAD not only has much larger eigenvalue variance, but it has very large negative eigenvalues. In fact this shows that the existing BERT model for SQuAD has not reached a local minima. This is further illustrated in the 3d loss landscape of all four tasks in Fig. LABEL:fig:Hessianlosslandscape3 and Fig. 13 (and Fig. LABEL:fig:Hessianlosslandscape2 in Appendix). It can be clearly seen that for the other three tasks, the stopping point is at a quadratic bowl (at least in the first two dominant eigenvalue directions of the Hessian). However, compared to the others, SQuAD has a totally different structure to its loss landscape. As shown in Fig. 13, the stopping points of different layers on SQuAD have negative curvature directions, which means they have not converged to a local minima yet. This could well explain why the quantization of SQuAD results in more accuracy drop. Our initial attempts to address this by changing training hyperparameters were not successful. We found that the BERT model quickly overfits the training data. However, we emphasize that fixing BERT model training itself is outside the scope of this paper and not possible with academic computational resources.
4.2 Effects of groupwise quantization
We measure the performance gains with different group numbers in Tab. 2. We can observe from the table that performing layerwise quantization (shown in Fig. (a)a) is suboptimal for all four tasks (the performance drop is around 7% to 11.5%). However, the performance significantly increases as we increase the number of groups. For example, for 12 groups, the performance degradation is less than 2% for all the tasks. Further increasing the group number from 12 to 128 increases the accuracy further by at least 0.3% accuracy. However, increasing the group number further from 128 to 768 can only increase the performance within 0.1%. This shows that the performance gain almost saturates around 128 groups. It is also preferable not to have very large value for the number of group since it increases the number of Lookup Tables (LUTs) necessary for each matrix multiplication. This can adversely affect hardware performance, and based on our results there are diminishing returns in terms of accuracy. In all our experiments, we used 128 groups for both QBERT and QBERTMP in Sec. 4.1.
# Group  SST2  MNLIm/mm  CoNLL03 

Baseline  93.00  84.00/84.40  95.00 
1  85.67  76.69/77.00  89.86 
12  92.31  82.37/82.95  94.42 
128  92.66  83.89/84.17  94.90 
768 ^{6}^{6}6Here we treat each output neuron as a single group.  92.78  84.00/84.20  94.99 
5 Discussion
In this Section, we further investigate the quantization effects on different modules, e.g., different embedding layers (e.g., word and position embeddings), and we perform qualitative analysis using attention distribution. This illustrates that QBERT better captures the behaviour of the original model as compared to DirectQ in all cases.
5.1 Quantization effects on different modules
Here we investigate the quantization effects with respect to different modules of BERT model (multihead selfattention versus feedforward network, and different embedding layers, i.e., word embedding versus position embedding).
Generally speaking, we find that embedding layer is more sensitive than weights for quantization. This is illustrated in Tab. (a)a, where we use 4bits layerwise quantization for embedding, which results in an unacceptable performance drop up to 10% for SST2, MNLI, CoNLL03 and even more than 20% for SQuAD. This is despite the fact that we used 8/8bits for weights/activations. On the contrary, encoder layers consume around 79% total parameters ( embedding parameter size), while quantizing them to 4bits in Tab. 1 leads to less performance loss.
Furthermore, we find that position embedding is very sensitive to quantization. For instance, quantizing position embedding to 4 bits results in generally 2% additional performance degradation than quantizing word embedding, even though the position embedding only accounts for less than 5% of the entire embedding. This indicates the importance of positional information in Natural Language Understanding tasks. Given position embedding only accounts for a small portion of model size, we can do mixedprecision quantization for embedding to further push down the model size boundary with a tolerable accuracy drop, as shown in Appendix C.2.


To study the quantization effects on selfattention layers and fullyconnected networks, we conducted extensive experiments under different bits settings for the encoder layers. The results are shown in Tab. (b)b. Specifically, we adopt the QBERTMP setting in Tab. 1, with a mixture of 2 and 3 bits for encoder weights. To test the robustness of the two modules inside each encoder layer, we further reduce one more bit in the corresponding modules and denote the resulting precision setting 1/2MP. From Tab. (b)b, we can conclude that generally selfattention layer is more robust to quantization than the fullyconnected network, since 1/2MP selfattention results in about 5% performance drop while 1/2MP fullyconnected will worsen this to 11%.
5.2 Qualitative Analysis
We use attention information to conduct qualitative analysis to analyze the difference between QBERT and DirectQ.
To do so, we compute the Kullback–Leibler (KL) divergence between the attention distribution for the same input from the coordinated head of both quantized BERT and fullprecision BERT. It should be noted that we compute the average distance out of 10% of the entire training dataset. The smaller KL divergence here means that the output of the multihead attention of the two models is closer to each other. We illustrate this distance score for each individual head in Fig. 22 for SST2, MNLI, CoNLL03 and SQuAD. We compared QBERT and DirectQ with 4bits weights, 8bits embedding and 8bits activation. Each scatter point in Fig. 22 denotes the distance w.r.t. one head, and the line chart shows the average results over the 12 heads in one layer. We can clearly see that QBERT always incurs a smaller distance to the original baseline model as compared to DirectQ model, for all the different layers.
6 Conclusion
In this work, we perform an extensive analysis of finetuned BERT and propose QBERT, an effective scheme for quantizing BERT. In order to reduce aggressively the model size by mixedprecision quantization, we proposed a new layerwise Hessian based method which captures both the average and the variance of the eigenvalues. Moreover, a new groupwise quantization is proposed to perform finegrained quantization inside each encoder layer. In four downstream tasks, equipped with the aforementioned methods, QBERT achieves compression ratio in weights, smaller activation size, and smaller embedding size, with at most 2.3% accuracy loss. To understand better how different factors will affect the tradeoff between performance and the model compression ratio in QBERT, we conduct controlled experiments to investigate the effect of different quantization schemes and quantizing different modules in BERT, respectively.
Acknowledgments
We would like to thank Prof. Joseph Gonzalez, Prof. Dan Klein, and Prof. David Patterson for their valuable feedback. This work was supported by a gracious fund from Intel corporation, Berkeley Deep Drive (BDD), and Berkeley AI Research (BAIR) sponsors. We would like to thank the Intel VLAB team for providing us with access to their computing cluster. We also thank gracious support from Google for providing cloud compute. MWM would also like to acknowledge ARO, DARPA, NSF, ONR, and Intel for providing partial support of this work.
Appendix A Detailed quantization process
In the forward pass, each element in the input will be quantized as follows:
where is the round operator, is distance between adjacent quantized points, is a set of integer indices and is the index for the bias. We drop for clarity in the following equations. In the inference, the expensive floating point tensor arithmetic can be replaced by efficient integer arithmetic for the matrix multiplication with , and then followed by a gathered dequantization operation, which will accelerate the computation time in order of magnitudes. Since we use the quantizationaware finetuning scheme, in the backward pass, the StraightThough Estimator (STE) [bengio2013estimating] is used for computing the gradient for .
Appendix B Dataset
We apply QBERT on Sentiment Classification, Natural Language Inference, Named Entity Recognition and Machine Reading Comprehension tasks. For Sentiment Classification, we evaluate on Stanford Sentiment Treebank (SST2) [socher2013recursive]. For Named Entity Recognition, we use CoNLL2003 English benchmark dataset for NER (CoNLL03) [sang2003introduction]. For Natural Language Inference, we test on MultiGenre Natural Language Inference (MNLI) [williams2017broad]. For Machine Reading Comprehension, we evaluate on the Stanford Question Answering Dataset (SQuAD) [rajpurkar2016SQuAD].
More specifically, SST2 is a movie review dataset with binary annotations, where the binary label indicates positive and negative reviews. MNLI is a multigenre NLI task for predicting whether a given premisehypothesis pair is entailment, contradiction or neural. Its test and development datasets are further divided into indomain (MNLIm) and crossdomain (MNLImm) splits to evaluate the generality of tested models. CoNLL03 is a newswire article dataset for predicting the exact span of the annotated four entity types: person, location, organization, and miscellaneous. SQuAD is a task to answer the question by extracting the relevant span from the context, where a paragraph of context and a question is provided for each sample.
Appendix C Extra results
Here we describe several additional results.
c.1 Ablation Study of Hessian based Mixed Precision Assignment
To demonstrate the robustness of our Hessian based Mixed Precision method, we conduct the ablation study here to use the reversed version of 2/3bit QBERTMP (QBERTMPrev). Specifically, we will assign higher bits to relatively sensitive layers and lower bit vice versa, which means the previous layer in 2/3bit QBERTMP with 2bit will be assigned 3bit. ^{7}^{7}7The bits setting of 2/3bit QBERTMP and 2/4bit QBERTMP are included in Tab. 6 and Tab. 7, respectively.
We can obverse that even the model size of QBERTMPrev is larger or similar to that of QBERTMP. The performance difference between QBERTMPrev and 2bit QBERT is within 2% for MNLI, CoNLL03, SQuAD and 4% for SST2, while that of QBERTMP is beyond 5% for MNLI, CoNLL03, SQuAD and 8% for SST2. This large discrepancy in the perfomance illustrates the superiority of leveraging second order Hessian information in mix precision bits assignment.




c.2 Mixed Precision Quantization for Embedding
As can be seen from Tab. 1, when 2/3 MP is used for quantizing the weight parameters, the bottleneck of the model size is bounded by the embedding table size. Also, observed in Tab. (a)a, we noticed that word embedding is less sensitive. Therefore, in this section, we further push the embedding table to be 4bit (word embedding) and 8bit (position embedding) mixedprecision to reduce the entire model size. Similar to groupwise quantization for weights, in this ultralow embedding bits setting, we bucket the 768 output neurons in BERTBASE word and position embedding layer into 128 groups in Tab. 5. We adopt the same setting for weights and activations in Tab. 1, where we employ 128 groups for weights and set 8/8 bits for weight/activation. Note that with around 0.5% performance drop, the embedding table size can be reduced to 11.6MB, which corresponds to around compression ratio in embedding table and compression ratio in total model size.




c.3 Detailed loss landscape for SST2
We include the detailed loss landscape analysis for the remaining task SST2 as shown in Fig. 25.
Layer(s)  Layer Type  Parameter Size(M)  Weight bit (SST2)  Weight bit (MNLI)  Weight bit (CoNLL03)  Weight bit (SQuAD) 
Layer 0  Embedding  23.8  8  8  8  8 
Layer 1  Transformer  7.1  2  2  2  2 
Layer 2  Transformer  7.1  2  2  3  2 
Layer 3  Transformer  7.1  2  2  3  2 
Layer 4  Transformer  7.1  3  2  3  3 
Layer 5  Transformer  7.1  3  3  3  3 
Layer 6  Transformer  7.1  3  3  2  3 
Layer 7  Transformer  7.1  3  3  2  3 
Layer 8  Transformer  7.1  3  3  2  3 
Layer 9  Transformer  7.1  3  2  2  3 
Layer 10  Transformer  7.1  2  2  2  2 
Layer 11  Transformer  7.1  2  2  2  2 
Layer 12  Transformer  7.1  2  2  2  2 
Layer 13  FC  0.01  32  32  32  32 
Layer(s)  Layer Type  Parameter Size(M)  Weight bit (SST2)  Weight bit (MNLI)  Weight bit (CoNLL03)  Weight bit (SQuAD) 
Layer 0  Embedding  23.8  8  8  8  8 
Layer 1  Transformer  7.1  2  2  2  2 
Layer 2  Transformer  7.1  2  2  2  2 
Layer 3  Transformer  7.1  4  2  2  2 
Layer 4  Transformer  7.1  4  4  4  4 
Layer 5  Transformer  7.1  4  4  4  4 
Layer 6  Transformer  7.1  2  4  4  4 
Layer 7  Transformer  7.1  4  4  4  4 
Layer 8  Transformer  7.1  4  4  4  4 
Layer 9  Transformer  7.1  4  2  2  2 
Layer 10  Transformer  7.1  2  2  2  2 
Layer 11  Transformer  7.1  2  2  2  2 
Layer 12  Transformer  7.1  2  2  2  2 
Layer 13  FC  0.01  32  32  32  32 
Comments
There are no comments yet.