Language model pre-training from large unlabeled data has become the new driving-power for models such as BERT, XLNet, and RoBerta [devlin2018bert, yang2019xlnet, liu2019roberta]. Built upon Transformer [vaswani2017attention], BERT based [devlin2018bert] models significantly improve the state of the art performance when fine-tuned on various Natural Language Processing (NLP) tasks [rajpurkar2016SQuAD, wang2018glue]. Recently, many follow-up works push this line of research even further by increasing the model capacity to more than billions of parameters [radford2019language]. Though these models achieve cutting-edge results on various NLP tasks, the resulting models have high latency, and prohibitive memory footprint and power consumption for edge inference. This, in turn, has limited the deployment of these models on embedded devices like cellphones or smart assistance, which now require cloud connectivity to function.
A promising method to address this challenge is quantization, which uses low bit precision for parameter storage and enables low bit hardware operations to speed up inference. The reduced memory footprint and accelerated inference can then enable edge deployment on hardware that supports reduced precision inference such as FPGAs or domain specific accelerators. However, for ultra low-bit setting, e.g., 4 bits, the generalization performance of the quantized model can significantly degrade, and this may not be acceptable for a target application. Historically, in the computer vision area, a large prominent line of work tackles this problem, e.g., different quantization schemes[krishnamoorthi2018quantizing, zhang2018lq], mixed precision quantization [dong2019hawq, wu2018mixed, zhou2018adaptive], etc. However, there is very limited work done on NLP [xu2018alternating, wang2018hitnet], particularly on BERT-based models, which are actually more in need of model compression and acceleration.
In this paper, we focus on ultra low precision quantization of BERT based models, with the goal of minimizing performance degradation while maintaining hardware efficiency. To achieve this, we incorporate a number of novel techniques and propose Q-BERT. The contributions of our work include:
The loss landscape for different layers in MNLI and CoNNL-03 is illustrated by perturbing the parameters along the first two dominant eigenvectors of the Hessian. The silver sphere shows the point in the parameter space to which the BERT model has converged. Layers that exhibit flatter curvature can be quantized to lower bit precision.
We apply mixed-precision quantization on BERT, guided by extensive layer-wise analysis of second order information (i.e., Hessian information). We find that BERT exhibits a drastically different Hessian behaviour, as compared with NN models for computer vision [yao2018Hessian, dong2019hawq]dong2019hawq], which only uses mean value.
We propose a new quantization scheme, named group-wise quantization, which can alleviate accuracy degradation, without significant increase in hardware complexity. Specifically, in group-wise quantization scheme, we partition each matrix to different groups, each with its unique quantization range and look up table.
We investigate the bottlenecks in BERT quantization, namely how different factors such as quantization scheme and modules such as embedding, self-attention, and fully-connected layers affect the trade-off between NLP performance and the model compression ratio.
We evaluate Q-BERT in four downstream tasks, including Sentiment Classification, Natural Language Inference, Named Entity Recognition, and Machine Reading Comprehension. Q-BERT achievescompression ratio in weights, smaller activation size, and smaller embedding size, within at most 2.3% accuracy loss. To the best of our knowledge, this is the first work for BERT quantization to ultra low bits with acceptable performance loss.
2 Related Work
Model compression is a very active area of research. Efforts in this area could be broadly categorized as follows: (i) new architectures that are compact by design [iandola2016squeezenet, howard2017mobilenets]; (ii) automated neural architecture search (NAS) with reward function set as latency or model size [wang2019haq, wu2019fbnet]; (iii) pruning based methods to reduce model size of existing architectures [lecun1990optimal, li2016pruning]; (iv) knowledge distillation from a large model to help train a more compact model [ba2014deep, hinton2015distilling]; (v) hardware and architecture co-design [gholami2018squeezenext]; and (vi) inference quantization [zhang2018lq, dong2019hawq].
Here we solely focus on quantization [courbariaux2015binaryconnect, rastegari2016xnor, li2016ternary, zhou2016dorefa, choi2018pact, Jacob_2018_CVPR, zhang2018lq, dong2019hawq]. One of the challenges here is that ultra low precision quantization can lead to significant accuracy degradation. Mixed precision quantization [wu2018mixed, zhou2018adaptive, wang2019haq] and multi-stage quantization [zhou2017incremental] have been proposed to solve/alleviate this problem. However, the challenge with mixed-precision quantization is that the search space is exponentially large. For instance, if we have three precision options for a specific layer (2, 4 or 8-bits), then the total search space of each fine-tuned BERT model [devlin2018bert] becomes different precision settings. Recently, [dong2019hawq] proposed a second-order sensitivity based method to address this issue and achieved state-of-the-art results on computer vision tasks. Part of our paper builds upon this prior work and extends the results to include other variations of second order information instead of just the mean value of the Hessian spectrum.
Compressed NLP model
Notable examples for NLP compression work are LSTM and GRU-based models for machine translation and language model [xu2018alternating, wang2018hitnet]. From the recent introduction of Tranformer models, we have observed a significant increase in NLP model size. This is due to the incorporation of very large fully connected layers and attention matrices in Transformers [vaswani2017attention, devlin2018bert, yang2019xlnet, liu2019roberta, radford2019language]. Model compression is crucial for deploying these models in resource constrained environments. Pilot works addressing this are [michel2019sixteen, bhandare2019efficient]. From a different angle, [tay2019lightweight, ma2019tensorized] have probed the architectural change of self-attention layer to make the Transformer lightweight. There have also been attempts to use distillation to reduce large pre-trained Transformer models such as BERT [devlin2018bert] in [tang2019distilling, sun2019patient]. However, significant accuracy loss is observed even for relatively small compression ratio of . Here we show that this compression ratio could be increased up to , including reduction of embedding layer, with much smaller performance degradation.
In this section, we introduce our proposed BERT quantization methods, including the mixed precision quantization based on Hessian information, as well as techniques used for the group-wise quantizing scheme.
As in [devlin2018bert], a fine-tuned BERTBASE model consists of three parts: embedding; Transformer based encoder layers; and output layer. Specifically, assuming is the input word (sentence) and
is the corresponding label, we have the loss functiondefined as:
where CE is the cross entropy function (or other appropriate loss functions), is a combination of , and . Here, is the embedding table, are the encoder layers, and
is the output/classifier layer333Here, we use for both function and its corresponding parameters without confusion..
The size of parameters in BERTBASE model is 91MB for embedding, 325MB for encoder and 0.01MB for output. We do not quantize the output layer due to its negligible size, and focus on quantizing both the embedding and encoder layers. As will be discussed in Sec. 5.1, we find that the embedding layer is more sensitive to quantization than the encoder layers. As a result, we quantize embedding and encoder parameters in different ways. The quantization schemes we used are explained in detail in the following sections.
3.1 Quantization process
General NN inference is performed in floating point precision for both weights and activations. Quantization restricts the network weights to a finite set of values defined as follows:
where is quantization operator,
is a real valued input tensor (activation or a weight), anddenotes an interval in the real numbers . Here is the quantization precision for a specific layer.
There are multiple choices for quantization function . Here we use uniform quantization function, where the range of floating point values in a tensor is equally split [zhou2016dorefa, hubara2017quantized] and then represented by unsigned integers in . It should be noted that a non-uniform quantizer can potentially further increase the accuracy. However, we solely focus on uniform quantization since it allows more efficient and easier hardware implementation. To backpropogate gradients through
, which is non-differentiable, we use the Straight-through Estimator (STE)[bengio2013estimating]. See Appendix A for more details about the forward and backward propagation during the entire quantization process.
3.2 Mixed precision quantization
Different encoder layers are attending to different structures [clark2019does], and it is expected that they exhibit different sensitivity. Thus, assigning the same number of bits to all the layers is sub-optimal. This scenario is more critical if the targeted model size is very small, which requires ultra low precision such as 4-bits or 2-bits. As a result we explore mixed-precision quantization, where we assign more bits to more sensitive layers in order to retain performance.
In [dong2019hawq], a Hessian AWare Quantization (HAWQ) is developed for mixed-bits assignments. The main idea is that the parameters in NN layers with higher Hessian spectrum (i.e., larger top eigenvalues) are more sensitive to quantization and require higher precision, as compared to layers with small Hessian spectrum (i.e., smaller top eigenvalues). However, there exist 7M parameters for each encoder layer in BERTBASE. Given that the Hessian of each layer is a matrix of size , there is a common misconception that computing second order statistics is infeasible. However, the Hessian spectrum can be computed by a matrix-free power iteration method [yao2018Hessian], and this does not require explicit formation of the operator. To illustrate this, we take the first encoder layer as an example. Denoting the gradient of the first encoder layer as
, for a random vectorwith the same dimension as , we have
where is Hessian matrix of the first encoder. Here the second equation comes from the fact that is independent to . The top eigenvalue then can be computed by power iteration, as shown in Alg. 1 in Appendix. We denote as the top eigenvalue of i-th encoder layer. Using this approach, we show in Fig. 10 the distribution of top Hessian eigenvalue for different layers of BERTBASE. Different layers exhibit different magnitude of eigenvalues even though all layers have exactly same structure and size.
The above Hessian based approach was used in [dong2019hawq], where top eigenvalues are computed and averaged for different training data. More aggressive quantization is performed for layers that have smaller top eigenvalue, which corresponds to flatter loss landscape as in Fig. LABEL:fig:Hessian-loss-landscape-3. However, we find that assigning bits based only on the average top eigenvalues is infeasible for many NLP tasks. As shown in Fig. 10, top eigenvalues of Hessian for some layers exhibits very high variance with respect to different portion of the input dataset. As an example, the variance of the layer for SQuAD stays larger than 61.6 while the mean of that layer is around 1.0, even though each data point corresponds to 10% of the entire dataset (which is 9K samples). To address this, we use the following metric instead of just using mean value,
where is the distribution of the top eigenvalues of , calculated with 10% of training dataset.444Without confusion, we use for both single top eigenvalue and its distribution with respect to 10% of the data. After is computed, we sort them in descending order, and we use it as a metric to relatively determine the quantization precision. We then perform quantization-aware fine-tuning based on the selected precision setting.
An important technical point that we need to emphasize is that our method expects that before performing quantization the trained model has converged to a local minima. That is, the practitioners who trained BERT and performed its fine-tuning for downstream tasks should have chosen the hyper-parameters and number of iterations such that a local minima has been reached. The necessary optimality conditions are zero gradient, and positive curvature (i.e., positive Hessian eigenvalue). In our analysis, we observed that for the three tasks of MNLI, CoNLL-03, and SST-2 the top Hessian eigenvalue is indeed positive for (see Fig. 5, and Fig. 25 in Appendix). However, we find that the BERT model fine-tuned for SQuAD has actually not converged to a local minima, as evident in the Hessian eigenvalues shown in Fig. 10(d), where we observe very large negative eigenvalues. Directly visualizing the loss landscape also shows this very clearly as in Fig. 13. Because of this, our expectation is that performing quantization on SQuAD would lead to higher performance degradation as compared to other tasks, and this is indeed the case as will be discussed next.
are concatenated together, which results in a 3-d tensor. The same color denotes the same group with a shared quantization range. As shown in (a), for layer-wise quantization, the entire 3-d tensor will be quantized from a universal quantization range into discrete unsigned integers. A special case of group-wise quantization in (b) is that we treat each dense matrix as a group, and every matrix can have its own quantization range. We show a more general case in (c), where we partition each dense matrix w.r.t. output neuron and bucket every continuousoutput neurons as a group.
3.3 Group-wise Quantization
Assume that the input sequence has words and each word has a -dim embedding vector ( for BERTBASE), i.e., . In Transformer encoder, each self-attention head has 4 dense matrix, i.e., , where is the number of attention heads. Here , , and stand for key, query, value and output weight matrix. Each self-attention head computes the weighted sum as
Through this reparametrization, the multi-head self-attention (MHSA) will add these features into the final output, that is we will have . Directly quantizing each 4 matrices in MHSA as an entirety with the same quantization range can significantly degrade the accuracy, since there are more than 2M parameters in total, which corresponds to
output neurons, and the weights corresponding to each neuron may lie in different range of real numbers. Channel-wise quantization can be used to alleviate this problem in convolutional neural networks, where each convolutional kernel can be treated as a single output channel and have its own quantization range. However, this cannot be directly applied for dense matrices, since each dense matrix itself is a single kernel. Therefore, we propose group-wise quantization for attention-based models. We treat the individual matrixwith respect to each head in one dense matrix of MHSA as a group so there will be groups. Furthermore, in each group, we bucket sequential output neurons together as sub-groups, e.g., each 6 output neurons as one sub-group so there are sub-group in total (the hidden dimension in each head of BERTBASE is ). Each sub-group can have its own quantization range. An illustration is shown in Fig. 17 for , where we concatenate value matrix to be a 3-d tensor. For layer-wise quantization, the entire 3-d tensor will be quantized into the same range of discrete numbers, as shown in Fig. (a)a. A special case of group-wise quantization is that we treat each dense matrix as a group, and every matrix can have its own quantization range as shown in Fig. (b)b. A more general case in Fig. (c)c is that we partition each dense matrix with respect to output neuron, and we bucket every continuous output neurons as a group. The effect of finer group-wise quantization is further investigated in Sec. 4.2.
In this section, we describe our experiments on evaluating the proposed Q-BERT on four different NLP tasks. Details of the datasets are shown in Appendix B. To the best of our knowledge, there is no published work done on BERT quantization at this point, so we report Direct quantization (DirectQ), i.e., quantization without mixed-precision and group-wise quantization as a baseline.
4.1 Main Results
We present results of Q-BERT on the development set of the four tasks of SST-2, MNLI, CoNLL-03, and SQuAD, as summarized in Tab. 1. As one can see, Q-BERT performs significantly better compared to the DirectQ method across all four tasks in each bit setting. The gap becomes more obvious for ultra low bit setting. As an example, in 4-bits setting, Direct quantization (DirectQ) of SQuAD results in 11.5% performance degradation as compared to BERTBASE. However, for the same 4-bits setting, Q-BERT only exhibits 0.5% performance degradation. Moreover, under 3-bits setting, the gap between Q-BERT and DirectQ increases even further to 9.68-27.83% for various tasks.
In order to push further the precision setting to lower bits, we investigate the mixed-precision Q-BERT (Q-BERTMP). As can be seen, Q-BERT with uniform 2-bits setting has very poor performance across all four tasks, though the memory is reduced by 20% against 3-bits setting. The reason behind this is the discrepancy that not all the layers have the same sensitivity to quantization as evident from loss landscape visualizations; see Fig. LABEL:fig:Hessian-loss-landscape-3 (and Fig. LABEL:fig:Hessian-loss-landscape-2 in Appendix). Intuitively, for more sensitive layers, higher bit precision needs to be set, while for layers that are less sensitive, 2-bits setting is already sufficient. To set mixed precision to each encoder layer of BERTBASE, we measure the sensitivity based on Eq. 2, which captures both mean and variance of the top eigenvalue of the Hessian shown in Fig. 10. Note that all experiments in Fig. 10 are based on 10 runs and each run uses 10% of the entire training dataset. We can obverse that for most of the lower encoder layers (layer 1-8), the variance is pretty large compared to the last three layers. We generally observe that the middle part (layer 4-8) has the largest . Beyond the relatively smaller mean, the last three layers also have much smaller variance, which indicates the insensitivity of these layers. Therefore, higher bits will only be assigned for middle layers according to Eq. 2 for Q-BERT 2/3 MP.555Exact detailed bits setting is included in the Appendix C.1 In this way, with only additional 5MB memory storage, 2/3-bits Q-BERTMP is able to retain the performance drop within 2.3% for MNLI, SQuAD and 1.1% for SST-2, CoNLL-03, with up to compression ratio in weights. Note that this is up to 6.8% better than Q-BERT with uniform 2 bits.
One consideration for quantization is that 3-bit quantized execution is typically not supported in hardware. It is however possible to load 3-bit quantized values and cast them to higher bit precision such as 4 or 8 bits in the execution units. This would still have the benefit of reduced memory volume to/from DRAM. It is also possible to avoid using 3 bits and instead use a mixture of 2 and 4 bits as shown in Tab. 1. For example, SST-2 Q-BERTMP with mixed 2/4-bit precision weights has the same model size as the 3 bit quantization in 53.2MB and achieves similar accuracy. We observe a similar trend for other tasks as well.
One important observation is that we found SQuAD to be harder to quantize as compared to other tasks; see Tab. (d)d. For example, 2-bits DirectQ results in more than 10% F score degradation. Even Q-BERT has larger performance drop as compared to other tasks in Tab. 1. We studied this phenomenon further through Hessian analysis. In Fig. 10, among all the tasks, it can be clearly seen that SQuAD not only has much larger eigenvalue variance, but it has very large negative eigenvalues. In fact this shows that the existing BERT model for SQuAD has not reached a local minima. This is further illustrated in the 3-d loss landscape of all four tasks in Fig. LABEL:fig:Hessian-loss-landscape-3 and Fig. 13 (and Fig. LABEL:fig:Hessian-loss-landscape-2 in Appendix). It can be clearly seen that for the other three tasks, the stopping point is at a quadratic bowl (at least in the first two dominant eigenvalue directions of the Hessian). However, compared to the others, SQuAD has a totally different structure to its loss landscape. As shown in Fig. 13, the stopping points of different layers on SQuAD have negative curvature directions, which means they have not converged to a local minima yet. This could well explain why the quantization of SQuAD results in more accuracy drop. Our initial attempts to address this by changing training hyper-parameters were not successful. We found that the BERT model quickly overfits the training data. However, we emphasize that fixing BERT model training itself is outside the scope of this paper and not possible with academic computational resources.
4.2 Effects of group-wise quantization
We measure the performance gains with different group numbers in Tab. 2. We can observe from the table that performing layer-wise quantization (shown in Fig. (a)a) is sub-optimal for all four tasks (the performance drop is around 7% to 11.5%). However, the performance significantly increases as we increase the number of groups. For example, for 12 groups, the performance degradation is less than 2% for all the tasks. Further increasing the group number from 12 to 128 increases the accuracy further by at least 0.3% accuracy. However, increasing the group number further from 128 to 768 can only increase the performance within 0.1%. This shows that the performance gain almost saturates around 128 groups. It is also preferable not to have very large value for the number of group since it increases the number of Look-up Tables (LUTs) necessary for each matrix multiplication. This can adversely affect hardware performance, and based on our results there are diminishing returns in terms of accuracy. In all our experiments, we used 128 groups for both Q-BERT and Q-BERTMP in Sec. 4.1.
|768 666Here we treat each output neuron as a single group.||92.78||84.00/84.20||94.99|
In this Section, we further investigate the quantization effects on different modules, e.g., different embedding layers (e.g., word and position embeddings), and we perform qualitative analysis using attention distribution. This illustrates that Q-BERT better captures the behaviour of the original model as compared to DirectQ in all cases.
5.1 Quantization effects on different modules
Here we investigate the quantization effects with respect to different modules of BERT model (multi-head self-attention versus feed-forward network, and different embedding layers, i.e., word embedding versus position embedding).
Generally speaking, we find that embedding layer is more sensitive than weights for quantization. This is illustrated in Tab. (a)a, where we use 4-bits layerwise quantization for embedding, which results in an unacceptable performance drop up to 10% for SST-2, MNLI, CoNLL-03 and even more than 20% for SQuAD. This is despite the fact that we used 8/8-bits for weights/activations. On the contrary, encoder layers consume around 79% total parameters ( embedding parameter size), while quantizing them to 4-bits in Tab. 1 leads to less performance loss.
Furthermore, we find that position embedding is very sensitive to quantization. For instance, quantizing position embedding to 4 bits results in generally 2% additional performance degradation than quantizing word embedding, even though the position embedding only accounts for less than 5% of the entire embedding. This indicates the importance of positional information in Natural Language Understanding tasks. Given position embedding only accounts for a small portion of model size, we can do mixed-precision quantization for embedding to further push down the model size boundary with a tolerable accuracy drop, as shown in Appendix C.2.
To study the quantization effects on self-attention layers and fully-connected networks, we conducted extensive experiments under different bits settings for the encoder layers. The results are shown in Tab. (b)b. Specifically, we adopt the Q-BERTMP setting in Tab. 1, with a mixture of 2 and 3 bits for encoder weights. To test the robustness of the two modules inside each encoder layer, we further reduce one more bit in the corresponding modules and denote the resulting precision setting 1/2MP. From Tab. (b)b, we can conclude that generally self-attention layer is more robust to quantization than the fully-connected network, since 1/2MP self-attention results in about 5% performance drop while 1/2MP fully-connected will worsen this to 11%.
5.2 Qualitative Analysis
We use attention information to conduct qualitative analysis to analyze the difference between Q-BERT and DirectQ.
To do so, we compute the Kullback–Leibler (KL) divergence between the attention distribution for the same input from the coordinated head of both quantized BERT and full-precision BERT. It should be noted that we compute the average distance out of 10% of the entire training dataset. The smaller KL divergence here means that the output of the multi-head attention of the two models is closer to each other. We illustrate this distance score for each individual head in Fig. 22 for SST-2, MNLI, CoNLL-03 and SQuAD. We compared Q-BERT and DirectQ with 4-bits weights, 8-bits embedding and 8-bits activation. Each scatter point in Fig. 22 denotes the distance w.r.t. one head, and the line chart shows the average results over the 12 heads in one layer. We can clearly see that Q-BERT always incurs a smaller distance to the original baseline model as compared to DirectQ model, for all the different layers.
In this work, we perform an extensive analysis of fine-tuned BERT and propose Q-BERT, an effective scheme for quantizing BERT. In order to reduce aggressively the model size by mixed-precision quantization, we proposed a new layer-wise Hessian based method which captures both the average and the variance of the eigenvalues. Moreover, a new group-wise quantization is proposed to perform fine-grained quantization inside each encoder layer. In four downstream tasks, equipped with the aforementioned methods, Q-BERT achieves compression ratio in weights, smaller activation size, and smaller embedding size, with at most 2.3% accuracy loss. To understand better how different factors will affect the trade-off between performance and the model compression ratio in Q-BERT, we conduct controlled experiments to investigate the effect of different quantization schemes and quantizing different modules in BERT, respectively.
We would like to thank Prof. Joseph Gonzalez, Prof. Dan Klein, and Prof. David Patterson for their valuable feedback. This work was supported by a gracious fund from Intel corporation, Berkeley Deep Drive (BDD), and Berkeley AI Research (BAIR) sponsors. We would like to thank the Intel VLAB team for providing us with access to their computing cluster. We also thank gracious support from Google for providing cloud compute. MWM would also like to acknowledge ARO, DARPA, NSF, ONR, and Intel for providing partial support of this work.
Appendix A Detailed quantization process
In the forward pass, each element in the input will be quantized as follows:
where is the round operator, is distance between adjacent quantized points, is a set of integer indices and is the index for the bias. We drop for clarity in the following equations. In the inference, the expensive floating point tensor arithmetic can be replaced by efficient integer arithmetic for the matrix multiplication with , and then followed by a gathered dequantization operation, which will accelerate the computation time in order of magnitudes. Since we use the quantization-aware fine-tuning scheme, in the backward pass, the Straight-Though Estimator (STE) [bengio2013estimating] is used for computing the gradient for .
Appendix B Dataset
We apply Q-BERT on Sentiment Classification, Natural Language Inference, Named Entity Recognition and Machine Reading Comprehension tasks. For Sentiment Classification, we evaluate on Stanford Sentiment Treebank (SST-2) [socher2013recursive]. For Named Entity Recognition, we use CoNLL-2003 English benchmark dataset for NER (CoNLL-03) [sang2003introduction]. For Natural Language Inference, we test on Multi-Genre Natural Language Inference (MNLI) [williams2017broad]. For Machine Reading Comprehension, we evaluate on the Stanford Question Answering Dataset (SQuAD) [rajpurkar2016SQuAD].
More specifically, SST-2 is a movie review dataset with binary annotations, where the binary label indicates positive and negative reviews. MNLI is a multi-genre NLI task for predicting whether a given premise-hypothesis pair is entailment, contradiction or neural. Its test and development datasets are further divided into in-domain (MNLI-m) and cross-domain (MNLI-mm) splits to evaluate the generality of tested models. CoNLL-03 is a newswire article dataset for predicting the exact span of the annotated four entity types: person, location, organization, and miscellaneous. SQuAD is a task to answer the question by extracting the relevant span from the context, where a paragraph of context and a question is provided for each sample.
Appendix C Extra results
Here we describe several additional results.
c.1 Ablation Study of Hessian based Mixed Precision Assignment
To demonstrate the robustness of our Hessian based Mixed Precision method, we conduct the ablation study here to use the reversed version of 2/3-bit Q-BERTMP (Q-BERTMP-rev). Specifically, we will assign higher bits to relatively sensitive layers and lower bit vice versa, which means the previous layer in 2/3-bit Q-BERTMP with 2-bit will be assigned 3-bit. 777The bits setting of 2/3-bit Q-BERTMP and 2/4-bit Q-BERTMP are included in Tab. 6 and Tab. 7, respectively.
We can obverse that even the model size of Q-BERTMP-rev is larger or similar to that of Q-BERTMP. The performance difference between Q-BERTMP-rev and 2-bit Q-BERT is within 2% for MNLI, CoNLL-03, SQuAD and 4% for SST-2, while that of Q-BERTMP is beyond 5% for MNLI, CoNLL-03, SQuAD and 8% for SST-2. This large discrepancy in the perfomance illustrates the superiority of leveraging second order Hessian information in mix precision bits assignment.
c.2 Mixed Precision Quantization for Embedding
As can be seen from Tab. 1, when 2/3 MP is used for quantizing the weight parameters, the bottleneck of the model size is bounded by the embedding table size. Also, observed in Tab. (a)a, we noticed that word embedding is less sensitive. Therefore, in this section, we further push the embedding table to be 4-bit (word embedding) and 8-bit (position embedding) mixed-precision to reduce the entire model size. Similar to group-wise quantization for weights, in this ultra-low embedding bits setting, we bucket the 768 output neurons in BERTBASE word and position embedding layer into 128 groups in Tab. 5. We adopt the same setting for weights and activations in Tab. 1, where we employ 128 groups for weights and set 8/8 bits for weight/activation. Note that with around 0.5% performance drop, the embedding table size can be reduced to 11.6MB, which corresponds to around compression ratio in embedding table and compression ratio in total model size.
c.3 Detailed loss landscape for SST-2
We include the detailed loss landscape analysis for the remaining task SST-2 as shown in Fig. 25.
|Layer(s)||Layer Type||Parameter Size(M)||Weight bit (SST-2)||Weight bit (MNLI)||Weight bit (CoNLL-03)||Weight bit (SQuAD)|
|Layer(s)||Layer Type||Parameter Size(M)||Weight bit (SST-2)||Weight bit (MNLI)||Weight bit (CoNLL-03)||Weight bit (SQuAD)|