1 Introduction
Pretrained transformerbased models [13] recently have achieved stateoftheart performance at a variety of natural language processing (NLP) tasks, such as sequence tagging and sentence classification. Among them, BERT models [3] based on transformer architecture [13] have drawn even more attention because of their great performance and generality. However, the memory and computing consumption of these models are prohibitive. Even the relatively small versions of BERT models (e.g., BERTbase) contain more than 100 million parameters. The overparameterized
characteristic makes it challenging to deploy BERT models on devices with constrained resources, such as smartphones and robots. Therefore, compressing these models is an important demand in the industry.
One popular and efficient method for model compression is quantization. To reduce model sizes, quantization represents the parameters of the model by fewer bits instead of the original 32 bits. With proper hardware, quantization could significantly reduce the memory footprint while accelerating inference. There have been many works focusing on quantizing models in the computer vision area
[8, 18, 17, 5, 4, 15], while much fewer works have been done on NLP [12, 9, 1, 2, 10]. Pilot works of transformer quantization include [1, 2, 10]. They successfully quantized transformer models to 8 or 4 bits while maintaining comparable performance. Moreover, to the best of our knowledge, there are only two published works focusing on BERT quantization [16, 11]. [16] applied 8bit fixedprecision linear quantization to BERT models and achieved a compression ratio of 4 with little accuracy drop. [11]improved the quantization performances by groupwise mixprecision linear quantization based on the Hessian matrix of the parameter tensors.
However, for the underlying quantization scheme, most of the above transformer quantization works, especially the BERT quantization works utilized linear clustering, which is a primary clustering method. Although it can process fast and easily, the quantized results cannot represent the original data distribution well. As a result, [16] only manages to quantize BERT to 8 bits. Although the other BERT quantization work [11] has achieved much higher compress ratios without quantization scheme upgrading, the groupwise method they developed is rather timeconsuming and increases the latency significantly. Although it is believed that replacing linear clustering with a better clustering method can improve the performance of quantized models. The effect of the quantization scheme upgrading is rather underestimated. Therefore, in this paper, we explore the effect of simply upgrading the quantization scheme from linear clustering to kmeans clustering, and compare the performance of the two schemes. Furthermore, to see the effect on other pretrained language models, we also compare the two quantization schemes on ALBERT models [7], which is an improved version of BERT.
In summary, we applied kmeans and linear quantization on BERT and ALBERT and test their performances on GLUE benchmarks. Through this, we verify that simple upgrading of quantization scheme could result in great performance increases and simple kmeans clustering has great potential as BERT quantization scheme. Moreover, we also show that the number of kmeans iterations plays an important role in the kmeans quantization. Through further comparison, we discover that ALBERT is less robust than BERT in terms of quantization, as the parameter sharing has reduced the redundancy of the parameters.
2 Background: BERT and ALBERT
In this section, we briefly introduce the architectures of BERT and ALBERT models and point out the version of the models we used in our experiments.
2.1 Bert
BERT models [3] are a special kind of pretrained transformerbased network.
They mainly consist of embedding layers, encoder blocks, and output layers.
There is no decoder block in BERT models. Each encoder block contains one selfattention layer (includes three parallel linear layers corresponding to query, key, and value) and 3 feedforward layers (each includes one linear layer).
For each selfattention layer, BERT utilize the multihead technique to further improve its performance.
For the each selfattention head, there are 3 weight matrices and , where ( is the number of heads in each selfattention layer). Let denote the input of the corresponding selfattention layer. Therefore, the output of the selfattention head is calculated as:
(1) 
Then, for each selfattention layer, the outputs of all its selfattention heads are concatenated sequentially to generate the output of the corresponding layer.
Specifically, in our work, we use the bertbaseuncased version of BERT models, which has 12 encoder blocks and 12 heads for each selfattention layer, to carry out the following experiments.
2.2 Albert
Compared to BERT, ALBERT contributes three main improvements. First, ALBERT models decompose the embedding parameters into the product of two smaller matrices.
Second, they adapt crosslayer parameter sharing to improve parameter efficiency. These two improvements can significantly reduce the total number of parameters and make the model more efficient. Moreover, parameter sharing can also stabilize network parameters.
Third, they replace nextsentence prediction (NSP) loss with sentenceorder prediction (SOP) loss while pretraining. This makes the models focus on modeling intersentence
coherence instead of topic prediction and improves the performance on multisentence encoding tasks.
Specifically, in this paper, we use the albertbasev2 version of ALBERT models, which also has 12 encoder blocks (where all parameters are shared across layers) and 12 heads for each selfattention layer.
3 Methodology
In this section, we first introduce the quantization process in our experiments (Section 3.1), then explain the two quantization schemes we used in detail (Section 3.2, 3.3).
3.1 Overview
To compare linear and kmeans quantization schemes on pretrained transformerbased models, we test the performance of quantized models on different downstream tasks. Specifically, for each chosen task, the following experiments are carried out sequentially: finetuning the pretrained models (BERT and ALBERT) on the downstream task; quantizing the taskspecific model; finetuning the quantized model. Then the performance of the resulting model is tested on the validation set of each chosen task.
To avoid the effect of other tricks, we simply apply the two quantization scheme (linear and kmeans) following fixprecision quantization strategy without any tricks. We quantize all the weight of the embedding layers and the fully connected layers (except the classification layer). For each weight vector, after quantization, it will be represented by a corresponding cluster index vector and a centroid value vector, and each parameter of the weight vectors will be replaced with the centroid of the cluster which it belongs to.
After the model is quantized, we further finetune it on the corresponding downstream tasks while maintaining quantized. For the forward pass, we reconstruct each quantized layer by its cluster index vector and centroid value vector. For the backward pass, while updating the rest parameters normally, we update the quantized parameters by training the centroids vectors. More specifically, the gradient of each parameter in the centroid vectors is calculated as the average of the gradients of the parameters that belong to the corresponding cluster. Then, the centroids value vectors are updated by the same backpropagation methods.
3.2 Linear Quantization
Suppose that we need to quantize a vector to bits (bit quantization). We first search for its minimum value and maximum value . The range is then divided into clusters with width
(2) 
Define function as
(3) 
whose value is between and . Such that each parameter belongs to the th cluster. And will be replaced with the centroid of th cluster, i.e., the average of all parameters belonging to it. Therefore, the quantization function is
(4) 
where equals to when the statement is true, otherwise .
3.3 KMeans Quantization
Suppose that we need to quantize a vector to bits (bit quantization). For kmeans quantization, we leverage the kmeans clustering with kmeans++ initialization to partition the vector into clusters.
We first utilize kmeans++ initialization method to initialize the centroids ( ) for each cluster (). Then, each parameter
is classified into its nearest cluster. After all the parameters in
are classified, the centroids are updated as the average of all the parameters that belong to them respectively. Then, repeat reclassifying parameters and updating centroids until convergence is met or the maximum iteration is reached. Moreover, the procedure of kmeans++ initialization method is as follows: first, choose a random parameter from the vector as the first centroid; then assign the possibilities to become the next centroids of other parameters according to their smallest distance from all the existing centroids and choose the next centroid based on these possibilities; finally, repeat possibility assignment and centroid choosing until all the centroids are generated.To reduce the efficiency drop caused by the upgrading of the quantization scheme, we set the maximum iteration of kmeans clustering to only 3. After kmeans clustering is finished, We utilize the resulting label vector as the cluster index vector and the resulting centroids as the corresponding centroid value vector. Each parameter will be replaced by the centroid of the cluster which it belongs to.
4 Experiments
In this section, we first introduce the dataset we used in our experiments (Section 4.1), then explain the experimental details of our experiments on BERT and ALBERT (Section 4.2), finally show the results and the corresponding discussion (Section 4.3).
4.1 Dataset
We test the performance of our quantized models on the General Language Understanding Evaluation (GLUE) benchmark [14]
. which contains NLU tasks including question answering, sentiment analysis, and textual entailment. Specifically, we utilize 8 tasks (QNLI, CoLA, RTE, SST2, MRPC, STSB, MNLI, and QQP) to test the performance of different quantization schemes. The evaluation metrics of each task are as follows: Matthews correlation coefficient (mcc) for CoLA; accuracy (acc) for QNLI, RTE, SST2, and MNLI; accuracy (acc) and F1 score for MRPC and QQP; Pearson and Spearman correlation coefficients (corr) for STSB. We follow the default split of the dataset. The datasets are available for download here:
https://gluebenchmark.com/tasks.#bits  QNLI  CoLA  RTE  SST2  MRPC  STSB  MNLIm/mm  QQP  average 
32 bits  91.7  59.2  72.2  93.1  86.3/90.4  89.7  85.0/84.8  91.6/88.8  83.7 
5 bits  88.5  48.4  69.3  89.6  83.8/88.7  88.7  79.8/80.4  88.9/85.3  79.7 
4 bits  81.8  19.9  57.0  81.4  75.7/84.5  84.9  71.4/71.9  80.8/75.9  69.4 
3 bits  61.3  11.9  56.3  78.9  70.8/81.9  68.6  59.6/61.6  76.5/71.1  60.6 
2 bits  60.7  6.6  55.2  77.9  69.6/81.4  47.4  49.6/50.8  74.2/63.2  54.7 
1 bit  59.5  0  54.9  77.5  69.9/81.4  37.8  47.3/48.8  74.3/63.3  52.2 
#bits  QNLI  CoLA  RTE  SST2  MRPC  STSB  MNLIm/mm  QQP  average 

32 bits  91.7  59.2  72.2  93.1  86.3/90.4  89.7  85.0/84.8  91.6/88.8  83.7 
5 bits  91.5  60.2  70.8  94.0  87.3/91.0  89.6  84.7/84.9  91.7/88.8  83.9 
4 bits  91.7  57.4  70.8  93.6  87.0/91.0  89.6  84.8/84.8  91.6/88.7  83.5 
3 bits  91.3  56.9  70.0  93.1  86.0/90.2  89.4  84.4/84.1  91.2/88.1  82.9 
2 bits  89.5  50.2  66.1  91.3  84.6/89.2  88.3  81.6/81.9  90.3/87.0  80.4 
1 bit  62.2  13.7  54.5  83.0  70.8/81.7  52.2  62.0/62.6  77.1/65.9  59.8 
#bits  QNLI  CoLA  RTE  SST2  MRPC  STSB  MNLIm/mm  QQP  average 

32 bits  91.5  58.9  81.6  92.8  90.2/93.1  90.9  84.9/85.1  90.8/87.7  85.2 
5 bits  60.1  0  53.1  74.8  68.4/81.2  39.9  43.6/45.6  72.6/65.8  50.9 
4 bits  52.3  0  52.7  50.9  68.4/81.2  6.8  35.5/35.2  67.9/56.5  41.1 
3 bits  51.4  0  54.2  54.9  68.4/81.2  16.7  35.5/35.4  68.2/56.7  42.7 
2 bits  54.0  0  52.7  50.9  68.4/81.2  18.8  35.4/35.3  67.5/53.2  42.6 
1 bit  54.3  0  55.6  50.9  68.4/81.2  9.7  35.5/35.3  67.3/52.5  41.9 
#bits  QNLI  CoLA  RTE  SST2  MRPC  STSB  MNLIm/mm  QQP  average 

32 bits  91.5  58.9  81.6  92.8  90.2/93.1  90.9  84.9/85.1  90.8/87.7  85.2 
5 bits  91.0  55.9  78.3  92.7  90.7/93.4  90.8  84.2/85.1  90.3/87.1  84.3 
4 bits  90.1  48.9  75.5  87.0  84.8/89.3  75.8  82.1/83.1  89.2/85.5  79.6 
3 bits  63.5  4.6  53.8  76.5  68.1/80.8  77.7  63.7/65.8  82.9/77.9  61.8 
2 bits  61.4  0  59.9  71.6  70.8/82.2  20.4  45.0/45.6  72.7/61.5  49.7 
1 bit  50.6  0  56.0  52.2  68.4/81.2  6.3  35.4/35.2  69.8/58.8  41.5 
4.2 Experimental Setup
Before quantization, the bertbaseuncased version of BERT models is finetuned on the 8 tasks by the Adam optimizer [6] and the linear schedule with a learning rate of 5e5. As for ALBERT models, We first finetune the albertbasev2 model on QNLI, CoLA, SST2, MNLI, and QQP, and then further finetuned on RTE, MRPC, and STSB basing on the MNLI checkpoint (following the same process as [7]). We use Adam optimizer and linear schedule to finetune ALBERT, and the learning rate for each tasks is searched in 1e5, 2e5, 3e5, 4e5, 5e5.
After quantization, we further finetune the quantized models on the corresponding tasks. In particular, the learning rates of the layers which are quantized are multiplied 10 times (i.e., 5e4 for all the quantized BERT models) while those of other layers remained the same.
4.3 Experimental Results and Discussion
We mainly focus on 15 bits fixedprecision quantization. The results of linear and kmeans quantization for BERT are shown in Table 1 and Table 2 respectively, and further comparison between the average scores of the two sets of experiments is shown in Figure 1. Similarly, The results and comparison of ALBERT are shown in Table 3, Table 4, and Figure 2 respectively.
4.3.1 Bert
The improvements brought by quantization scheme upgrading. As shown in Table 1, Table 2 and Figure 1, although the models perform worse with lower bits no matter which quantization scheme is utilized, the models quantized with kmeans quantization perform significantly better than those using linear quantization in each bit setting respectively, across all 8 tasks and their average. On average of 8 tasks, only by upgrading quantization scheme from linear to kmeans, we achieve a performance degradation drop from (38.8, 34.7, 27.6, 17.1, 4.8) to (28.6, 3.94, 0.9, 0.3, 0.2) for 15 bits quantization respectively, as compared to the full precision model. The result shows that great performance improvements could be achieved by only upgrading the quantization scheme, which indicates that the improvement space of the quantization scheme is much underestimated. To further illustrate it, we repeated several experiments using the groupwise linear quantization scheme developed by [11] which is an improvement based on linear quantization and achieves much higher performance than simple linear quantization. The results are shown in Table 5. Compared to the performance of groupwise linear quantization, simple kmeans quantization achieve even higher performance or comparable performance while saving a huge amount of time.^{1}^{1}1In groupwise quantization, each matrix is partitioned to different groups and each group is quantized separately. For the forward pass, the model needs to reconstruct each quantized group respectively for each layer instead of reconstructing the entire weight matrix of each quantized layer directly. That explains why groupwise quantization is quite timeconsuming. Specifically, in our groupwise quantization experiments, we partition each matrix to 128 groups.
Model  RTE  MRPC  acceleration 

3 bits kmeans  70.0  86.0/90.2  22 
3 bits groupwise  72.6  84.8/89.6  
2 bits kmeans  66.1  84.6/89.2  16 
2 bits groupwise  58.5  72.3/81.1  
1 bit kmeans  54.5  70.8/81.7  10 
1 bit groupwise  53.1  70.6/81.4 
The potential of kmeans quantization. As shown in Table 2, the model can be compressed well simply using kmeans quantization with fixedprecision strategies, and the quantized models still perform well even in some particularly low bit settings. For instance, on the task RTE, the model quantized to 3 bits with kmeans quantization only results in a 2.16 performance degradation. For most tasks including QNLI, SST2, MRPC, STSB, MNLI, and QQP, the performance of the quantized models only show a significant drop in 1bit setting. It is worth noting that these results were achieved by simple kmeans quantization with a maximum iteration of only 3 and without any tricks, which indicates the great developing potential of kmeans quantization.
4.3.2 Albert
Generally speaking, the two main arguments drew from BERT experiments still hold as shown in Table 3, Table 4, and Figure 2. We could also see great improvements brought by quantization scheme upgrading and great potential of kmeans quantization. However, there are some abnormal results which are worth discussing.
Iteration  QNLI  MRPC  STSB 

3  50.56  68.38/81.22  6.29 
5  50.63  68.38/81.22  6.93 
10  60.63  68.87/81.30  13.76 
20  60.19  69.85/81.83  11.10 
The influence of the number of kmeans iterations. The first set of abnormal results is from 1bit quantization of QNLI, MRPC, and STSB. While kmeans normally outperformed linear quantization, these results violate this regulation. We believe that is because the distribution of parameters is so complicated that 3 iterations of kmeans could not work well. To validate this theory and further explore the influence of iterations, we repeated the experiments with these abnormal results while extending the number of iteration to 5, 10, and 20. The corresponding results are shown in Table 6. With more iterations, the accuracy of kmeans quantization increases and outperforms linear quantization. However, the overfitting problem might be troublesome as the performances decrease for QNLI and STSB when the number of iteration increases from 10 to 20. Therefore, in kmeans quantization, the number of kmeans iterations is also an important hyperparameter that needs to be searched carefully.
The special number of CoLA and MRPC. Another set of abnormal results is from the linear quantization of CoLA and MRPC, which are binary classification tasks. We find the quantized models output “1” all the time after being finetuned. The two values and are only determined by the data distribution on the dev sets. In other words, after the model is quantized to 15 bits with linear quantization, it almost loses its functionality and becomes difficult to train on the two tasks. Moreover, we further do experiments in high bit settings on the two tasks and find that the results of the quantized models are no longer the two values starting from 6 bits.
The comparison between BERT and ALBERT. Moreover, we compare the performances between kmeans quantization for BERT and ALBERT, and the results are shown in Figure 3 and Figure 4. Compared with BERT which remains 96.1% of its origin performance after kmeans 2bit quantization, ALBERT is much less robust in terms of quantization (in our work, robustness towards quantization means the ability to quantize to low bitwidth while maintaining high performance). The performance of ALBERT falls to 93.4% and 72.5% after kmeans 4bit and 3bit quantization respectively. Consider that the major improvement of ALBERT based on BERT is parameter sharing and quantization can also be considered as intralayer parameter sharing, we speculate that parameter sharing and quantification have similar effects, which means that the redundant information removed by parameter sharing and quantization partially overlaps. Moreover, after parameter sharing, ALBERT has removed a great amount of redundant information compared to BERT (the total number of parameters fall from 108M to 12M). Therefore, further applying quantization upon ALBERT will easily damage the useful information and the robustness of ALBERT towards quantization is rather low. However, from another point of view, the parameter sharing has already significantly reduced the parameter number and thus can also be considered as a model compression method. Moreover, consider that the performances of fullprecision ALBERT are better than those of 4bit and 3bit BERT models which occupy a similar amount of memory in GPU, the parameter sharing can even achieve better compress performance than simple quantization. However, as a compression method, parameter sharing has a nonnegligible drawback: it can only reduce the memory consumption while most other compression methods can reduce both the memory consumption and the calculation consumption (i.e. the inference time).
5 Conclusion
In this paper, we compare kmeans and linear quantization on BERT and ALBERT models and get three main results. First, we find the models quantized with kmeans significantly outperform those using linear quantization. Great performance improvements could be achieved by simply upgrading the quantization scheme. Second, the model can be compressed to relatively low bitwidth only using kmeans quantization even with simple fixprecision strategy and without any tricks. That indicates the great developing potential of kmeans quantization. Third, the number of kmeans iterations plays an important role in the performance of quantized models and should be determined carefully. Besides, through comparison between the results of kmeans quantization for BERT and ALBERT, we discover that ALBERT is much less robust towards quantization than BERT. That indicates that parameter sharing and quantization have some effects in common. Therefore, further applying quantization upon models with extensive parameter sharing will easily damage the useful information and thus lead to a significant performance drop.
5.0.1 Acknowledgement
We thank the anonymous reviewers for their thoughtful comments. This work has been supported by the National Key Research and Development Program of China (Grant No. 2017YFB1002102) and Shanghai Jiao Tong University Scientific and Technological Innovation Funds (YG2020YQ01).
References
 [1] (2019) Efficient 8bit quantization of transformer neural machine language translation model. arXiv preprint arXiv:1906.00532. Cited by: §1.
 [2] (2019) Transformers. zip: compressing transformers with pruning and quantization. Technical report Stanford University, Stanford, California. External Links: Link Cited by: §1.
 [3] (2019) BERT: pretraining of deep bidirectional transformers for language understanding. In NAACL, pp. 4171–4186. Cited by: §1, §2.1.

[4]
(2019)
Hawq: hessian aware quantization of neural networks with mixedprecision
. In ICCV, pp. 293–302. Cited by: §1.  [5] (2016) Deep compression: compressing deep neural network with pruning, trained quantization and huffman coding. In ICLR, Cited by: §1.
 [6] (2015) Adam: A method for stochastic optimization. In ICLR, Cited by: §4.2.

[7]
(2019)
ALBERT: a lite bert for selfsupervised learning of language representations
. In ICLR, Cited by: §1, §4.2. 
[8]
(2017)
Towards accurate binary convolutional neural network
. In NIPS, pp. 345–353. Cited by: §1. 
[9]
(2019)
Highly efficient neural network language model compression using soft binarization training
. In2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
, pp. 62–69. Cited by: §1.  [10] (2019) Fully quantized transformer for machine translation. arXiv preprint arXiv:1910.10485. Cited by: §1.
 [11] (2020) Qbert: hessian based ultra low precision quantization of bert. In AAAI, Cited by: §1, §4.3.1.
 [12] (2018) Structured word embedding for low memory neural network language model.. In INTERSPEECH, pp. 1254–1258. Cited by: §1.
 [13] (2017) Attention is all you need. In NIPS, pp. 5998–6008. Cited by: §1.
 [14] (2019) Glue: a multitask benchmark and analysis platform for natural language understanding. In ICLR, Cited by: §4.1.
 [15] (2019) HAQ: hardwareaware automated quantization with mixed precision. In CVPR, pp. 8612–8620. Cited by: §1.
 [16] (2019) Q8bert: quantized 8bit bert. In NIPS EMC workshop, Cited by: §1.
 [17] (2016) Dorefanet: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160. Cited by: §1.
 [18] (2017) Trained ternary quantization. In ICLR, Cited by: §1.
Comments
There are no comments yet.