Pre-trained transformer-based models  recently have achieved state-of-the-art performance at a variety of natural language processing (NLP) tasks, such as sequence tagging and sentence classification. Among them, BERT models  based on transformer architecture  have drawn even more attention because of their great performance and generality. However, the memory and computing consumption of these models are prohibitive. Even the relatively small versions of BERT models (e.g., BERT-base) contain more than 100 million parameters. The over-parameterized
characteristic makes it challenging to deploy BERT models on devices with constrained resources, such as smartphones and robots. Therefore, compressing these models is an important demand in the industry.
One popular and efficient method for model compression is quantization. To reduce model sizes, quantization represents the parameters of the model by fewer bits instead of the original 32 bits. With proper hardware, quantization could significantly reduce the memory footprint while accelerating inference. There have been many works focusing on quantizing models in the computer vision area[8, 18, 17, 5, 4, 15], while much fewer works have been done on NLP [12, 9, 1, 2, 10]. Pilot works of transformer quantization include [1, 2, 10]. They successfully quantized transformer models to 8 or 4 bits while maintaining comparable performance. Moreover, to the best of our knowledge, there are only two published works focusing on BERT quantization [16, 11].  applied 8-bit fixed-precision linear quantization to BERT models and achieved a compression ratio of 4 with little accuracy drop. 
improved the quantization performances by group-wise mix-precision linear quantization based on the Hessian matrix of the parameter tensors.
However, for the underlying quantization scheme, most of the above transformer quantization works, especially the BERT quantization works utilized linear clustering, which is a primary clustering method. Although it can process fast and easily, the quantized results cannot represent the original data distribution well. As a result,  only manages to quantize BERT to 8 bits. Although the other BERT quantization work  has achieved much higher compress ratios without quantization scheme upgrading, the group-wise method they developed is rather time-consuming and increases the latency significantly. Although it is believed that replacing linear clustering with a better clustering method can improve the performance of quantized models. The effect of the quantization scheme upgrading is rather underestimated. Therefore, in this paper, we explore the effect of simply upgrading the quantization scheme from linear clustering to k-means clustering, and compare the performance of the two schemes. Furthermore, to see the effect on other pre-trained language models, we also compare the two quantization schemes on ALBERT models , which is an improved version of BERT.
In summary, we applied k-means and linear quantization on BERT and ALBERT and test their performances on GLUE benchmarks. Through this, we verify that simple upgrading of quantization scheme could result in great performance increases and simple k-means clustering has great potential as BERT quantization scheme. Moreover, we also show that the number of k-means iterations plays an important role in the k-means quantization. Through further comparison, we discover that ALBERT is less robust than BERT in terms of quantization, as the parameter sharing has reduced the redundancy of the parameters.
2 Background: BERT and ALBERT
In this section, we briefly introduce the architectures of BERT and ALBERT models and point out the version of the models we used in our experiments.
BERT models  are a special kind of pre-trained transformer-based network.
They mainly consist of embedding layers, encoder blocks, and output layers.
There is no decoder block in BERT models. Each encoder block contains one self-attention layer (includes three parallel linear layers corresponding to query, key, and value) and 3 feed-forward layers (each includes one linear layer).
For each self-attention layer, BERT utilize the multi-head technique to further improve its performance. For the each self-attention head, there are 3 weight matrices and , where ( is the number of heads in each self-attention layer). Let denote the input of the corresponding self-attention layer. Therefore, the output of the self-attention head is calculated as:
Then, for each self-attention layer, the outputs of all its self-attention heads are concatenated sequentially to generate the output of the corresponding layer.
Specifically, in our work, we use the bert-base-uncased version of BERT models, which has 12 encoder blocks and 12 heads for each self-attention layer, to carry out the following experiments.
Compared to BERT, ALBERT contributes three main improvements. First, ALBERT models decompose the embedding parameters into the product of two smaller matrices.
Second, they adapt cross-layer parameter sharing to improve parameter efficiency. These two improvements can significantly reduce the total number of parameters and make the model more efficient. Moreover, parameter sharing can also stabilize network parameters.
Third, they replace next-sentence prediction (NSP) loss with sentence-order prediction (SOP) loss while pre-training. This makes the models focus on modeling inter-sentence
coherence instead of topic prediction and improves the performance on multi-sentence encoding tasks.
Specifically, in this paper, we use the albert-base-v2 version of ALBERT models, which also has 12 encoder blocks (where all parameters are shared across layers) and 12 heads for each self-attention layer.
To compare linear and k-means quantization schemes on pre-trained transformer-based models, we test the performance of quantized models on different downstream tasks. Specifically, for each chosen task, the following experiments are carried out sequentially: fine-tuning the pre-trained models (BERT and ALBERT) on the downstream task; quantizing the task-specific model; fine-tuning the quantized model. Then the performance of the resulting model is tested on the validation set of each chosen task.
To avoid the effect of other tricks, we simply apply the two quantization scheme (linear and k-means) following fix-precision quantization strategy without any tricks. We quantize all the weight of the embedding layers and the fully connected layers (except the classification layer). For each weight vector, after quantization, it will be represented by a corresponding cluster index vector and a centroid value vector, and each parameter of the weight vectors will be replaced with the centroid of the cluster which it belongs to.
After the model is quantized, we further fine-tune it on the corresponding downstream tasks while maintaining quantized. For the forward pass, we reconstruct each quantized layer by its cluster index vector and centroid value vector. For the backward pass, while updating the rest parameters normally, we update the quantized parameters by training the centroids vectors. More specifically, the gradient of each parameter in the centroid vectors is calculated as the average of the gradients of the parameters that belong to the corresponding cluster. Then, the centroids value vectors are updated by the same back-propagation methods.
3.2 Linear Quantization
Suppose that we need to quantize a vector to bits (-bit quantization). We first search for its minimum value and maximum value . The range is then divided into clusters with width
Define function as
whose value is between and . Such that each parameter belongs to the -th cluster. And will be replaced with the centroid of -th cluster, i.e., the average of all parameters belonging to it. Therefore, the quantization function is
where equals to when the statement is true, otherwise .
3.3 K-Means Quantization
Suppose that we need to quantize a vector to bits (-bit quantization). For k-means quantization, we leverage the k-means clustering with k-means++ initialization to partition the vector into clusters.
We first utilize k-means++ initialization method to initialize the centroids ( ) for each cluster (). Then, each parameter
is classified into its nearest cluster. After all the parameters inare classified, the centroids are updated as the average of all the parameters that belong to them respectively. Then, repeat re-classifying parameters and updating centroids until convergence is met or the maximum iteration is reached. Moreover, the procedure of k-means++ initialization method is as follows: first, choose a random parameter from the vector as the first centroid; then assign the possibilities to become the next centroids of other parameters according to their smallest distance from all the existing centroids and choose the next centroid based on these possibilities; finally, repeat possibility assignment and centroid choosing until all the centroids are generated.
To reduce the efficiency drop caused by the upgrading of the quantization scheme, we set the maximum iteration of k-means clustering to only 3. After k-means clustering is finished, We utilize the resulting label vector as the cluster index vector and the resulting centroids as the corresponding centroid value vector. Each parameter will be replaced by the centroid of the cluster which it belongs to.
In this section, we first introduce the dataset we used in our experiments (Section 4.1), then explain the experimental details of our experiments on BERT and ALBERT (Section 4.2), finally show the results and the corresponding discussion (Section 4.3).
We test the performance of our quantized models on the General Language Understanding Evaluation (GLUE) benchmark 
. which contains NLU tasks including question answering, sentiment analysis, and textual entailment. Specifically, we utilize 8 tasks (QNLI, CoLA, RTE, SST-2, MRPC, STS-B, MNLI, and QQP) to test the performance of different quantization schemes. The evaluation metrics of each task are as follows: Matthews correlation coefficient (mcc) for CoLA; accuracy (acc) for QNLI, RTE, SST-2, and MNLI; accuracy (acc) and F1 score for MRPC and QQP; Pearson and Spearman correlation coefficients (corr) for STS-B. We follow the default split of the dataset. The datasets are available for download here:https://gluebenchmark.com/tasks.
4.2 Experimental Setup
Before quantization, the bert-base-uncased version of BERT models is fine-tuned on the 8 tasks by the Adam optimizer  and the linear schedule with a learning rate of 5e-5. As for ALBERT models, We first fine-tune the albert-base-v2 model on QNLI, CoLA, SST-2, MNLI, and QQP, and then further fine-tuned on RTE, MRPC, and STS-B basing on the MNLI checkpoint (following the same process as ). We use Adam optimizer and linear schedule to fine-tune ALBERT, and the learning rate for each tasks is searched in 1e-5, 2e-5, 3e-5, 4e-5, 5e-5.
After quantization, we further fine-tune the quantized models on the corresponding tasks. In particular, the learning rates of the layers which are quantized are multiplied 10 times (i.e., 5e-4 for all the quantized BERT models) while those of other layers remained the same.
4.3 Experimental Results and Discussion
We mainly focus on 1-5 bits fixed-precision quantization. The results of linear and k-means quantization for BERT are shown in Table 1 and Table 2 respectively, and further comparison between the average scores of the two sets of experiments is shown in Figure 1. Similarly, The results and comparison of ALBERT are shown in Table 3, Table 4, and Figure 2 respectively.
The improvements brought by quantization scheme upgrading. As shown in Table 1, Table 2 and Figure 1, although the models perform worse with lower bits no matter which quantization scheme is utilized, the models quantized with k-means quantization perform significantly better than those using linear quantization in each bit setting respectively, across all 8 tasks and their average. On average of 8 tasks, only by upgrading quantization scheme from linear to k-means, we achieve a performance degradation drop from (38.8, 34.7, 27.6, 17.1, 4.8) to (28.6, 3.94, 0.9, 0.3, -0.2) for 1-5 bits quantization respectively, as compared to the full precision model. The result shows that great performance improvements could be achieved by only upgrading the quantization scheme, which indicates that the improvement space of the quantization scheme is much underestimated. To further illustrate it, we repeated several experiments using the group-wise linear quantization scheme developed by  which is an improvement based on linear quantization and achieves much higher performance than simple linear quantization. The results are shown in Table 5. Compared to the performance of group-wise linear quantization, simple k-means quantization achieve even higher performance or comparable performance while saving a huge amount of time.111In group-wise quantization, each matrix is partitioned to different groups and each group is quantized separately. For the forward pass, the model needs to reconstruct each quantized group respectively for each layer instead of reconstructing the entire weight matrix of each quantized layer directly. That explains why group-wise quantization is quite time-consuming. Specifically, in our group-wise quantization experiments, we partition each matrix to 128 groups.
|3 bits k-means||70.0||86.0/90.2||22|
|3 bits group-wise||72.6||84.8/89.6|
|2 bits k-means||66.1||84.6/89.2||16|
|2 bits group-wise||58.5||72.3/81.1|
|1 bit k-means||54.5||70.8/81.7||10|
|1 bit group-wise||53.1||70.6/81.4|
The potential of k-means quantization. As shown in Table 2, the model can be compressed well simply using k-means quantization with fixed-precision strategies, and the quantized models still perform well even in some particularly low bit settings. For instance, on the task RTE, the model quantized to 3 bits with k-means quantization only results in a 2.16 performance degradation. For most tasks including QNLI, SST-2, MRPC, STS-B, MNLI, and QQP, the performance of the quantized models only show a significant drop in 1-bit setting. It is worth noting that these results were achieved by simple k-means quantization with a maximum iteration of only 3 and without any tricks, which indicates the great developing potential of k-means quantization.
Generally speaking, the two main arguments drew from BERT experiments still hold as shown in Table 3, Table 4, and Figure 2. We could also see great improvements brought by quantization scheme upgrading and great potential of k-means quantization. However, there are some abnormal results which are worth discussing.
The influence of the number of k-means iterations. The first set of abnormal results is from 1-bit quantization of QNLI, MRPC, and STS-B. While k-means normally outperformed linear quantization, these results violate this regulation. We believe that is because the distribution of parameters is so complicated that 3 iterations of k-means could not work well. To validate this theory and further explore the influence of iterations, we repeated the experiments with these abnormal results while extending the number of iteration to 5, 10, and 20. The corresponding results are shown in Table 6. With more iterations, the accuracy of k-means quantization increases and outperforms linear quantization. However, the over-fitting problem might be troublesome as the performances decrease for QNLI and STS-B when the number of iteration increases from 10 to 20. Therefore, in k-means quantization, the number of k-means iterations is also an important hyper-parameter that needs to be searched carefully.
The special number of CoLA and MRPC. Another set of abnormal results is from the linear quantization of CoLA and MRPC, which are binary classification tasks. We find the quantized models output “1” all the time after being fine-tuned. The two values and are only determined by the data distribution on the dev sets. In other words, after the model is quantized to 1-5 bits with linear quantization, it almost loses its functionality and becomes difficult to train on the two tasks. Moreover, we further do experiments in high bit settings on the two tasks and find that the results of the quantized models are no longer the two values starting from 6 bits.
The comparison between BERT and ALBERT. Moreover, we compare the performances between k-means quantization for BERT and ALBERT, and the results are shown in Figure 3 and Figure 4. Compared with BERT which remains 96.1% of its origin performance after k-means 2-bit quantization, ALBERT is much less robust in terms of quantization (in our work, robustness towards quantization means the ability to quantize to low bit-width while maintaining high performance). The performance of ALBERT falls to 93.4% and 72.5% after k-means 4-bit and 3-bit quantization respectively. Consider that the major improvement of ALBERT based on BERT is parameter sharing and quantization can also be considered as intra-layer parameter sharing, we speculate that parameter sharing and quantification have similar effects, which means that the redundant information removed by parameter sharing and quantization partially overlaps. Moreover, after parameter sharing, ALBERT has removed a great amount of redundant information compared to BERT (the total number of parameters fall from 108M to 12M). Therefore, further applying quantization upon ALBERT will easily damage the useful information and the robustness of ALBERT towards quantization is rather low. However, from another point of view, the parameter sharing has already significantly reduced the parameter number and thus can also be considered as a model compression method. Moreover, consider that the performances of full-precision ALBERT are better than those of 4-bit and 3-bit BERT models which occupy a similar amount of memory in GPU, the parameter sharing can even achieve better compress performance than simple quantization. However, as a compression method, parameter sharing has a non-negligible drawback: it can only reduce the memory consumption while most other compression methods can reduce both the memory consumption and the calculation consumption (i.e. the inference time).
In this paper, we compare k-means and linear quantization on BERT and ALBERT models and get three main results. First, we find the models quantized with k-means significantly outperform those using linear quantization. Great performance improvements could be achieved by simply upgrading the quantization scheme. Second, the model can be compressed to relatively low bit-width only using k-means quantization even with simple fix-precision strategy and without any tricks. That indicates the great developing potential of k-means quantization. Third, the number of k-means iterations plays an important role in the performance of quantized models and should be determined carefully. Besides, through comparison between the results of k-means quantization for BERT and ALBERT, we discover that ALBERT is much less robust towards quantization than BERT. That indicates that parameter sharing and quantization have some effects in common. Therefore, further applying quantization upon models with extensive parameter sharing will easily damage the useful information and thus lead to a significant performance drop.
We thank the anonymous reviewers for their thoughtful comments. This work has been supported by the National Key Research and Development Program of China (Grant No. 2017YFB1002102) and Shanghai Jiao Tong University Scientific and Technological Innovation Funds (YG2020YQ01).
-  (2019) Efficient 8-bit quantization of transformer neural machine language translation model. arXiv preprint arXiv:1906.00532. Cited by: §1.
-  (2019) Transformers. zip: compressing transformers with pruning and quantization. Technical report Stanford University, Stanford, California. External Links: Cited by: §1.
-  (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL, pp. 4171–4186. Cited by: §1, §2.1.
Hawq: hessian aware quantization of neural networks with mixed-precision. In ICCV, pp. 293–302. Cited by: §1.
-  (2016) Deep compression: compressing deep neural network with pruning, trained quantization and huffman coding. In ICLR, Cited by: §1.
-  (2015) Adam: A method for stochastic optimization. In ICLR, Cited by: §4.2.
ALBERT: a lite bert for self-supervised learning of language representations. In ICLR, Cited by: §1, §4.2.
Towards accurate binary convolutional neural network. In NIPS, pp. 345–353. Cited by: §1.
Highly efficient neural network language model compression using soft binarization training. In
2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 62–69. Cited by: §1.
-  (2019) Fully quantized transformer for machine translation. arXiv preprint arXiv:1910.10485. Cited by: §1.
-  (2020) Q-bert: hessian based ultra low precision quantization of bert. In AAAI, Cited by: §1, §4.3.1.
-  (2018) Structured word embedding for low memory neural network language model.. In INTERSPEECH, pp. 1254–1258. Cited by: §1.
-  (2017) Attention is all you need. In NIPS, pp. 5998–6008. Cited by: §1.
-  (2019) Glue: a multi-task benchmark and analysis platform for natural language understanding. In ICLR, Cited by: §4.1.
-  (2019) HAQ: hardware-aware automated quantization with mixed precision. In CVPR, pp. 8612–8620. Cited by: §1.
-  (2019) Q8bert: quantized 8bit bert. In NIPS EMC workshop, Cited by: §1.
-  (2016) Dorefa-net: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160. Cited by: §1.
-  (2017) Trained ternary quantization. In ICLR, Cited by: §1.