Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm

Various pruning approaches have been proposed to reduce the footprint requirements of Transformer-based language models. Conventional wisdom is that pruning reduces the model expressiveness and thus is more likely to underfit than overfit compared to the original model. However, under the trending pretrain-and-finetune paradigm, we argue that pruning increases the risk of overfitting if pruning was performed at the fine-tuning phase, as it increases the amount of information a model needs to learn from the downstream task, resulting in relative data deficiency. In this paper, we aim to address the overfitting issue under the pretrain-and-finetune paradigm to improve pruning performance via progressive knowledge distillation (KD) and sparse pruning. Furthermore, to mitigate the interference between different strategies of learning rate, pruning and distillation, we propose a three-stage learning framework. We show for the first time that reducing the risk of overfitting can help the effectiveness of pruning under the pretrain-and-finetune paradigm. Experiments on multiple datasets of GLUE benchmark show that our method achieves highly competitive pruning performance over the state-of-the-art competitors across different pruning ratio constraints.



There are no comments yet.


page 4


Deep Neural Compression Via Concurrent Pruning and Self-Distillation

Pruning aims to reduce the number of parameters while maintaining perfor...

Greedy Layer Pruning: Decreasing Inference Time of Transformer Models

Fine-tuning transformer models after unsupervised pre-training reaches a...

Paying more attention to snapshots of Iterative Pruning: Improving Model Compression via Ensemble Distillation

Network pruning is one of the most dominant methods for reducing the hea...

Rethinking Network Pruning – under the Pre-train and Fine-tune Paradigm

Transformer-based pre-trained language models have significantly improve...

Faster gaze prediction with dense networks and Fisher pruning

Predicting human fixations from images has recently seen large improveme...

Block Pruning For Faster Transformers

Pre-training has improved model accuracy for both classification and gen...

Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior

Traditional (unstructured) pruning methods for a Transformer model focus...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, the emergence of Transformer-based language models vaswani2017attention, such as BERT devlin2018bert and GPT-3 brown2020language, have achieved huge success on various natural language (NLP) processing tasks and brought NLP to new eras. These models all adopt the pretrain-and-finetune paradigm, where models are first pre-trained in a self-supervised fashion on large corpus and fine-tuned for specific downstream tasks wang2018glue. While effective and prevalent, they suffer from heavy model size, which hinders the popularity on resource-constrained devices, e.g., mobile phones li2021npas, smart cameras choi2020edge, autonomous driving cars zhou2021end, etc.

(a) Conventional pruning
(b) Pruning under pretrain-and-finetune paradigm
Figure 1: The comparison between conventional pruning and the pruning under pretrain-and-finetune paradigm. and represent general-purpose language knowledge learned during pre-training and downstream task-specific knowledge, respectively.

To this end, various weight pruning approaches have been proposed to reduce the footprint requirements of Transformers zhu2018prune; guo2019reweighted; li2020efficient; liu2020autocompress; blalock2020state; gordon2020compressing; xu2021rethinking; li2021npas. They zero out certain weights and then optimize the rest, which is a destruction plus learning process. Conventional wisdom in pruning states that pruning helps reduce the overfitting risk, since the compressed model structures have less parameters and are believed to be less prone to overfitting gerum2020sparsity. However, under the pretrain-and-finetune paradigm, most of pruning methods understate the overfitting issue. Therefore, we argue that model pruning increases the risk of overfitting if pruning was performed at the fine-tuning phase (see Figure 1), as it increases the amount of information a model needs to recover the pruned general-purpose knowledge from the downstream task compared to the conventional pruning, which results in relative data deficiency. We visualize the overfitting issue on the real-world dataset in Figure 2. From Figure 2 (b), it is observed that the evaluation performance on the training dataset remains improved while it keeps the same for the validation set through the training process. From Figure 2 (c), the difference in performance becomes more significant when pruning rate becomes higher and the performance on validation set even becomes worse after 2000 training steps. All these observations verify our hypothesis.

(a) Pruning rate = 0
(b) Pruning rate = 0.8
(c) Pruning rate = 0.95
Figure 2:

Visualization of the overfitting issue when pruning all the linear transformation weight matrices of BERT

at the fine-tuning stage on MRPC. The yellow and blue lines represent the evaluation performance on the training set and the validation set during training, respectively. We perform iterative magnitude pruning 

frankle2019lottery during the first 1000 steps and keep knowledge distillation (KD) through the whole training process.

The main question this paper attempts to answer is: how to reduce the risk of overfitting of pre-trained language models caused by pruning? However, answering this question is challenging. First, under the pretrain-and-finetune paradigm, both the general-purpose language knowledge and the task-specific knowledge are learned. Therefore, it is nontrivial to keep the model parameters related to both knowledge when pruning. Second, the Transformer architecture contains multiple components, e.g., cascaded multi-head self attention and feed-forward layers devlin2018bert. Third, the downstream tasks data may be small, like the MRPC dataset in GLUE wang2018glue and the data with privacy. Thus, the overfitting problem can easily arise, especially in the face of high pruning rate requirements. A few recent progresses have been made on addressing overfitting associated with model compression. However, their results are not remarkable and most of them focus on the vision domain bai2020few; shen2021progressive.

To address these challenges, we propose SPD, a sparse progressive distillation method, for pruning pre-trained language models, with a special focus on the widely used BERT model. The key point of our proposed method is a layer grafting scheme, which randomly grafts the student layers onto the teacher model and updates the parameters of these layers intertwined with the teacher in a progressive fashion. Such a layer grafting scheme takes good advantage of the well-trained parameters of the teacher to reduce the overfitting risk, while striving to maintain the representational learning ability of the compressed student model, thanks to three contributing factors. First, KD mitigates overfitting by forcing the student model to mimic the behavior of teacher model hinton2015distilling. However, the overfitting problem may still not be negligible. Second, to further reduce the overfitting risk, the layers of teacher and student are mixed, which adds regularization for optimizing the student model shen2021progressive

. Third, to preserve the expressive power of the student model, we adopt sparse pruning technique, which leads to more compression rate than structural pruning in both convolution and Transformer neural networks 

zhu2018prune; elsen2020fast; xu2021rethinking.

Specifically, we design a three-stage framework to perform SPD (see Figure 3). In the first stage, we apply iterative magnitude pruning frankle2019lottery

on the student layers, during which the pruned layers are randomly grafted onto the teacher model and KD is performed between the grafted model and the teacher. In this stage, only the parameters of pruned layers are updated and a constant replacing probability (

1.0) is adopted for the grafting, which reduces the complexity of network optimization. Pruning is completed in this stage. In the second stage, the replacing probability is progressively increased to 1.0, which enables more student layers to orchestrate with each other. In the last stage, we keep the replacing probability as 1.0 and do the KD between the teacher and the grafted model. In other words, all the pruned layers of student are updated simultaneously in this stage, which further allows all the pruned layers to adapt themselves to each other.

To summarize, our contribution is determining the overfitting issue of pruning under the pretrain-and-finetune paradigm and proposing the sparse progressive distillation method to address it. We demonstrate the benefits of the proposed three-stage framework through the ablation studies. We validate our method on six datasets from the GLUE benchmark. To test if our method is applicable across tasks, we include the tasks of both single sentence and sentence-pair classification. Experimental results show that our method outperforms the leading competitors by a large margin.

2 Related Work

Network Pruning.

Common wisdom has shown that weight parameters of deep learning models can be reduced without sacrificing accuracy loss, such as magnitude-based pruning 

han2015 and ottery ticket hypothesis frankle2019lottery. zhu2018prune compared small-dense models and large-sparse models with same parameters and showed that the latter outperforms the former, showing the large-sparse models have a better expressive power than their small-dense counterparts. However, under pretrain-and-finetune paradigm, pruning leads to overfitting as discussed.

Knowledge Distillation (KD). As a common method in reducing the number of parameters, the main idea of KD is that the a small student model mimics the behaviour of the large teacher model and achieves a comparable performance hinton2015distilling; mirzadeh2020improved. On pre-trained language models, sanh2019distilbert; jiao2020tinybert; sun2020mobilebert utilized KD to learn universal language representations from large corpus. However, current SOTA knowledge distillation methods is not able to achieve high model compression rate (less than 10% remaining weights) while achieving insignificant performance decrease.

Figure 3: An overview of three-stage sparse progressive knowledge distillation framework. Stage I: iterative pruning with a constant replacing probability and KD between teacher and the grafted model. Stage II: replacing with linearly increasing probability. Stage III: knowledge distillation between teacher and student.
(a) Replacing probability
(b) Pruning rate
(c) Learning rate
Figure 4: An example of three schedulers. The red and green dashed lines are the locations of the end of the first and second stages, respectively.

Progressive Learning. The key idea of progressive learning is that student learns to update module by module with the teacher shen2021progressive

. Utilized a dual-stage distillation scheme where student modules are progressively grafted onto the teacher network, it targets at the few-shot scenario and uses only a few unlabeled samples to achieve the comparable results on CIFAR-10 and CIFAR-100.  

xu-etal-2020-bert gradually increased the probability of replacing each teacher module with their corresponding student module and trained the student to reproduce the behavior of the teacher. However, the performance on Transformer-based models of aforementioned first method unknown while the second method has an obvious performance drop with low pruning rate (50%).

3 Sparse Progressive Distillation

We propose a sparse progressive distillation strategy (Figure 3). The orange layers are the ones that will be pruned, while the blue layers are the copies from teacher model.

Figure 5: An overview of the layer-wise KD used in SPD. (a) M sparse student layers have probabilities of , , , …, to substitute the corresponding teacher layers separately. (b) Teacher model. (c) Grafted model. (1 i M + 1) denotes the distillation loss between the -th layer of teacher and the -th layer of the grafted model. (d) An illustration of the pruned weight matrices in a Transformer encoder layer.

3.1 Problem Formulation

The teacher model and student model (shown in Figure 3) are denoted by and , respectively. The grafted model is denoted as . The above three models all have

layers (i.e., the first M - 1 layers are encoder layers and the last layer is the output layer). Each layer corresponds to a Transformer encoder. We use an independent Bernoulli random variable

which has probability of and 1 - to take the value 1 and 0, respectively, to indicate whether the student layer is grafted onto the model (i.e., = 1) or not (i.e., = 0). Based on this, the -th encoder layer has probability of to be the -th student layer and has probability of to be the -th teacher layer . Denote , , as the behaviour function induced from the -th encoder of the teacher model, the student model and the grafted model, respectively. Given input , we have


We adopt the layer-wise KD to help the grafted layers mimic the behaviors of teacher layers, shown in Figure 5. In SPD, the teacher model is the fine-tuned pre-trained model, as we feed the downstream data into the framework, the grafted network will learn both general purpose language knowledge and task-specific knowledge. We formulate the objective function as


where denotes the collection of weights in the first layers, denotes the training dataset, is coefficient of -th layer loss, is the distillation loss of Transformer layer, is the distillation loss between output layer, and represent the behaviour function induced from the output layer in teacher model and student model separately. We only update weight parameters of the replaced student layers in the back-propagation process using .

After the convergence of neural network training, we find the sparse weight matrix



where denotes the Euclidean projection onto the set

Layer-wise Knowledge Distillation.

KD is the procedure of transferring knowledge from a large model T to a compact model S. Original KD uses logits of output layer in teacher model to represent it’s knowledge 

hinton2015distilling. Later, activations, feature of intermediate layers also be considered as knowledge of model and can be transferred to student model romero2015fitnets. To further help the student model mimic the behaviors of the teacher model, we adopt the layer-wise KD approach as shown in Figure 5, which enable the student model learn both the knowledge in intermediate layer and output layer.

During KD, each student layer mimics the behavior of one teacher layer. Similar to jiao2020tinybert, we take the advantage of abundant knowledge in self-attention distribution, hidden states of each Transformer layer and final output layer’s soft logits of teacher model to help train the student model. Specifically, we design the KD loss as follows


where = MSE(, ) () indicates the difference between hidden states, = MSE(, ) () indicates the difference between attention matrices. MSE(

) is the mean square error loss function and

is the index of Transformer layer. = -() ( / ) indicates the difference of soft cross-entropy loss, where and are the soft logits of teacher and student model, respectively. is the temperature hyper-parameter.

3.2 Three-stage Sparse Progressive Distillation

Inspired by shen2021progressive which leverages module grafting to train student model in few-shot KD scenario, we propose to use layer replacing in sparse pruning to mitigate the overfitting in pretrain-and-finetune paradigm. Figure 4 is an example of how the three important schedulers (replacing probability scheduler , pruning rate scheduler ( is the target pruning rate), learning rate scheduler ) work. For pruning scheduler, we do times of magnitude iterative pruning chen2020lottery on the pre-trained model as shown in Figure 4. The pruning rate set is (, , , …, ) as shown in Figure 3 for the times pruning and = .

For the remaining two schedulers, we design them in the three stages as follows


where t is the training step. , , are the ending steps of three stages respectively. is the constant replacing probability in the first stage. is the slope of replacing probability curve in the second stage. , are the initial learning rates of the first stage and second stage respectively. is the slope of the learning rate curve in the first stage while is sharing slope of it in the second and third stages.

  Input: teacher model : fine-tuned BERT; student model , grafted model : initialized by teacher model
  Set , , as the training steps of three stages separately
  Set , as the replacing probability scheduler of Stage I and Stage II, respectively.
  Output: student model
  for t = to  do
      Stage I, Stage II, Stage III:
      Calculate distillation loss
      Update weight parameters in
      Copy pruned layers of grafted model to student
      Stage I:
      if 0 t  then
          Prune student layers
          Update grafted model using
      end if
      Stage II:
      if  t  then
          Update grafted model using
      end if
  end for
  return   at the end of Stage III
Algorithm 1 Three-stage sparse progressive distillation

Stage I: Iterative Pruning with Constant Replacing Probability. In this stage, we apply iterative magnitude pruning frankle2019lottery. The pruning rate curve with respect to training steps is depicted in Figure 4 (b). We graft the pruned layers of student model onto the grafted model by the constant replacing probability throughout Stage I. Different from the model compression methods that update all model parameters at once, such as TinyBERT jiao2020tinybert and DistilBERT sanh2019distilbert, SPD only updates the student layers of the grafted model. It reduces the complexity of network optimization, which mitigates the overfitting issue and enables the student layers learn deeper knowledge from the teacher model. In addition, we use a linear scheduler of learning rate that decreases in Stage I as shown in the first piece of Equation 6 and Figure 4 (c).

Stage II: Progressive Replacement with Increasing Probability. In the second stage, we aim to enable the student layers to orchestrate with each other. To achieve this goal, we incrementally enlarge the percentage of student layers in the grafted model as depicted in Figure 4 (a). Note that in this stage, the pruning rate of the student model has reached the target value and remains constant, thus there are only two schedulers (i.e., replacing probability scheduler and learning rate scheduler) for parameter tuning, which helps mitigate the interference between different strategies.

Stage III: Knowledge Distillation between Teacher and Student. In the third stage, we continue the KD between the teacher model and the grafted model with the replacing probability = 1.0. In other words, the grafted model consists entirely of the layers of the student model in this stage. The extended KD further allows all the compressed layers of the student model to adapt themselves to each other, which helps improve the performance of the compressed student model. Additionally, we exploit the layer-wise KD throughout all the three stages as stated in Algorithm 1. Figure 5 gives an overview of the layer-wise KD used in our proposed method.

4 Evaluation

4.1 Experimental Setup

Datasets. For the proposed sparse progressive distillation, we conduct experiments on the General Language Understanding Evaluation (GLUE) benchmark wang2018glue

, which is grouped into three categories of natural language understanding tasks (single-sentence tasks, similarity matching tasks, and natural language inference tasks) according to the purpose of tasks and difficulty level of datasets. We compare our proposed method with other baselines on two single-sentence tasks (CoLA 

warstadt2018neural and SST-2 socher-etal-2013-recursive), two similarity matching tasks (MRPC dolan2005automatically, STS-B cer-etal-2017-semeval

) and two inference tasks (QNLI 


, RTE 


). In all the referred GLUE benchmark tasks, we report the evaluation results follow the evaluation metrics in 

wang2018glue. To be more specifically, we report accuracy for SST-2, QNLI and RTE and use f1 for MRPC. For STS-B, we report the Spearman correlation. For CoLA, Matthew’s correlation is used as the evaluation metrics.

Baselines. As shown in Table 1, three different groups of baselines are adopted for comparison based on sparse pruning and progressive strategy used or not (i.e., non-sparse, non-progressive; non-sparse, progressive; sparse, non-progressive). For models using neither sparse pruning nor progressive, we select BERT-PKD sun2019patient, DistilBERT sanh2019distilbert, MiniLM wang2020minilm, TinyBERT jiao2020tinybert as baselines to compare. We also compare with both model leveraging progressive but not sparse pruning such as BERT-of-Theseus xu-etal-2020-bert and model using sparse pruning but no progressive adopted such as BERT-tickets chen2020lottery. Furthermore, we compare with lighter models which have compressed encoder parameters, such as 4-layer DistilBERT, 4-layer TinyBERT, CompressBERT and RPP.

4.2 Implementation Details

We use the fine-tuned BERT as teacher and also initialize student with the fine-tuned BERT. Specifically, we first fine-tune the pre-train BERT on six GLUE tasks wang2018glue

with four epochs, including QNLI, SST-2, CoLA, STS-B, MRPC, and RTE. We select the learning rate with best performance from {

}. Batch size and maximum sequence length are set as 32 and 128, respectively. After fine-tuned, the BERT is used as the teacher model. In the first stage of the proposed SPD, we vary the number of epochs from 10 to 30. Each student layer has a constant probability in (0, 1] to be grafted onto the teacher. We fix the learning rate as and run the experiments with different replacing probabilities from {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}. The probability with the best performance will be adopted in the first stage. In the second stage, the training epochs is set as 10. We adjust the slope of the replacing probability curve so that the replacing probability equals 1 at the end of this stage. We use the same learning rate scheduler for the last two stages and the learning rate is chosen from {, , , ,

}. The model training and evaluation are performed using Python 3.6.8, torch 1.8.0 and CUDA 11.1 on Quadro RTX6000 GPU226and Intel(R) Xeon(R) Gold 6244 @ 3.60GHz CPU.

(a) No progressive, no KD
(b) Progressive, no KD
(c) No progressive, KD
(d) Progressive, KD (ours)
Figure 6: Comparison of the overfitting problem of the four methods on MRPC.

4.3 Experimental Results

Accuracy vs. Pruning Rate. We evaluate SPD on BERT and compare it with three groups of baselines (i.e., non-sparse and non-progressive; non-sparse and progressive; sparse and non-progressive) on six GLUE benchmark tasks (shown in Table 1).

Method #Param QNLI SST-2 CoLA STS-B MRPC RTE Avg.
(Acc) (Acc) (Mcc) (Spea) (F1) (Acc)
BERT 109M 91.6 92.9 57.9 89.1 90.2 72.2 82.3
Non-sparse, non-progressive
BERT-PKD 67M 88.4 91.0 45.5 86.2 85.7 66.5 76.1
DistilBERT 67M 89.2 92.7 51.3 86.9 87.5 59.9 77.9
MiniLM 67M 91.0 92.0 49.2 - 88.4 71.5 -
TinyBERT 67M 91.1 93.0 54.0 90.1 90.6 73.4 82.0
Non-sparse, progressive
Theseus 67M 89.5 91.5 51.1 88.7 89.0 68.2 79.7
Sparse, non-progressive
BERT-tickets 67M - - 53.8 88.2 84.9 66.0 -
SPD (ours) 67M 92.0 93.0 61.4 90.1 90.7 72.2 83.2
Table 1: Results are evaluated on the dev set of the GLUE benchmark. The results of DistilBERT and TinyBERT are taken from  jiao2020tinybert. Mcc refers to Matthews correlation and Spea refers to Spearman.
Method Remain. CoLA STS-B MRPC RTE
Weights (Mcc) (Spea) (F1) (Acc)
BERT (Teacher) - 57.9 89.1 90.2 72.2
TinyBERT (w/o. DA) 18% 29.8 - 82.4 -
RPP (w/o. DA) 11.6% - - 81.9 67.5
SparseBERT (w/o. DA) 5% 18.1 32.2 81.5 47.3
SPD (ours) (w/o. DA) 10% 48.7 87.8 89.9 69.0
SPD (ours) (w/o. DA) 5% 42.1 85.2 88.7 56.7
Table 2: Results are evaluated on the dev set of the GLUE benchmark under higher pruning rate.

For non-sparse and non-progressive baselines, SPD exceeds all of them on QNLI, SST-2, CoLA, STS-B, MRPC. For RTE, TinyBERT has a 1.6% higher accuracy than SPD. However, TinyBERT used augmented data while SPD do not use data augmentation to generate the results in Table 1. On average, SPD has 9.3%, 6.8%, 1.5% improvement in performance than BERT-PKD, DistilBERT, TinyBERT respectively. Furthermore, on CoLA, SPD achieves up to 34.9% higher performance than the 4 non-sparse and non-progressive baselines. For non-sparse and progressive baseline, we compare SPD with BERT-of-Theseus. Experimental results show that SPD exceeds the latter on all tasks. SPD has a 4.5% increase on the average. Among all the tasks, CoLA and RTE has 20.2% and 5.9% gain respectively. For the comparison with sparse and non-progressive baseline, SPD has an improvements of 14.1%, 2.2%, 6.8%, 9.4% on CoLA, STS-B, MRPC and RTE respectively.

On all listed tasks, SPD even outperforms the teacher model except for RTE. On RTE, SPD retain exactly the full accuracy of the teacher model. On average, the proposed SPD achieves even 1.1% higher accuracy/score than teacher model. We conclude the reason for the outstanding performance from three respects: 1) There is redundancy in the original dense BERT model. Thus, pruning the model using a low pruning rate (50%) will not lead to a significant performance drop. 2) SPD mitigates overfitting which will help the student model learn better. 3) The training epochs in progressive methods are longer than typical non-progressive methods which enable SPD to obtain a better student model.

We also compare SPD with other baselines (i.e., 4-layer TinyBERT jiao2020tinybert, RPP guo2019reweighted, SparseBERT xu2021rethinking ) under higher pruning rate. Results are as shown in Table 2. For the fairness of comparison, we remove data augmentation from the above methods. Experimental results show that SPD outperforms the three baselines in both performance and pruning rate. For the comparison with TinyBERT, both SPD ( remain weights) and SPD ( remain weights) wins. SPD ( remain weights) has 63.4% and higher evaluation score than TinyBERT on CoLA and MRPC. Though compressed to weights remaining in backbone encoders, SPD outperforms TinyBERT with and higher performance, separately. For RPP, SPD ( remain weights) and SPD ( remain weights) both have advantages over it on MRPC, with and higher F1 score respectively. For SparseBERT, SPD exceeds it on all tasks in Table 2. Especially on CoLA, SPD ( remain weights) and SPD ( remain weights) have 2.69 and 2.33 higher Mcc score on CoLA respectively. SparseBERT have competitive performance with SOTA when using data augmentation. The reason of the performance drop for SparseBERT may because its deficiency of ability in mitigating overfitting issue.

Figure 7: Stage I with replacing probability scheduler vs. Stage I w/o. replacing probability scheduler on RTE (dev set).
Figure 8: Sensitivity analysis of replacing probability on RTE (dev set).

Overfitting Mitigation. We investigate the effects of SPD in mitigating overfitting issue. Based on using progressive and KD or not, we compare 4 strategies: (a) no progressive, no KD; (b) progressive, no KD; (c) no progressive, KD; (d) progressive, KD (ours). We evaluate the strategies on both training and validation set (i.e., dev set) on MRPC. The results are shown in Figure 6. The divergence of evaluation results on training set and dev set decreases from (a) to (d), which is a strong indication that SPD outperforms other strategies in mitigating overfitting issue. Figure 6 (b) and (c) indicate that compared to progressive only, KD has a bigger impact on mitigating overfitting, as the divergence difference between the dev and training set is much smaller than that of KD only. From Figure 6 (a), (b) and (c), we observe that compared to no progressive, no KD, either using progressive (Figure 6 (b)) or KD (Figure 6 (c)) will significantly help in mitigating the overfitting issue. Figures 6 (b), (c) and (d) indicate that the combination of progressive and KD brings significant benefits than only using progressive or KD as Figure 6 (d) has the smallest divergence difference between the dev and training set. Together with the experimental results in Table 1 and Table 2, Figure 6 shows that SPD mitigates overfitting and leads to higher performance on sparse model.

Figure 9: One learning rate scheduler vs. two learning rate schedulers on RTE (dev set).

4.4 Ablation Studies

In this section, we justify the three schedulers used in our method (i.e., the replacing probability, the pruning rate, and the learning rate) and study the sensitivity of our method with respective to each of them.

Figure 10: Influence of the pruning with different ending stages on MRPC (dev set).

Effects of Replacing Probability Scheduler Strategy. In our method, we set the replacing probability in Stage I greater than 0, in order to allow student layers to orchestrate with each other. To verify the benefit of this design, we change the replacing probability to zero and compare it with our method. The result on RTE is shown in Figure 7. Stage I with a replacing probability scheduler (the red curve) shows better performance than Stage I without a replacing probability scheduler, which justifies the scheduler in Stage I. In addition, we study the sensitivity of our method with respect to the value of the replacing probability (Figure 8). It is observed that = 0.6 achieves the best performance and the progressive design is better than the non-progressive one.

Effects of Pruning Rate Scheduler Strategy. For the pruning rate scheduler, we compare the strategies with different pruning ending steps. The results are shown in Figure 10. It is observed that the pruning till the end of Stage I has higher F1 score than other strategies on MRPC.

Effects of Learning Rate Scheduler Strategy. We compare our strategy with the strategy that only has one learning rate scheduler. The results on RTE (Figure 9) indicate that our strategy (i.e., two independent learning rate schedulers) is better. We also evaluate different learning rates on RTE with the pruning rate of 0.9, the replacing probability of 0.8.

5 Conclusion

In this paper, we propose SPD, a sparse progressive distillation method, to address the overfitting issue when the Transformer-based language models are pruned under the pretrain-and-finetune paradigm. SPD includes three stages. In the first stage, iterative magnitude pruning is applied on the student layers and the pruned layers are randomly grafted onto the teacher model with a constant replacing probability ( 1.0). KD is performed between the grafted model and the teacher. In the second stage, the replacing probability is progressively increased to 1.0, allowing more student layers to orchestrate with each other. In the last stage, the replacing probability is set to 1.0 and the KD is continued between teacher and the grafted model. A series of ablation studies successfully verify the effectiveness of the three strategies used in SPD. Experimental results on six datasets from the GLUE benchmark demonstrate the superiority of SPD over the leading competitors.


6 Appendix

In this section, we provide the sensitivity analysis of learning rate on RTE and STS-B (on dev set) and the evaluation curves of four tasks (i.e., CoLA, STS-B, MRPC, RTE) with the target pruning rate of 0.95.

Sensitivity Analysis of Learning Rate. The analysis results on RTE and STS-B are shown in Figure 11 and Figure 12, respectively. Results vary with different learning rate settings. Among the eight learning rates listed in the legend of Figure 11, achieves the best performance. For STS-B, gives the best performance among the learning rate choices in Figures 12.

Evaluation Curves of Four Tasks at Target Pruning rate of 0.95. Here, we plot the evaluation curves of CoLA (shown as Figure 13), STS-B (shown as Figure 14), MRPC (shown as Figure 15), RTE (shown as Figure 16) to provide extra evidence to support competitive performance of SPD at Remain. Weights of 5% in Table 2. For each dataset, the x-axis is the training steps while the y-axis depends on the corresponding evaluation metrics. To obtain the curves, we use the same settings as the ones for getting results in Table 2.

Next, we will provide the detailed hyper-parameters settings we used. For CoLA, we set the max sequence length as 128, the learning rate as , the replacing probability in Stage I as 0.8, the number of training epochs as 60, the number of pruning epochs as 30. For STS-B, we use the same setting as CoLA. For MRPC, we set the max sequence length as 128, the learning rate as , the replacing probability in Stage I as 0.8, the number of training epochs as 60, the number of pruning epochs as 30. For RTE, we set the max sequence length as 128, the learning rate as , the replacing probability in Stage I as 0.6, the number of training epochs as 60, the number of pruning epochs as 30.

Figure 11: Sensitivity analysis of learning rate on RTE (dev set).
Figure 12: Sensitivity analysis of learning rate on STS-B (dev set).
Figure 13: Evaluation on CoLA (dev set). Target sparsity is 0.95.
Figure 14: Evaluation on STS-B (dev set). Target sparsity is 0.95.
Figure 15: Evaluation on MRPC (dev set). Target sparsity is 0.95.
Figure 16: Evaluation on RTE (dev set). Target sparsity is 0.95.