[NeurIPS 2022] "A Win-win Deal: Towards Sparse and Robust Pre-trained Language Models", Yuanxin Liu, Fandong Meng, Zheng Lin, Jiangnan Li, Peng Fu, Yanan Cao, Weiping Wang, Jie Zhou
Despite the remarkable success of pre-trained language models (PLMs), they still face two challenges: First, large-scale PLMs are inefficient in terms of memory footprint and computation. Second, on the downstream tasks, PLMs tend to rely on the dataset bias and struggle to generalize to out-of-distribution (OOD) data. In response to the efficiency problem, recent studies show that dense PLMs can be replaced with sparse subnetworks without hurting the performance. Such subnetworks can be found in three scenarios: 1) the fine-tuned PLMs, 2) the raw PLMs and then fine-tuned in isolation, and even inside 3) PLMs without any parameter fine-tuning. However, these results are only obtained in the in-distribution (ID) setting. In this paper, we extend the study on PLMs subnetworks to the OOD setting, investigating whether sparsity and robustness to dataset bias can be achieved simultaneously. To this end, we conduct extensive experiments with the pre-trained BERT model on three natural language understanding (NLU) tasks. Our results demonstrate that sparse and robust subnetworks (SRNets) can consistently be found in BERT, across the aforementioned three scenarios, using different training and compression methods. Furthermore, we explore the upper bound of SRNets using the OOD information and show that there exist sparse and almost unbiased BERT subnetworks. Finally, we present 1) an analytical study that provides insights on how to promote the efficiency of SRNets searching process and 2) a solution to improve subnetworks' performance at high sparsity. The code is available at https://github.com/llyx97/sparse-and-robust-PLM.READ FULL TEXT VIEW PDF
[NeurIPS 2022] "A Win-win Deal: Towards Sparse and Robust Pre-trained Language Models", Yuanxin Liu, Fandong Meng, Zheng Lin, Jiangnan Li, Peng Fu, Yanan Cao, Weiping Wang, Jie Zhou
Pre-trained language models (PLMs) have enjoyed impressive success in natural language processing (NLP) tasks. However, they still face two major problems. On the one hand, the prohibitive model size of PLMs leads to poor efficiency in terms of memory footprint and computational costGanesh et al. (2021); Strubell et al. (2019). On the other hand, despite being pre-trained on large-scale corpus, PLMs still tend to rely on dataset bias Gururangan et al. (2018); McCoy et al. (2019); Zhang et al. (2019); Schuster et al. (2019), i.e., the spurious features of input examples that strongly correlate with the label, during downstream fine-tuning. These two problems pose great challenge to the real-world deployment of PLMs, and they have triggered two separate lines of works.
In terms of the efficiency problem, some recent studies resort to sparse subnetworks as alternatives to the dense PLMs. Li et al. (2020); Michel et al. (2019); Liu et al. (2021a) compress the fine-tuned PLMs in a post-hoc fashion. Chen et al. (2020); Prasanna et al. (2020); Liu et al. (2022); Liang et al. (2021) extend the Lottery Ticket Hypothesis (LTH) Frankle and Carbin (2019) to search PLMs subnetworks that can be fine-tuned in isolation. Taking one step further, Zhao et al. (2020) propose to learn task-specific subnetwork structures via mask training Hubara et al. (2016); Mallya et al. (2018), without fine-tuning any pre-trained parameter. Fig. 1 illustrates these three paradigms. Encouragingly, the empirical evidences suggest that PLMs can indeed be replaced with sparse subnetworks without compromising the in-distribution (ID) performance.
To address the dataset bias problem, numerous debiasing methods have been proposed. A prevailing category of debiasing methods Clark et al. (2019); Utama et al. (2020a); Karimi Mahabadi et al. (2020); He et al. (2019); Schuster et al. (2019); Ghaddar et al. (2021); Utama et al. (2020b)
adjust the importance of training examples, in terms of training loss, according to their bias degree, so as to reduce the impact of biased examples (examples that can be correctly classified based on the spurious features). As a result, the model is forced to rely less on the dataset bias during training and generalizes better to OOD situations.
Although progress has been made in both directions, most existing work tackle the two problems independently. To facilitate real-world application of PLMs, the problems of robustness and efficiency should be addressed simultaneously. Motivated by this, we extend the study on PLM subnetwork to the OOD scenario, investigating whether there exist PLM subnetworks that are both sparse and robust against dataset bias? To answer this question, we conduct large-scale experiments with the pre-trained BERT model Devlin et al. (2019) on three natural language understanding (NLU) tasks that are widely-studied in the question of dataset bias. We consider a variety of setups including the three pruning and fine-tuning paradigms, standard and debiasing training objectives, different model pruning methods, and different variants of PLMs from the BERT family. Our results show that BERT does contain sparse and robust subnetworks (SRNets) within certain sparsity constraint (e.g., less than 70%), giving affirmative answer to the above question. Compared with a standard fine-tuned BERT, SRNets exhibit comparable ID performance and remarkable OOD improvement. When it comes to BERT model fine-tuned with debiasing method, SRNets can preserve the full model’s ID and OOD performance with much fewer parameters. On this basis, we further explore the upper bound of SRNets by making use of the OOD information, which reveals that there exist sparse and almost unbiased subnetworks, even in a standard fine-tuned BERT that is biased.
Regardless of the intriguing properties of SRNets, we find that the subnetwork searching process still have room for improvement, based on some observations from the above experiments. First, we study the timing to start searching SRNets during full BERT fine-tuning, and find that the entire training and searching cost can be reduced from this perspective. Second, we refine the mask training method with gradual sparsity increase, which is quite effective in identifying SRNets at high sparsity.
Our main contributions are summarized as follows:
We extend the study on PLMs subnetworks to the OOD scenario. To our knowledge, this paper presents the first systematic study on sparsity and dataset bias robustness for PLMs.
We conduct extensive experiments to demonstrate the existence of sparse and robust BERT subnetworks, across different pruning and fine-tuning setups. By using the OOD information, we further reveal that there exist sparse and almost unbiased BERT subenetworks.
We present analytical studies and solutions that can help further refine the SRNets searching process in terms of efficiency and the performance of subnetworks at high sparsity.
Studies on BERT compression can be divided into two classes. The first one focuses on the design of model compression techniques, which include pruning Gordon et al. (2020); Michel et al. (2019); Gale et al. (2019), knowledge distillation Sanh et al. (2019); Sun et al. (2019); Jiao et al. (2020); Liu et al. (2021b), parameter sharing Lan et al. (2020), quantization Zafrir et al. (2019); Zhang et al. (2020), and combining multiple techniques Tambe et al. (2020); Mao et al. (2020); Liu et al. (2021a). The second one, which is based on the lottery ticket hypothesis Frankle and Carbin (2019), investigates the compressibility of BERT on different phases of the pre-training and fine-tuning paradigm. It has been shown that BERT can be pruned to a sparse subnetwork after Gale et al. (2019) and before fine-tuning Chen et al. (2020); Prasanna et al. (2020); Liang et al. (2021); Liu et al. (2022); Gordon et al. (2020), without hurting the accuracy. Moreover, Zhao et al. (2020) show that directly learning subnetwork structures on the pre-trained weights can match fine-tuning the full BERT. In this paper, we follow the second branch of works, and extend the evaluation of BERT subnetworks to the OOD scenario.
To facilitate the development of NLP systems that truly learn the intended task solution, instead of relying on dataset bias, many efforts have been made recently. On the one hand, challenging OOD test sets are constructed Gururangan et al. (2018); McCoy et al. (2019); Zhang et al. (2019); Schuster et al. (2019); Agrawal et al. (2018) by eliminating the spurious correlations in the training sets, in order to establish more strict evaluation. On the other hand, numerous debiasing methods Clark et al. (2019); Utama et al. (2020a); Karimi Mahabadi et al. (2020); He et al. (2019); Schuster et al. (2019); Ghaddar et al. (2021); Utama et al. (2020b) are proposed to discourage the model from learning dataset bias during training. However, few attention has been paid to the influence of pruning on the OOD generalization ability of PLMs. This work presents a systematic study on this question.
Some pioneer attempts have also been made to obtain models that are both compact and robust to adversarial attacks Gui et al. (2019); Ye et al. (2019); Sehwag et al. (2020); Fu et al. (2021); Xu et al. (2021) and spurious correlations Zhang et al. (2021); Du et al. (2021). Specially, Xu et al. (2021); Du et al. (2021) study the compression and robustness question on PLM. Different from Xu et al. (2021), which is based on adversarial robustness, we focus on the spurious correlations, which is more common than the worst-case adversarial attack. Compared with Du et al. (2021), which focus on post-hoc pruning of the standard fine-tuned BERT, we thoroughly investigate different fine-tuning methods (standard and debiasing) and subnetworks obtained from the three pruning and fine-tuning paradigms. A more detailed discussion of the relation and difference between our work and previous studies on model compression and robustness is provided in Appendix D.
BERT is composed of an embedding layer, a stack of Transformer layers Vaswani et al. (2017) and a task-specific classifier. Each Transformer layer has a multi-head self-attention (MHAtt) module and a feed-forward network (FFN). MHAtt has four kinds of weight matrices, i.e., the query, key and value matrices , and the output matrix . FFN consits of two linear layers , , where is the hidden dimension of FFN.
To obtain the subnetwork of a model parameterized by , we apply a binary pruning mask to its weight matrices, which produces , where is the Hadamard product. For BERT, we focus on the Transformer layers and the classifier. The parameters to be pruned are , where is the classifier weights.
Magnitude-based pruning Han et al. (2015); Frankle and Carbin (2019) zeros-out parameters with low absolute values. It is usually realized in an iterative manner, namely, iterative magnitude pruning (IMP). IMP alternates between pruning and training and gradually increases the sparsity of subnetworks. Specifically, a typical IMP algorithm consists of four steps: (i) Training the full model to convergence. (ii) Pruning a fraction of parameters with the smallest magnitude. (iii) Re-training the pruned subnetwork. (iv) Repeat (ii)-(iii) until reaching the target sparsity. To obtain subnetworks from the pre-trained BERT, i.e., (b) and (c) in Fig. 1, the subnetwork parameters are rewound to the pre-trained values after (iii), and (i) can be abandoned. More details about our IMP implementations can be found in Appendix A.1.1.
Each weight matrix , which is frozen during mask training, is associated with a bianry mask , and a real-valued mask . In the forward pass, is replaced with , where is derived from through binarization:
where is the threshold. In the backward pass, since the binarization operation is not differentiable, we use the straight-through estimator Bengio et al. (2013) to compute the gradients for using the gradients of , i.e., , where is the loss. Then, is updated as , where is the learning rate.
As described in the Introduction, the debiasing methods measure the bias degree of training examples. This is achieved by training a bias model. The inputs to the bias model are hand-crafted spurious features based on our prior knowledge of the dataset bias (Section 4.1.3 describes the details). In this way, the bias model mainly relies on the spurious features to make predictions, which can then serve as a measurement of the bias degree. Specifically, given the bias model prediction over the classes, the bias degree
, i.e., the the probability of the ground-truth class.
Then, can be used to adjust the training loss in several ways, including product-of-experts (PoE) Clark et al. (2019); He et al. (2019); Karimi Mahabadi et al. (2020), example reweighting Schuster et al. (2019); Ghaddar et al. (2021) and confidence regularization Utama et al. (2020a). Here we describe the standard cross-entropy and PoE, and the other two methods are introduced in Appendix A.2.
Standard Cross-Entropy computes the cross-entropy between the predicted distribution and the ground-truth one-hot distribution as .
Product-of-Experts combines the predictions of main model and bias model, i.e., and , and then computes the training loss as .
Here we define some notations, which will be used in the following sections.
: Training with loss for steps, where can be omitted for simplicity.
: Pruning using pruning method and training loss .
: Extracting the pruning mask of , i.e., .
and , where “imp” and “imp-rw”denote the standard IMP and IMP with weight rewinding, as described in Section 3.2.1. “mask” stands for mask training.
: Evaluating on the test data with distribution .
We use MNLI Williams et al. (2018) as the ID dataset for NLI. MNLI is comprised of premise-hypothesis pairs, whose relationship may be entailment, contradiction, or neutral. In MNLI the word overlap between premise and hypothesis is strongly correlated with the entailment class. To solve this problem, the OOD HANS dataset McCoy et al. (2019) is built so that such correlation does not hold.
The ID dataset for paraphrase identification is QQP 111https://www.kaggle.com/c/quora-question-pairs, which contains question pairs that are labelled as either duplicate or non-duplicate. In QQP, high lexical overlap is also strongly associated with the duplicate class. The OOD datasets PAWS-qqp and PAWS-wiki Zhang et al. (2019) are built from sentences in Quora and Wikipedia respectively. In PAWS sentence pairs with high word overlap have a balanced distribution over duplicate and non-duplicate.
FEVER 222See the licence information at https://fever.ai/download/fever/license.html Thorne et al. (2018) is adopted as the ID dataset of fact verification, where the task is to assess whether a given evidence supports or refutes the claim, or whether there is not-enough-info to reach a conclusion. The OOD dataset Fever-Symmetric (v1 and v2) Schuster et al. (2019) is proposed to evaluate the influence of the claim-only bias (the label can be predicted correctly without the evidence).
We mainly experiment with the BERT-base-uncased model Devlin et al. (2019). It has roughly 110M parameters in total, and 84M parameters in the Transformer layers. As described in Section 3.1, we derive the subnetworks from the Transformer layers and report sparsity levels relative to the 84M parameters. To generalize our conclusions to other PLMs, we also consider two variants of the BERT family, namely RoBERTa-base and BERT-large, the results of which can be found in Appendix C.5.
Following Clark et al. (2019)
, we use a simple linear classifier as the bias model. For HANS and PAWS, the spurious features are based on the the word overlapping information between the two input text sequences. For Fever-Symmetric, the spurious features are max-pooled word embeddings of the claim sentence. More details about the bias model and the spurious features are presented in AppendixB.3.1.
Mask training and IMP basically use the same hyper-parameters (adopting from Utama et al. (2020b)) as full BERT. An exception is longer training, because we find that good subnetworks at high sparsity levels require more training to be found. Unless otherwise specified, we select the best checkpoints based on the ID dev performance, without using OOD information. All the reported results are averaged over 4 runs. We defer training details about each dataset, and each training and pruning setup, to Appendix B.3.
Given the fine-tuned full BERT , where and are the pre-trained and fine-tuned parameters respectively, the goal is to find a subnetwork that satisfies a target sparsity level and maximize the ID and OOD performance.
where is the norm and is the total number of parameters to be pruned. In practice, the above optimization problem is achieved via , which minimizes the loss on the ID training set. When the pruning method is IMP, the subnetwork parameters will be further fine-tuned and . For mask training, only the subnetwork structure is updated and .
We consider two kinds of fine-tuned full BERT, which utilize the standard CE loss and PoE loss respectively (i.e., ). IMP and mask training are used as the pruning methods (i.e., ). For the standard fine-tuned BERT, both and are examined in the pruning process. For the PoE fine-tuned BERT, we only use during pruning. Note that in this work, we mainly experiment with and . and are also examined for subnetworks from fine-tuned BERT, the results of which can be found in Appendix C.1.
The results are shown in Fig. 2 (In this paper, we present most results in figures for clear comparisons. Actual values of the results can be found in the code link.). We discuss them from three perspectives. For the full BERT, we can see that standard CE fine-tuning, which achieves good results on the ID dev sets, performs significantly worse on the OOD test sets. This demonstrates that the ID performance of BERT depends, to a large extent, on memorizing the dataset bias.
In terms of the subnetworks, we can derive the following observations: (1) Using any of the four pruning methods, we can compress a large proportion of the BERT parameters (up to sparsity) and still preserve of the full model’s ID performance. (2) With standard pruning, i.e., “mask train (std)” or “imp (std)”, we can observe small but perceivable improvement over the full BERT on the HANS and PAWS datasets. This suggests that pruning may remove some parameters related to the bias features. (3) The OOD performance of “mask train (poe)” and “imp (poe)” subnetworks is even better, and the ID performance degrades slightly but is still above of the full BERT. This shows that introducing the debiasing objective in the pruning process is beneficial. Specially, as mask training does not change the model parameters, the results of “mask train (poe)” implicates that the biased “full bert (std)” contains sparse and robust subnetworks (SRNets) that already encode a less biased solution to the task. (4) SRNets can be identified across a wide range of sparsity levels (from ). However at higher sparsity of , the performance of the subnetworks is not desirable. (5) We also find that there is an abnormal increase of the PAWS F1 score at sparsity for some pruning methods, when the corresponding ID performance drops sharply. This is because the class distribution of PAWS is imbalanced (see Appendix B.1), and thus even a naive random-guessing model can outperform the biased full model on PAWS. Therefore, the OOD improvement should only be acceptable when there is no large ID performance decline.
Comparing IMP and mask training, the latter performs better in general, except for “mask train (poe)” at
sparsity on QQP and FEVER. This suggests that directly optimizing the subnetwork structure is a better choice than using the magnitude heuristic as the pruning metric.
Fig. 3 presents the results. We can find that: (1) For the full BERT, the OOD performance is obviously promoted with the PoE debiasing method, while the ID performance is sacrificed slightly. (2) Unlike the subnetworks from the standard fine-tuned BERT, the subnetworks of PoE fine-tuned BERT (the green and blue lines) cannot outperform the full model. However, these subnetworks maintain comparable performance at up to sparsity, on both the ID and OOD settings, making them desirable alternatives to the full model in resource-constraint scenarios. Moreover, this phenomenon suggests that there is a great redundancy of BERT parameters, even when OOD generalization is taken into account. (3) With PoE-based pruning, subnetworks from the standard fine-tuned BERT (the orange line) is comparable with subnetworks from the PoE fine-tuned BERT (the blue line). This means we do not have to fine-tune a debiased BERT before searching for the SRNets. (4) IMP, again, slightly underperforms mask training at moderate sparsity levels, while it is better at sparsity on the fact verification task.
Given the pre-trained BERT , a subnetwork is obtained before downstream fine-tuning. The goal is to maximize the performance of the fine-tuned subnetwork :
Following the LTH Frankle and Carbin (2019), we solve this problem using the train-prune-rewind pipeline. For IMP, the procedure is described in Section 3.2.1 and . For mask training, the subnetwork structure is learned from (same as the previous section) and .
We employ CE and PoE loss for model fine-tuning (i.e., ). Since we have shown that using the debiasing loss in pruning is conducive, the CE loss is not considered (i.e., ).
The results of subnetworks fine-tuned in isolation are presented in Fig. 4. It can be found that: (1) For standard CE fine-tuning, the “mask train (poe)” subnetworks are superior to “full bert (std)” on the OOD test data, i.e., the subnetworks are less susceptible to the dataset bias during training. (2) In terms of the PoE-based fine-tuning, the “imp (poe)” and “mask train (poe)” subnetworks are generally comparable to “full bert (poe)”. (3) For most of the subnetworks, “poe ft” clearly outperforms “std ft” in the OOD setting, which suggests that it is important to use the debiasing method in fine-tuning, even if the BERT subnetwork structure has already encoded some unbiased information.
Moreover, based on (1) and (2), we can extend the LTH on BERT Chen et al. (2020); Prasanna et al. (2020); Liang et al. (2021); Liu et al. (2022): The pre-trained BERT contains SRNets that can be fine-tuned in isolation, using either standard or debiasing method, and match or even outperform the full model in both the ID and OOD evaluations.
This setup aims at finding a subnetwork inside the pre-trained BERT, which can be directly employed to a task. The problem is formulated as:
Following Zhao et al. (2020), we fix the pre-trained parameters and optimize the mask variables . This process can be represented as , where .
As we can see in Fig. 5: (1) With CE-based mask training, the identified subnetworks (under sparsity) in pre-trained BERT are competitive with the CE fine-tuned full BERT. (2) Similarly, using PoE-based mask training, the subnetworks under sparsity are comparable to the PoE fine-tuned full BERT, which demonstrates that SRNets for a particular downstream task already exist in the pre-trained BERT. (3) “mask train (poe)” subnetworks in pre-trained BERT can even match the subnetworks found in the fine-tuned BERT (the orange lines) in some cases (e.g., on PAWS and on FEVER under sparsity). Nonetheless, the latter exhibits a better overall performance.
To explore the upper bound of BERT subnetworks in terms of OOD generalization, we include the OOD training data in mask training, and use the OOD test sets for evaluation. Like the previous sections, we investigate three pruning and fine-tuning paradigms, as formulated by Eq. 2, 3 and 4 respectively. We only consider the standard CE for subnetwork and full BERT fine-tuning, which is more vulnerable to the dataset bias. Appendix B.3.3 summarizes the detailed experimental setups.
From Fig. 7 we can observe that: (1) The subnetworks from fine-tuned BERT (“bert-ft subnet”) at sparsity achieve nearly accuracy on HANS, and their ID performance is also close to the full BERT. (2) The subnetworks in the pre-trained BERT (“bert-pt subnet”) also have very high OOD accuracy, while they perform worse than “bert-ft subnet” in the ID setting. (3) “bert-pt subnet + ft” subnetworks, which are fine-tuned in isolation with CE loss, exhibits the best ID performance, and the poorest OOD performance. However, compared to the full BERT, these subnetworks still rely much less on the dataset bias, reaching nearly HANS accuracy at sparsity. Jointly, these results show that there consistently exist BERT subnetworks that are almost unbiased towards the MNLI training set bias, under the three kinds of pruning and fine-tuning paradigms.
Compared with searching subnetworks from the fine-tuned BERT, directly searching from the pre-trained BERT is more efficient in that it dispenses with fine-tuning the full model. However, the former has a better overall performance, as we have shown in Section 4.4. This induces a question: At which point of the BERT fine-tuning process, can we find subnetworks comparable to those found after the end of fine-tuning using mask training? To answer this question, we perform mask training on the model checkpoints from different steps of BERT fine-tuning.
Fig. 7 shows the mask training curves, which start from different . We can see that “ft step=0” converges slower and to a worse final accuracy, as compared with “ft to end”, especially on the HANS dataset. However, with 20,000 steps of full BERT fine-tuning, which is roughly of the “ft to end”, the mask training performance is very competitive. This suggests that the total training cost of SRNet searching can be reduced, by a large amount, in the full model training stage.
To actually reduce the training cost, we need to predict the exact timing to start mask training. This is intractable without information of all the training curves in Fig. 7. A feasible solution is adopting the idea of early-stopping (see Appendix E.1 for detailed discussions). However, accurately predicting the optimal timing (with the least amount of fine-tuning and comparable subnetwork performance to fully fine-tuning) is indeed difficult and we invite follow-up studies to investigate this question.
As the results of Section 4 demonstrate, there is a sharp decline of the subnetworks’ performance from sparsity. We conjecture that this is because directly initializing mask training to reduces the model’s capacity too drastically, and thus causes some difficulties in optimization. Therefore, we gradually increase the sparsity from during mask training, using the cubic sparsity schedule Zhu and Gupta (2018) (see Appendix C.4 for ablation studies). Fig. 8
compares the fixed sparsity used in the previous sections and the gradual sparsity increase, across varied mask training epochs. We find that while simply extending the training process is conducive, gradual sparsity increase achieves better results. In particular, “gradual” outperforms “fixed” with lower training cost on all the three tasks, except for the PAWS dataset, A similar phenomenon is explained in Section4.2.2.
In this paper, we investigate whether sparsity and robustness to dataset bias can be achieved simultaneously for PLM subnetworks. Through extensive experiments, we demonstrate that BERT indeed contains sparse and robust subnetworks (SRNets) across a variety of NLU tasks and training and pruning setups. We further use the OOD information to reveal that there exist sparse and almost unbiased BERT subnetworks. Finally, we present analysis and solutions to refine the SRNet searching process in terms of subnetwork performance and searching efficiency.
The limitations of this work is twofold. First, we focus on BERT-like PLMs and NLU tasks, while dataset biases are also common in other scenarios. For example, gender and racial biases exist in dialogue generation systems Dinan et al. (2020) and PLMs Guo et al. (2022). In the future work, we would like to extend our exploration to other types of PLMs and NLP tasks (see Appendix E.2 for a discussion). Second, as we discussed in Section 5.1, our analysis on “the timing to start searching SRNets” mainly serves as a proof-of-concept, and actually reducing the training cost requires predicting the exact timing.
This work was supported by National Natural Science Foundation of China (61976207 and 61906187).
Estimating or propagating gradients through stochastic neurons for conditional computation. CoRR abs/1308.3432. Cited by: §3.2.2.
The lottery ticket hypothesis: finding sparse, trainable neural networks. In ICLR, Cited by: §1, §2.1, §3.2.1, §4.3.1.
Compressing BERT: studying the effects of weight pruning on transfer learning. In RepL4NLP@ACL, pp. 143–155. Cited by: §2.1.
Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019), pp. 132–142. Cited by: §1, §2.2, §3.3.
ALBERT: A lite BERT for self-supervised learning of language representations. In ICLR, Cited by: §2.1.
Proceedings of Machine Learning Research, Vol. 119, pp. 6575–6585. Cited by: Appendix D.
Improving language understanding with unsupervised learning. In Technical report, OpenAI, Cited by: §E.2.
For all authors…
Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
Did you describe the limitations of your work? See Section 6.
Did you discuss any potential negative societal impacts of your work? Currently, we think there are no apparent negative societal impacts related to our work.
Have you read the ethics review guidelines and ensured that your paper conforms to them?
If you are including theoretical results…
Did you state the full set of assumptions of all theoretical results?
Did you include complete proofs of all theoretical results?
If you ran experiments…
Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? We will release the codes and reproduction instructions upon publication.
Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? See all the figures of our experiments.
Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? See Appendix B.
If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…
If your work uses existing assets, did you cite the creators? See Section 4.1.
Did you mention the license of the assets? Licenses of some dataset we used are mentioned in Section 4.1. However, for the other datasets, we were unable to find the licenses.
Did you include any new assets either in the supplemental material or as a URL?
Did you discuss whether and how consent was obtained from people whose data you’re using/curating?
Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?
If you used crowdsourcing or conducted research with human subjects…
Did you include the full text of instructions given to participants and screenshots, if applicable?
Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?
Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?
Algo. 1 summarizes our implementation of IMP and IMP with weight rewinding. In practice, we set the per time pruning ratio and the pruning interval .
As we described in Section 3.2.2 of the main paper, we realize mask training via binarization in forward pass and gradient estimation in backward pass. Following Radiya-Dixit and Wang , Liu et al. , we adopt a magnitude-based strategy to initialize the real-valued masks. Specially, we consider two variants: The first one (hard variant) identifies the weights in matrix with the smallest magnitudes, and sets the corresponding elements in to zero, and the remaining elements to a fixed value:
where extracts the weights with the lowest absolute value, according to sparsity level . is a hyper-parameter. The second one (soft variant) directly utilizes the absolute values of the weights for mask initialization:
To control the sparsity of the model, the threshold is adjusted dynamically at a frequency of training steps. In practice, we control the sparsity in a local way, i.e., all the weight matrices should satisfy the same sparsity constraint . Algo. 2 summarizes the entire process of mask training.
We have introduced the PoE method in Section 3.3. Here we provide descriptions of the other two debiasing methods, i.e., example reweighting and confidence regularization.
Example Reweighting directly assigns an importance weight to the standard CE training loss, according to the bias degree :
Confidence Regularization is based on knowledge distillation Hinton et al. . It involves a teacher model trained with the standard CE loss. The teacher model’s prediction is used as a supervision signal to train the main model. To account for the bias degree of training examples, is smoothed using a scaling function , and the final loss is computed as:
We utilize eight datasets from three NLU tasks. The statistics of different dataset splits are summarized in Tab. 1. If one dataset has a test set, we use it for evaluation, and otherwise we report results on the dev set. For MNLI and QQP, since the official test server 333https://gluebenchmark.com/ only allows two submissions a day, we instead evaluate on the dev sets, following Chen et al. , Liu et al. , Sanh et al. . For FEVER, we use the training and evaluation data processed by Schuster et al.  444https://github.com/TalSchuster/FeverSymmetric.
Tab. 2 shows the distribution of examples over classes. We can see that the distributions of the QQP and evaluation sets are imbalanced. Specially, in the OOD , where a biased model tends to predict most examples to the duplicate class, simply classifying all examples as non-duplicate can achieve substantial improvement in accuracy (from to ). To account for this, we use the F1 score to evaluate the performance on the three paraphrase identification datasets. Specifically, we calculate the weighted average of the F1 score of each class. However, the class imbalance may still affect the evaluation on PAWS (as we discussed in Section 4.2.2) and therefore the OOD improvement should be assessed by also considering the ID performance.
Whether all the hypothesis words also belong to the premise.
Whether the hypothesis appears as a continuous subsequence in the premise.
The percentage of the hypothesis words that appear in the premise . Formally .
The minimum of the same similarities above:
For FEVER, we use the max-pooled word embeddings of the claim sentence, which are also based on the fastText word vectors.
The main training hyper-parameters are shown in Tab. 3, which basically follow Utama et al. [2020b]. Most of the hyper-parameters are the same for different training strategies, except for the number of training epochs (#Epoch) on MNLI. For the standard CE loss and example reweighting, the model is trained for 3 epochs. For PoE and confidence regularization, the model is trained for 5 epochs.
Mask training and IMP basically use the same set of hyper-parameters as full BERT, except for longer training. The number of training epochs for mask training and IMP is 5 on MNLI, and 7 on QQP and FEVER. The hyper-parameters specific to mask training or IMP are summarized in Tab. 4. Unless otherwise specified, we adopt the hard-variant of mask initialization (Eq. 5) and fix the subnetwork sparsity to target sparsity throughout the process of mask training. Some special experimental setups are described as follows:
When we search for subnetworks at low sparsity (e.g., 20%) from a fine-tuned BERT, we find that mask training (with debiasing loss) stably improves the OOD performance, while the ID performance peaks at an early point of training and then slightly drops and recovers later. Therefore, the ID performance favors the early checkpoints, which are not good at the OOD generalization. To address this problem, we select the best checkpoint after of training, but still according to the performance on the ID dev set. This strategy is only adopted for mask training on fine-tuned BERT (for all sparsity levels), and in other cases we select the best checkpoint across training based on ID performance.
When fine-tuning the searched subnetworks (with their weights rewound to pre-trained values) in isolation, we use the same set of hyper-parameters as full BERT fine-tuning.
The OOD data is used in this setup. Specifically, we utilize the training data of HANS and PAWS for NLI and paraphrase identification respectively. In terms of the FEVER-Symmetric dataset, which does not provide a training set (see Tab. 1), we use the dev set of FEVER-Symm2 and copy the data 10 times to construct the OOD training data. The OOD and ID training data are then combined to form the final training set. Note that the evaluation sets are the same as the other setups, and NO test data is used in mask training.
We mainly experiment with the gradual sparsity increase schedule for subnetworks at 90% sparsity. Concretely, we increase the sparsity from 70% to 90% during the process of mask training. The real-valued mask is initialized using the soft-variant (Eq. 6). This is because we find that the hard-variant is difficult to optimize with sparsity increase.
In Section 4, we mainly experiment with the PoE debiasing method. Here, we combine mask training with the other two debiasing methods, namely example reweighting and confidence regularization, and search for SRNets from the CE fine-tuned BERT. Fig. 9 presents the results. As we can see: (1) Pruning with different debiasing methods almost consistently improves the OOD performance over the CE fine-tuned BERT. (2) The confidence regularization method (the grey lines) only achieves mild OOD improvement over the full BERT, while it preserves more ID performance compared with the other two methods. This phenomenon is in accordance with the results from Utama et al. [2020a], which propose the confidence regularization method to achieve a better trade-off between the ID and OOD performance.
Fig. 10 shows the results of mask training with the OOD training data. We can see that the general patterns in paraphrase identification and fact verification datasets are basically the same as the NLI datasets. Although the identified subnetworks cannot achieve 100% accuracy on PAWS and FEVER-Symmetric as on HANS, they substantially narrow the gap between OOD and ID performance, as compared with the full BERT. An exception is on the Symm2, where the upper bound of SRNets seems not very high. This is probably because we do not have enough examples (708 in total) to represent the data distribution of the FEVER-Symmetric dataset. Therefore, we conjecture that the existence of sparse and unbiased subnetworks might be ubiquitous.
Fig. 11 shows the mask training curves on all the 8 datasets. Similar to the NLI datasets, mask training on the other two tasks can achieve comparable results as “ft to end” by starting from an intermediate checkpoint of BERT fine-tuning. For QQP, we can start from 15,000 steps of full BERT fine-tuning (44% of ). For FEVER, we can start from 10,000 steps (44% of ).
As we mentioned in Appendix B.3.3, we increase the sparsity from 70% to 90% and adopt the soft variant of mask initialization. To explain the reason for using this specific strategy, we present the ablation study results in Tab. 5. We can observe that: (1) Replacing the hard variant of mask initialization with the soft variant is beneficial, which leads to obvious improvements on the QQP, FEVER, Symm1 and Symm2 datasets. (2) Gradually increasing the sparsity further promotes the performance, with the 0.70.9 strategy achieving the best results on 7 out of the 8 datasets.
It has been shown by Hendrycks et al. , Tu et al.  that pre-trained model RoBERTa Liu et al.  have better OOD generalization than BERT. Tu et al.  also shows that larger PLMs, which are more computationally expensive, are more robust. To examine whether our conclusions can generalize to RoBERTa and larger versions of BERT, we conduct mask training on the standard fine-tuned RoBERTa-base and BERT-large models and use the PoE debiasing loss in the mask training process.
The results are shown in Tab. 6. We can see that, for RoBERTa-base: (1) At 50% sparsity, the searched subnetworks outperform the full RoBERTa (std) by 6.84 points on HANS, with a relative small drop of 1.74 on MNLI, validating that SRNets can be found in RoBERTa. (2) At 70% sparsity, the vanilla mask training produces subnetworks with undesirable ID performance and OOD performance comparable to full model (std). In comparison, when we gradually increase the sparsity level from 50% to 70%, the ID and OOD performance are improved simultaneously, demonstrating that gradual sparsity increase is also effective for RoBERTa.
When it comes to BERT-large, the conclusions are basically the same as BERT-base and RoBERTa-base: (1) We can find 50% sparse SRNets from BERT-large using the original mask training. (2) Gradual sparsity increase is also effective for BERT-large. Additionally, we find that the original mask training exhibits high variance at 70% sparsity because the training fails for some random seeds. In comparison, with gradual sparsity increase, the searched subnetworks have better performance and low variance.
Some prior attempts have also been made to obtain compact and robust deep neural networks. We discuss the relationship and difference between these works and our paper from three perspectives:
There are various types of model robustness, including generalization to in-distribution unseen examples, robustness towards dataset bias Beery et al. , McCoy et al. , Zhang et al. , Schuster et al.  and adversarial attacks Goodfellow et al. , etc. Among the researches on model compression and robustness, adversarial robustness Gui et al. , Ye et al. , Sehwag et al. , Fu et al. , Xu et al.  and dataset bias robustness Zhang et al. , Du et al.  are the most widely studied. In this paper, we focus on the dataset bias problem, which is more common than the worst-case adversarial attack, in terms of real-world application.
A major direction in robust model compression is about the design of compression methods. Sehwag et al.  investigate the effect of magnitude-based pruning on adversarially trained models. Gui et al. , Ye et al.  treat sparsity and adversarial robustness as a constrained optimization problem, and solve it using the alternating direction method of multipliers (ADMM) framework Zhang et al. . Sehwag et al. , Zhang et al. , Madaan et al.  combine learnable weight mask (i.e., mask training) and robust training objectives. Our study investigates the use of magnitude-based pruning and mask training, which are also widely employed in the literature of BERT compression.
Despite the topic of model compression and robustness has been proposed for years, it is mostly studied in the context of computer vision (CV) tasks and models, and few attention has been paid to the NLP field. Considering the real-world application potential of PLMs, it is critical to study the questions of PLM compression and robustness jointly. To this end, some recent studies extend the evaluation of compressed PLMs to consider adversarial robustnessXu et al.  and dataset bias robustness Du et al. .
Although our work shares the same topic with Du et al. , we differ in several aspects. First, the scope and focus of our research questions are different. They aim at analyzing the impact of different compression methods (pruning and knowledge distillation Hinton et al. ) on the OOD robustness of standard fine-tuned BERT. By contrast, we focus on subnetworks obtained from different pruning and fine-tuning paradigms and consider both standard fine-tuning and debiasing fine-tuning. Second, our conclusions are different. The results of Du et al.  suggest that pruning generally has a negative impact on the robustness of BERT. In comparison, we revel the consistent existence of sparse BERT subnetworks that are more robust to dataset bias than the full model.
A feasible way of solution is to stop full BERT fine-tuning when there is no significant improvement across several consecutive evaluation steps. The patience of early-stopping can be determined based on the computational budget. If our resource is limited, we can at least directly training the mask on , which can still produce SRNets at 50% sparsity (as shown by Section 4.4.2).
In this work, we focus on NLU tasks and PLMs from the BERT family. However, the methodology we utilize is agnostic to the type of bias, task and backbone model. Theoretically, it can be flexibly adapted to other scenarios by simply change the spurious features to train the bias model (for the three debiasing methods considered in this paper) or combine the pruning method with another kind of debiasing method that also involves model training. In the future work, we would like to extend our exploration to other types of PLMs (e.g., language generation models like GPT Radford et al.  and T5 Raffel et al. ) and other types of NLP tasks (e.g., dialogue generation).