Log In Sign Up

A Win-win Deal: Towards Sparse and Robust Pre-trained Language Models

by   Yuanxin Liu, et al.

Despite the remarkable success of pre-trained language models (PLMs), they still face two challenges: First, large-scale PLMs are inefficient in terms of memory footprint and computation. Second, on the downstream tasks, PLMs tend to rely on the dataset bias and struggle to generalize to out-of-distribution (OOD) data. In response to the efficiency problem, recent studies show that dense PLMs can be replaced with sparse subnetworks without hurting the performance. Such subnetworks can be found in three scenarios: 1) the fine-tuned PLMs, 2) the raw PLMs and then fine-tuned in isolation, and even inside 3) PLMs without any parameter fine-tuning. However, these results are only obtained in the in-distribution (ID) setting. In this paper, we extend the study on PLMs subnetworks to the OOD setting, investigating whether sparsity and robustness to dataset bias can be achieved simultaneously. To this end, we conduct extensive experiments with the pre-trained BERT model on three natural language understanding (NLU) tasks. Our results demonstrate that sparse and robust subnetworks (SRNets) can consistently be found in BERT, across the aforementioned three scenarios, using different training and compression methods. Furthermore, we explore the upper bound of SRNets using the OOD information and show that there exist sparse and almost unbiased BERT subnetworks. Finally, we present 1) an analytical study that provides insights on how to promote the efficiency of SRNets searching process and 2) a solution to improve subnetworks' performance at high sparsity. The code is available at


page 1

page 2

page 3

page 4


DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language Models

Gigantic pre-trained models have become central to natural language proc...

Fine-tuning Pre-trained Language Models with Noise Stability Regularization

The advent of large-scale pre-trained language models has contributed gr...

Compressing And Debiasing Vision-Language Pre-Trained Models for Visual Question Answering

Despite the excellent performance of large-scale vision-language pre-tra...

Are Sample-Efficient NLP Models More Robust?

Recent work has observed that pre-trained models have higher out-of-dist...

Investigating Ensemble Methods for Model Robustness Improvement of Text Classifiers

Large pre-trained language models have shown remarkable performance over...

Transfer Learning Robustness in Multi-Class Categorization by Fine-Tuning Pre-Trained Contextualized Language Models

This study compares the effectiveness and robustness of multi-class cate...

MASKER: Masked Keyword Regularization for Reliable Text Classification

Pre-trained language models have achieved state-of-the-art accuracies on...

Code Repositories


[NeurIPS 2022] "A Win-win Deal: Towards Sparse and Robust Pre-trained Language Models", Yuanxin Liu, Fandong Meng, Zheng Lin, Jiangnan Li, Peng Fu, Yanan Cao, Weiping Wang, Jie Zhou

view repo

1 Introduction

Pre-trained language models (PLMs) have enjoyed impressive success in natural language processing (NLP) tasks. However, they still face two major problems. On the one hand, the prohibitive model size of PLMs leads to poor efficiency in terms of memory footprint and computational cost

Ganesh et al. (2021); Strubell et al. (2019). On the other hand, despite being pre-trained on large-scale corpus, PLMs still tend to rely on dataset bias Gururangan et al. (2018); McCoy et al. (2019); Zhang et al. (2019); Schuster et al. (2019), i.e., the spurious features of input examples that strongly correlate with the label, during downstream fine-tuning. These two problems pose great challenge to the real-world deployment of PLMs, and they have triggered two separate lines of works.

In terms of the efficiency problem, some recent studies resort to sparse subnetworks as alternatives to the dense PLMs. Li et al. (2020); Michel et al. (2019); Liu et al. (2021a) compress the fine-tuned PLMs in a post-hoc fashion. Chen et al. (2020); Prasanna et al. (2020); Liu et al. (2022); Liang et al. (2021) extend the Lottery Ticket Hypothesis (LTH) Frankle and Carbin (2019) to search PLMs subnetworks that can be fine-tuned in isolation. Taking one step further, Zhao et al. (2020) propose to learn task-specific subnetwork structures via mask training Hubara et al. (2016); Mallya et al. (2018), without fine-tuning any pre-trained parameter. Fig. 1 illustrates these three paradigms. Encouragingly, the empirical evidences suggest that PLMs can indeed be replaced with sparse subnetworks without compromising the in-distribution (ID) performance.

To address the dataset bias problem, numerous debiasing methods have been proposed. A prevailing category of debiasing methods Clark et al. (2019); Utama et al. (2020a); Karimi Mahabadi et al. (2020); He et al. (2019); Schuster et al. (2019); Ghaddar et al. (2021); Utama et al. (2020b)

adjust the importance of training examples, in terms of training loss, according to their bias degree, so as to reduce the impact of biased examples (examples that can be correctly classified based on the spurious features). As a result, the model is forced to rely less on the dataset bias during training and generalizes better to OOD situations.

Figure 1: Three kinds of PLM subnetworks obtained from different pruning and fine-tuning paradigms. (a) Pruning a fine-tuned PLM. (b) Pruning the PLM and then fine-tuning the subnetwork. (c) Pruning the PLM without fine-tuning model parameters. The obtained subnetworks are used for testing.

Although progress has been made in both directions, most existing work tackle the two problems independently. To facilitate real-world application of PLMs, the problems of robustness and efficiency should be addressed simultaneously. Motivated by this, we extend the study on PLM subnetwork to the OOD scenario, investigating whether there exist PLM subnetworks that are both sparse and robust against dataset bias? To answer this question, we conduct large-scale experiments with the pre-trained BERT model Devlin et al. (2019) on three natural language understanding (NLU) tasks that are widely-studied in the question of dataset bias. We consider a variety of setups including the three pruning and fine-tuning paradigms, standard and debiasing training objectives, different model pruning methods, and different variants of PLMs from the BERT family. Our results show that BERT does contain sparse and robust subnetworks (SRNets) within certain sparsity constraint (e.g., less than 70%), giving affirmative answer to the above question. Compared with a standard fine-tuned BERT, SRNets exhibit comparable ID performance and remarkable OOD improvement. When it comes to BERT model fine-tuned with debiasing method, SRNets can preserve the full model’s ID and OOD performance with much fewer parameters. On this basis, we further explore the upper bound of SRNets by making use of the OOD information, which reveals that there exist sparse and almost unbiased subnetworks, even in a standard fine-tuned BERT that is biased.

Regardless of the intriguing properties of SRNets, we find that the subnetwork searching process still have room for improvement, based on some observations from the above experiments. First, we study the timing to start searching SRNets during full BERT fine-tuning, and find that the entire training and searching cost can be reduced from this perspective. Second, we refine the mask training method with gradual sparsity increase, which is quite effective in identifying SRNets at high sparsity.

Our main contributions are summarized as follows:

  • We extend the study on PLMs subnetworks to the OOD scenario. To our knowledge, this paper presents the first systematic study on sparsity and dataset bias robustness for PLMs.

  • We conduct extensive experiments to demonstrate the existence of sparse and robust BERT subnetworks, across different pruning and fine-tuning setups. By using the OOD information, we further reveal that there exist sparse and almost unbiased BERT subenetworks.

  • We present analytical studies and solutions that can help further refine the SRNets searching process in terms of efficiency and the performance of subnetworks at high sparsity.

2 Related Work

2.1 BERT Compression

Studies on BERT compression can be divided into two classes. The first one focuses on the design of model compression techniques, which include pruning Gordon et al. (2020); Michel et al. (2019); Gale et al. (2019), knowledge distillation Sanh et al. (2019); Sun et al. (2019); Jiao et al. (2020); Liu et al. (2021b), parameter sharing Lan et al. (2020), quantization Zafrir et al. (2019); Zhang et al. (2020), and combining multiple techniques Tambe et al. (2020); Mao et al. (2020); Liu et al. (2021a). The second one, which is based on the lottery ticket hypothesis Frankle and Carbin (2019), investigates the compressibility of BERT on different phases of the pre-training and fine-tuning paradigm. It has been shown that BERT can be pruned to a sparse subnetwork after Gale et al. (2019) and before fine-tuning Chen et al. (2020); Prasanna et al. (2020); Liang et al. (2021); Liu et al. (2022); Gordon et al. (2020), without hurting the accuracy. Moreover, Zhao et al. (2020) show that directly learning subnetwork structures on the pre-trained weights can match fine-tuning the full BERT. In this paper, we follow the second branch of works, and extend the evaluation of BERT subnetworks to the OOD scenario.

2.2 Dataset Bias in NLP Tasks

To facilitate the development of NLP systems that truly learn the intended task solution, instead of relying on dataset bias, many efforts have been made recently. On the one hand, challenging OOD test sets are constructed Gururangan et al. (2018); McCoy et al. (2019); Zhang et al. (2019); Schuster et al. (2019); Agrawal et al. (2018) by eliminating the spurious correlations in the training sets, in order to establish more strict evaluation. On the other hand, numerous debiasing methods Clark et al. (2019); Utama et al. (2020a); Karimi Mahabadi et al. (2020); He et al. (2019); Schuster et al. (2019); Ghaddar et al. (2021); Utama et al. (2020b) are proposed to discourage the model from learning dataset bias during training. However, few attention has been paid to the influence of pruning on the OOD generalization ability of PLMs. This work presents a systematic study on this question.

2.3 Model Compression and Robustness

Some pioneer attempts have also been made to obtain models that are both compact and robust to adversarial attacks Gui et al. (2019); Ye et al. (2019); Sehwag et al. (2020); Fu et al. (2021); Xu et al. (2021) and spurious correlations Zhang et al. (2021); Du et al. (2021). Specially, Xu et al. (2021); Du et al. (2021) study the compression and robustness question on PLM. Different from Xu et al. (2021), which is based on adversarial robustness, we focus on the spurious correlations, which is more common than the worst-case adversarial attack. Compared with Du et al. (2021), which focus on post-hoc pruning of the standard fine-tuned BERT, we thoroughly investigate different fine-tuning methods (standard and debiasing) and subnetworks obtained from the three pruning and fine-tuning paradigms. A more detailed discussion of the relation and difference between our work and previous studies on model compression and robustness is provided in Appendix D.

3 Preliminaries

3.1 BERT Architecture and Subnetworks

BERT is composed of an embedding layer, a stack of Transformer layers Vaswani et al. (2017) and a task-specific classifier. Each Transformer layer has a multi-head self-attention (MHAtt) module and a feed-forward network (FFN). MHAtt has four kinds of weight matrices, i.e., the query, key and value matrices , and the output matrix . FFN consits of two linear layers , , where is the hidden dimension of FFN.

To obtain the subnetwork of a model parameterized by , we apply a binary pruning mask to its weight matrices, which produces , where is the Hadamard product. For BERT, we focus on the Transformer layers and the classifier. The parameters to be pruned are , where is the classifier weights.

3.2 Pruning Methods

3.2.1 Magnitude-based Pruning

Magnitude-based pruning Han et al. (2015); Frankle and Carbin (2019) zeros-out parameters with low absolute values. It is usually realized in an iterative manner, namely, iterative magnitude pruning (IMP). IMP alternates between pruning and training and gradually increases the sparsity of subnetworks. Specifically, a typical IMP algorithm consists of four steps: (i) Training the full model to convergence. (ii) Pruning a fraction of parameters with the smallest magnitude. (iii) Re-training the pruned subnetwork. (iv) Repeat (ii)-(iii) until reaching the target sparsity. To obtain subnetworks from the pre-trained BERT, i.e., (b) and (c) in Fig. 1, the subnetwork parameters are rewound to the pre-trained values after (iii), and (i) can be abandoned. More details about our IMP implementations can be found in Appendix A.1.1.

3.2.2 Mask Training

Mask training treats the pruning mask as trainable parameters. Following Mallya et al. (2018); Zhao et al. (2020); Radiya-Dixit and Wang (2020); Liu et al. (2022)

, we achieve this through binarization in forward pass and gradient estimation in backward pass.

Each weight matrix , which is frozen during mask training, is associated with a bianry mask , and a real-valued mask . In the forward pass, is replaced with , where is derived from through binarization:


where is the threshold. In the backward pass, since the binarization operation is not differentiable, we use the straight-through estimator Bengio et al. (2013) to compute the gradients for using the gradients of , i.e., , where is the loss. Then, is updated as , where is the learning rate.

Following Radiya-Dixit and Wang (2020); Liu et al. (2022), we initialize the real-valued masks according to the magnitude of the original weights. The complete mask training algorithm is summarized in Appendix A.1.2.

3.3 Debiasing Methods

As described in the Introduction, the debiasing methods measure the bias degree of training examples. This is achieved by training a bias model. The inputs to the bias model are hand-crafted spurious features based on our prior knowledge of the dataset bias (Section 4.1.3 describes the details). In this way, the bias model mainly relies on the spurious features to make predictions, which can then serve as a measurement of the bias degree. Specifically, given the bias model prediction over the classes, the bias degree

, i.e., the the probability of the ground-truth class


Then, can be used to adjust the training loss in several ways, including product-of-experts (PoE) Clark et al. (2019); He et al. (2019); Karimi Mahabadi et al. (2020), example reweighting Schuster et al. (2019); Ghaddar et al. (2021) and confidence regularization Utama et al. (2020a). Here we describe the standard cross-entropy and PoE, and the other two methods are introduced in Appendix A.2.

Standard Cross-Entropy computes the cross-entropy between the predicted distribution and the ground-truth one-hot distribution as .

Product-of-Experts combines the predictions of main model and bias model, i.e., and , and then computes the training loss as .

3.4 Notations

Here we define some notations, which will be used in the following sections.

  • : Training with loss for steps, where can be omitted for simplicity.

  • : Pruning using pruning method and training loss .

  • : Extracting the pruning mask of , i.e., .

  • and , where “imp” and “imp-rw”denote the standard IMP and IMP with weight rewinding, as described in Section 3.2.1. “mask” stands for mask training.

  • : Evaluating on the test data with distribution .

4 Sparse and Robust BERT Subnetworks

4.1 Experimental Setups

4.1.1 Datasets and Evaluation

Natural Language Inference

We use MNLI Williams et al. (2018) as the ID dataset for NLI. MNLI is comprised of premise-hypothesis pairs, whose relationship may be entailment, contradiction, or neutral. In MNLI the word overlap between premise and hypothesis is strongly correlated with the entailment class. To solve this problem, the OOD HANS dataset McCoy et al. (2019) is built so that such correlation does not hold.

Paraphrase Identification

The ID dataset for paraphrase identification is QQP 111, which contains question pairs that are labelled as either duplicate or non-duplicate. In QQP, high lexical overlap is also strongly associated with the duplicate class. The OOD datasets PAWS-qqp and PAWS-wiki Zhang et al. (2019) are built from sentences in Quora and Wikipedia respectively. In PAWS sentence pairs with high word overlap have a balanced distribution over duplicate and non-duplicate.

Fact Verification

FEVER 222See the licence information at Thorne et al. (2018) is adopted as the ID dataset of fact verification, where the task is to assess whether a given evidence supports or refutes the claim, or whether there is not-enough-info to reach a conclusion. The OOD dataset Fever-Symmetric (v1 and v2) Schuster et al. (2019) is proposed to evaluate the influence of the claim-only bias (the label can be predicted correctly without the evidence).

For NLI and fact verification, we use Accuracy as the evaluation metric. For paraphrase identification, we evaluate using the F1 score. More details of datasets and evaluation are shown in Appendix


4.1.2 PLM Backbone

We mainly experiment with the BERT-base-uncased model Devlin et al. (2019). It has roughly 110M parameters in total, and 84M parameters in the Transformer layers. As described in Section 3.1, we derive the subnetworks from the Transformer layers and report sparsity levels relative to the 84M parameters. To generalize our conclusions to other PLMs, we also consider two variants of the BERT family, namely RoBERTa-base and BERT-large, the results of which can be found in Appendix C.5.

4.1.3 Training Details

Following Clark et al. (2019)

, we use a simple linear classifier as the bias model. For HANS and PAWS, the spurious features are based on the the word overlapping information between the two input text sequences. For Fever-Symmetric, the spurious features are max-pooled word embeddings of the claim sentence. More details about the bias model and the spurious features are presented in Appendix


Mask training and IMP basically use the same hyper-parameters (adopting from Utama et al. (2020b)) as full BERT. An exception is longer training, because we find that good subnetworks at high sparsity levels require more training to be found. Unless otherwise specified, we select the best checkpoints based on the ID dev performance, without using OOD information. All the reported results are averaged over 4 runs. We defer training details about each dataset, and each training and pruning setup, to Appendix B.3.

4.2 Subnetworks from Fine-tuned BERT

4.2.1 Problem Formulation and Experimental Setups

Given the fine-tuned full BERT , where and are the pre-trained and fine-tuned parameters respectively, the goal is to find a subnetwork that satisfies a target sparsity level and maximize the ID and OOD performance.


where is the norm and is the total number of parameters to be pruned. In practice, the above optimization problem is achieved via , which minimizes the loss on the ID training set. When the pruning method is IMP, the subnetwork parameters will be further fine-tuned and . For mask training, only the subnetwork structure is updated and .

We consider two kinds of fine-tuned full BERT, which utilize the standard CE loss and PoE loss respectively (i.e., ). IMP and mask training are used as the pruning methods (i.e., ). For the standard fine-tuned BERT, both and are examined in the pruning process. For the PoE fine-tuned BERT, we only use during pruning. Note that in this work, we mainly experiment with and . and are also examined for subnetworks from fine-tuned BERT, the results of which can be found in Appendix C.1.

Figure 2:

Results of subnetworks pruned from the CE fine-tuned BERT. “std” means standard, and the shadowed areas denote standard deviations, which also apply to the other figures of this paper.

Figure 3: Results of subnetworks pruned from the PoE fine-tuned BERT. Results of the “mask train (poe)” subnetworks from Fig. 2 (the orange line) are also reported for reference.

4.2.2 Results

Subnetworks from Standard Fine-tuned BERT

The results are shown in Fig. 2 (In this paper, we present most results in figures for clear comparisons. Actual values of the results can be found in the code link.). We discuss them from three perspectives. For the full BERT, we can see that standard CE fine-tuning, which achieves good results on the ID dev sets, performs significantly worse on the OOD test sets. This demonstrates that the ID performance of BERT depends, to a large extent, on memorizing the dataset bias.

In terms of the subnetworks, we can derive the following observations: (1) Using any of the four pruning methods, we can compress a large proportion of the BERT parameters (up to sparsity) and still preserve of the full model’s ID performance. (2) With standard pruning, i.e., “mask train (std)” or “imp (std)”, we can observe small but perceivable improvement over the full BERT on the HANS and PAWS datasets. This suggests that pruning may remove some parameters related to the bias features. (3) The OOD performance of “mask train (poe)” and “imp (poe)” subnetworks is even better, and the ID performance degrades slightly but is still above of the full BERT. This shows that introducing the debiasing objective in the pruning process is beneficial. Specially, as mask training does not change the model parameters, the results of “mask train (poe)” implicates that the biased “full bert (std)” contains sparse and robust subnetworks (SRNets) that already encode a less biased solution to the task. (4) SRNets can be identified across a wide range of sparsity levels (from ). However at higher sparsity of , the performance of the subnetworks is not desirable. (5) We also find that there is an abnormal increase of the PAWS F1 score at sparsity for some pruning methods, when the corresponding ID performance drops sharply. This is because the class distribution of PAWS is imbalanced (see Appendix B.1), and thus even a naive random-guessing model can outperform the biased full model on PAWS. Therefore, the OOD improvement should only be acceptable when there is no large ID performance decline.

Comparing IMP and mask training, the latter performs better in general, except for “mask train (poe)” at

sparsity on QQP and FEVER. This suggests that directly optimizing the subnetwork structure is a better choice than using the magnitude heuristic as the pruning metric.

Subnetworks from PoE Fine-tuned BERT

Fig. 3 presents the results. We can find that: (1) For the full BERT, the OOD performance is obviously promoted with the PoE debiasing method, while the ID performance is sacrificed slightly. (2) Unlike the subnetworks from the standard fine-tuned BERT, the subnetworks of PoE fine-tuned BERT (the green and blue lines) cannot outperform the full model. However, these subnetworks maintain comparable performance at up to sparsity, on both the ID and OOD settings, making them desirable alternatives to the full model in resource-constraint scenarios. Moreover, this phenomenon suggests that there is a great redundancy of BERT parameters, even when OOD generalization is taken into account. (3) With PoE-based pruning, subnetworks from the standard fine-tuned BERT (the orange line) is comparable with subnetworks from the PoE fine-tuned BERT (the blue line). This means we do not have to fine-tune a debiased BERT before searching for the SRNets. (4) IMP, again, slightly underperforms mask training at moderate sparsity levels, while it is better at sparsity on the fact verification task.

Figure 4: Results of BERT subnetworks fine-tuned in isolation. “ft” is short for fine-tuning.

4.3 BERT Subnetworks Fine-tuned in Isolation

4.3.1 Problem Formulation and Experimental Setups

Given the pre-trained BERT , a subnetwork is obtained before downstream fine-tuning. The goal is to maximize the performance of the fine-tuned subnetwork :


Following the LTH Frankle and Carbin (2019), we solve this problem using the train-prune-rewind pipeline. For IMP, the procedure is described in Section 3.2.1 and . For mask training, the subnetwork structure is learned from (same as the previous section) and .

We employ CE and PoE loss for model fine-tuning (i.e., ). Since we have shown that using the debiasing loss in pruning is conducive, the CE loss is not considered (i.e., ).

4.3.2 Results

The results of subnetworks fine-tuned in isolation are presented in Fig. 4. It can be found that: (1) For standard CE fine-tuning, the “mask train (poe)” subnetworks are superior to “full bert (std)” on the OOD test data, i.e., the subnetworks are less susceptible to the dataset bias during training. (2) In terms of the PoE-based fine-tuning, the “imp (poe)” and “mask train (poe)” subnetworks are generally comparable to “full bert (poe)”. (3) For most of the subnetworks, “poe ft” clearly outperforms “std ft” in the OOD setting, which suggests that it is important to use the debiasing method in fine-tuning, even if the BERT subnetwork structure has already encoded some unbiased information.

Moreover, based on (1) and (2), we can extend the LTH on BERT Chen et al. (2020); Prasanna et al. (2020); Liang et al. (2021); Liu et al. (2022): The pre-trained BERT contains SRNets that can be fine-tuned in isolation, using either standard or debiasing method, and match or even outperform the full model in both the ID and OOD evaluations.

4.4 BERT Subnetworks Without Fine-tuning

4.4.1 Problem Formulation and Experimental Setups

This setup aims at finding a subnetwork inside the pre-trained BERT, which can be directly employed to a task. The problem is formulated as:


Following Zhao et al. (2020), we fix the pre-trained parameters and optimize the mask variables . This process can be represented as , where .

Figure 5: Results of BERT subnetworks without fine-tuning. Results of the “mask train (poe)” subnetworks from Fig. 2 (the orange line) are also reported for reference.

4.4.2 Results

As we can see in Fig. 5: (1) With CE-based mask training, the identified subnetworks (under sparsity) in pre-trained BERT are competitive with the CE fine-tuned full BERT. (2) Similarly, using PoE-based mask training, the subnetworks under sparsity are comparable to the PoE fine-tuned full BERT, which demonstrates that SRNets for a particular downstream task already exist in the pre-trained BERT. (3) “mask train (poe)” subnetworks in pre-trained BERT can even match the subnetworks found in the fine-tuned BERT (the orange lines) in some cases (e.g., on PAWS and on FEVER under sparsity). Nonetheless, the latter exhibits a better overall performance.

Figure 6: NLI results of BERT subnetworks found using the OOD information. Results of the other two tasks can be found in Appendix C.2.
Figure 7: NLI mask training curves ( sparse), starting from BERT fine-tuned for varied steps. Appendix C.3 shows results of the other two tasks.

4.5 Sparse and Unbiased BERT Subnetworks

4.5.1 Problem Formulation and Experimental Setups

To explore the upper bound of BERT subnetworks in terms of OOD generalization, we include the OOD training data in mask training, and use the OOD test sets for evaluation. Like the previous sections, we investigate three pruning and fine-tuning paradigms, as formulated by Eq. 2, 3 and 4 respectively. We only consider the standard CE for subnetwork and full BERT fine-tuning, which is more vulnerable to the dataset bias. Appendix B.3.3 summarizes the detailed experimental setups.

4.5.2 Results

From Fig. 7 we can observe that: (1) The subnetworks from fine-tuned BERT (“bert-ft subnet”) at sparsity achieve nearly accuracy on HANS, and their ID performance is also close to the full BERT. (2) The subnetworks in the pre-trained BERT (“bert-pt subnet”) also have very high OOD accuracy, while they perform worse than “bert-ft subnet” in the ID setting. (3) “bert-pt subnet + ft” subnetworks, which are fine-tuned in isolation with CE loss, exhibits the best ID performance, and the poorest OOD performance. However, compared to the full BERT, these subnetworks still rely much less on the dataset bias, reaching nearly HANS accuracy at sparsity. Jointly, these results show that there consistently exist BERT subnetworks that are almost unbiased towards the MNLI training set bias, under the three kinds of pruning and fine-tuning paradigms.

Figure 8: Comparison between fixed sparsity and gradual sparsity increase for mask training with the standard fine-tuned full BERT. The subnetworks are at sparsity.

5 Refining the SRNets Searching Process

In this section, we study how to further improve the SRNets searching process based on mask training, which generally performs better than IMP, as shown in Section 4.2 and Section 4.3.

5.1 The Timing to Start Searching SRNets

Compared with searching subnetworks from the fine-tuned BERT, directly searching from the pre-trained BERT is more efficient in that it dispenses with fine-tuning the full model. However, the former has a better overall performance, as we have shown in Section 4.4. This induces a question: At which point of the BERT fine-tuning process, can we find subnetworks comparable to those found after the end of fine-tuning using mask training? To answer this question, we perform mask training on the model checkpoints from different steps of BERT fine-tuning.

Fig. 7 shows the mask training curves, which start from different . We can see that “ft step=0” converges slower and to a worse final accuracy, as compared with “ft to end”, especially on the HANS dataset. However, with 20,000 steps of full BERT fine-tuning, which is roughly of the “ft to end”, the mask training performance is very competitive. This suggests that the total training cost of SRNet searching can be reduced, by a large amount, in the full model training stage.

To actually reduce the training cost, we need to predict the exact timing to start mask training. This is intractable without information of all the training curves in Fig. 7. A feasible solution is adopting the idea of early-stopping (see Appendix E.1 for detailed discussions). However, accurately predicting the optimal timing (with the least amount of fine-tuning and comparable subnetwork performance to fully fine-tuning) is indeed difficult and we invite follow-up studies to investigate this question.

5.2 SRNets at High Sparsity

As the results of Section 4 demonstrate, there is a sharp decline of the subnetworks’ performance from sparsity. We conjecture that this is because directly initializing mask training to reduces the model’s capacity too drastically, and thus causes some difficulties in optimization. Therefore, we gradually increase the sparsity from during mask training, using the cubic sparsity schedule Zhu and Gupta (2018) (see Appendix C.4 for ablation studies). Fig. 8

compares the fixed sparsity used in the previous sections and the gradual sparsity increase, across varied mask training epochs. We find that while simply extending the training process is conducive, gradual sparsity increase achieves better results. In particular, “gradual” outperforms “fixed” with lower training cost on all the three tasks, except for the PAWS dataset, A similar phenomenon is explained in Section


6 Conclusions and Limitations

In this paper, we investigate whether sparsity and robustness to dataset bias can be achieved simultaneously for PLM subnetworks. Through extensive experiments, we demonstrate that BERT indeed contains sparse and robust subnetworks (SRNets) across a variety of NLU tasks and training and pruning setups. We further use the OOD information to reveal that there exist sparse and almost unbiased BERT subnetworks. Finally, we present analysis and solutions to refine the SRNet searching process in terms of subnetwork performance and searching efficiency.

The limitations of this work is twofold. First, we focus on BERT-like PLMs and NLU tasks, while dataset biases are also common in other scenarios. For example, gender and racial biases exist in dialogue generation systems Dinan et al. (2020) and PLMs Guo et al. (2022). In the future work, we would like to extend our exploration to other types of PLMs and NLP tasks (see Appendix E.2 for a discussion). Second, as we discussed in Section 5.1, our analysis on “the timing to start searching SRNets” mainly serves as a proof-of-concept, and actually reducing the training cost requires predicting the exact timing.

This work was supported by National Natural Science Foundation of China (61976207 and 61906187).


  • A. Agrawal, D. Batra, D. Parikh, and A. Kembhavi (2018) Don’t just assume; look and answer: overcoming priors for visual question answering. In CVPR, pp. 4971–4980. Cited by: §2.2.
  • S. Beery, G. V. Horn, and P. Perona (2018) Recognition in terra incognita. In ECCV (16), Lecture Notes in Computer Science, Vol. 11220, pp. 472–489. Cited by: Appendix D.
  • Y. Bengio, N. Léonard, and A. C. Courville (2013)

    Estimating or propagating gradients through stochastic neurons for conditional computation

    CoRR abs/1308.3432. Cited by: §3.2.2.
  • T. Chen, J. Frankle, S. Chang, S. Liu, Y. Zhang, Z. Wang, and M. Carbin (2020) The lottery ticket hypothesis for pre-trained BERT networks. In NeurIPS, pp. 15834–15846. Cited by: §B.1, §1, §2.1, §4.3.2.
  • C. Clark, M. Yatskar, and L. Zettlemoyer (2019) Don’t take the easy way out: ensemble based methods for avoiding known dataset biases. In EMNLP/IJCNLP, pp. 4069–4082. Cited by: §B.3.1, §1, §2.2, §3.3, §4.1.3.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, pp. 4171–4186. Cited by: §1, §4.1.2.
  • E. Dinan, A. Fan, A. Williams, J. Urbanek, D. Kiela, and J. Weston (2020) Queens are powerful too: mitigating gender bias in dialogue generation. In EMNLP (1), pp. 8173–8188. Cited by: §6.
  • M. Du, S. Mukherjee, Y. Cheng, M. Shokouhi, X. Hu, and A. H. Awadallah (2021) What do compressed large language models forget? robustness challenges in model compression. CoRR abs/2110.08419. Cited by: Appendix D, Appendix D, Appendix D, §2.3.
  • J. Frankle and M. Carbin (2019)

    The lottery ticket hypothesis: finding sparse, trainable neural networks

    In ICLR, Cited by: §1, §2.1, §3.2.1, §4.3.1.
  • Y. Fu, Q. Yu, Y. Zhang, S. Wu, X. Ouyang, D. D. Cox, and Y. Lin (2021) Drawing robust scratch tickets: subnetworks with inborn robustness are found within randomly initialized networks. In NeurIPS, pp. 13059–13072. Cited by: Appendix D, §2.3.
  • T. Gale, E. Elsen, and S. Hooker (2019) The state of sparsity in deep neural networks. CoRR abs/1902.09574. Cited by: §2.1.
  • P. Ganesh, Y. Chen, X. Lou, M. A. Khan, Y. Yang, H. Sajjad, P. Nakov, D. Chen, and M. Winslett (2021) Compressing large-scale transformer-based models: a case study on BERT. Transactions of the Association for Computational Linguistics 9, pp. 1061–1080. Cited by: §1.
  • A. Ghaddar, P. Langlais, M. Rezagholizadeh, and A. Rashid (2021) End-to-end self-debiasing framework for robust NLU training. In ACL/IJCNLP (Findings), Findings of ACL, Vol. ACL/IJCNLP 2021, pp. 1923–1929. Cited by: §1, §2.2, §3.3.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. In ICLR (Poster), Cited by: Appendix D.
  • M. A. Gordon, K. Duh, and N. Andrews (2020)

    Compressing BERT: studying the effects of weight pruning on transfer learning

    In RepL4NLP@ACL, pp. 143–155. Cited by: §2.1.
  • S. Gui, H. Wang, H. Yang, C. Yu, Z. Wang, and J. Liu (2019) Model compression with adversarial robustness: A unified optimization framework. In NeurIPS, pp. 1283–1294. Cited by: Appendix D, Appendix D, §2.3.
  • Y. Guo, Y. Yang, and A. Abbasi (2022) Auto-debias: debiasing masked language models with automated biased prompts. In ACL, pp. 1012–1023. Cited by: §6.
  • S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. R. Bowman, and N. A. Smith (2018) Annotation artifacts in natural language inference data. In NAACL-HLT, pp. 107–112. Cited by: §1, §2.2.
  • S. Han, J. Pool, J. Tran, and W. Dally (2015) Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems 28, pp. 1135–1143. Cited by: §3.2.1.
  • H. He, S. Zha, and H. Wang (2019) Unlearn dataset bias in natural language inference by fitting the residual. In

    Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019)

    pp. 132–142. Cited by: §1, §2.2, §3.3.
  • D. Hendrycks, X. Liu, E. Wallace, A. Dziedzic, R. Krishnan, and D. Song (2020) Pretrained transformers improve out-of-distribution robustness. In ACL, pp. 2744–2751. Cited by: §C.5.
  • G. E. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. CoRR abs/1503.02531. Cited by: §A.2, Appendix D.
  • I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio (2016) Binarized neural networks. In NIPS, pp. 4107–4115. Cited by: §1.
  • X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, and Q. Liu (2020) TinyBERT: distilling BERT for natural language understanding. In EMNLP (Findings), pp. 4163–4174. Cited by: §2.1.
  • R. Karimi Mahabadi, Y. Belinkov, and J. Henderson (2020) End-to-end bias mitigation by modelling biases in corpora. In ACL, pp. 8706–8716. Cited by: §1, §2.2, §3.3.
  • Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2020)

    ALBERT: A lite BERT for self-supervised learning of language representations

    In ICLR, Cited by: §2.1.
  • Z. Li, E. Wallace, S. Shen, K. Lin, K. Keutzer, D. Klein, and J. E. Gonzalez (2020) Train large, then compress: rethinking model size for efficient training and inference of transformers. CoRR abs/2002.11794. Cited by: §1.
  • C. Liang, S. Zuo, M. Chen, H. Jiang, X. Liu, P. He, T. Zhao, and W. Chen (2021) Super tickets in pre-trained language models: from model compression to improving generalization. In ACL/IJCNLP, pp. 6524–6538. Cited by: §1, §2.1, §4.3.2.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. Cited by: §C.5.
  • Y. Liu, Z. Lin, and F. Yuan (2021a) ROSITA: refined BERT compression with integrated techniques. In AAAI, pp. 8715–8722. Cited by: §1, §2.1.
  • Y. Liu, F. Meng, Z. Lin, P. Fu, Y. Cao, W. Wang, and J. Zhou (2022) Learning to win lottery tickets in BERT transfer via task-agnostic mask training. CoRR abs/2204.11218. Cited by: §A.1.2, §B.1, §1, §2.1, §3.2.2, §3.2.2, §4.3.2.
  • Y. Liu, F. Meng, Z. Lin, W. Wang, and J. Zhou (2021b) Marginal utility diminishes: exploring the minimum knowledge for BERT knowledge distillation. In ACL/IJCNLP, pp. 2928–2941. Cited by: §2.1.
  • I. Loshchilov and F. Hutter (2019) Decoupled weight decay regularization. In ICLR (Poster), Cited by: 1.
  • D. Madaan, J. Shin, and S. J. Hwang (2020) Adversarial neural pruning with latent vulnerability suppression. In ICML,

    Proceedings of Machine Learning Research

    , Vol. 119, pp. 6575–6585.
    Cited by: Appendix D.
  • A. Mallya, D. Davis, and S. Lazebnik (2018) Piggyback: adapting a single network to multiple tasks by learning to mask weights. In ECCV, Lecture Notes in Computer Science, Vol. 11208, pp. 72–88. Cited by: §1, §3.2.2.
  • Y. Mao, Y. Wang, C. Wu, C. Zhang, Y. Wang, Q. Zhang, Y. Yang, Y. Tong, and J. Bai (2020) LadaBERT: lightweight adaptation of BERT through hybrid model compression. In COLING, pp. 3225–3234. Cited by: §2.1.
  • T. McCoy, E. Pavlick, and T. Linzen (2019) Right for the wrong reasons: diagnosing syntactic heuristics in natural language inference. In ACL, pp. 3428–3448. Cited by: Appendix D, §1, §2.2, §4.1.1.
  • P. Michel, O. Levy, and G. Neubig (2019) Are sixteen heads really better than one?. In NeurIPS, pp. 14014–14024. Cited by: §1, §2.1.
  • T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, and A. Joulin (2018) Advances in pre-training distributed word representations. In LREC, Cited by: 4th item.
  • S. Prasanna, A. Rogers, and A. Rumshisky (2020) When BERT plays the lottery, all tickets are winning. In EMNLP, pp. 3208–3229. Cited by: §1, §2.1, §4.3.2.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018)

    Improving language understanding with unsupervised learning

    In Technical report, OpenAI, Cited by: §E.2.
  • E. Radiya-Dixit and X. Wang (2020) How fine can fine-tuning be? learning efficient language models. In AISTATS, Proceedings of Machine Learning Research, Vol. 108, pp. 2435–2443. Cited by: §A.1.2, §3.2.2, §3.2.2.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, pp. 140:1–140:67. Cited by: §E.2.
  • V. Sanh, L. Debut, J. Chaumond, and T. Wolf (2019) DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR abs/1910.01108. Cited by: §2.1.
  • V. Sanh, T. Wolf, and A. M. Rush (2020) Movement pruning: adaptive sparsity by fine-tuning. In NeurIPS, pp. 20378–20389. Cited by: §B.1.
  • T. Schuster, D. J. Shah, Y. J. S. Yeo, D. Filizzola, E. Santus, and R. Barzilay (2019) Towards debiasing fact verification models. In EMNLP/IJCNLP, pp. 3417–3423. Cited by: §B.1, Appendix D, §1, §1, §2.2, §3.3, §4.1.1.
  • V. Sehwag, S. Wang, P. Mittal, and S. Jana (2019) Towards compact and robust deep neural networks. CoRR abs/1906.06110. Cited by: Appendix D.
  • V. Sehwag, S. Wang, P. Mittal, and S. Jana (2020) HYDRA: pruning adversarially robust neural networks. In NeurIPS, Cited by: Appendix D, Appendix D, §2.3.
  • E. Strubell, A. Ganesh, and A. McCallum (2019) Energy and policy considerations for deep learning in NLP. In ACL, pp. 3645–3650. Cited by: §1.
  • S. Sun, Y. Cheng, Z. Gan, and J. Liu (2019) Patient knowledge distillation for BERT model compression. In EMNLP/IJCNLP, pp. 4322–4331. Cited by: §2.1.
  • T. Tambe, C. Hooper, L. Pentecost, E. Yang, M. Donato, V. Sanh, A. M. Rush, D. Brooks, and G. Wei (2020) EdgeBERT: optimizing on-chip inference for multi-task NLP. CoRR abs/2011.14203. Cited by: §2.1.
  • J. Thorne, A. Vlachos, O. Cocarascu, C. Christodoulopoulos, and A. Mittal (2018) The fact extraction and verification (FEVER) shared task. CoRR abs/1811.10971. Cited by: §4.1.1.
  • L. Tu, G. Lalwani, S. Gella, and H. He (2020) An empirical study on robustness to spurious correlations using pre-trained language models. Trans. Assoc. Comput. Linguistics 8, pp. 621–633. Cited by: §C.5.
  • P. A. Utama, N. S. Moosavi, and I. Gurevych (2020a) Mind the trade-off: debiasing NLU models without degrading the in-distribution performance. In ACL, pp. 8717–8729. Cited by: §C.1, §1, §2.2, §3.3.
  • P. A. Utama, N. S. Moosavi, and I. Gurevych (2020b) Towards debiasing NLU models from unknown biases. In EMNLP, pp. 7597–7610. Cited by: §B.3.2, §1, §2.2, §4.1.3.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, pp. 5998–6008. Cited by: §3.1.
  • A. Williams, N. Nangia, and S. R. Bowman (2018) A broad-coverage challenge corpus for sentence understanding through inference. In NAACL-HLT, pp. 1112–1122. Cited by: §4.1.1.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush (2020) Transformers: state-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, pp. 38–45. Cited by: §B.2.
  • C. Xu, W. Zhou, T. Ge, K. Xu, J. J. McAuley, and F. Wei (2021) Beyond preserved accuracy: evaluating loyalty and robustness of BERT compression. In EMNLP (1), pp. 10653–10659. Cited by: Appendix D, Appendix D, §2.3.
  • S. Ye, X. Lin, K. Xu, S. Liu, H. Cheng, J. Lambrechts, H. Zhang, A. Zhou, K. Ma, and Y. Wang (2019) Adversarial robustness vs. model compression, or both?. In ICCV, pp. 111–120. Cited by: Appendix D, Appendix D, §2.3.
  • O. Zafrir, G. Boudoukh, P. Izsak, and M. Wasserblat (2019) Q8BERT: quantized 8bit BERT. In EMC2@NeurIPS, pp. 36–39. Cited by: §2.1.
  • D. Zhang, K. Ahuja, Y. Xu, Y. Wang, and A. C. Courville (2021) Can subnetwork structure be the key to out-of-distribution generalization?. In ICML, Proceedings of Machine Learning Research, Vol. 139, pp. 12356–12367. Cited by: Appendix D, Appendix D, §2.3.
  • T. Zhang, S. Ye, K. Zhang, J. Tang, W. Wen, M. Fardad, and Y. Wang (2018) A systematic DNN weight pruning framework using alternating direction method of multipliers. In ECCV (8), Lecture Notes in Computer Science, Vol. 11212, pp. 191–207. Cited by: Appendix D.
  • W. Zhang, L. Hou, Y. Yin, L. Shang, X. Chen, X. Jiang, and Q. Liu (2020) TernaryBERT: distillation-aware ultra-low bit BERT. In EMNLP, pp. 509–521. Cited by: §2.1.
  • Y. Zhang, J. Baldridge, and L. He (2019) PAWS: paraphrase adversaries from word scrambling. In NAACL-HLT, pp. 1298–1308. Cited by: Appendix D, §1, §2.2, §4.1.1.
  • M. Zhao, T. Lin, F. Mi, M. Jaggi, and H. Schütze (2020) Masking as an efficient alternative to finetuning for pretrained language models. In EMNLP, pp. 2226–2241. Cited by: §1, §2.1, §3.2.2, §4.4.1.
  • M. Zhu and S. Gupta (2018) To prune, or not to prune: exploring the efficacy of pruning for model compression. In ICLR (Workshop), Cited by: §5.2.


  1. For all authors…

    1. Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

    2. Did you describe the limitations of your work? See Section 6.

    3. Did you discuss any potential negative societal impacts of your work? Currently, we think there are no apparent negative societal impacts related to our work.

    4. Have you read the ethics review guidelines and ensured that your paper conforms to them?

  2. If you are including theoretical results…

    1. Did you state the full set of assumptions of all theoretical results?

    2. Did you include complete proofs of all theoretical results?

  3. If you ran experiments…

    1. Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? We will release the codes and reproduction instructions upon publication.

    2. Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? See Section

      4.1 and Appendix B.

    3. Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? See all the figures of our experiments.

    4. Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? See Appendix B.

  4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. If your work uses existing assets, did you cite the creators? See Section 4.1.

    2. Did you mention the license of the assets? Licenses of some dataset we used are mentioned in Section 4.1. However, for the other datasets, we were unable to find the licenses.

    3. Did you include any new assets either in the supplemental material or as a URL?

    4. Did you discuss whether and how consent was obtained from people whose data you’re using/curating?

    5. Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?

  5. If you used crowdsourcing or conducted research with human subjects…

    1. Did you include the full text of instructions given to participants and screenshots, if applicable?

    2. Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

    3. Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?

Appendix A More Information of Pruning and Debiasing Methods

a.1 Pruning Methods

a.1.1 Iterative Magnitude Pruning

Algo. 1 summarizes our implementation of IMP and IMP with weight rewinding. In practice, we set the per time pruning ratio and the pruning interval .

Input: PLM w. , maximum training steps , pruning interval , per time pruning ratio , target sparsity level (), pruning method
Output: Pruned subentwork
1 Initialize the pruning mask and the number of pruning
2 while t  do
3       if (t mod ) ==  then
4             # For imp, return the subnetwork after some further training
5             if  == and == then
6                   return
7             end if
8            Prune from the remaining parameters based on the magnitudes, and update accordingly
10             # For imp-rw, return the subnetwork directly after pruning
11             if  == and == then
12                   return
13             end if
15       end if
16      Update the remaining model parameters via AdamW Loshchilov and Hutter [2019];
18 end while
Algorithm 1 Iterative Magnitude Pruning (+ weight rewinding)

a.1.2 Mask Training

As we described in Section 3.2.2 of the main paper, we realize mask training via binarization in forward pass and gradient estimation in backward pass. Following Radiya-Dixit and Wang [2020], Liu et al. [2022], we adopt a magnitude-based strategy to initialize the real-valued masks. Specially, we consider two variants: The first one (hard variant) identifies the weights in matrix with the smallest magnitudes, and sets the corresponding elements in to zero, and the remaining elements to a fixed value:


where extracts the weights with the lowest absolute value, according to sparsity level . is a hyper-parameter. The second one (soft variant) directly utilizes the absolute values of the weights for mask initialization:


To control the sparsity of the model, the threshold is adjusted dynamically at a frequency of training steps. In practice, we control the sparsity in a local way, i.e., all the weight matrices should satisfy the same sparsity constraint . Algo. 2 summarizes the entire process of mask training.

Input: PLM w. , maximum training steps , frequency , target sparsity level , threshold , hyper-parameter , initialization method
Output: Pruned subentwork
1 if init ==  then
2       Initialize the real-valued mask according to Eq. 5
3       Set threshold
5       Initialize the real-valued mask according to Eq. 6
6       Set threshold according to the sparsity constraint
7 end if
8while t  do
9       Get a mini-batch of examples
10       Forward pass through binarization:
11         ,   where
12       Backward pass through gradient estimation:
14       if (t mod ) ==  then
15             Update the threshold to satisfy the sparsity constraint
16       end if
18 end while
Algorithm 2 Mask Training

a.2 Debiasing Methods

We have introduced the PoE method in Section 3.3. Here we provide descriptions of the other two debiasing methods, i.e., example reweighting and confidence regularization.

Example Reweighting directly assigns an importance weight to the standard CE training loss, according to the bias degree :


Confidence Regularization is based on knowledge distillation Hinton et al. [2015]. It involves a teacher model trained with the standard CE loss. The teacher model’s prediction is used as a supervision signal to train the main model. To account for the bias degree of training examples, is smoothed using a scaling function , and the final loss is computed as:


Appendix B More Experimental Setups

b.1 Datasets and Evaluations

NLI Paraphrase Identification Fact Verification MNLI HANS QQP PAWS-qqp PAWS-wiki FEVER FEVER-Symm1 FEVER-Symm2 Train 392,702 30,000 363,849 11,988 49,401 242,911 - - Dev 9,815 30,000 40,432 677 8,000 16,664 - 708 Test - - - - 8,000 - 717 712

Table 1: The number of examples in different dataset splits. The splits used for evaluation are highlighted with red color. The dev set for MNLI is MNLI-m.

MNLI HANS Train ent 33.3% 50% cont 33.3% 50% neutral 33.3% 0% Eval ent 35.4% 50% cont 32.7% 50% neutral 31.8% 0%

QQP Train dulp 36.9% 31.5% 44.2% non-dulp 63.1% 68.5% 55.8% Eval dulp 36.8% 28.2% 44.2% non-dulp 63.2% 71.8% 55.8%

FEVER Symm1 Symm2 Train supp 41.4% - - refute 17.2% - - not-info 41.4% - - Eval supp 47.9% 52.9% 50% refute 52.1% 47.1% 50% not-info 0% 0% 0%

Table 2: Data distribution over classes. The meaning of the abbreviations are: ent (entailment), cont (contradiction), dulp (duplicate), supp (support), not-info (not-enough-info). “Eval” represents the dataset split used for evaluation, as described in Tab. 1

We utilize eight datasets from three NLU tasks. The statistics of different dataset splits are summarized in Tab. 1. If one dataset has a test set, we use it for evaluation, and otherwise we report results on the dev set. For MNLI and QQP, since the official test server 333 only allows two submissions a day, we instead evaluate on the dev sets, following Chen et al. [2020], Liu et al. [2022], Sanh et al. [2020]. For FEVER, we use the training and evaluation data processed by Schuster et al. [2019] 444

Tab. 2 shows the distribution of examples over classes. We can see that the distributions of the QQP and evaluation sets are imbalanced. Specially, in the OOD , where a biased model tends to predict most examples to the duplicate class, simply classifying all examples as non-duplicate can achieve substantial improvement in accuracy (from to ). To account for this, we use the F1 score to evaluate the performance on the three paraphrase identification datasets. Specifically, we calculate the weighted average of the F1 score of each class. However, the class imbalance may still affect the evaluation on PAWS (as we discussed in Section 4.2.2) and therefore the OOD improvement should be assessed by also considering the ID performance.

b.2 Software and Computational Resources

We use two types of GPU, i.e., Nvidia V100 and TITAN RTX. All the experiments are run on a single GPU. Our codes are based on the Pytorch

555 and the huggingface transformers library666 Wolf et al. [2020].

b.3 Training Details

b.3.1 Bias Model

As mentioned in Section 4.1.3, we train the bias model with spurious features. For MNLI and QQP, we adopt the hand-crafted word overlapping features proposed by Clark et al. [2019], which includes:

  • Whether all the hypothesis words also belong to the premise.

  • Whether the hypothesis appears as a continuous subsequence in the premise.

  • The percentage of the hypothesis words that appear in the premise . Formally .

  • The average of the maximum similarity between each hypothesis word and all the premise words:

    , where the similarity is computed based on the fastText word vectors

    Mikolov et al. [2018] and the cosine distance.

  • The minimum of the same similarities above:

For FEVER, we use the max-pooled word embeddings of the claim sentence, which are also based on the fastText word vectors.

#Epoch Learning Rate Batch Size Max Length Eval Interval Eval Metric Optimizer MNLI 3 or 5 5e-5 32 128 1,000 Acc AdamW QQP 3 2e-5 32 128 1,000 F1 AdamW FEVER 3 2e-5 32 128 500 Acc AdamW

Table 3: Basic training hyper-parameters.

Mask Training IMP Mask Init Sparsity Schedule magnitude (hard) fixed to 0.01 2 equal to Eval Interval 10% 0.1

Table 4: Basic hyper-parameters related to pruning methods. is the number of optimization steps by training #Epoch epochs.

b.3.2 Full BERT

The main training hyper-parameters are shown in Tab. 3, which basically follow Utama et al. [2020b]. Most of the hyper-parameters are the same for different training strategies, except for the number of training epochs (#Epoch) on MNLI. For the standard CE loss and example reweighting, the model is trained for 3 epochs. For PoE and confidence regularization, the model is trained for 5 epochs.

b.3.3 Mask Training and IMP

Mask training and IMP basically use the same set of hyper-parameters as full BERT, except for longer training. The number of training epochs for mask training and IMP is 5 on MNLI, and 7 on QQP and FEVER. The hyper-parameters specific to mask training or IMP are summarized in Tab. 4. Unless otherwise specified, we adopt the hard-variant of mask initialization (Eq. 5) and fix the subnetwork sparsity to target sparsity throughout the process of mask training. Some special experimental setups are described as follows:

Subnetworks from Fine-tuned BERT

When we search for subnetworks at low sparsity (e.g., 20%) from a fine-tuned BERT, we find that mask training (with debiasing loss) stably improves the OOD performance, while the ID performance peaks at an early point of training and then slightly drops and recovers later. Therefore, the ID performance favors the early checkpoints, which are not good at the OOD generalization. To address this problem, we select the best checkpoint after of training, but still according to the performance on the ID dev set. This strategy is only adopted for mask training on fine-tuned BERT (for all sparsity levels), and in other cases we select the best checkpoint across training based on ID performance.

BERT Subnetworks Fine-tuned in Isolation

When fine-tuning the searched subnetworks (with their weights rewound to pre-trained values) in isolation, we use the same set of hyper-parameters as full BERT fine-tuning.

Sparse and Unbiased BERT Subnetworks

The OOD data is used in this setup. Specifically, we utilize the training data of HANS and PAWS for NLI and paraphrase identification respectively. In terms of the FEVER-Symmetric dataset, which does not provide a training set (see Tab. 1), we use the dev set of FEVER-Symm2 and copy the data 10 times to construct the OOD training data. The OOD and ID training data are then combined to form the final training set. Note that the evaluation sets are the same as the other setups, and NO test data is used in mask training.

Gradual Sparsity Increase

We mainly experiment with the gradual sparsity increase schedule for subnetworks at 90% sparsity. Concretely, we increase the sparsity from 70% to 90% during the process of mask training. The real-valued mask is initialized using the soft-variant (Eq. 6). This is because we find that the hard-variant is difficult to optimize with sparsity increase.

Figure 9: Results of subnetworks pruned from the CE fine-tuned BERT, with different debiasing methods in pruning.
Figure 10: Results of subnetworks found using the OOD information.

Appendix C More Results and Analysis

c.1 More Debiasing Methods

In Section 4, we mainly experiment with the PoE debiasing method. Here, we combine mask training with the other two debiasing methods, namely example reweighting and confidence regularization, and search for SRNets from the CE fine-tuned BERT. Fig. 9 presents the results. As we can see: (1) Pruning with different debiasing methods almost consistently improves the OOD performance over the CE fine-tuned BERT. (2) The confidence regularization method (the grey lines) only achieves mild OOD improvement over the full BERT, while it preserves more ID performance compared with the other two methods. This phenomenon is in accordance with the results from Utama et al. [2020a], which propose the confidence regularization method to achieve a better trade-off between the ID and OOD performance.

c.2 Sparse and Unbiased Subnetworks

Fig. 10 shows the results of mask training with the OOD training data. We can see that the general patterns in paraphrase identification and fact verification datasets are basically the same as the NLI datasets. Although the identified subnetworks cannot achieve 100% accuracy on PAWS and FEVER-Symmetric as on HANS, they substantially narrow the gap between OOD and ID performance, as compared with the full BERT. An exception is on the Symm2, where the upper bound of SRNets seems not very high. This is probably because we do not have enough examples (708 in total) to represent the data distribution of the FEVER-Symmetric dataset. Therefore, we conjecture that the existence of sparse and unbiased subnetworks might be ubiquitous.

c.3 The Timing to Start Searching SRNets

Fig. 11 shows the mask training curves on all the 8 datasets. Similar to the NLI datasets, mask training on the other two tasks can achieve comparable results as “ft to end” by starting from an intermediate checkpoint of BERT fine-tuning. For QQP, we can start from 15,000 steps of full BERT fine-tuning (44% of ). For FEVER, we can start from 10,000 steps (44% of ).

Figure 11: Mask training curves starting from full BERT checkpoints fine-tuned for varied steps. The sparsity levels are 70%, 70% and 90% for MNLI, QQP and FEVER respectively. At these sparsity levels, the gap between “ft step=0” and “ft to end” is the largest, according to Fig. 5.

c.4 Ablation Studies on Gradual Sparsity Increase

As we mentioned in Appendix B.3.3, we increase the sparsity from 70% to 90% and adopt the soft variant of mask initialization. To explain the reason for using this specific strategy, we present the ablation study results in Tab. 5. We can observe that: (1) Replacing the hard variant of mask initialization with the soft variant is beneficial, which leads to obvious improvements on the QQP, FEVER, Symm1 and Symm2 datasets. (2) Gradually increasing the sparsity further promotes the performance, with the 0.70.9 strategy achieving the best results on 7 out of the 8 datasets.

MNLI HANS fixed hard 72.09 52.56 soft 72.63 52.82 gradual 0.20.9 73.61 53.90 0.50.9 75.06 54.99 0.70.9 76.84 56.72

QQP fixed hard 71.64 55.70 49.59 soft 77.08 46.48 49.38 gradual 0.20.9 75.79 51.57 47.94 0.50.9 77.54 50.92 48.86 0.70.9 79.49 46.59 51.15

FEVER Symm1 Symm2 fixed hard 49.56 27.45 29.75 soft 72.80 46.67 52.33 gradual 0.20.9 73.53 46.47 52.42 0.50.9 77.01 49.87 56.57 0.70.9 79.01 51.74 58.17

Table 5: Ablation studies of the gradual sparsity increase schedule. The number of training epochs are 3, 5 and 5 for MNLI, QQP and FEVER respectively. The subnetworks are at sparsity. The numbers in the subscripts are standard deviations.

c.5 Results on RoBERTa-base and BERT-large

It has been shown by Hendrycks et al. [2020], Tu et al. [2020] that pre-trained model RoBERTa Liu et al. [2019] have better OOD generalization than BERT. Tu et al. [2020] also shows that larger PLMs, which are more computationally expensive, are more robust. To examine whether our conclusions can generalize to RoBERTa and larger versions of BERT, we conduct mask training on the standard fine-tuned RoBERTa-base and BERT-large models and use the PoE debiasing loss in the mask training process.

The results are shown in Tab. 6. We can see that, for RoBERTa-base: (1) At 50% sparsity, the searched subnetworks outperform the full RoBERTa (std) by 6.84 points on HANS, with a relative small drop of 1.74 on MNLI, validating that SRNets can be found in RoBERTa. (2) At 70% sparsity, the vanilla mask training produces subnetworks with undesirable ID performance and OOD performance comparable to full model (std). In comparison, when we gradually increase the sparsity level from 50% to 70%, the ID and OOD performance are improved simultaneously, demonstrating that gradual sparsity increase is also effective for RoBERTa.

When it comes to BERT-large, the conclusions are basically the same as BERT-base and RoBERTa-base: (1) We can find 50% sparse SRNets from BERT-large using the original mask training. (2) Gradual sparsity increase is also effective for BERT-large. Additionally, we find that the original mask training exhibits high variance at 70% sparsity because the training fails for some random seeds. In comparison, with gradual sparsity increase, the searched subnetworks have better performance and low variance.

RoBERTa-base MNLI HANS full model std 87.14 68.33 poe 86.56 76.15 mask train 0.5 85.40 75.17 0.7 83.48 68.63 0.50.7 84.41 71.95

BERT-large MNLI HANS full model std 86.84 69.44 poe 86.25 76.27 mask train 0.5 85.47 75.40 0.7 77.54 60.19 0.50.7 84.83 70.18

Table 6: Results of RoBERTa-base and BERT-large on the NLI task. We conduct mask training with PoE loss on the standard fine-tuned PLMs. “0.50.7" denotes gradual sparsity increase. The numbers in the subscripts are standard deviations.

Appendix D Related Work on Model Compression and Robustness

Some prior attempts have also been made to obtain compact and robust deep neural networks. We discuss the relationship and difference between these works and our paper from three perspectives:

Robustness Types

There are various types of model robustness, including generalization to in-distribution unseen examples, robustness towards dataset bias Beery et al. [2018], McCoy et al. [2019], Zhang et al. [2019], Schuster et al. [2019] and adversarial attacks Goodfellow et al. [2015], etc. Among the researches on model compression and robustness, adversarial robustness Gui et al. [2019], Ye et al. [2019], Sehwag et al. [2020], Fu et al. [2021], Xu et al. [2021] and dataset bias robustness Zhang et al. [2021], Du et al. [2021] are the most widely studied. In this paper, we focus on the dataset bias problem, which is more common than the worst-case adversarial attack, in terms of real-world application.

Compression Methods

A major direction in robust model compression is about the design of compression methods. Sehwag et al. [2019] investigate the effect of magnitude-based pruning on adversarially trained models. Gui et al. [2019], Ye et al. [2019] treat sparsity and adversarial robustness as a constrained optimization problem, and solve it using the alternating direction method of multipliers (ADMM) framework Zhang et al. [2018]. Sehwag et al. [2020], Zhang et al. [2021], Madaan et al. [2020] combine learnable weight mask (i.e., mask training) and robust training objectives. Our study investigates the use of magnitude-based pruning and mask training, which are also widely employed in the literature of BERT compression.

Application Fields

Despite the topic of model compression and robustness has been proposed for years, it is mostly studied in the context of computer vision (CV) tasks and models, and few attention has been paid to the NLP field. Considering the real-world application potential of PLMs, it is critical to study the questions of PLM compression and robustness jointly. To this end, some recent studies extend the evaluation of compressed PLMs to consider adversarial robustness

Xu et al. [2021] and dataset bias robustness Du et al. [2021].

Although our work shares the same topic with Du et al. [2021], we differ in several aspects. First, the scope and focus of our research questions are different. They aim at analyzing the impact of different compression methods (pruning and knowledge distillation Hinton et al. [2015]) on the OOD robustness of standard fine-tuned BERT. By contrast, we focus on subnetworks obtained from different pruning and fine-tuning paradigms and consider both standard fine-tuning and debiasing fine-tuning. Second, our conclusions are different. The results of Du et al. [2021] suggest that pruning generally has a negative impact on the robustness of BERT. In comparison, we revel the consistent existence of sparse BERT subnetworks that are more robust to dataset bias than the full model.

Appendix E More Discussions

e.1 How to Predict the Timing to Start Searching SRNets?

A feasible way of solution is to stop full BERT fine-tuning when there is no significant improvement across several consecutive evaluation steps. The patience of early-stopping can be determined based on the computational budget. If our resource is limited, we can at least directly training the mask on , which can still produce SRNets at 50% sparsity (as shown by Section 4.4.2).

e.2 How to Generalize to Other Scenarios?

In this work, we focus on NLU tasks and PLMs from the BERT family. However, the methodology we utilize is agnostic to the type of bias, task and backbone model. Theoretically, it can be flexibly adapted to other scenarios by simply change the spurious features to train the bias model (for the three debiasing methods considered in this paper) or combine the pruning method with another kind of debiasing method that also involves model training. In the future work, we would like to extend our exploration to other types of PLMs (e.g., language generation models like GPT Radford et al. [2018] and T5 Raffel et al. [2020]) and other types of NLP tasks (e.g., dialogue generation).