Large-scale pre-trained language models such as BERT Devlin2019BERTPO
have been widely used in natural language processing tasksGuu2020REALMRL; Liu2019FinetuneBF; Wadden2019EntityRA; Zhu2020IncorporatingBI
. A typical process of training a supervised downstream dataset is to fine-tune a pre-trained model for a few epochs. In this process, most of the model’s parameters are reused, while a random initialized task-specific layer is added to adapt the model to the new task.
Fine-tuning BERT has significantly boosted the state of the art performance on natural language understanding (NLU) benchmarks such as GLUE Wang2018GLUEAM and SuperGLUE Wang2019SuperGLUEAS. However, despite the impressive empirical results, this process remains unstable due to the randomness involved by data shuffling and the initialization of the task-specific layer. The observed instability in fine-tuning BERT was first discovered by Devlin2019BERTPO; Dodge2020FineTuningPL, and several approaches have been proposed to solve this problem Lee2020MixoutER; Zhang2020RevisitingFB; Mosbach2020OnTS.
In this study, we consider the fine-tuning stability of BERT from the perspective of the sensitivity to input perturbation. This is motivated by Arora2018StrongerGB and Sanyal2020StableRN
who show that noise injected at the lower layers has very little effect on the higher layers for neural networks with good generalizability. However, for a well pre-trained BERT, we find that the higher layers are still very sensitive to the lower layer’s perturbation (as shown in Figure1), implying that the high level representations of the pre-trained BERT may not generalize well on downstreaming tasks and consequently lead to instability. This phenomenon coincides with the observation that transferring the top pre-trained layers of BERT slows down learning and hurts performance Zhang2020RevisitingFB. In addition, Yosinski2014HowTA
also point out that in transfer learning models for object recognition, the lower pre-trained layers learn more general features while the higher layers closer to the output specialize more to the pre-training tasks. We argue that this result also applies to BERT. Intuitively, if a trained model is insensitive to the perturbation of the lower layers’ output, then the model is confident about the output, and vice versa. Based on the above theoretical and empirical results, we propose a simple and effective regularization method to reduce the noise sensitivity of BERT and thus improve the stability and performance of fine-tuned BERT.
To verify our approach, we conduct extensive experiments on different few-sample (fewer than 10k training samples) NLP tasks, including CoLAWarstadt2019NeuralNA, MRPC Dolan2005AutomaticallyCA
, RTEWang2018GLUEAM; Dagan2005ThePR; BarHaim2006TheSP; Giampiccolo2007TheTP, and STS-B Cer2017SemEval2017T1
. With the layer-wise noise stability regularization, we obtain strong empirical performance. Compared with other state-of-the-art models, our approach not only improves the fine-tuning stability (with a smaller standard deviation) but also consistently improve the overall performance (with a larger mean, median and maximum).
In summary, our main contributions are:
We propose a lightweight and effective regularization method, referred to as Layer-wise Noise Stability Regularization (LNSR) to improve the local Lipschitz continuity of each BERT layer and thus ensure the smoothness of the whole model. The empirical results show that the fine-tuned BERT models regularized with LNSR obtain significantly more accurate and stable results. LNSR also outperforms other state-of-the-art methods aiming at stabilizing fine-tuning such as -SP Li2018ExplicitIB, Mixout Lee2020MixoutER and SMART Jiang2020SMARTRA.
We are the first to study the effect of noise stability in NLP tasks. We extend classic theories of adding noise to explicitly constraining the output consistency when adding noise to the input. We theoretically prove that our proposed layer-wise noise stability regularizer is equivalent to a special case of the Tikhonov regularizer, which serves as a stabler regularizer than simply adding noise to the input rifai2011adding.
We investigate the relation of the noise stability property to the generalizability of BERT. We find that in general, models with good generalizability tend to be insensitive to noise perturbation; the lower layers of BERT show a better error resilience property but the higher layers of BERT remain sensitive to the lower layers’ perturbation (as is depicted in Figure 1).
2 Related Work
Pre-training has been well studied in machine learning and natural language processingErhan2009TheDO; Erhan2010WhyDU. Mikolov2013DistributedRO and Pennington2014GloveGV
proposed to use distributional representations (i.e., word embeddings) for individual words.Dai2015SemisupervisedSL proposed to train a language model or an auto-encoder with unlabeled data and then leveraged the obtained model to finetune downstream tasks. Recently, pre-trained language models, like ELMo Peters2018DeepCW, GPT/GPT-2 Radford2018ImprovingLU; Radford2019LanguageMA, BERT Devlin2019BERTPO, cross-lingual language model (briefly, XLM) Lample2019CrosslingualLM, XLNet Yang2019XLNetGA, RoBERTa Liu2019RoBERTaAR and ALBERTLan2020ALBERTAL have attracted more and more attention in natural language processing communities. The models are first pre-trained on large amount of unlabeled data to capture rich representations of the input, and then applied to the downstream tasks by either providing context-aware embeddings of an input sequence Peters2018DeepCW, or initializing the parameters of the downstream model Devlin2019BERTPO for fine-tuning. Such pre-training approaches deliver decent performance on natural language understanding tasks.
2.2 Instability in Fine-tuning
Fine-tuning instability of BERT has been reported in various previous works. Devlin2019BERTPO report instabilities when fine-tuning BERT on small datasets and resort to performing multiple restarts of fine-tuning and selecting the model that performs best on the development set. Dodge2020FineTuningPL performs a large-scale empirical investigation of the fine-tuning instability of BERT. They found dramatic variations in fine-tuning accuracy across multiple restarts and argue how it might be related to the choice of random seed and the dataset size. Lee2020MixoutER propose a new regularization method named Mixout to improve the stability and performance of fine-tuning BERT. Zhang2020RevisitingFB evaluate the importance of debiasing step empirically by fine-tuning BERT with both BERTAdam and standard Adam optimizer Kingma2015AdamAM and propose a re-initialization method to get a better initialization point for fine-tuning optimization. Mosbach2020OnTS analyses the cause of fine-tuning instability and propose a simple but strong baseline (small learning rate combined with bias correction).
There has been several regularization approaches to stabilizing the performance of models. Loshchilov2019DecoupledWD propose a decoupled weight decay regularizer integrated in Adam Kingma2015AdamAM optimizer to prevent neural networks from being too complicate. Gunel2020SupervisedCL use contrastive learning method to augment training set to improve the generalization performance. In addition, spectral norm Yoshida2017SpectralNR; Roth2019AdversarialTG serves as a general method can also be used to constrain the Lipschitz continuous of matrix, which can increase the stability of generalized neural networks.
There are also several noise-based methods have been proposed to improve the generalizability of pre-trained language models, including SMART Jiang2020SMARTRA, FreeLB Zhu2020FreeLBEA and R3F Aghajanyan2020BetterFB
. They achieves state of the art performance on GLUE, SNLIBowman2015ALA, SciTail Khot2018SciTaiLAT, and ANLI nie-etal-2020-adversarial NLU benchmarks. Most of these algorithms employ adversarial training method to improve the robustness of language model fine-tuing. SMART uses an adversarial methodology to encourage models to be smooth within a neighborhoods of all the inputs; FreeLB optimizes a direct adversarial loss through iterative gradient ascent steps; R3F removes the adversarial nature of SMART and optimize the smoothness of the whole model directly. Different from these methods, our proposed method does not adopt the adversarial training strategy, we optimize the smoothness of each layer of BERT directly and thus improve the stability of whole model.
3 Using Noise Stability as a Regularizer
One of the central issues in neural network training is to determine the optimal degree of complexity for the model. A model which is too limited will not sufficiently capture the structure in the data, while one which is too complex will model the noise on the data (the phenomenon of over-fitting). In either case, the performance on new data, that is the ability of the network to generalize, will be poor. The problem can be regarded as one of finding the optimal trade-off between the high bias of a model which is too inflexible and the high variance of a model with too much freedomgeman1992neural; bishop1995training; Novak2018SensitivityAG; bishop1991improving. To control the trade-off of bias against variance of BERT models, we impose an explicit noise regularization method.
3.1 Introduction of Our Method
Denoting the training set as , we give the general form of optimization objective for a BERT model with layers, as following:
To represent , we first define the injection position as the input of layer which is denoted as . If the regularization is operated at the output of layer , we can further denote the function between layer and as , satisfying that
. To implement the noise stability regularization, we inject a Gaussian-like noise vectorto and get a neighborhood . Specifically, each element
is independently randomly sampled from a Gaussian distribution with the mean of zero and the standard deviation ofas
. The probability density function of the noise distribution can be written as. Our goal is to minimize the discrepancy between their outputs over defined as . In our framework, we use a fixed position as the position of noise injection and constrain the output distance on all layers following layer . Denoting the regularization weight corresponding to each as , given a sample , the regularization term is represented by the following formulas:
An overall algorithm is represented in Algorithm 1.
3.2 Theoretical Analysis
Regularzation is a kind of commonly used techniques to reduce the function complexity and, as a result, to make the learned model generalize well on unseen examples. In this part, we theoretically prove that the proposed LNSR algorithm has the effects of encouraging the local Lipschitz continuity and imposing a Tikhonov regularizer under different assumptions. For simplicity, we omit the notations about the layer number in this part, denoting as the target function and as the input of parameterized by . Given a sample , we discuss the general form of the noise stability defined as following:
. The Lipschitz property reflects the degree of smoothness for a function. Recent theoretical studies on deep learning has revealed the close connection between Lipschitz property and generalizationbartlett2017spectrally; neyshabur2017exploring.
Given a sampled , minimizing is equivalent to minimizing:
Thus the noise stability regularization can be regarded as minimizing the Lipschitz constant in a local region around the input .
Tikhonov regularizer. The Tikhonov regularizer willoughby1979solutions involves constraints on the derivatives of the objective function with respect to different orders. For the simplest first-order case, it can be regarded as imposing robustness and shaping a flatter loss surface at input, which makes the learned function smoother.
Assuming that the magnitude of is small, we can expand the first term as a Taylor approximation as:
where and refer to the Jacobian and Hessian of with respect to the input respectively.
We define the input vector as and noise vector as . Assuming that distributions of the noise and the input are irrelevant, and the derivative of with respect to different elements of the input vector is independent with each other, we expand the second order term corresponding to the Jacobian as:
According to the characteristics of the Gaussian distribution, we also have
Thus, we can rewrite the second order term corresponding to the Hessian in Eq. 6 as:
Where is a constant independent of the input . The third term generated from the expansion of Eq. 6 is zero as we have . Thus we get
Considering that the input and output of the function are both scalar variable, the Tikhonov regularization willoughby1979solutions takes the general form as:
Eq. 10 shows that our proposed regularizer ensuring the noise stability is equivalent to a special case of the Tikhonov regularizer, where we involve the first and second order derivatives of the objective function .
An alternative for improving the robustness is to directly add noise to the input, without explicitly constraining the output stability. rifai2011adding has derived that adding noise to the input has the effect of penalizing both the -norm of the Jacobian and the trace of the Hessian , whereas the Hessian term is not constrained to be positive. While the regularizer brought by our proposed LNSR is guaranteed to be positive by involving the sum of squares of the first and second order derivatives. Moreover, our work relaxes the assumption of MSE regression loss required by rifai2011adding. By imposing the explicit constraint of noise stability on middle layer representations, we extend the theoretical understanding of noise stability into deep learning algorithms.
In this section, we experimentally demostrate the effectiveness of LNSR method on text classification tasks over other regularization methods, and confirm that the insensitivity to noise promotes the generalizability and stability of BERT.
We conduct experiments on four few-sample (less than 10k training samples) text classification tasks of GLUE 111https://gluebenchmark.com/, the datasets are described below and summarized in Appendix A Table 4.
Corpus of Linguistic Acceptability (CoLA Warstadt2019NeuralNA) consists of English acceptability judgments drawn from books and journal articles on linguistic theory. Each example is a sequence of words annotated with whether it is a grammatical English sentence. This is a binary classification task and Matthews correlation coefficient (MCC) matthews1975comparison is used to evaluate the performance.
Microsoft Research Paraphrase Corpus (MRPC Dolan2005AutomaticallyCA
) is a corpus of sentence pairs with human annotations for whether the sentences in the pair are semantically equivalent. The evaluation metrics is the average of F1 and Accuracy.
Recognizing Textual Entailment (RTE Wang2018GLUEAM) Dagan2005ThePR BarHaim2006TheSP Giampiccolo2007TheTP is a corpus of textual entailment, and each example is a sentence pair annotated whether the first entails the second. The evaluation metrics is Accuracy.
Semantic Textual Similarity Benchmark (STS-B Cer2017SemEval2017T1)is a regression task. Each example is a sentence pair and is human-annotated with a similarity score from 1 to 5; the task is to predict these scores. The evaluation metrics is the average of Pearson and Spearman correlation coefficients.
4.2 Baseline Models
We use BERT Devlin2019BERTPO
, a large-scale bidirectional pre-trained language model as the base model in all experiments. We adopt pytorch edition implemented byWolf2019HuggingFacesTS.
Fine-tuning. We use the standard BERT fine-tuning method described in Devlin2019BERTPO.
-SP Li2018ExplicitIB is a regularization scheme that explicitly promotes the similarity of the final solution with the initial model. It is usually used for preventing pre-trained models from catastrophic forgetting. We adopt the form of .
Mixout Lee2020MixoutER is a stochastic regularization technique motivated by Dropout Srivastava2014DropoutAS and DropConnect Wan2013RegularizationON
. At each training iteration, each model parameter is replaced with its pre-trained value with probability. The goal is to improve the generalizability of pre-trained language models.
SMART Jiang2020SMARTRA imposes an smoothness regularizer inducing an adversarial manner to control the model complexity at the fine-tuning stage. It also employs a class of Bregman proximal point optimization methods to prevent the model from aggressively updating during fine-tuning.
4.3 Experimental Setup
Our model is implemented using Pytorch based on Transformers framework 222https://huggingface.co/transformers/index.html
. Specifically, we use the learning setup and hyperparameters recommended byDevlin2019BERTPO. We use Huggingface edition Adam Kingma2015AdamAM optimizer (without bias correction) with learning rate of ,, and warmup over the first 10% steps of the total steps. We fine-tune the entire model (340 million parameters), of which the vast majority start as pre-trained weights (BERT-Large-Uncased) and the classification layer (2048 parameters). Weights of the classification layer are initialized with . We train with a batch size of 32 for 3 epochs. More details of our experimental setup are described in Appendix A.
|FT (4 Epochs)|
|train / eval / gap||train / eval / gap||train / eval / gap||train / eval / gap|
4.4 Overall Performance
Table 1 shows the results of all the models on selected GLUE datasets. We train each dataset over 25 random seeds. To implement our LNSR, we uniformly inject noise at the first layer on BERT-large for the comparison with baseline models. As we can see from the table, our model outperforms all the baseline models in mean and max values, which indicates the stronger generalizability of our model against other baseline models. The p-values between the accuracy distributions of standard BERT fine-tuning and our model are calculated to verify whether the improvements are significant. We obtain very small p-values in all tasks: RTE: , MRPC: , CoLA: , STS-2: .
Standard deviation is an indicator of the stability of models’ performance and higher std means more sensitive to random seeds. Our model shows a lower standard deviation on each task, which means our model is less sensitive to random seeds than other models. Figure 2 presents a clearer illustration. To sum up, our proposed method can effectively improve the performance and stability of fine-tuning BERT.
5.1 Ablation Study
To verify the effectiveness of our proposed LNSR model, we conduct several ablation experiments including fine-tuning with more training epochs and noise perturbation without regularization (we inject noise directly to the output of a specific layer, and then use the perturbed representation to conduct propagation and then calculate loss, this process is similar to a vector-space represent augmentation). The results are shown in Table 2. We observe that benefit obtained by longer training is limited. Similarly, fine-tuning with noise perturbation only achieves slightly better results on two of these tasks, showing that simply adding noise without an explicit restriction on outputs may not be sufficient to obtain good generalizability. While BERT models with LNSR perform better on each task. This verifies our claim that LNSR can promote the stability of BERT fine-tuning and meanwhile improve the generalizability of the BERT model.
5.2 Effects on the Generalizability of Models
We verify the effects of our proposed method on the generalizability of BERT models in two ways – generalization gap and models’ performance on fewer training samples. Due to the limited data and the extremely high complexity of BERT model, bad fine-tuning start point makes the adapted model overfit the training data and does not generalize well to unseen data. Generalizability of models can be intuitively reflected by generalization gap and models’ performance on fewer training samples.
Table 3 shows the mean training Acc, mean evaluation Acc and generalization gap of different models on each task. As we can see from the table, fine-tuning with LNSR can effectively narrow the generalization gap, and achieve higher evaluation score. The effect of narrowing generalization gap is also reflected in Figure 3 where we can see the higher evaluation accuracy and lower evaluation loss.
We sample subsets from the two relatively larger datasets CoLA (8.5k training samples) and STS-B (7k training samples) with the sampling ratio of 0.15, 0.3 and 0.5. As is shown in Figure 4, fine-tuning with LNSR shows clear advantage on fewer training samples, suggesting LNSR can effectively promote the model’s generalizability.
5.3 Sensitivity to the Position of Noise Injection
We briefly discuss about the sensitivity to the position of noise injection as it is a pre-determined hyperparameter of our method. As is shown in Figure 5 in Appendix A, we observe that the performance of LNSR does not fluctuate much as the position of noise injection changes. All injection positions bring significant improvements over vanilla fine-tuning. Note that, with LNSR, noise injection to the lower layers usually leads to relatively higher accuracy and stability, implying that LNSR may be more effective when it affects both the lower and higher layers of the network.
5.4 Relationship to Previous Noise-based Approaches
Our method is related to SMART Jiang2020SMARTRA, FreeLB Zhu2020FreeLBEA and R3F Aghajanyan2020BetterFB. As is mentioned before, most of these approaches employ adversarial training strategies to improve the robustness of BERT fine-tuing. SMART solves supremum by using an adversarial methodology to achieve the largest KL divergence with an -ball, FreeLB optimizes a direct adversarial loss through iterative gradient ascent steps, while R3F removes the adversarial nature of SMART and optimize the smoothness of the whole model directly.
Compared with this sort of adversarial based algorithms, our method is easier to implement and provides a relatively rigorous theoretical guarantee. The design of layer-wise regularization is sensible that it exploits the characteristics of hierarchical representations in modern deep neural networks. Studies in knowledge distillation have shown similar experience that imitating through middle layer representations adriana2015fitnets; zagoruyko2016paying performs better than aligning the final outputs hinton2015distilling. Moreover, LNSR allows us to use different regularization weights for different layers (we use fixed weight 1 on all layers in this paper). We will leave the exploitation in future work.
In this paper, we propose the Layer-wise Noise Stability Regularization (LNSR) as a lightweight and effective method to improve the generalizability and stability when fine-tuning BERT on few training samples. Our proposed LNSR method is a general technique that improves model output stability while maintaining or improving the original performance. Furthermore, we provide a theoretically analysis of the relationship of our model to the Lipschitz continuity and Tikhonov regularizer. Extensive empirical results show that our proposed method can effectively improve the generalizability and stability of the BERT model.
Hang Hua would like to thank Jeffries for supporting his research.
Appendix A Experimental Details
The model we use for experiments in section 4 is the standard BERT large model with 24 layers staked Transformers Vaswani2017AttentionIA encoder, 1024 hidden size, and 16 self-attention heads. We initialize the pre-trained part of the model with BERT-Large-Uncased-Whole-Word-Masking weight. The final layer is a classification layer with 2048 parameters which contains of the total number of parameters in the model. We initialize the last layer with and each bias is 0. For the position of noise injection, we uniformly chose the first layer as the noise regularization start point. In the sensitivity to the position of noise injection analysis section, we also try injecting noise from the different layers as is shown in Figure 5. As for the baseline model Mixout, we use the code from the Github repository https://github.com/bloodwass/mixout.git. The other baseline models are implemented by ourselves.
Appendix B Other Experimental Reports
We also report the maximum value we get during fine-tuning BERT with our proposed LNSR regularizer among a large number of random seeds and several noise injection position, since the maximum value can also reflect the ability of the learning algorithm to reach an optimal point. The results are shown in Table 5, and we can see that on some tasks, fine-tuning BERT with LNSR is even competitive with fine-tuning state-of-the-art models which adopt more powerful modern architectures and pre-training strategies.
|Metrics||Accuracy||Matthews corr||Pearson/Spearman corr|
|of training samples||2.5k||3.7k||8.6k||7k|
|of validation samples||276||408||1k||1.5k|
|of test samples||3k||1.7k||1k||1.4k|