1 Introduction
Largescale pretrained language models such as BERT Devlin2019BERTPO
have been widely used in natural language processing tasks
Guu2020REALMRL; Liu2019FinetuneBF; Wadden2019EntityRA; Zhu2020IncorporatingBI. A typical process of training a supervised downstream dataset is to finetune a pretrained model for a few epochs. In this process, most of the model’s parameters are reused, while a random initialized taskspecific layer is added to adapt the model to the new task.
Finetuning BERT has significantly boosted the state of the art performance on natural language understanding (NLU) benchmarks such as GLUE Wang2018GLUEAM and SuperGLUE Wang2019SuperGLUEAS. However, despite the impressive empirical results, this process remains unstable due to the randomness involved by data shuffling and the initialization of the taskspecific layer. The observed instability in finetuning BERT was first discovered by Devlin2019BERTPO; Dodge2020FineTuningPL, and several approaches have been proposed to solve this problem Lee2020MixoutER; Zhang2020RevisitingFB; Mosbach2020OnTS.
In this study, we consider the finetuning stability of BERT from the perspective of the sensitivity to input perturbation. This is motivated by Arora2018StrongerGB and Sanyal2020StableRN
who show that noise injected at the lower layers has very little effect on the higher layers for neural networks with good generalizability. However, for a well pretrained BERT, we find that the higher layers are still very sensitive to the lower layer’s perturbation (as shown in Figure
1), implying that the high level representations of the pretrained BERT may not generalize well on downstreaming tasks and consequently lead to instability. This phenomenon coincides with the observation that transferring the top pretrained layers of BERT slows down learning and hurts performance Zhang2020RevisitingFB. In addition, Yosinski2014HowTAalso point out that in transfer learning models for object recognition, the lower pretrained layers learn more general features while the higher layers closer to the output specialize more to the pretraining tasks. We argue that this result also applies to BERT. Intuitively, if a trained model is insensitive to the perturbation of the lower layers’ output, then the model is confident about the output, and vice versa. Based on the above theoretical and empirical results, we propose a simple and effective regularization method to reduce the noise sensitivity of BERT and thus improve the stability and performance of finetuned BERT.
To verify our approach, we conduct extensive experiments on different fewsample (fewer than 10k training samples) NLP tasks, including CoLA
Warstadt2019NeuralNA, MRPC Dolan2005AutomaticallyCA, RTE
Wang2018GLUEAM; Dagan2005ThePR; BarHaim2006TheSP; Giampiccolo2007TheTP, and STSB Cer2017SemEval2017T1. With the layerwise noise stability regularization, we obtain strong empirical performance. Compared with other stateoftheart models, our approach not only improves the finetuning stability (with a smaller standard deviation) but also consistently improve the overall performance (with a larger mean, median and maximum).
In summary, our main contributions are:

We propose a lightweight and effective regularization method, referred to as Layerwise Noise Stability Regularization (LNSR) to improve the local Lipschitz continuity of each BERT layer and thus ensure the smoothness of the whole model. The empirical results show that the finetuned BERT models regularized with LNSR obtain significantly more accurate and stable results. LNSR also outperforms other stateoftheart methods aiming at stabilizing finetuning such as SP Li2018ExplicitIB, Mixout Lee2020MixoutER and SMART Jiang2020SMARTRA.

We are the first to study the effect of noise stability in NLP tasks. We extend classic theories of adding noise to explicitly constraining the output consistency when adding noise to the input. We theoretically prove that our proposed layerwise noise stability regularizer is equivalent to a special case of the Tikhonov regularizer, which serves as a stabler regularizer than simply adding noise to the input rifai2011adding.

We investigate the relation of the noise stability property to the generalizability of BERT. We find that in general, models with good generalizability tend to be insensitive to noise perturbation; the lower layers of BERT show a better error resilience property but the higher layers of BERT remain sensitive to the lower layers’ perturbation (as is depicted in Figure 1).
2 Related Work
2.1 Pretraining
Pretraining has been well studied in machine learning and natural language processing
Erhan2009TheDO; Erhan2010WhyDU. Mikolov2013DistributedRO and Pennington2014GloveGVproposed to use distributional representations (i.e., word embeddings) for individual words.
Dai2015SemisupervisedSL proposed to train a language model or an autoencoder with unlabeled data and then leveraged the obtained model to finetune downstream tasks. Recently, pretrained language models, like ELMo Peters2018DeepCW, GPT/GPT2 Radford2018ImprovingLU; Radford2019LanguageMA, BERT Devlin2019BERTPO, crosslingual language model (briefly, XLM) Lample2019CrosslingualLM, XLNet Yang2019XLNetGA, RoBERTa Liu2019RoBERTaAR and ALBERTLan2020ALBERTAL have attracted more and more attention in natural language processing communities. The models are first pretrained on large amount of unlabeled data to capture rich representations of the input, and then applied to the downstream tasks by either providing contextaware embeddings of an input sequence Peters2018DeepCW, or initializing the parameters of the downstream model Devlin2019BERTPO for finetuning. Such pretraining approaches deliver decent performance on natural language understanding tasks.2.2 Instability in Finetuning
Finetuning instability of BERT has been reported in various previous works. Devlin2019BERTPO report instabilities when finetuning BERT on small datasets and resort to performing multiple restarts of finetuning and selecting the model that performs best on the development set. Dodge2020FineTuningPL performs a largescale empirical investigation of the finetuning instability of BERT. They found dramatic variations in finetuning accuracy across multiple restarts and argue how it might be related to the choice of random seed and the dataset size. Lee2020MixoutER propose a new regularization method named Mixout to improve the stability and performance of finetuning BERT. Zhang2020RevisitingFB evaluate the importance of debiasing step empirically by finetuning BERT with both BERTAdam and standard Adam optimizer Kingma2015AdamAM and propose a reinitialization method to get a better initialization point for finetuning optimization. Mosbach2020OnTS analyses the cause of finetuning instability and propose a simple but strong baseline (small learning rate combined with bias correction).
2.3 Regularization
There has been several regularization approaches to stabilizing the performance of models. Loshchilov2019DecoupledWD propose a decoupled weight decay regularizer integrated in Adam Kingma2015AdamAM optimizer to prevent neural networks from being too complicate. Gunel2020SupervisedCL use contrastive learning method to augment training set to improve the generalization performance. In addition, spectral norm Yoshida2017SpectralNR; Roth2019AdversarialTG serves as a general method can also be used to constrain the Lipschitz continuous of matrix, which can increase the stability of generalized neural networks.
There are also several noisebased methods have been proposed to improve the generalizability of pretrained language models, including SMART Jiang2020SMARTRA, FreeLB Zhu2020FreeLBEA and R3F Aghajanyan2020BetterFB
. They achieves state of the art performance on GLUE, SNLI
Bowman2015ALA, SciTail Khot2018SciTaiLAT, and ANLI nieetal2020adversarial NLU benchmarks. Most of these algorithms employ adversarial training method to improve the robustness of language model finetuing. SMART uses an adversarial methodology to encourage models to be smooth within a neighborhoods of all the inputs; FreeLB optimizes a direct adversarial loss through iterative gradient ascent steps; R3F removes the adversarial nature of SMART and optimize the smoothness of the whole model directly. Different from these methods, our proposed method does not adopt the adversarial training strategy, we optimize the smoothness of each layer of BERT directly and thus improve the stability of whole model.3 Using Noise Stability as a Regularizer
One of the central issues in neural network training is to determine the optimal degree of complexity for the model. A model which is too limited will not sufficiently capture the structure in the data, while one which is too complex will model the noise on the data (the phenomenon of overfitting). In either case, the performance on new data, that is the ability of the network to generalize, will be poor. The problem can be regarded as one of finding the optimal tradeoff between the high bias of a model which is too inflexible and the high variance of a model with too much freedom
geman1992neural; bishop1995training; Novak2018SensitivityAG; bishop1991improving. To control the tradeoff of bias against variance of BERT models, we impose an explicit noise regularization method.3.1 Introduction of Our Method
Denoting the training set as , we give the general form of optimization objective for a BERT model with layers, as following:
(1) 
To represent , we first define the injection position as the input of layer which is denoted as . If the regularization is operated at the output of layer , we can further denote the function between layer and as , satisfying that
. To implement the noise stability regularization, we inject a Gaussianlike noise vector
to and get a neighborhood . Specifically, each elementis independently randomly sampled from a Gaussian distribution with the mean of zero and the standard deviation of
as. The probability density function of the noise distribution can be written as
. Our goal is to minimize the discrepancy between their outputs over defined as . In our framework, we use a fixed position as the position of noise injection and constrain the output distance on all layers following layer . Denoting the regularization weight corresponding to each as , given a sample , the regularization term is represented by the following formulas:(2) 
An overall algorithm is represented in Algorithm 1.
3.2 Theoretical Analysis
Regularzation is a kind of commonly used techniques to reduce the function complexity and, as a result, to make the learned model generalize well on unseen examples. In this part, we theoretically prove that the proposed LNSR algorithm has the effects of encouraging the local Lipschitz continuity and imposing a Tikhonov regularizer under different assumptions. For simplicity, we omit the notations about the layer number in this part, denoting as the target function and as the input of parameterized by . Given a sample , we discuss the general form of the noise stability defined as following:
(3) 
Lipschitz continuity
. The Lipschitz property reflects the degree of smoothness for a function. Recent theoretical studies on deep learning has revealed the close connection between Lipschitz property and generalization
bartlett2017spectrally; neyshabur2017exploring.Given a sampled , minimizing is equivalent to minimizing:
(4) 
Thus the noise stability regularization can be regarded as minimizing the Lipschitz constant in a local region around the input .
Tikhonov regularizer. The Tikhonov regularizer willoughby1979solutions involves constraints on the derivatives of the objective function with respect to different orders. For the simplest firstorder case, it can be regarded as imposing robustness and shaping a flatter loss surface at input, which makes the learned function smoother.
Assuming that the magnitude of is small, we can expand the first term as a Taylor approximation as:
(5) 
where and refer to the Jacobian and Hessian of with respect to the input respectively.
Ignoring the higher order term and denoting as the kth output of the function , we can rewrite the regularizer by substituting Eq. 5 in Eq. 3 as:
(6)  
We define the input vector as and noise vector as . Assuming that distributions of the noise and the input are irrelevant, and the derivative of with respect to different elements of the input vector is independent with each other, we expand the second order term corresponding to the Jacobian as:
(7)  
According to the characteristics of the Gaussian distribution, we also have
(8) 
Thus, we can rewrite the second order term corresponding to the Hessian in Eq. 6 as:
(9)  
Where is a constant independent of the input . The third term generated from the expansion of Eq. 6 is zero as we have . Thus we get
(10) 
Considering that the input and output of the function are both scalar variable, the Tikhonov regularization willoughby1979solutions takes the general form as:
(11) 
Eq. 10 shows that our proposed regularizer ensuring the noise stability is equivalent to a special case of the Tikhonov regularizer, where we involve the first and second order derivatives of the objective function .
An alternative for improving the robustness is to directly add noise to the input, without explicitly constraining the output stability. rifai2011adding has derived that adding noise to the input has the effect of penalizing both the norm of the Jacobian and the trace of the Hessian , whereas the Hessian term is not constrained to be positive. While the regularizer brought by our proposed LNSR is guaranteed to be positive by involving the sum of squares of the first and second order derivatives. Moreover, our work relaxes the assumption of MSE regression loss required by rifai2011adding. By imposing the explicit constraint of noise stability on middle layer representations, we extend the theoretical understanding of noise stability into deep learning algorithms.
RTE  MRPC  CoLA  STSB  
mean  std  max  mean  std  max  mean  std  max  mean  std  max  
FT Devlin2019BERTPO  
Li2018ExplicitIB  
MixoutLee2020MixoutER  
SMARTJiang2020SMARTRA  
LNSR(ours) 
4 Experiments
In this section, we experimentally demostrate the effectiveness of LNSR method on text classification tasks over other regularization methods, and confirm that the insensitivity to noise promotes the generalizability and stability of BERT.
4.1 Data
We conduct experiments on four fewsample (less than 10k training samples) text classification tasks of GLUE ^{1}^{1}1https://gluebenchmark.com/, the datasets are described below and summarized in Appendix A Table 4.
Corpus of Linguistic Acceptability (CoLA Warstadt2019NeuralNA) consists of English acceptability judgments drawn from books and journal articles on linguistic theory. Each example is a sequence of words annotated with whether it is a grammatical English sentence. This is a binary classification task and Matthews correlation coefficient (MCC) matthews1975comparison is used to evaluate the performance.
Microsoft Research Paraphrase Corpus (MRPC Dolan2005AutomaticallyCA
) is a corpus of sentence pairs with human annotations for whether the sentences in the pair are semantically equivalent. The evaluation metrics is the average of F1 and Accuracy.
Recognizing Textual Entailment (RTE Wang2018GLUEAM) Dagan2005ThePR BarHaim2006TheSP Giampiccolo2007TheTP is a corpus of textual entailment, and each example is a sentence pair annotated whether the first entails the second. The evaluation metrics is Accuracy.
Semantic Textual Similarity Benchmark (STSB Cer2017SemEval2017T1)is a regression task. Each example is a sentence pair and is humanannotated with a similarity score from 1 to 5; the task is to predict these scores. The evaluation metrics is the average of Pearson and Spearman correlation coefficients.
4.2 Baseline Models
We use BERT Devlin2019BERTPO
, a largescale bidirectional pretrained language model as the base model in all experiments. We adopt pytorch edition implemented by
Wolf2019HuggingFacesTS.Finetuning. We use the standard BERT finetuning method described in Devlin2019BERTPO.
SP Li2018ExplicitIB is a regularization scheme that explicitly promotes the similarity of the final solution with the initial model. It is usually used for preventing pretrained models from catastrophic forgetting. We adopt the form of .
Mixout Lee2020MixoutER is a stochastic regularization technique motivated by Dropout Srivastava2014DropoutAS and DropConnect Wan2013RegularizationON
. At each training iteration, each model parameter is replaced with its pretrained value with probability
. The goal is to improve the generalizability of pretrained language models.SMART Jiang2020SMARTRA imposes an smoothness regularizer inducing an adversarial manner to control the model complexity at the finetuning stage. It also employs a class of Bregman proximal point optimization methods to prevent the model from aggressively updating during finetuning.
4.3 Experimental Setup
Our model is implemented using Pytorch based on Transformers framework ^{2}^{2}2https://huggingface.co/transformers/index.html
. Specifically, we use the learning setup and hyperparameters recommended by
Devlin2019BERTPO. We use Huggingface edition Adam Kingma2015AdamAM optimizer (without bias correction) with learning rate of ,, and warmup over the first 10% steps of the total steps. We finetune the entire model (340 million parameters), of which the vast majority start as pretrained weights (BERTLargeUncased) and the classification layer (2048 parameters). Weights of the classification layer are initialized with . We train with a batch size of 32 for 3 epochs. More details of our experimental setup are described in Appendix A.RTE  MRPC  CoLA  STSB  
mean  std  max  mean  std  max  mean  std  max  mean  std  max  
FT  
FT (4 Epochs)  
FT+Noise  
LNSR(ours) 
RTE  MRPC  CoLA  STSB  
train / eval / gap  train / eval / gap  train / eval / gap  train / eval / gap  
FT Devlin2019BERTPO  //  //  //  // 
LNSR(ours)  //  //  //  // 
4.4 Overall Performance
Table 1 shows the results of all the models on selected GLUE datasets. We train each dataset over 25 random seeds. To implement our LNSR, we uniformly inject noise at the first layer on BERTlarge for the comparison with baseline models. As we can see from the table, our model outperforms all the baseline models in mean and max values, which indicates the stronger generalizability of our model against other baseline models. The pvalues between the accuracy distributions of standard BERT finetuning and our model are calculated to verify whether the improvements are significant. We obtain very small pvalues in all tasks: RTE: , MRPC: , CoLA: , STS2: .
Standard deviation is an indicator of the stability of models’ performance and higher std means more sensitive to random seeds. Our model shows a lower standard deviation on each task, which means our model is less sensitive to random seeds than other models. Figure 2 presents a clearer illustration. To sum up, our proposed method can effectively improve the performance and stability of finetuning BERT.
5 Analysis
5.1 Ablation Study
To verify the effectiveness of our proposed LNSR model, we conduct several ablation experiments including finetuning with more training epochs and noise perturbation without regularization (we inject noise directly to the output of a specific layer, and then use the perturbed representation to conduct propagation and then calculate loss, this process is similar to a vectorspace represent augmentation). The results are shown in Table 2. We observe that benefit obtained by longer training is limited. Similarly, finetuning with noise perturbation only achieves slightly better results on two of these tasks, showing that simply adding noise without an explicit restriction on outputs may not be sufficient to obtain good generalizability. While BERT models with LNSR perform better on each task. This verifies our claim that LNSR can promote the stability of BERT finetuning and meanwhile improve the generalizability of the BERT model.
5.2 Effects on the Generalizability of Models
We verify the effects of our proposed method on the generalizability of BERT models in two ways – generalization gap and models’ performance on fewer training samples. Due to the limited data and the extremely high complexity of BERT model, bad finetuning start point makes the adapted model overfit the training data and does not generalize well to unseen data. Generalizability of models can be intuitively reflected by generalization gap and models’ performance on fewer training samples.
Table 3 shows the mean training Acc, mean evaluation Acc and generalization gap of different models on each task. As we can see from the table, finetuning with LNSR can effectively narrow the generalization gap, and achieve higher evaluation score. The effect of narrowing generalization gap is also reflected in Figure 3 where we can see the higher evaluation accuracy and lower evaluation loss.
We sample subsets from the two relatively larger datasets CoLA (8.5k training samples) and STSB (7k training samples) with the sampling ratio of 0.15, 0.3 and 0.5. As is shown in Figure 4, finetuning with LNSR shows clear advantage on fewer training samples, suggesting LNSR can effectively promote the model’s generalizability.
5.3 Sensitivity to the Position of Noise Injection
We briefly discuss about the sensitivity to the position of noise injection as it is a predetermined hyperparameter of our method. As is shown in Figure 5 in Appendix A, we observe that the performance of LNSR does not fluctuate much as the position of noise injection changes. All injection positions bring significant improvements over vanilla finetuning. Note that, with LNSR, noise injection to the lower layers usually leads to relatively higher accuracy and stability, implying that LNSR may be more effective when it affects both the lower and higher layers of the network.
5.4 Relationship to Previous Noisebased Approaches
Our method is related to SMART Jiang2020SMARTRA, FreeLB Zhu2020FreeLBEA and R3F Aghajanyan2020BetterFB. As is mentioned before, most of these approaches employ adversarial training strategies to improve the robustness of BERT finetuing. SMART solves supremum by using an adversarial methodology to achieve the largest KL divergence with an ball, FreeLB optimizes a direct adversarial loss through iterative gradient ascent steps, while R3F removes the adversarial nature of SMART and optimize the smoothness of the whole model directly.
Compared with this sort of adversarial based algorithms, our method is easier to implement and provides a relatively rigorous theoretical guarantee. The design of layerwise regularization is sensible that it exploits the characteristics of hierarchical representations in modern deep neural networks. Studies in knowledge distillation have shown similar experience that imitating through middle layer representations adriana2015fitnets; zagoruyko2016paying performs better than aligning the final outputs hinton2015distilling. Moreover, LNSR allows us to use different regularization weights for different layers (we use fixed weight 1 on all layers in this paper). We will leave the exploitation in future work.
6 Conclusion
In this paper, we propose the Layerwise Noise Stability Regularization (LNSR) as a lightweight and effective method to improve the generalizability and stability when finetuning BERT on few training samples. Our proposed LNSR method is a general technique that improves model output stability while maintaining or improving the original performance. Furthermore, we provide a theoretically analysis of the relationship of our model to the Lipschitz continuity and Tikhonov regularizer. Extensive empirical results show that our proposed method can effectively improve the generalizability and stability of the BERT model.
7 Acknowledgements
Hang Hua would like to thank Jeffries for supporting his research.
References
Appendix A Experimental Details
The model we use for experiments in section 4 is the standard BERT large model with 24 layers staked Transformers Vaswani2017AttentionIA encoder, 1024 hidden size, and 16 selfattention heads. We initialize the pretrained part of the model with BERTLargeUncasedWholeWordMasking weight. The final layer is a classification layer with 2048 parameters which contains of the total number of parameters in the model. We initialize the last layer with and each bias is 0. For the position of noise injection, we uniformly chose the first layer as the noise regularization start point. In the sensitivity to the position of noise injection analysis section, we also try injecting noise from the different layers as is shown in Figure 5. As for the baseline model Mixout, we use the code from the Github repository https://github.com/bloodwass/mixout.git. The other baseline models are implemented by ourselves.
Table 4 summarizes dataset statistics used in this work. We use the standard GLUE benchmark datasets downloaded from https://gluebenchmark.com/tasks.
Appendix B Other Experimental Reports
We also report the maximum value we get during finetuning BERT with our proposed LNSR regularizer among a large number of random seeds and several noise injection position, since the maximum value can also reflect the ability of the learning algorithm to reach an optimal point. The results are shown in Table 5, and we can see that on some tasks, finetuning BERT with LNSR is even competitive with finetuning stateoftheart models which adopt more powerful modern architectures and pretraining strategies.
RTE  MRPC  CoLA  STSB  
Task  NLI  Paraphrase  Acceptability  Similarity  
Metrics  Accuracy  Matthews corr  Pearson/Spearman corr  
of labels  2  2  2  1  
of training samples  2.5k  3.7k  8.6k  7k  
of validation samples  276  408  1k  1.5k  
of test samples  3k  1.7k  1k  1.4k 
RTE  MRPC  CoLA  STSB  
BERT Devlin2019BERTPO  70.4  88.0  60.6  90.0  
BERT Phang2018SentenceEO  70.0  90.7  62.1  90.9  
LNSR (ours)  
XLNet Yang2019XLNetGA  
RoBERTa Liu2019RoBERTaAR  
ALBERT Lan2020ALBERTAL 
.