Noise Stability Regularization for Improving BERT Fine-tuning

07/10/2021 ∙ by Hang Hua, et al. ∙ University of Rochester Baidu, Inc. 0

Fine-tuning pre-trained language models such as BERT has become a common practice dominating leaderboards across various NLP tasks. Despite its recent success and wide adoption, this process is unstable when there are only a small number of training samples available. The brittleness of this process is often reflected by the sensitivity to random seeds. In this paper, we propose to tackle this problem based on the noise stability property of deep nets, which is investigated in recent literature (Arora et al., 2018; Sanyal et al., 2020). Specifically, we introduce a novel and effective regularization method to improve fine-tuning on NLP tasks, referred to as Layer-wise Noise Stability Regularization (LNSR). We extend the theories about adding noise to the input and prove that our method gives a stabler regularization effect. We provide supportive evidence by experimentally confirming that well-performing models show a low sensitivity to noise and fine-tuning with LNSR exhibits clearly higher generalizability and stability. Furthermore, our method also demonstrates advantages over other state-of-the-art algorithms including L2-SP (Li et al., 2018), Mixout (Lee et al., 2020) and SMART (Jiang et al., 2020).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Large-scale pre-trained language models such as BERT Devlin2019BERTPO

have been widely used in natural language processing tasks

Guu2020REALMRL; Liu2019FinetuneBF; Wadden2019EntityRA; Zhu2020IncorporatingBI

. A typical process of training a supervised downstream dataset is to fine-tune a pre-trained model for a few epochs. In this process, most of the model’s parameters are reused, while a random initialized task-specific layer is added to adapt the model to the new task.

Figure 1:

Attenuation of injected noise on the BERT-Large-Uncased model on the MRPC task (X-axis: the layer index. Y-axis: the

norm between the original output and noise perturbed output). A curve starts at the layer where a scaled Gaussian noise is injected to its output whose norm is set to of the norm of its original output. As it propagates up, the injected noise has a rapidly decreasing effect on the lower layers but becomes volatile on the higher layers, which indicates the poor generalizability and brittleness of the BERT top layers. Moreover, models with higher accuracies (marked in the upper right) usually have lower error ratios or higher noise stability in top layers.

Fine-tuning BERT has significantly boosted the state of the art performance on natural language understanding (NLU) benchmarks such as GLUE Wang2018GLUEAM and SuperGLUE Wang2019SuperGLUEAS. However, despite the impressive empirical results, this process remains unstable due to the randomness involved by data shuffling and the initialization of the task-specific layer. The observed instability in fine-tuning BERT was first discovered by Devlin2019BERTPO; Dodge2020FineTuningPL, and several approaches have been proposed to solve this problem Lee2020MixoutER; Zhang2020RevisitingFB; Mosbach2020OnTS.

In this study, we consider the fine-tuning stability of BERT from the perspective of the sensitivity to input perturbation. This is motivated by Arora2018StrongerGB and Sanyal2020StableRN

who show that noise injected at the lower layers has very little effect on the higher layers for neural networks with good generalizability. However, for a well pre-trained BERT, we find that the higher layers are still very sensitive to the lower layer’s perturbation (as shown in Figure 

1), implying that the high level representations of the pre-trained BERT may not generalize well on downstreaming tasks and consequently lead to instability. This phenomenon coincides with the observation that transferring the top pre-trained layers of BERT slows down learning and hurts performance Zhang2020RevisitingFB. In addition, Yosinski2014HowTA

also point out that in transfer learning models for object recognition, the lower pre-trained layers learn more general features while the higher layers closer to the output specialize more to the pre-training tasks. We argue that this result also applies to BERT. Intuitively, if a trained model is insensitive to the perturbation of the lower layers’ output, then the model is confident about the output, and vice versa. Based on the above theoretical and empirical results, we propose a simple and effective regularization method to reduce the noise sensitivity of BERT and thus improve the stability and performance of fine-tuned BERT.

To verify our approach, we conduct extensive experiments on different few-sample (fewer than 10k training samples) NLP tasks, including CoLA

Warstadt2019NeuralNA, MRPC Dolan2005AutomaticallyCA

, RTE

Wang2018GLUEAM; Dagan2005ThePR; BarHaim2006TheSP; Giampiccolo2007TheTP, and STS-B Cer2017SemEval2017T1

. With the layer-wise noise stability regularization, we obtain strong empirical performance. Compared with other state-of-the-art models, our approach not only improves the fine-tuning stability (with a smaller standard deviation) but also consistently improve the overall performance (with a larger mean, median and maximum).

In summary, our main contributions are:

  • We propose a lightweight and effective regularization method, referred to as Layer-wise Noise Stability Regularization (LNSR) to improve the local Lipschitz continuity of each BERT layer and thus ensure the smoothness of the whole model. The empirical results show that the fine-tuned BERT models regularized with LNSR obtain significantly more accurate and stable results. LNSR also outperforms other state-of-the-art methods aiming at stabilizing fine-tuning such as -SP Li2018ExplicitIB, Mixout Lee2020MixoutER and SMART Jiang2020SMARTRA.

  • We are the first to study the effect of noise stability in NLP tasks. We extend classic theories of adding noise to explicitly constraining the output consistency when adding noise to the input. We theoretically prove that our proposed layer-wise noise stability regularizer is equivalent to a special case of the Tikhonov regularizer, which serves as a stabler regularizer than simply adding noise to the input rifai2011adding.

  • We investigate the relation of the noise stability property to the generalizability of BERT. We find that in general, models with good generalizability tend to be insensitive to noise perturbation; the lower layers of BERT show a better error resilience property but the higher layers of BERT remain sensitive to the lower layers’ perturbation (as is depicted in Figure 1).

2 Related Work

2.1 Pre-training

Pre-training has been well studied in machine learning and natural language processing

Erhan2009TheDO; Erhan2010WhyDU. Mikolov2013DistributedRO and Pennington2014GloveGV

proposed to use distributional representations (i.e., word embeddings) for individual words.

Dai2015SemisupervisedSL proposed to train a language model or an auto-encoder with unlabeled data and then leveraged the obtained model to finetune downstream tasks. Recently, pre-trained language models, like ELMo Peters2018DeepCW, GPT/GPT-2 Radford2018ImprovingLU; Radford2019LanguageMA, BERT Devlin2019BERTPO, cross-lingual language model (briefly, XLM) Lample2019CrosslingualLM, XLNet Yang2019XLNetGA, RoBERTa Liu2019RoBERTaAR and ALBERTLan2020ALBERTAL have attracted more and more attention in natural language processing communities. The models are first pre-trained on large amount of unlabeled data to capture rich representations of the input, and then applied to the downstream tasks by either providing context-aware embeddings of an input sequence Peters2018DeepCW, or initializing the parameters of the downstream model Devlin2019BERTPO for fine-tuning. Such pre-training approaches deliver decent performance on natural language understanding tasks.

2.2 Instability in Fine-tuning

Fine-tuning instability of BERT has been reported in various previous works. Devlin2019BERTPO report instabilities when fine-tuning BERT on small datasets and resort to performing multiple restarts of fine-tuning and selecting the model that performs best on the development set. Dodge2020FineTuningPL performs a large-scale empirical investigation of the fine-tuning instability of BERT. They found dramatic variations in fine-tuning accuracy across multiple restarts and argue how it might be related to the choice of random seed and the dataset size. Lee2020MixoutER propose a new regularization method named Mixout to improve the stability and performance of fine-tuning BERT. Zhang2020RevisitingFB evaluate the importance of debiasing step empirically by fine-tuning BERT with both BERTAdam and standard Adam optimizer Kingma2015AdamAM and propose a re-initialization method to get a better initialization point for fine-tuning optimization. Mosbach2020OnTS analyses the cause of fine-tuning instability and propose a simple but strong baseline (small learning rate combined with bias correction).

2.3 Regularization

There has been several regularization approaches to stabilizing the performance of models. Loshchilov2019DecoupledWD propose a decoupled weight decay regularizer integrated in Adam Kingma2015AdamAM optimizer to prevent neural networks from being too complicate. Gunel2020SupervisedCL use contrastive learning method to augment training set to improve the generalization performance. In addition, spectral norm Yoshida2017SpectralNR; Roth2019AdversarialTG serves as a general method can also be used to constrain the Lipschitz continuous of matrix, which can increase the stability of generalized neural networks.

There are also several noise-based methods have been proposed to improve the generalizability of pre-trained language models, including SMART Jiang2020SMARTRA, FreeLB Zhu2020FreeLBEA and R3F Aghajanyan2020BetterFB

. They achieves state of the art performance on GLUE, SNLI

Bowman2015ALA, SciTail Khot2018SciTaiLAT, and ANLI nie-etal-2020-adversarial NLU benchmarks. Most of these algorithms employ adversarial training method to improve the robustness of language model fine-tuing. SMART uses an adversarial methodology to encourage models to be smooth within a neighborhoods of all the inputs; FreeLB optimizes a direct adversarial loss through iterative gradient ascent steps; R3F removes the adversarial nature of SMART and optimize the smoothness of the whole model directly. Different from these methods, our proposed method does not adopt the adversarial training strategy, we optimize the smoothness of each layer of BERT directly and thus improve the stability of whole model.

3 Using Noise Stability as a Regularizer

One of the central issues in neural network training is to determine the optimal degree of complexity for the model. A model which is too limited will not sufficiently capture the structure in the data, while one which is too complex will model the noise on the data (the phenomenon of over-fitting). In either case, the performance on new data, that is the ability of the network to generalize, will be poor. The problem can be regarded as one of finding the optimal trade-off between the high bias of a model which is too inflexible and the high variance of a model with too much freedom

geman1992neural; bishop1995training; Novak2018SensitivityAG; bishop1991improving. To control the trade-off of bias against variance of BERT models, we impose an explicit noise regularization method.

3.1 Introduction of Our Method

Denoting the training set as , we give the general form of optimization objective for a BERT model with layers, as following:

(1)

To represent , we first define the injection position as the input of layer which is denoted as . If the regularization is operated at the output of layer , we can further denote the function between layer and as , satisfying that

. To implement the noise stability regularization, we inject a Gaussian-like noise vector

to and get a neighborhood . Specifically, each element

is independently randomly sampled from a Gaussian distribution with the mean of zero and the standard deviation of

as

. The probability density function of the noise distribution can be written as

. Our goal is to minimize the discrepancy between their outputs over defined as . In our framework, we use a fixed position as the position of noise injection and constrain the output distance on all layers following layer . Denoting the regularization weight corresponding to each as , given a sample , the regularization term is represented by the following formulas:

(2)

An overall algorithm is represented in Algorithm 1.

3.2 Theoretical Analysis

Regularzation is a kind of commonly used techniques to reduce the function complexity and, as a result, to make the learned model generalize well on unseen examples. In this part, we theoretically prove that the proposed LNSR algorithm has the effects of encouraging the local Lipschitz continuity and imposing a Tikhonov regularizer under different assumptions. For simplicity, we omit the notations about the layer number in this part, denoting as the target function and as the input of parameterized by . Given a sample , we discuss the general form of the noise stability defined as following:

(3)

Lipschitz continuity

. The Lipschitz property reflects the degree of smoothness for a function. Recent theoretical studies on deep learning has revealed the close connection between Lipschitz property and generalization 

bartlett2017spectrally; neyshabur2017exploring.

Given a sampled , minimizing is equivalent to minimizing:

(4)

Thus the noise stability regularization can be regarded as minimizing the Lipschitz constant in a local region around the input .

Tikhonov regularizer. The Tikhonov regularizer willoughby1979solutions involves constraints on the derivatives of the objective function with respect to different orders. For the simplest first-order case, it can be regarded as imposing robustness and shaping a flatter loss surface at input, which makes the learned function smoother.

Assuming that the magnitude of is small, we can expand the first term as a Taylor approximation as:

(5)

where and refer to the Jacobian and Hessian of with respect to the input respectively.

Ignoring the higher order term and denoting as the k-th output of the function , we can rewrite the regularizer by substituting Eq. 5 in Eq. 3 as:

(6)

We define the input vector as and noise vector as . Assuming that distributions of the noise and the input are irrelevant, and the derivative of with respect to different elements of the input vector is independent with each other, we expand the second order term corresponding to the Jacobian as:

(7)

According to the characteristics of the Gaussian distribution, we also have

(8)

Thus, we can rewrite the second order term corresponding to the Hessian in Eq. 6 as:

(9)

Where is a constant independent of the input . The third term generated from the expansion of Eq. 6 is zero as we have . Thus we get

(10)

Considering that the input and output of the function are both scalar variable, the Tikhonov regularization willoughby1979solutions takes the general form as:

(11)

Eq. 10 shows that our proposed regularizer ensuring the noise stability is equivalent to a special case of the Tikhonov regularizer, where we involve the first and second order derivatives of the objective function .

An alternative for improving the robustness is to directly add noise to the input, without explicitly constraining the output stability.  rifai2011adding has derived that adding noise to the input has the effect of penalizing both the -norm of the Jacobian and the trace of the Hessian , whereas the Hessian term is not constrained to be positive. While the regularizer brought by our proposed LNSR is guaranteed to be positive by involving the sum of squares of the first and second order derivatives. Moreover, our work relaxes the assumption of MSE regression loss required by rifai2011adding. By imposing the explicit constraint of noise stability on middle layer representations, we extend the theoretical understanding of noise stability into deep learning algorithms.

0:  Training set , perturbation bound , learning rate , number of layers , number of training epochs , function and its corresponding parameters , the position of noise injection , and regularization weights for each layer .
1:  Initialize
2:  for epoch=  do
3:     for minibatch  do
4:        
5:        for each  do
6:           
7:           
8:           forward pass given and as inputs
9:           for  do
10:              
11:           end for
12:        end for
13:        
14:        
15:     end for
16:  end for
16:  
Algorithm 1 Layer-wise Noise Stability Regularization (LNSR)
RTE MRPC CoLA STS-B
mean std max mean std max mean std max mean std max
FT Devlin2019BERTPO
Li2018ExplicitIB
MixoutLee2020MixoutER
SMARTJiang2020SMARTRA
LNSR(ours)
Table 1: The mean, standard deviation, and maximum performance on the development set of RTE, MRPC, CoLA, and STS-B tsak across 25 random seeds when fine-tuning the BERT-Large model with various regularization methods. FT refers to the standard BERT fine-tuning. Standard deviation: lower is better.
Figure 2: Performance distribution box plot of each model on the four tasks from 25 random seeds.

4 Experiments

In this section, we experimentally demostrate the effectiveness of LNSR method on text classification tasks over other regularization methods, and confirm that the insensitivity to noise promotes the generalizability and stability of BERT.

4.1 Data

We conduct experiments on four few-sample (less than 10k training samples) text classification tasks of GLUE 111https://gluebenchmark.com/, the datasets are described below and summarized in Appendix A Table 4.

Corpus of Linguistic Acceptability (CoLA Warstadt2019NeuralNA) consists of English acceptability judgments drawn from books and journal articles on linguistic theory. Each example is a sequence of words annotated with whether it is a grammatical English sentence. This is a binary classification task and Matthews correlation coefficient (MCC) matthews1975comparison is used to evaluate the performance.

Microsoft Research Paraphrase Corpus (MRPC Dolan2005AutomaticallyCA

) is a corpus of sentence pairs with human annotations for whether the sentences in the pair are semantically equivalent. The evaluation metrics is the average of F1 and Accuracy.

Recognizing Textual Entailment (RTE Wang2018GLUEAM) Dagan2005ThePR BarHaim2006TheSP Giampiccolo2007TheTP is a corpus of textual entailment, and each example is a sentence pair annotated whether the first entails the second. The evaluation metrics is Accuracy.

Semantic Textual Similarity Benchmark (STS-B Cer2017SemEval2017T1)is a regression task. Each example is a sentence pair and is human-annotated with a similarity score from 1 to 5; the task is to predict these scores. The evaluation metrics is the average of Pearson and Spearman correlation coefficients.

4.2 Baseline Models

We use BERT Devlin2019BERTPO

, a large-scale bidirectional pre-trained language model as the base model in all experiments. We adopt pytorch edition implemented by

Wolf2019HuggingFacesTS.

Fine-tuning. We use the standard BERT fine-tuning method described in Devlin2019BERTPO.

-SP Li2018ExplicitIB is a regularization scheme that explicitly promotes the similarity of the final solution with the initial model. It is usually used for preventing pre-trained models from catastrophic forgetting. We adopt the form of .

Mixout Lee2020MixoutER is a stochastic regularization technique motivated by Dropout Srivastava2014DropoutAS and DropConnect Wan2013RegularizationON

. At each training iteration, each model parameter is replaced with its pre-trained value with probability

. The goal is to improve the generalizability of pre-trained language models.

SMART Jiang2020SMARTRA imposes an smoothness regularizer inducing an adversarial manner to control the model complexity at the fine-tuning stage. It also employs a class of Bregman proximal point optimization methods to prevent the model from aggressively updating during fine-tuning.

4.3 Experimental Setup

Our model is implemented using Pytorch based on Transformers framework 222https://huggingface.co/transformers/index.html

. Specifically, we use the learning setup and hyperparameters recommended by

Devlin2019BERTPO. We use Huggingface edition Adam Kingma2015AdamAM optimizer (without bias correction) with learning rate of ,, and warmup over the first 10% steps of the total steps. We fine-tune the entire model (340 million parameters), of which the vast majority start as pre-trained weights (BERT-Large-Uncased) and the classification layer (2048 parameters). Weights of the classification layer are initialized with . We train with a batch size of 32 for 3 epochs. More details of our experimental setup are described in Appendix A.

RTE MRPC CoLA STS-B
mean std max mean std max mean std max mean std max
FT
FT (4 Epochs)
FT+Noise
LNSR(ours)
Table 2: Ablation study of LNSR on each task, we report the mean evaluation scores and standard deviation and max value across 25 random seeds. FT refers to standard BERT fine-tuning.
RTE MRPC CoLA STS-B
train / eval / gap train / eval / gap train / eval / gap train / eval / gap
FT Devlin2019BERTPO // // // //
LNSR(ours) // // // //
Table 3: Comparison of the generalizability performance of different models. We report the mean training Acc and evaluation Acc and the generalizability gap (training Acc - evaluation Acc) of each model across 20 random seeds.

4.4 Overall Performance

Table 1 shows the results of all the models on selected GLUE datasets. We train each dataset over 25 random seeds. To implement our LNSR, we uniformly inject noise at the first layer on BERT-large for the comparison with baseline models. As we can see from the table, our model outperforms all the baseline models in mean and max values, which indicates the stronger generalizability of our model against other baseline models. The p-values between the accuracy distributions of standard BERT fine-tuning and our model are calculated to verify whether the improvements are significant. We obtain very small p-values in all tasks: RTE: , MRPC: , CoLA: , STS-2: .

Standard deviation is an indicator of the stability of models’ performance and higher std means more sensitive to random seeds. Our model shows a lower standard deviation on each task, which means our model is less sensitive to random seeds than other models. Figure 2 presents a clearer illustration. To sum up, our proposed method can effectively improve the performance and stability of fine-tuning BERT.

5 Analysis

5.1 Ablation Study

To verify the effectiveness of our proposed LNSR model, we conduct several ablation experiments including fine-tuning with more training epochs and noise perturbation without regularization (we inject noise directly to the output of a specific layer, and then use the perturbed representation to conduct propagation and then calculate loss, this process is similar to a vector-space represent augmentation). The results are shown in Table  2. We observe that benefit obtained by longer training is limited. Similarly, fine-tuning with noise perturbation only achieves slightly better results on two of these tasks, showing that simply adding noise without an explicit restriction on outputs may not be sufficient to obtain good generalizability. While BERT models with LNSR perform better on each task. This verifies our claim that LNSR can promote the stability of BERT fine-tuning and meanwhile improve the generalizability of the BERT model.

Figure 3: The top-1 accuracy (top) and loss (bottom) curve for fine-tuning BERT with and w/o LNSR.

5.2 Effects on the Generalizability of Models

We verify the effects of our proposed method on the generalizability of BERT models in two ways – generalization gap and models’ performance on fewer training samples. Due to the limited data and the extremely high complexity of BERT model, bad fine-tuning start point makes the adapted model overfit the training data and does not generalize well to unseen data. Generalizability of models can be intuitively reflected by generalization gap and models’ performance on fewer training samples.

Table 3 shows the mean training Acc, mean evaluation Acc and generalization gap of different models on each task. As we can see from the table, fine-tuning with LNSR can effectively narrow the generalization gap, and achieve higher evaluation score. The effect of narrowing generalization gap is also reflected in Figure 3 where we can see the higher evaluation accuracy and lower evaluation loss.

We sample subsets from the two relatively larger datasets CoLA (8.5k training samples) and STS-B (7k training samples) with the sampling ratio of 0.15, 0.3 and 0.5. As is shown in Figure 4, fine-tuning with LNSR shows clear advantage on fewer training samples, suggesting LNSR can effectively promote the model’s generalizability.

Figure 4: Mean evaluation score comparison of fine-tuning BERT with and w/o LNSR on fewer training samples of the CoLA and STS-B tasks. The mean values are calculated over 20 random seeds.

5.3 Sensitivity to the Position of Noise Injection

We briefly discuss about the sensitivity to the position of noise injection as it is a pre-determined hyperparameter of our method. As is shown in Figure 5 in Appendix A, we observe that the performance of LNSR does not fluctuate much as the position of noise injection changes. All injection positions bring significant improvements over vanilla fine-tuning. Note that, with LNSR, noise injection to the lower layers usually leads to relatively higher accuracy and stability, implying that LNSR may be more effective when it affects both the lower and higher layers of the network.

5.4 Relationship to Previous Noise-based Approaches

Our method is related to SMART Jiang2020SMARTRA, FreeLB Zhu2020FreeLBEA and R3F Aghajanyan2020BetterFB. As is mentioned before, most of these approaches employ adversarial training strategies to improve the robustness of BERT fine-tuing. SMART solves supremum by using an adversarial methodology to achieve the largest KL divergence with an -ball, FreeLB optimizes a direct adversarial loss through iterative gradient ascent steps, while R3F removes the adversarial nature of SMART and optimize the smoothness of the whole model directly.

Compared with this sort of adversarial based algorithms, our method is easier to implement and provides a relatively rigorous theoretical guarantee. The design of layer-wise regularization is sensible that it exploits the characteristics of hierarchical representations in modern deep neural networks. Studies in knowledge distillation have shown similar experience that imitating through middle layer representations adriana2015fitnets; zagoruyko2016paying performs better than aligning the final outputs hinton2015distilling. Moreover, LNSR allows us to use different regularization weights for different layers (we use fixed weight 1 on all layers in this paper). We will leave the exploitation in future work.

6 Conclusion

In this paper, we propose the Layer-wise Noise Stability Regularization (LNSR) as a lightweight and effective method to improve the generalizability and stability when fine-tuning BERT on few training samples. Our proposed LNSR method is a general technique that improves model output stability while maintaining or improving the original performance. Furthermore, we provide a theoretically analysis of the relationship of our model to the Lipschitz continuity and Tikhonov regularizer. Extensive empirical results show that our proposed method can effectively improve the generalizability and stability of the BERT model.

7 Acknowledgements

Hang Hua would like to thank Jeffries for supporting his research.

References

Appendix A Experimental Details

The model we use for experiments in section 4 is the standard BERT large model with 24 layers staked Transformers Vaswani2017AttentionIA encoder, 1024 hidden size, and 16 self-attention heads. We initialize the pre-trained part of the model with BERT-Large-Uncased-Whole-Word-Masking weight. The final layer is a classification layer with 2048 parameters which contains of the total number of parameters in the model. We initialize the last layer with and each bias is 0. For the position of noise injection, we uniformly chose the first layer as the noise regularization start point. In the sensitivity to the position of noise injection analysis section, we also try injecting noise from the different layers as is shown in Figure 5. As for the baseline model Mixout, we use the code from the Github repository https://github.com/bloodwass/mixout.git. The other baseline models are implemented by ourselves.

Table 4 summarizes dataset statistics used in this work. We use the standard GLUE benchmark datasets downloaded from https://gluebenchmark.com/tasks.

Appendix B Other Experimental Reports

We also report the maximum value we get during fine-tuning BERT with our proposed LNSR regularizer among a large number of random seeds and several noise injection position, since the maximum value can also reflect the ability of the learning algorithm to reach an optimal point. The results are shown in Table 5, and we can see that on some tasks, fine-tuning BERT with LNSR is even competitive with fine-tuning state-of-the-art models which adopt more powerful modern architectures and pre-training strategies.

RTE MRPC CoLA STS-B
Task NLI Paraphrase Acceptability Similarity
Metrics Accuracy Matthews corr Pearson/Spearman corr
of labels 2 2 2 1
of training samples 2.5k 3.7k 8.6k 7k
of validation samples 276 408 1k 1.5k
of test samples 3k 1.7k 1k 1.4k
Table 4: The summarization of the datasets used in this work.
RTE MRPC CoLA STS-B
BERT Devlin2019BERTPO 70.4 88.0 60.6 90.0
BERT Phang2018SentenceEO 70.0 90.7 62.1 90.9
LNSR (ours)
XLNet Yang2019XLNetGA
RoBERTa Liu2019RoBERTaAR
ALBERT Lan2020ALBERTAL
Table 5: We report the maximum value we get when fine-tuning the LNSR model on different noise injection position and random seeds on the four tasks. On some tasks, BERT (standard BERT-large-uncased Devlin2019BERTPO) with LNSR even become competitive with some newly proposed powerful models (bottom rows)

.

Figure 5: Performance distribution box plot of each model on the four tasks across 25 random seeds.