Large pretrained (PT) models have become an inevitability in modern Natural Language Processing (NLP) applications(Transformers; BERT; radford2019language; RoBERTa; lewis-etal-2020-bart; brown2020language; ELMO; ULM-FIT). Pretraining techniques such as Masked Language Modeling (MLM) and Causal Language Modeling (CLM) have aided in making the best use of language statistics learned from large corpora to achieve improved performance on several NLP benchmarks (glue2; glue8; glue7; glue3; glue4; GLUE; SuperGLUE; glue1; glue5; glue6; glue9; glue10; glue11; SQUADv1; SQUADv2). With the increase in applications of these large models also came a growing interest in evaluating the way these models learn to solve natural language tasks.
Recent research exploring the sensitivity to syntax of pretrained models have mostly been applying perturbations to text through perturbing the order of words. Perturbations applied and quantified at this granularity of text offer only a limited understanding to the learning dynamics of the large architectures. Analysing perturbations at a finer granularity such as subwords (bojanowski2017enriching) or characters gao2018black; ebrahimi2017hotflip, may provide a deeper insight into the insensitivity of neural models. Consider Figure 1, which shows an unperturbed sentence, a word-level perturbed sentence, a subword-level perturbed sentence, and a character-level perturbed sentence. An average reader may find it possible to parse and infer the meaning up to the word-level perturbed sentence, but would have issues inferring any meaning from subword and character-level perturbed text.
In this paper, we define two types of structure111Structure here relates to the organization of characters in the text. in text, global which relates to the absolute position of characters, and local, which relates to the relative position of characters to their immediate neighbors. We observe from experiments in the paper that most perturbations proposed and analyzed in the literature often perturb the global structures well with different reordering of words, while the amount of disturbance to the local structure remains limited, thus preserving most of the local structure. We hypothesize that the local structure enables understanding in natural language tasks and effectively perturbing this structure aids in analyzing the sensitivity of neural models in language tasks. We, hence, propose two new metrics, the Index Displacement Count metric (IDC) and the Direct Neighbour Displacement metric (DND)222There is a wordplay in the naming of the metrics. IDC: standing for I Don’t Care and DND: Do Not Disturb in internet slang is used to identify the metric the neural models cares the most about in language tasks and the one it does not., to measure the amount of perturbation to the global and local structures of text respectively.
Our contributions are as follows:
We propose two metrics, IDC (global) and DND (local), to measure the perturbations on global and local structures of text.
We show that the performance of neural models – Transformers and others – to perturbed input has a strong correlation with the proposed DND metric.
We observe that DND has a strong correlation with GLUE scores across different architectures, suggesting that neural language understanding models generally are sensitive to distortions in local structures moreso than global structures.
We show that commonly used lexical perturbations distort the global structures and seldom affect the local structures explaining the insensitivity of large models to such perturbations.
We show that DND has a weak correlation to other metrics – BLEU, Levenshtein – indicating that the common metrics used for evaluation do not measure the dimension captured by DND.
We find that the lack of correlation to performance of non-pretrained Transformers to IDC is useful in detecting when models do not make use of the positional information present in text, defaulting to bag-of-word models.
2 Related Work
Importance of syntax
Discussions on semantics (culbertson2014language; futrell2020dependency) agree on specific orders of words to be necessary for comprehending the text. Psycholinguistic research (hale2017models) corroborates this through evaluating sentence comprehension mechanisms of humans. Hence, interpreting language as a bag-of-words could limit the expressions conveyed through the word-orders harris1954distributional; le2014distributed and understanding syntax333Preference to a specific word-order over the other and the preference complying with the choices of an average human speaking that language. becomes an essential artifact.
Prior works have explored the relationship between neural models and syntax. assessing_syntax_1; assessing_syntax_2 both show that BERT (BERT) models have some syntactic capacity. BertHierachicalSyntaxicRepresentation show that BERT represents information hierarchically and conclude that BERT models linguistically relevant aspects in a hierarchical structure. syntax1; syntax2 show that the contextual embeddings that BERT outputs contain syntactic information that could be used in downstream tasks.
While it seems that syntax is both important, and to an extent, understood by the recent family of PT models, it is unclear how much use they make of it. SyntaxSupervisedTraining showed that pretraining BERT on syntax does not seem to improve downstream performance much. LearningWhichFeaturesMatter showed that while models such as BERT do understand syntax, they often prefer not to use that information to solve tasks. ettinger-2020-bert; pham-etal-2019-improving; sinha2020unnatural; gupta2021bert show that large language models are insensitive to minor perturbations highlighting the lack of syntactic knowledge used in syntax rich NLP tasks. sinha2021masked show that pretraining models on perturbed inputs still obtain reasonable results on downstream tasks, showing that models that have never been trained on well-formed syntax can obtain results that are close to their peers.
While syntactic information seems vital to language, and large PT models seem to be at least aware of syntax, the lack of sensitivity of neural models to perturbation of syntax motivates further probing.
Text Similarity Metrics
Several popular similarity metrics can be used to measure perturbations. Metrics like BLEU (papineni2002bleu) and ROUGE (lin2004rouge) will treat text as a sequence of words, from which a measure of overlap is computed. The Levenshtein distance (OG_Levenshtein; yujian2007normalized), or the edit distance, measures the minimum amount of single character edits (insertions, deletions or substitutions) necessary to match two strings together. parthasarathi2021sometimes observed that learned metrics like BERT-Score (zhang2019bertscore) and BLEURT (sellam-etal-2020-bleurt) are often unaffected by minor perturbations in text which limits their usefulness in measuring perturbations. sinha2020unnatural
propose a POS mini-tree overlap score to interpret the results of the perturbation analysis. The score computes the part-of-speech (PoS) tags neighborhood for every word and estimates an average overlap in the neighborhood for all the tokens before and after applying the perturbation. The authors, however, find that the working range of the proposed metric to be small and explain the effect only for PT Transformer architectures.
Several different types of reordering perturbation functions and schemes have been explored to understand and study the (in)sensitivity of neural architectures to word-order. The class of perturbation analysis could broadly be split into three categories that involves – deletion, paraphrase injection and/or reordering of tokens. sankar-etal-2019-neural explore utterance and word-level perturbations applied to generative dialogue models to highlight their insensitivity to the order of conversational history. On natural language classification tasks, pham2020out
define n-grams for different values ofand shuffle them to highlight the insensitivity of pretrained models. They show that shuffling larger -grams have a lesser effect than shuffling smaller n-grams, hinting that preserving more local structure causes less degradation in performances. Studying textual entailment tasks, sinha2020unnatural perform perturbations on the position of the words, with a criteria that no word remains in its initial position.
hsieh-etal-2019-robustness propose a suite of adversarial attacks that replace one word in the input to cause a model to flip its correct prediction. gupta2021bert combine several types of destructive transformations — such as sorting, reversing, shuffling words — towards removing all informative signal in a text. Along similar lines, wang-etal-2019-improving inject noise by reordering articles or deleting minimally towards injecting artificial noise to measure the robustness of pretrained language models. Character-level perturbations that perform minimal flips to cause a degenerate response have been explored by ebrahimi2017hotflip; gao2018black. gao2018black quantify the perturbation in Levenshtein distance and draw a correlation to the model’s performance. This work is closely related to our own. We, however, demonstrate that our proposed metric is a much more robust explanation of the degradation in performance of models than the Levenshtein distance.
Although the recent literature on perturbation analysis in PT language models was able to observe the extremes — sensitivity and insensitivity, understanding the attributes of text to which PT models and others are sensitive to requires a detailed study. We speculate that the perturbation analyses done at the granularity of sub-words and characters is necessary for properly probing the neural models’ insensitivity to word-order phenomenon. Likewise, a generic score quantifying the amount of perturbation is essential towards setting up a unified evaluation framework for perturbation functions.
3 DND and IDC
We define two metrics — the Direct Neighbour Displacement (DND) and the Index Displacement Count (IDC) — that score the local and global structures that are perturbed with any reordering based perturbation functions. The global structure here relates to the absolute position of characters in a text, and the local structure relates to the neighbouring character of any other character in a text. The metrics are all measured on the perturbations of characters in text, allowing for a unified framework to compare perturbation functions that are applied at various levels of text granularity444Pseudo-code and examples of both metrics are shown in Appendix B and Appendix C.; words, subwords, and characters.
Let a string, , be denoted by a sequence of characters , where is the length of the string in characters and denote the positions of characters in . Let be a perturbation operation.
where denote the perturbed string with positions of the characters specified by .
The denominator normalizes the average by the length of the text555 is used to normalize as we sum times a number that is between and , where is the text length.. Intuitively, an IDC of 0.3 would imply that characters in the perturbed text have moved 30% of the text length on average. The values of IDC will lie in the range , where would be obtained by reversing a text at the character level.
For every , let indicate the relative position of the right neighbor () of character with respect to the position of in string . Then, DND is computed as a summation over an indicator variable that indicates when the neighbor to the right of has shifted to a different position in .
DND measures the amount of distortions that happened to the local neighborhood of every character. Intuitively, a DND of 0.3 would imply that 30% of characters in the perturbed text are no longer followed by their immediate neighbouring character in the unperturbed text. The values of DND will lie in the range , where can be obtained by removing every single neighborhood relations.
4 Perturbation Functions
Towards conducting a detailed analysis on the effect of perturbations on performance of neural language models, we define three granularities of perturbation functions — word-level, subword-level and character-level. The subwords are taken from the RoBERTa-Base vocabulary. We define the perturbation functions as generic operations that can be applied across the different levels of granularity. Pseudo-code and examples for all perturbations are shown in Appendix B.
randomly shuffles the position of every word, sub-word, or character, according to the level it is applied to. This transformation should cause a great amount of perturbation to the global and local structure for the specific granularity.
creates chunks of contiguous tokens of variable length and shuffles the phrases of word, subword, or characters. This perturbation has, on average, the same impact as the full shuffling on the global structure as the absolute positions of characters tend to change just as much as full shuffling while having a lesser impact on the local structure.
Unlike the full-shuffling operation, phrase shuffling uses a parameter
that controls the average size of the randomly defined contiguous chunks of tokens. To randomly define our phrases, we traverse the text sequentially on the desired granularity. The entire text is assumed as a single large phrase and is truncated at a token with probabilityinto smaller phrases.
The lower the value of is, the longer, on average, the phrases are, thus preserving more of the local structure while destroying roughly the same amount of global structure. In the extreme case with , phrase shuffling will be equivalent to full shuffling as phrases will all be one token long.
Neighbour flip perturbations
flip tokens of the chosen granularity with the immediate right neighbor with probability, . This function has, on average, a smaller impact on the global structure, as the absolute positions of tokens do not change much but can have an arbitrary large effect on disturbing the local structure.
The perturbation is applied by traversing the string from left-to-right on the desired granularity and, with a probability , switching the current attended token with the following token. The lower the is, the less perturbation happens, thus preserving more of the local structure. This transformation never has a large impact on the global metric, thus letting us isolate the impact of perturbations to the different structures.
We experiment with the GLUE Benchmark (GLUE)
datasets, a popular Natural Language Understanding (NLU) benchmark. Of the GLUE’s suite, we evaluate on 8 tasks – Multi-Genre NLI (MNLI), Corpus of Linguistic Acceptability (CoLA), Quora Question Pairs (QQP), Microsoft Research Paraphrase Corpus (MRPC), Question NLI (QNLI), Recognizing Textual Entailment (RTE), Stanford Sentiment TreeBank (SST), Semantic Textual Similarity Benchmark (STS-B)–(glue1; glue2; glue3; glue4; glue5; glue6; glue7; glue8). The model agnostic tasks in GLUE that have diverse textual contexts, dataset sizes, and varying degree of difficulty are designed to evaluate language understanding components of neural language models.
We create perturbed versions of datasets for all tasks with the different perturbation functions defined in § 4.
In total, 20 different variation of our perturbation functions are applied666 The hyperparameters used for the perturbation functions are detailed in Appendix
The hyperparameters used for the perturbation functions are detailed in AppendixA.. We maintain an unperturbed version of the dataset to benchmark the model’s performance.
We experiment on neural architectures with different inductive biases — BiLSTMs (Bi-LSTM), ConvNets, PT Transformers (RoBERTa-Base and BART-Base), and a Non-Pretrained (NPT) Transformer( RoBERTa-Base architecture). We also experiment with different tokenization schemes, using byte-pair encoding as well as character-level tokenization. All training, finetuning and evaluation are done on the perturbed version of the dataset777The training details can be found in Appendix A. Code to reproduce the results is available in GitHub.. The tokenization for PT Transformer models use their corresponding vocabulary, while NPT models use RoBERTa-Base vocabulary and the character-level models use exclusively characters as vocabulary.
The primary objectives of the experiments are to: (1) understand if different degrees of perturbations affect the models alike, (2) verify if correlation exists between the performance across different NLU tasks and the amount of perturbation measured by DND and IDC, (3) investigate the different perturbation operations used in the literature and their distribution on our proposed metrics, and (4) understand if the pretraining of models is important to the studied phenomenon.
6.1 Correlation with other metrics
Towards estimating the relationship the proposed metrics — IDC and DND — have with the existing metrics — BLEU and Levenshtein —, we compute pairwise -correlation among the metrics averaged across all samples in the GLUE validation set in Figure 5888For every correlation, we inverted the value of DND, IDC, and the Levenshtein distance by subtracting the value from to make the comparison of the different correlations more straightforward. They are a measure of perturbation and not similarity and are therefore inversely correlated to the GLUE score.. Specifically, for every sample in the validation set of the tasks, we perturb them using the different perturbation functions and compute their scores with the different metrics.
We observe that IDC and DND are uncorrelated suggesting that the metrics measure different aspects of the perturbations. Further, we observe that DND only has a weak correlation with BLEU and Levenshtein, indicating that DND measures a previously unmeasured dimension of the structure and similarity in texts.
6.2 Comparison of Perturbation Functions
We populate an assorted list of perturbation functions analyzed by parthasarathi2021sometimes that can be applied to examples in GLUE tasks. The 16 different word-level perturbations are categorized as PoS-Tag perturbations, Dependency Tree perturbations, and Random shuffles that include perturbing with different traversal orders of dependency tree — Pre-Order, Post-Order or In-Order —, swapping verbs, adverbs, nouns in a text, reversing sentences among other perturbations.
We perturb the samples across GLUE tasks for every perturbation function, compute the scores with the metrics and compare with the perturbations defined in § 4. The distribution of scores measured by BLEU and Levenshtein covers the entire range of values for most of the word-level functions (shown in Figure 14). While the distribution of scores computed by DND for the different perturbations functions shown in Figure 6 indicates that the word-level and subword-level perturbations have a limited impact on the local structure.
No surprise but, we found BLEU to be uninterpretable when the perturbations were done at character or subword level rendering it ineffective for our study. Although Levenshtein does better in that regard, we observe DND metric to strongly correlate with model performance on perturbed samples (in §6.3). The analysis provides a reasonable explanation to the insensitivity observed due to word-level perturbations studied in the literature (sinha2020unnatural; pham2020out; gupta2021bert).
6.3 IDC/DND vs GLUE tasks
We compute the average GLUE score of different models on validation data perturbed with different functions to cover the range of DND score as shown in Figure 8. We observe the general trend to be that the proposed DND metric has a strong correlation with neural models’ loss in performance on the GLUE benchmark tasks (Figure 7). By computing a correlation between the performance of the different models on the perturbed samples and a measure of perturbation as estimated by the different metrics ( Figure 13), we see that the correlation with DND holds for every single architecture and setting tested. On the other hand, IDC is only weakly correlated with performance decay. This implies that local structure, moreso than global structure, is necessary for models to understand text. The Levenshtein distance and the BLEU metric both hold some explanatory power, but do not show a monotonically increasing or decreasing performance which limits the usefulness of those measures. For example, A model being evaluated on a perturbed text with a DND of can be assumed to have much lower performance than on a perturbed text with a DND of . This is not true for any of the other metrics. But the same do not hold for a model being tested with a perturbed sample measuring BLEU and another with a BLEU. Similarly, Levenshtein score could represent perturbations that lead to a very low performance, or to a barely affected performance.
Without computing an average GLUE score to estimate the correlation but looking at the tasks more closely, as in Figures 11, 10 and 9, we see that a few tasks buck the overall trend. Especially, the semantic acceptability task, CoLA, correlates very strongly with the BLEU-4 metric and only weakly with DND. It is also the only task in GLUE benchmark that reaches chance-level performance with word-level perturbations (pham2020out). Hence, word-level perturbations, as measured by BLEU, being more important for the task than the local structure, as measured by DND, is expected.
The STS-B task, a semantic textual similarity task, is barely correlated with IDC but strongly correlated with DND. It does not seem possible to affect this task performance with word or subword-level perturbations, implying that the bag-of-words information is sufficient to obtain good textual similarity estimates. DND, being able to measure distortions to the bag-of-words information, is able to provide an explanation for degradation in performance on a task that does not seem to rely on syntactic information. Whereas IDC does not seem to provide an explanation for the decay of performance in this task.
6.4 Model specific analysis
The loss in performance of models in GLUE tasks shows a greater degree of correlation with the DND metric than any other metric, as shown in Figure 13. We found our results to be consistent across PT Transformers, NPT Transformers, ConvNets, and BiLSTMs. This indicates that our results generalize to neural language models across different inductive biases, pretrained or not-pretrained, and to the different pretraining techniques.
6.4.1 Pretrained vs Non-Pretrained
Unsurprisingly, PT Transformers outperform every NPT variant across all types of perturbations, as shown in Figure 8. The PT RoBERTa and BART model have a comparable level of degradation across the different perturbations, shown in Figure 9, despite the different pretraining schemes used.
All NPT models also exhibit a strong correlation between the DND metric and their degradation in performance on the GLUE tasks, which indicates that the insensitivity to word-order is not an artifact of pretraining. In contrast with PT models, NPT models have very low correlations between all metrics and performance on the RTE task. This is explained by the fact that most NPT models do not obtain significantly above chance-level performance on the RTE task. As the performance quickly degrades to chance-level once any perturbation is applied, it is hard to measure correlations between the metrics and the task performance.
6.4.2 NPT Transformer and Positional Embeddings
Interestingly, the NPT Transformer, as shown in Figure 10, has a close to zero correlation between its performance and IDC metric, which other Non-Transformer models do not mirror. As the IDC metric measures the changes in absolute position of tokens, the IDC metric being completely uncorrelated with performance implies that the absolute position of tokens has little to no impact on the performance of NPT Transformers. We hypothesize that learning the positional embeddings require much more data than is present in a single NLU task, leading the NPT model to essentially act as bag-of-words model.
Towards studying this, we conduct an ablation study on the impact of positional embeddings with NPT and PT Transformers. To do this, we freeze the weights of the positional embeddings to , making them have no contribution on the overall output of the model. As we are interested in the marginal utility of positional embeddings with relation to NPT Transformers, we report the difference in performance between the model that does not have access to those embeddings and the model that does ( GLUE Score). Without positional embeddings, a model has no information on the relative position of inputs and is forced to use only a bag-of-words level of information from the input text. In Figure 12, we can see a drop in performance of at most , consistent across all levels of perturbations, for the NPT Transformer. This suggests that NPT Transformers barely make any use of the positional embeddings on those tasks.
The positional embeddings of PT models, however, seem strongly impacted by perturbations, suggesting that they make heavy use of the positional embeddings. We see that the impact of those embeddings degrades monotonically with perturbation on the local structure and is somewhat correlated with perturbation on the global structure, which is not observed with the NPT Transformer.
6.5 Character-Level Experimentation
As the results presented from experiments so far use subword tokenization, it is possible that the local perturbations being directly correlated with performance decay could be caused by the perturbation to the vocabulary. Towards removing tokenization as a confounding factor for the observed phenomenon, we train character-level BiLSTMs and ConvNets999As character sequences are much longer than subword sequences for the same text and memory usage of transformer models scale quadratically with sequence length, we were not able to run our study on a character-level Transformer. to evaluate whether the correlations with the DND metric hold without multi-character vocabulary. Results shown in Figure 13 demonstrate that DND remains as good of an explanation of the decay in performance of models with character-level tokenization, as with models using a vocabulary of subwords. This result allows generalizing the correlation to neural models beyond the choice of tokenization.
Our results on the importance of local structure could bear some implications for tokenization. Recent research trends (xu27vocabulary; clark2021canine) look at alternatives and improvements to BPE. The current research appears to be pushing towards smaller vocabulary at finer granularity, even exploring simple byte-level representations (xue2021byt5; tay2021charformer).
Through our DND metric, we find that local clumps of characters contain the most essential structural information required to solve several NLU problems. As a large part of the complexity of NLU seems to be contained within the meaning of the specific order of clumps of characters, by having more of that local structure fixed through tokenization, it is possible to inject additional inductive biases into the model. The perturbation analysis discussed in the paper could be used for better construction of vocabulary with improved heuristics.
Local, Global, and Bag-of-Words
Our results on the relative importance of local structure in relation to global structure hint at the possibility that much of the tested NLU tasks can be solved with a bag-of-words formulation. Intuitively, local structure mainly relates to building meaningful words from the characters of a text whereas the global structure relates to the general order and word-level syntax being maintained. From our experiments, we observe that as long as the local structure is roughly maintained, a majority of NLU tasks can be solved without requiring the global structure. This correlates with similar findings by o2021context. In essence, the structure required to build words seems to be necessary, but much of NLU can be solved with the information of which words(or subwords) are present in the text, without regard to their relative positions. This adds further credibility to similar research that attempts to understand the success of Transformers in NLP through hypothesizing that the global attention makes the architectures particularly apt at reflecting over a set of items, like a bag-of-words.
Learning Positional Embeddings is Data Hungry
Our experiments indicate that learning useful positional embeddings require huge amounts of data and are mostly unused by Transformers that were not pretrained with an extensive corpora. This suggests that for problems where the input order is important, and there is limited training data, models with stronger inductive biases such as ConvNets and LSTMs may be a better choice than Transformers that were not pretrained (tran-etal-2018-importance). We leave experimenting with models from specific breakpoints to study the evolution of the utility of positional embeddings to be explored as future work.
In this work, we propose the Direct Neighbour Displacement metric and the Index Displacement Count metric — that score the local and global structure of tokens in the perturbed texts. The results provide a way to quantify perturbations to better understand the inner workings of neural language understanding models. Reflecting on our results, we observe that perturbations on a local level, as measured by DND, explains the (in)sensitivity of pretrained language models to perturbations at different granularities on a variety of natural language understanding tasks. Although the paper primarily focuses on the effects of perturbations on English texts, extending the study to neural models on other languages could be beneficial. Especially, studying whether perturbations have a similar effect on other languages could help in deepening our understanding of cross-language tasks, like machine translation.
We thank Saujas Vaduguru for the useful comments and discussions on early drafts. This research was supported by Apogée Canada, Canada First Research Excellence Fund program and École Polytechnique Startup Fund PIED. SC is supported by a Canada CIFAR AI Chair and an NSERC Discovery Grant.
Appendix A Experiment Details
The results in the paper are averaged over fiv experiments run for 5 random seeds. Early stopping was performed after 2 full epochs not resulting in better results on the validation set. All models had similar model sizes, containing between 100 million and 130 million parameters. The ConvNet architecture is the one described inconvnet_architecture and the BiLSTM architecture is the one described in bilstm_architecture. Both use the same hidden size, dropout and word embedding size as the RoBERTa-Base model. Pretrained models used a learning rate of 2e-5, a batch size of 32, a maximum of 3 epochs and a weight decay of 0.1. Non pretrained models used a learning rate of 1e-4, a batch size of 128, a maximum of 50 epochs and a weight decay of 1e-6. All experiments used a warmup ratio of 0.06, as described in RoBERTa. Experiments using characters as input used a maximum sequence length of 2048 inputs. All other experiments used a maximum sequence length of 512. The Winograd Schema Challenge (WNLI) task was omitted from all experiments.
Subword-level perturbations were all done with the RoBERTa-Base vocabulary. On all level of granularity, we perform experiments with the full shuffling, the -gram shuffling with and the neighbour flip perturbation with . As the word-level and subword-level perturbations do not permit a sufficient exploration of DND, we populate the continuum of our metric with several more character-level perturbations by testing different values of for both the -gram shuffling and the neighbour flip perturbation. Experiments are ran with following values of for -gram shuffling: , and for neighbour flip perturbation: .
Appendix B Pseudocode for Metric and Perturbations
Appendix C DND/IDC Computation Examples
Appendix D Distribution of metrics scoring perturbations
Appendix E Reproducibility Checklist
As per the prescribed Reproducibility Checklist, we provide the information of the following:
Submission of source code: Source code for the perturbations, metrics and models is provided in GitHub. The training code was adapted from the excellent HuggingFace github.
Description of the computing infrastructure used: We used up to 20 NVIDIA V100 32 GB at a time to run all experiments. All models where trained and evaluated on 1 NVIDIA V100 32 GB GPUs for every seed of every model.
Average runtime for each approach: The approximate training time for fine-tuning varies between - hours and the inferencing on standard validation sets was about an hour.
Explanation of evaluation metrics used, with links to code: We add necessary citations for the metrics considered in the paper and also provide codes to reproduce them.
Relevant statistics of the datasets used: We provide the statistics of the datasets used in 5.
Explanation of any data that were excluded, and all pre-processing steps: The details for omitting WNLI data from GLUE benchmark are provided in §A.
Link to downloadable version of data: The data sets used in the paper are from public repositories. Links to the paper that proposes the data sets is included in §5.